This project demonstrates how to use Scrapy for web scraping within Intuned's browser automation environment. It includes two approaches:
scrapy-crawler: Uses Scrapy's built-in HTTP request system for scraping static websitesscrapy-crawler-js: Uses Playwright to render JavaScript-heavy pages, then parses the HTML with Scrapy
Open this project in Intuned by clicking the button below.
NOTE: All commands support
--helpflag to get more information about the command and its arguments and options.
uv syncAfter installing dependencies, intuned command should be available in your environment.
Run the Scrapy crawler with Scrapy's HTTP requests:
# Scrapy crawler (static sites)
uv run intuned run api scrapy-crawler .parameters/api/scrapy-crawler/default.json
# Scrapy crawler with JavaScript rendering (using Playwright)
uv run intuned run api scrapy-crawler-js .parameters/api/scrapy-crawler-js/default.jsonuv run intuned provisionuv run intuned deployThis project uses Scrapy, a powerful web scraping framework for Python. Scrapy provides:
- Built-in HTTP request handling
- Powerful CSS and XPath selectors
- Item pipelines for data processing
- Built-in support for pagination and following links
This project uses Intuned browser SDK for browser automation. For more information, check out the Intuned Browser SDK documentation.
The scrapy-crawler-js API uses Playwright to render JavaScript-heavy pages before parsing with Scrapy. This allows you to scrape dynamic content that requires JavaScript execution.
The project structure is as follows:
/
├── api/ # API endpoints
│ ├── scrapy-crawler.py # Scrapy crawler using Scrapy's HTTP requests
│ └── scrapy-crawler-js.py # Scrapy crawler using Playwright + Scrapy parsing
├── collector/ # Item collection utilities
│ └── item_collector.py # Collects scraped items via Scrapy signals
├── utils/ # Utility modules
│ └── types_and_schemas.py # Pydantic models for parameters and data
├── Intuned.jsonc # Intuned project configuration file
└── pyproject.toml # Python project dependencies
scrapy-crawler: Uses Scrapy'sCrawlerRunnerto make HTTP requests and scrape static websites. Best for sites that don't require JavaScript rendering.scrapy-crawler-js: Uses Playwright to navigate and render pages, then creates ScrapyHtmlResponseobjects for parsing. Best for JavaScript-heavy websites.
QuotesSpider: Scrapy spider class that defines how to parse quotes from the target websiteItemCollector: Collects scraped items via Scrapy's signal systemListParams: Pydantic model for API parameters (url, max_pages)Quote: Pydantic model for scraped quote data (text, author, tags)
To adapt this example for your own scraping needs:
-
Update the Spider: Modify the
QuotesSpiderclass in the API files:- Change CSS selectors to match your target website
- Update the data structure being yielded
- Adjust pagination logic if needed
-
Update Data Models: Modify
utils/types_and_schemas.py:- Update
ListParamsfor your API parameters - Update
Quote(or create new models) for your scraped data
- Update
-
Choose the Right Approach:
- Use
scrapy-crawlerfor static websites - Use
scrapy-crawler-jsfor JavaScript-heavy sites
- Use
{ // Your Intuned workspace ID. // Optional - If not provided here, it must be supplied via the `--workspace-id` flag during deployment. "workspaceId": "your_workspace_id", // The name of your Intuned project. // Optional - If not provided here, it must be supplied via the command line when deploying. "projectName": "your_project_name", // Replication settings "replication": { // The maximum number of concurrent executions allowed via Intuned API. This does not affect jobs. // A number of machines equal to this will be allocated to handle API requests. // Not applicable if api access is disabled. "maxConcurrentRequests": 1, // The machine size to use for this project. This is applicable for both API requests and jobs. // "standard": Standard machine size (6 shared vCPUs, 2GB RAM) // "large": Large machine size (8 shared vCPUs, 4GB RAM) // "xlarge": Extra large machine size (1 performance vCPU, 8GB RAM) "size": "standard" } // Auth session settings "authSessions": { // Whether auth sessions are enabled for this project. // If enabled, "auth-sessions/check.ts" API must be implemented to validate the auth session. "enabled": true, // Whether to save Playwright traces for auth session runs. "saveTraces": false, // The type of auth session to use. // "API" type requires implementing "auth-sessions/create.ts" API to create/recreate the auth session programmatically. // "MANUAL" type uses a recorder to manually create the auth session. "type": "API", // Recorder start URL for the recorder to navigate to when creating the auth session. // Required if "type" is "MANUAL". Not used if "type" is "API". "startUrl": "https://example.com/login", // Recorder finish URL for the recorder. Once this URL is reached, the recorder stops and saves the auth session. // Required if "type" is "MANUAL". Not used if "type" is "API". "finishUrl": "https://example.com/dashboard", // Recorder browser mode // "fullscreen": Launches the browser in fullscreen mode. // "kiosk": Launches the browser in kiosk mode (no address bar, no navigation controls). // Only applicable for "MANUAL" type. "browserMode": "fullscreen" } // API access settings "apiAccess": { // Whether to enable consumption through Intuned API. If this is false, the project can only be consumed through jobs. // This is required for projects that use auth sessions. "enabled": true }, // Whether to run the deployed API in a headful browser. Running in headful can help with some anti-bot detections. However, it requires more resources and may work slower or crash if the machine size is "standard". "headful": false, // The region where your Intuned project is hosted. // For a list of available regions, contact support or refer to the documentation. // Optional - Default: "us" "region": "us" }