Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Scrapy Example Intuned Project

This project demonstrates how to use Scrapy for web scraping within Intuned's browser automation environment. It includes two approaches:

  1. scrapy-crawler: Uses Scrapy's built-in HTTP request system for scraping static websites
  2. scrapy-crawler-js: Uses Playwright to render JavaScript-heavy pages, then parses the HTML with Scrapy

Run on Intuned

Open this project in Intuned by clicking the button below.

Run on Intuned

Development

NOTE: All commands support --help flag to get more information about the command and its arguments and options.

Install dependencies

uv sync

After installing dependencies, intuned command should be available in your environment.

Run an API

Run the Scrapy crawler with Scrapy's HTTP requests:

# Scrapy crawler (static sites)
uv run intuned run api scrapy-crawler .parameters/api/scrapy-crawler/default.json

# Scrapy crawler with JavaScript rendering (using Playwright)
uv run intuned run api scrapy-crawler-js .parameters/api/scrapy-crawler-js/default.json

Save project

uv run intuned provision

Deploy project

uv run intuned deploy

Technologies Used

Scrapy

This project uses Scrapy, a powerful web scraping framework for Python. Scrapy provides:

  • Built-in HTTP request handling
  • Powerful CSS and XPath selectors
  • Item pipelines for data processing
  • Built-in support for pagination and following links

Intuned Browser SDK

This project uses Intuned browser SDK for browser automation. For more information, check out the Intuned Browser SDK documentation.

Playwright

The scrapy-crawler-js API uses Playwright to render JavaScript-heavy pages before parsing with Scrapy. This allows you to scrape dynamic content that requires JavaScript execution.

Project Structure

The project structure is as follows:

/
├── api/                      # API endpoints
│   ├── scrapy-crawler.py     # Scrapy crawler using Scrapy's HTTP requests
│   └── scrapy-crawler-js.py  # Scrapy crawler using Playwright + Scrapy parsing
├── collector/                # Item collection utilities
│   └── item_collector.py     # Collects scraped items via Scrapy signals
├── utils/                    # Utility modules
│   └── types_and_schemas.py  # Pydantic models for parameters and data
├── Intuned.jsonc             # Intuned project configuration file
└── pyproject.toml            # Python project dependencies

API Endpoints

  • scrapy-crawler: Uses Scrapy's CrawlerRunner to make HTTP requests and scrape static websites. Best for sites that don't require JavaScript rendering.
  • scrapy-crawler-js: Uses Playwright to navigate and render pages, then creates Scrapy HtmlResponse objects for parsing. Best for JavaScript-heavy websites.

Key Components

  • QuotesSpider: Scrapy spider class that defines how to parse quotes from the target website
  • ItemCollector: Collects scraped items via Scrapy's signal system
  • ListParams: Pydantic model for API parameters (url, max_pages)
  • Quote: Pydantic model for scraped quote data (text, author, tags)

Customizing for Your Use Case

To adapt this example for your own scraping needs:

  1. Update the Spider: Modify the QuotesSpider class in the API files:

    • Change CSS selectors to match your target website
    • Update the data structure being yielded
    • Adjust pagination logic if needed
  2. Update Data Models: Modify utils/types_and_schemas.py:

    • Update ListParams for your API parameters
    • Update Quote (or create new models) for your scraped data
  3. Choose the Right Approach:

    • Use scrapy-crawler for static websites
    • Use scrapy-crawler-js for JavaScript-heavy sites

Intuned.jsonc Reference

{
  // Your Intuned workspace ID. 
  // Optional - If not provided here, it must be supplied via the `--workspace-id` flag during deployment.
  "workspaceId": "your_workspace_id",

  // The name of your Intuned project. 
  // Optional - If not provided here, it must be supplied via the command line when deploying.
  "projectName": "your_project_name",

  // Replication settings
  "replication": {
    // The maximum number of concurrent executions allowed via Intuned API. This does not affect jobs.
    // A number of machines equal to this will be allocated to handle API requests.
    // Not applicable if api access is disabled.
    "maxConcurrentRequests": 1,

    // The machine size to use for this project. This is applicable for both API requests and jobs.
    // "standard": Standard machine size (6 shared vCPUs, 2GB RAM)
    // "large": Large machine size (8 shared vCPUs, 4GB RAM)
    // "xlarge": Extra large machine size (1 performance vCPU, 8GB RAM)
    "size": "standard"
  }

  // Auth session settings
  "authSessions": {
    // Whether auth sessions are enabled for this project.
    // If enabled, "auth-sessions/check.ts" API must be implemented to validate the auth session.
    "enabled": true,

    // Whether to save Playwright traces for auth session runs.
    "saveTraces": false,

    // The type of auth session to use.
    // "API" type requires implementing "auth-sessions/create.ts" API to create/recreate the auth session programmatically.
    // "MANUAL" type uses a recorder to manually create the auth session.
    "type": "API",
    

    // Recorder start URL for the recorder to navigate to when creating the auth session.
    // Required if "type" is "MANUAL". Not used if "type" is "API".
    "startUrl": "https://example.com/login",

    // Recorder finish URL for the recorder. Once this URL is reached, the recorder stops and saves the auth session.
    // Required if "type" is "MANUAL". Not used if "type" is "API".
    "finishUrl": "https://example.com/dashboard",

    // Recorder browser mode
    // "fullscreen": Launches the browser in fullscreen mode.
    // "kiosk": Launches the browser in kiosk mode (no address bar, no navigation controls).
    // Only applicable for "MANUAL" type.
    "browserMode": "fullscreen"
  }
  
  // API access settings
  "apiAccess": {
    // Whether to enable consumption through Intuned API. If this is false, the project can only be consumed through jobs.
    // This is required for projects that use auth sessions.
    "enabled": true
  },

  // Whether to run the deployed API in a headful browser. Running in headful can help with some anti-bot detections. However, it requires more resources and may work slower or crash if the machine size is "standard".
  "headful": false,

  // The region where your Intuned project is hosted.
  // For a list of available regions, contact support or refer to the documentation.
  // Optional - Default: "us"
  "region": "us"
}

Learn More