Name	Name	Last commit message	Last commit date
parent directory ..
.parameters/api/crawl	.parameters/api/crawl
api	api
utils	utils
.env.example	.env.example
.gitignore	.gitignore
Intuned.jsonc	Intuned.jsonc
README.md	README.md
pyproject.toml	pyproject.toml

native-crawler

A simple, library-free web crawler demonstrating Intuned's extend_payload and persistent_store features for parallel crawling with deduplication.

Run on Intuned

Open this project in Intuned by clicking the button below.

Architecture

This project showcases two key Intuned runtime features:

`extend_payload`

Dynamically spawn new payloads within a job. This enables a fan-out pattern where one API call triggers many others, all within the same job run. Reference: https://docs.intunedhq.com/docs/05-references/runtime-sdk-python/extend-payload

`persistent_store`

A shared key-value store that persists across all payloads in a job. Used here for URL deduplication — preventing the same page from being crawled multiple times. Reference: https://docs.intunedhq.com/docs/05-references/runtime-sdk-python/persistent-store

Flow

                    ┌─────────────────────────────────────────────────┐
                    │                   JOB RUN                       │
                    │                                                 │
                    │    ┌─────────────────────────────────────────┐  │
                    │    │     SHARED persistent_store             │  │
                    │    │  (tracks visited URLs)                  │  │
                    │    └─────────────────────────────────────────┘  │
                    │                                                 │
  Start Job ───────►│  1. crawl(seed_url)                            │
                    │     ├─► Extract markdown content               │
                    │     ├─► Discover 50 links                      │
                    │     └─► extend_payload(crawl) × 50             │
                    │                      │                         │
                    │     ┌────────────────┼────────────────┐        │
                    │     ▼                ▼                ▼        │
                    │  2. crawl(link1)  crawl(link2)    crawl(...)   │
                    │     ├─► Extract    ├─► Extract    ├─► ...      │
                    │     ├─► Discover   ├─► Discover   │            │
                    │     └─► extend...  └─► extend...  │            │
                    │                                                 │
                    │  Continues until depth/page limits reached     │
                    │  All results sent to job sink (webhook/S3)     │
                    └─────────────────────────────────────────────────┘

Project Structure

native-crawler/
├── api/
│   └── crawl.py          # Main API: extract content + discover links + recurse
├── utils/
│   ├── __init__.py
│   ├── content.py        # extract_page_content() - markdown extraction
│   └── links.py          # extract_links() - link discovery + normalization
├── Intuned.jsonc
└── README.md

API

`crawl`

Crawls a URL: extracts content, discovers links, and queues them for further crawling.

Parameter	Type	Default	Description
`url`	string	required	URL to crawl
`max_depth`	int	2	Maximum crawl depth from seed
`max_pages`	int	50	Maximum total pages to process
`depth`	int	0	Current depth (set internally by extend_payload)

Usage

Local Development

# Install dependencies
uv sync

# Run the crawler
uv run intuned run api crawl .parameters/api/crawl/default.json

As a Job (Production)

When run as a job, extend_payload spawns parallel payloads and attempt_store deduplicates across all of them:

curl -X POST "https://api.intunedhq.com/projects/{project}/jobs" \
  -H "Authorization: Bearer {api_key}" \
  -d '{
    "payload": {
      "api": "crawl",
      "parameters": {
        "url": "https://books.toscrape.com",
        "max_depth": 2,
        "max_pages": 100
      }
    },
    "sink": {
      "type": "webhook",
      "url": "https://your-webhook.com/results"
    }
  }'

Structured Data Extraction (Schema)

You can extract structured data instead of markdown by providing a generic JSON schema. This uses Intuned's AI extraction model.

Edit .parameters/api/crawl/default.json to include a schema:

{
  "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
  "max_depth": 0,
  "schema": {
    "type": "object",
    "properties": {
      "title": { "type": "string" },
      "price": { "type": "string" },
      "availability": { "type": "string" }
    },
    "required": ["title", "price"]
  }
}

Then run:

uv run intuned run api crawl .parameters/api/crawl/default.json

Download Attachments

You can automatically find and download files (PDFs, images, etc.) to S3 by enabling include_attachments.

Edit .parameters/api/crawl/default.json:

{
  "url": "https://sandbox.intuned.dev/pdfs",
  "max_depth": 1,
  "include_attachments": true
}

Then run:

uv run intuned run api crawl .parameters/api/crawl/default.json

Utils

`utils/content.py`

extract_page_content(page) — Returns {title, markdown, markdown_length}

`utils/links.py`

extract_links(page, base_domain) — Returns list of normalized internal URLs
normalize_url(url) — Normalize URL (remove fragments, trailing slashes)
get_base_domain(url) — Extract domain from URL

Deduplication Keys

The persistent_store uses these key patterns:

Key Pattern	Purpose
`visited:{url}`	Tracks URLs that have been crawled
`__page_count__`	Global counter for pages processed
`__max_depth__`	Stored config: max depth
`__max_pages__`	Stored config: max pages
`__base_domain__`	Stored config: base domain for filtering

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

native-crawler

Run on Intuned

Architecture

`extend_payload`

`persistent_store`

Flow

Project Structure

API

`crawl`

Usage

Local Development

As a Job (Production)

Structured Data Extraction (Schema)

Download Attachments

Utils

`utils/content.py`

`utils/links.py`

Deduplication Keys

Learn More

FilesExpand file tree

native-crawler

Directory actions

More options

Directory actions

More options

Latest commit

History

native-crawler

Folders and files

parent directory

README.md

native-crawler

Run on Intuned

Architecture

extend_payload

persistent_store

Flow

Project Structure

API

crawl

Usage

Local Development

As a Job (Production)

Structured Data Extraction (Schema)

Download Attachments

Utils

utils/content.py

utils/links.py

Deduplication Keys

Learn More

`extend_payload`

`persistent_store`

`crawl`

`utils/content.py`

`utils/links.py`