Skip to main content

Norns Web Crawling

Ingest web content into the Norns RAG system using Firecrawl.


Overview

The Web Crawling feature allows you to scrape and ingest web pages into Norns' knowledge base. Content is automatically chunked, embedded, and stored for RAG retrieval. Supports single page scraping, full site crawling, and URL mapping.

PropertyValue
UI Locationnorns.ravenhelm.dev/documents
API EndpointPOST /api/documents/crawl
BackendFirecrawl API
StoragePostgreSQL + pgvector

Features

FeatureDescription
Single PageScrape a single URL and extract content
Site CrawlCrawl an entire website following links
URL MappingDiscover all URLs on a site without extracting content
Document TypesCategorize content (Web, Technical, Reference, General)
Auto-ChunkingAdaptive chunking strategy for optimal retrieval
EmbeddingAutomatic vector embedding for semantic search

Crawl Modes

Single Page (single)

Scrapes a single URL and extracts its main content.

Best for:

  • Individual articles or blog posts
  • Specific documentation pages
  • Landing pages

Example: Scraping a single API documentation page

Site Crawl (crawl)

Follows links from the starting URL and crawls multiple pages.

Best for:

  • Documentation sites
  • Knowledge bases
  • Blog archives

Parameters:

  • max_pages: Maximum number of pages to crawl (1-100, default: 10)
  • max_depth: How many levels deep to follow links

Example: Crawling an entire documentation site

URL Mapping (map)

Discovers all URLs on a site without extracting content. Useful for reconnaissance before a full crawl.

Best for:

  • Understanding site structure
  • Planning which sections to crawl
  • Estimating crawl scope

Returns: List of discovered URLs with count


User Interface

Access the web crawling feature from the Documents page:

https://norns.ravenhelm.dev/documents

UI Components

ComponentDescription
URL InputEnter the website URL to crawl
Document TypeSelect category for the content
Crawl ModeChoose single, crawl, or map
Max PagesSet page limit for site crawls
Submit ButtonStart the crawl operation
ProgressReal-time status and results

Document Types

TypeUse Case
WebGeneral web content
TechnicalAPI docs, code references, technical guides
ReferenceManuals, specifications, standards
GeneralMiscellaneous content

API Reference

Endpoint

POST /api/documents/crawl

Request Body

{
"url": "https://example.com/docs",
"user_id": "uuid-string",
"document_type": "technical",
"crawl_mode": "crawl",
"max_pages": 20,
"max_depth": 3,
"only_main_content": true
}

Parameters

ParameterTypeRequiredDefaultDescription
urlstringYes-URL to crawl
user_idUUIDYes-User identifier
document_typestringNo"web"Content category
crawl_modestringNo"single"single, crawl, or map
max_pagesintNo10Max pages (1-100)
max_depthintNonullMax link depth (1-10)
only_main_contentboolNotrueExtract main content only

Response (Single/Crawl)

{
"crawl_mode": "crawl",
"url": "https://example.com/docs",
"documents_created": 15,
"total_chunks": 127,
"documents": [
{
"document_id": "uuid",
"document_name": "Getting Started",
"source_url": "https://example.com/docs/getting-started",
"chunks_created": 8
}
]
}

Response (Map)

{
"crawl_mode": "map",
"url": "https://example.com",
"discovered_urls": [
"https://example.com/",
"https://example.com/about",
"https://example.com/docs"
],
"url_count": 47
}

Setup

Prerequisites

  1. Firecrawl API Key - Required for web crawling functionality
  2. PostgreSQL with pgvector - For storing embeddings
  3. Embedding Service - For generating vector embeddings

Configuration

Add the Firecrawl API key to your environment:

# In /Users/ravenhelm/ravenhelm/secrets/.env
FIRECRAWL_API_KEY=fc-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Get Firecrawl API Key

  1. Visit firecrawl.dev
  2. Create an account
  3. Navigate to API Keys
  4. Generate a new key
  5. Add to .env file

Free Tier:

  • 500 credits/month
  • Single page scrapes: 1 credit
  • Site crawls: 1 credit per page

Verify Configuration

# Check if Firecrawl is configured
ssh ravenhelm@100.115.101.81 "docker exec norns-agent env | grep FIRECRAWL"

# Restart agent after configuration
ssh ravenhelm@100.115.101.81 "docker restart norns-agent"

Architecture

Data Flow

┌──────────────────────────────────────────────────────────────────┐
│ Web Crawl Flow │
├──────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Norns │───▶│ Backend │───▶│ Firecrawl API │ │
│ │ Admin UI │ │ Endpoint │ │ (Web Scraping) │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Document │◀───│ Markdown Content │ │
│ │ Processor │ │ + Metadata │ │
│ └─────────────┘ └─────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Adaptive │ │
│ │ Chunking │ │
│ └─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Embedding │───▶│ PostgreSQL │ │
│ │ Service │ │ (pgvector) │ │
│ └─────────────┘ └─────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────┘

Database Schema

Crawled content is stored in the document_chunks table:

CREATE TABLE document_chunks (
id SERIAL PRIMARY KEY,
document_id UUID NOT NULL,
user_id UUID NOT NULL,
document_name VARCHAR(500),
document_type VARCHAR(50),
chunk_index INTEGER,
chunk_content TEXT,
chunk_embedding vector(1536),
content_type VARCHAR(50),
metadata JSONB,
created_at TIMESTAMP DEFAULT NOW()
);

Metadata Fields

Each chunk stores rich metadata:

{
"source_url": "https://example.com/page",
"crawl_mode": "crawl",
"crawled_at": "2025-01-08T12:00:00Z",
"token_count": 512
}

Usage Examples

Crawl Documentation Site

curl -X POST "https://norns.ravenhelm.dev/api/documents/crawl" \
-H "Content-Type: application/json" \
-H "X-API-Key: $NORNS_API_KEY" \
-d '{
"url": "https://docs.example.com",
"user_id": "your-user-id",
"document_type": "technical",
"crawl_mode": "crawl",
"max_pages": 50
}'

Scrape Single Article

curl -X POST "https://norns.ravenhelm.dev/api/documents/crawl" \
-H "Content-Type: application/json" \
-H "X-API-Key: $NORNS_API_KEY" \
-d '{
"url": "https://blog.example.com/article",
"user_id": "your-user-id",
"document_type": "web",
"crawl_mode": "single"
}'

Map Site URLs

curl -X POST "https://norns.ravenhelm.dev/api/documents/crawl" \
-H "Content-Type: application/json" \
-H "X-API-Key: $NORNS_API_KEY" \
-d '{
"url": "https://example.com",
"user_id": "your-user-id",
"crawl_mode": "map"
}'

Troubleshooting

Issue: "Web crawling not available"

Symptoms: Error message about Firecrawl not being installed

Cause: FIRECRAWL_API_KEY not configured

Solution:

# Add API key to environment
ssh ravenhelm@100.115.101.81 "echo 'FIRECRAWL_API_KEY=your-key' >> ~/ravenhelm/secrets/.env"

# Restart agent
ssh ravenhelm@100.115.101.81 "docker restart norns-agent"

Issue: "Crawl failed" error

Symptoms: Crawl operation fails with generic error

Solutions:

  1. Check if URL is accessible:
    curl -I "https://example.com"
  2. Verify Firecrawl API key is valid
  3. Check agent logs:
    ssh ravenhelm@100.115.101.81 "docker logs norns-agent 2>&1 | tail -50"

Issue: Empty content scraped

Symptoms: Documents created but no chunks

Cause: Site may block scraping or use JavaScript rendering

Solutions:

  1. Try only_main_content: false to capture more content
  2. Check if site requires JavaScript rendering
  3. Verify site doesn't block bots (check robots.txt)

Issue: Too many chunks created

Symptoms: Single page creates excessive chunks

Solutions:

  1. Content may be very large - this is expected
  2. Review document type selection (affects chunking)
  3. Consider using max_pages: 1 for testing

Issue: Duplicate content

Symptoms: Same content appears multiple times

Cause: Site has multiple URLs pointing to same content

Solutions:

  1. Use URL mapping first to identify duplicates
  2. Set appropriate max_depth to limit crawl scope
  3. Manually remove duplicate documents after crawl

Best Practices

Before Crawling

  1. Test with single page first - Verify content extraction works
  2. Use URL mapping - Understand site structure before full crawl
  3. Set reasonable limits - Start with max_pages: 10-20
  4. Choose correct document type - Affects RAG retrieval quality

During Crawling

  1. Monitor progress - Watch the UI for status updates
  2. Check for errors - Review any failed pages
  3. Be patient - Large sites take time to process

After Crawling

  1. Verify in knowledge base - Check documents appear correctly
  2. Test retrieval - Ask questions about the crawled content
  3. Remove duplicates - Clean up any duplicate content

Limitations

LimitationDetails
JavaScript SitesLimited support for JS-rendered content
AuthenticationCannot crawl login-protected pages
Rate LimitsFirecrawl API has usage limits
Large SitesMax 100 pages per crawl operation
File TypesOnly extracts text content (not PDFs, images)