Norns Web Crawling

Ingest web content into the Norns RAG system using Firecrawl.

Overview

The Web Crawling feature allows you to scrape and ingest web pages into Norns' knowledge base. Content is automatically chunked, embedded, and stored for RAG retrieval. Supports single page scraping, full site crawling, and URL mapping.

Property	Value
UI Location	norns.ravenhelm.dev/documents
API Endpoint	`POST /api/documents/crawl`
Backend	Firecrawl API
Storage	PostgreSQL + pgvector

Features

Feature	Description
Single Page	Scrape a single URL and extract content
Site Crawl	Crawl an entire website following links
URL Mapping	Discover all URLs on a site without extracting content
Document Types	Categorize content (Web, Technical, Reference, General)
Auto-Chunking	Adaptive chunking strategy for optimal retrieval
Embedding	Automatic vector embedding for semantic search

Crawl Modes

Single Page (`single`)

Scrapes a single URL and extracts its main content.

Best for:

Individual articles or blog posts
Specific documentation pages
Landing pages

Example: Scraping a single API documentation page

Site Crawl (`crawl`)

Follows links from the starting URL and crawls multiple pages.

Best for:

Documentation sites
Knowledge bases
Blog archives

Parameters:

max_pages: Maximum number of pages to crawl (1-100, default: 10)
max_depth: How many levels deep to follow links

Example: Crawling an entire documentation site

URL Mapping (`map`)

Discovers all URLs on a site without extracting content. Useful for reconnaissance before a full crawl.

Best for:

Understanding site structure
Planning which sections to crawl
Estimating crawl scope

Returns: List of discovered URLs with count

User Interface

Access the web crawling feature from the Documents page:

https://norns.ravenhelm.dev/documents

UI Components

Component	Description
URL Input	Enter the website URL to crawl
Document Type	Select category for the content
Crawl Mode	Choose single, crawl, or map
Max Pages	Set page limit for site crawls
Submit Button	Start the crawl operation
Progress	Real-time status and results

Document Types

Type	Use Case
Web	General web content
Technical	API docs, code references, technical guides
Reference	Manuals, specifications, standards
General	Miscellaneous content

API Reference

Endpoint

POST /api/documents/crawl

Request Body

{
  "url": "https://example.com/docs",
  "user_id": "uuid-string",
  "document_type": "technical",
  "crawl_mode": "crawl",
  "max_pages": 20,
  "max_depth": 3,
  "only_main_content": true
}

Parameters

Parameter	Type	Required	Default	Description
`url`	string	Yes	-	URL to crawl
`user_id`	UUID	Yes	-	User identifier
`document_type`	string	No	"web"	Content category
`crawl_mode`	string	No	"single"	single, crawl, or map
`max_pages`	int	No	10	Max pages (1-100)
`max_depth`	int	No	null	Max link depth (1-10)
`only_main_content`	bool	No	true	Extract main content only

Response (Single/Crawl)

{
  "crawl_mode": "crawl",
  "url": "https://example.com/docs",
  "documents_created": 15,
  "total_chunks": 127,
  "documents": [
    {
      "document_id": "uuid",
      "document_name": "Getting Started",
      "source_url": "https://example.com/docs/getting-started",
      "chunks_created": 8
    }
  ]
}

Response (Map)

{
  "crawl_mode": "map",
  "url": "https://example.com",
  "discovered_urls": [
    "https://example.com/",
    "https://example.com/about",
    "https://example.com/docs"
  ],
  "url_count": 47
}

Setup

Prerequisites

Firecrawl API Key - Required for web crawling functionality
PostgreSQL with pgvector - For storing embeddings
Embedding Service - For generating vector embeddings

Configuration

Add the Firecrawl API key to your environment:

# In /Users/ravenhelm/ravenhelm/secrets/.env
FIRECRAWL_API_KEY=fc-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Get Firecrawl API Key

Visit firecrawl.dev
Create an account
Navigate to API Keys
Generate a new key
Add to .env file

Free Tier:

500 credits/month
Single page scrapes: 1 credit
Site crawls: 1 credit per page

Verify Configuration

# Check if Firecrawl is configured
ssh ravenhelm@100.115.101.81 "docker exec norns-agent env | grep FIRECRAWL"

# Restart agent after configuration
ssh ravenhelm@100.115.101.81 "docker restart norns-agent"

Architecture

Data Flow

┌──────────────────────────────────────────────────────────────────┐
│                         Web Crawl Flow                            │
├──────────────────────────────────────────────────────────────────┤
│                                                                   │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────┐   │
│  │   Norns     │───▶│   Backend   │───▶│    Firecrawl API    │   │
│  │   Admin UI  │    │   Endpoint  │    │   (Web Scraping)    │   │
│  └─────────────┘    └─────────────┘    └─────────────────────┘   │
│                            │                      │               │
│                            ▼                      ▼               │
│                     ┌─────────────┐    ┌─────────────────────┐   │
│                     │  Document   │◀───│   Markdown Content  │   │
│                     │  Processor  │    │   + Metadata        │   │
│                     └─────────────┘    └─────────────────────┘   │
│                            │                                      │
│                            ▼                                      │
│                     ┌─────────────┐                               │
│                     │  Adaptive   │                               │
│                     │  Chunking   │                               │
│                     └─────────────┘                               │
│                            │                                      │
│                            ▼                                      │
│                     ┌─────────────┐    ┌─────────────────────┐   │
│                     │  Embedding  │───▶│    PostgreSQL       │   │
│                     │  Service    │    │    (pgvector)       │   │
│                     └─────────────┘    └─────────────────────┘   │
│                                                                   │
└──────────────────────────────────────────────────────────────────┘

Database Schema

Crawled content is stored in the document_chunks table:

CREATE TABLE document_chunks (
  id SERIAL PRIMARY KEY,
  document_id UUID NOT NULL,
  user_id UUID NOT NULL,
  document_name VARCHAR(500),
  document_type VARCHAR(50),
  chunk_index INTEGER,
  chunk_content TEXT,
  chunk_embedding vector(1536),
  content_type VARCHAR(50),
  metadata JSONB,
  created_at TIMESTAMP DEFAULT NOW()
);

Metadata Fields

Each chunk stores rich metadata:

{
  "source_url": "https://example.com/page",
  "crawl_mode": "crawl",
  "crawled_at": "2025-01-08T12:00:00Z",
  "token_count": 512
}

Usage Examples

Crawl Documentation Site

curl -X POST "https://norns.ravenhelm.dev/api/documents/crawl" \
  -H "Content-Type: application/json" \
  -H "X-API-Key: $NORNS_API_KEY" \
  -d '{
    "url": "https://docs.example.com",
    "user_id": "your-user-id",
    "document_type": "technical",
    "crawl_mode": "crawl",
    "max_pages": 50
  }'

Scrape Single Article

curl -X POST "https://norns.ravenhelm.dev/api/documents/crawl" \
  -H "Content-Type: application/json" \
  -H "X-API-Key: $NORNS_API_KEY" \
  -d '{
    "url": "https://blog.example.com/article",
    "user_id": "your-user-id",
    "document_type": "web",
    "crawl_mode": "single"
  }'

Map Site URLs

curl -X POST "https://norns.ravenhelm.dev/api/documents/crawl" \
  -H "Content-Type: application/json" \
  -H "X-API-Key: $NORNS_API_KEY" \
  -d '{
    "url": "https://example.com",
    "user_id": "your-user-id",
    "crawl_mode": "map"
  }'

Troubleshooting

Issue: "Web crawling not available"

Symptoms: Error message about Firecrawl not being installed

Cause: FIRECRAWL_API_KEY not configured

Solution:

# Add API key to environment
ssh ravenhelm@100.115.101.81 "echo 'FIRECRAWL_API_KEY=your-key' >> ~/ravenhelm/secrets/.env"

# Restart agent
ssh ravenhelm@100.115.101.81 "docker restart norns-agent"

Issue: "Crawl failed" error

Symptoms: Crawl operation fails with generic error

Solutions:

Check if URL is accessible:
```
curl -I "https://example.com"
```
Verify Firecrawl API key is valid

Check agent logs:

ssh ravenhelm@100.115.101.81 "docker logs norns-agent 2>&1 | tail -50"

Issue: Empty content scraped

Symptoms: Documents created but no chunks

Cause: Site may block scraping or use JavaScript rendering

Solutions:

Try only_main_content: false to capture more content
Check if site requires JavaScript rendering
Verify site doesn't block bots (check robots.txt)

Issue: Too many chunks created

Symptoms: Single page creates excessive chunks

Solutions:

Content may be very large - this is expected
Review document type selection (affects chunking)
Consider using max_pages: 1 for testing

Issue: Duplicate content

Symptoms: Same content appears multiple times

Cause: Site has multiple URLs pointing to same content

Solutions:

Use URL mapping first to identify duplicates
Set appropriate max_depth to limit crawl scope
Manually remove duplicate documents after crawl

Best Practices

Before Crawling

Test with single page first - Verify content extraction works
Use URL mapping - Understand site structure before full crawl
Set reasonable limits - Start with max_pages: 10-20
Choose correct document type - Affects RAG retrieval quality

During Crawling

Monitor progress - Watch the UI for status updates
Check for errors - Review any failed pages
Be patient - Large sites take time to process

After Crawling

Verify in knowledge base - Check documents appear correctly
Test retrieval - Ask questions about the crawled content
Remove duplicates - Clean up any duplicate content

Limitations

Limitation	Details
JavaScript Sites	Limited support for JS-rendered content
Authentication	Cannot crawl login-protected pages
Rate Limits	Firecrawl API has usage limits
Large Sites	Max 100 pages per crawl operation
File Types	Only extracts text content (not PDFs, images)

Norns Admin - Web interface overview
Norns Memory System - How RAG retrieval works
Norns Knowledge Domain - Knowledge domain agents

Overview​

Features​

Crawl Modes​

Single Page (single)​

Site Crawl (crawl)​

URL Mapping (map)​

User Interface​

UI Components​

Document Types​

API Reference​

Endpoint​

Request Body​

Parameters​

Response (Single/Crawl)​

Response (Map)​

Setup​

Prerequisites​

Configuration​

Get Firecrawl API Key​

Verify Configuration​

Architecture​

Data Flow​

Database Schema​

Metadata Fields​

Usage Examples​

Crawl Documentation Site​

Scrape Single Article​

Map Site URLs​

Troubleshooting​

Issue: "Web crawling not available"​

Issue: "Crawl failed" error​

Issue: Empty content scraped​

Issue: Too many chunks created​

Issue: Duplicate content​

Best Practices​

Before Crawling​

During Crawling​

After Crawling​

Limitations​

Related Documentation​

Overview

Features

Crawl Modes

Single Page (`single`)

Site Crawl (`crawl`)

URL Mapping (`map`)

User Interface

UI Components

Document Types

API Reference

Endpoint

Request Body

Parameters

Response (Single/Crawl)

Response (Map)

Setup

Prerequisites

Configuration

Get Firecrawl API Key

Verify Configuration

Architecture

Data Flow

Database Schema

Metadata Fields

Usage Examples

Crawl Documentation Site

Scrape Single Article

Map Site URLs

Troubleshooting

Issue: "Web crawling not available"

Issue: "Crawl failed" error

Issue: Empty content scraped

Issue: Too many chunks created

Issue: Duplicate content

Best Practices

Before Crawling

During Crawling

After Crawling

Limitations

Related Documentation