Norns Web Crawling
Ingest web content into the Norns RAG system using Firecrawl.
Overview
The Web Crawling feature allows you to scrape and ingest web pages into Norns' knowledge base. Content is automatically chunked, embedded, and stored for RAG retrieval. Supports single page scraping, full site crawling, and URL mapping.
| Property | Value |
|---|---|
| UI Location | norns.ravenhelm.dev/documents |
| API Endpoint | POST /api/documents/crawl |
| Backend | Firecrawl API |
| Storage | PostgreSQL + pgvector |
Features
| Feature | Description |
|---|---|
| Single Page | Scrape a single URL and extract content |
| Site Crawl | Crawl an entire website following links |
| URL Mapping | Discover all URLs on a site without extracting content |
| Document Types | Categorize content (Web, Technical, Reference, General) |
| Auto-Chunking | Adaptive chunking strategy for optimal retrieval |
| Embedding | Automatic vector embedding for semantic search |
Crawl Modes
Single Page (single)
Scrapes a single URL and extracts its main content.
Best for:
- Individual articles or blog posts
- Specific documentation pages
- Landing pages
Example: Scraping a single API documentation page
Site Crawl (crawl)
Follows links from the starting URL and crawls multiple pages.
Best for:
- Documentation sites
- Knowledge bases
- Blog archives
Parameters:
max_pages: Maximum number of pages to crawl (1-100, default: 10)max_depth: How many levels deep to follow links
Example: Crawling an entire documentation site
URL Mapping (map)
Discovers all URLs on a site without extracting content. Useful for reconnaissance before a full crawl.
Best for:
- Understanding site structure
- Planning which sections to crawl
- Estimating crawl scope
Returns: List of discovered URLs with count
User Interface
Access the web crawling feature from the Documents page:
https://norns.ravenhelm.dev/documents
UI Components
| Component | Description |
|---|---|
| URL Input | Enter the website URL to crawl |
| Document Type | Select category for the content |
| Crawl Mode | Choose single, crawl, or map |
| Max Pages | Set page limit for site crawls |
| Submit Button | Start the crawl operation |
| Progress | Real-time status and results |
Document Types
| Type | Use Case |
|---|---|
| Web | General web content |
| Technical | API docs, code references, technical guides |
| Reference | Manuals, specifications, standards |
| General | Miscellaneous content |
API Reference
Endpoint
POST /api/documents/crawl
Request Body
{
"url": "https://example.com/docs",
"user_id": "uuid-string",
"document_type": "technical",
"crawl_mode": "crawl",
"max_pages": 20,
"max_depth": 3,
"only_main_content": true
}
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url | string | Yes | - | URL to crawl |
user_id | UUID | Yes | - | User identifier |
document_type | string | No | "web" | Content category |
crawl_mode | string | No | "single" | single, crawl, or map |
max_pages | int | No | 10 | Max pages (1-100) |
max_depth | int | No | null | Max link depth (1-10) |
only_main_content | bool | No | true | Extract main content only |
Response (Single/Crawl)
{
"crawl_mode": "crawl",
"url": "https://example.com/docs",
"documents_created": 15,
"total_chunks": 127,
"documents": [
{
"document_id": "uuid",
"document_name": "Getting Started",
"source_url": "https://example.com/docs/getting-started",
"chunks_created": 8
}
]
}
Response (Map)
{
"crawl_mode": "map",
"url": "https://example.com",
"discovered_urls": [
"https://example.com/",
"https://example.com/about",
"https://example.com/docs"
],
"url_count": 47
}
Setup
Prerequisites
- Firecrawl API Key - Required for web crawling functionality
- PostgreSQL with pgvector - For storing embeddings
- Embedding Service - For generating vector embeddings
Configuration
Add the Firecrawl API key to your environment:
# In /Users/ravenhelm/ravenhelm/secrets/.env
FIRECRAWL_API_KEY=fc-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Get Firecrawl API Key
- Visit firecrawl.dev
- Create an account
- Navigate to API Keys
- Generate a new key
- Add to
.envfile
Free Tier:
- 500 credits/month
- Single page scrapes: 1 credit
- Site crawls: 1 credit per page
Verify Configuration
# Check if Firecrawl is configured
ssh ravenhelm@100.115.101.81 "docker exec norns-agent env | grep FIRECRAWL"
# Restart agent after configuration
ssh ravenhelm@100.115.101.81 "docker restart norns-agent"
Architecture
Data Flow
┌──────────────────────────────────────────────────────────────────┐
│ Web Crawl Flow │
├──────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Norns │───▶│ Backend │───▶│ Firecrawl API │ │
│ │ Admin UI │ │ Endpoint │ │ (Web Scraping) │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Document │◀───│ Markdown Content │ │
│ │ Processor │ │ + Metadata │ │
│ └─────────────┘ └─────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Adaptive │ │
│ │ Chunking │ │
│ └─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Embedding │───▶│ PostgreSQL │ │
│ │ Service │ │ (pgvector) │ │
│ └─────────────┘ └─────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────┘
Database Schema
Crawled content is stored in the document_chunks table:
CREATE TABLE document_chunks (
id SERIAL PRIMARY KEY,
document_id UUID NOT NULL,
user_id UUID NOT NULL,
document_name VARCHAR(500),
document_type VARCHAR(50),
chunk_index INTEGER,
chunk_content TEXT,
chunk_embedding vector(1536),
content_type VARCHAR(50),
metadata JSONB,
created_at TIMESTAMP DEFAULT NOW()
);
Metadata Fields
Each chunk stores rich metadata:
{
"source_url": "https://example.com/page",
"crawl_mode": "crawl",
"crawled_at": "2025-01-08T12:00:00Z",
"token_count": 512
}
Usage Examples
Crawl Documentation Site
curl -X POST "https://norns.ravenhelm.dev/api/documents/crawl" \
-H "Content-Type: application/json" \
-H "X-API-Key: $NORNS_API_KEY" \
-d '{
"url": "https://docs.example.com",
"user_id": "your-user-id",
"document_type": "technical",
"crawl_mode": "crawl",
"max_pages": 50
}'
Scrape Single Article
curl -X POST "https://norns.ravenhelm.dev/api/documents/crawl" \
-H "Content-Type: application/json" \
-H "X-API-Key: $NORNS_API_KEY" \
-d '{
"url": "https://blog.example.com/article",
"user_id": "your-user-id",
"document_type": "web",
"crawl_mode": "single"
}'
Map Site URLs
curl -X POST "https://norns.ravenhelm.dev/api/documents/crawl" \
-H "Content-Type: application/json" \
-H "X-API-Key: $NORNS_API_KEY" \
-d '{
"url": "https://example.com",
"user_id": "your-user-id",
"crawl_mode": "map"
}'
Troubleshooting
Issue: "Web crawling not available"
Symptoms: Error message about Firecrawl not being installed
Cause: FIRECRAWL_API_KEY not configured
Solution:
# Add API key to environment
ssh ravenhelm@100.115.101.81 "echo 'FIRECRAWL_API_KEY=your-key' >> ~/ravenhelm/secrets/.env"
# Restart agent
ssh ravenhelm@100.115.101.81 "docker restart norns-agent"
Issue: "Crawl failed" error
Symptoms: Crawl operation fails with generic error
Solutions:
- Check if URL is accessible:
curl -I "https://example.com" - Verify Firecrawl API key is valid
- Check agent logs:
ssh ravenhelm@100.115.101.81 "docker logs norns-agent 2>&1 | tail -50"
Issue: Empty content scraped
Symptoms: Documents created but no chunks
Cause: Site may block scraping or use JavaScript rendering
Solutions:
- Try
only_main_content: falseto capture more content - Check if site requires JavaScript rendering
- Verify site doesn't block bots (check robots.txt)
Issue: Too many chunks created
Symptoms: Single page creates excessive chunks
Solutions:
- Content may be very large - this is expected
- Review document type selection (affects chunking)
- Consider using
max_pages: 1for testing
Issue: Duplicate content
Symptoms: Same content appears multiple times
Cause: Site has multiple URLs pointing to same content
Solutions:
- Use URL mapping first to identify duplicates
- Set appropriate
max_depthto limit crawl scope - Manually remove duplicate documents after crawl
Best Practices
Before Crawling
- Test with single page first - Verify content extraction works
- Use URL mapping - Understand site structure before full crawl
- Set reasonable limits - Start with
max_pages: 10-20 - Choose correct document type - Affects RAG retrieval quality
During Crawling
- Monitor progress - Watch the UI for status updates
- Check for errors - Review any failed pages
- Be patient - Large sites take time to process
After Crawling
- Verify in knowledge base - Check documents appear correctly
- Test retrieval - Ask questions about the crawled content
- Remove duplicates - Clean up any duplicate content
Limitations
| Limitation | Details |
|---|---|
| JavaScript Sites | Limited support for JS-rendered content |
| Authentication | Cannot crawl login-protected pages |
| Rate Limits | Firecrawl API has usage limits |
| Large Sites | Max 100 pages per crawl operation |
| File Types | Only extracts text content (not PDFs, images) |
Related Documentation
- Norns Admin - Web interface overview
- Norns Memory System - How RAG retrieval works
- Norns Knowledge Domain - Knowledge domain agents