URL Ingestion
URL ingestion lets you import content directly from web pages without downloading files. Simply provide a URL, and Airmailer fetches, processes, and stores the content automatically.
How It Works
Enter URL → Airmailer fetches page → Content extracted → Stored as document
- Fetch: Airmailer retrieves the webpage content
- Parse: HTML is analyzed to identify main content
- Clean: Navigation, headers, footers removed
- Convert: Content converted to Markdown
- Store: Document saved to your knowledge base
Using URL Ingestion
- Navigate to Documents in the sidebar
- Click Import from URL
- Enter the webpage URL
- Provide a title for the document
- Select a document type
- Click Import
URL Requirements
Supported URLs
- Public webpages (no login required)
- HTTPS URLs (HTTP automatically upgraded)
- Pages under 2 MB content
Unsupported URLs
- Password-protected pages
- Content behind paywalls
- Dynamic/JavaScript-only content
- Private IP addresses
- Localhost URLs
Content Extraction
What's Captured
- Main article/page content
- Headings and structure
- Text formatting (bold, italic)
- Lists and tables
- Inline links
What's Removed
- Navigation menus
- Site headers and footers
- Sidebar content
- Cookie notices
- Advertisement blocks
- Social sharing buttons
- Comments sections
Security Features
Airmailer includes security measures for URL ingestion:
| Protection | Description | |------------|-------------| | HTTPS Required | All URLs upgraded to HTTPS | | Private IP Blocked | Cannot fetch from internal networks | | Size Limit | Maximum 2 MB content | | Timeout | 10-second fetch timeout | | Redirect Limit | Maximum 3 redirects followed |
Best Practices
Choose the Right Pages
Good candidates for import:
- FAQ pages
- Policy pages (returns, privacy, terms)
- Product description pages
- Help center articles
- Blog posts with evergreen content
Avoid importing:
- Pages with mostly images
- Dynamic content (changes frequently)
- Pages with minimal text
- User-generated content
Verify After Import
After importing, review the document:
- Check content was extracted correctly
- Verify formatting looks right
- Edit title if needed
- Confirm document type is appropriate
Handling Import Issues
Page Not Loading
Possible causes:
- Page requires authentication
- Server blocking automated requests
- Page doesn't exist (404)
- Network timeout
Solution: Try downloading the page manually and uploading as HTML.
Content Missing
If the imported content is incomplete:
- The page may use JavaScript rendering
- Content might be in an iframe
- The main content detection may have missed areas
Solution: Export the page to HTML and upload manually.
Wrong Content Extracted
If navigation or sidebar content appears:
- The page structure may be non-standard
- Content detection found the wrong area
Solution: Edit the document to remove unwanted content, or re-import from a cleaner source.
Common Use Cases
Import Your FAQ Page
URL: https://yoursite.com/faq
Title: "Frequently Asked Questions"
Type: FAQ
Import Return Policy
URL: https://yoursite.com/returns
Title: "Return and Refund Policy"
Type: Returns
Import Product Info
URL: https://yoursite.com/products/widget
Title: "Widget Product Information"
Type: Other
Refreshing Content
If the source webpage changes:
- Delete the existing document
- Re-import from the same URL
- Or manually update the document content
Airmailer doesn't automatically sync with source URLs—imported content is a snapshot.
Limits
| Limit | Value | |-------|-------| | Content size | 2 MB maximum | | Fetch timeout | 10 seconds | | Max redirects | 3 | | Rate limit | No specific limit |
Troubleshooting
"Failed to fetch URL"
- Verify the URL is correct and accessible
- Check if the page requires login
- Try accessing the URL in an incognito browser window
"Content too large"
- The page exceeds 2 MB
- Try importing a more focused sub-page
- Download and manually trim the HTML
"Timeout error"
- The server took too long to respond
- Try again later
- Download the page manually instead
"No content found"
- The page may be mostly JavaScript-rendered
- Content structure not recognized
- Use manual HTML upload instead