Web Scraper
Extract content from any webpage automatically
What does this node do?
The Web Scraper node fetches and extracts content from any webpage. It’s one of the most used nodes for gathering data from websites, whether for content analysis, data collection, or research.
Common uses:
- Extract article content for AI analysis
- Gather product information
- Collect competitor data
- Build content datasets
Quick setup
Add the Web Scraper node
Find it in Integrations → Web Scraper
Enter or connect a URL
Provide the webpage URL to scrape
Choose a content template (optional)
Select a template for structured extraction
Run and get content
Execute to receive extracted content
Configuration
Required fields
url string required The webpage URL to scrape.
Examples:
- Static:
https://example.com/article - Dynamic:
{{Text_0.value}}(from input) - From loop:
{{Loop_0.currentItem.url}}
Optional fields
content_type string default: No Template Pre-built extraction template for common page types.
| Template | Extracts |
|---|---|
| No Template | Raw page content |
| Article | Title, author, date, body, images |
| ArticleList | List of article links with titles |
| Product | Name, price, description, specs |
| ProductList | List of products with details |
xpath_1 string Custom XPath expression for targeted extraction.
Examples:
- Main content:
//articleor//div[@class='content'] - All paragraphs:
//p - Specific element:
//div[@id='main-text']
xpath_2 string Second XPath for additional extraction.
xpath_3 string Third XPath for additional extraction.
Output
The node returns extracted content:
{
"url": "https://example.com/article",
"title": "Article Title",
"metaDescription": "Meta description text...",
"content": "Main article content...",
"h1": ["Main Heading"],
"h2": ["Subheading 1", "Subheading 2"],
"h3": ["Section 1", "Section 2"],
"images": [
{"src": "image1.jpg", "alt": "Description"}
],
"links": [
{"href": "https://...", "text": "Link text"}
],
"wordCount": 1500,
"html": "<div>Raw HTML...</div>"
}
Accessing output
{{WebScraper_0.content}} → Main text content
{{WebScraper_0.title}} → Page title
{{WebScraper_0.metaDescription}} → Meta description
{{WebScraper_0.wordCount}} → Word count
{{WebScraper_0.h1}} → Array of H1 headings
{{WebScraper_0.html}} → Raw HTML
Examples
Basic content extraction
URL: https://blog.example.com/seo-tips
Output:
{
"title": "10 SEO Tips for 2024",
"content": "Search engine optimization continues to evolve...",
"wordCount": 2500,
"h2": ["Tip 1: Focus on E-E-A-T", "Tip 2: Optimize Core Web Vitals", ...]
}
Article template
Content Type: Article
Enhanced output:
{
"title": "10 SEO Tips for 2024",
"author": "John Smith",
"publishDate": "2024-01-15",
"content": "...",
"categories": ["SEO", "Digital Marketing"],
"estimatedReadTime": "8 min"
}
Custom XPath extraction
XPath 1: //div[@class='pricing']//span[@class='price']
Extracts: All price elements from the pricing section
Common patterns
Scrape and analyze
graph LR
A[URL] --> B[Web Scraper]
B --> C[HTML to Markdown]
C --> D[LLM Analysis]
Batch scraping
graph LR
A[URL List] --> B[Loop]
B --> C[Web Scraper]
C --> D[Save to Sheets]
Content comparison
graph LR
A[Our Page] --> C[LLM Compare]
B[Competitor Page] --> C
C --> D[Report]
XPath reference
Common selectors
| Goal | XPath |
|---|---|
| All paragraphs | //p |
| All links | //a |
| By class | //div[@class='content'] |
| By ID | //div[@id='main'] |
| Contains class | //div[contains(@class, 'article')] |
| By tag + class | //article[@class='post'] |
| Nested | //div[@class='content']//p |
Extracting specific content
| Goal | XPath |
|---|---|
| Article body | //article or //main |
| Navigation | //nav |
| Header | //header |
| Footer | //footer |
| All images | //img/@src |
| Link URLs | //a/@href |
Best practices
Respect websites
Always respect website terms of service and robots.txt. Add delays between requests when scraping multiple pages.
- Check robots.txt before scraping
- Add 2-3 second delays between requests
- Don’t overload servers with rapid requests
- Identify your scraper with a proper user-agent
Handle errors gracefully
graph LR
A[Scrape] --> B{Success?}
B -->|Yes| C[Process]
B -->|No| D[Log Error]
D --> E[Continue/Retry]
Use Conditional nodes to check for errors:
If {{WebScraper_0.content}} is_empty
→ Log "Failed to scrape URL"
→ Skip to next
Else
→ Continue processing
Optimize for AI processing
Convert HTML to Markdown before sending to LLM:
graph LR
A[Web Scraper] --> B[HTML to Markdown]
B --> C[LLM]
AI models work better with clean Markdown than raw HTML.
Common issues
Content is empty
- Page may require JavaScript (not supported)
- Check if URL is correct and accessible
- Try different XPath selectors
- Page may block scrapers
Getting wrong content
- Use specific XPath to target correct element
- Try Article template for blog posts
- Check for multiple matching elements
Request blocked or 403 error
- Site may block automated requests
- Try adding delays between requests
- Check robots.txt for restrictions
Content is garbled
- Page may have unusual encoding
- Try HTML Cleaner node after scraping
- Use HTML to Markdown for clean text