What does this node do?
The Web Scraper node fetches and extracts content from any webpage. It’s one of the most used nodes for gathering data from websites, whether for content analysis, data collection, or research. Common uses:- Extract article content for AI analysis
- Gather product information
- Collect competitor data
- Build content datasets
Quick setup
1
Add the Web Scraper node
Find it in Integrations → Web Scraper
2
Enter or connect a URL
Provide the webpage URL to scrape
3
Choose a content template (optional)
Select a template for structured extraction
4
Run and get content
Execute to receive extracted content
Configuration
Required fields
The webpage URL to scrape.Examples:
- Static:
https://example.com/article - Dynamic:
{{Text_0.value}}(from input) - From loop:
{{Loop_0.currentItem.url}}
Optional fields
Pre-built extraction template for common page types.
| Template | Extracts |
|---|---|
| No Template | Raw page content |
| Article | Title, author, date, body, images |
| ArticleList | List of article links with titles |
| Product | Name, price, description, specs |
| ProductList | List of products with details |
Custom XPath expression for targeted extraction.Examples:
- Main content:
//articleor//div[@class='content'] - All paragraphs:
//p - Specific element:
//div[@id='main-text']
Second XPath for additional extraction.
Third XPath for additional extraction.
Output
The node returns extracted content:Accessing output
Examples
Basic content extraction
URL:https://blog.example.com/seo-tips
Output:
Article template
Content Type: Article Enhanced output:Custom XPath extraction
XPath 1://div[@class='pricing']//span[@class='price']
Extracts: All price elements from the pricing section
Common patterns
Scrape and analyze
Batch scraping
Content comparison
XPath reference
Common selectors
| Goal | XPath |
|---|---|
| All paragraphs | //p |
| All links | //a |
| By class | //div[@class='content'] |
| By ID | //div[@id='main'] |
| Contains class | //div[contains(@class, 'article')] |
| By tag + class | //article[@class='post'] |
| Nested | //div[@class='content']//p |
Extracting specific content
| Goal | XPath |
|---|---|
| Article body | //article or //main |
| Navigation | //nav |
| Header | //header |
| Footer | //footer |
| All images | //img/@src |
| Link URLs | //a/@href |
Best practices
Respect websites
- Check robots.txt before scraping
- Add 2-3 second delays between requests
- Don’t overload servers with rapid requests
- Identify your scraper with a proper user-agent
Handle errors gracefully
Use Conditional nodes to check for errors:Optimize for AI processing
Convert HTML to Markdown before sending to LLM: AI models work better with clean Markdown than raw HTML.Common issues
Content is empty
Content is empty
- Page may require JavaScript (not supported)
- Check if URL is correct and accessible
- Try different XPath selectors
- Page may block scrapers
Getting wrong content
Getting wrong content
- Use specific XPath to target correct element
- Try Article template for blog posts
- Check for multiple matching elements
Request blocked or 403 error
Request blocked or 403 error
- Site may block automated requests
- Try adding delays between requests
- Check robots.txt for restrictions
Content is garbled
Content is garbled
- Page may have unusual encoding
- Try HTML Cleaner node after scraping
- Use HTML to Markdown for clean text

