Skip to main content

What does this node do?

The Web Scraper node fetches and extracts content from any webpage. It’s one of the most used nodes for gathering data from websites, whether for content analysis, data collection, or research. Common uses:
  • Extract article content for AI analysis
  • Gather product information
  • Collect competitor data
  • Build content datasets

Quick setup

1

Add the Web Scraper node

Find it in IntegrationsWeb Scraper
2

Enter or connect a URL

Provide the webpage URL to scrape
3

Choose a content template (optional)

Select a template for structured extraction
4

Run and get content

Execute to receive extracted content

Configuration

Required fields

url
string
required
The webpage URL to scrape.Examples:
  • Static: https://example.com/article
  • Dynamic: {{Text_0.value}} (from input)
  • From loop: {{Loop_0.currentItem.url}}

Optional fields

content_type
string
default:"No Template"
Pre-built extraction template for common page types.
TemplateExtracts
No TemplateRaw page content
ArticleTitle, author, date, body, images
ArticleListList of article links with titles
ProductName, price, description, specs
ProductListList of products with details
xpath_1
string
Custom XPath expression for targeted extraction.Examples:
  • Main content: //article or //div[@class='content']
  • All paragraphs: //p
  • Specific element: //div[@id='main-text']
xpath_2
string
Second XPath for additional extraction.
xpath_3
string
Third XPath for additional extraction.

Output

The node returns extracted content:
{
  "url": "https://example.com/article",
  "title": "Article Title",
  "metaDescription": "Meta description text...",
  "content": "Main article content...",
  "h1": ["Main Heading"],
  "h2": ["Subheading 1", "Subheading 2"],
  "h3": ["Section 1", "Section 2"],
  "images": [
    {"src": "image1.jpg", "alt": "Description"}
  ],
  "links": [
    {"href": "https://...", "text": "Link text"}
  ],
  "wordCount": 1500,
  "html": "<div>Raw HTML...</div>"
}

Accessing output

{{WebScraper_0.content}}           → Main text content
{{WebScraper_0.title}}             → Page title
{{WebScraper_0.metaDescription}}   → Meta description
{{WebScraper_0.wordCount}}         → Word count
{{WebScraper_0.h1}}                → Array of H1 headings
{{WebScraper_0.html}}              → Raw HTML

Examples

Basic content extraction

URL: https://blog.example.com/seo-tips Output:
{
  "title": "10 SEO Tips for 2024",
  "content": "Search engine optimization continues to evolve...",
  "wordCount": 2500,
  "h2": ["Tip 1: Focus on E-E-A-T", "Tip 2: Optimize Core Web Vitals", ...]
}

Article template

Content Type: Article Enhanced output:
{
  "title": "10 SEO Tips for 2024",
  "author": "John Smith",
  "publishDate": "2024-01-15",
  "content": "...",
  "categories": ["SEO", "Digital Marketing"],
  "estimatedReadTime": "8 min"
}

Custom XPath extraction

XPath 1: //div[@class='pricing']//span[@class='price'] Extracts: All price elements from the pricing section

Common patterns

Scrape and analyze

Batch scraping

Content comparison

XPath reference

Common selectors

GoalXPath
All paragraphs//p
All links//a
By class//div[@class='content']
By ID//div[@id='main']
Contains class//div[contains(@class, 'article')]
By tag + class//article[@class='post']
Nested//div[@class='content']//p

Extracting specific content

GoalXPath
Article body//article or //main
Navigation//nav
Header//header
Footer//footer
All images//img/@src
Link URLs//a/@href

Best practices

Respect websites

Always respect website terms of service and robots.txt. Add delays between requests when scraping multiple pages.
  • Check robots.txt before scraping
  • Add 2-3 second delays between requests
  • Don’t overload servers with rapid requests
  • Identify your scraper with a proper user-agent

Handle errors gracefully

Use Conditional nodes to check for errors:
If {{WebScraper_0.content}} is_empty
  → Log "Failed to scrape URL"
  → Skip to next
Else
  → Continue processing

Optimize for AI processing

Convert HTML to Markdown before sending to LLM: AI models work better with clean Markdown than raw HTML.

Common issues

  • Page may require JavaScript (not supported)
  • Check if URL is correct and accessible
  • Try different XPath selectors
  • Page may block scrapers
  • Use specific XPath to target correct element
  • Try Article template for blog posts
  • Check for multiple matching elements
  • Site may block automated requests
  • Try adding delays between requests
  • Check robots.txt for restrictions
  • Page may have unusual encoding
  • Try HTML Cleaner node after scraping
  • Use HTML to Markdown for clean text