Web Scraper

Extract content from any webpage automatically

What does this node do?

The Web Scraper node fetches and extracts content from any webpage. It’s one of the most used nodes for gathering data from websites, whether for content analysis, data collection, or research.

Common uses:

Extract article content for AI analysis
Gather product information
Collect competitor data
Build content datasets

Quick setup

Add the Web Scraper node

Find it in Integrations → Web Scraper

Enter or connect a URL

Provide the webpage URL to scrape

Choose a content template (optional)

Select a template for structured extraction

Run and get content

Execute to receive extracted content

Configuration

Required fields

url string required

The webpage URL to scrape.

Examples:

Static: https://example.com/article
Dynamic: {{Text_0.value}} (from input)
From loop: {{Loop_0.currentItem.url}}

Optional fields

content_type string default: No Template

Pre-built extraction template for common page types.

Template	Extracts
No Template	Raw page content
Article	Title, author, date, body, images
ArticleList	List of article links with titles
Product	Name, price, description, specs
ProductList	List of products with details

xpath_1 string

Custom XPath expression for targeted extraction.

Examples:

Main content: //article or //div[@class='content']
All paragraphs: //p
Specific element: //div[@id='main-text']

xpath_2 string

Second XPath for additional extraction.

xpath_3 string

Third XPath for additional extraction.

Output

The node returns extracted content:

{
  "url": "https://example.com/article",
  "title": "Article Title",
  "metaDescription": "Meta description text...",
  "content": "Main article content...",
  "h1": ["Main Heading"],
  "h2": ["Subheading 1", "Subheading 2"],
  "h3": ["Section 1", "Section 2"],
  "images": [
    {"src": "image1.jpg", "alt": "Description"}
  ],
  "links": [
    {"href": "https://...", "text": "Link text"}
  ],
  "wordCount": 1500,
  "html": "<div>Raw HTML...</div>"
}

Accessing output

{{WebScraper_0.content}}           → Main text content
{{WebScraper_0.title}}             → Page title
{{WebScraper_0.metaDescription}}   → Meta description
{{WebScraper_0.wordCount}}         → Word count
{{WebScraper_0.h1}}                → Array of H1 headings
{{WebScraper_0.html}}              → Raw HTML

Examples

Basic content extraction

URL: https://blog.example.com/seo-tips

Output:

{
  "title": "10 SEO Tips for 2024",
  "content": "Search engine optimization continues to evolve...",
  "wordCount": 2500,
  "h2": ["Tip 1: Focus on E-E-A-T", "Tip 2: Optimize Core Web Vitals", ...]
}

Article template

Content Type: Article

Enhanced output:

{
  "title": "10 SEO Tips for 2024",
  "author": "John Smith",
  "publishDate": "2024-01-15",
  "content": "...",
  "categories": ["SEO", "Digital Marketing"],
  "estimatedReadTime": "8 min"
}

Custom XPath extraction

XPath 1: //div[@class='pricing']//span[@class='price']

Extracts: All price elements from the pricing section

Common patterns

Scrape and analyze

graph LR
    A[URL] --> B[Web Scraper]
    B --> C[HTML to Markdown]
    C --> D[LLM Analysis]

Batch scraping

graph LR
    A[URL List] --> B[Loop]
    B --> C[Web Scraper]
    C --> D[Save to Sheets]

Content comparison

graph LR
    A[Our Page] --> C[LLM Compare]
    B[Competitor Page] --> C
    C --> D[Report]

XPath reference

Common selectors

Goal	XPath
All paragraphs	`//p`
All links	`//a`
By class	`//div[@class='content']`
By ID	`//div[@id='main']`
Contains class	`//div[contains(@class, 'article')]`
By tag + class	`//article[@class='post']`
Nested	`//div[@class='content']//p`

Extracting specific content

Goal	XPath
Article body	`//article` or `//main`
Navigation	`//nav`
Header	`//header`
Footer	`//footer`
All images	`//img/@src`
Link URLs	`//a/@href`

Best practices

Respect websites

Warning

Always respect website terms of service and robots.txt. Add delays between requests when scraping multiple pages.

Check robots.txt before scraping
Add 2-3 second delays between requests
Don’t overload servers with rapid requests
Identify your scraper with a proper user-agent

Handle errors gracefully

graph LR
    A[Scrape] --> B{Success?}
    B -->|Yes| C[Process]
    B -->|No| D[Log Error]
    D --> E[Continue/Retry]

Use Conditional nodes to check for errors:

If {{WebScraper_0.content}} is_empty
  → Log "Failed to scrape URL"
  → Skip to next
Else
  → Continue processing

Optimize for AI processing

Convert HTML to Markdown before sending to LLM:

graph LR
    A[Web Scraper] --> B[HTML to Markdown]
    B --> C[LLM]

AI models work better with clean Markdown than raw HTML.

Common issues

Content is empty

Page may require JavaScript (not supported)
Check if URL is correct and accessible
Try different XPath selectors
Page may block scrapers

Getting wrong content

Use specific XPath to target correct element
Try Article template for blog posts
Check for multiple matching elements

Request blocked or 403 error

Site may block automated requests
Try adding delays between requests
Check robots.txt for restrictions

Content is garbled

Page may have unusual encoding
Try HTML Cleaner node after scraping
Use HTML to Markdown for clean text

HTML Cleaner

Clean scraped HTML

HTML to Markdown

Convert to Markdown

Loop

Scrape multiple URLs

Web Scraper

What does this node do?

Quick setup

Add the Web Scraper node

Enter or connect a URL

Choose a content template (optional)

Run and get content

Configuration

Required fields

Optional fields

Output

Accessing output

Examples

Basic content extraction

Article template

Custom XPath extraction

Common patterns

Scrape and analyze

Batch scraping

Content comparison

XPath reference

Common selectors

Extracting specific content

Best practices

Respect websites

Handle errors gracefully

Optimize for AI processing

Common issues

Related nodes