Web Scraper

What does this node do?

The Web Scraper node fetches and extracts content from any webpage. It’s one of the most used nodes for gathering data from websites, whether for content analysis, data collection, or research. Common uses:

Extract article content for AI analysis
Gather product information
Collect competitor data
Build content datasets

Quick setup

Add the Web Scraper node

Find it in Integrations → Web Scraper

Enter or connect a URL

Provide the webpage URL to scrape

Choose a content template (optional)

Select a template for structured extraction

Run and get content

Execute to receive extracted content

Configuration

Required fields

url

string

required

The webpage URL to scrape.Examples:

Static: https://example.com/article
Dynamic: {{Text_0.value}} (from input)
From loop: {{Loop_0.currentItem.url}}

Optional fields

content_type

string

default:"No Template"

Pre-built extraction template for common page types.

Template	Extracts
No Template	Raw page content
Article	Title, author, date, body, images
ArticleList	List of article links with titles
Product	Name, price, description, specs
ProductList	List of products with details

xpath_1

string

Custom XPath expression for targeted extraction.Examples:

Main content: //article or //div[@class='content']
All paragraphs: //p
Specific element: //div[@id='main-text']

xpath_2

string

Second XPath for additional extraction.

xpath_3

string

Third XPath for additional extraction.

Output

The node returns extracted content:

{
  "url": "https://example.com/article",
  "title": "Article Title",
  "metaDescription": "Meta description text...",
  "content": "Main article content...",
  "h1": ["Main Heading"],
  "h2": ["Subheading 1", "Subheading 2"],
  "h3": ["Section 1", "Section 2"],
  "images": [
    {"src": "image1.jpg", "alt": "Description"}
  ],
  "links": [
    {"href": "https://...", "text": "Link text"}
  ],
  "wordCount": 1500,
  "html": "<div>Raw HTML...</div>"
}

Accessing output

{{WebScraper_0.content}}           → Main text content
{{WebScraper_0.title}}             → Page title
{{WebScraper_0.metaDescription}}   → Meta description
{{WebScraper_0.wordCount}}         → Word count
{{WebScraper_0.h1}}                → Array of H1 headings
{{WebScraper_0.html}}              → Raw HTML

Examples

Basic content extraction

URL: https://blog.example.com/seo-tips Output:

{
  "title": "10 SEO Tips for 2024",
  "content": "Search engine optimization continues to evolve...",
  "wordCount": 2500,
  "h2": ["Tip 1: Focus on E-E-A-T", "Tip 2: Optimize Core Web Vitals", ...]
}

Article template

Content Type: Article Enhanced output:

{
  "title": "10 SEO Tips for 2024",
  "author": "John Smith",
  "publishDate": "2024-01-15",
  "content": "...",
  "categories": ["SEO", "Digital Marketing"],
  "estimatedReadTime": "8 min"
}

Custom XPath extraction

XPath 1: //div[@class='pricing']//span[@class='price'] Extracts: All price elements from the pricing section

Common patterns

Scrape and analyze

Batch scraping

Content comparison

XPath reference

Common selectors

Goal	XPath
All paragraphs	`//p`
All links	`//a`
By class	`//div[@class='content']`
By ID	`//div[@id='main']`
Contains class	`//div[contains(@class, 'article')]`
By tag + class	`//article[@class='post']`
Nested	`//div[@class='content']//p`

Extracting specific content

Goal	XPath
Article body	`//article` or `//main`
Navigation	`//nav`
Header	`//header`
Footer	`//footer`
All images	`//img/@src`
Link URLs	`//a/@href`

Best practices

Respect websites

Always respect website terms of service and robots.txt. Add delays between requests when scraping multiple pages.

Check robots.txt before scraping
Add 2-3 second delays between requests
Don’t overload servers with rapid requests
Identify your scraper with a proper user-agent

Handle errors gracefully

Use Conditional nodes to check for errors:

If {{WebScraper_0.content}} is_empty
  → Log "Failed to scrape URL"
  → Skip to next
Else
  → Continue processing

Optimize for AI processing

Convert HTML to Markdown before sending to LLM: AI models work better with clean Markdown than raw HTML.

Common issues

Content is empty

Page may require JavaScript (not supported)
Check if URL is correct and accessible
Try different XPath selectors
Page may block scrapers

Getting wrong content

Use specific XPath to target correct element
Try Article template for blog posts
Check for multiple matching elements

Request blocked or 403 error

Site may block automated requests
Try adding delays between requests
Check robots.txt for restrictions

Content is garbled

Page may have unusual encoding
Try HTML Cleaner node after scraping
Use HTML to Markdown for clean text

HTML Cleaner

Clean scraped HTML

HTML to Markdown

Convert to Markdown

Loop

Scrape multiple URLs

AI Nodes

Integrations

Tools

Control Flow

Inputs

What does this node do?

Quick setup

Configuration

Required fields

Optional fields

Output

Accessing output

Examples

Basic content extraction

Article template

Custom XPath extraction

Common patterns

Scrape and analyze

Batch scraping

Content comparison

XPath reference

Common selectors

Extracting specific content

Best practices

Respect websites

Handle errors gracefully

Optimize for AI processing

Common issues

HTML Cleaner

HTML to Markdown

Loop

AI Nodes

Integrations

Tools

Control Flow

Inputs

​What does this node do?

​Quick setup

​Configuration

​Required fields

​Optional fields

​Output

​Accessing output

​Examples

​Basic content extraction

​Article template

​Custom XPath extraction

​Common patterns

​Scrape and analyze

​Batch scraping

​Content comparison

​XPath reference

​Common selectors

​Extracting specific content

​Best practices

​Respect websites

​Handle errors gracefully

​Optimize for AI processing

​Common issues

​Related nodes

HTML Cleaner

HTML to Markdown

Loop

What does this node do?

Quick setup

Configuration

Required fields

Optional fields

Output

Accessing output

Examples

Basic content extraction

Article template

Custom XPath extraction

Common patterns

Scrape and analyze

Batch scraping

Content comparison

XPath reference

Common selectors

Extracting specific content

Best practices

Respect websites

Handle errors gracefully

Optimize for AI processing

Common issues

Related nodes