Go to Studio

Web Scraper

Extract content from any webpage automatically

What does this node do?

The Web Scraper node fetches and extracts content from any webpage. It’s one of the most used nodes for gathering data from websites, whether for content analysis, data collection, or research.

Common uses:

  • Extract article content for AI analysis
  • Gather product information
  • Collect competitor data
  • Build content datasets

Quick setup

Add the Web Scraper node

Find it in IntegrationsWeb Scraper

Enter or connect a URL

Provide the webpage URL to scrape

Choose a content template (optional)

Select a template for structured extraction

Run and get content

Execute to receive extracted content

Configuration

Required fields

url string required

The webpage URL to scrape.

Examples:

  • Static: https://example.com/article
  • Dynamic: {{Text_0.value}} (from input)
  • From loop: {{Loop_0.currentItem.url}}

Optional fields

content_type string default: No Template

Pre-built extraction template for common page types.

TemplateExtracts
No TemplateRaw page content
ArticleTitle, author, date, body, images
ArticleListList of article links with titles
ProductName, price, description, specs
ProductListList of products with details
xpath_1 string

Custom XPath expression for targeted extraction.

Examples:

  • Main content: //article or //div[@class='content']
  • All paragraphs: //p
  • Specific element: //div[@id='main-text']
xpath_2 string

Second XPath for additional extraction.

xpath_3 string

Third XPath for additional extraction.

Output

The node returns extracted content:

{
  "url": "https://example.com/article",
  "title": "Article Title",
  "metaDescription": "Meta description text...",
  "content": "Main article content...",
  "h1": ["Main Heading"],
  "h2": ["Subheading 1", "Subheading 2"],
  "h3": ["Section 1", "Section 2"],
  "images": [
    {"src": "image1.jpg", "alt": "Description"}
  ],
  "links": [
    {"href": "https://...", "text": "Link text"}
  ],
  "wordCount": 1500,
  "html": "<div>Raw HTML...</div>"
}

Accessing output

{{WebScraper_0.content}}           → Main text content
{{WebScraper_0.title}}             → Page title
{{WebScraper_0.metaDescription}}   → Meta description
{{WebScraper_0.wordCount}}         → Word count
{{WebScraper_0.h1}}                → Array of H1 headings
{{WebScraper_0.html}}              → Raw HTML

Examples

Basic content extraction

URL: https://blog.example.com/seo-tips

Output:

{
  "title": "10 SEO Tips for 2024",
  "content": "Search engine optimization continues to evolve...",
  "wordCount": 2500,
  "h2": ["Tip 1: Focus on E-E-A-T", "Tip 2: Optimize Core Web Vitals", ...]
}

Article template

Content Type: Article

Enhanced output:

{
  "title": "10 SEO Tips for 2024",
  "author": "John Smith",
  "publishDate": "2024-01-15",
  "content": "...",
  "categories": ["SEO", "Digital Marketing"],
  "estimatedReadTime": "8 min"
}

Custom XPath extraction

XPath 1: //div[@class='pricing']//span[@class='price']

Extracts: All price elements from the pricing section

Common patterns

Scrape and analyze

graph LR
    A[URL] --> B[Web Scraper]
    B --> C[HTML to Markdown]
    C --> D[LLM Analysis]

Batch scraping

graph LR
    A[URL List] --> B[Loop]
    B --> C[Web Scraper]
    C --> D[Save to Sheets]

Content comparison

graph LR
    A[Our Page] --> C[LLM Compare]
    B[Competitor Page] --> C
    C --> D[Report]

XPath reference

Common selectors

GoalXPath
All paragraphs//p
All links//a
By class//div[@class='content']
By ID//div[@id='main']
Contains class//div[contains(@class, 'article')]
By tag + class//article[@class='post']
Nested//div[@class='content']//p

Extracting specific content

GoalXPath
Article body//article or //main
Navigation//nav
Header//header
Footer//footer
All images//img/@src
Link URLs//a/@href

Best practices

Respect websites

Warning

Always respect website terms of service and robots.txt. Add delays between requests when scraping multiple pages.

  • Check robots.txt before scraping
  • Add 2-3 second delays between requests
  • Don’t overload servers with rapid requests
  • Identify your scraper with a proper user-agent

Handle errors gracefully

graph LR
    A[Scrape] --> B{Success?}
    B -->|Yes| C[Process]
    B -->|No| D[Log Error]
    D --> E[Continue/Retry]

Use Conditional nodes to check for errors:

If {{WebScraper_0.content}} is_empty
  → Log "Failed to scrape URL"
  → Skip to next
Else
  → Continue processing

Optimize for AI processing

Convert HTML to Markdown before sending to LLM:

graph LR
    A[Web Scraper] --> B[HTML to Markdown]
    B --> C[LLM]

AI models work better with clean Markdown than raw HTML.

Common issues

Content is empty
  • Page may require JavaScript (not supported)
  • Check if URL is correct and accessible
  • Try different XPath selectors
  • Page may block scrapers
Getting wrong content
  • Use specific XPath to target correct element
  • Try Article template for blog posts
  • Check for multiple matching elements
Request blocked or 403 error
  • Site may block automated requests
  • Try adding delays between requests
  • Check robots.txt for restrictions
Content is garbled
  • Page may have unusual encoding
  • Try HTML Cleaner node after scraping
  • Use HTML to Markdown for clean text