Go to Studio

Web Scraper

The Web Scraper node fetches a web page and extracts its content using built-in templates or custom XPath selectors, returning the result as a single string for downstream processing.

Draft Goal web scraper orchestrating extractor steps previews along workflow backbone graph

What does the Web Scraper node do?

The Web Scraper node fetches a web page from a URL and extracts its content. It can run in raw mode (no template) and return the page payload as text, apply one of four built-in templates tuned for common page types (articles, article lists, products, product lists), or target very specific elements with up to three XPath selectors. The node returns a single string that downstream nodes can clean, parse, or feed to an LLM.

Common use cases:

  • Pulling article bodies from blog or news pages before summarising them with an LLM.
  • Collecting product details (name, price, description) from e-commerce listings into a structured dataset.
  • Extracting specific elements from a known page structure using XPath (prices, ratings, hidden fields).
  • Iterating over a list of URLs inside a Loop to build a content corpus or competitor watch.

Quick setup

Follow these steps to add and configure the Web Scraper node in your workflow:

Add the node to the canvas

Open the Node Library, go to Integrations, then drag and drop the Web Scraper node onto your workspace.

Connect or set the URL

Either type a static URL directly into the Url(s) input, or connect the output of an upstream node (Text Input, Loop, JSON Path Extractor) that provides the URL to scrape.

Pick a content template

In the node settings, choose a Content Type: keep No Template for the raw page, or pick Article, ArticleList, Product, ProductList to apply a pre-built extraction profile.

(Optional) Add XPath selectors

Open the XPath Selectors section and fill XPath 1, XPath 2, and/or XPath 3 to target specific DOM nodes (e.g. //div[@class='product-price']).

Choose how to handle errors

Pick an Error Handling strategy: None to fail the workflow run on error, or Skip & Continue to return an empty string for that URL and keep going.

Connect the output

Connect the output port (on the right of the node) to the next node, and create your own variable name in that next node to receive the scraped content.

Configuration parameters

Web scraper settings selectors throttling authentication export templating crawl depth fields

The Web Scraper exposes one input port and four business parameters on top of the standard identification fields.

Required fields

Name string required default: Scraping Tool

Node name — Important for quickly identifying this node’s role (e.g. Scrape competitor product page) when running and debugging the workflow.

Description string required default: A tool to scrape web content using XPath selectors

Node description — A short phrase describing what this scraping node fetches in the context of your workflow.

Url(s) string required

URL to scrape — The web page URL to fetch. Can be a hard-coded string or a variable injected from an upstream node (Text Input, Loop iteration, JSON Path Extractor, etc.).

Content Type string required default: No Template

Extraction template — Selects how the page is extracted. Available values:

ValueBehaviour
No TemplateReturns the raw page content with no template applied.
ArticleExtracts a single article (title, body, metadata).
ArticleListExtracts a list of article items from an index page.
ProductExtracts a single product (name, price, description).
ProductListExtracts a list of products from a listing page.
Error Handling string required default: None

Error handling strategy — Controls how the node reacts when the page cannot be fetched or parsed:

ValueBehaviour
NoneWhen an error occurs, the node stops and the workflow run fails.
Skip & ContinueIf an error occurs, the node returns an empty string for that URL and execution continues.

Optional fields

XPath 1 string default: Empty

First XPath selector — Custom XPath expression used to target a specific element on the page (e.g. //div[@class='product-title']). Combined with the chosen content template when both are provided.

XPath 2 string default: Empty

Second XPath selector — Additional XPath, typically used to extract a second piece of information (e.g. //div[@class='product-price']).

XPath 3 string default: Empty

Third XPath selector — Additional XPath, typically used to extract a third piece of information (e.g. //div[@class='product-description']).

Tip

Start with No Template and a single XPath when prototyping, inspect the raw output, then switch to a template (Article, Product…) only once you know which fields you actually need downstream.

What does the node output?

The node outputs a single string named html that contains the scraped content. The exact shape of that string depends on the chosen Content Type and on the XPath selectors:

  • With No Template, the output is the raw page payload (typically HTML).
  • With a template (Article, ArticleList, Product, ProductList), the output is a serialised text representation of the extracted fields.
  • With XPath selectors, the output focuses on the matched DOM nodes.

How to use the output

In Draft & Goal you don’t need to look up a system-generated variable name. To use the result:

  1. Draw a connection from the Web Scraper output port.
  2. Connect it to the next node’s input (HTML to Markdown, HTML Cleaner, JSON Path Extractor, LLM, etc.).
  3. In that next node, create and name your own variable (for example, scraped_page). The scraped content will be injected into it automatically.
html string

The scraped content, returned as a string. Empty when the URL fails to load and Error Handling is set to Skip & Continue.

Usage examples

Example 1: Scrape an article and summarise it with an LLM

You want to turn any article URL into a short briefing.

Workflow:

  1. Text Input holds the article URL.
  2. Web Scraper fetches it with Content Type = Article and Error Handling = Skip & Continue.
  3. HTML to Markdown cleans the output for the LLM.
  4. LLM receives the markdown and produces the summary.

Web Scraper configuration:

  • Url(s) = {{Text_0.value}}
  • Content Type = Article
  • Error Handling = Skip & Continue

Example 2: Targeted product extraction with XPath

You monitor a competitor’s product page and only need the title, price, and description.

Web Scraper settings illustrating three XPath rows filled for competitor product scraping

Web Scraper configuration:

  • Url(s) = https://shop.example.com/product/123
  • Content Type = Product
  • XPath 1 = //div[@class='product-title']
  • XPath 2 = //div[@class='product-price']
  • XPath 3 = //div[@class='product-description']
  • Error Handling = None

The html output then contains the three targeted blocks, ready to be parsed by a JSON Path Extractor or sent to an LLM for normalisation.

Example 3: Bulk scraping with a Loop

You hold a list of URLs and want to scrape each one and store the result.

Workflow:

  1. Create List (or upstream API Connector) provides the URL list.
  2. Loop iterates over each item.
  3. Web Scraper scrapes the current URL with Url(s) = {{Loop_0.currentItem}} and Error Handling = Skip & Continue so a single failing page does not abort the run.
  4. Save / Append the result downstream (Sheets, database, file).

Common issues

The output is empty although the URL works in my browser

Cause: The page may render its content with JavaScript after load, or it may block automated requests, in which case the raw HTML the scraper sees does not contain the expected text.

Solution: Open the page source (not the dev-tools DOM) to confirm the data is actually present in the initial HTML. If it is rendered client-side, the Web Scraper cannot reach it. If the page blocks scrapers, switch to an upstream API or a different source.

An XPath returns nothing

Cause: The XPath does not match the real DOM structure, or it targets attributes that differ from what is rendered server-side.

Solution: Test the XPath in your browser’s dev tools first ($x("//div[@class='product-title']")). Prefer robust selectors (contains(@class, 'price')) over fragile ones tied to volatile class names.

My workflow run fails on the first bad URL in a Loop

Cause: Error Handling is set to None, so the first non-200 response or parsing error stops the run.

Solution: Set Error Handling to Skip & Continue. The node will return an empty string for the failing URL and the Loop will move on to the next item.

The output is hard to feed to an LLM

Cause: Raw HTML contains a lot of markup the model does not need.

Solution: Place an HTML Cleaner or HTML to Markdown node between the Web Scraper and the LLM to reduce noise and token usage.

Best practices and pitfalls

Tip

Always set Error Handling to Skip & Continue when the Web Scraper sits inside a Loop. One unreachable URL out of a hundred should not abort the entire batch.

Warning

Respect target sites. Check the site’s terms of use and robots.txt before scraping at scale, throttle your Loops to avoid hammering servers, and prefer official APIs when they exist.

How does it fit into a workflow?

The Web Scraper typically sits between a node that produces URLs and a node that cleans or interprets the result. Here is a typical batch-scraping pattern with cleanup and LLM analysis:

graph LR
    Source[URL list / API Connector] --> Loop[Loop]
    Loop --> Scraper[Web Scraper]
    Scraper --> Clean[HTML to Markdown]
    Clean --> LLM[LLM Analysis]
    LLM --> Out[Save results]