Node Description
The HTML Cleaner node processes HTML content and removes specified elements, tags, and attributes based on your configuration. This is useful for simplifying HTML, removing unnecessary metadata, or extracting readable content.Node Inputs
Required Fields
- HTML
The raw HTML content to be cleaned.
Example:
Optional Fields
You can enable or disable the removal of specific HTML elements.-
Remove
<iframe>
Remove all<iframe>
elements from the HTML.
Default: Enabled -
Remove
<header>
Remove all<header>
tags.
Default: Enabled -
Remove
<nav>
Remove all<nav>
tags.
Default: Enabled -
Remove
<footer>
Remove all<footer>
tags.
Default: Enabled -
Remove Attributes
Remove all attributes from the tags, leaving only the bare tags.
Default: Enabled -
Remove Additional Tags (Optional):
<script>
<meta>
<link>
<style>
<noscript>
<head>
<img>
and<svg>
<video>
Output Format
- Output Type
- Text: Extracts clean text content only.
- HTML: Returns the cleaned HTML structure.
Node Output
The HTML Cleaner node provides the following output:- Output: Cleaned HTML or plain text depending on the selected format.
Example Usage
1. Extract Clean Text
- HTML Input:
- Configuration:
- Remove
<header>
: Enabled - Remove
<script>
: Enabled - Output: Text
- Remove
2. Simplify HTML Structure
- HTML Input:
- Configuration:
- Remove
<footer>
: Enabled - Output: HTML
- Remove
Node Functionality
The HTML Cleaner node is perfect for:- Simplifying raw HTML before text extraction.
- Removing clutter like ads, metadata, or scripts from web-scraped content.
- Preprocessing content for downstream workflows like NLP or data analysis.