HTML Cleaner
This document explains the HTML Cleaner node, which allows you to clean HTML content by removing unwanted tags and attributes, making the content easier to work with.
Node Description
The HTML Cleaner node processes HTML content and removes specified elements, tags, and attributes based on your configuration. This is useful for simplifying HTML, removing unnecessary metadata, or extracting readable content.
Node Inputs
Required Fields
- HTML
The raw HTML content to be cleaned.
Example:
Optional Fields
You can enable or disable the removal of specific HTML elements.
-
Remove
<iframe>
Remove all<iframe>
elements from the HTML.
Default: Enabled -
Remove
<header>
Remove all<header>
tags.
Default: Enabled -
Remove
<nav>
Remove all<nav>
tags.
Default: Enabled -
Remove
<footer>
Remove all<footer>
tags.
Default: Enabled -
Remove Attributes
Remove all attributes from the tags, leaving only the bare tags.
Default: Enabled -
Remove Additional Tags (Optional):
<script>
<meta>
<link>
<style>
<noscript>
<head>
<img>
and<svg>
<video>
All are toggled on by default, but you can customize based on your needs.
Output Format
- Output Type
- Text: Extracts clean text content only.
- HTML: Returns the cleaned HTML structure.
Node Output
The HTML Cleaner node provides the following output:
- Output: Cleaned HTML or plain text depending on the selected format.
Example Output (Text):
Example Output (HTML):
Example Usage
1. Extract Clean Text
- HTML Input:
- Configuration:
- Remove
<header>
: Enabled - Remove
<script>
: Enabled - Output: Text
- Remove
Output:
2. Simplify HTML Structure
- HTML Input:
- Configuration:
- Remove
<footer>
: Enabled - Output: HTML
- Remove
Output:
Node Functionality
The HTML Cleaner node is perfect for:
- Simplifying raw HTML before text extraction.
- Removing clutter like ads, metadata, or scripts from web-scraped content.
- Preprocessing content for downstream workflows like NLP or data analysis.
This node helps ensure that you only work with the most relevant and clean content.