Node Description

The HTML Cleaner node processes HTML content and removes specified elements, tags, and attributes based on your configuration. This is useful for simplifying HTML, removing unnecessary metadata, or extracting readable content.


Node Inputs

Required Fields

  1. HTML
    The raw HTML content to be cleaned.
    Example:
    <html>
    <head><title>Example</title></head>
    <body><h1>Hello, World!</h1><script>console.log("Hi!")</script></body>
    </html>
    

Optional Fields

You can enable or disable the removal of specific HTML elements.

  1. Remove <iframe>
    Remove all <iframe> elements from the HTML.
    Default: Enabled

  2. Remove <header>
    Remove all <header> tags.
    Default: Enabled

  3. Remove <nav>
    Remove all <nav> tags.
    Default: Enabled

  4. Remove <footer>
    Remove all <footer> tags.
    Default: Enabled

  5. Remove Attributes
    Remove all attributes from the tags, leaving only the bare tags.
    Default: Enabled

  6. Remove Additional Tags (Optional):

    • <script>
    • <meta>
    • <link>
    • <style>
    • <noscript>
    • <head>
    • <img> and <svg>
    • <video>

    All are toggled on by default, but you can customize based on your needs.


Output Format

  1. Output Type
    • Text: Extracts clean text content only.
    • HTML: Returns the cleaned HTML structure.

Node Output

The HTML Cleaner node provides the following output:

  • Output: Cleaned HTML or plain text depending on the selected format.

Example Output (Text):

Hello, World!

Example Output (HTML):

<h1>Hello, World!</h1>

Example Usage

1. Extract Clean Text

  • HTML Input:
    <html>
    <header></header>
    <body>
       <h1>Welcome!</h1>
       <script>console.log('Hi')</script>
    </body>
    </html>
    
  • Configuration:
    • Remove <header>: Enabled
    • Remove <script>: Enabled
    • Output: Text

Output:

Welcome!

2. Simplify HTML Structure

  • HTML Input:
    <html>
    <body>
       <div>
          <h1>Main Title</h1>
          <footer>Footer Content</footer>
       </div>
    </body>
    </html>
    
  • Configuration:
    • Remove <footer>: Enabled
    • Output: HTML

Output:

<div>
   <h1>Main Title</h1>
</div>

Node Functionality

The HTML Cleaner node is perfect for:

  • Simplifying raw HTML before text extraction.
  • Removing clutter like ads, metadata, or scripts from web-scraped content.
  • Preprocessing content for downstream workflows like NLP or data analysis.

This node helps ensure that you only work with the most relevant and clean content.