html2text is a highly popular utility and programming library primarily used to convert raw HTML code into clean, readable Markdown-formatted plain text. Originally written by internet activist Aaron Swartz, it has been ported to and expanded across multiple programming languages—most notably Python, Rust, and Go—and is widely used as a command-line interface (CLI) tool.
Instead of stripping away all tags and leaving a clump of unformatted text, html2text intelligently parses structural elements (like headers, lists, and links) and translates them into their Markdown equivalents. Key Features
,
) into # structures.
Hyperlink Handling: It rewrites text tags into standard inline Markdown links text.
Customization Engines: Users can toggle options to ignore links entirely, bypass structured tables, or escape special characters to avoid formatting issues.
Layout Conservation: It mimics paragraph breaks, list structures, and simple text indentation to preserve readability. Most Common Use Cases
AI Training & LLMs: Scraping web pages to feed clean, formatted context to Large Language Models (LLMs) like ChatGPT without cluttering the prompt with raw HTML markup.
Email Fallbacks: Creating the mandatory text-only fallback version (text/plain) for automated HTML marketing emails.
Web Scraping: Parsing unstructured blog posts, documentation, or news sites directly into text files for indexing and data analysis. Implementation Examples 1. Python Implementation
You can install the official package via the html2text PyPI registry using pip install html2text and run the following script:
import html2text # Sample HTML input html_content = “
Hello, please visit the official html2text GitHub.
” # Initialize the converter engine converter = html2text.HTML2Text() converter.ignore_links = False # Set to True if you want plain text without URLs # Generate clean markdown markdown_output = converter.handle(html_content) print(markdown_output) # Output: Hello, please visit the official html2text GitHub. Use code with caution. 2. Command-Line (CLI) Usage
You can run the script straight from your terminal to parse files or live sites. html2text/docs/usage.md at master – GitHub
Leave a Reply