Comparison of HTML parsers
Jump to navigation
Jump to search
This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages)
|
HTML parsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes:
- HTML traversal: offer an interface for programmers to easily access and modify the "HTML string code". Canonical example: DOM parsers.
- HTML clean: to fix invalid HTML and to improve the layout and indent style of the resulting markup. Canonical example: HTML Tidy.
Parser | License | Implementation language(s) | Latest date* | HTML parsing[1] | HTML5-compliant parsing | Clean HTML** | Update HTML*** |
---|---|---|---|---|---|---|---|
HTML Tidy | W3C license | ANSI C | 2021-07-17[2] | Yes[3] | Yes | Yes[3] | Yes |
HtmlUnit | Apache License 2.0 | Java | 2023-10-31[4] | Yes | ? | No | No |
Beautiful Soup | MIT License | Python | 2023-04-07[5] | Yes | Yes | ? | No |
jsoup | MIT License | Java | 2023-12-29[6] | Yes | Yes | Yes | Yes |
Parser | License | Implementation language(s) | Latest date* | HTML Parsing | HTML5-compliant Parsing | Clean HTML** | Update HTML*** |
- * Latest release (of significant changes) date.
- ** sanitize (generating standard-compatible web-page, reduce spam, etc.) and clean (strip out surplus presentational tags, remove XSS code, etc.) HTML code.
- *** Updates HTML4.X to XHTML or to HTML5, converting deprecated tags (ex. CENTER) to valid ones (ex. DIV with
style="text-align:center;"
).
References
Categories:
- Webarchive template wayback links
- Articles with short description
- Short description is different from Wikidata
- Articles needing additional references from May 2015
- All articles needing additional references
- Articles that may contain original research from May 2015
- All articles that may contain original research
- Articles with multiple maintenance issues
- HTML parsers
- Software comparisons