Comparison of HTML parsers

HTML parsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes:

HTML traversal: offer an interface for programmers to easily access and modify the "HTML string code". Canonical example: DOM parsers.
HTML clean: to fix invalid HTML and to improve the layout and indent style of the resulting markup. Canonical example: HTML Tidy.

Parser	License	Implementation language(s)	Latest date*	HTML parsing^[1]	HTML5-compliant parsing	Clean HTML**	Update HTML***
HTML Tidy	W3C license	ANSI C	2021-07-17^[2]	Yes^[3]	Yes	Yes^[3]	Yes
HtmlUnit	Apache License 2.0	Java	2023-10-31^[4]	Yes	?	No	No
Beautiful Soup	MIT License	Python	2023-04-07^[5]	Yes	Yes	?	No
jsoup	MIT License	Java	2023-12-29^[6]	Yes	Yes	Yes	Yes
Parser	License	Implementation language(s)	Latest date*	HTML Parsing	HTML5-compliant Parsing	Clean HTML**	Update HTML***

* Latest release (of significant changes) date.

** sanitize (generating standard-compatible web-page, reduce spam, etc.) and clean (strip out surplus presentational tags, remove XSS code, etc.) HTML code.

*** Updates HTML4.X to XHTML or to HTML5, converting deprecated tags (ex. CENTER) to valid ones (ex. DIV with style="text-align:center;").

References

^ 12.2 Parsing HTML documents — HTML Standard Archived 2013-01-16 at the Wayback Machine
^ HTML Tidy release 5.8.0
^ ^a ^b What is Tidy?
^ HtmlUnit 3.7.0
^ Beautiful Soup release 4.10
^ jsoup Java HTML Parser release 1.17.2

[1] 12.2 Parsing HTML documents — HTML Standard Archived 2013-01-16 at the Wayback Machine

[2] HTML Tidy release 5.8.0

[what_is_tidy-3] What is Tidy?

[HtmlUnit_Release_2.50.0-4] HtmlUnit 3.7.0

[5] Beautiful Soup release 4.10

[6] soup Java HTML Parser release 1.17.2

[1]

[2]

[3]

[4]

[5]

[6]

Comparison of HTML parsers

References

Navigation menu

Search