Skip to content

Allow extracting of comments from an HTML document #121

Closed
@tobstarr

Description

@tobstarr

I wonder if there is an easy way to extract comments embedded inside an HTML document.

I tried using html5ever with Floki and using the default parser comments are present in the parsed document as

{:comment, "My Comment"}

but when I switch the parser to html5ever they are just stripped. This can also be verified running:

html = """
<html><title>Some Title</title><body><!-- some comment --></body></html>
"""

Floki.parse_document(html)
|> IO.inspect()

Floki.parse_document(html, html_parser: Floki.HTMLParser.Html5ever)
|> IO.inspect()

that results in this output:

{:ok,
 [
   {"html", [],
    [{"title", [], ["Some Title"]}, {"body", [], [comment: " some comment "]}]}
 ]}
{:ok,
 [
   {"html", [],
    [{"head", [], [{"title", [], ["Some Title"]}]}, {"body", [], ["\n"]}]}
 ]}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions