Skip to content

Can't Parse HTML from HTMLDocument #2780

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 tasks
liamh101 opened this issue May 21, 2025 · 2 comments
Open
2 tasks

Can't Parse HTML from HTMLDocument #2780

liamh101 opened this issue May 21, 2025 · 2 comments

Comments

@liamh101
Copy link

liamh101 commented May 21, 2025

Describe the bug and add attachments

When creating a document from HTML, when adding HTML via the static call HTML::addHTML if the content contains a image tag with a p tag. The call fails with the DomDocument exception DOMDocument::loadXML(): Opening and ending tag mismatch.

The HTML was generated via PHP 8.4's HTMLDocument class.

Expected behavior

The HTML is accepted as valid HTML.

Is there an easy way to mitigate this?

Steps to reproduce

<?php

$dom = Dom\HTMLDocument::createFromString(<<<'HTML'
<!DOCTYPE html>
<html>
<body>
   <p><img style="aspect-ratio:12/13;" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAwAAAANCAYAAACdKY9CAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAAEnQAABJ0Ad5mH3gAAAJuaVRYdFNuaXBNZXRhZGF0YQAAAAAAeyJjbGlwUG9pbnRzIjpbeyJ4IjoxLCJ5IjozfSx7IngiOjIsInkiOjR9LHsieCI6MiwieSI6NX0seyJ4IjozLCJ5Ijo2fSx7IngiOjQsInkiOjd9LHsieCI6NCwieSI6OH0seyJ4Ijo0LCJ5Ijo5fSx7IngiOjUsInkiOjEwfSx7IngiOjUsInkiOjExfSx7IngiOjYsInkiOjEyfSx7IngiOjYsInkiOjEzfSx7IngiOjcsInkiOjEzfSx7IngiOjcsInkiOjE0fSx7IngiOjgsInkiOjE0fSx7IngiOjksInkiOjE0fSx7IngiOjEwLCJ5IjoxNH0seyJ4IjoxMSwieSI6MTR9LHsieCI6MTEsInkiOjEzfSx7IngiOjEyLCJ5IjoxMn0seyJ4IjoxMiwieSI6MTF9LHsieCI6MTIsInkiOjEwfSx7IngiOjEyLCJ5Ijo5fSx7IngiOjEyLCJ5Ijo4fSx7IngiOjEyLCJ5Ijo3fSx7IngiOjEyLCJ5Ijo1fSx7IngiOjEyLCJ5Ijo0fSx7IngiOjEyLCJ5IjozfSx7IngiOjEyLCJ5IjoyfSx7IngiOjExLCJ5IjoxfSx7IngiOjEwLCJ5IjowfSx7IngiOjksInkiOjB9LHsieCI6OCwieSI6MH0seyJ4Ijo3LCJ5IjowfSx7IngiOjYsInkiOjB9LHsieCI6NSwieSI6MH0seyJ4Ijo0LCJ5IjowfSx7IngiOjMsInkiOjB9LHsieCI6MiwieSI6MH0seyJ4IjoxLCJ5IjowfSx7IngiOjAsInkiOjB9XX0Gg0zKAAAAg0lEQVQoU42PwQ2AIAxFW3ULEoi6iGzkJo6EI7gCB&#43;8OAEEbGxIjobwLtMnPf0Wtxw0QVxCIIdrz9DsaMy8JkuN9FQp1AOHgWaQfetd57y9K8k4E&#43;QVtpsTfKo/SS2tLbiCUMgt58ljkEyAktazUyi8g3fJTImpaRaVaS7GBKLXEEO0NwE0ruorm1rsAAAAASUVORK5CYII&#61;" width="12" height="13" /></p>
</body>
</html>
HTML);

$doc = new PhpOffice\PhpWord\PhpWord();
$section = $doc->addSection([
    'headerHeight' => PhpOffice\PhpWord\Shared\Converter::cmToTwip(1.54),
]);

$html = $dom->saveHtml();

PhpOffice\PhpWord\Shared\Html::addHtml($section, $html);

PHPWord version(s) where the bug happened

1.3.0

PHP version(s) where the bug happened

8.4

Priority

  • I want to crowdfund the bug fix (with @algora-io) and fund a community developer.
  • I want to pay the bug fix and fund a maintainer for that. (Contact @Progi1984)
@Progi1984 Progi1984 added the HTML label May 28, 2025
@Progi1984 Progi1984 added this to the 1.4.0 milestone May 28, 2025
@michalschroeder
Copy link
Contributor

Hey @liamh101

The issue isn’t related to the img tag being inside a p element. The actual problem lies in the unclosed img tag in the HTML that’s passed to Html::addHtml(). When using $dom->saveHtml(), self-closing tags like <img> are outputed without a closing slash, which will cause issues.

To ensure proper formatting and valid output, especially for tags like img, br, and hr, please use $dom->saveXml() instead. This will generate properly closed tags in the output.

@liamh101
Copy link
Author

liamh101 commented Jun 1, 2025

Hey @michalschroeder,

This is exactly what we've done to mitigate this. Note for anyone else, make sure to pass the flag LIBXML_NOXMLDECL as an option.

@Progi1984 Progi1984 modified the milestones: 1.4.0, 2.0.0 Jun 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants