I have a terribly formed html, Thanks to MS Word 10 “save as htm, html”. Here’s a sample of what I’m trying to sanitize.
<html xmlns:v="urn:schemas-microsoft-com:vml"... other xmlns>
<head>
<meta tags, title, styles, a couple comments too (they are irrelevant to the question)>
</head>
<body lang=EN-US link=blue vlink=purple style='tab-interval:36.0pt'>
<div class=WordSection1>
<h1>Pros and Cons of a Website</h1>
<p class=MsoBodyText align=left style='a long irrelevant list'><span style='long list'><o:p> </o:p></span></p>(this is a sample of what it uses as line breaks. Take note of the <o:p> tag).
<p class=MsoBodyText style='margin-right:5.75pt;line-height:115%'>
A<span style='letter-spacing:.05pt'> </span>SAMPLE<span style='letter-spacing:.05pt'> </span>TEXT
</p>
</div>
<div class=WordSection2>...same pattern in div 1</div>
<div class=WordSection3>...same...</div>
</body>
</html>
What I need from all of this is:
<div>...A SAMPLE TEXT</div>
<div>...same pattern in div 1</div>
<div>...same...</div>
What I have so far:
$dom = new DOMDocument;
$dom->loadHTML($filecontent, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$body = $xpath->query('//html/body');
$nodes = $body->item(0)->getElementsByTagName('*');
foreach ($nodes as $node) {
if($node->tagName=='script') $node->parentNode->removeChild($node);
if($node->tagName=='a') continue;
$attrs = $xpath->query('@*', $node);
foreach($attrs as $attr) {
$attr->parentNode->removeAttribute($attr->nodeName);
}
}
echo str_ireplace(['<span>', '</span>'], '', $dom->saveHTML($body->item(0)));
It gives me:
<body lang="EN-US" link="blue" vlink="purple" style="tab-interval:36.0pt">
<div>
<h1>Pros and Cons of a Website</h1>
<p><p> </p></p>
<p>A SAMPLE TEXT</p>
</div>
<div>...same pattern in div 1</div>
<div>...same...</div>
</body>
which I’m good with, but I want the body tag out. I also want h1 and it’s content out too, but when I say:
if($node->tagName=='script' || $node->tagName=='h1') $node->parentNode->removeChild($node);
something weird happens:
<p><p> </p></p> becomes <p class="MsoBodyText" ...all those very long stuff I was trying to remove in the first place><p> </p></p>
I’ve come across some very good answers like:
- How to get innerHTML of DOMNode? (Haim Evgi’s answer, I don’t know how to properly implement it, Keyacom’s answer too), Marco Marsala’s answer is the closest I got but the divs all kept their classes.