Situation
My goal is to remove ALL text from a html file. The file is a part of an ePub (electronic book)
Example input and output
(It is a chapter of a novel, contained within a single xhtml file.)
<body id="">
<div>
<h3 class="h3"><a id="_idTextAnchor026"></a><a id="_idTextAnchor027"></a> Chapter title</h3>
<p>Paragraph 1. Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
<p>Paragraph 1. It contains an `a` element.<a id="_idTextAnchor026"></a>. The `a` element should remain intact!</p>
After processing the file or string, the output should be like this:
<body id="">
<div>
<h3 class="h3"><a id="_idTextAnchor026"></a><a id="_idTextAnchor027"></a></h3>
<p></p>
<p></p>
Success conditions:
- In the output, all html elements intact. THerefore the structure of both the html file and the ePub is also intact.
- The special focus is on
<a>elements, which are vital part of the whole book’s structure. They are referred to in contents table, file manifests and so on.
What I have tried
I am a begginner in HTML parsing. All solutions lack documentation, which block all my progress in resolving issues.
I have used this repo:
https://github.com/paquettg
and tried to modify this code:
$dom = new Dom;
$dom->loadStr('<div class="all"><p>Hey bro, <a href="google.com">click here1</a><a href="google.com">click here2</a><br /> :)</p></div>');
/** @var DomNodeInnerNode $a */
$a = $dom->find('a');
$a->childNodes()->setText('(deleted)');
echo $dom;
Issues
- I don’t know how to find all
<a>elements. ReplacingfirstChildwith my best quesses has not worked. - Also, I have no clue how to handle multiple elements: p, a, h1 to h5, span, b, i, cite, div.
but I was unable to find a command to select all<a>,h3and all<p>elements. No documentation.
Just a sidenote
- My purpose is to create a ePub file with a book sample. The task is to keep first 30% of a book and delete all remaining chapters. Anex and everything at the end of a book should remain as is.
- The common practices are time consuming and/or damage the ePub structure. In effect, the samples often do not pass automatic check-ups.
- After I solve this issue, I plan to develop a clean-up app to remove all unwanted class and id attributes, replace spaces with non-breaking spaces and so on.