How to remove all text from inside HTML elements of certain types in an XHTML file (in ePub)

Situation

My goal is to remove ALL text from a html file. The file is a part of an ePub (electronic book)

Example input and output

(It is a chapter of a novel, contained within a single xhtml file.)

<body id="">
  <div>
    <h3 class="h3"><a id="_idTextAnchor026"></a><a id="_idTextAnchor027"></a> Chapter title</h3>

    <p>Paragraph 1. Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
    <p>Paragraph 1. It contains an `a` element.<a id="_idTextAnchor026"></a>. The `a` element should remain intact!</p>

After processing the file or string, the output should be like this:

<body id="">
  <div>
    <h3 class="h3"><a id="_idTextAnchor026"></a><a id="_idTextAnchor027"></a></h3>

    <p></p>
    <p></p>

Success conditions:

  • In the output, all html elements intact. THerefore the structure of both the html file and the ePub is also intact.
  • The special focus is on <a> elements, which are vital part of the whole book’s structure. They are referred to in contents table, file manifests and so on.

What I have tried

I am a begginner in HTML parsing. All solutions lack documentation, which block all my progress in resolving issues.

I have used this repo:
https://github.com/paquettg
and tried to modify this code:

$dom = new Dom;
        $dom->loadStr('<div class="all"><p>Hey bro, <a href="google.com">click here1</a><a href="google.com">click here2</a><br /> :)</p></div>');
        /** @var DomNodeInnerNode $a */
        $a   = $dom->find('a');
        $a->childNodes()->setText('(deleted)');
        echo $dom;

Issues

  • I don’t know how to find all <a> elements. Replacing firstChild with my best quesses has not worked.
  • Also, I have no clue how to handle multiple elements: p, a, h1 to h5, span, b, i, cite, div.
    but I was unable to find a command to select all <a>, h3 and all <p> elements. No documentation.

Just a sidenote

  • My purpose is to create a ePub file with a book sample. The task is to keep first 30% of a book and delete all remaining chapters. Anex and everything at the end of a book should remain as is.
  • The common practices are time consuming and/or damage the ePub structure. In effect, the samples often do not pass automatic check-ups.
  • After I solve this issue, I plan to develop a clean-up app to remove all unwanted class and id attributes, replace spaces with non-breaking spaces and so on.