How do you convert html to text efficiently using NodeJS, i.e. outside of the browser? I also want to convert entities like ä
to ä
, etc and not only just remove tags from the html.
Here is a JEST unit test for a a function convertHtmlToText
which does this conversion:
it('when extract from partial html should extract text', () => {
const html = `<p> äü
t<img alt="" src="http://www.test.org:80/imageupload/userfiles/2/images/world med new - 2022.jpg" style="width: 2000px; height: 1047px; max-width: 100%; height: auto;" /></p>
<p>
tAn evening of music, silence and guiding thoughts to help us experience inner peace, connect with the Divine and share loving vibrations with the world. Join millions of people throughout the world to contribute in creating a wave of peace.</p>
<div>
t </div>
<div>
t<strong>Please join ....</strong></div>
<div>
t </div>
<div>
t<strong>Watch live: <a href="https://test.org/watchlive" target="_blank">test.org/watchlive</a></strong></div>`
const text = convertHtmlToText(html)
console.log(text)
expect(text).toContain("ä");
expect(text).toContain("ü");
expect.not.stringContaining("<")
expect.not.stringContaining(">")
});