I’m trying to extract the text from a PDF using Smalot PdfParser.
$parser = new SmalotPdfParserParser();
$pdf = $parser->parseFile(myfile);
$text = $pdf->getText();
It works fine, except foreign characters (like æ ø å ü ẞ) seems to cut words up. A word like “Banegård” will give the words “Baneg” and “rd”, and the å character is gone.
This is from very simple PDF files that was written with LibreOffice Writer default settings. So nothing fancy.
I’m a bit surprised that Googling gives me nothing. Surely this is a fairly straight forward thing here in 2024? What am I missing?