I’m building a Laravel application that aggregates news from multiple websites using DomCrawler. The system is scraping duplicate content where articles have the same meaning but different wording.
What I’ve tried:
- Basic cosine similarity with TF-IDF vectors
- Attempted using sentence-transformers (installation failed)
- Exact string matching with hashes
Current setup:
- Laravel 10
- PHP 8.2
- guzzlehttp/guzzle for HTTP requests
- symfony/dom-crawler for parsing
Error with sentence-transformers:
How can I implement effective semantic deduplication in a PHP/Laravel environment? Are there native PHP solutions or reliable API services for this purpose?