How to prevent duplicate news articles with similar meanings in Laravel web scraping project? [closed]

I’m building a Laravel application that aggregates news from multiple websites using DomCrawler. The system is scraping duplicate content where articles have the same meaning but different wording.

What I’ve tried:

  • Basic cosine similarity with TF-IDF vectors
  • Attempted using sentence-transformers (installation failed)
  • Exact string matching with hashes

Current setup:

  • Laravel 10
  • PHP 8.2
  • guzzlehttp/guzzle for HTTP requests
  • symfony/dom-crawler for parsing

Error with sentence-transformers:

How can I implement effective semantic deduplication in a PHP/Laravel environment? Are there native PHP solutions or reliable API services for this purpose?