I’ve tried searching for something like this online but haven’t actually found any solutions for my problem.
I’m trying to make a website to be a price tracker for the products they sell, since I’ve just started making this website I need to input all the products into my database for them to be tracked in the first place, but the issue is, their full product sitemap doesn’t seem to be up to date with their products so I can’t use that, so I’m using the regular products list page.
Now, the actual issue is that when you use a url with a parameter to pick a particular page it actually always gets the content for page 1, and then uses javascript to update the html to the actual correct content for the page number. I’m using requests
and BeautifulSoup
to get the page and parse through it.
Its not entirely relevant but here is my code:
class CategoryScraper():
def __init__(self, url):
self.url = url
self.headers = {
'user-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:131.0) Gecko/20100101 Firefox/131.0'
}
self.products = []
self.html = None
def get_data(self):
self.products = []
self.get_html()
product_list = self.html.find('div', attrs={'class': 'productList'})
product_containers = product_list.find_all('div', attrs={'class': 'itemContainer'})
for product in product_containers:
anchor = product.find('a')
product_name = anchor.find('div', attrs={'class': 'itemTitle'}).get_text()
product_price = anchor.find('div', attrs={'class': 'itemPrice'}).find('span').get_text().split('xa0')[1]
product_url = anchor['href']
self.products.append(
{'product_name': product_name, 'product_price': product_price, 'product_url': product_url})
def get_html(self):
page = requests.get(self.url, headers=self.headers)
self.html = BeautifulSoup(page.content, 'html5lib')
def change_url(self, url):
self.url = url
self.get_html()
def get(self):
self.get_data()
return self.products
I’m aware I might need to use a different library to wait for JavaScript to load and finish to get the page data, but I only started web scraping today so I don’t really know what libraries there are and their capabilities.