The Zyte tutorial “Create your first spider” crawls this page which has a pager with a “normal” next link. But what if the next link contains only a href="#"
and executes JavaScript instead, like many websites nowadays do? In that case, you have no URL for your next_page_links
and cannot execute response.follow_all
, right?
The chapter “Handle JavaScript” of the Zyte Tutorial suggests to use browser automation, and the example given there demonstrates how this works with the scrollBottom
action for http://quotes.toscrape.com/scroll.
Unfortunately, there is no example how to handle a click
action on a next link to make the next results load with JavaScript. Basically, as a proof of concept, clicking the link would even work with a normal link like on http://books.toscrape.com.
I tried this like that:
import scrapy
from scrapy import Request
class BooksToScrapeSpider(scrapy.Spider):
name = "books_toscrape"
start_urls = [
"http://books.toscrape.com/catalogue/category/books/mystery_3/index.html"
]
def parse(self, response):
# Extract book data
for book in response.css("article.product_pod"):
yield {
"name": book.css("h3 a::attr(title)").get(),
"price": book.css(".price_color::text").get(),
}
# Find the "next" link
next_page = response.css("li.next a::attr(href)").get()
if next_page:
self.logger.info(f"Found next page: {next_page}")
yield Request(
# response.urljoin(next_page),
"http://books.toscrape.com/catalogue/category/books/mystery_3/index.html#",
meta={
"zyte_api_automap": {
"browserHtml": True,
"actions": [
{
"action": "click",
"selector": {"type": "css", "value": "li.next a"},
},
{
"action": "waitForSelector",
"selector": {
"type": "css",
"value": "li.previous a",
},
},
],
}
},
callback=self.parse,
)
else:
self.logger.info("No next page found")
To perform Zyte browser automation, I first need a request, right? So it doesn’t work without an URL. In my fictitious case, the URL is http://books.toscrape.com/catalogue/category/books/mystery_3/index.html#, actually. But I do not want to fire a request and then perform an action. What I want is to perform an action without request (like an ‘onclick’ event does), and this action does something like, for example, a request.
I’ve been racking my brains for days on how to do this – to no avail. Does anyone have any ideas for me?