Can’t select language code in web-scraping script (RSelenium, Chrome)

First of all it’s important to know that I received the script I will be talking about, I did not create it myself. In fact, I am very new to R and don’t have any programming background at all. So very simple explanations or direct changes would be extremely appreciated, thank you in advance for your patience!

The script allows you to scrape Goodreads-reviews with RSelenium. I run in in Rstudio on a Windows-computer (not sure whether this is relevant or not). You put in the url of the book in question, the script then opens the Chrome-browser and navigates the url. The script will then scrape the 300 reviews shown. However, Goodreads allows you to filter the reviews per language. The script has a piece of code so you can select which language you want (e.g. only English reviews). For this, you can either enter the language code (‘en’ for English) or the the number of the position of this language in the drop-down menu. This position of the language varies from book to book, as the languages are ordered alphabetically and different books may have reviews in different languages.

My problem is: the language code part of the selector does not work, it only accepts the numeral order. However, it’s very tiring to have to look up the position of a specific language for every single book and to then fill it in. I just want to be able to give in the language code.

This is the full script (just set the output-directory at the end if you wish to run it):

library(rJava)        # Required to use RSelenium
library(data.table)   # Required for rbindlist
library(dplyr)        # Required to use the pipes %>% and some table manipulation commands
library(magrittr)     # Required to use the pipes %>%
library(rvest)        # Required for read_html
library(RSelenium)    # Required for webscraping with javascript
library(lubridate)    # Required to scrape the correct dates
library(stringr)      # Required to cut off any leading or trailing whitespace from text
library(purrr)


options(stringsAsFactors = F) #needed to prevent errors when merging data frames

#Paste the GoodReads Url
url <- "https://www.goodreads.com/book/show/96290.Die_unendliche_Geschichte"

englishOnly = F #If FALSE, all languages are chosen

#Set your browser settings (if chrome not working, pick closest version)
rD <- rsDriver(port = 9516L, browser = "chrome", chromever = "110.0.5481.30")
remDr <- rD[["client"]]
remDr$setTimeout(type = "implicit", 2000)
remDr$navigate(url)

bookTitle = unlist(remDr$getTitle())
finalData = data.frame()

# Main loop going through the website pages
morePages = T
pageNumber =  1
while(morePages){
  
  #Select reviews in correct language
  #Go to the goodreads page of the book in Chrome and right-click.
  #Click on "View Page Source".
  #Look for the language code, it will look like this:
  #<select name="language_code" id="language_code"><option value="">All Languages</option><option value="de">Deutsch &lrm;(9)</option>
  #<option value="en">English &lrm;(9)</option><option value="es">Español &lrm;(1)</option>
  #The numeral language code is the sequence, so here "All Languages" is 1, "Deutsch" is 2, "English" is 3...
  #This sequence is not the same for every book, so check it each time!
  #It is sufficient if you only fill in the numeral language code.
  selectLanguage = if(englishOnly){
    selectLanguage = remDr$findElement("xpath", "//select[@id='language_code']/option[@value='de']")
  } else {
    selectLanguage = remDr$findElement("xpath", "//select[@id='language_code']/option[4]")
  }
  
  selectLanguage$clickElement()
  Sys.sleep(1)
  
  #Expand all reviews
  expandMore <- remDr$findElements("link text", "...more")
  expandMore = sapply(expandMore, function(x) x$clickElement())
  
  #Extracting the reviews from the page
  reviews <- remDr$findElements("css selector", "#bookReviews .stacked")
  reviews.html <- lapply(reviews, function(x){x$getElementAttribute("outerHTML")[[1]]})
  
  #Remove double text when expanded
  reviews.html <- lapply(reviews.html, function(x){
    if(str_count(x, "span id="freeText") > 1) {
      str_remove(x, "<span id="freeTextContainer.*")
    } else {
      x
    }
  })
  
  reviews.list <- lapply(reviews.html, function(x){read_html(x) %>% html_text()} )
  reviews.text <- unlist(reviews.list)
  
  #Some reviews have only rating and no text, so we process them separately
  onlyRating = unlist(map(1:length(reviews.text), function(i) str_detect(reviews.text[i], "^\n\n")))
  
  #Full reviews
  if(sum(!onlyRating) > 0){
    
    filterData = reviews.text[!onlyRating]
    fullReviews = purrr::map_df(seq(1, length(filterData), by=2), function(i){
      review = unlist(strsplit(filterData[i], "n"))
      
      data.frame(
        date = mdy(review[2]), #date
        username = str_trim(review[5]), #user
        rating = str_trim(review[9]), #overall
        comment = str_trim(review[12]) #comment
      )
    })
    
    #Add review text to full reviews
    fullReviews$review = unlist(purrr::map(seq(2, length(filterData), by=2), function(i){
      str_trim(str_remove(filterData[i], "\s*\n\s*\(less\)"))
    }))
    
  } else {
    fullReviews = data.frame()
  }
  
  #partial reviews (only rating)
  if(sum(onlyRating) > 0){
    
    filterData = reviews.text[onlyRating]
    partialReviews = purrr::map_df(1:length(filterData), function(i){
      review = unlist(strsplit(filterData[i], "n"))
      
      data.frame(
        date = mdy(review[9]), #date
        username = str_trim(review[4]), #user
        rating = str_trim(review[8]), #overall
        comment = "",
        review = ""
      )
    })
    
  } else {
    partialReviews = data.frame()
  }
  
  #Get the review ID's from all the links
  reviewId = reviews.html %>% str_extract("/review/show/\d+")
  partialId = reviewId[(length(reviewId) - nrow(partialReviews) + 1):length(reviewId)] %>% 
    str_extract("\d+")
  if(nrow(fullReviews) > 0){
    reviewId = reviewId[1:(length(reviewId) - nrow(partialReviews))]
    reviewId = reviewId[seq(1, length(reviewId), 2)] %>% str_extract("\d+")
  } else {
    reviewId = NULL
  }
  
  if(nrow(partialReviews) > 0){
    reviewId = c(reviewId, partialId)
  }
  
  finalData = rbind(finalData, cbind(reviewId, rbind(fullReviews, partialReviews)))
  
  #Go to next page if possible
  nextPage = remDr$findElements("xpath", "//a[@class='next_page']")
  if(length(nextPage) > 0){
    message(paste("PAGE", pageNumber, "Processed - Going to next"))
    nextPage[[1]]$clickElement()
    pageNumber = pageNumber + 1
    Sys.sleep(2)
  } else {
    message(paste("PAGE", pageNumber, "Processed - Last page"))
    morePages = FALSE
  }
  
}   
#end of the main loop

#Replace missing ratings by 'not rated'
finalData$rating = ifelse(finalData$rating == "", "not rated", finalData$rating)

#Stop server
remDr$close()
rD$server$stop()
rm(rD, remDr)
gc()
system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)

#set directory to where you wish the file to go
#copy your working directory and exchange all backward slashes for forward slashes
getwd()
setwd("")

#Write results
write.csv(finalData, paste0(bookTitle, ".csv"), row.names = F)
message("FINISHED!")

The part I’m talking about is this:

  selectLanguage = if(englishOnly){
    selectLanguage = remDr$findElement("xpath", "//select[@id='language_code']/option[@value='de']")
  } else {
    selectLanguage = remDr$findElement("xpath", "//select[@id='language_code']/option[4]")
  }
  
  selectLanguage$clickElement()
  Sys.sleep(1)

In this case, it only seems to take the “4” into consideration and will completely ignore the “de”. I don’t really need the numeral version, I just want to be able to enter the language code and get going. Would someone be willing to help me?

Because of my lack of experience, I first tried to use chatGPT to come up with a solution, but that went about as well as you’d expect.