I wrote a web crawler with nodejs to send get requests to about 300 urls.
Here is the main loop:
for (let i = 1; i <= 300; i++) {
let page= `https://xxxxxxxxx/forum-103-${i}.html`
await getPage(page,(arr)=>{
console.log(`page ${i}`)
})
}
Here is the function getPage(url,callback):
export default async function getPage(url, callback) {
await https.get(url, (res) => {
let html = ""
res.on("data", data => {
html += data
})
res.on("end", () => {
const $ = cheerio.load(html)
let obj = {}
let arr = []
obj = $("#threadlisttableid tbody")
for (let i in obj) {
if (obj[i].attribs?.id?.substr(0, 6) === 'normal') {
arr.push(`https://xxxxxxx/${obj[i].attribs.id.substr(6).split("_").join("-")}-1-1.html`)
}
}
callback(arr)
console.log("success!")
})
})
.on('error', (e) => {
console.log(`Got error: ${e.message}`);
})
}
I use cheerio to analyze HTML and put all information i need to variable nameed ‘arr’.
The program will report an error after running normally for a period of time,like that:
...
success!
page 121
success!
page 113
success!
page 115
success!
Got error: connect ETIMEDOUT 172.67.139.206:443
Got error: connect ETIMEDOUT 172.67.139.206:443
Got error: connect ETIMEDOUT 172.67.139.206:443
Got error: connect ETIMEDOUT 172.67.139.206:443
Got error: connect ETIMEDOUT 172.67.139.206:443
Got error: connect ETIMEDOUT 172.67.139.206:443
I have two questions:
1.What is the reason for the error? Is it because I am sending too many get requests? How can I limit the request frequency?
2.As you can see, The order in which the pages are accessed is chaotic,how to control them?
I have tried using other modules to send get request (such as Axios) but it didn’t work.