Topic: https://intoli.com/blog/scrape-infinite-scroll/
hide preview

What's next? verify your email address for reply notifications!

DaveBowman 6y, 272d ago

Thanks for the nice article and code — clean and simple!

Why is the post marked as published on Jan 11, 2017 though? I think puppeteer was released in August 2017, and the first commit in repo had been made May 10, 2017 [1]

[1] https://github.com/GoogleChrome/puppeteer/commit/ebac2114111c37986b460e2d04d22b7879b36ced

remark link
hide preview

What's next? verify your email address for reply notifications!

andre 6y, 270d ago

Ah, the year was wrong. Nice catch, it's corrected to Jan 11, 2018 now.

hide preview

What's next? verify your email address for reply notifications!

unverified 6y, 264d ago

How do I scrape all items? because here we have to set itemTargetCount.

remark link
hide preview

What's next? verify your email address for reply notifications!

andre 6y, 263d ago [edited]

Inside the scrapeInfiniteScrollItems function, you can change

while (items.length < itemTargetCount) {

to

while (itemTargetCount == null || items.length < itemTargetCount) {

Then, by not specifying a target count, the while loop will time out since we do:

await page.waitForFunction(`document.body.scrollHeight > ${previousHeight}`);

and the error will be caught by the try-catch block surrounding the while loop.

Alternatively, you could keep track of the scraped item count, and condition the while loop on that number changing.

hide preview

What's next? verify your email address for reply notifications!

unverified 6y, 238d ago

How does querySelectorAll know to only pick newly loaded items and not the whole page each time it's called?

remark link
hide preview

What's next? verify your email address for reply notifications!

evan 6y, 236d ago

It doesn't, all of the items are extracted each time in this example code. The element selection and text extraction are generally quite efficient, so this shouldn't have much of a performance impact unless you're dealing with a very, very large number of items. If you're in that situation, then instead of terminating the loop in scrapeInfiniteScrollItems based on the total number of items you could instead specify a maximum number of iterations to complete before breaking and then call page.evaluate(extractItems) only after terminating the loop.

hide preview

What's next? verify your email address for reply notifications!

unverified 6y, 221d ago [edited]

If you wanted to scroll the entire feed, is there a way to check if the page stops scrolling?

await page.waitForFunction(`document.body.scrollHeight > ${previousHeight}`);

This will hang and throw an error correct?

remark link
hide preview

What's next? verify your email address for reply notifications!

andre 6y, 215d ago

That is actually already handled. The page.waitForFunction call will time out after a short time, and then throw an error as you said, but the error is caught by the surrounding try-catch block and the already extracted items are returned.

hide preview

What's next? verify your email address for reply notifications!

NUVV1X71 6y, 210d ago

Sorry if this is info is available somewhere but what is that IDE theme? I really like it!

remark link
hide preview

What's next? verify your email address for reply notifications!

andre 6y, 206d ago

It's gruvbox dark. Here's the vim version; you should be able to find the theme for other editors, too.

hide preview

What's next? verify your email address for reply notifications!

unverified 6y, 91d ago

excellent. thanks

hide preview

What's next? verify your email address for reply notifications!

unverified 5y, 159d ago [edited]

Thanks for the nice article and code — clean and simple!

hide preview

What's next? verify your email address for reply notifications!