Thanks for the nice article and code — clean and simple!
Why is the post marked as published on Jan 11, 2017 though? I think puppeteer was released in August 2017, and the first commit in repo had been made May 10, 2017 [1]
[1] https://github.com/GoogleChrome/puppeteer/commit/ebac2114111c37986b460e2d04d22b7879b36ced
How do I scrape all items? because here we have to set itemTargetCount.
Inside the scrapeInfiniteScrollItems
function, you can change
while (items.length < itemTargetCount) {
to
while (itemTargetCount == null || items.length < itemTargetCount) {
Then, by not specifying a target count, the while loop will time out since we do:
await page.waitForFunction(`document.body.scrollHeight > ${previousHeight}`);
and the error will be caught by the try-catch
block surrounding the while loop.
Alternatively, you could keep track of the scraped item count, and condition the while loop on that number changing.
How does querySelectorAll know to only pick newly loaded items and not the whole page each time it's called?
It doesn't, all of the items are extracted each time in this example code. The element selection and text extraction are generally quite efficient, so this shouldn't have much of a performance impact unless you're dealing with a very, very large number of items. If you're in that situation, then instead of terminating the loop in scrapeInfiniteScrollItems
based on the total number of items you could instead specify a maximum number of iterations to complete before breaking and then call page.evaluate(extractItems)
only after terminating the loop.
If you wanted to scroll the entire feed, is there a way to check if the page stops scrolling?
await page.waitForFunction(`document.body.scrollHeight > ${previousHeight}`);
This will hang and throw an error correct?