How to crawl an infinite-scrolling AJAX website

Infinite scrolling is a web-designing technique to loads the content on list pages continuously as the user scrolls down the page in browser, eliminating the need for pagination with next-previous buttons. This is mostly done with JavaScript/AJAX or any other client side programming language, and the output for those request are mostly in JSON of XML format. When the user scroll the page or click on Load more, a function fires on JavaScript code which sends a HTTP GET or POST request to server and fetches the data; then the function parse the response and append the content on webpage.

So scraping infinite scrolling pages will be bit different then usual web scraper. And we need to create our scraping agent to scrape the data directly from internal JSON or XML pages for faster-speed, instead loading and executing the entire JavaScript library which will slow down the crawling; and will also make your crawler easily traceable by webmasters.

To start with infinite scrolling web-pages scraping, the first step will be to keep developer mode (press F12 button) open in your Chrome or Firefox web browser. Then go to network tab, by default the browser will show your all request like images, CSS, fonts etc... but you can click on XHR button to filter AJAX requests only like I did in screenshot below.

Note : These event will execute only when you scroll down the page or click on load more button if any

Developer Mode

ajax request in infinite website scraping

Most likely you will see GET or POST requests there, coming through some internal web pages something like below

Get Method

GET /api/v2/product?page=1
GET /api/v2/product?page=2
GET /api/v2/product?page=3
GET /api/v2/product?page=4
GET /api/v2/product?page=5

Post Method

POST /api/v2/product?start=1&display=20
POST /api/v2/product?start=21&display=20
POST /api/v2/product?start=41&display=20
POST /api/v2/product?start=61&display=20
POST /api/v2/product?start=81&display=20

Result

AJAX website scraping

So once you found the actual back-end pages where the data is fetched from, why extract from html? Using Agenty we can extract data directly from these internal JSON or XML pages using GET or POST method.

This way you will be extracting data, directly from data source used by that website which can increase the agent speed by 2-3x, because no HTML, CSS and JavaScript will be loaded in scraping engine

If it's a GET method, simply append the query-string parameter in URL and enter in input url list to crawl.

If it's a POST request, enter the encoded data need to be posted in a separate column and crawl with HTTP POST method.