Scraping Details Pages from Listings

Using Agenty, we can extract details page data from a list page. As, most of the website display the search/list page to explore their products, when we search for something or browse by category. But those lists items doesn't have much information and we may need to click on each item to grab further product details by click-through. So, it's very important for a web scraper to have such feature, to automate the data scraping of details page from list page.

Agenty, scraping agents have a tremendously useful feature called as Connecting agents which uses the URL from source agent option in input to connect multiple agents. So we can use this feature for list-details page scraping automatically. In order to automate this task, we'd need to create 2 agents : 

  • 1st scraping agent for list page : This agent will traverse to category/search page and will extract the basic details(if available) and the details page URL, let's assume we named it DETAILS_PAGE_URL
  • 2nd scraping agent for details page : Now we need to create a 2nd agent to extract the information from details page. We can use any example details page link to setup our agent and add as many fields as needed. Then, once the agent is created, we will change it's input type as URL from source agent by selecting our 1st agent in input, and then point it to the field which has the URL, for example DETAILS_PAGE_URL

This way we can connect as many agents as needed, to go to any deep level of a website. Here is the detailed example : 

List Scraping Agent

  1. Go to the website list/search page URL.
  2. Open the Agenty Chrome extension and setup fields.

    list details page scraping
     
  3. If you see the screenshot above the DETAILS_PAGE_URL field showing the relative URL.  Because Agenty is going to extract exactly as in HTML, but we can always add a post-processing function to make it full URL by inserting the domain.
  4. Let's save the scraping agent first, and it will run automatically

    product list scraper
     
  5. Now we can edit the agent to add a post-processing function to transform the relative URL > into full URL
  6. Just edit the agent by clicking on the edit tab, then click on the Add post-processing button for respective field
  7. Add add the Insert function, the below dialog box will appear

    relative to full url
     
  8. Now enter the value we need to insert in the field result using "Input" parameter, as I did in screenshot above to insert the http://books.toscrape.com/
  9. Then save the scraping agent configuration
  10. And finally re-run you agent to apply the changes.

    retail product list scraping

Remember, we just need the DETAILS_PAGE_URL in our 1st agent and other fields are optional.

Details Scraping Agent

If you've reached here by following the steps, we have created our 1st agent to extract the data from list page. And ready to make our 2nd agent which will extract the details page from each listing, by connecting both agents and URL from source agent option in input type. Let's create the 2nd agent : 

Steps :

  1. Go to any example details page URL from the list #1
  2. Open the Agenty extension and setup all the fields you wants to included in your 2nd agent of details page scraping

    product details scraping
     
  3. Save the agent (The agent will auto-execute using default input type : SOURCE URL when saved in Agenty cloud)

    details page scraped
     
  4. Now, we need to change it's input type as URL from source agent to point the agent to our books list #1 agent. That way, the 1st agent output will be considered as 2nd agent input and this agent will grabs the DETAILS_PAGE_URL from the source agent to extract details for each product.

    connect agents using source url
  5. Now both agents are connected, so we can run the 2nd agent to extract all details page from list

    details page scraping result
     
  6. If you see the screenshot above, Agenty grabbed all 20 pages from list agent output and then extracted each details page for NAME, PRICE, AVAILABILITY AND DESCRIPTION fields.
  7. And I also added the default field REQUEST_URL with name as  PAGE_URL to better analyze my scraped result, as what page returned in what output. You can learn more about adding default fields here.

Scraping more pages in List

In most cases, you'd want to extract more then one page in your list scraper from some product categories or search page. So you can follow this pagination tutorial to get all the products list, before extracting the details of each products.

And you can also use the "Start an Agent" trigger to automatically run the 2nd agent when the 1st agent completes a job and the list is ready to scrape details.