Using Agenty, we can extract details page data from a list page. As, most of the website display the search/list page to explore their products, when we search for something or browse by category. But those lists items doesn't have much information and we may need to click on each item to grab further product details by click-through. So, it's very important for a web scraper to have such feature, to automate the data scraping of details page from list page.
Agenty, scraping agents have a tremendously useful feature called as Connecting agents which uses the URL from source agent option in input to connect multiple agents. So we can use this feature for list-details page scraping automatically. In order to automate this task, we'd need to create 2 agents :
- 1st scraping agent for list page : This agent will traverse to category/search page and will extract the basic details(if available) and the details page URL, let's assume we named it
- 2nd scraping agent for details page : Now we need to create a 2nd agent to extract the information from details page. We can use any example details page link to setup our agent and add as many fields as needed. Then, once the agent is created, we will change it's input type as URL from source agent by selecting our 1st agent in input, and then point it to the field which has the URL, for example
This way we can connect as many agents as needed, to go to any deep level of a website. Here is the detailed example :
List Scraping Agent
- Go to the website list/search page URL.
- Open the Agenty Chrome extension and setup fields.
- If you see the screenshot above the
DETAILS_PAGE_URL field showing the relative URL. Because Agenty is going to extract exactly as in HTML, but we can always add a post-processing function to make it full URL by inserting the domain.
- Let's save the scraping agent first, and it will run automatically
- Now we can edit the agent to add a post-processing function to transform the relative URL > into full URL
- Just edit the agent by clicking on the
edit tab, then click on the
Add post-processing button for respective field
- Add add the
Insert function, the below dialog box will appear
- Now enter the value we need to insert in the field result using "
Input" parameter, as I did in screenshot above to insert the
- Then save the scraping agent configuration
- And finally re-run you agent to apply the changes.
Remember, we just need the DETAILS_PAGE_URL in our 1st agent and other fields are optional.
Details Scraping Agent
If you've reached here by following the steps, we have created our 1st agent to extract the data from list page. And ready to make our 2nd agent which will extract the details page from each listing, by connecting both agents and
URL from source agent option in input type. Let's create the 2nd agent :
- Go to any example details page URL from the list #1
- Open the Agenty extension and setup all the fields you wants to included in your 2nd agent of details page scraping
- Save the agent (The agent will auto-execute using default input type :
SOURCE URL when saved in Agenty cloud)
- Now, we need to change it's input type as
URL from source agent to point the agent to our books list #1 agent. That way, the 1st agent output will be considered as 2nd agent input and this agent will grabs the
DETAILS_PAGE_URL from the source agent to extract details for each product.
- Now both agents are connected, so we can run the 2nd agent to extract all details page from list
- If you see the screenshot above, Agenty grabbed all 20 pages from list agent output and then extracted each details page for
- And I also added the default field
REQUEST_URL with name as
PAGE_URL to better analyze my scraped result, as what page returned in what output. You can learn more about adding default fields here.
Scraping more pages in List
In most cases, you'd want to extract more then one page in your list scraper from some product categories or search page. So you can follow this pagination tutorial to get all the products list, before extracting the details of each products.
And you can also use the "Start an Agent" trigger to automatically run the 2nd agent when the 1st agent completes a job and the list is ready to scrape details.