Using Agenty, we can extract details page data from a list page. As, most of the website display the search/list page to explore their products, when we search for something or browse by category. But those lists items doesn't have much information and we may need to click on each item to grab further product details by click-through. So, it's very important for a web scraper to have such feature, to automate the data scraping of details page from list page.
Agenty scraping agents have a tremendously useful feature called as
Connecting agents which uses the URL from source agent option in input to connect multiple agents. So we can use this feature for list-details page scraping automatically. In order to automate this task, we'd need to create 2 agents :
- 1st scraping agent for list page : This agent will traverse to category/search page and will extract the basic details(if available) and the details page URL, let's assume we named it
- 2nd scraping agent for details page : Now we need to create a 2nd agent to extract the information from details page. We can use any example details page link to setup our agent and add as many fields as needed. Then, once the agent is created, we will change it's input type as URL from source agent by selecting our 1st agent in input, and then point it to the field which has the URL, for example
By this way, we can connect as many agents as needed, to go to any deep level of a website. Here is the detailed example :
List Scraping Agent
Go to the website list/search page URL.
Open the Agenty Chrome extension and setup fields.
If you see the screenshot above the
DETAILS_PAGE_URLfield showing the relative URL. Because Agenty is going to extract exactly as in HTML, but we can always add a Post-processing function to make it full URL by inserting the domain.
savethe scraping agent first, and it will run automatically
Now we can edit the agent to add a Post-processing function to transform the relative URL > into full URL
Just edit the agent by clicking on the
edittab, then click on the
Add post-processingbutton for respective field
Insertfunction, the below dialog box will appear
Now,enter the value we need to insert in the field result using
Inputparameter, as I did in screenshot above to insert the
savethe scraping agent configuration
And finally, re-run you agent to apply the changes.
Remember, we just need the
DETAILS_PAGE_URLin our 1st agent and other fields are optional.
Details Scraping Agent
If you've reached here by following the steps, we have created our 1st agent to extract the data from website list page. And ready to make our 2nd agent which will extract the details page from each listing, by connecting both agents and
URL from source agent option in input type. Let's create the 2nd agent :
Go to any example details page URL from the list #1
Open the Agenty extension and setup all the fields you want to included in your 2nd agent of details page scraping
Save the agent (The agent will auto-execute using default input type :
SOURCE URLwhen saved in Agenty cloud)
Now, we need to change it's input type as
URL from source agentto point the agent to our books list #1 agent. That way, the 1st agent output will be considered as 2nd agent input and this agent will grabs the
DETAILS_PAGE_URLfrom the source agent to extract details for each product.
Now both agents are connected, so we can run the 2nd agent to extract all details page from list
If you see the screenshot above, Agenty grabbed all 20 pages from list agent output and then extracted each details page for ( NAME, PRICE, AVAILABILITY and DESCRIPTION) fields.
And I also added the default field
REQUEST_URLwith name as
PAGE_URLto better analyze my scraped result, as what page returned in what output. You can learn more about adding default fields here.
Scraping More Pages in List
In most cases, you want to extract more then one page in your list scraper from some product categories or search page. So you can follow this pagination tutorial to get all the products list, before extracting the details of each products.
And you can also use the "Start an Agent" Plugin to automatically run the 2nd agent when the 1st agent completes a job and the list is ready to scrape details.