How to create a scraping agent to extract data from websites?

Agenty scraping agents are easy and powerful tool for website scraping. Using the scraping agent, you can create your web scraper online and run it on Agenty web scraping software on cloud (or via our API) to scrape the data from thousands of websites in minutes

To create a custom scraping agent- First, you’d need to install our Chrome extension from Chrome store.

Once the extension is installed, go to the web page you want to scrape the data from. Then, launch the extension by clicking on the robot icon on top right side. It will display a panel in right side as in the this screenshot.

Web Scraping Chrome Extension Preview

Text Scraping

Once the extension panel is up and visible -

  • Click on the New button to add a field and give a name to your field as I did and named it ProductName.
  • Then click on the (asterisk) button to enable the point-and-click feature to easily generate automatic CSS selectors when you click on the HTML element you want to scrape. For example, I want to scrape the name of products in this field. So, I clicked on the product name element on HTML, and the extension automatically generated the selector for that element and highlighted the other matching products with same selector on this page.

Sometime you may see other matching items might be selected, due to same CSS class or selector — So you can click on the yellow highlighted items to reject them or can also write your selector manually by learning from here.

The extension will highlight the matching result, and will also show you the result preview under the field. Once you are satisfied with the result and the number of records looks per your expectation, click on the Accept button to save that field in your scraping agent configuration.

Scraping Text From Website

Now, follow the same process, to add as many fields as you want for text, attribute or html items to scrape anything from a html pages.

Hyperlinks Scraping

To scrape URL hyperlinks from websites, we need to extract the href attribute value, so after generating the CSS selector of hyperlink a element —

  • Select the the ATTR option in extract type
  • Enter href in the name of attribute box, to tell Agenty that you want to extract the value of href in output instead the plain text or HTML.

Scraping Hyper-links from Website

Images Scraping

To scrape images from websites, we need to extract the src attribute value, so after generating the CSS selector of image element —

  • Select the the ATTR option in extract type
  • Enter src in the attribute text box, to tell Agenty that you want to extract the value of src in output for images scraping.

Scraping Images from Website

HTML Scraping

If you are looking to scrape the full HTML tag instead the plain text or some attributes from the element.

  • Write your selector
  • Select the extract type as : HTML

Scraping HTML from Website

Attribute Scraping

The ATTR (attribute) option in scraping agent is very powerful feature to extract any attribute from a HTML element. For example —

  • We used src attribute for images scraping
  • And href attribute for URL scraping
  • Similarly, we can extract HTML data-* attributes, class, id or any other attribute given in HTML to add in our web scraping data
    <div class="image_container">
    <a href="catalogue/tipping-the-velvet_999/index.html">
    <img src="media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg" alt="Tipping the Velvet" class="thumbnail">
    </a>
    </div>

As I scraped ALT attribute in this example HTML page, and named the field : ImageALT

Scraping Attribute from Website

Preview Result

You may preview or download the scraped data in JSON, CSV or TSV formats on extension itself -

  • Click on the Options drop-down button
  • Then Preview result option will open a dialog box with all the fields result combined in an JSON array of objects.

Web Data Scraping Preview

Save the Agent

Once you are done with setting up all the fields in your agent, click on the Save button to save your web scraper in your Agenty account.

if you are using the extension first time — The extension will ask you to sign-in on your account before you can save the agent. So, create your free Agenty account or enter the credentials to login.

Remember — The Chrome extension is used to setup the fields of scraper initially, for a particular website scraping. After that, the agent should be stored in your Agenty account for advance features like scheduling, batch crawling, connecting multiple agents, plugins etc.

Save the Web Scraper

Once the agent is saved in your account, it will looks like this:

Run the Web Scraping Software

Now, no need to go back to Chrome extension ever, you may simply click on Start button to start the scraping on-demand or can use our API to run it from programming language like Python, Perl, Ruby, Java, PHP or C#…etc.

Crawl more pages

The scraping agent can be used to crawl any number of similar structure web-pages. All you need to do is enter the URLs in input for batch crawling or you may use the Lists feature to upload the file and select that in your agent input.

  • Go to input tab
  • Select input type : MANUAL
  • Enter the URLs and Save the input configuration
  • Now, just start the agent to crawl all web-pages.

Crawl 1000 URL simultaneously