Field Types in Web Scraping Agent

In this tutorial, we will learn how to use Field Type in scraping agent. The type attribute specifies the type of element to display. The default type is text. There are 6 ways to use this input type.

  1. CSS
  2. REGEX
  3. DEFAULT
  4. INPUT
  5. INLINECSS
  6. INLINEREGEX

CSS Field Type

The CSS field type is used to select elements with a specified attribute and value. For example, I have this URL https://cdn.agenty.com/sample_content/list/ecommerce-product-list.html and I want to extract these fields(ProductName, ProductPrice, ProductImage, ProductCartLink).

Steps

  1. Login your Agenty account
  2. Extract all these fields(ProductName, ProductPrice, ProductImage, ProductCartLink) as given in screenshot below

product,

  1. Save the extracted field and now you can see your extracted agent

    product1,

  2. Now if you click on Edit tab and see the Fields and Collections and select field type is CSS.

Capture,

REGEX Field Type

There are multiple extract option available in Agenty to extract data from websites and one of them is Type REGEX. While it's recommended to use CSS selectors when possible, but we can't deny the fact that sometime REGEX is required to extract some content which is not the part of HTML but needs to be parsed to get in agent result. For example, some JavaScript variable value inside a script tag. In this example, I will show you how to extract a HTML table fields using REGEX type to learn how the REGEX option can be used to extract anything you want from the page content. Here, I am going to use this example page: https://cdn.agenty.com/examples/example-1.html

Steps

  1. Create a new web scraping agent using Chrome extension or use an example agent from Samples.

  2. Go to the example page (or the page you want to extract) and open the HTML source code in a editor or using "View source" option in browser
    HTML Source :

    <!DOCTYPE html>
    <html>
    <head>
    <style>
    table, th, td {
        border: 1px solid black;
        border-collapse: collapse;
    }
    th, td {
        padding: 15px;
    }
    </style>
    </head>
    <body>
    <h1>HTML Table</h1>
    <table style="width:60%">
      <tbody>
    	  <tr>
    		<td>Jill</td>
    		<td>Smith</td>		
    		<td>50</td>
    	  </tr>
    	  <tr>
    		<td>Eve</td>
    		<td>Jackson</td>		
    		<td>94</td>
    	  </tr>
    	  <tr>
    		<td>John</td>
    		<td>Doe</td>		
    		<td>80</td>
    	  </tr>
    	  <tr>
    		<td>Altay</td>
    		<td>Doe</td>		
    		<td>30</td>
    	  </tr>
    	  <tr>
    		<td>Nick</td>
    		<td>Smith</td>		
    		<td>34</td>
    	  </tr>
    	  <tr>
    		<td>Rob</td>
    		<td>Milbern</td>		
    		<td>45</td>
    	  </tr>
    	  <tr>
    		<td>Scoot</td>
    		<td>Sam</td>		
    		<td>65</td>
    	  </tr>
    	</tbody>
    </table>
    </body>
    </html>
    
  3. Now use any REGEX editor tool to write and test your REGEX pattern. I am using rubular.com in this example

    HTML table scraping REGEX,

  4. Once we have our REGEX expression created. Edit the agent by clicking on Edit tab and go to Fields and Collection section. select the field type REGEX and paste the expression in "REGEX pattern" box. Because the REGEX expression I created is for entire row (all 3 fields), so I can use the same REGEX expression in all 3 fields by changing the "Group index" to 1, 2 and 3 for it's respective field.
    capture,

  5. Save the agent configuration and re-run your agent to see the updated result.
    regex,

DEFAULT Field Type

The Default field type explains itself by the name(default value), This function can be used in the agents to fill any value, where we want to set the field value as a default. For example, We have the scraping agent "Default InputType Example" and we want take an another field which name is URL. In this example we want set URL field value as default "Request_Url".

Before DEFAULT Field Type

Default InputType Example-before,

Steps

  1. Edit the scraping agent by click on the Edit tab and go to Fields and Collections section
  2. Add a field which name is URL
  3. Go to the field URL and click on the Edit button
  4. Select the field type as DEFAULT and From as REQUEST_URL.
    Capture,
  5. Save the scraping agent configuration
  6. And finally, re-run you agent to see the updated result.

After Default Field Type

If you notice the screenshot below, the URL field is added to the agent and set the value of URL field as default "REQUEST_URL" by DEFAULT input type.

Default InputType Example-after,

INPUT Field Type

Input field type can be used when we want to take one agent field values in an another agent. For example, we have this source URL https://news.ycombinator.com/news where the content is displaying by this URL and if you look on the content then you find the different Page URL corresponding with "Website URL".

Steps

  1. we create two agents first one is List Scraping Agent-(Example) which consists two fields(Website_URL, Page_URL) and second is Details Scraping Agent-(Example) which consists 4 fields(Title, User_name, Votes, Comments)(Note: To create an List Scraping Agent and Details scraping agent please see the connecting Agents Documentation)

    list,

    details,

  2. Edit Details Scraping Agent by clicking on the Edit tab and go to Fields and Collections section

  3. Click on Add Field button and add a field Page-_URL and then Edit it

  4. Now select the Type as INPUT because that field is dependent on the parent agent and select From input field same as Name input field like we given Page_URL, as in given screenshot below
    Capture,

  5. Now Add the function AutoFillBlankCell in Page_URL field 6 Save the scraping agent configuration and re-run your agent to see the updated result.

After INPUT Field Type

If you notice the screenshot below, you will find the extracted field.

details,