How to extract HTML table using REGEX

There are multiple extract option available in Agenty to extract data from websites and one of them is REGEX. While it's recommended to use CSS selectors when possible, but we can't deny the fact that sometime REGEX is required to extract some content which is not the part of HTML but needs to be parsed to get in agent result. For example some JavaScript variable value inside a script tag. In this example I will show you how to extract a HTML table fields using REGEX to learn how the REGEX option can be used to extract anything you want from the page content.

Note : Agenty Chrome extension can't be used to setup REGEX fields, so we need to create a dummy agent or use one from samples and then edit that agent in agent editor to add REGEX fields.

In this example I am going to use this example page : https://cdn.agenty.com/examples/example-1.html

  1. Step 1 : Create a new web scraping agent using Chrome extension or use a example agent from samples.
  2. Step 2 : Edit the agent in agent editor and go to Collection > Fields section.
  3. Step 3 : Go to the example page (or the page you wants to extract) and open the HTML source code in a editor or using "View source" option in browser
    HTML Source : 
    
    <!DOCTYPE html>
    <html>
    <head>
    <style>
    table, th, td {
        border: 1px solid black;
        border-collapse: collapse;
    }
    th, td {
        padding: 15px;
    }
    </style>
    </head>
    <body>
    <h1>HTML Table</h1>
    <table style="width:60%">
      <tbody>
    	  <tr>
    		<td>Jill</td>
    		<td>Smith</td>		
    		<td>50</td>
    	  </tr>
    	  <tr>
    		<td>Eve</td>
    		<td>Jackson</td>		
    		<td>94</td>
    	  </tr>
    	  <tr>
    		<td>John</td>
    		<td>Doe</td>		
    		<td>80</td>
    	  </tr>
    	  <tr>
    		<td>Altay</td>
    		<td>Doe</td>		
    		<td>30</td>
    	  </tr>
    	  <tr>
    		<td>Nick</td>
    		<td>Smith</td>		
    		<td>34</td>
    	  </tr>
    	  <tr>
    		<td>Rob</td>
    		<td>Milbern</td>		
    		<td>45</td>
    	  </tr>
    	  <tr>
    		<td>Scoot</td>
    		<td>Sam</td>		
    		<td>65</td>
    	  </tr>
    	</tbody>
    </table>
    </body>
    </html>
  4. Now use any REGEX editor tool to write and test your REGEX pattern. I am using rubular.com in this example and created this permanent link if you wants to try it out - http://rubular.com/r/ubMF1glSP4

    HTML table scraping REGEX
     
  5. Once we have our REGEX expression created. Go to Agenty agent editor and paste the expression in "REGEX pattern" box by editing the field. Because the REGEX expression I created is for entire row(all 3 fields), so I can use the same REGEX expression in all 3 fields by changing the "Group index" to 1, 2 and 3 for it's respective field.

    REGEX fields in Agenty
     
  6. Now, click on the "Save" button to save the agent configuration and then return back the main agent page
  7. Click on "Start" button to start the execution of scraping agent and wait for the job completion. Once the job is completed you can see the extracted result as in screenshot below :

    scraping data using regex