Extracting a website using Data Scraping Studio

This is the first post to understand the basics of web scraping and how to extract your first html web page using Data Scraping Studio. Data Scraping Studio is powered by two powerful web extraction engine (CSS selector and REGEX) for website scraping and one post processing engine(JavaScript) to modify the scraped data. In this tutorial I will cover how it works and then an example webpage scraping using Jquery style CSS selectors and REGEX both.

Architecture of Data Scraping Studio :

Data Scraping Studio is an easy and powerful suite of website scraping. Just point and click element selectors Chrome app to create web scraping agent with instant extracted data preview. And then use the desktop app or hosted app for advance features like batch crawling, scheduling, multiple website scraping simultaneously and more..

The desktop app is inspired by Microsoft SQL Server allows to extract any number of website simultaneously and to automate the web data collection process. The process is managed and executed by a scraping agent file(*.scraping) in a multi-threading architecture which is designed to run as many as you want scraping agents in parallel. A scraping agent is simply a file which stores all the configuration locally for a particular website you are going to scrape. For e.g the fields you want to scrape, list of URL etc.

web scraper online


In this tutorial, we will cover the basic list page scraping using CSS and REGEX data extraction engine in Data Scraping Studio.

  • Option 1 - Using CSS Selector
  • Option 2 - Using REGEX

Sample HTML Content

Web page URL : https://cdn.agenty.com/sample_content/simple-list.html

We will use the above URL and will write our REGEX pattern to extract NAME, BRAND, COLOR and PRICE fields from this web page.

<!DOCTYPE html>
<title>List Extraction</title>
<meta name="robots" content="noindex, nofollow" />
body{font-family:Arial;color:#333}#list-1,#list-2,#list-3,#list-4{border:2px solid #ddd;padding:10px;margin-bottom:10px}.label{font-weight:700;float:left}
<h1>List extraction</h1>
<h2>This example is used to extract the list with Data Scraping Studio</h2>

<div id="list-1">
  <div class="label">Name : </div><div class="name">Xiaomi Redmi 2 Prime</div>
  <div class="label">Brand : </div><div class="brand">Xiaomi</div>
  <div class="label">Color : </div><div class="color">Red</div>
  <div class="label">Price : </div><div class="price">$199</div>

<div id="list-2">
  <div class="label">Name : </div><div class="name">Asus Zenfone 2 Laser ZE550KL</div>
  <div class="label">Brand : </div><div class="brand">Asus</div>
  <div class="label">Color : </div><div class="color">Black</div>
  <div class="label">Price : </div><div class="price">$299</div>

<div id="list-3">
  <div class="label">Name : </div><div class="name">Intex Aqua Trend</div>
  <div class="label">Brand : </div><div class="brand">Intex</div>
  <div class="label">Color : </div><div class="color">White</div>
  <div class="label">Price : </div><div class="price">$230</div>

<div id="list-4">
  <div class="label">Name : </div><div class="name">Sony Xperia C5 Ultra Dual</div>
  <div class="label">Brand : </div><div class="brand">Sony</div>
  <div class="label">Color : </div><div class="color">Grey</div>
  <div class="label">Price : </div><div class="price">$147</div>



Using CSS Selector

Step 1 : Install the chrome extension from Chrome store.

Step 2 : Navigate to this sample page :  https://cdn.agenty.com/sample_content/simple-list.html and then Launch the app.

Step 3 : Click "New" button to add a field and give a friendly name to your field. Then click on the HTML element on webpage you want to extract and the app will automatically genrate the best possible selector for the element and will display the matching results count. You may also click on the hyperlink "Show" to see the instant preview and select the data(TEXT, HTML, ATTR) you want to extract. (see this tutorial for advanced options)

point and click screen scraper

Step 4 : Repeat the Step #3 for all the fields you want to extract.

You may also download the on page output in CSV, TSV or JSON format from the Chrome extension.

Once done with the setup, save the scraping agent by clicking on the “Done” button.  The *.scraping file will be right in your local download directory.

download scraping agent

Now the *.scraping agent file can be executed with desktop app or hosted app with more advanced features like batch crawling, scheduling and more. You'd need to double click on the scraping agent to execute in Data Scraping Studio Desktop App.

Free data scraper software


Step 1 : Click on the "New Agent" button to create a new scraping agent.

Step 2 : Give a friendly name to your scraping agent and browse for the path you want to save, then add all the fields and their REGEX pattern you want to extract. For example to extract the name of product

Where HTML is

<div class="name">Xiaomi Redmi 2 Prime</div>

So the REGEX will be

<div class="name">([^<]+)</div>

You can use the built-in REGEX tester tool of Data Scraping Studio or any other third party tool of your choice, now if you look on the screenshot below, I've added 4 REGEX field (NAME, BRAND, COLOR, PRICE) in the setup and one DEFAULT column (RESPONSE_URL)

setup a web scraping agent

Data Scraping Studio will search for each match of given pattern in RESPONSE_CONTENT and return all matching results.

Now we are done with setting of what item needs to be extracted from the website, so the next step is to provide the list of URLs where Data Scraping Studio will traverse all the web pages and will extract this information for us.

Step 3 :

Click on the next button to go to input tab and paste the URLs list in grid there.

You can also use other advanced input options as well. E.g read the URLs from WEB API or a CSV/
TSV/JSON file Located on hard drive

list of url in scraping

Leave all other OUTPUT and ADVANCE SETTING by default for now, we will learn about those features in advanced web scraping tutorials.

And to finish the setup, click on the save button.

Step 4:

  1. See the just created agent(my-first-agent) in left panel tree. Right click > Start the agent
  2. Data Scraping studio will crawl each pages from provided input while setup and display the result in grid
  3. More details logs message is available in logs window

Related Q & A

Close me