Scraping Agent

A scraping agent is a set of configuration like fields, selectors, headers etc for a particular website scraping. The scraping agent can be used to extract data from public websites, password-protected websites, sitemaps, RSS feeds, XML pages, web APIs, JSON pages and many more sources on the web.

The scraping agent can be created using our Chrome extension available on Chrome store

Batch URL Crawling

A single scraping agent can extract data from any number of pages, whether it's 100 or millions of similar structured pages. You just need to supply the URLs using several input type options available in agent or can use advance features like pagination, password-protected site crawling by supplying the credentials automatically and scripting to clean, validate or manipulate the result data.

Most of the website has their own different structure to display the page content and body, so a single scraping agent can extract the data from a particular website only, where it was setup. But can extract any number of pages with similar structure using pagination or by adding a URL list.

Example Configuration

{
  "name": "Books price scraping agent",
  "description": "This agent will extract the product list, prices, image and detail page hyperlink from books.toscrape.com website",
  "type": "scraping",
  "config": {
    "sourceurl": "http://books.toscrape.com/",
    "collections": [
      {
        "name": "Collection1",
        "fields": [
          {
            "name": "NAME",
            "type": "CSS",
            "selector": "h3 a",
            "extract": "TEXT",
            "attribute": null,
            "from": null,
            "visible": true,
            "cleantrim": true,
            "joinresult": false,
            "postprocessing": [],
            "formatter": []
          },
          {
            "name": "PRICE",
            "type": "CSS",
            "selector": ".price_color",
            "extract": "TEXT",
            "attribute": "",
            "from": null,
            "visible": true,
            "cleantrim": true,
            "joinresult": false,
            "postprocessing": [
              {
                "function": "Insert",
                "parameters": [
                  {
                    "name": null,
                    "value": "http://books.toscrape.com/"
                  }
                ]
              }
            ],
            "formatter": []
          },
          {
            "name": "IMAGE",
            "type": "CSS",
            "selector": ".thumbnail",
            "extract": "ATTR",
            "attribute": "src",
            "from": null,
            "visible": true,
            "cleantrim": true,
            "joinresult": false,
            "postprocessing": [
              {
                "function": "Insert",
                "parameters": [
                  {
                    "name": "Input",
                    "value": "http://books.toscrape.com/"
                  }
                ]
              }
            ],
            "formatter": []
          },
          {
            "name": "DETAILS_PAGE_URL",
            "type": "CSS",
            "selector": ".product_pod h3 a",
            "extract": "ATTR",
            "attribute": "href",
            "from": null,
            "visible": true,
            "cleantrim": true,
            "joinresult": false,
            "postprocessing": [
              {
                "function": "Insert",
                "parameters": [
                  {
                    "name": "Input",
                    "value": "http://books.toscrape.com/"
                  }
                ]
              }
            ],
            "formatter": []
          }
        ]
      }
    ],
    "engine": {
      "name": "default",
      "loadjavascript": true,
      "loadimages": false,
      "timeout": 30,
      "viewport": {
        "width": 1280,
        "height": 600
      }
    },
    "waitafterpageload": null,
    "login": {
      "enabled": false,
      "type": null,
      "data": []
    },
    "logout": null,
    "pagination": {
      "enabled": true,
      "type": "CLICK",
      "selector": ".next a",
      "maxpages": 50
    },
    "header": {
      "method": "GET",
      "encoding": "utf-8",
      "data": [
        {
          "key": "Accept",
          "value": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
        },
        {
          "key": "User-Agent",
          "value": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"
        },
        {
          "key": "Accept-Language",
          "value": "*"
        }
      ]
    },
    "autoredirect": {
      "enabled": true,
      "maxautoredirect": 3
    },
    "failretry": {
      "enabled": true,
      "maxtry": 3,
      "tryinterval": 2,
      "timeout": 0
    },
    "proxy": {
      "enabled": false,
      "type": null,
      "reference": null
    },
    "throttling": {
      "enabled": false,
      "type": null,
      "seconds": 0
    },
    "formsubmit": {
      "enabled": false,
      "data": []
    },
    "meta": null,
    "input": {
      "type": "SOURCE",
      "reference": null
    }
  }
}

Engine

The engine object is mandatory and indicates the type of web scraping engine (or browser) will be used in cloud to execute your agent.

  1. Default- The Default engine is fast in speed and support all features to scrape any website. This is selected by default and should be used if you do not know how scraping works.
  2. FastBrowser- FastBrowser engine is faster then Default engine but have JavaScript disabled permanently. This engine should be used for any website which does not require JavaScript to be loaded in order to crawl their data.
  3. HttpClient- HttpClient engine is super-fast then all other engines. But cannot login, form submit and no JavaScript support. This engine can be used to scrape static websites, XML, web API, JSON pages etc.
"engine": {
      "name": "default",
      "loadjavascript": true,
      "loadimages": false,
      "timeout": 30,
      "viewport": {
        "width": 1280,
        "height": 600
      }
    }

Wait after page load

The waitafterpageload object is optional and indicates if Agenty should wait for some element to be appeared on page or fixed number of seconds before running the fields extractor.

"waitafterpageload": {
        "enabled": true,
        "type": "SELECTOR",
        "timeout": 10,
        "selector": ".item"
    }

The waitafterpageload is enabled by default when Default engine is used, and Agenty will use the 1st field selector of type CSS by default to wait for before running the extraction.

Login

The login object is optional and indicates that, if scraping agent need to login before stating scraping of pages. The login commands should be added in data[] array under login object

"login": {
      "enabled": true,
      "type": "FORM",
      "data": []
    }

Logout

The logout object is optional and indicates that, if scraping agent need to logout before completing the job and closing the instance. The logout commands should be added in data[] array under logoutobject

"logout": {
      "enabled": true,
      "data": []
    }

Pagination

The pagination object is optional and indicates the next page button or the hyperlink to click next page in an listing or search page.

 "pagination": {
        "enabled": true,
        "type": "CLICK",
        "selector": ".next a",
        "maxpages": 5
    }

One Agent for Many Websites

There are cases where one agent can also be used for many websites as well, for example :

  1. Meta Tags Scraping : If you are looking to extract the meta tags (title, description, canonical etc) from websites, which are mainly used for SEO purpose from thousands of websites. You can use the same agent for all the websites because every website has the same structure for meta tags like below
<title>page title here</title>
<meta name="description" content="page description here" />
<link rel="canonical" href="page url here" />
  1. Structured Data Scraping : Google and most of the other search engine uses the structured data to display instant result on search engine about an organization, product, review, rating and many more and most of the popular websites uses the structured data markup to display the information on their websites to ensure their website is search engine friendly and better rank. So, you can create an agent to extract the structured data and then can use the same agent for many websites which is using the structured data

    <script type="application/ld+json">
    {
      "@context": "http://schema.org",
      "@type": "Organization",
      "url": "http://www.example.com",
      "name": "Unlimited Ball Bearings Corp.",
      "contactPoint": {
        "@type": "ContactPoint",
        "telephone": "+1-401-555-1212",
        "contactType": "Customer service"
      }
    }
    </script>