Default Fields in Scraping Agent

Agenty CSS Selector and REGEX extractor engine can be used to extract anything from the source content, but sometime you might want few fields to be included in your output result which is not the part of website source content. For example you are scraping stock information of 100 companies from a stock website but also want a DATETIME field should also be in your output result to see when a particular page was fetched and information was extracted.

So Agenty offers the built-in default fields to add in your agent for all those cases when needed :

Name

Type

Description

REQUEST_URL

string

URL of web page, same as in provided in input

RESPONSE_URL

string

URL of web page returned by web server even after redirects.
(E.g. If your inputs have old URL http://www.domain.com/some-old-product-page.html but web server returns a 301 redirect and serves the new URL http://www.domain.com/new-page.html. The RESPONSE_URL field will have the new URL populated)

RESPONSE_HTTP_CODE

int

HTTP Code of successful web request (E.g. 200, 301)

RESPONSE_HTTP_STATUS

string

HTTP Status of successful web request (E.g. Ok, Moved Permanently)

RESPONSE_ERROR_CODE

int

HTTP Error code of error-ed web request(E.g. 404, 408, 503 etc.)

RESPONSE_ERROR_MSG

string

HTTP Error message of error-ed web request. E.g.

  • Not Found
  • Request Time Out
  • Website under maintenance etc.

See more about HTTP status codes on W3 website

RESPONSE_HEADER

string

Collection of web response Header. E.g.

Cache-Control:private
Connection:Keep-Alive
Content-Encoding:gzip
Content-Length:8922
Content-Type:text/html; charset=utf-8
Date:Thu, 10 Dec 2015 11:44:36 GMT
Proxy-Connection:Keep-Alive
Server:Microsoft-IIS/7.5
Vary:Accept-Encoding
X-AspNet-Version:4.0.30319
X-Powered-By:ASP.NET

RESPONSE_CONTENT

string

Complete source code of requested web page

<!DOCTYPE html>
<html>
<head>
<title>Sample page</title>
</head>		
<body>
<h1>Page Heading</h1>
<h2>This is an example page</h2>
.....
.....
.....
</body>
</html>

RESPONSE_DATETIME

DateTime

The date time when a particular request was fetched (Format : MM/dd/yyyy hh:mm:ss) in GMT (UTC +0) with no offset from Coordinated Universal Time (UTC)

RESPONSE_DATE

Date

The date when a particular request was fetched (Format : MM/dd/yyyy) in GMT (UTC +0) with no offset from Coordinated Universal Time (UTC)

RESPONSE_TIME

Time

The time when a particular request was fetched (Format : hh:mm:ss) in GMT (UTC +0) with no offset from Coordinated Universal Time (UTC)

How to add a DEFAULT Field

  1. Edit your agent in agent editor (Create one if you don't have any agent, the DEFAULT fields can be added after the agent is created)
  2. Then go to Collections tab and click on Add field button to add a field
  3. Change the Type as DEFAULT
  4. Select the default field you want to add in From drop-down box as in screenshot below

add default field, 5. Finally, save the scraping agent configuration.

The above steps will change the scraping agent configuration only and you'd need to re-run your agent to extract the default field value when the field is added or updated.