Scraping Web API Data Using JSONPath Query Selectors

December 1st 2018, we are adding a new feature in scraping agent to allow users to extract data form JSON web API using the JSONPath selectors. The JSONPath is a query language for JSON that allows us to refer to a JSON object structure in the same way as XPath expressions do for XML documents. In Agenty, you can apply JSONPath expressions to refer specific objects or elements in the page response to extract any field from JSON using our scraping agents.

The JSON (JavaScript Object Notation) is a lightweight data-interchange format and widely used format on web API to display the data in a structured way or for integration with other apps, so it's important to have such capability in scraping tool these days, because many websites offers the API access and having the capability to scrape JSON directly will allow Agenty users to centralized their all data collection agents in single platform. In this tutorial, we will learn how to scrape this example API using the JSONPath query selectors.

JSON Example

For this demo, I have created this JSON example page https://cdn.agenty.com/examples/json/json-example-1.json. Here, if you see the JSON content, it's an array of objects where each object has 5 property rankcntttauyr) and their corresponding values.

[
  {
    "rank": 1,
    "cnt": 5,
    "tt": "The Great Gatsby",
    "au": "F. Scott Fitzgerald",
    "yr": "1990"
  },
  {
    "rank": 2,
    "cnt": 5,
    "tt": "The Grapes of Wrath",
    "au": "John Steinbeck",
    "yr": "1972"
  },
  {
    "rank": 3,
    "cnt": 5,
    "tt": "The Catcher in the Rye",
    "au": "J.D. Salinger",
    "yr": "1993"
  },
  {
    "rank": 4,
    "cnt": 4,
    "tt": "Invisible Man",
    "au": "Ralph Ellison",
    "yr": "1988"
  },
  {
    "rank": 5,
    "cnt": 4,
    "tt": "The Sound and the Fury",
    "au": "William Faulkner",
    "yr": "1987"
  },
  {
    "rank": 6,
    "cnt": 4,
    "tt": "The Sun Also Rises",
    "au": "",
    "yr": "1988"
  },
  {
    "rank": 7,
    "cnt": 4,
    "tt": "Things Fall Apart",
    "au": "Chinua Achebe",
    "yr": "1996"
  },
  {
    "rank": 8,
    "cnt": 4,
    "tt": "Lolita",
    "au": "Vladimir Vladimirovich Nabokov",
    "yr": "1983"
  },
  {
    "rank": 9,
    "cnt": 4,
    "tt": "A Passage to India",
    "au": "E. M. Forster",
    "yr": "1984"
  },
  {
    "rank": 10,
    "cnt": 4,
    "tt": "1984",
    "au": "George Orwell",
    "yr": "1977"
  },
  {
    "rank": 11,
    "cnt": 4,
    "tt": "Beloved",
    "au": "Toni Morrison",
    "yr": "1987"
  },
  {
    "rank": 12,
    "cnt": 4,
    "tt": "Native Son",
    "au": "Richard T. Wright",
    "yr": "1940"
  },
  {
    "rank": 13,
    "cnt": 4,
    "tt": "Catch-22",
    "au": "Joseph Heller",
    "yr": ""
  },
  {
    "rank": 14,
    "cnt": 4,
    "tt": "Go Tell it on the Mountain",
    "au": "James Baldwin",
    "yr": "1954"
  },
  {
    "rank": 15,
    "cnt": 4,
    "tt": "On the Road",
    "au": "Jack Kerouac",
    "yr": "1991"
  },
  {
    "rank": 16,
    "cnt": 3,
    "tt": "ULYSSES",
    "au": "James Joyce",
    "yr": "1961"
  },
  {
    "rank": 17,
    "cnt": 3,
    "tt": "Don Quixote",
    "au": "Miguel de Cervantes",
    "yr": "1982"
  },
  {
    "rank": 18,
    "cnt": 3,
    "tt": "To the Lighthouse",
    "au": "Virginia Woolf",
    "yr": "1982"
  },
  {
    "rank": 19,
    "cnt": 3,
    "tt": "Madame Bovary",
    "au": "Gustave Flaubert",
    "yr": "1998"
  },
  {
    "rank": 20,
    "cnt": 3,
    "tt": "An American Tragedy",
    "au": "Theodore Dreiser",
    "yr": "2000"
  }
]

Since, it's not an HTML page where we can use our Chrome Extension to generate CSS selectors automatically. So we would need to create our agent manually and then edit the agent in Agenty to add, update the field and URL. So let's create a placeholder agent from samples or you can also create from any website.

Create Agent

  1. Go to agents 
  2. Click on "New Agent" 
  3. Get any of the example agent available in "Sample Agent" section. (Because we are going to edit the URL, selector, fields etc. So we can use any demo agent to create one, and then edit that to change our selector and field names).

JSONPath Reference

We will be using JSONPath selectors to extract the individual property value from the JSON objects. So use any online JSON Path testing tool to build/or test your selector. I am going to use jsonpath.com in this example to demonstrate the selector as it show the result instantly as we type the selector. We need to enter the sample JSON in "Inputs" box, and then the tool will display the result as we type our query selector. Here is the complete JSONPath reference :

Expression Description
$ The root object or an array.
object.property
['object'].['property']
Selects the property property in the object object.
Note: Use the latter notation if the name of the property includes special characters (for example, spaces), or begins with a character other than A..Za..z_.
[n] Selects the n-th element from an array. Indexes start from 0.
[n1n2] Selects n1 and n2 array items. Returns the list of properties.
..property Performs a deep scan for the specified property in all available objects. Always returns a list of properties, even for a single match.
* Wildcard. Selects all elements in an object or array, regardless of their names or indexes. For more information, see below.
[n1:n2]
[n1:]
Selects array items from n1 up to n2. The n2 element is not included in the selection. Remove n2 from the expression to select array items from n1 to the end of an array. Returns a list.
[:n] Selects the first n items of the array. Returns a list.
[-n:] Selects the last n items of the array. Returns a list.
[?(expression)] Filter. Selects all elements in an object or array that match the specified filter. Returns a list.
@ Used for filter expressions. Refers to the current note for further processing.

Note: All of the JSONPath expressions (including property names and values) are case-sensitive. See more example here in JsonPath github repository 

JSONPath Notation

A JSONPath expression describes the path to a single property or set of properties in a JSON structure. So, we can use any of the following notations:

  • Dot notation:

    $.[*].rank
    To extract all the rank property from all objects

  • Bracket notation:

    $[0].[rank]
    To extract the rank property from first array and first object only

So, using the above reference if we wants to extract the rank field from the example JSON. Our JSONPath query selector will be $.[*].rank

json_rank

Similarly, for the next field cnt the JSONPath will be $.[*].cnt

json_cnt

And for the tt filed it will be $.[*].tt

json_tt

Now, we can create our selectors for au and yr field as well.

Add Fields

Now, we have the JSON query selector tested for all our fields we want to scrape from JSON. So we need to edit the scraping agent and then add these fields by selecting the field type as JSON

Edit the scraping agent by clicking on the "Edit" tab on scraping agent page.

  1. Add a new field, and give it some name as we have given "rank" in below screenshot.
  2. Now select the field "Type" as "JSON" and paste your JSONPath query selector in "JSON Path" box.
  3. Similarly, add next field "cnt" and enter the JSONPath query selector in "JSON Path" box.
  4. Add next field "tt" and enter the JSONPath query selector in "JSON Path" box.
  5. Add next field "au" and enter the JSONPath query selector in "JSON Path" box.
  6. Add next field "yr" and enter the JSONPath query selector in "JSON Path" box.
  7. Same way we can add any number of fields, and can enter their JSONPath query selector by selecting the JSON as the type of field. This will tells Agenty to use the JSON parser to extract those fields.
  8. Now "Save" the scraping agent configuration. (Remember, saving agent just update the configuration and we need to re-run our agent by clicking on "Start" button in order to reflect the changes in result).
  9. Change the SOURCE URL to the page with JSON content, if not already : https://cdn.agenty.com/examples/json/json-example-1.json. Or you can upload a list if multiple URLs.
  10. And re-run your agent to refresh the result as per the changes made in agent configuration.

Execution

Once the job has been completed, we can see the JSON scraping result in "Result" tab and can add any number of URLs with similar structure API to scrape data from JSON. 

Try it out

We know you wants to try it out, so we've uploaded this agent in scraping agent demos. You may login to you account > go to new agent > and then Sample agents tab

Now, click on the Get it button to clone the agent in your account.

json scraping demo