Scraping Data from JSON using REGEX

In this tutorial, we will learn how to extract data from JSON pages or API, by using a scraping agent with the super-fast Regular Expression extractor by Agenty. The JSON (JavaScript object notation) is a lightweight data-interchange format and widely used format on websites, API or to display the data in a structured way online.

JSON Example

For this demo , I have created this JSON example page https://cdn.agenty.com/examples/json/json-example-1.json, where the content is displaying as a JSON file, as below. Here, if you see the content, it's an array of objects where each object has 5 property rank, cnt, tt, au, yr) and their corresponding values.

[
  {
    "rank": 1,
    "cnt": 5,
    "tt": "The Great Gatsby",
    "au": "F. Scott Fitzgerald",
    "yr": "1990"
  },
  {
    "rank": 2,
    "cnt": 5,
    "tt": "The Grapes of Wrath",
    "au": "John Steinbeck",
    "yr": "1972"
  },
  {
    "rank": 3,
    "cnt": 5,
    "tt": "The Catcher in the Rye",
    "au": "J.D. Salinger",
    "yr": "1993"
  },
  {
    "rank": 4,
    "cnt": 4,
    "tt": "Invisible Man",
    "au": "Ralph Ellison",
    "yr": "1988"
  },
  {
    "rank": 5,
    "cnt": 4,
    "tt": "The Sound and the Fury",
    "au": "William Faulkner",
    "yr": "1987"
  },
  {
    "rank": 6,
    "cnt": 4,
    "tt": "The Sun Also Rises",
    "au": "",
    "yr": "1988"
  },
  {
    "rank": 7,
    "cnt": 4,
    "tt": "Things Fall Apart",
    "au": "Chinua Achebe",
    "yr": "1996"
  },
  {
    "rank": 8,
    "cnt": 4,
    "tt": "Lolita",
    "au": "Vladimir Vladimirovich Nabokov",
    "yr": "1983"
  },
  {
    "rank": 9,
    "cnt": 4,
    "tt": "A Passage to India",
    "au": "E. M. Forster",
    "yr": "1984"
  },
  {
    "rank": 10,
    "cnt": 4,
    "tt": "1984",
    "au": "George Orwell",
    "yr": "1977"
  },
  {
    "rank": 11,
    "cnt": 4,
    "tt": "Beloved",
    "au": "Toni Morrison",
    "yr": "1987"
  },
  {
    "rank": 12,
    "cnt": 4,
    "tt": "Native Son",
    "au": "Richard T. Wright",
    "yr": "1940"
  },
  {
    "rank": 13,
    "cnt": 4,
    "tt": "Catch-22",
    "au": "Joseph Heller",
    "yr": ""
  },
  {
    "rank": 14,
    "cnt": 4,
    "tt": "Go Tell it on the Mountain",
    "au": "James Baldwin",
    "yr": "1954"
  },
  {
    "rank": 15,
    "cnt": 4,
    "tt": "On the Road",
    "au": "Jack Kerouac",
    "yr": "1991"
  },
  {
    "rank": 16,
    "cnt": 3,
    "tt": "ULYSSES",
    "au": "James Joyce",
    "yr": "1961"
  },
  {
    "rank": 17,
    "cnt": 3,
    "tt": "Don Quixote",
    "au": "Miguel de Cervantes",
    "yr": "1982"
  },
  {
    "rank": 18,
    "cnt": 3,
    "tt": "To the Lighthouse",
    "au": "Virginia Woolf",
    "yr": "1982"
  },
  {
    "rank": 19,
    "cnt": 3,
    "tt": "Madame Bovary",
    "au": "Gustave Flaubert",
    "yr": "1998"
  },
  {
    "rank": 20,
    "cnt": 3,
    "tt": "An American Tragedy",
    "au": "Theodore Dreiser",
    "yr": "2000"
  }
]

Since, it's not an HTML page where we can use our Chrome Extension to generate CSS selectors automatically. So, we'd need to create our agent manually and then edit the agent in Agenty to add, update the field and URL...so, let's create a placeholder agent from samples, or you can also create from any website.

Create Agent

  1. Go to agents 
  2. Click on "New Agent"
  3. Get any of the example agent available in "Sample agent" section. (Because we are going to edit the url, selector, fields etc...so we can use any demo agent, to create one and then edit per our need)

Regex

We will be using REGEX to extract the individual field value from the JSON objects. So, use any online REGEX tester to build your expression, I am going to use rubular.com in this example, to demonstrate the expression and instant result. We need to enter the sample values in "Your test string" box, and then the tool will display the matching result and group, as soon we type our expression. To write the pattern for our first field "rank", we will just use the property name and then expression in values, because the value can change. If I see on the content, it's always a number, so we can use the (\d+)

Final REGEX for rank field in JSON will be : "rank": (\d+),

json regex rank

json group rank

Similarly, for the next field "cnt". It's also a number so we can just use "cnt": (\d+),

json regex cnt

json match cnt

The third field "tt" is not a number, so we can't use (\d+) here, because the (\d+) is used to extract the digit values only. So, we need to use ([^"]*) here. Which means anything in value ends with " and * means 0 or more times. So, to extract the value from "tt", the final expression is : "tt": "([^"]+)",

json group tt

json group tt

Similarly, we can test the REGEX expression for 4th and 5th field as well.

Add Fields

Now, we have the REGEX expression and the matching group number for all the fields we want to scrape from JSON. So, we need to edit the scraping agent and then add the fields expression and Index, by selecting the field type as "REGEX

  1. Edit the scraping agent by clicking on the "Edit" tab on agent page,
  2. Add a new field and give it a name, as I did in screenshot below "nm",
  3. Now select the "Type" as "REGEX" and paste your regular expression in "Regex Pattern" box, and the group number in "Group Index" box.

    json field rank
  4. Similarly, add next field 'cnt' and enter the expression in "Regex Pattern" box and the group number in "Group Index" box.
    json field cnt
  5. "tt" field
    json field tt
  6. "au" field
    json field au
  7. "yr" field
    json field yr
     
  8. Same way, we can add any number of fields, and can enter their REGEX expression and matching group number.
  9. Now, "Save" the scraping agent configuration. (Remember, saving agent just update the configuration and we need to re-run our agent by clicking on "Start" button in order to reflect the changes in result).
  10. Change the SOURCE URL to the page with JSON content, if not already : https://cdn.agenty.com/examples/json/json-example-1.json.
  11. And re-run your agent to refresh the result, as per the change in agent configuration.

Execution

Once the job is completed, we can see the JSON scraping result in "Result" tab and can add any number of URLs with similar structure to scrape data from JSON pages or APIs. 

json final result