Web Scraping with Regular Expressions

Regular Expression (aka REGEX or RE) is always been topic of debate in developers about whether it's good for website scraping or not. But saying this, we can't ignore the fact that sometime we are dependent on REGEX to scrape data where CSS selectors or XPATH can't work. e.g extract something from inline JavaScript of a webpage.

So In this tutorial we will learn How to use regular expressions to scrape data from HTML web pages and write the correct pattern of item need to be extracted with some live examples. In the first line you can see the HTML example and then the REGEX pattern and finally output in each group below.

Extract Heading(h1) tag

<h1>This is a heading</h1>


Item 1 : This is a heading

Extract Hyperlink from (A) tag

<a href="https://www.agenty.com">Typical Website Link</a>

<a href="([^"]+)">Typical Website Link</a>

Item 1 : https://www.agenty.com

Extract Hyperlink and Anchor Text from (A) tag

<a href="https://www.agenty.com">Typical Website Link</a>

<a href="([^"]+)">([^<]+)</a>

Item 1 : https://www.agenty.com
Item 2 : Typical Website Link

Extract Image alt text and source from (IMG) tag

<img alt="screen scraping" src="https://cdn.agenty.com/images/create-a-web-scraping-agent.jpg">

<img alt="([^"]+)" src="([^"]+)"/>

Item 1 : screen scraping
Item 2 : https://cdn.agenty.com/images/create-a-web-scraping-agent.jpg

Extract data attribute and price from (DIV) tag

<div data-id="17839" data-availability="InStock">USD 129.00</div>

<div data-id="(\d+)" data-availability="(\w+)">USD\s*([^"]+)<\/div>

Item1 : 17839
Item2 : InStock
Item3 : 129.00

Extract text from (STRONG) tag

<strong>My Bold text</strong>


Item1 : My Bold text

Extract text from (span) tag with some CSS class

<span class="some-css-class">My Favorite Data</span>

<span class="some-css-class">([^<]+)</span>

Item1 : My Favorite Data

Extract META description content value

<meta name="description" content="the SEO description of web page in heading section" />

<meta name="description" content="([^"]+)" />

Item1 : the SEO description of web page in heading section

