Scraping iframes with Puppeteer

Scraping data from iframes can be quite tricky at times because you need to make sure the iframe is loaded completely in the webpage. Thus, if the iframe is not part of the html document, but is loaded using JavaScript or another technique, you should probably wait for the Puppeteer script to execute and render the iframe dynamic content before running your scraping function.

In this article I will show you how to scrape data from iframes using Agenty’s Puppeteer API to wait for an iframe loading and then extract the data when it’s loaded completely by watching a selected inside the iframe.

What is Iframe

An iframe is a HTML page that is embedded inside another page of a website. The iframe uses <iframe> tags in HTML and is mostly used to render some external content on a website.

This is the example iframe looks like -

Scraping iframe

To scrape data from a page which has the iframe, we need to wait for the iframe to render on page and then continue the scraping agent to extract the data we want from iframe.

  • Navigate to the page using page.goto()
  • Find the iframe by name() or url()
  • Wait for selector to ensure iframe loaded using frame.waitForSelector('selector here')
  • Extract data, capture screenshot etc.
// Read the `url` from request, goto the page, capture screenshot and return the results

module.exports = async ({ page, request }) => {
    const response = await page.goto(request.url, {
        waitUntil: 'networkidle2',
        timeout: 30000
    });  
    
   // Find the iframe by name() or url()
    const frame = await page.frames().find(f => f.name() === 'iframeResult');
    
    if(frame){
        // Wait for h2 inside the iframe
        await frame.waitForSelector('h2');
    }else{
         console.log("iFrame not found");
    }
    
    console.log(`statusCode : ${response.status()}`);
    
    // Capture the screenshot
    await page.screenshot({path : 'iframe-screenshot.png'});
    
    return {
        data : {},
        type : 'application/json'
    };   
};

Try the code here - https://chrome.agenty.com

Find iframe by class

Sometimes there are no iframe names or fixed URL defined in HTML to find the iframe using the pages.frames().find() method. So, we can use the contentFrame() and elementHandle combination to find the iframe using CSS selectors.

For example, this code will work same as above, but we are using the ID CSS selector #iframeResult to find the element instead by name.

const elementHandle = await page.$('#iframeResult');
const frame = await elementHandle.contentFrame();
await frame.waitForSelector('h2');