Crawling Password Protected Website with Data Scraping Studio

Data Scraping Studio and the Agenty hosted app uses the same login engine technically and you can execute the same scraping agent in desktop app as well in most of the cases or vice-versa. So, if you have a password protected site crawling agent online, you may download the agent in *.scraping file using the "Download Scraping Agent" feature available on agent pages. Or follow the steps below, in order to enable the "scrape data from behind a log-in" using desktop app :

  • Edit the scraping agent from agent explorer tree in Data Scraping Studio.
  • Go to "Advance settings" tab
  • Click on the enable check box

password protected site crawling

Click on the "Add" button to add all events one-by-one in scraping agent

add login commands for scraping

The login events must be order. For example we can't click on the Submit button without navigating to the login page. 

Login to website for crawling

Now save the scraping agent back and Re-run will login to website first and then will start scraping internal URLs from input file as the agent is setup to extract data from internal pages.

website login successfully for crawling

If you notice the logs window closely in the screenshot above the Data Scraping Studio executes all the events provided in scraping agent in the same order to login successfully. And once logged in to the website then maintain the cookie/session for scraping all internal pages until the agent completed or application closed. So the interaction events must be in order as a human do while login : 

- Navigate to page
- Type username and password
- Wait few seconds
- Click on submit button | Submit form

To verify we've successfully logged in I'm using RESPONSE_URL default field here as the website will redirects me to login page if authentication fails. And we can always go to network tab to see the HTML response from server.

HTML response of server after login

