Advanced tips for scraping password protected websites

Could you give us some more information about scraping password protected websites?  I am using the Desktop application to scrape a password protected website and have followed this tutrial

I can get the scraping job set up correctly with username, password and clicking on the login button on the desired page, however there must be some sort of security or set up on the target website that requires cookies, javascript or some sort of session to be present.  I have tried using the tool to login to several websites and some of the errors I get back in the response built in variables are things like

"Your session has expired or you are not authorized to access this page."

or the login page on the target website just failing to login and returning me to the login screen.

I note that you have options in the Advanced Setting section like "Fetch pages via browser" but I have played around with these features but haven't been able to get anything working for password protected websites

Posted by Mathew 1 years ago

As discussed over personal support. This issue was due to a minor bug in Data Scraping Studio desktop app which was not enabling the Ajax in password protected websites. (Which is fixed now in version 2.0) and the working agent is sent to your email id.

Data Scraping Studio uses the browser instance to crawl password protected website and all cookie, sessions are maintained on that instance by default. And, you didn't need to add or send anything differently.

Here are the tips.

  • Try to use the specific login page of website instead dialog box or popup login where ever possible. You may find that by logging in and then logging out, most of the website will redirect you on specific login page when logged out.
  • Add a 10 seconds wait in last of your login events.
  • If the website require the AJAX, Javascript you can go to Advance Setting > Web Browser and Enable the fetch page by AJAX/JavaScript enabled browser.
  • Enable the JavaScript browser, if only you need it because it may slow down the crawling since the extractor waits for the complete page load including internal/external javascript on particular web page and you might need to increase your timeout setting as well if the website take more time to load that usual. (Go to Advance setting > Headers > Increase the connection timeout to something higher then default 6 seconds)

advanced password protected website crawling

Posted by anonymous 1 years ago

Topic Closed! This question is closed and don't accept posts now.

Close me