Download Images, PDF Files in Web Scraping

The download feature in Agenty allow you to download the product images, pdf, screenshot or other documents from web using the data scraping agent, and then upload them to your S3 bucket automatically.

Now, all paid customers can extract images and documents, and download them to their S3 bucket with all the web data extracted in CSV, JSON format.

Prerequisites

  1. A S3 bucket in any region.
  2. A scraping agent with any field has a valid hyperlink of image, pdf, screenshot, swf etc. to download the file from.

For example, I have this scraping agent with column ProductImage - Which is a valid image scraped using the SRC attribute and can be downloaded from this image URL.

Download image and pdf with web scraping

If you do not know how to extract images? See this forum page - https://forum.agenty.com/t/can-i-extract-images-from-website/24

Download options

  1. Click on the edit tab to change the agent setting
  2. Scroll down to the field which has the file to download
  3. Click on the add post processing function button
  4. Select the DownloadToS3 function and enter your S3 details as in this screenshot

s3 settings to upload documents

Note : The download URL must be a valid(full) HTTP or HTTPS URL with domain, instead the relative path of file. If it's relative in website you are scraping? you may use the insert function to convert relative URL into full URL by adding the domain before the Download function.

Now run your scraping agent and it will download the images automatically to your S3 bucket while the web scraping job is running on cloud server.

Logs

Click on the Logs tab on your agent page, and you will see the complete trace logs with details about the image(or file) downloaded, what it's named and where it is download in your S3 bucket.

2019-07-25 13:05:15.9668	TRACE	Download success - url: http://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg; name: 2cdad67c44b002e7ead0cc35693c0e8b.jpg; s3path: data/images/2cdad67c44b002e7ead0cc35693c0e8b.jpg`

image download logs

Result

To check your images downloaded, you may login to your S3 account and go to the bucket > path to find all images.

As I am using the free S3 browser software in this screenshot to view my images, and we can also bulk download these images(or files) to our local computer directory using the Download option and then selecting the folder.

images download to s3

Additional parameters

  • You may also use the dynamic parameters to download your files on a dynamically generated path on S3. For example, if you want to download the images on job_id folder inside data/images/. You may use the dynamic parameters to generate the path run time as in documentation here
  • So using the data/images/{{job_id}}/ in s3Path variable will modify the {{job_id}} with 106423 as that's the unique job id for this scraping job.