While checking out my great stats on Lijit recently, I started to see a pattern. I was able to determine that a large part of my Re-Search(search engine) traffic was coming from a post I did about reading and writing to files in Python back in 2005. In an effort to shamelessly attract more traffic on this topic, I have decided to flesh this post out a bit.
A common task that I run into both in my work life as well as my personal life, revolves around programmatically downloading content from the interwebs. This little code example will illustrate how to use urllib to download a file, and write/save the file contents locally. You may be saying to yourself “Self, can’t I do this in my favorite web browser??” . The answer is “YES”, but it’s a pain in the ass if you have more than 5 files you want to download.
Assume there are a set of images on your favorite website, and they are all named image1.jpg,image2.jpg,image3.jpg, etc. Now imagine there are 50 images using this naming convention.How do you download them all using python , without struggling to do it one image at a time in your browser? Look below!
# Let's create a function that downloads a file, and saves it locally. # This function accepts a file name, a read/write mode(binary or text), # and the base url. def stealStuff(file_name,file_mode,base_url): from urllib2 import Request, urlopen, URLError, HTTPError #create the url and the request url = base_url + file_name req = Request(url) # Open the url try: f = urlopen(req) print "downloading " + url # Open our local file for writing local_file = open(file_name, "w" + file_mode) #Write to our local file local_file.write(f.read()) local_file.close() #handle errors except HTTPError, e: print "HTTP Error:",e.code , url except URLError, e: print "URL Error:",e.reason , url # Set the range of images to 1-50.It says 51 because the # range function never gets to the endpoint. image_range = range(1,51) # Iterate over image range for index in image_range: base_url = 'http://www.techniqal.com/' #create file name based on known pattern file_name = str(index) + ".jpg" # Now download the image. If these were text files, # or other ascii types, just pass an empty string # for the second param ala stealStuff(file_name,'',base_url) stealStuff(file_name,"b",base_url)
That’s it. It not only reports on any errors it encountered while downloading, but think of all of the time you just saved… Really though, how important is your time to you if you’re reading this blog???

