Jul 312008
 

Update: Looking for how to download files using Python3 and urllib? Check out my post here .

While checking out my great stats on Lijit recently, I started to see a pattern. I was able to determine that a large part of my Re-Search(search engine) traffic was coming from a post I did about reading and writing to files in Python back in  2005. In an effort to shamelessly attract more traffic on this topic, I have decided to flesh this post out a bit.

A common task that I run into both in my work life as well as my personal life, revolves around programmatically downloading content from the interwebs. This little code example will illustrate how to use urllib to download a file, and write/save the file contents locally. You may be saying to yourself “Self, can’t I do this in my favorite web browser??” . The answer is “YES”, but it’s a pain in the ass if you have more than 5 files you want to download.

Assume there are a set of images on your favorite website, and they are all named  image1.jpg,image2.jpg,image3.jpg, etc. Now imagine there are 50 images using this naming convention.How do you download them all using python , without struggling to do it one image at a time in your browser? Look below!

python

# Let's create a function that downloads a file, and saves it locally.
# This function accepts a file name, a read/write mode(binary or text),
# and the base url.

def stealStuff(file_name,file_mode,base_url):
	from urllib2 import Request, urlopen, URLError, HTTPError
	
	#create the url and the request
	url = base_url + file_name
	req = Request(url)
	
	# Open the url
	try:
		f = urlopen(req)
		print "downloading " + url
		
		# Open our local file for writing
		local_file = open(file_name, "w" + file_mode)
		#Write to our local file
		local_file.write(f.read())
		local_file.close()
		
	#handle errors
	except HTTPError, e:
		print "HTTP Error:",e.code , url
	except URLError, e:
		print "URL Error:",e.reason , url


# Set the range of images to 1-50.It says 51 because the 
# range function never gets to the endpoint.
image_range = range(1,51)

# Iterate over image range
for index in image_range:
	
	base_url = 'http://www.techniqal.com/'
	#create file name based on known pattern 
	file_name =  str(index) + ".jpg"
	# Now download the image. If these were text files, 
	# or other ascii types, just pass an empty string 
	# for the second param ala stealStuff(file_name,'',base_url)
	stealStuff(file_name,"b",base_url)

That’s it. It not only reports on any errors it encountered while downloading, but think of all of the time you just saved… Really though, how important is your time to you if you’re reading this blog???

 
 Posted by at 4:01 pm

  18 Responses to “Python File Read Write with Urllib2”

  1. It works! I found your site googling for a “urllib2 loop read”. Thanks fro the script won't it have problems with large files?

  2. Hey daonb,

    the only reason I could see an issue with large files, is if the socket connection times out.
    If you see this happening, check out the socket module. You can import it within this function, and set temporary timeout settings.

  3. thanks a lot, it works! now, lets steal some stuff !!!!! XD

  4. Thanks very much for this script. I’ve downloaded the google logo based on your example; now I just need to tweak your example a bit to download a series of PDF documents instead of images.

  5. Works great. Thank you.

  6. Thanks, how does this work with basic authentication?

  7. You need it install a basic auth handler to urlllib 2. There are 2 methods to do that.
    1. Detect a 401 error, and send back the appropriate headers.
    2. If you know the uri you are logging into you can install it with the appropriate credentials before you call urlopen.

    Great example here from http://docs.python.org/library/urllib2.html:
    import urllib2
    # Create an OpenerDirector with support for Basic HTTP Authentication…
    auth_handler = urllib2.HTTPBasicAuthHandler()
    auth_handler.add_password(realm=’PDQ Application’,
    uri=’https://mahler:8092/site-updates.py’,
    user=’klem’,
    passwd=’kadidd!ehopper’)
    opener = urllib2.build_opener(auth_handler)
    # …and install it globally so it can be used with urlopen.
    urllib2.install_opener(opener)
    urllib2.urlopen(‘http://www.example.com/login.html’)

  8. Whoah daddy, thanks for this!

  9. this is the perfect post and script helped me more than 3 days of chit-chat on stackoverflow…Thanks you

  10. Thanks. It’s a pity that with python 3, urllib no longer works the same.

  11. I agree Haroldo. Maybe I’ll do a post about how to get the same thing achieved using Python 3. Thanks for checking out my post.

    UPDATE: I wrote a new post outlining how to do the same thing with Python3. python-file-read-write-with-urllib2

  12. This is also a great introduction for someone to hit the ground running using Python in an applicable manner. Have you ever messed with multi-threaded processes in Python? In the case you REALLY want to speed some things up.. :-D

  13. Thank you for this post, it is really helpful for me as a beginner with python. I wonder how could we use it to download videos say from youtube. Any Ideas.

  14. Hey, thanks for posting this script! It works great! It saves me a bunch of time.

  15. Great post, keep up the good work, i have added your site to my rss feed reader

  16. It is nice to find working code sure makes it easier for me.

  17. Awesome!! Thanx

  18. [...] todas las tapas del diario clarin entre un intervalo de fechas dado. Toqueteando un poco otro script me armé [...]

Leave a Reply

%d bloggers like this: