Python Web-page Scraping - Installing lxml and Beautiful Soup

So I've always used RegEx to scrape all my data. In fact, it can get pretty tough/tedious for a noob like me. I've been able to use it, but it's just a hassle. And until a few days ago, I thought this was my only route.

Fortunately for me, a few super-smart-engineer-entrepreneur friends (Noah Ready-Campbell and Calvin Young) told me about lxml and Beautiful Soup. They said it was a little tricky to install, but I didn't believe them... I tried it out for myself and actually had a lot of trouble getting it going. Eventually I stumbled upon something that made it pretty easy for me, but I'm hoping to turn that around and make it even easier for you to get.

So here it goes (disclosure: this worked on my Ubuntu EC2 emi and Ubuntu home machine):

How To Install lxml:

UPDATE: 

Try:

sudo apt-get install python-lxml python-beautifulsoup

Thanks to lamby of HN for this! I just tried it and it worked on my new Ubuntu EC2 ami...if anyone finds out this doesn't work please report it to me/someone-actually-important!

/UPDATE

The problem people usually have is there are just a lot of dependencies and it just seems that it never works. So here is what we'll end up getting through this:

  • libxml2 - the lxml library
  • libxslt1.1 - some other library that is a dependency
  • libxml2-dev - the libxml dev header
  • libxslt1-dev - the libxslt dev header
  • python-libxml2 - python bindings for libxml2
  • python-libxslt1 - python bindings for libxslt1
  • python-dev - python dev headers
  • python-setuptools - the thing that lets you run easy_install

So here's how it should all look like:

And if some of these exact commands don't work try searching for the package or updating your package directory:

 

Boom! Should be done. If you guys are running Ubuntu and have issues with this feel free to email me: wesley.zhao@gmail.com

 

How To Install BeautifulSoup:

This is much easier and hardly needs any instruction.

First go here to find the file you want: Beautiful Soup Downloads

Then depending on your file path and download you choose this is how you do it:

 

Once you have these installed check this post out: lxml; an underappreciated web scraping library.

The post has some great examples of how easy it is to scrape with lxml and BeautifulSoup. It's practically like being able to grab CSS tags!

Again if anyone has any questions feel free to email me! I know the set-up process can be a huge pain so...yeah.

15 responses
Posterous doesn't like Gists (or more accurately JavaScript) ... which is why this article appears to be missing some content.

And that's kind of ironic. How would you scrape that? (thought it was funny that an article about scraping appears to have content that was scraped away.)

Woops! The missing/scraped-away content was operator error.. just fixed it all I believe.

To scrape the Gists off Posterous specifically I would probably get the link from the Gist js src file, follow the link back, then scrape it off the GitHub site.

You're much better off installing lxml and Beautiful Soup (and just about any software package, for that matter) through a package manager like apt-get, exactly like how you're installing all of lxml's dependencies. A package manager will automatically pull in dependencies, and can also keep track of which file belongs to which package to make upgrades and uninstalls easy.
Simply trying to apt-get all-in-one did not seem to work for me, but it's worth giving a try next time!
The system package manager isn't good if you have to use multiple versions of a library, which is why I tend to build in a virtualenv. You can still use the package manager to build dependencies for you using build-dep like so (on Natty):

aptitude build-dep python2.7-lxml

If you installed ActiveState Python, its really easy to install lxml and BeautifulSoup.
`pypm install BeautifulSoup lxml`
@joshbohde The original post actually uses Aptitude as well. Why is the system pack not good? Sorry I'm not that familiar with this all.

@Joshua installing ActiveState was not on my radar but again its good to know for the future! I was hoping more newbs who start from scratch would find my tutorial useful.

Sometimes you work on multiple projects that require different versions of the same library. When using aptitude, you'll have a difficult time installing multiple versions. This is the problem the tool virtualenv was designed to solve. If you run the above build-dep, then you can easily run `pip install lxml==2.2.8` in one virtual python environment, while running `pip install lxml=2.2.4` in another.
joshbohde that seems pretty advanced. Though it makes sense and I think I get it. Is this an issue people run into a lot?
I deal with it everyday. A big source of this is frameworks like Django. I may have some legacy app still on 1.1 still, but do all of my new development on 1.3. If I installed this system wide, I'd have to do some crazy stuff in order to switch dev environments.
sudo pip install lxml

That's it. Done.

I'd suggest using html5 instead of BeautifulSoup. BS hasn't been properly maintained for a long time, and though it claims to not break on invalid pages, it doesn't do a very good job. http://code.google.com/p/html5lib/

Moreover, expat is a good alternative to lxml if you care about speed and memory usage. lxml parses out all of the xml (even the stuff you don't want), and leaves a lot of datastructures in memory.

Joshbohde - Django breaks backwards compatibility? Uh oh that's good to know since I'm just starting out on Django.

Max - haha if that works as simply as that then I'm glad I know now!

_nygren - html5lib huh? Do you lose anything from using it?

Most of the time Django is backwards compatible, with a few exceptions such as security fixes. You can see details at https://docs.djangoproject.com/en/dev/misc/api-stability/
1 visitor upvoted this post.