So I've always used RegEx to scrape all my data. In fact, it can get pretty tough/tedious for a noob like me. I've been able to use it, but it's just a hassle. And until a few days ago, I thought this was my only route.
Fortunately for me, a few super-smart-engineer-entrepreneur friends (Noah Ready-Campbell and Calvin Young) told me about lxml and Beautiful Soup. They said it was a little tricky to install, but I didn't believe them... I tried it out for myself and actually had a lot of trouble getting it going. Eventually I stumbled upon something that made it pretty easy for me, but I'm hoping to turn that around and make it even easier for you to get.
So here it goes (disclosure: this worked on my Ubuntu EC2 emi and Ubuntu home machine):
How To Install lxml:
sudo apt-get install python-lxml python-beautifulsoup
Thanks to lamby of HN for this! I just tried it and it worked on my new Ubuntu EC2 ami...if anyone finds out this doesn't work please report it to me/someone-actually-important!
The problem people usually have is there are just a lot of dependencies and it just seems that it never works. So here is what we'll end up getting through this:
- libxml2 - the lxml library
- libxslt1.1 - some other library that is a dependency
- libxml2-dev - the libxml dev header
- libxslt1-dev - the libxslt dev header
- python-libxml2 - python bindings for libxml2
- python-libxslt1 - python bindings for libxslt1
- python-dev - python dev headers
- python-setuptools - the thing that lets you run easy_install
So here's how it should all look like:
And if some of these exact commands don't work try searching for the package or updating your package directory:
Boom! Should be done. If you guys are running Ubuntu and have issues with this feel free to email me: email@example.com
How To Install BeautifulSoup:
This is much easier and hardly needs any instruction.
First go here to find the file you want: Beautiful Soup Downloads
Then depending on your file path and download you choose this is how you do it:
Once you have these installed check this post out: lxml; an underappreciated web scraping library.
The post has some great examples of how easy it is to scrape with lxml and BeautifulSoup. It's practically like being able to grab CSS tags!
Again if anyone has any questions feel free to email me! I know the set-up process can be a huge pain so...yeah.