Feb 07

brekiri and crawling

\\//,

I promised to post about redmine and the plugins we use, but as we progress with the project there are far more interesting things to blog about. I decided to talk a bit about the crawling we do.

Crawling is something a lot of people do all for various reasons. We crawl to be able to fill an index about business related information. But to be able to do that you have to jump following hurdles:

  • filter non business sites
  • block content spammers
  • keep the depth small
  • verify before you scrape

    To be able to do that you need a crawler. A crawler that can do multiple things and preferably one that is easily adaptable and has very lean code. We choose scrapy. Now scrapy has advantages and disadvantages. My feelings about scrapy are continuously changing. Lets go over the pro's.

    • It's python
    • has a nice set of features
    • works with basic tools, keeps footprint small
    • really easy to write middleware for it (lets you extend the features)
    I've had plenty of fun with writing the tools I needed, and they will be open sourced when I have a bit more time. Basically I wrote tools for storing the information, verifying the links before crawling, checking certain parameters before scraping. It does what I want how i want it except for one thing speed. Scrapy slows down a lot when you have a lot of starturls. I had 160000 urls that I wanted to be scraped and well after 4 days it only scraped 20000 of them. This hurt. When I took a 10 starturls the scraping went a whole lot faster. Which shouldn't happen.

    So what will I have to do in the future: 1) create a sharded version 2) allow scrapy to fetch urls to scrape from a queue

    I do hope it's will be possible without a too big change, if not I might have to look at other options.

    As I am a flexible person I'm willing to change to another solution if they occur, but it must be python. The reason for this is that we have a pretty big technology footprint and having multiple languages for multiple tools might be a good idea in the future but right now I prefer to keep that as simple as possible as it will only make our life harder.

    llap!

    About

    I'm Jochen Maes, a nerd, enough said! (contact info on the about page)

    Subscribe

    Recent Posts

    Archive

    Popular Posts

    Django Popular


    Locations of visitors to this page