brekiri and crawling
\\//,
I promised to post about redmine and the plugins we use, but as we progress with the project there are far more interesting things to blog about. I decided to talk a bit about the crawling we do.
Crawling is something a lot of people do all for various reasons. We crawl to be able to fill an index about business related information. But to be able to do that you have to jump following hurdles:
- filter non business sites
- block content spammers
- keep the depth small
- verify before you scrape
To be able to do that you need a crawler. A crawler that can do multiple things and preferably one that is easily adaptable and has very lean code. We choose scrapy. Now scrapy has advantages and disadvantages. My feelings about scrapy are continuously changing. Lets go over the pro's.
- It's python
- has a nice set of features
- works with basic tools, keeps footprint small
- really easy to write middleware for it (lets you extend the features)
So what will I have to do in the future: 1) create a sharded version 2) allow scrapy to fetch urls to scrape from a queue
I do hope it's will be possible without a too big change, if not I might have to look at other options.
As I am a flexible person I'm willing to change to another solution if they occur, but it must be python. The reason for this is that we have a pretty big technology footprint and having multiple languages for multiple tools might be a good idea in the future but right now I prefer to keep that as simple as possible as it will only make our life harder.
llap!
blog comments powered by Disqus