Scrawler: programmable web crawler

A couple of days back my little blog unnoticeably turned nine. Realizing that what started almost a decade ago as a result of my boredom is still more or less alive to this day is not only a surprise to me but also a welcome change from most of my other projects, half of which didn't even get to the point of publishment.

It was supposed to change at the end of February. I had to collect some data from quite popular Polish site for further usage and analysis. I started to create a commandline script to automate the process and had a fleeting thought of publishing it. However, I realized that combination of one cURL call and couple of regexes is not complicated enough to deserve a full-blown project. Once again I started to dream of coming up with some reusable library, to finally have at least one piece of software put under my name on Packagist. I was going to drop the conception of scrapper altogether and suddenly epithany had come. The idea of Scrawler was born.

The general concept was to have the crawler and the library combined. I was wondering how to design it to make it useful not only for me and this particular scenario but in a way it could serve as a solid foundation for more cases related to data scrapping.

Scrawler became a declarative web robot that you can use to visit almost any website you want, specify the rules to gather the data by and process it into the format you want. Of course, it doesn't do all the work for you but it lets you avoid all of the hassle of handling the HTTP communication, parsing HTML, respecting robots.txt, getting results in some conveniently abstracted form etc. It allows you to skip the implementation details and get straight to the point, like most of libraries do.

Writing a scrapper for a particular data source was easy but making it really flexible turned that task into a challenge. In fact, it is a case where good programming practices are actually enforced by the business goal – having a library for such vague task like web crawling that you cannot further extend makes it completely useless. Finally, I decided to give myself a week to create a proof of concept or fail miserably. Seven days and around two thousand lines of code later I had a script that would go to the given URL, scrap parts of the contents using provided patterns and map them over the Data Transfer Objects. I called it a day.

Now, three more months and two internal releases later I feel like it's time to show my monstrosity to the world. It still isn't very stable, lacks a bunch of features I planned for the release – but it is mine. I actually did put so much emphasis to make it mine that even to this day I haven't checked the existing libraries in that field, not even once. I know they are there. Probably much better than mine, perhaps they have well-established community and support but I didn't want to work under their influnce or get intimidated. The whole design of the extensibility and the flow is entirerly my ~~fault~~ idea.

Scrawler aims to help you with web crawling and data scrapping (hence the name, don't get confused by the poor quality of this writeup!). The process consists of steps like going to the URL, gathering bits of response to save them and looking for new URLs… just like most other software of this type. What makes Scrawler special is the abbility to tell it how exactly these steps should be performed. Each one of them has its own OOP interface plus couple of built-in implementations you may swap with a single line of code.

<?php

$scrawler = new Configuration();

$scrawler
    ->setOperationName('Sobakowy Blog')
    ->setBaseUrl('https://sobak.pl')
    ->addUrlListProvider(new ArgumentAdvancerUrlListProvider('/page/%u', 2))

Do you want to follow every found URL that matches the domain? There's an implementation waiting for you. To get URLs from external list you have prepared? I got you covered, just use the other class for finding URLs. You need to increment a fragment of the URL (e.g. page number) to get more data? Done. And what if you want to pass every third word of Fifty Shades of Grey as a search parameter on a site with cute hamsters? Well, conscience clause stops me from distributing it along Scrawler but hey, you can write your own class! I can guarantee that you'll fit into twenty lines of code and Scrawler will still do the rest. The same point applies to all remaining steps of what Scrawler does. It gives you the tools to do the dirty work quickly.

Given varying output formats like CSV, JSON or any Doctrine-supported database, Scrawler can be used either as a standalone tool or as a part of the toolchain where its results are further processed.

There are surely some limitations given Scrawler's early development stage and the room for improvement is definitely big but that's just one more reason to release it to the public. Aside from the fact that something what has started out as an experiment and a time-killer has already proven to be useful for me.

World – meet the Sobak's crawler!

Make sure to visit Scrawler's website with the detailed documentation of available features. Feel free to tinker with it, report issues or submit pull requests to the repository.

Sobakowy Blog

Trochę o wszystkim i wiele o niczym

Scrawler: programmable web crawler

Komentarze wyłączone