07/04/2014

A patch to limit PHPCrawl crawling depth

PHPCrawl is a webcrawler/webspider-library written in PHP. It supports filters, limiters, cookie-handling, robots.txt-handling, multiprocessing and much more.

Unluckily, there isn’t a way to limit PHPCrawl crawl depth. Here I propose a patch for its current version (0.82), that adds two methods to the PHPCrawler class: getMaxDepth and setMaxDepth.

The usage is intuitive:

$crawler = new PHPCrawler();
 $crawler->setURL($startingURL);
 $crawler->setMaxDepth($n);
 $crawler->go();

The crawler will get pages from level 0 (the $startingURL) to level $n – 1.

By default, the crawling depth limit is set to PHPCrawler::UNLIMITED_CRAWLING_DEPTH = 0. This means that the crawler will get any web page, regardless of its depth from the starting URL.

To apply the patch, download it and give:

patch -p1 -d PHPCrawl_082/ < PHPCrawl_082_maxcrawlingdepth_rev_2_1.path

from PHPCrawler source code parent directory.

Download: patch (revision 2)

UPDATE: In the comments section, Hiruka suggested that this patch it is not easily applicable using NetBeans patch facility. I recommend using the patch command from the command line, or some other tool. For instance, Hiruka succeeded to apply the patch using git.

UPDATE 2 (28/07/2014): I have uploaded a patch revision. This should fix the small bug reported by Sylvain LAVIELLE in the comment section (‘undefined offset’).

UPDATE 3 (28/08/2014): some more bug fixes. Also, I introduced an (experimental) feature to set the HTTP Accept-Language Header

// set preferred language
$crawler->setAcceptLanguage("it, en;q=0.8");