php

A patch to limit PHPCrawl crawling depth

PHPCrawl is a webcrawler/webspider-library written in PHP. It supports filters, limiters, cookie-handling, robots.txt-handling, multiprocessing and much more.

Unluckily, there isn’t a way to limit PHPCrawl crawl depth. Here I propose a patch for its current version (0.82), that adds two methods to the PHPCrawler class: getMaxDepth and setMaxDepth.

The usage is intuitive:

$crawler = new PHPCrawler();
 $crawler->setURL($startingURL);
 $crawler->setMaxDepth($n);
 $crawler->go();

The crawler will get pages from level 0 (the $startingURL) to level $n – 1.

By default, the crawling depth limit is set to PHPCrawler::UNLIMITED_CRAWLING_DEPTH = 0. This means that the crawler will get any web page, regardless of its depth from the starting URL.

To apply the patch, download it and give:

patch -p1 -d PHPCrawl_082/ < PHPCrawl_082_maxcrawlingdepth_rev_2_1.path

from PHPCrawler source code parent directory.

Download: patch (revision 2)

UPDATE: In the comments section, Hiruka suggested that this patch it is not easily applicable using NetBeans patch facility. I recommend using the patch command from the command line, or some other tool. For instance, Hiruka succeeded to apply the patch using git.

UPDATE 2 (28/07/2014): I have uploaded a patch revision. This should fix the small bug reported by Sylvain LAVIELLE in the comment section (‘undefined offset’).

UPDATE 3 (28/08/2014): some more bug fixes. Also, I introduced an (experimental) feature to set the HTTP Accept-Language Header

// set preferred language
$crawler->setAcceptLanguage("it, en;q=0.8");

 

A social networks hub

social

During December 2012, the student Daniele Orlando organized an hackathon at my university. The contest topic was about the social networking divide. For various reasons, I couldn’t attend it.
My idea would have been about creating an hub for my study course. We currently have (at least) three information channels: the offical site, the facebook group, and the super-cool google group. It can be pretty chaotic to follow, or update, all of these.
My idea for the contest was to create a single point of access to these channels, leveraging technologies like RSS and facebook api. I’ve lately implemented this idea. It’s been quite easy, it took me probably less than a day to develop the proof-of-concept you’ll find attached to this post.You can also find a running demo here.

My social hub uses SimplePie to access RSS contents (which is the case of the website and of the google group) and facebook social graph api to access facebook (open) groups. Also, with my social hub you can republish contents via RSS.
Since I suck at web designs, I’ve just used a Srinivas Tamada’s template.

To run my code on your machine you’ll just need a LAMP installation and a facebook application key and secret. A facebook app is needed only if you want to read from a facebook open group. Make sure to ask for user_groups privilege while creating the facebook app.

Be aware!! This is just a proof of concept release!!

Download: SocialHub.tar.gz

This site may use cookies. By continuing to use the site, you agree to the use of cookies. more information

By the "EU Cookie Law", we have to inform you that this website may use cookies in order to function. If you continue to use this website without changing your own cookie settings, or if you click "Accept" below, then you are consenting to this.

Close