A patch to limit PHPCrawl crawling depth
PHPCrawl is a webcrawler/webspider-library written in PHP. It supports filters, limiters, cookie-handling, robots.txt-handling, multiprocessing and much more.
Unluckily, there isn’t a way to limit PHPCrawl crawl depth. Here I propose a patch for its current version (0.82), that adds two methods to the PHPCrawler class: getMaxDepth and setMaxDepth.
The usage is intuitive:
$crawler = new PHPCrawler(); $crawler->setURL($startingURL); $crawler->setMaxDepth($n); $crawler->go();
The crawler will get pages from level 0 (the $startingURL) to level $n – 1.
By default, the crawling depth limit is set to PHPCrawler::UNLIMITED_CRAWLING_DEPTH = 0. This means that the crawler will get any web page, regardless of its depth from the starting URL.
To apply the patch, download it and give:
patch -p1 -d PHPCrawl_082/ < PHPCrawl_082_maxcrawlingdepth_rev_2_1.path
from PHPCrawler source code parent directory.
Download: patch (revision 2)
UPDATE: In the comments section, Hiruka suggested that this patch it is not easily applicable using NetBeans patch facility. I recommend using the patch command from the command line, or some other tool. For instance, Hiruka succeeded to apply the patch using git.
UPDATE 2 (28/07/2014): I have uploaded a patch revision. This should fix the small bug reported by Sylvain LAVIELLE in the comment section (‘undefined offset’).
UPDATE 3 (28/08/2014): some more bug fixes. Also, I introduced an (experimental) feature to set the HTTP Accept-Language Header
// set preferred language $crawler->setAcceptLanguage("it, en;q=0.8");
Hiruka
May 05, 2014 @ 06:11:17
First of all, I want to say thank you for your working to make this patch!
But after I applied it to my code, I got his error:
Fatal error: Call to undefined method PHPCrawlerHTTPRequest::setMaxDepth() in file PHPCrawler.class.php. line 228.
Could you please check it again?
Matteo Catena
May 05, 2014 @ 10:53:53
Hi Hiruka,
I’m glad you are interested in this patch.
However, I couldn’t reproduce that fatal error using a clean PHPCrawl 0.82. Also, setMaxDepth() is correctly in HPCrawlerHTTPRequest after patching.
Can you please describe to me how the problem arose?
Hiruka
May 07, 2014 @ 14:58:53
I just used Netbeans to apply your patch and ran your example2 file to test the new function, n I got this error Fatal error: Declaration of MyCrawler::handleDocumentInfo() must be compatible with that of PHPCrawler::handleDocumentInfo() in C:\AppServ\www\PHPtest\PHPCrawl_082\example2.php on line 11.
Hiruka
May 07, 2014 @ 15:00:05
Sorry, this is the correct one.
Fatal error: Call to undefined method PHPCrawlerHTTPRequest::setMaxDepth() in C:\AppServ\www\PHPtest\PHPCrawl_082\libs\PHPCrawler.class.php on line 228
Matteo Catena
May 07, 2014 @ 16:15:17
I’ve never used that NetBeans facility. Mind that the patch involves various classes. Does Netbeans apply the patch to the whole class set?
Also, is your PHPCrawl a CLEAN 0.82 version?
Then, can you try to patch a clean version of PHPCrawl0.82 using the patch command, as described in my blog post? This is to check if the problem is with my patch or with NetBeans way to apply it.
Thank you!
Hiruka
May 07, 2014 @ 18:00:34
Seems like there’s something wrong with using Netbeans to patch your code. I tried using Git and everything were fine. Maybe you should post a small note about that ^^.
Thank you very much for your great code! I’m looking forward to more of it :D.
Matteo Catena
May 08, 2014 @ 13:24:04
Thank you too, Hiruka. Please let me know if the patch it is working for you and if there are other issues.
Sylvain LAVIELLE
Jul 04, 2014 @ 14:55:38
Seems you have a little glitch in your patching command line I think …
patch -p0 < PHPCrawl_082_maxcrawldepth.patch
should suit better with your patch link
Thanks for this patch !
Matteo Catena
Jul 04, 2014 @ 15:07:46
Thanks!
Sylvain LAVIELLE
Jul 04, 2014 @ 15:43:45
Although, even if patch applies nicely, it seems that it does’nt work well for me.
$crawler->setMaxDepth(1) gets only one page (that may be normal if root page is considered to be at level 1), but $crawler->setMaxDepth(2) crawls pages deeper than level 2 though : in my crawled pages I have some pages that have a Referer-page that is not the root page, and that would not be the case for $crawler->setMaxDepth(2).
Seems there is some problems in PHPCrawlerMemoryURLCache->addURL method : $this->max_depth seems not to have the correct value.
Matteo Catena
Jul 04, 2014 @ 15:46:26
Hi Sylvain,
can you please provide me with an example, so that I can try to reproduce the bug and fix it?
Matteo
Sylvain LAVIELLE
Jul 04, 2014 @ 16:24:44
Yep, just get the example script here http://phpcrawl.cuab.de/example.html and replace
$crawler->setTrafficLimit(1000 * 1024);
with
$crawler->setMaxDepth(1);
get only one page :
Page requested: http://www.php.net (200)
Referer-page:
Content received: 29510 bytes
Summary:
Links followed: 1
Documents received: 1
Bytes received: 30100 bytes
Process runtime: 0.93943810462952 sec
So … Nice
But if I use
$crawler->setMaxDepth(2);
Then the crawling process go on and after a while I get some page having a Referer-page that is not http://www.php.net
Example :
Page requested: http://www.php.net/archive/2012.php (200)
Referer-page: http://www.php.net/conferences/
Content received: 67279 bytes
For some pages, I had some PHP notice too :
PHP Notice: Undefined offset: 0 in /home/user1/scripts/crawler/libs/PHPCrawl_082/libs/UrlCache/PHPCrawlerMemoryURLCache.class.php on line 93
PHP Stack trace:
PHP 1. {main}() /home/user1/scripts/crawler/test.php:0
PHP 2. PHPCrawler->go() /home/user1/scripts/crawler/test.php:61
PHP 3. PHPCrawler->startChildProcessLoop() /home/user1/scripts/crawler/libs/PHPCrawl_082/libs/PHPCrawler.class.php:356
PHP 4. PHPCrawler->processUrl() /home/user1/scripts/crawler/libs/PHPCrawl_082/libs/PHPCrawler.class.php:586
PHP 5. PHPCrawlerMemoryURLCache->addURLs() /home/user1/scripts/crawler/libs/PHPCrawl_082/libs/PHPCrawler.class.php:733
PHP 6. PHPCrawlerMemoryURLCache->addURL() /home/user1/scripts/crawler/libs/PHPCrawl_082/libs/UrlCache/PHPCrawlerMemoryURLCache.class.php:132
PHP Warning: Invalid argument supplied for foreach() in /home/user1/scripts/crawler/libs/PHPCrawl_082/libs/UrlCache/PHPCrawlerMemoryURLCache.class.php on line 93
PHP Stack trace:
PHP 1. {main}() /home/user1/scripts/crawler/test.php:0
PHP 2. PHPCrawler->go() /home/user1/scripts/crawler/test.php:61
PHP 3. PHPCrawler->startChildProcessLoop() /home/user1/scripts/crawler/libs/PHPCrawl_082/libs/PHPCrawler.class.php:356
PHP 4. PHPCrawler->processUrl() /home/user1/scripts/crawler/libs/PHPCrawl_082/libs/PHPCrawler.class.php:586
PHP 5. PHPCrawlerMemoryURLCache->addURLs() /home/user1/scripts/crawler/libs/PHPCrawl_082/libs/PHPCrawler.class.php:733
PHP 6. PHPCrawlerMemoryURLCache->addURL() /home/user1/scripts/crawler/libs/PHPCrawl_082/libs/UrlCache/PHPCrawlerMemoryURLCache.class.php:132
Thanks
Sylvain LAVIELLE
Jul 04, 2014 @ 16:28:00
PHP version : PHP 5.5.9-1ubuntu4 (cli) (built: Apr 9 2014 17:11:57)
Matteo Catena
Jul 06, 2014 @ 16:25:55
Dear Sylvain,
thank you for your feedback. I fear that what you are experimenting is part of the standard PHPCrawl behaviour. Infact, http://www.php.net/conference return an http 301. PHPCrawl, by default, follows http header redirects. To override this, do $crawler->setFollowRedirects(false);. This way, the patch works as expected. Can you please confirm this?
Anyway, thank you for spotting the ‘undefined offset’ bug. I’m going to fix it in the next days.
Sylvain LAVIELLE
Jul 08, 2014 @ 09:37:14
Hello Matteo
You’re absolutely right : with $crawler->setFollowRedirects(false), $crawler->setMaxDepth(1) behaves correclty !
Thanks a lot.
Looking forward your patch update about ‘‘undefined offset’
Thanks again
Stanley
Sep 08, 2014 @ 08:49:45
i m download from http://sourceforge.net/projects/phpcrawl/files/PHPCrawl/
latest version Download PHPCrawl_082.zip (411.0 kB)
and then download PHPCrawl_082_maxcrawlingdepth_rev_2.patch
■ upload linux server and create temp dir
–
– PHPCrawl_082_maxcrawlingdepth_rev_2.patch
■ execute patch
[temp]$ patch -p0 < PHPCrawl_082_maxcrawlingdepth_rev_2.patch
■ result
patching file PHPCrawl_082/.buildpath
patching file PHPCrawl_082/example2.php
patching file PHPCrawl_082/example.php
Hunk #1 FAILED at 44.
1 out of 1 hunk FAILED — saving rejects to file PHPCrawl_082/example.php.rej
patching file PHPCrawl_082/libs/PHPCrawler.class.php
Hunk #1 FAILED at 9.
1 out of 1 hunk FAILED — saving rejects to file PHPCrawl_082/libs/PHPCrawler.class.php.rej
patching file PHPCrawl_082/libs/PHPCrawlerHTTPRequest.class.php
Hunk #1 FAILED at 176.
Hunk #2 FAILED at 907.
Hunk #3 FAILED at 962.
Hunk #4 FAILED at 1220.
4 out of 4 hunks FAILED — saving rejects to file PHPCrawl_082/libs/PHPCrawlerHTTPRequest.class.php.rej
patching file PHPCrawl_082/libs/PHPCrawlerLinkFinder.class.php
Hunk #1 FAILED at 68.
Hunk #2 FAILED at 121.
Hunk #3 FAILED at 244.
Hunk #4 FAILED at 278.
4 out of 4 hunks FAILED — saving rejects to file PHPCrawl_082/libs/PHPCrawlerLinkFinder.class.php.rej
patching file PHPCrawl_082/libs/PHPCrawlerRobotsTxtParser.class.php
Hunk #1 FAILED at 19.
Hunk #2 FAILED at 222.
2 out of 2 hunks FAILED — saving rejects to file PHPCrawl_082/libs/PHPCrawlerRobotsTxtParser.class.php.rej
patching file PHPCrawl_082/libs/PHPCrawlerURLDescriptor.class.php
Hunk #1 FAILED at 43.
Hunk #2 FAILED at 55.
2 out of 2 hunks FAILED — saving rejects to file PHPCrawl_082/libs/PHPCrawlerURLDescriptor.class.php.rej
patching file PHPCrawl_082/libs/UrlCache/PHPCrawlerMemoryURLCache.class.php
Hunk #1 FAILED at 7.
1 out of 1 hunk FAILED — saving rejects to file PHPCrawl_082/libs/UrlCache/PHPCrawlerMemoryURLCache.class.p hp.rej
patching file PHPCrawl_082/libs/UrlCache/PHPCrawlerSQLiteURLCache.class.php
Hunk #1 FAILED at 7.
Hunk #2 FAILED at 125.
Hunk #3 FAILED at 250.
Hunk #4 FAILED at 280.
4 out of 4 hunks FAILED — saving rejects to file PHPCrawl_082/libs/UrlCache/PHPCrawlerSQLiteURLCache.class. php.rej
patching file PHPCrawl_082/libs/UrlCache/PHPCrawlerURLCacheBase.class.php
Hunk #1 FAILED at 19.
Hunk #2 FAILED at 137.
2 out of 2 hunks FAILED — saving rejects to file PHPCrawl_082/libs/UrlCache/PHPCrawlerURLCacheBase.class.ph p.rej
patching file PHPCrawl_082/.project
patching file PHPCrawl_082/.settings/org.eclipse.php.core.prefs
can help me what step fail, Thanks
Matteo Catena
Sep 08, 2014 @ 09:48:00
Hi Stanley,
thank you for your comment. Can you please download the patch once again (rev 2.1 now)?
Then download and decompress PHPCrawl_082.
Put the patch in the parent directory of PHPCrawl_082.
From the parent directory, give:
patch -p1 -d PHPCrawl_082/ < PHPCrawl_082_maxcrawlingdepth_rev_2_1.path Let me now if this is working for you, please.
Stanley
Sep 08, 2014 @ 17:19:34
hi Matteo
first, thank you for your attention. but i’m still fail 🙁
$ patch -p1 -d PHPCrawl_082/ < PHPCrawl_082_maxcrawlingdepth_rev_2_1.patch
patching file .buildpath
patching file example2.php
patching file example.php
Hunk #1 FAILED at 44.
1 out of 1 hunk FAILED — saving rejects to file example.php.rej
patching file libs/PHPCrawler.class.php
Hunk #1 FAILED at 9.
1 out of 1 hunk FAILED — saving rejects to file libs/PHPCrawler.class.php.rej
patching file libs/PHPCrawlerHTTPRequest.class.php
Hunk #1 FAILED at 176.
Hunk #2 FAILED at 907.
Hunk #3 FAILED at 962.
Hunk #4 FAILED at 1220.
4 out of 4 hunks FAILED — saving rejects to file libs/PHPCrawlerHTTPRequest.class.php.rej
patching file libs/PHPCrawlerLinkFinder.class.php
Hunk #1 FAILED at 68.
Hunk #2 FAILED at 121.
Hunk #3 FAILED at 244.
Hunk #4 FAILED at 278.
4 out of 4 hunks FAILED — saving rejects to file libs/PHPCrawlerLinkFinder.class.php.rej
patching file libs/PHPCrawlerRobotsTxtParser.class.php
Hunk #1 FAILED at 19.
Hunk #2 FAILED at 222.
2 out of 2 hunks FAILED — saving rejects to file libs/PHPCrawlerRobotsTxtParser.class.php.rej
patching file libs/PHPCrawlerURLDescriptor.class.php
Hunk #1 FAILED at 43.
Hunk #2 FAILED at 55.
2 out of 2 hunks FAILED — saving rejects to file libs/PHPCrawlerURLDescriptor.class.php.rej
patching file libs/UrlCache/PHPCrawlerMemoryURLCache.class.php
Hunk #1 FAILED at 7.
1 out of 1 hunk FAILED — saving rejects to file libs/UrlCache/PHPCrawlerMemoryURLCache.class.php.rej
patching file libs/UrlCache/PHPCrawlerSQLiteURLCache.class.php
Hunk #1 FAILED at 7.
Hunk #2 FAILED at 125.
Hunk #3 FAILED at 250.
Hunk #4 FAILED at 280.
4 out of 4 hunks FAILED — saving rejects to file libs/UrlCache/PHPCrawlerSQLiteURLCache.class.php.rej
patching file libs/UrlCache/PHPCrawlerURLCacheBase.class.php
Hunk #1 FAILED at 19.
Hunk #2 FAILED at 137.
2 out of 2 hunks FAILED — saving rejects to file libs/UrlCache/PHPCrawlerURLCacheBase.class.php.rej
patching file .project
patching file resumable_example.php
Hunk #1 FAILED at 19.
Hunk #2 FAILED at 28.
2 out of 2 hunks FAILED — saving rejects to file resumable_example.php.rej
patching file .settings/org.eclipse.php.core.prefs
Matteo Catena
Sep 15, 2014 @ 16:00:07
Hi Stanley,
sorry for this late answer. Just to be sure, I re-uploaded the patch (now compressed in .tar.gz), and I tried on a friend’s laptop. It works fine for him.
Are you still experiencing the issue? In this case, I can think about just two differences in setting:
– the fact that you are using PHPCrawl in its .zip format, while i’m using its .tar.gz
– the patch command version (mine is 2.6.1)
Let me know.
border
Oct 27, 2014 @ 04:06:06
with PHPstorm 8.0.1 patch apply ok. you forgoted your .settings file lol
border
Oct 28, 2014 @ 13:46:05
patch tested and working
Kieran Headley
Jan 21, 2015 @ 14:23:23
Thanks for the modification, is there anyway that you can use this to return the page depth that the current page being crawled was found at?
Home Page (Level 1)
> News Feed (Level 2)
> News Item 1 (Level 3)
> News Item 2 (Level 3)
> News Item 3 (Level 3)
> About Us (Level 2)
> Our Services (Level 2)
Matteo Catena
Jan 21, 2015 @ 15:12:03
Not currently, but it should be possible with minor modification to the code
A patch to limit PHPCrawl crawling depth – UPDATE
Jan 28, 2015 @ 13:47:12
[…] « A patch to limit PHPCrawl crawling depth […]