07/04/2014

A patch to limit PHPCrawl crawling depth

PHPCrawl is a webcrawler/webspider-library written in PHP. It supports filters, limiters, cookie-handling, robots.txt-handling, multiprocessing and much more.

Unluckily, there isn’t a way to limit PHPCrawl crawl depth. Here I propose a patch for its current version (0.82), that adds two methods to the PHPCrawler class: getMaxDepth and setMaxDepth.

The usage is intuitive:

$crawler = new PHPCrawler();
 $crawler->setURL($startingURL);
 $crawler->setMaxDepth($n);
 $crawler->go();

The crawler will get pages from level 0 (the $startingURL) to level $n – 1.

By default, the crawling depth limit is set to PHPCrawler::UNLIMITED_CRAWLING_DEPTH = 0. This means that the crawler will get any web page, regardless of its depth from the starting URL.

To apply the patch, download it and give:

patch -p1 -d PHPCrawl_082/ < PHPCrawl_082_maxcrawlingdepth_rev_2_1.path

from PHPCrawler source code parent directory.

Download: patch (revision 2)

UPDATE: In the comments section, Hiruka suggested that this patch it is not easily applicable using NetBeans patch facility. I recommend using the patch command from the command line, or some other tool. For instance, Hiruka succeeded to apply the patch using git.

UPDATE 2 (28/07/2014): I have uploaded a patch revision. This should fix the small bug reported by Sylvain LAVIELLE in the comment section (‘undefined offset’).

UPDATE 3 (28/08/2014): some more bug fixes. Also, I introduced an (experimental) feature to set the HTTP Accept-Language Header

// set preferred language
$crawler->setAcceptLanguage("it, en;q=0.8");

24 Comments

Hiruka
May 05, 2014 @ 06:11:17

First of all, I want to say thank you for your working to make this patch!
But after I applied it to my code, I got his error:
Fatal error: Call to undefined method PHPCrawlerHTTPRequest::setMaxDepth() in file PHPCrawler.class.php. line 228.
Could you please check it again?

Reply
- Matteo Catena
  May 05, 2014 @ 10:53:53
  
  Hi Hiruka,
  I’m glad you are interested in this patch.
  However, I couldn’t reproduce that fatal error using a clean PHPCrawl 0.82. Also, setMaxDepth() is correctly in HPCrawlerHTTPRequest after patching.
  Can you please describe to me how the problem arose?
  
  Reply
Hiruka
May 07, 2014 @ 14:58:53

I just used Netbeans to apply your patch and ran your example2 file to test the new function, n I got this error Fatal error: Declaration of MyCrawler::handleDocumentInfo() must be compatible with that of PHPCrawler::handleDocumentInfo() in C:\AppServ\www\PHPtest\PHPCrawl_082\example2.php on line 11.

Reply
- Hiruka
  May 07, 2014 @ 15:00:05
  
  Sorry, this is the correct one.
  Fatal error: Call to undefined method PHPCrawlerHTTPRequest::setMaxDepth() in C:\AppServ\www\PHPtest\PHPCrawl_082\libs\PHPCrawler.class.php on line 228
  
  Reply
- Matteo Catena
  May 07, 2014 @ 16:15:17
  
  I’ve never used that NetBeans facility. Mind that the patch involves various classes. Does Netbeans apply the patch to the whole class set?
  Also, is your PHPCrawl a CLEAN 0.82 version?
  Then, can you try to patch a clean version of PHPCrawl0.82 using the patch command, as described in my blog post? This is to check if the problem is with my patch or with NetBeans way to apply it.
  
  Thank you!
  
  Reply
  - Hiruka
    May 07, 2014 @ 18:00:34
    
    Seems like there’s something wrong with using Netbeans to patch your code. I tried using Git and everything were fine. Maybe you should post a small note about that ^^.
    Thank you very much for your great code! I’m looking forward to more of it :D.
    
    Reply
    - Matteo Catena
      May 08, 2014 @ 13:24:04
      
      Thank you too, Hiruka. Please let me know if the patch it is working for you and if there are other issues.
      
      Reply
Sylvain LAVIELLE
Jul 04, 2014 @ 14:55:38

Seems you have a little glitch in your patching command line I think …

patch -p0 < PHPCrawl_082_maxcrawldepth.patch

should suit better with your patch link

Thanks for this patch !

Reply
- Matteo Catena
  Jul 04, 2014 @ 15:07:46
  
  Thanks!
  
  Reply
Sylvain LAVIELLE
Jul 04, 2014 @ 15:43:45

Although, even if patch applies nicely, it seems that it does’nt work well for me.

$crawler->setMaxDepth(1) gets only one page (that may be normal if root page is considered to be at level 1), but $crawler->setMaxDepth(2) crawls pages deeper than level 2 though : in my crawled pages I have some pages that have a Referer-page that is not the root page, and that would not be the case for $crawler->setMaxDepth(2).

Seems there is some problems in PHPCrawlerMemoryURLCache->addURL method : $this->max_depth seems not to have the correct value.

Reply
- Matteo Catena
  Jul 04, 2014 @ 15:46:26
  
  Hi Sylvain,
  
  can you please provide me with an example, so that I can try to reproduce the bug and fix it?
  
  Matteo
  
  Reply
  - Sylvain LAVIELLE
    Jul 04, 2014 @ 16:24:44
    
    Yep, just get the example script here http://phpcrawl.cuab.de/example.html and replace
    
    $crawler->setTrafficLimit(1000 * 1024);
    
    with
    
    $crawler->setMaxDepth(1);
    
    get only one page :
    
    Page requested: http://www.php.net (200)
    Referer-page:
    Content received: 29510 bytes
    
    Summary:
    Links followed: 1
    Documents received: 1
    Bytes received: 30100 bytes
    Process runtime: 0.93943810462952 sec
    
    So … Nice
    
    But if I use
    
    $crawler->setMaxDepth(2);
    
    Then the crawling process go on and after a while I get some page having a Referer-page that is not http://www.php.net
    
    Example :
    
    Page requested: http://www.php.net/archive/2012.php (200)
    Referer-page: http://www.php.net/conferences/
    Content received: 67279 bytes
    
    For some pages, I had some PHP notice too :
    
    PHP Notice: Undefined offset: 0 in /home/user1/scripts/crawler/libs/PHPCrawl_082/libs/UrlCache/PHPCrawlerMemoryURLCache.class.php on line 93
    PHP Stack trace:
    PHP 1. {main}() /home/user1/scripts/crawler/test.php:0
    PHP 2. PHPCrawler->go() /home/user1/scripts/crawler/test.php:61
    PHP 3. PHPCrawler->startChildProcessLoop() /home/user1/scripts/crawler/libs/PHPCrawl_082/libs/PHPCrawler.class.php:356
    PHP 4. PHPCrawler->processUrl() /home/user1/scripts/crawler/libs/PHPCrawl_082/libs/PHPCrawler.class.php:586
    PHP 5. PHPCrawlerMemoryURLCache->addURLs() /home/user1/scripts/crawler/libs/PHPCrawl_082/libs/PHPCrawler.class.php:733
    PHP 6. PHPCrawlerMemoryURLCache->addURL() /home/user1/scripts/crawler/libs/PHPCrawl_082/libs/UrlCache/PHPCrawlerMemoryURLCache.class.php:132
    PHP Warning: Invalid argument supplied for foreach() in /home/user1/scripts/crawler/libs/PHPCrawl_082/libs/UrlCache/PHPCrawlerMemoryURLCache.class.php on line 93
    PHP Stack trace:
    PHP 1. {main}() /home/user1/scripts/crawler/test.php:0
    PHP 2. PHPCrawler->go() /home/user1/scripts/crawler/test.php:61
    PHP 3. PHPCrawler->startChildProcessLoop() /home/user1/scripts/crawler/libs/PHPCrawl_082/libs/PHPCrawler.class.php:356
    PHP 4. PHPCrawler->processUrl() /home/user1/scripts/crawler/libs/PHPCrawl_082/libs/PHPCrawler.class.php:586
    PHP 5. PHPCrawlerMemoryURLCache->addURLs() /home/user1/scripts/crawler/libs/PHPCrawl_082/libs/PHPCrawler.class.php:733
    PHP 6. PHPCrawlerMemoryURLCache->addURL() /home/user1/scripts/crawler/libs/PHPCrawl_082/libs/UrlCache/PHPCrawlerMemoryURLCache.class.php:132
    
    Thanks
    
    Reply
  - Sylvain LAVIELLE
    Jul 04, 2014 @ 16:28:00
    
    PHP version : PHP 5.5.9-1ubuntu4 (cli) (built: Apr 9 2014 17:11:57)
    
    Reply
  - Matteo Catena
    Jul 06, 2014 @ 16:25:55
    
    Dear Sylvain,
    
    thank you for your feedback. I fear that what you are experimenting is part of the standard PHPCrawl behaviour. Infact, http://www.php.net/conference return an http 301. PHPCrawl, by default, follows http header redirects. To override this, do $crawler->setFollowRedirects(false);. This way, the patch works as expected. Can you please confirm this?
    
    Anyway, thank you for spotting the ‘undefined offset’ bug. I’m going to fix it in the next days.
    
    Reply
  - Sylvain LAVIELLE
    Jul 08, 2014 @ 09:37:14
    
    Hello Matteo
    
    You’re absolutely right : with $crawler->setFollowRedirects(false), $crawler->setMaxDepth(1) behaves correclty !
    
    Thanks a lot.
    
    Looking forward your patch update about ‘‘undefined offset’
    
    Thanks again
    
    Reply
Stanley
Sep 08, 2014 @ 08:49:45

i m download from http://sourceforge.net/projects/phpcrawl/files/PHPCrawl/
latest version Download PHPCrawl_082.zip (411.0 kB)

and then download PHPCrawl_082_maxcrawlingdepth_rev_2.patch

■ upload linux server and create temp dir

–
– PHPCrawl_082_maxcrawlingdepth_rev_2.patch

■ execute patch
[temp]$ patch -p0 < PHPCrawl_082_maxcrawlingdepth_rev_2.patch

■ result
patching file PHPCrawl_082/.buildpath
patching file PHPCrawl_082/example2.php
patching file PHPCrawl_082/example.php
Hunk #1 FAILED at 44.
1 out of 1 hunk FAILED — saving rejects to file PHPCrawl_082/example.php.rej
patching file PHPCrawl_082/libs/PHPCrawler.class.php
Hunk #1 FAILED at 9.
1 out of 1 hunk FAILED — saving rejects to file PHPCrawl_082/libs/PHPCrawler.class.php.rej
patching file PHPCrawl_082/libs/PHPCrawlerHTTPRequest.class.php
Hunk #1 FAILED at 176.
Hunk #2 FAILED at 907.
Hunk #3 FAILED at 962.
Hunk #4 FAILED at 1220.
4 out of 4 hunks FAILED — saving rejects to file PHPCrawl_082/libs/PHPCrawlerHTTPRequest.class.php.rej
patching file PHPCrawl_082/libs/PHPCrawlerLinkFinder.class.php
Hunk #1 FAILED at 68.
Hunk #2 FAILED at 121.
Hunk #3 FAILED at 244.
Hunk #4 FAILED at 278.
4 out of 4 hunks FAILED — saving rejects to file PHPCrawl_082/libs/PHPCrawlerLinkFinder.class.php.rej
patching file PHPCrawl_082/libs/PHPCrawlerRobotsTxtParser.class.php
Hunk #1 FAILED at 19.
Hunk #2 FAILED at 222.
2 out of 2 hunks FAILED — saving rejects to file PHPCrawl_082/libs/PHPCrawlerRobotsTxtParser.class.php.rej
patching file PHPCrawl_082/libs/PHPCrawlerURLDescriptor.class.php
Hunk #1 FAILED at 43.
Hunk #2 FAILED at 55.
2 out of 2 hunks FAILED — saving rejects to file PHPCrawl_082/libs/PHPCrawlerURLDescriptor.class.php.rej
patching file PHPCrawl_082/libs/UrlCache/PHPCrawlerMemoryURLCache.class.php
Hunk #1 FAILED at 7.
1 out of 1 hunk FAILED — saving rejects to file PHPCrawl_082/libs/UrlCache/PHPCrawlerMemoryURLCache.class.p hp.rej
patching file PHPCrawl_082/libs/UrlCache/PHPCrawlerSQLiteURLCache.class.php
Hunk #1 FAILED at 7.
Hunk #2 FAILED at 125.
Hunk #3 FAILED at 250.
Hunk #4 FAILED at 280.
4 out of 4 hunks FAILED — saving rejects to file PHPCrawl_082/libs/UrlCache/PHPCrawlerSQLiteURLCache.class. php.rej
patching file PHPCrawl_082/libs/UrlCache/PHPCrawlerURLCacheBase.class.php
Hunk #1 FAILED at 19.
Hunk #2 FAILED at 137.
2 out of 2 hunks FAILED — saving rejects to file PHPCrawl_082/libs/UrlCache/PHPCrawlerURLCacheBase.class.ph p.rej
patching file PHPCrawl_082/.project
patching file PHPCrawl_082/.settings/org.eclipse.php.core.prefs

can help me what step fail, Thanks

Reply
- Matteo Catena
  Sep 08, 2014 @ 09:48:00
  
  Hi Stanley,
  
  thank you for your comment. Can you please download the patch once again (rev 2.1 now)?
  Then download and decompress PHPCrawl_082.
  Put the patch in the parent directory of PHPCrawl_082.
  From the parent directory, give:
  patch -p1 -d PHPCrawl_082/ < PHPCrawl_082_maxcrawlingdepth_rev_2_1.path Let me now if this is working for you, please.
  
  Reply
  - Stanley
    Sep 08, 2014 @ 17:19:34
    
    hi Matteo
    
    first, thank you for your attention. but i’m still fail 🙁
    
    $ patch -p1 -d PHPCrawl_082/ < PHPCrawl_082_maxcrawlingdepth_rev_2_1.patch
    
    patching file .buildpath
    patching file example2.php
    patching file example.php
    Hunk #1 FAILED at 44.
    1 out of 1 hunk FAILED — saving rejects to file example.php.rej
    patching file libs/PHPCrawler.class.php
    Hunk #1 FAILED at 9.
    1 out of 1 hunk FAILED — saving rejects to file libs/PHPCrawler.class.php.rej
    patching file libs/PHPCrawlerHTTPRequest.class.php
    Hunk #1 FAILED at 176.
    Hunk #2 FAILED at 907.
    Hunk #3 FAILED at 962.
    Hunk #4 FAILED at 1220.
    4 out of 4 hunks FAILED — saving rejects to file libs/PHPCrawlerHTTPRequest.class.php.rej
    patching file libs/PHPCrawlerLinkFinder.class.php
    Hunk #1 FAILED at 68.
    Hunk #2 FAILED at 121.
    Hunk #3 FAILED at 244.
    Hunk #4 FAILED at 278.
    4 out of 4 hunks FAILED — saving rejects to file libs/PHPCrawlerLinkFinder.class.php.rej
    patching file libs/PHPCrawlerRobotsTxtParser.class.php
    Hunk #1 FAILED at 19.
    Hunk #2 FAILED at 222.
    2 out of 2 hunks FAILED — saving rejects to file libs/PHPCrawlerRobotsTxtParser.class.php.rej
    patching file libs/PHPCrawlerURLDescriptor.class.php
    Hunk #1 FAILED at 43.
    Hunk #2 FAILED at 55.
    2 out of 2 hunks FAILED — saving rejects to file libs/PHPCrawlerURLDescriptor.class.php.rej
    patching file libs/UrlCache/PHPCrawlerMemoryURLCache.class.php
    Hunk #1 FAILED at 7.
    1 out of 1 hunk FAILED — saving rejects to file libs/UrlCache/PHPCrawlerMemoryURLCache.class.php.rej
    patching file libs/UrlCache/PHPCrawlerSQLiteURLCache.class.php
    Hunk #1 FAILED at 7.
    Hunk #2 FAILED at 125.
    Hunk #3 FAILED at 250.
    Hunk #4 FAILED at 280.
    4 out of 4 hunks FAILED — saving rejects to file libs/UrlCache/PHPCrawlerSQLiteURLCache.class.php.rej
    patching file libs/UrlCache/PHPCrawlerURLCacheBase.class.php
    Hunk #1 FAILED at 19.
    Hunk #2 FAILED at 137.
    2 out of 2 hunks FAILED — saving rejects to file libs/UrlCache/PHPCrawlerURLCacheBase.class.php.rej
    patching file .project
    patching file resumable_example.php
    Hunk #1 FAILED at 19.
    Hunk #2 FAILED at 28.
    2 out of 2 hunks FAILED — saving rejects to file resumable_example.php.rej
    patching file .settings/org.eclipse.php.core.prefs
    
    Reply
    - Matteo Catena
      Sep 15, 2014 @ 16:00:07
      
      Hi Stanley,
      
      sorry for this late answer. Just to be sure, I re-uploaded the patch (now compressed in .tar.gz), and I tried on a friend’s laptop. It works fine for him.
      
      Are you still experiencing the issue? In this case, I can think about just two differences in setting:
      – the fact that you are using PHPCrawl in its .zip format, while i’m using its .tar.gz
      – the patch command version (mine is 2.6.1)
      
      Let me know.
      
      Reply
border
Oct 27, 2014 @ 04:06:06

with PHPstorm 8.0.1 patch apply ok. you forgoted your .settings file lol

Reply
border
Oct 28, 2014 @ 13:46:05

patch tested and working

Reply
Kieran Headley
Jan 21, 2015 @ 14:23:23

Thanks for the modification, is there anyway that you can use this to return the page depth that the current page being crawled was found at?

Home Page (Level 1)
> News Feed (Level 2)
> News Item 1 (Level 3)
> News Item 2 (Level 3)
> News Item 3 (Level 3)
> About Us (Level 2)
> Our Services (Level 2)

Reply
- Matteo Catena
  Jan 21, 2015 @ 15:12:03
  
  Not currently, but it should be possible with minor modification to the code
  
  Reply
A patch to limit PHPCrawl crawling depth – UPDATE
Jan 28, 2015 @ 13:47:12

[…] « A patch to limit PHPCrawl crawling depth […]

Reply

nicecode.eu

A patch to limit PHPCrawl crawling depth

24 Comments

Leave a Reply Cancel