Login

    Register

Managed Hosting

PROJECT CATEGORIES

 

WebCrawler
Project Home External Project Link Contact Project

Author: RocketBoots (All RIAForge projects by this author)
Last Updated: November 6, 2010 4:55 AM
Version: 1.0
Views: 12,901
Downloads: 934
License: GPL (GNU General Public License), Version 2

Description:

Here is a useful extract from RocketBoots com.rocketboots.util.web library that we've decided to open source. It is the core of a multi-threaded web spider that we have used to index large numbers of sites.

To use, write a cfc that implements the IWebVisitor interface:


Tell the crawler if we are interested in looking at a url      
@param      url      fully qualified URL
@returns    true    if you would like to process this URL

boolean match(url)
   
   
   
Give the crawler information about our cached version of the URL, if any
@parm      url      fully qualified URL
@returns   structure with two optional keys: eTag and lastModified. If one or both
         are specified they are used to qualify our query the web server to only
         return content if it has been updated.
         
struct cacheInfo(url)
   
   
   
Process the contents of a URL where match(url) = true and our cache (if any) was
out of date
      
@param      url      fully qualified URL
@param      headers http headers
@param      content   URL content
@returns   array of additional urls to process

array process(url, headers, content)

The main work your implementation needs to do is extract the URLs from the page passed to process() - regular expressions make it relatively easy.

Then create an instance of WebCrawler, pass an instance of your cfc to it and start the crawler with some seed URLs:

wc.setVisitor(myVisitor);
wc.crawl(["someurl","someotherurl"]);

The crawler will keep calling your visitor until it runs out of URLs to process. Your visitor can call a database or do whatever it likes with the visited pages.

Requirements:

Tested on CF8.01