GrubNG Protocol

[edit] Server functions

  • Check robots.txt on sites to crawl.
  • Prepare workunit with websites for client (for now amount of url's in one workunit is 250). In one workunit cannot be more than one link to this same page.
  • Check uploaded by client arc.gz files - check for amount, order of links and correctness of arc file.

More information about workunit you can find in article Grub Workunit.


[edit] Client functions

  • Downloads a prepared workunit from the server.
  • Crawls the URLs given by the workunit. For each URL in a workunit, sends one request to the given host. If a response is received, writes the response text in an .arc file. The client does not follow any links on the crawled pages and does not follow any redirects.
  • After crawling, compresses the .arc file and sends it to the server.
  • There are a few differences between the available clients. For the C client, workunits need to be manually downloaded from the server (for example using wget). The C# client can run a few crawlers simultaneously.

You can find information about the available clients in the article Grub Clients

[edit] Additional information

Current User-Agent name for crawler: GrubNG 20080128

Retrieved from "http://search.wikia.com/wiki/GrubNG_Protocol"

This page was last modified 23:28, 9 July 2008. GFDL