GrubNG Protocol
[edit] Server functions
- Check robots.txt on sites to crawl.
- Prepare workunit with websites for client (for now amount of url's in one workunit is 250). In one workunit cannot be more than one link to this same page.
- Check uploaded by client arc.gz files - check for amount, order of links and correctness of arc file.
More information about workunit you can find in article Grub Workunit.
[edit] Client functions
- Downloads a prepared workunit from the server.
- Crawls the URLs given by the workunit. For each URL in a workunit, sends one request to the given host. If a response is received, writes the response text in an .arc file. The client does not follow any links on the crawled pages and does not follow any redirects.
- After crawling, compresses the .arc file and sends it to the server.
- There are a few differences between the available clients. For the C client, workunits need to be manually downloaded from the server (for example using wget). The C# client can run a few crawlers simultaneously.
You can find information about the available clients in the article Grub Clients
[edit] Additional information
Current User-Agent name for crawler: GrubNG 20080128