Nutch/it

Contents

[edit] Nutch, Hadoop, Lucene

Lucene, creato dapprima da Doug Cutting, è un progetto open source per indicizzare e trovare documenti. Lucene usa inverted indexes. Nutch è un sottoprogetto di Lucene, che è un motore di ricerca. Nutch è grossolanamente composto dalle seguenti parti: fetcher, parser and indexer. Nutch usa Hadoop, che implementa Google's map/reduce computing paradigm and a Distributed File System(DFS).


[edit] Descrizione

Hadoop is a framework for running applications on large clusters of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named map/reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework.

The intent is to scale Hadoop up to handling thousand of computers. Hadoop has been tested on clusters of 600 nodes.

Hadoop is a Lucene sub-project that contains the distributed computing platform that was formerly a part of Nutch. This includes the Hadoop Distributed Filesystem (HDFS) and an implementation of map/reduce.


[edit] Hadoop based on Nutch

Nutch logo

Nutch is a project to develop an open source search engine. Nutch is supported by the Apache Software Foundation, and is a subproject of Lucene since 2005.

[edit] Highlights

  • Nutch includes a high-performance multithreaded crawler
  • Nutch parses and indexes many document formats out of the box
  • Nutch uses a distributed computing platform called Hadoop, an open source implementation of MapReduce. This allows to easily deploy a Nutch solution over a large number of servers. Furthermore, Hadoop can now natively run on Amazon S3
  • Webpages are stored in a Lucene index, allowing for high-performance retrieval
  • Nutch is highly customizable; one can extend it by creating plugins (several plugins already available)
  • Nutch is open source and has a very healthy and friendly community
  • Nutch is coded in Java, and thus runs on Windows,OS X, and Linux

[edit] External links

Image:40px-Wikipedialogo.png Wikipedia article:
Lucene
Image:40px-Wikipedialogo.png Wikipedia article:
Nutch
Image:40px-Wikipedialogo.png Wikipedia article:
Hadoop

[edit] see as well

Retrieved from "http://search.wikia.com/wiki/Nutch/it"

This page was last modified on 28 December 2007, at 20:48. GFDL