Tech/Scoring and Configuration

Please feel free to fix the formatting of this page as I am new to mediawiki

Just starting out this page. This will be information about the Nutch setup and configuration. Only basic information right now.

The current Wikia search engine is an implementation of Nutch, an open source apache project. If you are not familiar with Nutch some of the information below will probably not make sense. Here are the configuration changes we have made to Nutch:

  • We only use prefix and suffix url filters. Our prefix filter is set to only accept http:// and our suffix filters weed out things link images, pdf, scripts, etc. Here is our current plugin.includes config:
protocol-http|urlfilter-(prefix|suffix)|parse-(text|html)|index-(basic|anchor)|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic).  
[Does the dot at the end of the previous line belong there? Can you somehow make that expression need less width? ]
One thing to notice is we are only parsing html and text, nothing else. Scoring is still OPIC based.
  • The first release of the search didn't index inbound anchor text. We found that had a very detrimental effect on the quality of search results. On Jan 10-11 we reindexed all of the current shards and were merging and pushing them out on Jan 12. They should be live by Jan. 13.
  • db.score.link.internal is set to 0. We do not give any value to internal links, only for external links. We may give value in the future to internal links but we need to find a way to score them without internal links overriding scores for external links.
  • generate.max.per.host is set to 500, so we only fetch 500 pages per domain. For some of the bigger quality domains such as wikipedia, this restriction is not in effect.


Back to Tech.

Retrieved from "http://search.wikia.com/wiki/Tech/Scoring_and_Configuration"

This page was last modified 19:48, 14 January 2008. GFDL