Tech/Open Index

The Search Wikia and Visvo indexes are currently open and can be queried through web 2.0 clients. Here is some basic information on querying the indexes directly (i.e not through a browser). Both indexes follow the same parameter structure.

Contents

[edit] URLs

[edit] Example

[edit] Design

Queries are passed to the index servers via HTTP GET requests. In principle, any HTTP client (or even telnet) can be used to send such a query. For example, you can copy the following query URL into a browser's address field and submit it:

http://search.isc.swlabs.org/nutchsearch?query=obscure%20innocuous&hitsPerPage=1

The response will look something like this:

 <results>
  <search>
   <query>obscure innocuous</query>
   <queryString>obscure+innocuous</queryString>
   <numberOfHits>123</numberOfHits>
   ...
  </search>
  <documents>
   <document indexDocumentNo="343740" indexNo="2">
    <summary>
     [One-line excerpt from the site...]
    </summary>
    <fields>
     [analysis of the page at this URL...]
    </fields>
   </document>
   [further search results...]
  </documents>
 </results>
 

The results are designed to be consumed by web 2.0 clients, in other words: a program. By default, the results are formatted as XML. With a suitable script, a web client can transform the XML representation to HTML that is then used to actually present the results to the user.

If the "type" parameter in the query is set to "json", the results are formatted as Javascript data (in JavaScript Object Notation (JSON)), with a surrounding function call. Here is the example query from above, with the added parameter setting "type=json":

http://search.isc.swlabs.org/nutchsearch?query=obscure%20innocuous&type=json&hitsPerPage=1

The response will now look something like this:

 processJSON({
	"search": {
		"query": "obscure innocuous",
		"queryString": "obscure+innocuous",
		"numberOfHits": 123,
		[more header attributes...],
		"documents": [
			{
				"indexNo": 2,
				"indexDocumentNo": 343740,
				"summary": "[...]",
				"fields": {
					"url": ["http://www...../.../....html"],
					"title": ["..."],
                              [other result attributes...]
				}
			}
		]
	}
 }
 )
 

To use this, a search engine's result page can simply evaluate the result string shown above. For this to work, the processJSON function must be defined. The results page loads the function via a Javascript library, which is a static document loaded from the search engine server (a .js file).

For example, the Wikia Search results page loads the Javascript library http://re.search.wikia.com/search-a3.js, which defines, among other things, the desired function processJSON.

[edit] Parameters

  • query = any url encoded query
  • hitsPerSite = the number of duplicate results to return, 1 for no duplicates, this is deduped by hostname not actual content although that might change in the future.
  • lang = 2 letter language code, currently not implemented
  • hitsPerPage = the number of results to return per page, 10 by default
  • start = the starting result number, for example if 10 results per page and starting page 2, then 10 is starting number as index starts from 0
  • type = the results return type, either json or xml. We will probably add further types in the future. by default this is xml

[edit] Search Results

With the search results returned, here is some useful information:

  • query = the display value for query
  • queryString = the url encoded value for the query
  • numberOfHits = the total number of search results returned
  • lang = language
  • dedupField = the index field used to remove duplicate results. When duplicates are removed the highest scoring result is returned
  • hitsPerDup = the number of duplicates to allow
  • reverse = not implemented
  • start = the starting search result number
  • end = the ending search result number (not inclusive)
  • documents = the actual search results, each result will have an index number, and index document number, a summary, and index fields
    • indexNo = the index number of the search result, our indexes are in shards where pieces of the index are stored on multiple machines. This lets us know the specific index number or shard the results is from.
    • indexDocumentNo - the document number in the index shard for the given search result document. Once you know the shard (indexNo) you have to know the document in that shard (indexDocumentNo).
    • Summary = a text summary for the search result
    • Fields = the index fields

Retrieved from "http://search.wikia.com/wiki/Tech/Open_Index"

This page was last modified 04:44, 10 July 2008. GFDL