Forum:The Plan
This is a plan for implementing a distributed, community driven search engine. This plan starts with a stated problem, and includes how the problem will be solved, who the competitors are, how the technology will be developed and deployed, and how the product will gain distribution.
Contents |
[edit] The Problem
It has been proposed that search is currently "broken". Stating exactly what is broken with search is the first step to solving it. The following are known issues with current search technologies.
[edit] Search Context
Search engines typically don't understand the context of your search terms. For example, "Windows", the operating system, is not the same as "windows", those pieces of glass in your house. Currently the only way to work around this issue is to manually add new key terms.
Search engines are not capable of adding more context in an interactive way. For example, a human might ask "Did you mean a pane of glass?", where a search engine will just give you the most relevant results it has for a particular key term.
A curious paradox is that it often seems that the more obvious and well known the answer to a question, the more difficult it is to find by searching. As a rule, any question that might reasonably be expected to be addressed by a Wikipedia article is better pursued there than by a global Web search, and often leads directly to relevant links.
[edit] Relevancy
Relevancy is defined as something having significant and demonstrable bearing on the matter at hand. Most search engines have to be coaxed into providing relevant results, or lack the information to provide relevant results.
- Search engines typically use a combination of link weighting and a few other simple text matching algorithms to weigh result sets. This results in returning pages with more inbound links before returning high content relevancy pages with less links, such as with Google's PageRank. This creates SEO problems for legitimate sites.
- High relevancy pages for a particular high-demand search term can be faked by evil-doers. This creates relevancy problems for the search engine.
- Search engines typically do not use user generated feedback mechanisms to weigh result sets, such as those employed by Digg, Reddit and Slashdot.
- Complex algorithms that give additional relevancy, such as a semantic text parse, are prohibitively expensive to do on billions of pages. Current technologies are limited by the computational power available to the search cluster.
- Sites with large amount of dynamic content, including that locked up in Flash and JavaScript, go largely un-indexed.
- The "Dark Web" is largely ignored by current search engines. Databases containing valuable relevancy items are left un-indexed.
- Search result display context is broken. Searching for "funny cat videos" on Google does not return a page full of links to videos of funny cats. Searches for "sample dotcom business plans" does not return a page of PDF document thumbnails.
- Update times are limited by the time it takes to detect changes, re-crawl content, and re-index it.
- Relevancy cannot always be directly deduced from the search keyword or phrase. For example if you "Google up" an error message you got while installing something, the search engine cannot tell which of the pages in which the search phrase appear actually include a good solution for the problem.
- Web pages with a very wide range of content, such as dictionaries or news archives, can defeat attempts to find content using multiple relevant keywords. This is especially frustrating when such pages are intended as advertisements or botnet recruiters.
- By indexing full text of subscription-limited articles, Web engines such as Google may provide a service to the reader with extensive access or willing to pay, but make searches for others much more difficult.
- By automatically applying synonyms and removing special characters, some search engines may reduce the number of search term variants, but can interfere with specific requests.
[edit] Local Searching
Local search for a given website must be done either by running it through a remote search engine and aggregating the results on a page on the website, or by the company licensing an existing search product, such as Google's Search Appliance.
Additionally, search terms are not kept intact when navigating to a relevant site's page. A site knows nothing about why the user is on a particular page. For example, searches for "funny cat videos" returns several pages on YouTube, but once the user navigates to a given videos page, the original search term is lost, and additional relevant results on the site are no longer accessible.
[edit] Number of Results
Search engines give an excessive amount of low relevancy results. For example, Google returned this for a simple search: "Results 1 - 10 of about 2,970,000,000 for search"
It is assumed users will not look past the second page of results.
[edit] Solution
The problems above can be solved by changing the way in which the current search pipeline operates. The idea of what is returned for a given user's search needs to be changed substantially, as do the methods by which the data behind the search is collected, analyzed, and weighted.
Focus on the top 1 million unique search queries done today on search engines. This is estimated to cover 99% of the searches done today on sites like Google, Yahoo, and MSN. TBD is how to deal with the other 1% of search queries.
A suite of open-source software will be written that provides a distributed search infrastructure, including centralized caching/coordinating servers, and distributed worker clients.
Distribute the unique search queries to each client on a daily basis for processing.
[edit] Data Collection (Crawling)
Shift the collection of the data to the source. This allows the data be collected faster, and allows access to data that requires localized access. Software available for download will provide methods for detecting changes, local database access, and mapping content to URLs.
"Crawled" content is no longer represented by simple URLs representing web pages. HTML files, XML documents, PDFs, images, videos, audio files, aggregated search results, tags, database entries, etc. are represented in the context in which they can be accessed. If a video exists both on its own web page, and a search result page, then the system is aware of both contexts.
[edit] Indexing
Utilize the resources of the clients to tackle additional high cost algorithms. By implementing algorithms such as semantic text parsing, the client can extract additional meta information that can be used later to match it up with particular search terms. Note: this can be done today by Google, for example, but doing it for every page on the Internet is costly.
[edit] Search Interface
[edit] Search Results
Instead of pages and pages of search results, a single page is shown for a set of particular, popular search terms. For example, 'big green cars' would have it's own page (given it was in the top 1 million search terms) and would contain relevant content to the search, including a few search results, related search terms (like 'big yellow cars') and comments by the community.
--- An important question is how the mini articles could help the user in his search. There is an interesting start of discussion here about how to match keywords and the multiple related Mini articles related.
[edit] Post Indexing
The community can edit and maintain search term pages. Allow voting and comments on results ala Digg/Reddit, and provide a history of changes. For example, a page would be auto-created for 'big green cars', which could then be modified by editors.
[edit] Funding/Revenue
The project must be funded to succeed, but it may or may not have a revenue model. Question: Are there going to be ads on the main search site?
- There are other possible funding sources besides advertising. One example is "Pay to Play", that is the fact that companies who has internal databases may be willing to pay for the work of making some of their content externally searchable (for example, companies like Oracle or HP, may want to externalize their tech support knowledge bases). Even if the software to do that (domain specific plug-ins, etc.) is open source, the work itself may be done for profit.
- Quality is crucial to advertising, and imposing standards on advertisers can preserve the value of ad space and increase its acceptance by users. For example, if a user looks for an Edsel in North Dakota, many sites will provide advertising links that proclaim they sell Edsels in North Dakota, but which in reality have few if any listings for any Edsels or any cars in North Dakota, let alone both at once. Obviously users who click on this type of ad once or twice give up on them entirely, at which point they are worthless. Therefore the project should require that all advertisers be able to deliver any product they advertise, even if it means sacrificing some revenue in the short term.
[edit] Competition
Many competitors exist in this market, and have erected numerous barriers to entry for new competitors. This is a list of current major competitors:
- Yahoo
- Ask
- MSN Search
It often seems, de facto if not de jure, that there is no idea too obvious to patent, and one expects a major stumblingblock for the project to be based on claims that certain practices violate existing software or business model patents. This raises substantial questions, namely
- How do community members find out and agree on what the existing legal barriers are which prevent improvements to the search engine?
- How do community members prevent their ideas from becoming or inspiring software patents in the hands of competing firms which are intended to block their implementation?
- How do community members discuss ways to circumvent or find loopholes in existing patents without compromising the legal position of the project in the case of a dispute?