Brainstorming

We should start a Brainstorming about our ideas of a new search engine.

Just throw in your ideas!

Contents

[edit] Ideas about the Search Functionality in General

  • Have multiple types of results, across the "need states" of users: such as "the actual page for something" "news about something" "blog posts about something" these categories could come with some implied structure which a user could add.
  • allow people to integrate their social media sites straight into the interface of search, if they are logged in thru wikia search.
  • Creation of a tags/comments page for every indexed page, as well as community rating (mimicking the functionality of StumbleUpon[1], but integrating ratings with search rank)
  • Local Search - by city (for pizza, niche shops, etc., allow businesses to include metadata like items they carry or foods served at restaurants)
  • Search weights - Allow the user to enter extra words that should boost results if present, but aren't sine qua non. Like: 'carthage book !must-read'.
  • User algorithms - let users combine algorithm criteria and assign weights (i.e. for research studies).
  • Search options: rank higher pages with a) higher word count b) words from a glossary c) statistically uncommon words d) 'cite'd blockquotes.
  • Fine tuned search by date.
  • Image search - search by exact pixel counts.
  • Image search - search by metadata
  • whitelists: what sites are worth crawling?
  • blacklist: what sites are spam / dangerous (e.g.phishing)?
  • mark results:
    • do I have to pay for the online content?
    • do I have to have a login to see the content?
  • An option to only search secure sites.
  • An option to only search FTP sites.
  • Here's something I don't know if I've mentioned before: One thing we could do to apply the community to improve search results is to constantly rotate in different algorithms and let users rate the result relevance. Sure, maybe Joe Averageuser wouldn't want to have much to do with that, but even a dedicated core of few hundred power users could quickly create significant statistics about what algorithms are *actually* working better. But you'd never want to stop - constantly take in new ideas about what data to collect and how to weight it, then collect data about how well the algorithm is working in real time. Use the feedback from that to "evolve" the algorithms.
  • edit search results - drag and drop to order the results, rather than simply rate each one idividually
  • open licensed materials I can remix and redistribute
  • Proximity based ranking
  • Popularity-aware ranking
  • Integrate TIP total informatin pages technology to enhance the search platform. TIP is a compiler that links every word in any documents to thousands of sources their are TIPed doucument that are outsourced to wiki now on the internet. The educational commuinty would benefit greatly from this type of search technology.
  • Why not let every website spider (index) its own content and submit the results (like they do a sitemap)? - I agree, it would be useful to have a php, asp and javascript crawler (factory?), this could then be ran as a cron job, and be built into CMS's such as drupal, mambo etc.
  • Don't re-invent the wheel(search engine) - focus on PRIVACY
  • SearchPlatform
  • CommunityRanking
  • semi-automatic semantic tagging
  • transparent engine that explains why this link is the first
    • maybe information on what top-rated results were skipped for results with lower rating is even more interesting, since top-rated results will be popular anyway - such a negative rating might increase the chance of previously less rated links to "move up" more easily?
  • Pareto-Efficiency-Criteria Ranking [2]
  • Multi-search sessions
  • site thumbnails
    • thumbnails, but just on request
  • flexible user-defined search with choice of ranking and voting panel to rank suitability
    • NPOV criteria for improving ranking by generic relevance using aggregated tagging, ranking, of material to specified criteria?
    • POV search by person's own criteria, eg person values x over y sources, open licenses, local to me, most recent? first.
  • date: how old is a page? - yes, along with filtering results by date (useful for searching for news)
  • Thesaurus style search grouped by similarity of meaning. Owners place pages in contextual boxes.
  • intergrate basic dictionary spell checking fuctionality
  • context-sensitive hints (like amazon, etc.: Others also have searched for ..., ..., ...
  • Using web servers as self define search agent each website could define available content based on tags ,ranks as serve it to other web servers in p2p agent networks in semantic web conecepts
  • Visual Searching: Use metadata about picture elements (and each pixel's immediate neighbor) to find similar images. Diana Day points us to: Hermitage Museum rbg search

See also: eVision and Able Image from Mu Labs... web-based, drag-and-drop, searches that find blue pictures, not just pictures with blue.png filenames.

  • Allow contributors to edit categories, not results. Use solid searching, but allow searchers to edit "cluster" categories and directory hierarchies...
  • Allow the users to categorize the content but do not force them to do this completely: e.g. allow a user to categorize an article to chemistry but do not require to categorize the article into organic/inorganic chemistry -> sequential improvement of categorization.
  • Allow to categorize every aspect of an article and do not require a single categorization.
  • Allow to categorize an article by a content provider; perhaps allow for overruling of this categorization by users in order to limit abuse.
  • Check experiences of categorization schemes like the Japanese Patent Classification, International Patent Classification etc. and design a compatible system or use at least the work provided by these organizations.
  • Having induction periods where new websites added are displayed in a random order and whichever gets the most clicks comes first
  • Tagging can be used to gain buzzwords describing a found page. These tags can be used as index.
    • Tags can be gained from the meta data of the page
      • The contents of the page can further be analyzed semantically for second level index. Meaning the contents contains many words and phrases that are common and not specific the subject on the page; therefore these informations should not be weighted to much.
    • Allow the user to supply tags to a search result (Generate a tag-cloud for each page). The more user supply a result with the same tags the more appropriate they are.
  • A page that is often visited seems to be important. To find out how well received the contents of the page is there are other resources.
    • How many entries for a page can be found at del.ico.us or digg can indicate how widely the page is known and how well it is appreciated.
    • Some sites have their own critic section à la 'How helpful was this article?' Make use of this. There may be some need for standardisation but there is already good data out there.
    • Allow a page to indicate that it was visited. The webmaster of a site adds some lines to his page and every click on the page sends an event to the ranking mechanism. There must be decided if this was a regular click or fraud click and therefor adjust the ranking of the page.
  • It must be easy for the user to 'comment' a search result. The easier the more it will be used. The comments should be available in various granularity, eg. good or bad, rate between 0..9, some written comment. Such a thing can be done with an Add-on or another browser extension.
  • Harvest External Links from Wikipedia articles
this is kinda what Wikiseek does.


[edit] Ideas about what should surround the search functionality

  • Serve Tag Cloud of related or surrounding search terms
  • if a search term has a Wikipedia page, place the summary in the sidebar, always (see google extra greasemonkey script)
  • Add an option to serve equivalent yahoo/ask/google search results alongside the wikia search results. This may seem to somewhat defeat the purpose of our work, but it could help an individual to "hand copy then prune" traditional search results.
  • Use the plain-jane sidebar to allow searchers to add custom functionality to make up for the fact that wikia search can't show locations or local results, but for an address, it could show a google map api mashup of top local results for all ::address:: or ::directions: . other examples could be news items, specialized results from a given webpage, images, video etc.
  • a more robust comment system incl. rating
  • Collaborative Search -- generate a unique session ID and allow two or more people to tick off results from the list


[edit] Other Ideas

  • integrate wikia search more closely with Wikipedia and other wiki's
  • mobile version to search with pda's etc.
  • personal homepages with customisable content
  • peer to peer search spider
  • proxy server option (for the schoolkids lol)
  • Anti search - show what sites dont want to be indexed. A db of every sites robots.txt would be nice.
  • A secure version of the site.
  • For performance reasons it may be helpful to treat frequent queries differently. Possibly by A) showing the most popular terms to users b) to keep wikia search most relevent...see Mahalo for an idea of how to address this issue (although not a great way to do so imho, but an idea nonetheless)


[edit] ideas for Grub

  • user rating of search results / certain pages
  • distributed spidering and indexing of web pages using networked video game systems (xbox, etc.)
  • Centralized control with all your Grub clients, eg. from the grub.org webpage. I would like to send schedules to all clients at one time, disable/enable all clients and so on.
  • The possibility for the Grub client to use more of the allocated resources. When I setup the client to use 100% CPU and 100% network usage, the client still only utilize a fraction of this power. Maybe this should be done as a "dedicated crawler" install, e.g. to be used by ISP's who would like to put dedicated machines into the fight, to ensure their sites get indexed (Local crawling).
  • Make local crawling available on a team level. E.g. I would like to get everyone at our office to install the grub client, but I would also like for them all to crawl the websites we produce. As this is several hundreds pages, I would like to add these pages to the team, instead of to each client.
  • It would also be nice to be able to add sites to be crawled locally, from code. This could be used for ISP's who create many websites.
  • can we make a grub firefox plugin?



[edit] other stuff I didn't know how to categorize

  • Something similar to Googles "Subscribed Links" where people can add specific data on a specialized subject, or access to a specific web application (i.e. searching for "whois:www.domain.com" will provide the whois data for that domain, or searching for "where is company ABC" will bring up there address from an address book)
  • In addition to searching the Internet as a whole you can include a list of familiar sites ONLY search as an option.
  • Google does not serve more than 1000 results for any query. Why not serve ALL results? And show some options with the results. (Maybe this work could be done in the user's computer, maybe by a browser extension) Some options could be:
    • Order results in reverse mode.
    • Or shuffle mode.
    • Or specifying a rank.
    • Or I want to watch the development of a topic online, show me a date view
  • It does not matter if all results are returned, 88% of users only look at the first 3 pages before changing their search query.
    • But being able to do a more complete search could be a niche to fill, or an advantage over what Google already does. About 85% of users use Google, Yahoo or MSN, doesn't mean that there shouldn't be a wikia search engine.Search engine watch Sept '07
  • Search should focus on the high-precision results (the first n results)
  • When using P2P to store your index you want incremental results OR just the top n results
  • Let webservers index their own content and let them act like peers in a hyrbid P2P network (see for an overview a paper I wrote Towards large scale P2P web search)
  • Arrange META data -and other factors- obtained from each site to try a "semantic" search engine
  • I really want the ability to do search stemming (e.g., the search term "*librar*" can find pages to do with libraries, librarians, metalibraries, etc.).
  • Allow users to tag search results, and allow users to have their search results influence by the tagging or not. Allow users to "second" a tag, giving it more weight if the tag is seen as appropriate
  • Firefox/IE Toolbar --COMPLETED!!!!
  • Add a "freshness" rating. An article may be highly relevant when first published, but after a period of time, the data in the article becomes dated or useless.
  • Pruning of dead links and stale articles.
  • Session generation -- automatically generate Firefox/Opera sessions of each search result in a tab, with the option of excluding search results
  • Search session history -- ability to exclude certain results over long periods of time (see the persistent search session history idea above)
  • RSS/Atom feed for search results on particular topics
  • Special ASCII character search
  • We could have all the different seach tools available in a similar way to facebook applications, in that people can seek out the search tool they most like the look of and have that on their personalised homepage. This would lead to things like popularity rankings for the search tools themselves...
  • Today many informations are found in blogs and forums. Information is only valuable if it is correct. The correctness is something only a human can decide on after the study of an article. However if he knew beforehand the date of the article he could decide if it is out of date or worthwhile persueing. The date must not be the actual date of the last change page at the time of the search request but can be the last change by the time the page was last crawled.
  • It must be transparent how the crawling mechanism works so the user can decide if the found contents is out of date, eg. in the result page some event in the past is announced. To find out if the text in the displayed search result is old or the page itself he must open the page.

[edit] See also

A new Method for Product Search - BrandPages

Retrieved from "http://search.wikia.com/wiki/Brainstorming"

This page was last modified 19:34, 9 August 2008. GFDL