Policies | Competitive Intelligence | Oil & Gas | Resume
Home > Competitive Intelligence > Links > Search Engines
 

Search Engines

These are the workhorses of the CI industry, and new versions are popping up all the time.  Search engines can be stand-alone, or can be described as "Meta" search engines, which take your query and submit it to a number of other engines. They can be located on a server or on your own machine. A recent category of engine searches the internet directly rather than a database created from the process of "spidering", and can be more effective if you already have a clear idea of where to start looking.. Another category is the directory, where data has been pre-classified. The basic search engine is a good place to start if you do not have a clear idea of where to go.  They're also a bad place to go because they will deliver lots of inappropriate data if the keywords used in the search have more than one meaning - searching for "Mobil", for example will pull in anything containing "mobile" as well on many engines.

A new category of search engines is one that searches on-line data bases, which are invisible to normal spiders. See below.

Bear in mind that any single search engine has only indexed a part of the spectrum, so if you don't find what you're looking for, try another engine, or better still, try a meta search engine.

News

One of the most important aspects of CI is access to recent information. Search engines can take weeks or even months to revisit a site. Consequently they are not good at finding recently posted information. One way is to use the search engines of the major news sites. A number of sites, however crawl the news sites every few hours:

AlltheWeb

Google

Infomart ($)
Claims to be the largest electronic resource of Canadian news and business information products, delivered via multiple formats Offers same-day and archival access to over 160 national and regional sources, daily/weekly newspapers, magazines, trade publications, broadcast transcripts and newswires - all full text; many exclusive to Infomart.

Moreover

Rocket News

History

Another disadvantage of most conventional search engines is that they normally only reference current sites. Once a site goes down, the information goes with it. Equally, if information changes, you only get the current information, not how it was.

Google offers a partial remedy through its cache option - you can see how the page looked when the site was last spidered. Google also provides the Deja archive of all the Newsgroups going back to 1995.

Wayback This site can give you a view of pages as they were on specific dates. The archive goes back to 1996 thanks to Alexa. The archive is claimed to have 10 billion pages and 100 TB (terabytes) of data (October 2001). There is an option to have one's own site made inaccessible, so check on your competitors now before they hear about this facility!

Basic search engines

The phrase "Lies, damn lies and statistics" applies to search engine usage claims - most companies seem to be able to drum up reasons why theirs is the best.   My own advice is to try them all and see which appears to work best for you.

Alta Vista
Claimed in November 2003 to have 580 million searchable files indexed, with the most extensive multi-media index - 50 million images, video and audio clips. Guarantees index "freshness" rate - updated at least every 28 days. A unique feature, which I have found very useful, is the Babelfish translator (remember Hitchhiker's Guide to the Galaxy?), which will translate web pages from a selection of languages.   The result is useable if not necessarily syntactically accurate - I once retrieved a German article on the Shell/Esso Brent Spar which translated as the Brent "Save"!

AskMSR
This isn't even available to the public yet, but it's getting good reviews from users. It's a Microsoft product, and uses basic language rules to interpret plain language queries and translate them into sophisticated boolean search strings.

Atomica
Atomica is a free one-click information service that works whenever you're online.Previously known as GuruNet, it has been revised to add pay-for extras. It automatically analyzes pointed-to text in context and pops up a simple window without linking or leaving your document. You don't even have to select the word. GuruNet's got reference information (dictionary, thesaurus and encyclopedia) and real-time information (e.g. news, sports, weather or stock quotes). Similar to Nano and Zapper in that it is an application that runs on your own computer.

Dipsie
A new search engine to be launched in summer 2004. This is designed to mine the "deep web". Chicago-based Dipsie has a crawler that burrows deep through forms and database interfaces to get to the specific information you require. Having already indexed a whole database it can present the required page without the need to go via the intermediate levels. At launch it is planned to have 10 billion documents, some three times the size of Google's.

Direct Hit
Direct Hit provides relevance for any Internet search by analyzing the activity of millions of previous Internet searchers. In February 2000, Direct Hit was acquired by Ask Jeeves, Inc. and then by Teoma.

Eurekster
Eurekster is a privately-held US company with offices in San Francisco and New Zealand. Eurekster (www.eurekster.com) is the only Internet search engine powered by social networking technologies - delivering results that matter most to users and their networks of friends and contacts. Eurekster continuously "learns" from the behavior of users and their social networks to deliver personalized search results and instant sharing of their popular Web destinations and searches. The service allows users to share their knowledge and experiences with each other, confidentially and privately.

Excite
I've experienced good results with this engine which uses "more like this" concept for following links from relevant hits. 

FAST (AlltheWeb)
AlltheWeb indexes over 2.1 billion web pages, 118 million multimedia files, 132 million FTP files, two million MP3s, 15 million PDF files and supports 49 languages, making it one of the largest search engines available to search enthusiasts. AlltheWeb claims to provide the freshest information because "we update our index every 7 to 11 days and index up to 800 news stories per minute from 3,000 news sources".

Gigablast
Small but effective engine running (April 2004) on 8 large but fairly standard PCs. The code is capable of handling 40 search requests per second and indexing eight million web pages per day.

Gold Rush http://www.wisdombuilder.com/prGoldRush.htm
Gold Rush is an artificial intelligence-based meta "find" engine from the makers of Wisdombuilder.

Google
I find it can often locate items missed by other engines. Has some sponsored advertising linked to the query, but these are clearly marked as such. One of the best features is that it caches each page when indexing it so that even if the page disappears on the original site, you can still retrieve it from the Google cache. (Click on the cache link). It also can search for images. Now has some 72 interface languages including Klingon! http://www.google.com/advanced_search?hl=xx-klingon. Latest add on is a news section, and an interesting search by sets. Recently (March 2004), there have been reports of shortcomings, in that its link-search function does not reveal all of the pages in its database. Its advanced search function engine will ignore any data that is available beyond 101kb for html Web pages and 120kb of PDF files, whilst. its OR-search operator has not worked properly since November 2002. Yahoo recently stopped using Google as its search engine and has begun using its own platform.

Guidestar
GuideStar, is a national database of U.S. charitable organizations. It gathers and distributes data on more than 850,000 IRS-recognized nonprofits.

GuruNet See Atomica
GuruNet is a free new one-click information service that works whenever you're online. It automatically analyzes pointed-to text in context and pops up a simple window without linking or leaving your document. You don't even have to select the word. GuruNet's got reference information (dictionary, thesaurus and encyclopedia) and real-time information (e.g. news, sports, weather or stock quotes). Similar to Nano and Zapper in that it is an application that runs on your own computer..

Hotbot
This is one of my favourites as it allows you to specify an "exact phrase" search.  This eliminates most irrelevant hits. However I'm increasingly using Google.

iWon
iWon.com, based in Irvington, NY, is a CBS-backed destination Internet portal powered by Inktomi. It encourages use of its services by offering cash prizes for using the system. It was launched in October 1999.

Jayde (WebProNews)
Jayde.com is a cross between a directory and a search engine and now boasts over 1.2 million site listings. Unlike site directories that are organized in a series of nested sub-categories which force the user to drill down to find meaningful results, Jayde offers visitors fast sub-category searches by making each sub-category a keyword or keyword phrase that is searched in real-time across the entire database.

Karnak
Atlanta-based VedaSource LLC unveiled a new Internet search service early 1999 called Karnak. This is a service that searches the internet directly, and subsequently maintains a watch on the sites of interest for any changes that occur while the user is off-line. It is a free service for a single search profile, but multiple profiles may be obtained for what seems to be a reasonable monthly fee. Can take weeks to get all the relevant hits.

Looksmart
Looksmart claims a Web index of more than 1.1 billion indexed documents, and a professionally edited directory (November 2003).

Mooter
Mooter is a new type of search engine that organises data into clusters. This narrows the search to areas where you are interested and avoids trolling through page after page of irrelevant hits. This is an Australian company, and the engine works by analyzing the potential meanings and permutations of the submitted keywords and presenting the data graphically as a star of clusters emanating out from the core words. The system learns as it goes on, recognising areas of interest inherent in the choice of subjects.

Northern Lights
Northern Light’s search services Claimed to have over 25 million Special Collection documents from over 7100 premium sources, and over 315 million Web pages (Feb 2001).   A useful feature of Northern Lights is the ability to group hits by dynamically grouping results into meaningful categories, thus allowing the searcher to identify the desired context.   Premium information is also available on a pay-per-item basis.

RocketLinks (Xuppa)
Launched in March 2000, Rocketlinks is an advertiser-geared search engine in which the advertiser dictates his site's placement by bidding on keywords. Uses Google search technology.

Search.msn.com (July 2004)
Search engine from Misrosoft using their prototype MSNBot. This may or may not be accessible when you read this, as Microsoft plans to take it down after a while to fiddle with it based on user feedback. See the Tech Preview page.

Subjex
A plain language search engine, designed to provide a very focussed search. I have not had much success so far with it, but it is early days.

Superpages
A yellow pages lookup engine (US only).

Teoma
Also uses peer sites to rank results. Incorporates DirectHit. Teoma (Gaelic for "expert") was designed at Rutgers, N.J. and powers Ask Jeeves. It has been getting good reviews in that it looks for authorities within the communities associated with the keywords and whether the authorities are peer listed. Ranking is based on how often each page is cited by authority pages. The result is a search in greater depth for the selected domain.

Wisenut
This engine is claimed to be able to crawl and index 50M Web documents a day, and that the full production version of the WISEnut search engine will have crawled and indexed more Web pages than any competitors and will be able to refresh this comprehensive database once a month, using only a fraction of the computing resources others use. The relevancy ranking system uses context-sensitive link analysis not only to measure the relative importance of a given page, but also determines the relative relevancy of that page for a given query.

Yahoo
Yahoo announced late February 2004 that it has deployed its own algorithmic search technology and expects to continue the process worldwide basis over the following several weeks. It has stopped using the Google search engine.The Web portal also introduced a Content Acquisition Program in March 2004 designed to index the billions of documents contained in public databases but that are commonly inaccessible to search engines, or what's called the invisible or deep Web. To this end, it has aligned with the Library of Congress, the University of California at Los Angeles, National Public Radio, the University of Michigan and Project Gutenberg, among others, to begin seeding its index with fresh, searchable material for Web surfers' queries.

Meta Search Engines

Meta search engines submit your search parameters to several search engines at a time.   Some of them combine results.  Certainly you get increased coverage, but also increased junk.

Ask Jeeves
Uses normal English systax to enter a query. Re-branded as Ask.com with extra pay-for services.

Ask.com See above.

Beaucoup
Meta search engine run by one person, but quite good, because of the claimed 2,500 search engines linked. Uses the Mamma meta search engine, but categorises searches to provide better focus.

Copernic
This is a piece of software that you download and run on your machine.  It updates itself on the internet to incorporate new search engines as they become available.  Recommended. Additionally, when installing Copernic on your search engine you get the excellent Gist translator from Alis Technologies.

Dogpile

Highway 61
This is an online meta search engine with a refreshing style of dialogue, and some interesting quotations while you are browsing.

Mamma
Mamma.com claims to be the largest independently owned Meta Search Engine on the Internet.

Mega Spider

MESA University of Hanover site meta search engine for email addresses.

MetaCrawler (Go2Net)Meta Find
Has an exact phrase option.

Savvy Search (CNET)
Also has a phrase option

Visit
This is a brand new meta search engine that shows results as icons with the most relevant towards the centre. Arrows show how the pages are linked to one another, including back-links. Placing the mouse pointer over an icon reveals the content of the page (if activated). This is supposed to help users more easily identify pages relevant to their search. It is a beta version from the University of Illinois at Urbana-Champagne. It requires a dedicated browser to be downloaded (free).

Vivisimo
New meta search engine from Carnegie Mellon University.

Newsgroups

Deja has until recently been the place to search for items appearing in newsgroups. In February 12, 2001, Google acquired the entire database (press release). The Deja service has folded, but Google has launched http://groups.google.com/ - which as of May 2001 contains the entire archive again.

Invisible site searches

The invisible, or deep web refers to data that is held in formats not accessible to a search engine, or buried deep beyond its capabilities to reach them, for example due to memory constraints.

Dipsie
A new search engine to be launched in summer 2004. Chicago-based Dipsie has a crawler that burrows deep through forms and database interfaces to get to the specific information you require. Having already indexed a whole database it can present the required page without the need to go via the intermediate levels. At launch it is planned to have 10 billion documents, some three times the size of Google's.

Google
Google now indexes PDF, Microsoft Word, Excel, PowerPoint, Rich Text Format and PostScript files. You can restrict the search to a specific filetype by using the filetype: command e.g. widgets filetype:doc

Invisibleweb.com
Produced by Intelliseek.

Streamsage
One of the invisible areas not indexed by search engines has been audio/visual material. StreamSage claims it has developed a technology which understands the context of the information contained within audio/video content. This system identifies the relevant portions of an audio/video file for any given term or concept. These relevant sections can then be used for a variety of powerful search & retrieval, knowledge management, data mining, or content management applications.

Yahoo
The Web portal introduced a Content Acquisition Program in March 2004 designed to index the billions of documents contained in public databases but that are commonly inaccessible to search engines. To this end, it has aligned with the Library of Congress, the University of California at Los Angeles, National Public Radio, the University of Michigan and Project Gutenberg, among others, to begin adding fresh, searchable material to its index.

Profusion
Intelliseek's main engine.

Try their beta site , which claims:

Wayback This site can give you a view of sites as they were on specific dates. The archive goes back to 1996 thanks to Alexa. The archive is claimed to have 10 billion pages and 100 TB (terabytes) of data.

and if you still haven't found what you're looking for, try http://websearch.about.com/internet/websearch/mbody.htm

Directories of Search Engines

General

http://www.searchability.com/

http://library.albany.edu/internet/engines.html

Spider Foods - This site has lots of good info on use of search engines and optimising your site.

Canadian search engines

http://www.journalismnet.com/canada/searchengines.htm