![]() |
| Policies | Competitive Intelligence | Oil & Gas | Resume |
| Home > Competitive Intelligence > Links > Search Engines |
These are the workhorses of the CI industry, and new versions are popping up all the time. Search engines can be stand-alone, or can be described as "Meta" search engines, which take your query and submit it to a number of other engines. They can be located on a server or on your own machine. A recent category of engine searches the internet directly rather than a database created from the process of "spidering", and can be more effective if you already have a clear idea of where to start looking.. Another category is the directory, where data has been pre-classified. The basic search engine is a good place to start if you do not have a clear idea of where to go. They're also a bad place to go because they will deliver lots of inappropriate data if the keywords used in the search have more than one meaning - searching for "Mobil", for example will pull in anything containing "mobile" as well on many engines.
A new category of search engines is one that searches on-line data bases, which are invisible to normal spiders. See below.
Bear in mind that any single search engine has only indexed a part of the spectrum, so if you don't find what you're looking for, try another engine, or better still, try a meta search engine.
One of the most important aspects of CI is access to recent information. Search engines can take weeks or even months to revisit a site. Consequently they are not good at finding recently posted information. One way is to use the search engines of the major news sites. A number of sites, however crawl the news sites every few hours:
Infomart ($)
Claims to be the largest electronic resource of Canadian news and business information
products, delivered via multiple formats Offers same-day and archival access
to over 160 national and regional sources, daily/weekly newspapers, magazines,
trade publications, broadcast transcripts and newswires - all full text; many
exclusive to Infomart.
Another disadvantage of most conventional search engines is that they normally only reference current sites. Once a site goes down, the information goes with it. Equally, if information changes, you only get the current information, not how it was.
Google offers a partial remedy through its cache option - you can see how the page looked when the site was last spidered. Google also provides the Deja archive of all the Newsgroups going back to 1995.
Wayback This site can give you a view of pages as they were on specific dates. The archive goes back to 1996 thanks to Alexa. The archive is claimed to have 10 billion pages and 100 TB (terabytes) of data (October 2001). There is an option to have one's own site made inaccessible, so check on your competitors now before they hear about this facility!
The phrase "Lies, damn lies and statistics" applies to search engine usage claims - most companies seem to be able to drum up reasons why theirs is the best. My own advice is to try them all and see which appears to work best for you.
Alta Vista
Claimed in November 2003 to have 580 million searchable files indexed, with
the most extensive multi-media index - 50 million images, video and audio clips.
Guarantees index "freshness" rate - updated at least every 28 days. A unique
feature, which I have found very useful, is the Babelfish translator (remember Hitchhiker's
Guide to the Galaxy?), which will translate web pages from a selection of languages.
The result is useable if not necessarily syntactically accurate - I once
retrieved a German article on the Shell/Esso Brent Spar which translated as
the Brent "Save"!
AskMSR
This isn't even available to the public yet, but it's getting good reviews from
users. It's a Microsoft product, and uses basic language rules to interpret
plain language queries and translate them into sophisticated boolean search
strings.
Atomica
Atomica is a free one-click information service that works whenever you're online.Previously
known as GuruNet, it has been revised to add pay-for extras. It automatically
analyzes pointed-to text in context and pops up a simple window without linking
or leaving your document. You don't even have to select the word. GuruNet's
got reference information (dictionary, thesaurus and encyclopedia) and real-time
information (e.g. news, sports, weather or stock quotes). Similar to Nano and
Zapper in that it is an application that runs on your own computer.
Dipsie ![]()
A new search engine to be launched in summer 2004. This is designed to mine
the "deep web". Chicago-based Dipsie has a crawler that burrows deep
through forms and database interfaces to get to the specific information you
require. Having already indexed a whole database it can present the required
page without the need to go via the intermediate levels. At launch it is planned
to have 10 billion documents, some three times the size of Google's.
Direct Hit
Direct Hit provides relevance for any Internet search by analyzing the activity
of millions of previous Internet searchers. In February 2000, Direct Hit was
acquired by Ask Jeeves, Inc. and then by Teoma.
Eurekster
Eurekster is a privately-held US company with offices in San Francisco and New
Zealand. Eurekster (www.eurekster.com) is the only Internet search engine powered
by social networking technologies - delivering results that matter most to users
and their networks of friends and contacts. Eurekster continuously "learns"
from the behavior of users and their social networks to deliver personalized
search results and instant sharing of their popular Web destinations and searches.
The service allows users to share their knowledge and experiences with each
other, confidentially and privately.
Excite
I've experienced good results with this engine which uses "more like this"
concept for following links from relevant hits.
FAST (AlltheWeb)
AlltheWeb indexes over 2.1 billion web pages, 118 million multimedia files,
132 million FTP files, two million MP3s, 15 million PDF files and supports 49
languages, making it one of the largest search engines available to search enthusiasts.
AlltheWeb claims to provide the freshest information because "we update
our index every 7 to 11 days and index up to 800 news stories per minute from
3,000 news sources".
Gigablast ![]()
Small but effective engine running (April 2004) on 8 large but fairly standard
PCs. The code is capable of handling 40 search requests per second and indexing
eight million web pages per day.
Gold Rush http://www.wisdombuilder.com/prGoldRush.htm
Gold Rush is an artificial intelligence-based meta "find" engine from the makers
of Wisdombuilder.
Google
I find it can often locate items missed by other engines. Has some sponsored
advertising linked to the query, but these are clearly marked as such. One of
the best features is that it caches each page when indexing it so that even
if the page disappears on the original site, you can still retrieve it from
the Google cache. (Click on the cache link). It also can search for images.
Now has some 72 interface languages including Klingon! http://www.google.com/advanced_search?hl=xx-klingon.
Latest add on is a news section, and an interesting search by sets. Recently
(March 2004), there have been reports of shortcomings, in that its link-search
function does not reveal all of the pages in its database. Its advanced search
function engine will ignore any data that is available beyond 101kb for html
Web pages and 120kb of PDF files, whilst. its OR-search operator has not worked
properly since November 2002. Yahoo recently stopped using Google as its search
engine and has begun using its own platform.
Guidestar
GuideStar, is a national database of U.S. charitable organizations. It gathers
and distributes data on more than 850,000 IRS-recognized nonprofits.
GuruNet See Atomica
GuruNet is a free new one-click information service that works whenever you're
online. It automatically analyzes pointed-to text in context and pops up a simple
window without linking or leaving your document. You don't even have to select
the word. GuruNet's got reference information (dictionary, thesaurus and encyclopedia)
and real-time information (e.g. news, sports, weather or stock quotes). Similar
to Nano and Zapper in that it is an application that runs on your own computer..
Hotbot
This is one of my favourites as it allows you to specify an "exact phrase"
search. This eliminates most irrelevant hits. However I'm increasingly
using Google.
iWon
iWon.com, based in Irvington, NY, is a CBS-backed destination Internet portal
powered by Inktomi. It encourages use of its services by offering cash prizes
for using the system. It was launched in October 1999.
Jayde (WebProNews)
Jayde.com is a cross between a directory and a search engine and now boasts
over 1.2 million site listings. Unlike site directories that are organized in
a series of nested sub-categories which force the user to drill down to find
meaningful results, Jayde offers visitors fast sub-category searches by making
each sub-category a keyword or keyword phrase that is searched in real-time
across the entire database.
Karnak
Atlanta-based VedaSource LLC unveiled a new Internet search service early 1999
called Karnak. This is a service that searches the internet directly, and subsequently
maintains a watch on the sites of interest for any changes that occur while
the user is off-line. It is a free service for a single search profile, but
multiple profiles may be obtained for what seems to be a reasonable monthly
fee. Can take weeks to get all the relevant hits.
Looksmart
Looksmart claims a Web index of more than 1.1 billion indexed documents, and
a professionally edited directory (November 2003).
Mooter ![]()
Mooter is a new type of search engine that organises data into clusters. This
narrows the search to areas where you are interested and avoids trolling through
page after page of irrelevant hits. This is an Australian company, and the engine
works by analyzing the potential meanings and permutations of the submitted
keywords and presenting the data graphically as a star of clusters emanating
out from the core words. The system learns as it goes on, recognising areas
of interest inherent in the choice of subjects.
Northern Lights
Northern Light’s search services Claimed to have over 25 million Special Collection
documents from over 7100 premium sources, and over 315 million Web pages (Feb
2001). A useful feature of Northern Lights is the ability to group hits
by dynamically grouping results into meaningful categories, thus allowing the
searcher to identify the desired context. Premium information is also
available on a pay-per-item basis.
RocketLinks (Xuppa)
Launched in March 2000, Rocketlinks is
an advertiser-geared search engine in which the advertiser dictates his site's
placement by bidding on keywords. Uses Google search technology.
Search.msn.com (July 2004) ![]()
Search engine from Misrosoft using their prototype MSNBot. This may or may not
be accessible when you read this, as Microsoft plans to take it down after a
while to fiddle with it based on user feedback. See the Tech
Preview page.
Subjex
A plain language search engine, designed to provide a very focussed search.
I have not had much success so far with it, but it is early days.
Superpages
A yellow pages lookup engine (US only).
Teoma ![]()
Also uses peer sites to rank results. Incorporates DirectHit. Teoma (Gaelic
for "expert") was designed at Rutgers, N.J. and powers Ask Jeeves.
It has been getting good reviews in that it looks for authorities within the
communities associated with the keywords and whether the authorities are peer
listed. Ranking is based on how often each page is cited by authority pages.
The result is a search in greater depth for the selected domain.
Wisenut
This engine is claimed to be able to crawl and index 50M Web documents a day,
and that the full production version of the WISEnut search engine will have
crawled and indexed more Web pages than any competitors and will be able to
refresh this comprehensive database once a month, using only a fraction of the
computing resources others use. The relevancy ranking system uses context-sensitive
link analysis not only to measure the relative importance of a given page, but
also determines the relative relevancy of that page for a given query.
Yahoo
Yahoo announced late February 2004 that it has deployed its own algorithmic
search technology and expects to continue the process worldwide basis over the
following several weeks. It has stopped using the Google search engine.The Web
portal also introduced a Content Acquisition Program in March 2004 designed
to index the billions of documents contained in public databases but that are
commonly inaccessible to search engines, or what's called the invisible or deep
Web. To this end, it has aligned with the Library of Congress, the University
of California at Los Angeles, National Public Radio, the University of Michigan
and Project Gutenberg, among others, to begin seeding its index with fresh,
searchable material for Web surfers' queries.
Meta search engines submit your search parameters to several search engines at a time. Some of them combine results. Certainly you get increased coverage, but also increased junk.
Ask Jeeves
Uses normal English systax to enter a query. Re-branded as Ask.com
with extra pay-for services.
Ask.com See above.
Beaucoup
Meta search engine run by one person, but quite good, because of the claimed
2,500 search engines linked. Uses the Mamma meta search engine, but categorises
searches to provide better focus.
Copernic
This is a piece of software that you download and run on your machine.
It updates itself on the internet to incorporate new search engines as they
become available. Recommended. Additionally, when installing Copernic
on your search engine you get the excellent Gist translator from Alis Technologies.
Highway 61
This is an online meta search engine with a refreshing style of dialogue, and
some interesting quotations while you are browsing.
Mamma
Mamma.com claims to be the largest independently owned Meta Search Engine on
the Internet.
MESA
University
of Hanover site meta search engine for email addresses.
MetaCrawler (Go2Net)Meta
Find
Has an exact phrase option.
Savvy Search (CNET)
Also has a phrase option
Visit ![]()
This is a brand new meta search engine that shows results as icons with the
most relevant towards the centre. Arrows show how the pages are linked to one
another, including back-links. Placing the mouse pointer over an icon reveals
the content of the page (if activated). This is supposed to help users more
easily identify pages relevant to their search. It is a beta version from the
University of Illinois at Urbana-Champagne. It requires a dedicated browser
to be downloaded (free).
Vivisimo
New meta search engine from Carnegie Mellon University.
Deja has until recently been the place to search for items appearing in newsgroups. In February 12, 2001, Google acquired the entire database (press release). The Deja service has folded, but Google has launched http://groups.google.com/ - which as of May 2001 contains the entire archive again.
The invisible, or deep web refers to data that is held in formats not accessible to a search engine, or buried deep beyond its capabilities to reach them, for example due to memory constraints.
Dipsie ![]()
A new search engine to be launched in summer 2004. Chicago-based Dipsie has
a crawler that burrows deep through forms and database interfaces to get to
the specific information you require. Having already indexed a whole database
it can present the required page without the need to go via the intermediate
levels. At launch it is planned to have 10 billion documents, some three times
the size of Google's.
Google
Google now indexes PDF, Microsoft Word, Excel, PowerPoint, Rich Text Format
and PostScript files. You can restrict the search to a specific filetype by
using the filetype: command e.g. widgets filetype:doc
Invisibleweb.com
Produced by Intelliseek.
Streamsage
One of the invisible areas not indexed by search engines has been audio/visual
material. StreamSage claims it has developed a technology which understands
the context of the information contained within audio/video content. This system
identifies the relevant portions of an audio/video file for any given term or
concept. These relevant sections can then be used for a variety of powerful
search & retrieval, knowledge management, data mining, or content management
applications.
Yahoo
The Web portal introduced a Content Acquisition Program in March 2004 designed
to index the billions of documents contained in public databases but that are
commonly inaccessible to search engines. To this end, it has aligned with the
Library of Congress, the University of California at Los Angeles, National Public
Radio, the University of Michigan and Project Gutenberg, among others, to begin
adding fresh, searchable material to its index.
Profusion
Intelliseek's main engine.
Try their beta site , which claims:
Wayback This site can give you a view of sites as they were on specific dates. The archive goes back to 1996 thanks to Alexa. The archive is claimed to have 10 billion pages and 100 TB (terabytes) of data.
and if you still haven't found what you're looking for, try http://websearch.about.com/internet/websearch/mbody.htm
http://library.albany.edu/internet/engines.html
Spider Foods - This site has lots of good info on use of search engines and optimising your site.
http://www.journalismnet.com/canada/searchengines.htm