In this lesson we will discuss how search
engines work in general terms, not all possible scenarios (or search
algorithms!).
What is a search engine, and
how does it work?
What we
think of as a search engine is really a team effort. There are 3 "members"
of the team -- a mechanism that identifies web pages to be included
in the database, a mechanism that indexes the sites and a searching
mechanism with an interface, which scans, for keywords within the index.
Users search the index (and hence, the database or web documents) through
a query box or a template. Documents in which the search terms occur
are presented as "hits."
Although some facilities incorporate "natural
language" searching (searching by asking a question "Where are the
doughnuts?"), most search tools retrieve "hits" or "matches" by seeking
occurrences of your search terms within its database and by attempting to
match the terms (converted to a "string" of data bits) against its index.
Because the terms are converted to a digital string, the search engine
must somehow be instructed to include plurals and alternate forms of a
term
Note:
Although some
search tools automatically include plurals, many do not. If you are
interested in "dogs," search for "dog or dogs" or use a wildcard such as *
(A wildcard is a typed symbol which simply means "put any
character here").
Some search engines also allow "stemming."
(This involves using a special
character symbol which simply means "put any ending here after this
point".) An example: the term comput& (where &=stemming symbol)
would bring up hits from the following words: computer, computers, computing, computation
etc. |

What's a 'bot?
A
'bot, otherwise known as an intelligent agent, spider, crawler, robot, or
worm, is an automated device (software) which may be programmed to search
for terms (data "strings") matching certain criteria. In terms of
web search engines, a 'bot identifies and notes the url's of web pages to
be included in the database. Later, another 'bot comes along and works on
the interiors of the web documents, recording occurrences of
words and their position within the text. This information is used to
create a huge index. 'Bots travel along the links of a web site, that is,
they crawl or traverse from one hypertext link to
another.
What's the index for?
The index is how the search engine locates the url's
which match your request. The web documents containing the query keywords
are presented as a listing which may include a brief summary of the site.
A simple way to understand the index is to think of it as a computerized
book index. To discover where a topic occurs in a book, we would look up
the word in the index which would indicate the page number(s) where the
term occurs. Now imagine that every single word is included in the book
index. A computerized version might be represented like this:
|
Keyword |
Number of
times keyword occurs in book |
Position(s)
in book of keyword |
Page
number(s) |
|
Apple |
175 |
title page, page 1: first
paragraph word #5, page 2: first paragraph word #20, second
paragraph word #15, page 5: 2nd paragraph word 21,...etc.
etc., in summary |
title page, table of
contents, pages: 1,2,5
etc.,12,25. |
|
Orange |
22 |
table of contents, page 3,
first paragraph word #3, page 17; first paragraph word #30,
page 21 etc. |
table of contents,
pages 3,17,21 etc. |
|
Grape |
3 |
page 50, 2nd paragraph,
word #18, page 52, 1st paragraph word #41, page 53,
1st paragraph word #4 |
pages 50, 52,
53 |
Some immediate
observations might include:
- a) the word apple occurs a lot in the
database
- b) the word apple occurs in the
title
- c) the words apple and orange occur in the
table of contents
- d) the word grape does not occur in the
title or table of contents.
A
search engine uses its index to retrieve web documents in which your
search terms occur. The index lists the term and where it occurs (the url
or address of the web page) much like a book index.
|
Remember: a
search engine returns hits only from its own database, that
is, web pages which it has indexed. So if the site you are looking
for has not yet been indexed, it won't be in the results listing no
matter how magnificent your search strategy or
statement. |

How does a search engine decide
how to list web sites matching the search
terms I use ?Each search engine
uses a different algorithm or method to calculate something called a
"relevance" which it "ranks." Have you ever noticed the numbers
which sometimes appear next to the url's in a listing of search results?
This is the "relevance ranking." Relevance means the probability that
the "hit" or "match" is on-target with your query. The creators of
search engines change the way they calculate relevance and do not tell us
mere users their methodology; being high in the major search engines'
rankings on a topic means big business.
Sometimes Web site owners try to skew the
odds of appearing on the first page of "results" for folks searching
specific keywords. Being on the first page or in the top results increases
the likelihood that the site will be seen and hence selected by the user.
Unscrupulous folks "spam" the search engine to try to improve their
rankings (and hence, their Web-based business) in a variety of methods
including using "invisible text" (where text is colored the same as a
background) or repeatedly using keywords in "meta-tags" (descriptive
information not usually seen by the user unless when viewing the "page
source" -- seel below).
Perhaps most unsettling is the rising trend
of some search engines which in effect, sell higher ratings to
companies willing to pay for the privilege. Most users will be unaware
that the set of search results has in effect been manipulated to boost
these companies ratings artificially.
Exactly how
relevance is calculated is protected, proprietary information but it
is important to be aware that search engine providers may have
alliances or agreements with other businesses (reciprocal and/or
financial) which may affect search
results. |
In general however, relevance is calculated
by noting where the term occurs within the text and assigning this
position a "weight" or level of importance. Some search utilties also
include a popularity element in calculating the relevance algorithm; that
is, the more a site is linked to or used, the higher the rating. Search
terms occurring in the title, summary, in key positions within a paragraph
or appearing several times within a paragraph usually carry more
"weight" because there is a higher probability that terms in these
positions indicate significant material on the topic.
This is very similar to our book index
example above; because the term apple occurs many times and
in key positions (title, table of contents, beginning of paragraphs) there
is a high probability that the document contains significant information
about apple. Note that orange also occurs in
the table of contents, an indication of the term's relative importance (it
is a significant topic, but not as important as apple). The
algorithm of the search engine and the methodology it uses to calculate
relevance emulate the observations and judgments we make based on our
experience. A search engine will return the terms in our book index
as hits when the search terms apple and
grape are requested whereas a human might judge that
although the two terms occur within the document, there is no significant
relationship between them and is hence irrelevant.
Some search engines look only in certain
fields to index documents such as the title field, first paragraph and in
something called "meta-tags." Meta-tags allow the creator of a web
site to add descriptive keywords which are not displayed in the actual web
documents; they are specifically to enhance retrieval of the document. As
people "spam" the search engine (for example, by repeating terms over and
over again) meta-tags are decreasing in importance because the folks that
program the 'bots train them to overlook repetitions and other clues to
"spamming."
Note:
Because each search engine assigns relevancy rankings differently,
if you execute exactly the same search in several search engines you
will have different results in terms of how and where the url's are
listed (even if the database contents are
identical). |

What's the best search engine?
This is going to disappoint a lot of folks by
giving the answer "the best search engine is the one that fits the
task" instead of recommending a particular utility. Until
you have some experience with knowledge seeking tools and importantly,
with identifying your real information need (for example, a query
on "Leonardo di Vinci's Mona Lisa" is likely to be more successful than
"that lady with the smile by a Renaissance artist" ( or simply "di Vinci")
or "dosage and usage guidelines for St. John's Wort" as opposed to "St.
John's Wort") it may be difficult to ascertain which tool is best for your
purpose. But the good news is, you will make better choices with
experience.

What are simple ways to make my
search more effective?A
very effective way to increase the relevance or precision of "hits"
is to search as a phrase. In most cases it simply means putting quotation
marks around the search terms. "Red Fox" is a different search
than red fox in most search engines. What you are actually doing
by searching as a phrase is using the concept of proximity which
concerns the terms' physical closeness to one another (that is, their
proximity). A document with red fox occurring close
or next to each other are more likely to be on target than a document with
red in the title and fox buried in the text.
Another way to increase your search
effectiveness is to be as specific as possible; that is including as many
terms and synonyms as you can think of to fully describe your topic.
Instead of
women and
computerstry
(woman or women) and (technology or
computer) and (training or professional development) and (barriers or
problems)
Note: search utilities may not support
the use of parentheses (called nesting) in basic searches although
many support them in their "advanced" searches.
So to recap, phrase
searching and specificity are two simple ways to
increase precision in searching.

What are the most popular and
useful search utilities? (the "major" search
engines)Ok folks. We are looking at a sampling of search engines and
describing generalities; we are not attempting to create a definitive
listing. For example, we'll be discussing meta search engines in
Chapter 7, so you won't find them listed here.
- Alta Vista (http://www.altavista.com)
Originally developed by Digital
Equipment Corporation, Alta Vista searches the Web and
Usenet. In its very
large database, both simple and advanced searching are supported with
the ability to limit searches to select portions of web documents. For
example, it is possible to limit searches to title, domains, images and
links within Web documents and by particular newsgroups or subjects in
Usenet. Also, ability to browse by subject (although this is rather
slow).
- Excite(http://www.excite.com/)
Search site featuring a very large
database and a lot of "extras" such as: Excite Channels (guide to sites
by subject), stock quotes, news, tv and searching of Newsgroups.
Offers concept searching.
- HotBot (http://www.hotbot.com/)
Voted no. 1 among search engines by PC
Magazine, Hot Bot offers a sophisticated interface with a vast array of
options such as: searching by dates, by certain domains in the U.S.
(e.g. .com, .org, .edu, .gov), by media type (e.g. image, audio, video).
Also, a huge database, powerful advanced searching options,
access to other search tools by type and a subject guide.
There are
more "major search engines" for you to evaluate in
Assignments

Specialized Search Engines and
Collections:Specialized
search engines are most often programmed to "collect" web documents along
a topical theme. For example, in the Arts, Science, Health-related topics
or even more specialized subjects such as Ancient History of the
Mediterranean.
Also fitting in this category are "search
tools" that really calculate rather than retrieve information (such as
those fitting in the "distance between two points" or "salary
differential" categories). Since it is impossible to list specific tools
here, the following are sites which group or list subject specific search
engines or tools:
-
Beaucoup
(http://www.beaucoup.com) Beaucoup is a collection of approximately 1000
search engines, directories and indices from all over the world,
organized into categories such as: General Searchers, Reviewed
Sites/What's New, Software, Reference, Education, Art/Graphics,
Social/Environmental/Political Concerns, and Consumer Medicine. Good
starting point for popular subjects.
-
Internet
Search Engine Collections
(http://library.albany.edu/internet/engines.html#collections)
from the University of Albany by Laura
Cohen.

|