iPedia.net
Home - Information - Dictionary - Articles - Video - Web

Introduction to Information Retrieval-Search Engines

This article aims to provide readers with an overview of the very basics of information retrieval. Understanding these principles can help you to optimise your website content for the search engines and also help you to analyse search engine algorithm changes. However, the details in this article are not intended to describe how modern search engines work, as they use many additional factors, including link analysis.


This article aims to provide readers with an overview of the very basics of information retrieval. Understanding these principles can help you to optimise your website content for the search engines and also help you to analyse search engine algorithm changes. However, the details in this article are not intended to describe how modern search engines work, as they use many additional factors, including link analysis.

Information retrieval (IR) is the science of searching for documents / within documents. Information retrieval techniques form some of the most fundamental elements of web search engine technology. This article will discuss information retrieval in the context of search engines.

Indexes

It is unrealistic to remotely access documents in real-time when performing a search, as it would be exceptionally slow and unreliable. Therefore a local index is created, which for search engines is done by a crawler (aka spider). Thus, when you perform a search you are not actually searching the web, but are searching a version of the web as seen and stored by the crawler at some point in the past.

The index would not usually contain the whole document (this may, however, be stored in a separate document cache), but stores a representation of the terms relevant to the document that is quickly and easily searchable. There are various stages to this process (not all systems will include all of these stages):

  1. Document

    This is the document in its raw format with all text, structure and formatting.

  2. Structure Analysis

    Recognising headings, paragraphs, titles, bold text, lists, ..., etc.

  3. Lexical Analysis

    Converting the characters in the document into a list of words. This process may include analysing digits, hyphens, punctuation and the case of letters. Proper Noun Analysis can use the case and format of words/phrases to identify important information such as names, places, dates and organisations.

  4. Stopwords Removal

    The removal of words which occur very often and provide no ability to discriminate between documents. For example: "the", "it", "is". However, it can be seen that some search engines leave these words in the index and remove them at the user query level. This allows "+word" queries to be performed.

  5. Stemming

    This is a conflation procedure which reduces variations of a word into a single root. For example, both "worked" and "working" may be reduced to "work". The Porter Stemming Algorithm can be used to perform stemming.

After these processes have been performed we have a list of index terms for this particular document.

Index Term Weighting

We now need to calculate to what degree a term is relevant to a particular document. The following is an example of a weighting scheme:

Other items may be a factor in deciding weight, such as: the terms position in the document, whether it was in the title, whether it was bold, whether it was in a list, ..., etc.

Reverse Index

We now have a list of terms (with their weights) for a given document. However, a list of documents that contain a particular word would be much more useful, rather than a list of words for a particular document. This is called a reverse index.

For example, if we had the following three documents:

  1. This is a file about website search engine optimisation
  2. A website design tutorial file
  3. A file about bespoke software design and development
Then the index terms for each document may be as follows (weights would be in parentheses):

  1. file(?), website(?), search(?), engine(?), optimisation(?)
  2. website(?), design(?), tutorial(?), file(?)
  3. file(?), bespoke(?), software(?), design(?), development(?)
However, the reverse index would be:

file: document1(?), document2(?), docuement3(?)

website:

document1(?), document2(?)

search:

document1(?)

engine:

document1(?)

optimisation:

document1(?)

design:

document2(?), document3(?)

tutorial:

document2(?)

bespoke:

document3(?)

software:

document3(?)

development:

document3(?)

The reverse index then allows us to easily find the relevant documents for a particular word

Similarity Matching

This is the process for computing the relevance of a document to a particular query. It can comprise:

Refreshing the Index

Documents can continually change, therefore the index needs to be continually refreshed. The crawler needs to decide how often to reindex particular documents, based on how often they are updated. If a document is not updated very often, then reindexing it very often would be a waste of resources. However, documents that are always changing need to be continually reindexed as they may no longer be relevant to terms they are currently indexed for.

Measuring Accuracy of IR Systems

Two of the simplest ways to assess the accuracy of a basic information retrieval system are Precision and Recall. These are calculated using the number of relevant documents and the number of retrieved documents (the documents perceived to be relevant by the system), the documents actually returned to the user are where these two sets of document overlap.

The documents actually returned from the retrieved documents set will be decided using some form of ranking mechanism (discussion of this is beyond the scope of this article).

Generally, there is a compromise between precision and recall, as increasing the number of documents retrieved is likely to also increase the number of irrelevant documents in the set of retrieved documents.

Web Search Engines

Web search engines (such as Google, Yahoo! and MSN) usually combine information retrieval techniques with link structure analysis, as well as many other unknown techniques. Obviously, the above techniques are very easily spammed, so any useful search engine would need to try to filter out spamming where possible.

Michael Hawthornthwaite works at Acid Computer Services in Manchester, they are web designers / developers and bespoke software developers.


...click on link for more information and related articles.


AddThis Social Bookmark Button    Digg this article.

Other articles

Article Categories
 

Albums, songs and lyrics - Hotels & Maps - Music & Cinema Encyclopedia
All content on this website, including articles, information, pictures, dictionary, and other reference data is for informational purposes only. This information should not be considered complete, up to date, and is not intended to be used in place of a visit, consultation, or advice of a legal, medical, or any other professional.
page counter