Information about Stemming

Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form — generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. The algorithm has been a long-standing problem in computer science; the first paper on the subject was published in 1968. The process of stemming, often called conflation, is useful in search engines for query expansion or indexing and other natural language processing problems.

Stemming programs are commonly referred to as stemming algorithms or stemmers.

Examples

A stemmer for English, for example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stemmer", "stemming", "stemmed" as based on "stem". A stemming algorithm reduces the words "fishing", "fished", "fish", and "fisher" to the root word, "fish".

History

The first ever published stemmer was written by Julie Beth Lovins in 1968. [1] This paper was remarkable for its early date and had great influence on later work in this area.

A later stemmer was written by Martin Porter and was published in the July 1980 issue of the journal Program. This stemmer was very widely used and became the de-facto standard algorithm used for English stemming. Dr. Porter received the Tony Kent Strix award in 2000 for his work on stemming and information retrieval.

Many implementations of this algorithm were written and freely distributed; however, many of these implementations contained subtle flaws. As a result, these stemmers did not match their potential. To eliminate this source of error, Martin Porter released an official free-software implementation of the algorithm around the year 2000. He extended this work over the next few years by building Snowball, a framework for writing stemming algorithms, and implemented an improved English stemmer together with stemmers for several other languages.

Algorithms

There are several types of stemming algorithms which differ in respect to performance and accuracy and how certain stemming obstacles are overcome.

Brute Force Algorithms

The term brute force is borrowed from a concept in artificial intelligence research and problem solving in mathematics denoted as brute force search. Brute force stemmers employ a lookup table which contains relations between root forms and inflected forms. To stem a word, the table is queried to find a matching inflection. If a matching inflection is found, the associated root form is returned.

Brute force approaches are criticized for their general lack of elegance in that no algorithm is applied that would more quickly converge on a solution. In other words, there are more operations performed during the search than should be necessary. Brute force searches consume immense amounts of storage to host the list of relations (relative to the task). The algorithm is only accurate to the extent that the inflected form already exists in the table. Given the number of words in a given language, like English, it is unrealistic to expect that all word forms can be captured and manually recorded by human action alone. Manual training of the algorithm is overly time-intensive and the ratio between the effort and the increase in accuracy is marginal at best.

Brute force algorithms do overcome some of the challenges faced by the other approaches. Not all inflected word forms in a given language "follow the rules" appropriately. While "running" might be easy to stem to "run" in a suffix stripping approach, the alternate inflection, "ran", is not. Suffix stripping algorithms are somewhat powerless to overcome this problem, short of increasing the number and complexity of the rules, but brute force algorithms only require storing a single extra relation between "run" and "ran". While that is true, this assumes someone bothered to store the relation in the first place, and one of the major problems of improving brute force algorithms is the coverage of the language.

Brute force algorithms are initially very difficult to design given the immense amount of relations that must be initially stored to produce an acceptable level of accuracy (the number can span well into the millions). However, brute force algorithms are easy to improve in that decreasing the stemming error is only a matter of adding more relations to the table. Someone with only a minor experience in linguistics is capable of improving the algorithm, unlike the suffix stripping approaches which require a solid background in linguistics.

For technical accuracy, some programs may use suffix stripping to generate the lookup table given a text corpus, and then only consult the lookup table when stemming. This is not regarded as a brute force approach, although a lookup table is involved.

Suffix Stripping Algorithms

Suffix stripping algorithms do not rely on a lookup table that consists of inflected forms and root form relations. Instead, a typically smaller list of "rules" are stored which provide a path for the algorithm, given an input word form, to find its root form. Some examples of the rules include:
  • if the word ends in 'ed', remove the 'ed'
  • if the word ends in 'ing', remove the 'ing'
  • if the word ends in 'ly', remove the 'ly'
Suffix stripping approaches enjoy the benefit of being much simpler to maintain than brute force algorithms, assuming the maintainer is sufficiently knowledgeable in the challenges of linguistics and morphology and encoding suffix stripping rules. Suffix stripping algorithms are sometimes regarded as crude given the poor performance when dealing with exceptional relations (like 'ran' and 'run'). The solutions produced by suffix stripping algorithms are limited to those lexical categories which have well known suffices with few exceptions. This, however, is a problem, as not all parts of speech have such a well formulated set of rules. Lemmatisation attempts to improve upon this challenge.

Lemmatisation Algorithms

A more complex approach to the problem of determining a stem of a word is lemmatisation. This process involves first determining the part of speech of a word, and applying different normalization rules for each part of speech. The part of speech is first detected prior to attempting to find the root since for some languages, the stemming rules change depending on a word's part of speech.

This approach is highly conditional upon obtaining the correct lexical category (part of speech). While there is overlap between the normalization rules for certain categories, identifying the wrong category or being unable to produce the right category limits the added benefit of this approach over suffix stripping algorithms. The basic idea is that, if we are able to grasp more information about the word to be stemmed, then we are able to more accurately apply normalization rules (which are, more or less, suffix stripping rules).

Stochastic Algorithms

Stochastic algorithms involve using probability to identify the root form of a word. Stochastic algorithms are trained (they "learn") on a table of root form to inflected form relations to develop a probabilistic model. This model is typically expressed in the form of complex linguistic rules, similar in nature to those in suffix stripping or lemmatisation. Stemming is performed by inputting an inflected form to the trained model and having the model produce the root form according to its internal ruleset, which again is similar to suffix stripping and lemmatisation, except that the decisions involved in applying the most appropriate rule, or whether or not to stem the word and just return the same word, or whether to apply two different rules sequentially, are applied on the grounds that the output word will have the highest probability of being correct (which is to say, the smallest probability of being incorrect, which is how it is typically measured).

Some lemmatisation algorithms are stochastic in that, given a word which may belong to multiple parts of speech, a probability is assigned to each possible part. This may take into account the surrounding words, called the context, or not. Context-free grammars do not take into account any additional information. In either case, after assigning the probabilities to each possible part of speech, the most likely part of speech is chosen, and from there the appropriate normalization rules are applied to the input word to produce the normalized (root) form.

Hybrid Approaches

Hybrid approaches use one or more of the approaches described above in unison. A simple example is a suffix tree algorithm which first consults a lookup table using brute force. However, instead of trying to store the entire set of relations between words in a given language, the lookup table is kept small and is only used to store a minute amount of "frequent exceptions" like "ran => run". If the word is not in the exception list, apply suffix stripping or lemmatisation and output the result.

Affix Stemmers

In linguistics, the term affix refers to either a prefix or a suffix. In addition to dealing with suffices, several approaches also attempt to remove common prefixes. For example, given the word indefinitely, identify that the leading "in" is a prefix that can be removed. Many of the same approaches mentioned earlier apply, but go by the name affix stripping.

Matching Algorithms

In this algorithm first of all we have a set of documents that contain stem word that stems can be not valid roots because staring common characters save as stem of course the length of stem of a word must be at a specific range according of the main word. For example "be" and "beside" between these two word "be" is not accepted as stem but "browse" and "browsing" stem will be "brows".

Language Challenges

While much of the work in this area has focused on the English language (with significant use of the Porter Stemmer algorithm), other languages have been investigated including at least French, Italian, Spanish, Portuguese, German, Dutch, Swedish, Norwegian, Danish, Russian, Finnish, Slovenian, Hebrew, and Arabic. Hebrew and Arabic are still considered difficult research languages for stemming. English stemmers are fairly trivial (with only occasional problems, such as "dries" being the third-person singular present form of the verb "dry", "axes" being the plural of "axe" as well as "axis"); but stemmers become harder to design as the morphology, orthography, and character encoding of the target language becomes more complex. For example, an Italian stemmer is more complex than an English one (because of more possible verb inflections), a Russian one is more complex (more possible noun declensions), a Hebrew one is even more complex (due to non-catenative morphology and a writing system without vowels), and so on. On the other hand, stemmers for true isolating languages such as Vietnamese can be even simpler than those for English.

Error metrics

There are two error measurements in stemming algorithms, overstemming and understemming. Overstemming is an error where two separate inflected words are stemmed to the same root, but should not have been - a false positive. Understemming is an error where two separate inflected words should be stemmed to the same root, but are not - a false negative. Stemming algorithms attempt to minimize each type of error, although reducing one type can lead to increasing the other.

Applications

Information Retrieval

Stemmers are common elements in query systems such as Web search engines, since a user who runs a query on "daffodils" would probably also be interested in documents that contain the word "daffodil" (without the s). The effectiveness of stemming for English query systems were soon found to be rather limited, however, and this has led early Information retrieval researchers to deem stemming irrelevant in general.[2]. An alternative approach, based on searching for n-grams rather than stems, may be used instead. Also, recent research has shown greater benefits for retrieval in other languages.[3][4]

Usage in commercial products

Google search adopted word stemming in 2003[5]. Previously a search for "fish" would not have returned "fishing". Other software search algorithms vary in their use of word stemming. Programs that simply search for substrings obviously will find "fish" in "fishing" but when searching for "fishes" will not find occurrences of the word "fish".

See also

References

1. ^ Julie Beth Lovins (1968). Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11:22–31.
2. ^ Ricardo Baeza-Yates and Berthier Ribeiro-Neto (1999). Modern Information Retrieval. ACM Press/Addison Wesley.
3. ^ Jaap Kamps, Christof Monz, Maarten de Rijke and Börkur Sigurbjörnsson (2004). Language-dependent and Language-independent Approaches to Cross-Lingual Text Retrieval. In: C. Peters, J. Gonzalo, M. Braschler and M. Kluck, eds. Comparative Evaluation of Multilingual Information Access Systems. Springer Verlag, pp. 152–165.
4. ^ Eija Airio (2006). Word normalization and decompounding in mono- and bilingual IR. Information Retrieval 9:249–271.
5. ^ The Essentials of Google Search. Web Search Help Center. Google Inc.

Further reading

  • Dawson, J.L. (1974) Suffix Removal for Word Conflation, Bulletin of the Association for Literary and Linguistic Computing, 2(3): 33-46
  • Frakes, W.B. (1984) Term Conflation for Information Retrieval, Cambridge University Press
  • Frakes, W.B. & Fox, C.J. (2003) Strength and Similarity of Affix Removal Stemming Algorithms, SIGIR Forum, 37: 26-30
  • Frakes, W. B. (1992) Stemming algorithms, Information retrieval: data structures and algorithms, Prentice-Hall, Inc., Upper Saddle River, NJ
  • Hafer, M.A. & Weiss, S.F., (1974) Word segmentation by letter successor varieties, Information Processing & Management 10 (11/12), 371-386.
  • Harman, D., (1991) How effective is suffixing? Journal of the American Society for Information Science 42 (1), 7-15.
  • Hull, D.A. (1996) Stemming Algorithms – A Case Study for Detailed Evaluation, JASIS, 47(1): 70-84
  • Hull, D.A. & Grefenstette, G. (1996) A Detailed Analysis of English Stemming Algorithms, Xerox Technical Report
  • Kraaij, W. & Pohlmann, R., 1996: Viewing stemming as recall enhancement, in H-P. Frei, D. Harman, P. Schauble & R. Wilkinson (eds.), Proceedings of the 17th ACM SIGIR conference held at Zurich, August 18-22, pp.40-48.
  • Krovetz, R. (1993) Viewing Morphology as an Inference Process, In Proceedings of ACM-SIGIR93, pp191-203
  • Lennon, M., Pierce, D.S., Tarry, B.D. & Willett, P. (1981) An Evaluation of some Conflation Algorithms for Information Retrieval, Journal of Information Science, 3: 177-183
  • Lovins, J. (1971) Error Evaluation for Stemming Algorithms as Clustering Algorithms, JASIS, 22: 28-40
  • Lovins, J. B. "Development of a Stemming Algorithm." Mechanical Translation and Computational Linguistics 11, 1968, 22--31.
  • Marie-Claire, J. and Smith, D. (2005) Conservative stemming for search and indexing
  • Paice, C.D. (1990) Another Stemmer, SIGIR Forum, 24: 56-61
  • Paice, C.D. (1996) Method for Evaluation of Stemming Algorithms based on Error Counting, JASIS, 47(8): 632-649
  • Popovic, M. and Willett, P., (1992) The effectiveness of stemmng for natural language access to Slovene textual data, Journal of the American Society for Information Science, 43(5), 384-390.
  • Porter, M.F. (1980) An Algorithm for Suffix Stripping, Program, 14(3): 130-137
  • Savoy, J., (1993) Stemming of French words based on grammatical categories Journal of the American Society for Information Science, 44(1), 1-9.
  • Ulmschneider, J.E. & Doszkocs, (1983) A practical stemming algorithm for online search assistance, Online Review, 7(4), 301-318.
  • Xu, J. & Croft, W.B., (1998) Corpus-based stemming using coocurrence of word variants, ACM Transactions on Information Systems, 16(1), 61-81.

External links

This article was originally based on material from the Free On-line Dictionary of Computing, which is licensed under the GFDL.


    In linguistics, a stem is the part of a word that is common to all its inflected variants. Stems are often roots, i.e. atomic (unanalyzable) lexical morphemes, but a stem can also be morphologically complex, as seen with compound words (cf.
    ..... Click the link for more information.
    The root is the primary lexical unit of a word, which carries the most significant aspects of semantic content and cannot be reduced into smaller constituents. Content words in nearly all languages contain, and may consist only of, root morphemes.
    ..... Click the link for more information.
    The root is the primary lexical unit of a word, which carries the most significant aspects of semantic content and cannot be reduced into smaller constituents. Content words in nearly all languages contain, and may consist only of, root morphemes.
    ..... Click the link for more information.
    In mathematics, computing, linguistics, and related disciplines, an algorithm is a finite list of well-defined instructions for accomplishing some task that, given an initial state, will proceed through a well-defined series of successive states, eventually terminating in an
    ..... Click the link for more information.
    Computer science, or computing science, is the study of the theoretical foundations of information and computation and their implementation and application in computer systems.
    ..... Click the link for more information.
    19th century - 20th century - 21st century
    1930s  1940s  1950s  - 1960s -  1970s  1980s  1990s
    1965 1966 1967 - 1968 - 1969 1970 1971

    Year 1968 (MCMLXVIII
    ..... Click the link for more information.
    search engine is an information retrieval system designed to help find information stored on a computer system. Search engines help to minimize the time required to find information and the amount of information which must be consulted, akin to other techniques for managing
    ..... Click the link for more information.
    Query expansion (QE) is the process of reformulating a seed query to improve retrieval performance in information retrieval operations.[1] In the context of web search engines, query expansion involves evaluating a user's input (what words were typed into the search
    ..... Click the link for more information.
    This article or section needs copy editing for grammar, style, cohesion, tone and/or spelling.
    You can assist by [ editing it] now. A how-to guide is available, as is general .
    This article has been tagged since January 2007.
    ..... Click the link for more information.
    Natural language processing (NLP) is a subfield of artificial intelligence and computational linguistics. It studies the problems of automated generation and understanding of natural human languages.
    ..... Click the link for more information.
    English}}} 
    Writing system: Latin (English variant) 
    Official status
    Official language of: 53 countries
    Regulated by: no official regulation
    Language codes
    ISO 639-1: en
    ISO 639-2: eng
    ISO 639-3: eng  
    ..... Click the link for more information.
    A string literal is the representation of a string value within the source code of a computer program. There exist numerous alternate notations for specifying string literals, and the exact notation depends on the individual programming language in question.
    ..... Click the link for more information.
    Dr. Martin Porter is the inventor of the Porter Stemmer and the Snowball programming framework. He has a personal homepage at [1]
    ..... Click the link for more information.
    19th century - 20th century - 21st century
    1950s  1960s  1970s  - 1980s -  1990s  2000s  2010s
    1977 1978 1979 - 1980 - 1981 1982 1983

    Year 1980 (MCMLXXX
    ..... Click the link for more information.
    The Strix award is an annual award for outstanding contributions to the field of information retrieval.

    The award has been presented since 1998 in memory of Dr Tony Kent, a past Fellow of the Institute of Information Scientists (IIS), who died in 1997.
    ..... Click the link for more information.
    20th century - 21st century
    1970s  1980s  1990s  - 2000s -  2010s  2020s  2030s
    1997 1998 1999 - 2000 - 2001 2002 2003

    2000 by topic:
    News by month
    Jan - Feb - Mar - Apr - May - Jun
    ..... Click the link for more information.


    Snowball is a small string-handling programming language whose name was chosen as a tribute to the SNOBOL programming language, with which it shares the concept of string patterns delivering signals
    ..... Click the link for more information.
    In grammar, a lexical category (also word class, lexical class, or in traditional grammar part of speech) is a linguistic category of words (or more precisely lexical items
    ..... Click the link for more information.
    In computing, lemmatisation is the process of determining the lemma for a given word. Since the process involves determining the part of speech of a word in a sentence, it requires knowledge of the grammar of a language, and it can therefore be a great deal of work to implement a
    ..... Click the link for more information.
    In grammar, a lexical category (also word class, lexical class, or in traditional grammar part of speech) is a linguistic category of words (or more precisely lexical items
    ..... Click the link for more information.
    Linguistics is the scientific study of language, which can be theoretical or applied. Someone who engages in this study is called a linguist.
    ..... Click the link for more information.
    An affix is a morpheme that is attached to a base morpheme such as a root or to a stem, to form a word. Affixes may be derivational, like English -ness and pre-, or inflectional, like English plural -s and past tense -ed.
    ..... Click the link for more information.
    French (français, pronounced [fʁɑ̃ˈsɛ]) is a Romance language originally spoken in France, Belgium, Luxembourg, and Switzerland, and today by about 300 million people around the world as either
    ..... Click the link for more information.
    Italian}}} 
    Official status
    Official language of:  European Union
     European Union
     Switzerland
     San Marino
    Vatican City
    Sovereign Military Order of Malta

    The template is . Please use instead.

    ..... Click the link for more information.

     Spanish, Castilian
    }}} 
    Writing system: Latin (Spanish variant)
    Language codes
    ISO 639-1: none
    ISO 639-2:
    ISO 639-3: —

    Spanish (
    ..... Click the link for more information.
    Portuguese}}} 
    Writing system: Latin alphabet (Portuguese variant) 
    Official status
    Official language of: Angola
    Brazil
    Cape Verde
    East Timor
    Equatorial Guinea
    Guinea-Bissau
    Macau (PRC)
    Mozambique
    Portugal
    São Tomé and Príncipe
    ..... Click the link for more information.
    German language (Deutsch, ] ) is a West Germanic language and one of the world's major languages.
    ..... Click the link for more information.
    Dutch}}} 
    Writing system: Latin alphabet (Dutch variant) 
    Official status
    Official language of:  Aruba
     Belgium
     European Union
     European Union
     Netherlands Antilles
     Suriname
    ..... Click the link for more information.
    Swedish}}} 
    Official status
    Official language of:  European Union
     European Union (in Noarootsi along with Estonian) [1]
     Finland
     Sweden (de facto)
    Nordic Council
    ..... Click the link for more information.
    Norwegian}}} 
    Official status
    Official language of:  Norway
    Nordic Council
    Regulated by: Norwegian Language Council
    Language codes
    ISO 639-1: no — Norwegian
    nb — Bokml
    nn — Nynorsk
    ..... Click the link for more information.


    This article is copied from an article on Wikipedia.org - the free encyclopedia created and edited by online user community. The text was not checked or edited by anyone on our staff. Although the vast majority of the wikipedia encyclopedia articles provide accurate and timely information please do not assume the accuracy of any particular article. This article is distributed under the terms of GNU Free Documentation License.
    Herod_Archelaus


    page counter