Information about Controlled Vocabulary
Controlled vocabularies are used in subject indexing schemes, subject headings, thesauri and taxonomies. Controlled vocabulary schemes mandate the uses of predefined, authorised terms that have been preselected by the designer of the controlled vocabulary as opposed to natural language vocabularies where there is no restriction on the vocabulary that can be used.
For example, in the Library of Congress Subject Heading (a Subject heading system that uses controlled vocabulary), authorised terms (subject headings in this case) have to be chosen to handle choices between variant spellings of the same concept (American versus British), choice among scientific and popular terms (Cockroaches versus Periplaneta americana), choices between synonyms (automobile versus cars) among other difficult issues.
Choices of authorised terms are based on the principles of user warrant (what terms users are likely to use), Literacy warrant (what terms are generally used in the literature and documents), structural warrant (terms chosen by considering the structure, scope of the controlled vocabulary).
Controlled vocabularies also typically handle the problem of homographs, with qualifiers. For example, the term "pool" has to be qualified to refer to either swimming pool, or the game pool to ensure that each authorised term or heading refers to only one concept.
There are two main kinds of controlled vocabulary tools used in libraries: subject headings and thesauri. While the differences between the two are diminishing, there are still some minor differences.
Historically subject headings were designed to describe books in library catalogs by catalogers while thesauri were used by indexers to apply index terms to documents and articles. Subject headings tend to be broader in scope describing whole books, while thesauri tend to be more specialised covering very specific disciplines. Also because of the card catalog system, subject headings tend to have terms that are in indirect order (though with the rise of automated systems this is being removed), while thesauri terms are always in direct order. Subject headings also tend to use more pre-co-ordination of terms such that the designer of the controlled vocabulary will combine various concepts together to form one authorised subject heading. (e.g., children and terrorism) while thesauri tend to use singular direct terms. Lastly thesauri list not only equivalent terms but also narrower, broader terms and related terms among various authorised and non-authorised terms, while historically most subject headings did not.
For example Library of Congress Subject Heading itself did not have much syndetic structure until 1943, and it was not until 1985 when it began to adopt the thesauri type term "Broader term" and "Narrow term".
The terms are chosen and organized by trained professionals (including librarians and information scientists) who possess expertise in the subject area. Controlled vocabulary terms can accurately describe what a given document is actually about, even if the terms themselves do not occur within the document's text. Well known subject heading systems are library of congress subject heading, MESH, Sears. Well known therasui are Art and Architecture Thesaurus , ERIC Thesaurus etc.
Choosing authorized terms to be used is a tricky business, besides the areas already considered above, the designer has to consider the specificity of the term chosen, whether to use direct entry, inter consistency and stability of the language. Lastly the amount of pre-co-ordinate (in which case the degree of enumeration versus synthesis becomes an issue) and post co-ordinate in the system is another important issue
Controlled vocabularies tagged to documents are metadata.
In recent years free text search as a means of access to documents has become popular. This involves using natural language indexing with an indexing exhaustively set to maximum (every word in the text is indexed). Many studies have been done to compare the efficiency and effectiveness of free text searches against documents that have been indexed by experts using a few well chosen controlled vocabulary descriptors.
Controlled vocabularies are often claimed to improve the accuracy of free text searching, such as to reduce irrelevant items in the retrieval list. These irrelevant items (false positives) are often caused by the inherent ambiguity of natural language. Take the English word football for example. Football is the name given to a number of different team sports. Worldwide the most popular of these team sports is Association football, which also happens to be called soccer in several countries. The English language word football is also applied to Rugby football (Rugby union and rugby league), American football, Australian rules football, Gaelic football, and Canadian football. A search for football therefore will retrieve documents that are about several completely different sports. Controlled vocabulary solves this problem by tagging the documents in such a way that the ambiguities are eliminated.
Compared to free text searching, the use of a controlled vocabulary can dramatically increase the performance of an information retrieval system, if performance is measured by precision (the percentage of documents in the retrieval list that are actually relevant to the search topic).
In some cases controlled vocabulary can enhance recall as well, because unlike natural language schemes, once the correct authorised term is searched, you don't need to worry about searching for other terms that might be synonyms of that term.
However, a controlled vocabulary search may also lead to unsatisfactory recall, in that it will fail to retrieve some documents that are actually relevant to the search question.
This is particularly problematic when the search question involves terms that are sufficiently tangential to the subject area such that the indexer might have decided to tag it using a different term (but the searcher might consider the same). Essentially, this can be avoided only by an experienced user of controlled vocabulary whose understanding of the vocabulary coincides with the way it is used by the indexer.
Another possibility is that the article is just not tagged by the indexer because indexing exhaustivity is low. For example an article might mention football as a secondary focus, and the indexer might decide not to tag it with "football" because it is not important enough compared to the main focus. But it turns out that for the searcher that article is relevant and hence recall fails. A free text search would automatically pick up that article regardless.
On the other hand free text searches have high exhaustivity (you search on every word) so it has potential for high recall (assuming you solve the problems of synonyms by entering every combination) but will have much lower precision.
Controlled vocabularies are also quickly out-dated and in fast developing fields of knowledge, the authorised terms available might not be available if they are not updated regularly. Even in the best case scenario, controlled language is often not as specific as using the words of the text itself. Indexers trying to choose the appropriate index terms might mis-interpret the author, while a free text search is in no danger of doing so, because it uses the author's own words.
The use of controlled vocabularies can be costly compared to free text searches because human experts or expensive automated systems are necessary to index each entry. Furthermore, the user has to be familiar with the controlled vocabulary scheme to make best use of the system. But as already mentioned, the control of synonyms, homographs can help increase precision.
Numerous methodologies have been developed to assist in the creation of controlled vocabularies, including faceted classification, which enables a given data record or document to be described in multiple ways.
In large organizations, controlled vocabularies may be introduced to improve technical communication. The use of controlled vocabulary ensures that everyone is using the same word to mean the same thing. This consistency of terms is one of the most important concepts in technical writing and knowledge management, where effort is expended to use the same word throughout a document or organization instead of slightly different ones to refer to the same thing.
Web searching could be dramatically improved by the development of a controlled vocabulary for describing Web pages; the use of such a vocabulary could culminate in a Semantic Web, in which the content of Web pages is described using a machine-readable metadata scheme. One of the first proposals for such a scheme is the Dublin Core Initiative.
It is unlikely that a single metadata scheme will ever succeed in describing the content of the entire Web.[3] To create a Semantic Web, it may be necessary to draw from two or more metadata systems to describe a Web page's contents. The eXchangeable Faceted Metadata Language (XFML) is designed to enable controlled vocabulary creators to publish and share metadata systems. XFML is designed on faceted classification principles.[4]
Taxonomy is the practice and science of classification. The word comes from the Greek τάξις, taxis, 'order' +
..... Click the link for more information.
Synonyms (in ancient Greek, συν ("syn") = plus and όνομα ("onoma") = name
..... Click the link for more information.
Controlled vocabulary in library and information science
In library and information science controlled vocabulary is a carefully selected list of words and phrases, which are used to tag units of information (document or work) so that they may be more easily retrieved by a search.[1][2]. Controlled vocabularies solve the problems of homographs, synonyms and polysemes by ensuring that each concept is described using only one authorized term and each authorised term in the controlled vocabulary describes only one concept. In short, controlled vocabularies reduce ambiguity inherent in normal human languages where the same concept can be given different names and ensure consistency.For example, in the Library of Congress Subject Heading (a Subject heading system that uses controlled vocabulary), authorised terms (subject headings in this case) have to be chosen to handle choices between variant spellings of the same concept (American versus British), choice among scientific and popular terms (Cockroaches versus Periplaneta americana), choices between synonyms (automobile versus cars) among other difficult issues.
Choices of authorised terms are based on the principles of user warrant (what terms users are likely to use), Literacy warrant (what terms are generally used in the literature and documents), structural warrant (terms chosen by considering the structure, scope of the controlled vocabulary).
Controlled vocabularies also typically handle the problem of homographs, with qualifiers. For example, the term "pool" has to be qualified to refer to either swimming pool, or the game pool to ensure that each authorised term or heading refers to only one concept.
There are two main kinds of controlled vocabulary tools used in libraries: subject headings and thesauri. While the differences between the two are diminishing, there are still some minor differences.
Historically subject headings were designed to describe books in library catalogs by catalogers while thesauri were used by indexers to apply index terms to documents and articles. Subject headings tend to be broader in scope describing whole books, while thesauri tend to be more specialised covering very specific disciplines. Also because of the card catalog system, subject headings tend to have terms that are in indirect order (though with the rise of automated systems this is being removed), while thesauri terms are always in direct order. Subject headings also tend to use more pre-co-ordination of terms such that the designer of the controlled vocabulary will combine various concepts together to form one authorised subject heading. (e.g., children and terrorism) while thesauri tend to use singular direct terms. Lastly thesauri list not only equivalent terms but also narrower, broader terms and related terms among various authorised and non-authorised terms, while historically most subject headings did not.
For example Library of Congress Subject Heading itself did not have much syndetic structure until 1943, and it was not until 1985 when it began to adopt the thesauri type term "Broader term" and "Narrow term".
The terms are chosen and organized by trained professionals (including librarians and information scientists) who possess expertise in the subject area. Controlled vocabulary terms can accurately describe what a given document is actually about, even if the terms themselves do not occur within the document's text. Well known subject heading systems are library of congress subject heading, MESH, Sears. Well known therasui are Art and Architecture Thesaurus , ERIC Thesaurus etc.
Choosing authorized terms to be used is a tricky business, besides the areas already considered above, the designer has to consider the specificity of the term chosen, whether to use direct entry, inter consistency and stability of the language. Lastly the amount of pre-co-ordinate (in which case the degree of enumeration versus synthesis becomes an issue) and post co-ordinate in the system is another important issue
Controlled vocabularies tagged to documents are metadata.
Types of indexing language
There are three main types of indexing languages.- Controlled indexing language - Only approved terms can be used by the indexer to describe the document
- Natural language indexing language - Any term from the document in question can be used to describe the document.
- Free indexing language - Any term (not only from the document) can be used to describe the document.
In recent years free text search as a means of access to documents has become popular. This involves using natural language indexing with an indexing exhaustively set to maximum (every word in the text is indexed). Many studies have been done to compare the efficiency and effectiveness of free text searches against documents that have been indexed by experts using a few well chosen controlled vocabulary descriptors.
Controlled vocabularies are often claimed to improve the accuracy of free text searching, such as to reduce irrelevant items in the retrieval list. These irrelevant items (false positives) are often caused by the inherent ambiguity of natural language. Take the English word football for example. Football is the name given to a number of different team sports. Worldwide the most popular of these team sports is Association football, which also happens to be called soccer in several countries. The English language word football is also applied to Rugby football (Rugby union and rugby league), American football, Australian rules football, Gaelic football, and Canadian football. A search for football therefore will retrieve documents that are about several completely different sports. Controlled vocabulary solves this problem by tagging the documents in such a way that the ambiguities are eliminated.
Compared to free text searching, the use of a controlled vocabulary can dramatically increase the performance of an information retrieval system, if performance is measured by precision (the percentage of documents in the retrieval list that are actually relevant to the search topic).
In some cases controlled vocabulary can enhance recall as well, because unlike natural language schemes, once the correct authorised term is searched, you don't need to worry about searching for other terms that might be synonyms of that term.
However, a controlled vocabulary search may also lead to unsatisfactory recall, in that it will fail to retrieve some documents that are actually relevant to the search question.
This is particularly problematic when the search question involves terms that are sufficiently tangential to the subject area such that the indexer might have decided to tag it using a different term (but the searcher might consider the same). Essentially, this can be avoided only by an experienced user of controlled vocabulary whose understanding of the vocabulary coincides with the way it is used by the indexer.
Another possibility is that the article is just not tagged by the indexer because indexing exhaustivity is low. For example an article might mention football as a secondary focus, and the indexer might decide not to tag it with "football" because it is not important enough compared to the main focus. But it turns out that for the searcher that article is relevant and hence recall fails. A free text search would automatically pick up that article regardless.
On the other hand free text searches have high exhaustivity (you search on every word) so it has potential for high recall (assuming you solve the problems of synonyms by entering every combination) but will have much lower precision.
Controlled vocabularies are also quickly out-dated and in fast developing fields of knowledge, the authorised terms available might not be available if they are not updated regularly. Even in the best case scenario, controlled language is often not as specific as using the words of the text itself. Indexers trying to choose the appropriate index terms might mis-interpret the author, while a free text search is in no danger of doing so, because it uses the author's own words.
The use of controlled vocabularies can be costly compared to free text searches because human experts or expensive automated systems are necessary to index each entry. Furthermore, the user has to be familiar with the controlled vocabulary scheme to make best use of the system. But as already mentioned, the control of synonyms, homographs can help increase precision.
Numerous methodologies have been developed to assist in the creation of controlled vocabularies, including faceted classification, which enables a given data record or document to be described in multiple ways.
Applications
Controlled vocabularies, such as the Library of Congress Subject Headings, are an essential component of bibliography, the study and classification of books. They were initially developed in library and information science. In the 1950s, government agencies began to develop controlled vocabularies for the burgeoning journal literature in specialized fields; an example is the Medical Subject Headings (MeSH) developed by the U.S. National Library of Medicine. Subsequently, for-profit firms (called Abstracting and indexing services) emerged to index the fast-growing literature in every field of knowledge. In the 1960s, an online bibliographic database industry developed based on dialup X.25 networking. These services were seldom made available to the public because they were difficult to use; specialist librarians called search intermediaries handled the searching job. In the 1980s, the first full text databases appeared; these databases contain the full text of the index articles as well as the bibliographic information. Online bibliographic databases have migrated to the Internet and are now publicly available; however, most are proprietary and can be expensive to use. Students enrolled in colleges and universities may be able to access some of these services without charge; some of these services may be accessible without charge at a public library.In large organizations, controlled vocabularies may be introduced to improve technical communication. The use of controlled vocabulary ensures that everyone is using the same word to mean the same thing. This consistency of terms is one of the most important concepts in technical writing and knowledge management, where effort is expended to use the same word throughout a document or organization instead of slightly different ones to refer to the same thing.
Web searching could be dramatically improved by the development of a controlled vocabulary for describing Web pages; the use of such a vocabulary could culminate in a Semantic Web, in which the content of Web pages is described using a machine-readable metadata scheme. One of the first proposals for such a scheme is the Dublin Core Initiative.
It is unlikely that a single metadata scheme will ever succeed in describing the content of the entire Web.[3] To create a Semantic Web, it may be necessary to draw from two or more metadata systems to describe a Web page's contents. The eXchangeable Faceted Metadata Language (XFML) is designed to enable controlled vocabulary creators to publish and share metadata systems. XFML is designed on faceted classification principles.[4]
References
- ^ Amy Warner, A taxonomy primer.
- ^ Karl Fast, Fred Leise and Mike Steckel, What is a controlled vocabulary?
- ^ Cory Doctorow, Metacrap.
- ^ Mark Pilgrim, This is XFML.
- Controlled Vocabularies Links to examples of thesauri and classification schemes.
- Controlled Vocabularies Links to examples of thesauri and classification schemes used in the domain of Agriculture, Fisheries, Forestry etc.
See also
- Authority control
- Controlled natural language
- Faceted classification
- Full text search
- Information retrieval
- Metadata
- Metadata registry
- Ontology (computer science)
- Semantic spectrum
- Terminology
- Technical terminology
- Text retrieval
- Thesaurus
- Vocabulary-based transformation
External links
- Controlled vocabularies: a glosso-thesaurus
- controlledvocabulary.com — explains how controlled vocabularies are useful in describing images and information for classifying content in electronic databases.
- GoPubMed - Explore PubMed/MEDLINE with the Controlled Vocabulary Gene Ontology
- MeshPubMed - Explore PubMed/MEDLINE with the Controlled Vocabulary MeSH
- ANSI/NISO Z39.19 - 2005 Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies
- Vocabulary Links:// Thesaurus Design for Information Systems — seminar by Dr. Bella Hass Weinberg
Subject indexing is the act of describing a document by index terms to indicate what the document is about or to summarize its content. The index terms are often selected from some form of controlled vocabulary.
..... Click the link for more information.
..... Click the link for more information.
Thesaurus is derived from 16th century New Latin, in turn from Latin thesaurus, from ancient Greek θησαυρός thesauros, "store-house", "treasury".
..... Click the link for more information.
..... Click the link for more information.
For the science of classifying living things, see .
Taxonomy is the practice and science of classification. The word comes from the Greek τάξις, taxis, 'order' +
..... Click the link for more information.
In the philosophy of language, a natural language (or ordinary language) is a language that is spoken, written, or signed (visually or tactilely) by humans for general-purpose communication, as distinguished from formal languages (such as computer-programming
..... Click the link for more information.
..... Click the link for more information.
A vocabulary is a set of words known to a person or other entity, or that are part of a specific language.
The vocabulary of a person is defined either as the set of all words that are understood by that person or the set of all words likely to be used by that person when
..... Click the link for more information.
The vocabulary of a person is defined either as the set of all words that are understood by that person or the set of all words likely to be used by that person when
..... Click the link for more information.
Library science is an interdisciplinary science incorporating the humanities, law and applied science to study topics related to libraries, the collection, organization, and dissemination of information resources, and the political economy of information.
..... Click the link for more information.
..... Click the link for more information.
A word is a unit of language that carries meaning and consists of one or more morphemes which are linked more or less tightly together, and has a phonetical value. Typically a word will consist of a root or stem and zero or more affixes.
..... Click the link for more information.
..... Click the link for more information.
In grammar, a phrase is a group of words that functions as a single unit in the syntax of a sentence.
For example the house at the end of the street (example 1) is a phrase. It acts like a noun.
..... Click the link for more information.
For example the house at the end of the street (example 1) is a phrase. It acts like a noun.
..... Click the link for more information.
tag is a (relevant) keyword or term associated with or assigned to a piece of information (e.g. a picture, a geographic map, a blog entry, or video clip), thus describing the item and enabling keyword-based classification and search of information.
..... Click the link for more information.
..... Click the link for more information.
A homograph is one of a group of words that share the same spelling but have different meanings. When spoken, the meanings are sometimes, but not necessarily, distinguished by different pronunciations. A homograph is a specific type of homonym.
..... Click the link for more information.
..... Click the link for more information.
For the taxonomical term, see .
Synonyms (in ancient Greek, συν ("syn") = plus and όνομα ("onoma") = name
..... Click the link for more information.
Polysemy ([pəˈlɪsəmɪ] or [ˈpɒlɪˌsɛmɪ]) (from the Greek
..... Click the link for more information.
..... Click the link for more information.
A homograph is one of a group of words that share the same spelling but have different meanings. When spoken, the meanings are sometimes, but not necessarily, distinguished by different pronunciations. A homograph is a specific type of homonym.
..... Click the link for more information.
..... Click the link for more information.
Terminology is the study of terms and their use — of words and compound words that are used in specific contexts.
Terminology also denotes a more formal discipline which systematically studies the labelling or designating of concepts
..... Click the link for more information.
Terminology also denotes a more formal discipline which systematically studies the labelling or designating of concepts
..... Click the link for more information.
Specificity may refer to:
..... Click the link for more information.
- Specificity (tests), a measure of a test's effectiveness
- A concept relating to Cascading Style Sheets
..... Click the link for more information.
Metadata is data about data. An item of metadata may describe an individual datum, or content item, or a collection of data including multiple content items.
Metadata (sometimes written 'meta data') is used to facilitate the understanding, use and management of data.
..... Click the link for more information.
Metadata (sometimes written 'meta data') is used to facilitate the understanding, use and management of data.
..... Click the link for more information.
In text retrieval, full text search (also called free search text ) refers to a technique for searching a computer-stored document or database; in a full text search, the search engine examines all of the words in every stored document as it tries to match search words
..... Click the link for more information.
..... Click the link for more information.
In computer science, and particularly in search engines, relevance is a numerical score assigned to a search result, representing how well the result meets the information need of the user that issued the search query.
..... Click the link for more information.
..... Click the link for more information.
Type I errors (or α error, or false positive) and type II errors (β error, or a false negative) are two terms used to describe statistical errors.
..... Click the link for more information.
Statistical error vs.
..... Click the link for more information.
In the philosophy of language, a natural language (or ordinary language) is a language that is spoken, written, or signed (visually or tactilely) by humans for general-purpose communication, as distinguished from formal languages (such as computer-programming
..... Click the link for more information.
..... Click the link for more information.
Team sport refers to sports that are practiced between opposing teams, where the players interact directly and simultaneously between them to achieve an objective. The objective generally involves team members facilitating the movement of a ball or similar item in accordance with a
..... Click the link for more information.
..... Click the link for more information.
Association football, commonly known as football or soccer, is a team sport played between two teams of 11 players. It is the most popular sport in the world.
..... Click the link for more information.
..... Click the link for more information.
Association football, commonly known as football or soccer, is a team sport played between two teams of 11 players. It is the most popular sport in the world.
..... Click the link for more information.
..... Click the link for more information.
English}}}
Writing system: Latin (English variant)
Official status
Official language of: 53 countries
Regulated by: no official regulation
Language codes
ISO 639-1: en
ISO 639-2: eng
ISO 639-3: eng
..... Click the link for more information.
Writing system: Latin (English variant)
Official status
Official language of: 53 countries
Regulated by: no official regulation
Language codes
ISO 639-1: en
ISO 639-2: eng
ISO 639-3: eng
..... Click the link for more information.
The English language word "football" may mean any one of several games, or the ball used in that game, depending on the national or regional origin/location of the person using the word.
..... Click the link for more information.
..... Click the link for more information.
Rugby football, often just "rugby", may refer to a number of sports descended from a common form of football developed at Rugby School in England, United Kingdom. Rugby union, rugby league, and, to a lesser extent, American football and Canadian football, are modern sports
..... Click the link for more information.
..... Click the link for more information.
Editing of this page by unregistered or newly registered users is currently disabled.
If you are prevented from editing this page, and you wish to make a change, please discuss changes on the talk page, request unprotection, log in, or .
..... Click the link for more information.
If you are prevented from editing this page, and you wish to make a change, please discuss changes on the talk page, request unprotection, log in, or .
..... Click the link for more information.
Rugby League
General Information
Originated 1895, Huddersfield, Yorkshire, England
World Governing Body Rugby League International Federation
International Rugby League
Test Nations Australia
..... Click the link for more information.
General Information
Originated 1895, Huddersfield, Yorkshire, England
World Governing Body Rugby League International Federation
International Rugby League
Test Nations Australia
..... Click the link for more information.
American football, known in the United States simply as football [1] is a competitive team sport known for its physical roughness despite being a highly strategic game.
..... Click the link for more information.
..... Click the link for more information.
Australian rules football, also known as Australian football, Aussie rules, or simply "football" or "footy" is a code of football played with a prolate spheroid ball, on large oval shaped fields (cricket fields), with four posts at each end.
..... Click the link for more information.
..... Click the link for more information.
This article is copied from an article on Wikipedia.org - the free encyclopedia created and edited by online user community. The text was not checked or edited by anyone on our staff. Although the vast majority of the wikipedia encyclopedia articles provide accurate and timely information please do not assume the accuracy of any particular article. This article is distributed under the terms of GNU Free Documentation License.
Herod_Archelaus