Information about Parser



Enlarge picture
An example of parsing a mathematical expression.
In computer science and linguistics, parsing (more formally syntactic analysis) is the process of analyzing a sequence of tokens to determine its grammatical structure with respect to a given formal grammar. A parser is the component of a compiler that carries out this task.

Parsing transforms input text into a data structure, usually a tree, which is suitable for later processing and which captures the implied hierarchy of the input. Lexical analysis creates tokens from a sequence of input characters and it is these tokens that are processed by a parser to build a data structure such as parse tree or abstract syntax trees.

Parsing is also an earlier term for the diagramming of sentences of natural languages, and is still used for the diagramming of inflected languages, such as the Roman languages or Latin.

Parser generators are tools that can automatically generate a parser (in some programming language) from a grammar written in Backus-Naur form (e.g. Yacc - yet another compiler compiler).

Human languages

In some machine translation and natural language processing systems, human languages are parsed by computer programs. Human sentences are not easily parsed by programs, as there is substantial ambiguity in the structure of human language. In order to parse natural language data, researchers must first agree on the grammar to be used. The choice of syntax is affected by both linguistic and computational concerns; for instance some parsing systems use lexical functional grammar, but in general, parsing for grammars of this type is known to be NP-complete. Head-driven phrase structure grammar is another linguistic formalism which has been popular in the parsing community, but other research efforts have focused on less complex formalisms such as the one used in the Penn Treebank. Shallow parsing aims to find only the boundaries of major constituents such as noun phrases. Another popular strategy for avoiding linguistic controversy is dependency grammar parsing.

Most modern parsers are at least partly statistical; that is, they rely on a corpus of training data which has already been annotated (parsed by hand). This approach allows the system to gather information about the frequency with which various constructions occur in specific contexts. (See machine learning.) Approaches which have been used include straightforward PCFGs (probabilistic context free grammars), maximum entropy, and neural nets. Most of the more successful systems use lexical statistics (that is, they consider the identities of the words involved, as well as their part of speech). However such systems are vulnerable to overfitting and require some kind of smoothing to be effective.

Parsing algorithms for natural language cannot rely on the grammar having 'nice' properties as with manually-designed grammars for programming languages. As mentioned earlier some grammar formalisms are very computationally difficult to parse; in general, even if the desired structure is not context-free, some kind of context-free approximation to the grammar is used to perform a first pass. Algorithms which use context-free grammars often rely on some variant of the CKY algorithm, usually with some heuristic to prune away unlikely analyses to save time. (See chart parsing.) However some systems trade speed for accuracy using, eg, linear-time versions of the shift-reduce algorithm. A somewhat recent development has been parse reranking in which the parser proposes some large number of analyses, and a more complex system selects the best option.

Programming languages

The most common use of a parser is as a component of a compiler. This parses the source code of a computer programming language to create some form of internal representation. Programming languages tend to be specified in terms of a context-free grammar because fast and efficient parsers can be written for them. Parsers are usually not written by hand but are generated by parser generators.

Context-free grammars are limited in the extent to which they can express all of the requirements of a language. Informally, the reason is that the memory of such a language is limited. The grammar cannot remember the presence of a construct over an arbitrarily long input; this is necessary for a language in which, for example, a name must be declared before it may be referenced. More powerful grammars that can express this constraint, however, cannot be parsed efficiently. Thus, it is a common strategy to create a relaxed parser for a context-free grammar which accepts a superset of the desired language constructs (that is, it accepts some invalid constructs); later, the unwanted constructs can be filtered out.

Overview of process

The following example demonstrates the common case of parsing a computer language with two levels of grammar: lexical and syntactic.

The first stage is the token generation, or lexical analysis, by which the input character stream is split into meaningful symbols defined by a grammar of regular expressions. For example, a calculator program would look at an input such as "12*(3+4)^2" and split it into the tokens 12, *, (, 3, +, 4, ), ^ and 2, each of which is a meaningful symbol in the context of an arithmetic expression. The parser would contain rules to tell it that the characters *, +, ^, ( and ) mark the start of a new token, so meaningless tokens like "12*" or "(3" will not be generated.

The next stage is syntactic parsing or syntactic analysis, which is checking that the tokens form an allowable expression. This is usually done with reference to a context-free grammar which recursively defines components that can make up an expression and the order in which they must appear. However, not all rules defining programming languages can be expressed by context-free grammars alone, for example type validity and proper declaration of identifiers. These rules can be formally expressed with attribute grammars.

The final phase is semantic parsing or analysis, which is working out the implications of the expression just validated and taking the appropriate action. In the case of a calculator, the action is to evaluate the expression; a compiler, on the other hand, would generate code. Attribute grammars can also be used to define these actions.

Types of parsers

The task of the parser is essentially to determine if and how the input can be derived from the start symbol of the grammar. This can be done in essentially two ways:
  • Top-down parsing - A parser can start with the start symbol and try to transform it to the input. Intuitively, the parser starts from the largest elements and breaks them down into incrementally smaller parts. LL parsers are examples of top-down parsers.
  • Bottom-up parsing - A parser can start with the input and attempt to rewrite it to the start symbol. Intuitively, the parser attempts to locate the most basic elements, then the elements containing these, and so on. LR parsers are examples of bottom-up parsers. Another term used for this type of parser is Shift-Reduce parsing.
Another important distinction is whether the parser generates a leftmost derivation or a rightmost derivation (see context-free grammar). LL parsers will generate a leftmost derivation and LR parsers will generate a rightmost derivation (although usually in reverse).

Examples of parsers

Top-down parsers

Some of the parsers that use top-down parsing include:

Bottom-up parsers

Some of the parsers that use bottom-up parsing include:

References

See also

Parsing concepts

Parser Development Software

See also the comparison table at List of parser generators.

Free Software

Wikimedia

Commercial

Computer science, or computing science, is the study of the theoretical foundations of information and computation and their implementation and application in computer systems.
..... Click the link for more information.
Linguistics is the scientific study of language, which can be theoretical or applied. Someone who engages in this study is called a linguist.
..... Click the link for more information.
In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. Programs performing lexical analysis are called lexical analyzers or lexers. A lexer consists of a scanner and a tokenizer.
..... Click the link for more information.
In computer science and linguistics, a formal grammar, or sometimes simply grammar, is a precise description of a formal language — that is, of a set of strings over some alphabet.
..... Click the link for more information.
compiler is a computer program (or set of programs) that translates text written in a computer language (the source language) into another computer language (the target language).
..... Click the link for more information.
In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. Programs performing lexical analysis are called lexical analyzers or lexers. A lexer consists of a scanner and a tokenizer.
..... Click the link for more information.
A parse tree or concrete syntax tree is a tree that represents the syntactic structure of a string according to some formal grammar. A program that produces such trees is called a parser.
..... Click the link for more information.
In computer science, an abstract syntax tree (AST) is a finite, labeled, directed tree, where the internal nodes are labeled by operators, and the leaf nodes represent the operands of the operators. Thus, the leaves are NULL operators and only represent variables or constants.
..... Click the link for more information.
inflection or inflexion is the modification or marking of a word (or more precisely lexeme) to reflect grammatical (that is, relational) information, such as gender, tense, number or person.
..... Click the link for more information.
Romance languages (sometimes referred to as Romanic languages) are a branch of the Indo-European language family that comprisies all the languages that descend from Latin, the language of the Roman Empire.
..... Click the link for more information.
Latin}}} 
Official status
Official language of: Vatican City
Used for official purposes, but not spoken in everyday speech
Regulated by: Opus Fundatum Latinitas
Roman Catholic Church
Language codes
ISO 639-1: la
ISO 639-2: lat
..... Click the link for more information.
The computer program yacc is a parser generator developed by Stephen C. Johnson at AT&T for the Unix operating system. The name is an acronym for "Yet Another Compiler Compiler.
..... Click the link for more information.
Machine translation, sometimes referred to by the acronym MT, is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another.
..... Click the link for more information.
Natural language processing (NLP) is a subfield of artificial intelligence and computational linguistics. It studies the problems of automated generation and understanding of natural human languages.
..... Click the link for more information.
Syntactic ambiguity is a property of sentences which may be reasonably interpreted in more than one way, or reasonably interpreted to mean more than one thing. Ambiguity may or may not involve one word having two parts of speech or homonyms.
..... Click the link for more information.
Grammar is the study of the rules governing the use of a given natural language, and as such a field of linguistics. Traditionally, grammar included morphology and syntax, in modern linguistics commonly expanded by the subfields of phonetics, phonology, orthography, semantics, and
..... Click the link for more information.
Linguistic may refer to:
  • Natural language, human language that is spoken, written, or signed for general communication
  • Linguistics, the scientific study of human language
See also:
  • Neuro-linguistic programming (NLP)

..... Click the link for more information.
Lexical functional grammar (LFG) is a grammar framework in theoretical linguistics, a variety of generative grammar. The development of the theory was initiated by Joan Bresnan and Ronald Kaplan in the 1970s, in reaction to the direction research in the area of transformational
..... Click the link for more information.
In complexity theory, the NP-complete problems are the most difficult problems in NP ("non-deterministic polynomial time") in the sense that they are the smallest subclass of NP that could conceivably remain outside of P, the class of deterministic polynomial-time problems.
..... Click the link for more information.
Head-driven phrase structure grammar (HPSG) is a highly lexicalized, non-derivational generative grammar theory developed by Carl Pollard and Ivan Sag (1985). It is the immediate successor to generalized phrase structure grammar.
..... Click the link for more information.
A treebank is a text corpus in which each sentence has been annotated with syntactic structure. Syntactic structure is commonly represented as a tree structure, hence the name treebank.
..... Click the link for more information.
Shallow parsing (also chunking, "light parsing") is an analysis of a sentence which identifies the constituents (noun groups, verbs,...), but does not specify their internal structure, nor their role in the main sentence.
..... Click the link for more information.
Dependency grammar (DG) is a class of syntactic theories developed by Lucien Tesnière. It is distinct from phrase structure grammars, as it lacks phrasal nodes. Structure is determined by the relation between a word (a head) and its dependents.
..... Click the link for more information.
Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. It is applicable to a wide variety of academic disciplines, from the physical and social sciences to the humanities.
..... Click the link for more information.
machine learning is concerned with the design and development of algorithms and techniques that allow computers to "learn". At a general level, there are two types of learning: inductive, and deductive.
..... Click the link for more information.
The principle of maximum entropy is a method for analyzing the available information in order to determine a unique epistemic probability distribution. It states that the least biased
..... Click the link for more information.
Traditionally, the term neural network had been used to refer to a network or circuitry of biological neurons. The modern usage of the term often refers to artificial neural networks, which are composed of artificial neurons or nodes.
..... Click the link for more information.
In grammar, a lexical category (also word class, lexical class, or in traditional grammar part of speech) is a linguistic category of words (or more precisely lexical items
..... Click the link for more information.
overfitting is fitting a statistical model that has too many parameters. An absurd and false model may fit perfectly if the model has enough complexity by comparison to the amount of data available. Overfitting is generally recognized to be a violation of Occam's razor.
..... Click the link for more information.

..... Click the link for more information.


This article is copied from an article on Wikipedia.org - the free encyclopedia created and edited by online user community. The text was not checked or edited by anyone on our staff. Although the vast majority of the wikipedia encyclopedia articles provide accurate and timely information please do not assume the accuracy of any particular article. This article is distributed under the terms of GNU Free Documentation License.
Herod_Archelaus


page counter