Information about Tagged Union

In computer science, a tagged union, also called a variant, variant record, discriminated union, or disjoint union, is a data structure used to hold a value that could take on several different, but fixed types. Only one of the types can be in use at any one time, and a tag field explicitly indicates which one is in use. It can be thought of as a type which has several "cases," each of which should be handled correctly when that type is manipulated. Like ordinary unions, tagged unions can save storage by overlapping storage areas for each type, since only one is in use at a time.

Tagged unions are most important in functional languages such as ML and Haskell, where they are called datatypes (see algebraic data type) and the compiler is able to verify that all cases of a tagged union are always handled, avoiding many types of errors. They can, however, be constructed in nearly any language, and are much safer than untagged unions, often simply called unions, which are similar but do not explicitly keep track of which member of the union is currently in use.

Tagged unions are often accompanied by the concept of a constructor, which is similar but not the same as a constructor for a class. Constructors produce a tagged union value, given the initial tag value and a value of the corresponding type.

Mathematically, tagged unions correspond to disjoint or discriminated unions, usually written using +. Given an element of a disjoint union A + B, it is possible to determine whether it came from A or B. If an element lies in both, there will be two effectively distinct copies of the value in A + B, one from A and one from B.

In type theory, a tagged union is called a sum type. Notations vary, but usually the sum type comes with two introduction forms and . The elimination form is case analysis: if has type and and have type under the assumptions and respectively, then the term has type . The sum type corresponds to logical disjunction under the Curry-Howard correspondence.

Advantages and disadvantages

The primary advantage of a tagged union over an untagged union is that all accesses are safe, and the compiler can even check that all cases are handled. Untagged unions depend on program logic to correctly identify the currently active field, which may result in strange behavior and hard-to-find bugs if that logic fails.

The primary advantage of a tagged union over a simple record containing a field for each type is that it saves storage by overlapping storage for all the types. Some implementations reserve enough storage for the largest type, while others dynamically adjust the size of a tagged union value as needed. When the value is immutable, or cannot be changed, it is simple to allocate just as much storage as is needed.

The main disadvantage of tagged unions is that the tag occupies space. Since there are usually a small number of alternatives, the tag can often be squeezed into 2 or 3 bits wherever space can be found, but sometimes even these bits are not available. In this case, a helpful alternative may be folded or encoded tags, where the tag value is dynamically computed from the contents of the union field. The most common examples of this are the use of reserved values, where, for example, a function returning a positive number may return -1 to indicate failure. Another common example is the tagged pointer.

Sometimes, untagged unions are used to perform bit-level conversions between types, called reinterpret casts in C++. Tagged unions are not intended for this purpose; typically a new value is assigned whenever the tag is changed.

Many languages support, to some extent, a universal data type, which is a type that includes every value of every other type, and often a way is provided to test the actual type of a value of the universal type. These are sometimes referred to as variants. While universal data types are comparable to tagged unions in their formal definition, typical tagged unions include a relatively small number of cases, and these cases form different ways of expressing a single coherent concept, such as a data structure node or instruction. Also, there is an expectation that every possible case of a tagged union will be dealt with when it is used. The values of a universal data type are not related and there is no feasible way to deal with them all.

Examples

Say we wanted to build a binary tree of integers. In ML, we would do this by creating a datatype like this:

>
datatype tree = Leaf
              | Node of (int * tree * tree)


This is a tagged union with two cases: one, the leaf, is used to terminate a path of the tree, and functions much like a null value would in imperative languages. The other branch holds a node, which contains an integer and a left and right subtree. Leaf and Node are the constructors, which enable us to actually produce a particular tree, such as:

>
Node(5, Node(1,Leaf,Leaf), Node(3, Leaf, Node(4, Leaf, Leaf)))


which corresponds to this tree:

The tree produced by the above constructors


Now we can easily write a typesafe function that, say, counts the number of nodes in the tree:

>
fun countNodes(Leaf) = 0
  | countNodes(Node(int,left,right)) =
      1 + countNodes(left) + countNodes(right)

Time line of language support

1960s

In Algol 68, tagged unions are called united modes, the tag is implicit, and the case construct is used to determine which field is tagged:

mode node = union (real, int, compl, string);

Usage example for union case of node:

node n := "1234";   case n in (real r): print(("real:", r)), (int i): print(("int:", i)), (compl c): print(("compl:", c)), (string s): print(("string:", s)) out print(("?:", n)) esac

1970s & 1980s

Although primarily only functional languages such as ML and Haskell (from 1990s) give a central role to tagged unions and have the power to check that all cases are handled, other languages have support for tagged unions as well. However, in practice they can be less efficient in non-functional languages due to optimizations enabled by functional language compilers that can eliminate explicit tag checks and avoid explicit storage of tags.

Pascal, Ada, and Modula-2 call them variant records, and require the tag field to be manually created and the tag values specified, as in this Pascal example:

>
type shapeKind = (square, rectangle, circle);
     shape = record
                centerx : integer;
                centery : integer;
                case kind : shapeKind of
                   square : (side : integer);
                   rectangle : (length, height : integer);
                   circle : (radius : integer);
	      end;


In C and C++, a tagged union can be created from untagged unions using a strict access discipline where the tag is always checked:

>
enum ShapeKind { Square, Rectangle, Circle };
struct Shape {
    int centerx;
    int centery;
    enum ShapeKind kind;
    union {
        struct { int side; }           squareData;
        struct { int length, height; } rectangleData;
        struct { int radius; }         circleData;
    } shapeKindData;
};
int getSquareSide(struct Shape* s) {
    assert(s->kind == Square);
    return s->shapeKindData.squareData.side;
}
void setSquareSide(struct Shape* s, int side) {
    s->kind = Square;
    s->shapeKindData.squareData.side = side;
}
/* and so on */


As long as the union fields are only accessed through the functions, the accesses will be safe and correct. The same approach can be used for encoded tags; we simply decode the tag and then check it on each access. If the inefficiency of these tag checks is a concern, they may be automatically removed in the final version.

C and C++ also have language support for one particular tagged union: the possibly-null pointer. This may be compared to the option type in ML or the Maybe type in Haskell, and can be seen as a tagged union (with an encoded tag) of two types:
  • Valid pointers,
  • A type with only one value, null, indicating an exceptional condition.
Unfortunately, C compilers do not verify that the null case is always handled, and this is a particularly prevalent source of errors in C code, since there is a tendency to ignore exceptional cases.

2000s

One advanced dialect of C called Cyclone has extensive built-in support for tagged unions. See the tagged union section of the on-line manual for more information.

Class hierarchies as tagged unions

In a typical class hierarchy in object-oriented programming, each subclass can encapsulate data unique to that class. The metadata used to perform virtual method lookup (for example, the object's vtable pointer in most C++ implementations) identifies the subclass and so effectively acts as a tag identifying the particular data stored by the instance. An object's constructor sets this tag, and it remains constant throughout the object's lifetime.

External links

Computer science, or computing science, is the study of the theoretical foundations of information and computation and their implementation and application in computer systems.
..... Click the link for more information.
In set theory, a disjoint union (or discriminated union) is a modified union operation which indexes the elements according to which set they originated in.

Formally, let be a family of sets indexed by I.
..... Click the link for more information.
data structure is a way of storing data in a computer so that it can be used efficiently. Often a carefully chosen data structure will allow the most efficient algorithm to be used. The choice of the data structure often begins from the choice of an abstract data type.
..... Click the link for more information.
In computer science, a union is a data structure that stores one of several types of data at a single location. There are only two safe ways of accessing a union object. One is to always read the field of a union most recently assigned; tagged unions enforce this restriction.
..... Click the link for more information.
ML
Paradigm: multi-paradigm: imperative, functional
Appeared in: 1973
Designed by: Robin Milner & others at the University of Edinburgh
Typing discipline: static, strong, inferred
Dialects: Standard ML, OCaml, F#
Influenced by: ISWIM
..... Click the link for more information.
Haskell

Paradigm: functional, non-strict, modular
Appeared in: 1990
Designed by: Simon Peyton-Jones, Paul Hudak[1], Philip Wadler, et al
Typing discipline: static, strong, inferred
Major implementations: GHC, Hugs, NHC , JHC , Yhc
..... Click the link for more information.
In computer programming, an algebraic data type is a datatype each of whose values is data from other datatypes wrapped in one of the constructors of the datatype. Any wrapped data is an argument to the constructor.
..... Click the link for more information.
In object-oriented programming, a constructor (sometimes shortened to ctor) in a class is a special block of statements called when an object is created, either when it is declared (statically constructed on the stack, possible in C++ but not in Java and other
..... Click the link for more information.
In set theory, a disjoint union (or discriminated union) is a modified union operation which indexes the elements according to which set they originated in.

Formally, let be a family of sets indexed by I.
..... Click the link for more information.
In mathematics, logic and computer science, type theory is any of several formal systems that can serve as alternatives to naive set theory, or the study of such formalisms in general.
..... Click the link for more information.
or, also known as logical disjunction or inclusive disjunction is a logical operator that results in true whenever one or more of its operands are true. In grammar, or is a coordinating conjunction.
..... Click the link for more information.
In computer science, object composition (not to be confused with function composition) is a way and practice to combine simple objects or data types into more complex ones.
..... Click the link for more information.
In computer science, a tagged pointer is a common example of a tagged union, where the primary type of data to be stored in the union is a pointer. Often, the tag in a tagged pointer will be "folded" into the data representing the pointer, taking advantage of one of the
..... Click the link for more information.
In computer science, a binary tree is a tree data structure in which each node has at most two children. Typically the child nodes are called left and right. One common use of binary trees is binary search trees; another is binary heaps.
..... Click the link for more information.
Functional programming is a programming paradigm that treats computation as the evaluation of mathematical functions and avoids state and mutable data. It emphasizes the application of functions, in contrast with the imperative programming style that emphasizes changes in state.
..... Click the link for more information.
ML
Paradigm: multi-paradigm: imperative, functional
Appeared in: 1973
Designed by: Robin Milner & others at the University of Edinburgh
Typing discipline: static, strong, inferred
Dialects: Standard ML, OCaml, F#
Influenced by: ISWIM
..... Click the link for more information.
Haskell

Paradigm: functional, non-strict, modular
Appeared in: 1990
Designed by: Simon Peyton-Jones, Paul Hudak[1], Philip Wadler, et al
Typing discipline: static, strong, inferred
Major implementations: GHC, Hugs, NHC , JHC , Yhc
..... Click the link for more information.
Pascal is a structured imperative computer programming language, developed in 1970 by Niklaus Wirth as a language particularly suitable for structured programming. A derivative known as Object Pascal was designed for object oriented programming.
..... Click the link for more information.
Ada

Paradigm: multi-paradigm: concurrent, distributed, generic-programming, imperative, object-oriented
Appeared in: 1983, last revised 2005
Designed by: Jean Ichbiah, extended
by S.
..... Click the link for more information.
Modula-2
Paradigm: imperative, structured, modular
Appeared in: 1978
Designed by: Niklaus Wirth
Typing discipline: strong, static
Major implementations: Gardens Point , p1 , Native XDS-x86
Dialects: PIM2, PIM3, PIM4, ISO
..... Click the link for more information.
C

The C Programming Language, Brian Kernighan and Dennis Ritchie, the original edition that served for many years as an informal specification of the language.
..... Click the link for more information.
C++
Paradigm: Multi-paradigm
Appeared in: 1983
Designed by: Bjarne Stroustrup
Typing discipline: Static, unsafe, nominative
Major implementations: G++, Microsoft Visual C++, Borland C++ Builder
Dialects: ISO/IEC C++ 1998, ISO/IEC C++ 2003
..... Click the link for more information.
pointer is a programming language data type whose value refers directly to (or “points to”) another value stored elsewhere in the computer memory using its address. Obtaining the value to which a pointer refers is called dereferencing the pointer.
..... Click the link for more information.
The Cyclone programming language is intended to be a safe dialect of the C programming language. Cyclone is designed to avoid buffer overflows and other vulnerabilities that are endemic in C programs, without losing the power and convenience of C as a tool for systems programming.
..... Click the link for more information.
As in taxonomy, the classifications of species, a class hierarchy in computer science is a classification of object types, denoting objects as the instantiations of classes (class is like a blueprint, the object is what is built from that blueprint) inter-relating the various
..... Click the link for more information.
Object-oriented programming (OOP) is a programming paradigm that uses "objects" and their interactions to design applications and computer programs. It is based on several techniques, including inheritance, modularity, polymorphism, and encapsulation.
..... Click the link for more information.
In object-oriented programming (OOP), a virtual function or virtual method is a function whose behavior, by virtue of being declared "virtual," is determined by the definition of a function with the same signature furthest in the inheritance lineage of the instantiated
..... Click the link for more information.
A virtual method table, virtual function table, dispatch table, or vtable, is a mechanism used in programming language implementations in order to support dynamic polymorphism, i.e., run-time method binding.
..... Click the link for more information.
In object-oriented programming, a constructor (sometimes shortened to ctor) in a class is a special block of statements called when an object is created, either when it is declared (statically constructed on the stack, possible in C++ but not in Java and other
..... Click the link for more information.
D
Paradigm: multiparadigm
Appeared in: 1999
Designed by: Walter Bright
Latest release: 1.022 (stable)/ October 1, 2007[1]
Typing discipline: strong, static
Major implementations: DMD , GDC
Influenced by: C, C++, C#, Java, Eiffel

The
..... Click the link for more information.


This article is copied from an article on Wikipedia.org - the free encyclopedia created and edited by online user community. The text was not checked or edited by anyone on our staff. Although the vast majority of the wikipedia encyclopedia articles provide accurate and timely information please do not assume the accuracy of any particular article. This article is distributed under the terms of GNU Free Documentation License.
Herod_Archelaus


page counter