Information about Skip List

Invented in 1990 by William Pugh, a skip list is a probabilistic data structure, based on parallel linked lists, with efficiency comparable to a binary search tree (order O(log n) average time for most operations).

Underlying the skip list is an augmentation of an ordered linked list with additional forward links, added in a randomized way with a geometric/negative binomial distribution (see William Pugh's original paper), so that a search in the list may quickly skip parts of the list (hence the name). Insert, search and delete operations are performed in logarithmic randomized time.

Description

A skip list is built in layers. The bottom layer is an ordinary ordered linked list. Each higher layer acts as an "express lane" for the lists below, where an element in layer i appears in layer i+1 with some fixed probability p (two commonly-used values for p are 1/2 or 1/4). On average, each element appears in 1/(1-p) lists, and the tallest element (usually a special head element at the front of the skip list) appears in lists.

1 1
>4
>6 1
>3->4
>6
>9 1->2->3->4->5->6->7->8->9->10


A search for a target element is started with the head element in the top list, and proceeds horizontally until the current element is greater than or equal to the target. If the current element is equal to the target, it has been found. If the current element is greater than the target, go back to the previous element and drop down vertically to the next lower list and repeat the procedure. The expected number of steps in each linked list is easily seen to be 1/p, by tracing the search path backwards from the target until reaching an element that appears in the next higher list. So the total cost of a search is , which is O(log n) when p is a constant. By choosing different values of p, it is possible to trade search costs against storage costs.

Implementation Details

Although the above diagram is drawn as several, separate linked lists with some repeated node values--implying a total of 1+3+5+10 = 19 total nodes, in practice one would not implement a skip list this way because of the "drop down vertically" step. In fact there are only 10 total nodes in the diagram, but several of them contain more than one pointer.

Insertions and deletions are implemented much like the corresponding linked-list operations, except that "tall" elements must be inserted into or deleted from more than one linked list.

Θ(n) operations, which force us to visit every node in ascending order (PrintEntireListToScreen(), etc.) provide the opportunity to perform a behind-the-scenes derandomization of the level structure of the skip-list in an optimal way (Choose the level of the i'th finite node to be 1 plus the number of times we can repeatedly divide i by 2 before it becomes odd. Also, i=0 for the negative infinity header as we have the usual special case of choosing the highest possible level for negative and/or positive infinite nodes.). However this also allows someone to know where all of the higher-than-level 1 nodes are and delete them.

Alternatively, we could make the level structure quasi-random in the following way:

make all nodes level 1 j = 1 while the number of nodes at level j > 1 for each i'th node at level j if i is odd if i is not the last node at level j randomly choose whether or not to promote it to level j+1 else do not promote end if else if i is even and node i-1 was not promoted promote it to level j+1 end if end for j = j + 1 end while

Like the derandomized version, quasi-randomization is only done when there is some other reason to be running a Θ(n) operation (which visits every node).

The advantage of this quasi-randomness is that it doesn't give away nearly as much level-structure related information to an adversarial user as the de-randomized one. This is desirable because an adversarial user who is able to tell which nodes are not at the lowest level can pessimize performance by simply deleting higher-level nodes. The search performance is still guaranteed to be logarithmic.

It would be tempting to make the following "optimization": In the part which says "Next, for each i'th...", forget about doing a coin-flip for each even-odd pair. Just flip a coin once to decide whether or not to promote only the even ones or only the odd ones. Instead of Θ(n lg n) coin flips, there would only be Θ(lg n) of them. Unfortunately, this gives the adversarial user a 50/50 chance of being correct upon guessing that all of the even numbered nodes (among the ones at level 1 or higher) are higher than level one. This is despite the property that he has a very low probability of guessing that a particular node is at level N for some integer N.

The following proves these two claims concerning the advantages of quasi-randomness over the totally derandomized version. First, to prove that the search time is guaranteed to be logarithmic. Suppose a node n is searched for, where n is the position of the found node among the nodes of level 1 or higher. If n is even, then there is a 50/50 chance that it is higher than level 1. However, if it is not higher than level 1 then node n-1 is guaranteed to be higher than level 1. If n is odd, then there is a 50/50 chance that it is higher than level 1. Suppose that it is not; there is a 50/50 chance that node n-1 is higher than level 1. Suppose that this is not either; we are guaranteed that node n-2 is higher than level 1. The analysis can then be repeated for nodes of level 2 or higher, level 3 or higher, etc. always keeping in mind that n is the position of the node among the ones of level k or higher for integer k. So the search time is constant in the best case (if the found node is the highest possible level) and 2 times the worst case for the search time for the totally derandomized skip-list (because we have to keep moving left twice rather than keep moving left once).

Next, an examination of the probability of an adversarial user's guess of a node being level k or higher being correct. First, the adversarial user has a 50/50 chance of correctly guessing that a particular node is level 2 or higher. This event is independent of whether or not the user correctly guesses at some other node being level 2 or higher. If the user knows the positions of two consecutive nodes of level 2 or higher, and knows that the one on the left is in an odd numbered position among the nodes of level 2 or higher, the user has a 50/50 chance of correctly guessing which one is of level 3 or higher. So, the user's probability of being correct, when guessing that a node is level 3 or higher, is 1/4. Inductively continuing this analysis, we see that the user's probability of guessing that a particular node is level k or higher is 1/(2^(k-1)).

The above analyses only work when the number of nodes is a power of two. However, because of the third rule which says, "Finally, if i is odd and also the last node at level 1 then do not promote." (where we substitute the appropriate level number for 1) it becomes a sequence of exact-power-of-two-sized skiplists, concatenated onto each other, for which the analysis does work. In fact, the exact powers of two correspond to the binary representation for the number of nodes in the whole list.

A skip list, upon which we have not recently performed either of the above mentioned Θ(n) operations, does not provide the same absolute worst-case performance guarantees as more traditional balanced tree data structures, because it is always possible (though with very low probability) that the coin-flips used to build the skip list will produce a badly balanced structure. However, they work well in practice, and the randomized balancing scheme has been argued to be easier to implement than the deterministic balancing schemes used in balanced binary search trees. Skip lists are also useful in parallel computing, where insertions can be done in different parts of the skip list in parallel without any global rebalancing of the data structure.

There has been some evidence that skip lists have worse real-world performance and space requirements than B trees due to memory locality and other issues [1].

History

Skip lists were invented by William Pugh. He details how they work in Skip lists: a probabilistic alternative to balanced trees in Communications of the ACM, June 1990, 33(6) 668-676. See also citations and downloadable documents.

To quote the inventor:

Skip lists are a probabilistic data structure that seem likely to supplant balanced trees as the implementation method of choice for many applications. Skip list algorithms have the same asymptotic expected time bounds as balanced trees and are simpler, faster and use less space.

References

  • Pugh, William (June 1990). "Skip lists: a probabilistic alternative to balanced trees". Communications of the ACM 33: 668-676. 

See also

  • Deterministic skip list

External links

William Pugh (Bill Pugh) is the inventor of the skip list, and was highly influential in the development of the current memory model of the Java language.

He is currently a professor of computer science at the University of Maryland, College Park.
..... Click the link for more information.
data structure is a way of storing data in a computer so that it can be used efficiently. Often a carefully chosen data structure will allow the most efficient algorithm to be used. The choice of the data structure often begins from the choice of an abstract data type.
..... Click the link for more information.
linked list is one of the fundamental data structures, and can be used to implement other data structures. It consists of a sequence of nodes, each containing arbitrary data fields and one or two references ("links") pointing to the next and/or previous nodes.
..... Click the link for more information.
In computer science, efficiency is used to describe several desirable properties of an algorithm or other construct, besides clean design, functionality, etc. Efficiency is generally contained in two properties: speed (the time it takes for an operation to complete), and space (the
..... Click the link for more information.
binary search tree (BST) is a binary tree data structure which has the following properties:
  • Each node has a value.
  • A total order is defined on these values.
  • The left subtree of a node contains only values less than the node's value.

..... Click the link for more information.
In computational complexity theory, big O notation is often used to describe how the size of the input data affects an algorithm's usage of computational resources (usually running time or memory).
..... Click the link for more information.
linked list is one of the fundamental data structures, and can be used to implement other data structures. It consists of a sequence of nodes, each containing arbitrary data fields and one or two references ("links") pointing to the next and/or previous nodes.
..... Click the link for more information.
Randomization is the process of making something random; this can mean:
  • Generating a random permutation of a sequence (such as when shuffling cards).
  • Selecting a random sample of a population (important in statistical sampling).

..... Click the link for more information.
linked list is one of the fundamental data structures, and can be used to implement other data structures. It consists of a sequence of nodes, each containing arbitrary data fields and one or two references ("links") pointing to the next and/or previous nodes.
..... Click the link for more information.
Express Lanes are according to FHWA "A lane or set of lanes physically separated or barriered from the general-purpose capacity provided within major roadway corridors. Express lane access is managed by limiting the number of entranced and exit points to the facility.
..... Click the link for more information.
In computer science, a self-balancing binary search tree or height-balanced binary search tree is a binary search tree that attempts to keep its height, or the number of levels of nodes beneath the root, as small as possible at all times, automatically.
..... Click the link for more information.
Parallel computing is the simultaneous execution of some combination of multiple instances of programmed instructions and data on multiple processors in order to obtain results faster.
..... Click the link for more information.
B-tree is a tree data structure that keeps data sorted and allows searches, insertions, and deletions in logarithmic amortized time. It is most commonly used in databases and filesystems.
..... Click the link for more information.
Memory locality (or data locality) is a term in computer science used to denote the temporal or spatial proximity of memory access in computer programs. There are two basic types of locality of memory reference: temporal and spatial:

Temporal locality

..... Click the link for more information.
William Pugh (Bill Pugh) is the inventor of the skip list, and was highly influential in the development of the current memory model of the Java language.

He is currently a professor of computer science at the University of Maryland, College Park.
..... Click the link for more information.
Communications of the ACM (CACM) is the flagship monthly journal of the Association for Computing Machinery (ACM). First published in 1957, CACM is sent to all ACM members.
..... Click the link for more information.
The Dictionary of Algorithms and Data Structures is a dictionary style reference for many algorithms, "algorithmic techniques", "archetypal problems" and data structures found in the field of Computer Science [1]. The dictionary is maintained by Paul E.
..... Click the link for more information.


This article is copied from an article on Wikipedia.org - the free encyclopedia created and edited by online user community. The text was not checked or edited by anyone on our staff. Although the vast majority of the wikipedia encyclopedia articles provide accurate and timely information please do not assume the accuracy of any particular article. This article is distributed under the terms of GNU Free Documentation License.
Herod_Archelaus


page counter