Information about Frequency Analysis

Enlarge picture
A typical distribution of letters in English language text. Weak ciphers do not sufficiently mask the distribution, and this might be exploited by a cryptanalyst to read the message.
In cryptanalysis, frequency analysis is the study of the frequency of letters or groups of letters in a ciphertext. The method is used as an aid to breaking classical ciphers.

Frequency analysis is based on the fact that, in any given stretch of written language, certain letters and combinations of letters occur with varying frequencies. Moreover, there is a characteristic distribution of letters that is roughly the same for almost all samples of that language. For instance, given a section of English language, E tends to be very common, while X is very rare. Likewise, ST, NG, TH, and QU are common pairs of letters (termed bigrams or digraphs), while NZ and QJ are rare. The phrase "ETAOIN SHRDLU" encodes the 12 most frequent letters in typical English language text.

In some ciphers, such properties of the natural language plaintext are preserved in the ciphertext, and these patterns have the potential to be exploited in a ciphertext-only attack.

Frequency analysis for simple substitution ciphers

In a simple substitution cipher, each letter of the plaintext is replaced with another, and any particular letter in the plaintext will always be transformed into the same letter in the ciphertext. For instance, if all occurrences of the letter e turn into the letter X, a ciphertext message containing numerous instances of the letter X would suggest to a cryptanalyst that X represents e.

The basic use of frequency analysis is to first count the frequency of ciphertext letters and then associate guessed plaintext letters with them. More X's in the ciphertext than anything else suggests that X corresponds to e in the plaintext, but this is not certain; t and a are also very common in English, so X might be either of them also. It is unlikely to be a plaintext z or q which are less common. Thus the cryptanalyst may need to try several combinations of mappings between ciphertext and plaintext letters.

More complex use of statistics can be conceived, such as considering counts of pairs of letters, or triplets (trigrams), and so on. This is done to provide more information to the cryptanalyst, for instance, Q and U nearly always occur together in that order in English, even though Q itself is rare.

An example

Suppose Evelina has intercepted the cryptogram below, and it is known to be encrypted using a simple substitution cipher: LIVITCSWPIYVEWHEVSRIQMXLEYVEOIEWHRXEXIPFEMVEWHKVSTYLXZIXLIKIIXPIJVSZEYPERRGERIM WQLMGLMXQERIWGPSRIHMXQEREKIETXMJTPRGEVEKEITREWHEXXLEXXMZITWAWSQWXSWEXTVEPMRXRSJ GSTVRIEYVIEXCVMUIMWERGMIWXMJMGCSMWXSJOMIQXLIVIQIVIXQSVSTWHKPEGARCSXRWIEVSWIIBXV IZMXFSJXLIKEGAEWHEPSWYSWIWIEVXLISXLIVXLIRGEPIRQIVIIBGIIHMWYPFLEVHEWHYPSRRFQMXLE PPXLIECCIEVEWGISJKTVWMRLIHYSPHXLIQIMYLXSJXLIMWRIGXQEROIVFVIZEVAEKPIEWHXEAMWYEPP XLMWYRMWXSGSWRMHIVEXMSWMGSTPHLEVHPFKPEZINTCMXIVJSVLMRSCMWMSWVIRCIGXMWYMX For this example, uppercase letters are used to denote ciphertext, lowercase letters are used to denote plaintext (or guesses at such), and X~t is used to express a guess that ciphertext letter X represents the plaintext letter t.

Eve could use frequency analysis to help solve the message along the following lines: counts of the letters in the cryptogram show that I is the most common single letter, XL most common bigram, and XLI is the most common trigram. e is the most common letter in the English language, th is the most common bigram, and the the most common trigram. This strongly suggests that X~t, L~h and I~e. The second most common letter in the cryptogram is E; since the first and second most frequent letters in the English language, e and t are accounted for, Eve guesses that E~a, the third most frequent letter. Tentatively making these assumptions, the following partial decrypted message is obtained.

heVeTCSWPeYVaWHaVSReQMthaYVaOeaWHRtatePFaMVaWHKVSTYhtZetheKeetPeJVSZaYPaRRGaReM WQhMGhMtQaReWGPSReHMtQaRaKeaTtMJTPRGaVaKaeTRaWHatthattMZeTWAWSQWtSWatTVaPMRtRSJ GSTVReaYVeatCVMUeMWaRGMeWtMJMGCSMWtSJOMeQtheVeQeVetQSVSTWHKPaGARCStRWeaVSWeeBtV eZMtFSJtheKaGAaWHaPSWYSWeWeaVtheStheVtheRGaPeRQeVeeBGeeHMWYPFhaVHaWHYPSRRFQMtha PPtheaCCeaVaWGeSJKTVWMRheHYSPHtheQeMYhtSJtheMWReGtQaROeVFVeZaVAaKPeaWHtaAMWYaPP thMWYRMWtSGSWRMHeVatMSWMGSTPHhaVHPFKPaZeNTCMteVJSVhMRSCMWMSWVeRCeGtMWYMt

Using these initial guesses, Eve can spot patterns that confirm her choices, such as "that". Moreover, other patterns suggest further guesses. "Rtate" might be "state", which would mean R~s. Similarly "atthattMZe" could be guessed as "atthattime", yielding M~i and Z~m. Furthermore, "heVe" might be "here", giving V~r. Filling in these guesses, Eve gets:

hereTCSWPeYraWHarSseQithaYraOeaWHstatePFairaWHKrSTYhtmetheKeetPeJrSmaYPassGasei WQhiGhitQaseWGPSseHitQasaKeaTtiJTPsGaraKaeTsaWHatthattimeTWAWSQWtSWatTraPistsSJ GSTrseaYreatCriUeiWasGieWtiJiGCSiWtSJOieQthereQeretQSrSTWHKPaGAsCStsWearSWeeBtr emitFSJtheKaGAaWHaPSWYSWeWeartheStherthesGaPesQereeBGeeHiWYPFharHaWHYPSssFQitha PPtheaCCearaWGeSJKTrWisheHYSPHtheQeiYhtSJtheiWseGtQasOerFremarAaKPeaWHtaAiWYaPP thiWYsiWtSGSWsiHeratiSWiGSTPHharHPFKPameNTCiterJSrhisSCiWiSWresCeGtiWYit

In turn, these guesses suggest still others (for example, "remarA" could be "remark", implying A~k) and so on, and it is relatively straightforward to deduce the rest of the letters, eventually yielding the plaintext.

In this example, Eve's guesses were all correct. This would not always be the case, however; the variation in statistics for individual plaintexts can mean that initial guesses are incorrect. It may be necessary to backtrack incorrect guesses or to analyze the available statistics in much more depth than the somewhat simplified justifications given in the above example.

It is also possible that the plaintext does not exhibit the expected distribution of letter frequencies. Shorter messages are likely to show more variation. It is also possible to construct artificially skewed texts. For example, entire novels have been written that omit the letter "e" altogether — a form of literature known as a lipogram.

History and usage

Enlarge picture
First page of Al-Kindi's 9th century Manuscript on Deciphering Cryptographic Messages


The first known recorded explanation of frequency analysis (indeed, of any kind of cryptanalysis) was given by 9th century Arab polymath Abu Yusuf Yaqub ibn Ishaq al-Sabbah Al-Kindi in A Manuscript on Deciphering Cryptographic Messages (Prof. Ibrahim Al-Kadi, 1992- see References). It has been suggested that close textual study of the Qur'an first brought to light that Arabic has a characteristic letter frequency. Its use spread, and was so widely used by European states by the Renaissance that several schemes were invented by cryptographers to defeat it. These included:
  • Use of homophones — several alternatives to the most common letters in otherwise monoalphabetic substitution ciphers (for example, for English, both X and Y ciphertext might mean plaintext E).
  • Polyalphabetic substitution, that is, the use of several alphabets — chosen in assorted, more or less devious, ways (Leone Alberti seems to have been the first to propose this); and
  • Polygraphic substitution, schemes where pairs or triplets of plaintext letters are treated as units for substitution, rather than single letters (for example, the Playfair cipher invented by Charles Wheatstone in the mid 1800s).
A disadvantage of all these attempts to defeat frequency counting attacks is that it increases complication of both enciphering and deciphering, leading to mistakes. Famously, a British Foreign Secretary is said to have rejected the Playfair cipher because, even if school boys could cope successfully as Wheatstone and Playfair had shown, 'our attachés could never learn it!'.

The rotor machines of the first half of the 20th century (for example, the Enigma machine) were essentially immune to straightforward frequency analysis. However, other kinds of analysis ("attacks") successfully decoded messages from some of those machines.

Frequency analysis requires only a basic understanding of the statistics of the plaintext language and some problem solving skills, and, if performed by hand, some tolerance for extensive letter bookkeeping. During World War II (WWII), both the British and the Americans recruited codebreakers by placing crossword puzzles in major newspapers and running contests for who could solve them the fastest. Several of the ciphers used by the Axis powers were breakable using frequency analysis (for example, some of the consular ciphers used by the Japanese). Mechanical methods of letter counting and statistical analysis (generally IBM card type machinery) were first used in WWII, possibly by the US Army's SIS. There are lurid tales of midnight expeditions by the cryptographers to machines in another Department. Today, the hard work of letter counting and analysis has been replaced by computer software, which can carry out such analysis in seconds. With modern computing power, classical ciphers are unlikely to provide any real protection for confidential data.

Frequency analysis in fiction

Frequency analysis has been described in fiction. Edgar Allan Poe's The Gold Bug, and Sir Arthur Conan Doyle's Sherlock Holmes tale The Adventure of the Dancing Men are examples of stories which describe the use of frequency analysis to attack simple substitution ciphers. The cipher in the Poe story is encrusted with several deception measures, but this is more a literary device than anything significant cryptographically.

Part of the cryptogram in The Dancing Men

See also

References

  • Helen Fouché Gaines, "Cryptanalysis", 1939, Dover. ISBN 0-486-20097-3
  • Ibrahim A. Al-Kadi "The origins of cryptology: The Arab contributions”, Cryptologia, 16(2) (April 1992) pp. 97–126.
  • Abraham Sinkov, "Elementary Cryptanalysis : A Mathematical Approach", The Mathematical Association of America, 1966. ISBN 0-88385-622-0.

External links

Cryptanalysis (from the Greek kryptós, "hidden", and analıein, "to loosen" or "to untie") is the study of methods for obtaining the meaning of encrypted information, without access to the secret information which is normally required to do so.
..... Click the link for more information.
The frequency of letters in text has often been studied for use in cryptography, and frequency analysis in particular. No exact letter frequency distribution underlies a given language, since all writers write slightly differently.
..... Click the link for more information.
encryption is the process of transforming information (referred to as plaintext) to make it unreadable to anyone except those possessing special knowledge, usually referred to as a key.
..... Click the link for more information.
In cryptography, a classical cipher is a type of cipher used historically but which now have fallen, for the most part, into disuse. In general, classical ciphers operate on an alphabet of letters (such as "A-Z"), and are implemented by hand or with simple mechanical devices.
..... Click the link for more information.
English}}} 
Writing system: Latin (English variant) 
Official status
Official language of: 53 countries
Regulated by: no official regulation
Language codes
ISO 639-1: en
ISO 639-2: eng
ISO 639-3: eng  
..... Click the link for more information.
Bigrams are groups of two written letters, two syllables, or two words, and are very commonly used as the basis for simple statistical analysis of text. They are used in one of the most successful language models for speech recognition.
..... Click the link for more information.
ETAOIN SHRDLU is the approximate order of frequency of the twelve most commonly used letters in the English language, best known as a nonsense phrase that sometimes appeared in print in the days of "hot type" publishing due to a custom of Linotype machine operators.
..... Click the link for more information.
In cryptography, a ciphertext-only attack (COA) is an attack model for cryptanalysis where the attacker is assumed to have access only to a set of ciphertexts.

The attack is completely successful if the corresponding plaintexts can be deduced, or even better, the key.
..... Click the link for more information.
In cryptography, a substitution cipher is a method of encryption by which units of plaintext are substituted with ciphertext according to a regular system; the "units" may be single letters (the most common), pairs of letters, triplets of letters, mixtures of the above, and so
..... Click the link for more information.
plaintext is information used as input to an encryption algorithm; the output is termed ciphertext. The plaintext could be, for example, a diplomatic message, a bank transaction, an e-mail, a diary and so forth — any information that someone might want to prevent
..... Click the link for more information.
cryptogram is a type of puzzle which consists of a short piece of text encrypted with a simple substitution cipher in which each letter is replaced by a different letter. To solve the puzzle, one must recover the original lettering.
..... Click the link for more information.
Backtracking is a type of algorithm that is a refinement of brute force search.[1] In backtracking, multiple solutions can be eliminated without being explicitly examined, by using specific properties of the problem.
..... Click the link for more information.
A lipogram (from Greek lipagrammatos, "missing letter") is a kind of constrained writing or word game consisting of writing paragraphs or longer works in which a particular letter or group of letters is missing, usually a common vowel, the most common in English being
..... Click the link for more information.
As a means of recording the passage of time the 9th century was the century that lasted from 801 to 900.

Western European

"Dark Ages" applied later to this period


..... Click the link for more information.

..... Click the link for more information.
polymath (Greek polymathēs, πολυμαθής, "having learned much")[1][2] is a person with encyclopedic, broad, or varied knowledge or learning.
..... Click the link for more information.
Yaʻqūb ibn Isḥāq al-Kindī (Arabic:
..... Click the link for more information.
The Qur’ān [1] (Arabic: القرآن
..... Click the link for more information.
al-‘Arabiyyah in written Arabic (Kufic script):  
Pronunciation: /alˌʕa.raˈbij.ja/
Spoken in: Algeria, Bahrain, Egypt, Iraq, Jordan, Kuwait, Lebanon, Libya, Mauritania, Morocco, Oman,
..... Click the link for more information.
A polyalphabetic cipher is any cipher based on substitution, using multiple substitution alphabets. The Vigenère cipher is probably the best-known example of a polyalphabetic cipher, though it is a simplified special case.
..... Click the link for more information.
Leon Battista Alberti (February 14, 1404 – April 25, 1472) was an Italian author, artist, architect, poet, linguist, philosopher, and cryptographer, and general Renaissance humanist polymath. In Italy, his first name is usually spelled "Leon".
..... Click the link for more information.
Playfair cipher or Playfair square is a manual symmetric encryption technique and was the first literal digraph substitution cipher. The scheme was invented in 1854 by Charles Wheatstone, but bears the name of Lord Playfair who promoted the use of the cipher.
..... Click the link for more information.
Sir Charles Wheatstone (February 6, 1802 - October 19, 1875) was a British scientist and inventor of many scientific breakthroughs of the Victorian era, including the English concertina, the stereoscope (a device for displaying three-dimensional images), and the Playfair cipher (an
..... Click the link for more information.
rotor machine is an electro-mechanical device used for encrypting and decrypting secret messages. Rotor machines were the cryptographic state-of-the-art for a brief but prominent period of history; they were in widespread use in the 1930s–1950s.
..... Click the link for more information.
Enigma cipher machine
  • Enigma machine
  • Enigma rotor details
  • Cryptanalysis of the Enigma
  • Cyclometer

..... Click the link for more information.
Allied powers:
 Soviet Union
 United States
 United Kingdom
 China
 France
...et al. Axis powers:
 Germany
 Japan
 Italy
...et al.
..... Click the link for more information.
Motto
"Dieu et mon droit" [2]   (French)
"God and my right"
Anthem
"God Save the Queen" [3]
..... Click the link for more information.
Motto
"In God We Trust"   (since 1956)
"E Pluribus Unum"   ("From Many, One"; Latin, traditional)
Anthem
..... Click the link for more information.
crossword is a word puzzle that normally takes the form of a square grid of black and white squares. The goal is to fill the white squares with letters, forming words or phrases, by solving clues which lead to the answers.
..... Click the link for more information.
Axis Powers, also interpreted as Axis alliance, Axis nations, Axis countries or sometimes just the Axis were those countries opposed to the Allies during World War II.
..... Click the link for more information.


This article is copied from an article on Wikipedia.org - the free encyclopedia created and edited by online user community. The text was not checked or edited by anyone on our staff. Although the vast majority of the wikipedia encyclopedia articles provide accurate and timely information please do not assume the accuracy of any particular article. This article is distributed under the terms of GNU Free Documentation License.
Herod_Archelaus


page counter