Information about Data Conversion

Data conversion is the conversion of one form of computer data to another--the changing of bits from being in one format to a different one, usually for the purpose of application interoperability or of capability of using new features. At the simplest level, data conversion can be exemplified by conversion of a text file from one character encoding to another. More complex conversions are those of office file formats, and conversions of image and audio file formats are an endeavor that is beyond the ken of ordinary computer users.

Information basics

Before any data conversion is carried out, the user or application programmer should keep a few basics of computing and information theory in mind. These include:
  • Information can easily be discarded using the computer, but adding information takes effort.
  • The computer can be used to add information only in a rule-based fashion; most additions of information that users want can be done only with human judgement.
  • Upsampling the data or converting to a more feature-rich format does not add information; it merely makes room for that addition, which usually a human must do.
For example, a truecolor image can easily be converted to grayscale, while the opposite conversion is a painstaking process. Converting a Unix text file to a Microsoft (DOS/Windows) text file involves adding information, namely a CR (hexadecimal 0D) byte before each LF (0A) byte, but that addition is easily done with a computer, since it is rule-based; whereas the addition of color information to a grayscale image cannot be done programmatically, since only a human knows which colors are needed for each section of the picture--there are no rules that can be used to automate that process. Converting a 24-bit PNG to a 48-bit one does not add information to it, it only pads existing RGB pixel values with zeroes, so that a pixel with a value of FF C3 56, for example, becomes FF00 C300 5600. The conversion makes it possible to change a pixel to have a value of, for instance, FF80 C340 56A0, but the conversion itself does not do that, only further manipulation of the image can. Converting an image or audio file in a lossy format (like JPEG or Vorbis) to a lossless (like PNG or FLAC) or uncompressed (like BMP or WAV) format only wastes space, since the same image with its loss of original information (the artifacts of lossy compression) becomes the target. A JPEG image can never be restored to the quality of the original lossless image from which it was made, no matter how much the user tries the "JPEG Artifact Removal" feature of his or her image manipulation program.

Because of these realities of computing and information theory, data conversion is more often than not a complex and error-prone process, necessitating the help of experts. It is safe to say that only the success of artificial intelligence can put data conversion companies out of a job.

Pivotal conversion

Data conversion can be directly from one format to another, but many applications that convert between multiple formats use a pivotal encoding by way of which any source format is converted to its target. For example, it is possible to convert Cyrillic text from KOI8-R to Windows-1251 using a lookup table between the two encodings, but the modern approach is to convert the KOI8-R file to Unicode first and from that to Windows-1251. This is a more manageable approach: an application specializing in character encoding conversion would have to keep hundreds of lookup tables, for all the permutations of character encoding conversions available, while keeping lookup tables just for each character set to Unicode scales down the number to a few tens.

Pivotal conversion is similarly used in other areas. Office applications, when employed to convert between office file formats, use their internal, default file format as a pivot. For example, a word processor may convert an RTF file to a WordPerfect file by converting the RTF to OpenDocument and then that to WordPerfect format. An image conversion program does not convert a PCX image to PNG directly; instead, when loading the PCX image, it decodes it to a simple bitmap format for internal use in memory, and when commanded to convert to PNG, that memory image is converted to the target format. An audio converter that converts from FLAC to AAC decodes the source file to raw PCM data in memory first, and then performs the lossy AAC compression on that memory image to produce the target file.

Lossy and inexact data conversion

For any conversion to be carried out without loss of information, the target format must support the same features and data constructs present in the source file. Conversion of a word processing document to a plain text file necessarily involves loss of information, because plain text format does not support word processing constructs such as marking a word as boldface. For this reason, conversion from one format to another that has less features is rarely carried out, though it may be necessary for interoperability, eg converting a file from one version of Microsoft Word to an earlier version for the sake of those who do not have the latest version of Word installed.

Loss of information can be mitigated by approximation in the target format. There is no way of converting a character like ä to ASCII, since the ASCII standard lacks it, but the information may be retained by approximating the character as ae. Of course, this is not an optimal solution, and can impact operations like searching and copying; and if a language makes a distinction between ä and ae, then that approximation does involve loss of information.

Data conversion can also suffer from inexactitude, the result of converting between formats that are conceptually different. The WYSIWYG paradigm, extant in word processors and desktop publishing applications, versus the structural-descriptive paradigm, found in SGML, XML and many applications derived therefrom, like HTML and MathML, is one example. Using a WYSIWYG HTML editor conflates the two paradigms, and the result is HTML files with suboptimal, if not nonstandard, code. In the WYSIWYG paradigm a double linebreak signifies a new paragraph, as that is the visual cue for such a construct, but a WYSIWYG HTML editor will usually convert such a sequence to

, which is structurally no new paragraph at all. As another example, converting from PDF to an editable word processor format is a tough chore, because PDF records the textual information like engraving on stone, with each character given a fixed position and linebreaks hard-coded, whereas word processor formats accommodate text reflow. PDF does not know of a word space character--the space between two letters and the space between two words differ only in quantity. Therefore, a title with ample letter-spacing for effect will usually end up with spaces in the word processor file, for example INTRODUCTION with spacing of 1 em as I N T R O D U C T I O N on the word processor.

Open vs. secret specifications

Successful data conversion requires thorough knowledge of the workings of both source and target formats. In the case where the specification of a format is unknown, reverse engineering will be needed to carry out conversion. Reverse engineering can achieve close approximation of the original specifications, but errors and missing features can still result. The binary format of Microsoft Office documents (DOC, XLS, PPT and the rest) is undocumented, and anyone who seeks interoperability with those formats needs to reverse-engineer them. Such efforts have so far been fairly successful, so that most Microsoft Word files open without any ill-effect in the competing OpenOffice.org Writer, but the few that don't, usually very complex ones, utilizing more obscure features of the DOC file format, serve to show the limits of reverse-engineering.

See also

External links

BIT is an acronym for:
  • Bannari amman Institute of Technology
  • Bangalore Institute of Technology
  • Beijing Institute of Technology
  • Benzisothiazolinone
  • Bilateral Investment Treaty
  • Bhilai Institute of Technology - Durg

..... Click the link for more information.
A file format is a particular way to encode information for storage in a computer file.

Since a disk drive, or indeed any computer storage, can store only bits, the computer must have some way of converting information to 0s and 1s and vice-versa.
..... Click the link for more information.
text file.]]

A text file is a generic description of a kind of computer file in a computer file system.[1] At this generic level of description, there are two kinds of computer files: 1) text files; and 2) binary files.
..... Click the link for more information.
A character encoding consists of a code that pairs a sequence of characters from a given character set (sometimes referred to as code page) with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the storage of text in
..... Click the link for more information.
Information theory is a branch of applied mathematics and engineering involving the quantification of information to find fundamental limits on compressing and reliably communicating data.
..... Click the link for more information.
Truecolor is a method of representing and storing graphical image information (especially in computer processing) such that a very large number of colors, shades, and hues can be displayed at once, such as high quality photographic images or complex graphics.
..... Click the link for more information.
Unix (officially trademarked as UNIX®) is a computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs including Ken Thompson, Dennis Ritchie and Douglas McIlroy.
..... Click the link for more information.
Microsoft Corporation

Public (NASDAQ:  MSFT )
Founded Albuquerque, New Mexico, USA (April 4 1975)[1]
Headquarters Redmond, Washington, United States

Key people Bill Gates, Co-founder and Executive Chairman ;
Paul Allen, Co-founder ;
..... Click the link for more information.
PNG may stand for:
  • Papua New Guinea, a country in Oceania, occupying the eastern half of the island of New Guinea and numerous off-shore islands.
  • Portable Network Graphics, a bitmapped image format that employs lossless data compression.

..... Click the link for more information.
RGB color model is an additive model in which red, green, and blue (often used in additive light models) are combined in various ways to reproduce other colors. The name of the model and the abbreviation ‘RGB’ come from the three primary colors, red, green, and blue and
..... Click the link for more information.
lossy compression method is one where compressing data and then decompressing it retrieves data that may well be different from the original, but is close enough to be useful in some way.
..... Click the link for more information.
JPEG

A photo of a flower compressed with successively more lossy compression ratios from left to right.
File extension: .jpeg, .jpg, .jpe
.jfif, .jfi, .

..... Click the link for more information.
Vorbis

File extension: .ogg .oga [1]
MIME type: audio/ogg
Developed by: Xiph.Org Foundation
Type of format: Audio codec
Contained by: Ogg
Standard(s): Specification

Vorbis
..... Click the link for more information.
Lossless data compression is a class of data compression algorithms that allows the exact original data to be reconstructed from the compressed data. This can be contrasted to lossy data compression, which does not allow the exact original data to be reconstructed from the
..... Click the link for more information.
PNG may stand for:
  • Papua New Guinea, a country in Oceania, occupying the eastern half of the island of New Guinea and numerous off-shore islands.
  • Portable Network Graphics, a bitmapped image format that employs lossless data compression.

..... Click the link for more information.
Free Lossless Audio Codec

File extension: .flac
MIME type: audio/x-flac[1]
Type of format: Audio
Free Lossless Audio Codec

Developer: Xiph.Org Foundation
Latest release: 1.2.
..... Click the link for more information.
Windows Bitmap

File extension: .bmp or .dib
MIME type: image/x-ms-bmp (unofficial)
Type code: 'BMP '
Uniform Type Identifier: com.microsoft.
..... Click the link for more information.
Waveform

File extension: .wav
MIME type: audio/wav
audio/wave
audio/x-wav

Type code: WAVE
Uniform Type Identifier: com.microsoft.
..... Click the link for more information.
artificial intelligence (or AI) is "the study and design of intelligent agents" where an intelligent agent is a system that perceives its environment and takes actions which maximizes its chances of success.
..... Click the link for more information.
Cyrillic alphabet

Sister systems Latin alphabet
Coptic alphabet
Armenian
Unicode range U+0400 to U+052F
ISO 15924 Cyrl

Note: This page may contain IPA phonetic symbols in Unicode.
..... Click the link for more information.
KOI8-R is an 8-bit character encoding, designed to cover Russian, which uses the Cyrillic alphabet. It also happens to cover Bulgarian. A derivative encoding is KOI8-U, which adds Ukrainian characters. The original KOI-8 encoding was designed by Soviet authorities in 1974.
..... Click the link for more information.
Windows-1251 is an 8-bit character encoding, designed to cover languages that use the Cyrillic alphabet such as Russian, Bulgarian and other languages. It is the most widely used for encoding the Serbian, Macedonian and Bulgarian languages.
..... Click the link for more information.
Unicode is an industry standard allowing computers to consistently represent and manipulate text expressed in any of the world's writing systems. Developed in tandem with the Universal Character Set standard and published in book form as The Unicode Standard
..... Click the link for more information.
A word processor (more formally known as document preparation system) is a computer application used for the production (including composition, editing, formatting, and possibly printing) of any sort of printable material.
..... Click the link for more information.
Rich Text Format

File extension: .rtf
MIME type: text/rtf
Type code: 'RTF '
Uniform Type Identifier: public.rtf
Magic:
..... Click the link for more information.
WordPerfect is a proprietary word processing application. At the height of its popularity in the late 1980s and early 1990s, it was the de facto standard word processor, but has since been eclipsed in sales by Microsoft Word.
..... Click the link for more information.
OpenDocument format (ODF, ISO/IEC 26300, full name: OASIS Open Document Format for Office Applications) is a file format for electronic office documents, such as spreadsheets, charts, presentations and word processing documents.
..... Click the link for more information.
a major revision or rewrite and needs further review. You can help!

PCX

File extension: .pcx
Developed by: ZSoft Corporation
Type of format: lossless bitmap image format
..... Click the link for more information.
PNG may stand for:
  • Papua New Guinea, a country in Oceania, occupying the eastern half of the island of New Guinea and numerous off-shore islands.
  • Portable Network Graphics, a bitmapped image format that employs lossless data compression.

..... Click the link for more information.
Free Lossless Audio Codec

File extension: .flac
MIME type: audio/x-flac[1]
Type of format: Audio
Free Lossless Audio Codec

Developer: Xiph.Org Foundation
Latest release: 1.2.
..... Click the link for more information.


This article is copied from an article on Wikipedia.org - the free encyclopedia created and edited by online user community. The text was not checked or edited by anyone on our staff. Although the vast majority of the wikipedia encyclopedia articles provide accurate and timely information please do not assume the accuracy of any particular article. This article is distributed under the terms of GNU Free Documentation License.
Herod_Archelaus


page counter