Information about Gpgpu

General-purpose computing on graphics processing units (GPGPU, also referred to as GPGP and to a lesser extent GP²) is a recent trend focused on using GPUs to perform computations rather than the CPU. The addition of programmable stages and higher precision arithmetic to the rendering pipelines allowed software developers using GPUs for non graphics related applications. By exploiting GPU's extremely parallel architecture using stream processing approaches many real-time computing problems can be sped up considerably.

GPU improvements

For many years GPU functionality was very limited. In fact, for many years the GPU was only used to accelerate certain parts of the graphics pipeline. Some improvements were needed before GPGPU became feasible.

Programmability

Programmable vertex and fragment shaders were added to the graphics pipeline to enable game programmers to generate even more realistic effects. Vertex shaders allow the programmer to alter per-vertex attributes, such as position, color, texture coordinates, and normal vector. Fragment shaders are used to calculate the color of a fragment, or per-pixel. Programmable fragment shaders allow the programmer to substitute, for example, a lighting model other than those provided by default by the graphics card, typically simple Gouraud shading. Shaders have enabled graphics programmers to create lens effects, displacement mapping, and depth of field.

The programmability of the pipelines have trended according the Microsoft’s DirectX specification , with DirectX8 introducing Shader Model 1.1, DirectX8.1 Pixel Shader Models 1.2, 1.3 and 1.4, and DirectX9 defining Shader Model 2.x and 3.0. Each shader model increased the programming model flexibilities and capabilities, ensuring the conforming hardware follows suit. The DirectX10 specification unifies the programming specification for vertex, geometry (“Geometry Shaders” are new to DirectX10) and fragment processing allowing for a better fit for unified shader hardware, thus providing a single computational pool of programmable resource.

Data types

Pre-DirectX9 graphics cards only supported paletted or integral color types. Various formats are available, each containing a red element, a green element, and a blue element. Sometimes an additional alpha value is added, to be used for transparency. Common formats are:
  • 8 bits per pixel - Palette mode, where each value is an index in a table with the real color value specified in one of the other formats. Possibly 2 bits for red, 3 bits for green, and 3 bits for blue.
  • 16 bits per pixel - Usually allocated as 5 bits for red, 6 bits for green, and 5 bits for blue.
  • 24 bits per pixel - 8 bits for each of red, green, and blue
  • 32 bits per pixel - 8 bits for each of red, green, blue, and alpha
For early fixed function or limited programmability graphics (i.e. up to and including DirectX8.1 compliant GPUs) this was sufficient because this is also the representation used in displays. This representation does have certain limitations, however. Given sufficient graphics processing power even graphics programmers would like to use better formats, such as floating point data formats, in order to obtain effects such as high dynamic range imaging. Many GPGPU applications require floating point accuracy, which came with graphics cards conforming to the DirectX9 specification.

DirectX9 Shader Model 2.x suggested the support of two precision types: full and partial precision. Full precision support could either be FP32 and FP24 (floating point 24-bit per component) or greater, while partial precision was FP16. ATI’s R300 series of GPUs supported FP24 precision only in the programmable fragment pipeline (although FP32 was supported in the vertex processors) while NVIDIA’s NV30 series supported both FP16 and FP32; other vendors such as S3 Graphics and XGI supported a mixture of formats up to FP24.

Shader Model 3.0 altered the specification, increasing full precision requirements to a minimum of FP32 support in the fragment pipeline. ATI’s Shader Model 3.0 compliant R5xx generation (Radeon X1000 series) supports just FP32 throughout the pipeline while NVIDIA’s NV4x and G7x series continued to support both FP32 full precision and FP16 partial precisions. Although not stipulated by Shader Model 3.0, both ATI and NVIDIA’s Shader Model 3.0 GPUs introduced support for blendable FP16 render targets, easier facilitating the support for High Dynamic Range Rendering.

The implementations of floating point on nVidia GPUs are IEEE compliant, however this is not true across all vendors[1]. This has implications for correctness which are considered important to some scientific applications. While 64 bit floating point values (double precision float) are commonly available on CPUs, these are not currently available on GPUs. Some applications require at least double precision floating point values and thus cannot currently be ported to GPUs. There have been efforts to emulate double precision floating point values on GPUs[2].

Most operations on the GPU operate in a vectorized fashion: a single operation can be performed on up to four values at once. For instance, if one color <r1, G1, B1> is to be modulated by another color <r2, G2, B2>, the GPU can produce the resulting color <r1*R2, G1*G2, B1*B2> in a single operation. This functionality is useful in graphics because almost every basic data type is a vector (either 2, 3, or 4 dimensional). Examples include vertices, colors, normal vectors, and texture coordinates. Many other applications can put this to good use, and because of this vector instructions (SIMD) have already been added to CPUs.

In November 2006 NVIDIA launched GeForce 8800 that uses CUDA, a SDK and API that allows a programmer to use the C programming language to code algorithms for execution on the GPU. ATI/AMD offers a similar SDK for their ATI-based GPUs and that SDK and technology is called CTM (Close to Metal), designed to compete directly with NVIDIA's CUDA. CTM provides a hardware interface thin (thin hardware interface). AMD has also announced the AMD Stream Processor product line (combining a CPU and a GPU technology on one chip. Compared, for example, to traditional floating point accelerators such as the 64-bit CSX600 boards from Clearspeed that is used in today's supercomputers, the current GPUs from NVIDIA and AMD/ATI are only running on 32-bit, providing only single-precision data capability – instead of the double-precision (64-bit) capability of todays supercomputers[4]. NVIDIA however stated in the CUDA Release Notes Version 0.8 file that NVIDIA GPUs supporting (64-bit) Double Precision Floating Point arithmetic in hardware will become available in late 2007.[5]. Still, even without true Double Precision Floating Point arithmetic in hardware, CUDA and CTM are a great step toward a broader use of GPGPU technology.

GPGPU programming concepts

GPUs are designed specifically for graphics and thus are very restrictive in terms of operations and programming. Because of their nature GPUs are only effective at tackling problems that can be solved using stream processing and the hardware can only be used in certain ways.

Stream processing

Main article: Stream processing
GPUs can only process independent vertices and fragments, but can process many of them in parallel. This is especially effective when the programmer wants to process many vertices or fragments in the same way. In this sense, GPUs are stream processors - processors that can operate in parallel by running a single kernel on many records in a stream at once.

A stream is simply a set of records that require similar computation. Streams provide data parallelism. Kernels are the functions that are applied to each element in the stream. In the GPUs, vertices and fragments are the elements in streams and vertex and fragment shaders are the kernels to be run on them. Since GPUs process elements independently there is no way to have shared or static data. For each element we can only read from the input, perform operations on it, and write to the output. It is permissible to have multiple inputs and multiple outputs, but never a piece of memory that is both readable and writable .

Arithmetic intensity is defined as the operations performed per word of memory transferred. It is important for GPGPU applications to have high arithmetic intensity or memory access latency will limit computation speed.

Ideal GPGPU applications have large data sets, high parallelism, and minimal dependency between data elements.

GPU programming concepts

Computational resources

There are a variety of computational resources available on the GPU:
  • Programmable processors - Vertex, primitive and fragment pipelines allow programmer to perform kernel on streams of data
  • Rasterizer - creates fragments and interpolates per-vertex constants such as texure coordinates and color
  • Texture Unit - read only memory interface
  • Framebuffer - write only memory interface
In fact, the programmer can substitute a write only texture for output instead of the framebuffer. This is accomplished either through Render-To-Texture (RTT), Render-To-Backbuffer-Copy-To-Texture(RTBCTT), or the more recent stream-out.

Textures as stream

The most common form for a stream to take in GPGPU is a 2D grid because this fits naturally with the rendering model built into GPUs. Many computations naturally map into grids: matrix algebra, image processing, physically based simulation, and so on.

Since textures are used as memory, texture lookups are then used as memory reads. Certain operations can be done automatically by the GPU because of this.

Kernels

Kernels can be thought of as the body of loops. For example, if the programmer was operating on a grid on the CPU he might have code that looked like this:

>
/* Pseudocode */
x = 1e8
y = 1e8
make array x by y
for each "x" { // Loop this block 1e8 times
  for each "y" { // Loop this block 1e8 times
    do_some_hard_work(x, y) // This is done 1e16 times (10 000 000 000 000 000)
  }
} 


On the GPU, the programmer only specifies the body of the loop as the kernel and what data to loop over by invoking geometry processing.
Flow control
In regular programs it is possible to control the flow of the program using if-then-else statements and various forms of loops. Such flow control structures have only recently been added to GPUs. Conditional writes could be accomplished using a series of simpler instructions, but looping and conditional branching were not possible.

Recent GPUs allow branching, but usually with a performance penalty. Branching should generally be avoided in inner loops, whether in CPU or GPU code, and various techniques, such as static branch resolution, pre-computation, and Z-cull[3] can be used to achieve branching when hardware support does not exist.

GPU techniques

Map

The map operation simply applies the given function (the kernel) to every element in the stream. A simple example is multiplying each value in the stream by a constant (increasing the brightness of an image). The map operation is simple to implement on the GPU. The programmer generates a fragment for each pixel on screen and applies a fragment program to each one. The result stream of the same size is stored in the output buffer.

Reduce

Some computations require calculating a smaller stream (possibly a stream of only 1 element) from a larger stream. This is called a reduction of the stream. Generally a reduction can be accomplished in multiple steps. The results from the previous step are used as the input for the current step and the range over which the operation is applied is reduced until only one stream element remains.

Stream filtering

Stream filtering is essentially a non-uniform reduction. Filtering involves removing items from the stream based on some criteria.

Scatter

The scatter operation is most naturally defined on the vertex processor. The vertex processor is able to adjust the position of the vertex, which allows the programmer to control where information is deposited on the grid. Other extensions are also possible, such as controlling how large an area the vertex affects.

The fragment processor cannot perform a direct scatter operation because the location of each fragment on the grid is fixed at the time of the fragment's creation and cannot be altered by the programmer. However, a logical scatter operation may sometimes be recast or implemented with an additional gather step. A scatter implementation would first emit both an output value and an output address. An immediately following gather operation uses address comparisons to see whether the output value maps to the current output slot.

Gather

The fragment processor is able to read textures in a random access fashion, so it can gather information from any grid cell, or multiple grid cells, as desired.

Sort

The sort operation transforms an unordered set of elements into an ordered set of elements. The most common implementation on GPUs is using sorting networks[3].

Search

The search operation allows the programmer to find a particular element within the stream, or possibly find neighbors of a specified element. The GPU is not used to speed up the search for an individual element, but instead is used to run multiple searches in parallel.

Data structures

A variety of data structures can be represented on the GPU:
  • Dense arrays
  • Sparse arrays - static or dynamic
  • Adaptive structures

Applications

The following are some of the non-graphics areas where GPUs have been used for general purpose computing:

References

4. ^ [1]
5. ^ [2]
  1. ^  Double precision on GPUs (Proceedings of ASIM 2005): Dominik Goddeke, Robert Strzodka, and Stefan Turek. Accelerating Double Precision (FEM) Simulations with (GPUs). Proceedings of ASIM 2005 - 18th Symposium on Simulation Technique, 2005.
  2. ^  GPGPU survey paper: John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krüger, Aaron E. Lefohn, and Tim Purcell. "A Survey of General-Purpose Computation on Graphics Hardware". Computer Graphics Forum, volume 26, number 1, 2007, pp. 80--113.
  3. ^  Mapping computational concepts to GPUs: Mark Harris. Mapping computational concepts to GPUs. In ACM SIGGRAPH 2005 Courses (Los Angles, California, July 31 - August 04, 2005). J. Fujii, Ed. SIGGRAPH '05. ACM Press, New York, NY, 50.

See also

External links

graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a dedicated graphics rendering device for a personal computer, workstation, or game console.
..... Click the link for more information.
central processing unit (CPU), or sometimes simply processor, is the component in a digital computer capable of executing a program.(Knott 1974) It interprets computer program instructions and processes data.
..... Click the link for more information.
graphics pipeline or rendering pipeline most commonly refer to the current state of the art method of rasterization-based rendering as supported by commodity graphics hardware [1].
..... Click the link for more information.
A software developer is a person who is concerned with one or more facets of the software development process, a somewhat broader scope of computer programming or a specialty of project managing.
..... Click the link for more information.
Stream processing is a relatively new, yet quite successful paradigm to allow parallel processing at never-before-seen efficiency with minimal effort.
..... Click the link for more information.
real-time computing (RToC) is the study of hardware and software systems which are subject to a "real-time constraint"—i.e., operational deadlines from event to system response.
..... Click the link for more information.
Vertex shader (abbreviation VS) is a shader program, normally executed on the Graphics processing unit.

Function

A vertex shader is a graphics processing function used to add special effects to objects in a 3D environment by performing mathematical operations on
..... Click the link for more information.
In the geometry of computer graphics, a vertex normal at a vertex of a polyhedron is the normalized average of the surface normals of the faces that contain that vertex. The average can be weighted by the area of the face or it can be unweighted.
..... Click the link for more information.
A fragment is a computer graphics term for all of the data necessary needed to generate a pixel in the frame buffer. This may include, but is not limited to:
  • raster position
  • depth
  • interpolated attributes (color, texture coordinates, etc.

..... Click the link for more information.
Gouraud shading, named after Henri Gouraud, is a method used in computer graphics to simulate the differing effects of light and colour across the surface of an object. In practice, Gouraud shading is used to achieve smooth lighting on low-polygon surfaces without the heavy
..... Click the link for more information.
Displacement mapping is an alternative computer graphics technique in contrast to bump mapping, normal mapping, and parallax mapping, using a (procedural-) texture- or height map to cause an effect where the actual geometric position of points over the textured surface are
..... Click the link for more information.
In optics, particularly film and photography, the depth of field (DOF) is the distance in front of and beyond the subject that appears to be in focus.

Apparent sharp focus


..... Click the link for more information.
Microsoft DirectX is a collection of application programming interfaces for handling tasks related to multimedia, especially game programming and video, on Microsoft platforms.
..... Click the link for more information.
The High Level Shader Language or High Level Shading Language (HLSL) is a proprietary shading language developed by Microsoft for use with the Microsoft Direct3D API. It is analogous to the GLSL shading language used with the OpenGL standard.
..... Click the link for more information.
Fragment processing is a term in computer graphics referring to a collection of operations applied to fragments generated by the rasterization operation in the rendering pipeline.
..... Click the link for more information.
In computing, floating-point is a numerical-representation system in which a string of digits (or bits) represents a real number. The most commonly encountered representation is that defined by the IEEE 754 Standard.
..... Click the link for more information.
In computer graphics and photography, high dynamic range imaging (HDRI) is a set of techniques that allows a far greater dynamic range of exposures (i.e. a large range of values between light and dark areas) than normal digital imaging techniques.
..... Click the link for more information.
ATI Technologies U.L.C.

Subsidiary
Founded 1985
Headquarters 1 Commerce Valley Drive East
Markham, Ontario
Canada

Key people Adrian Hartog, Senior Vice President and GM, Consumer Electronics Group and Rick Bergman, Senior VP, GM Graphics Products
..... Click the link for more information.
Radeon R300 (introduced August 2002) is the third generation of Radeon graphics chips from ATI Technologies. The line features 3D acceleration based upon Direct3D 9.0 and OpenGL 2.
..... Click the link for more information.
NVIDIA Corporation

Public (NASDAQ:  NVDA )
Founded 1993
Headquarters 2701 San Tomas Expressway
Santa Clara, California
USA

Key people Jen-Hsun Huang, Co-Founder, President and CEO
Industry Semiconductors- Specialized
..... Click the link for more information.
GeForce FX or "GeForce 5" series (codenamed NV30) is a line of graphics cards from the manufacturer NVIDIA.

Specifications

NVIDIA's GeForce FX series is the fifth generation in the GeForce line.
..... Click the link for more information.
S3 Graphics

Public
Founded January 1989
Headquarters

Key people Dado Banatao and Ronald Yara
Industry Computing
Products Graphics cards
Website www.s3graphics.com

S3 Graphics, Ltd is a company specializing in graphics chipsets.
..... Click the link for more information.
XGI Technology Inc. (Traditional Chinese:圖誠科技) is based upon the old graphics division of SiS spun off as a separate company, and the graphics assets of Trident Microsystems.
..... Click the link for more information.
"R520" core (codenamed Fudo) is the foundation for a line of DirectX 9.0c 3D accelerator X1000 video cards. It is ATI's first major architectural overhaul since the "R300" core and is highly optimized for Shader Model 3.0.
..... Click the link for more information.
GeForce 6 Series (codenamed NV40) is NVIDIA's sixth generation of GeForce graphics chipsets. All of them support Vertex and Pixel shader version 3.0, as required under the Microsoft DirectX 9.0c specification.
..... Click the link for more information.
The GeForce 7 Series is the seventh generation of NVIDIA's GeForce graphics cards.

NVIDIA GeForce 7 Series

Codename(s) G70 (NV47), G71, G72, G73
Created June 2005 - 2006
Entry-level GPU 7100, 7200, 7300
Mid-Range GPU 7500, 7600
..... Click the link for more information.
The IEEE Standard for Binary Floating-Point Arithmetic (IEEE 754) is the most widely-used standard for floating-point computation, and is followed by many CPU and FPU implementations.
..... Click the link for more information.
Flynn's Taxonomy
  Single
Instruction Multiple
Instruction
Single
Data SISD MISD
Multiple
Data SIMD MIMD In computing, SIMD (Single Instruction, Multiple D
..... Click the link for more information.
NVIDIA Corporation

Public (NASDAQ:  NVDA )
Founded 1993
Headquarters 2701 San Tomas Expressway
Santa Clara, California
USA

Key people Jen-Hsun Huang, Co-Founder, President and CEO
Industry Semiconductors- Specialized
..... Click the link for more information.
worldwide view.

NVIDIA GeForce 8 Series

Codename(s) G80, G84, G86, G92
Created 2006
Entry-level GPU 8300, 8400
Mid-Range GPU 8500, 8600
High-end GPU 8800
Direct3D and Shader version D3D 10.0, Model 4.
..... Click the link for more information.


This article is copied from an article on Wikipedia.org - the free encyclopedia created and edited by online user community. The text was not checked or edited by anyone on our staff. Although the vast majority of the wikipedia encyclopedia articles provide accurate and timely information please do not assume the accuracy of any particular article. This article is distributed under the terms of GNU Free Documentation License.
Herod_Archelaus


page counter