ÜberLame GPGPU

Using ÜberLame for GPGPU

What is GPGPU, anyway?

GPGPU means using GPU's (graphics processing units) for general-purpose calculations. General purpose means calculations GPU was not intended to do, possibly even unrelated to graphics.

Why go GPGPU, anyway?

Using GPU for general-purpose calculations is pretty common thing nowadays. That's because of high GPU processing power. Common GPU's in 2008 have arround 128 cores (a lot, compared to 2 cores in Core 2 Duo) and processing power peaking in hundreds of GFLOPS.

How to GPGPU, anyway?

In the beginning, GPGPU meant using graphics library to evaluate some expression. Sort of scratching right ear with left hand sort of thing. But it works. Today, we have special libraries for GPGPU.

NVidia's sollution is called CUDA and it's a compiler. It can be integrated to Visual C++ IDE and works as pre-compile step. Source code can contain mixed C/C++/CUDA code (some functions execute on CPU, some on GPU, depending on their declaration) and that's all programmer needs to worry about. CUDA compiler compiles it's functions, then lets Visual C++ compiler do the rest. Result is pretty ordinary exe. Except it can only execute on machines with recent NVidia GPU's.

In cuda, programmer can see GPU as parallel device. The functions are executed in threads on several cores. For further reading, refer to NVidia CUDA homepage.

Note there have been rumors ATI is threatening the world with it's own GPGPU sollution. Well, i didn't check into that, i'm not brave enough to let ATI drivers into my system.

What does ÜberLame do?

This is the other approach. We're going to use OpenGL for general-purpose calculations. Did i just say it's clumsy to use graphics library for GP? Why do that anyway? Two basic reasons ... because we can and because CUDA can't run anywhere (older NVidia boards, ATI boards).

Using OpenGL for GPGPU can be, however, pretty tedious, setting up all textures, compiling shaders, stuff like that. That's why ÜberLame offers convenient GPGPU classes.

ÜberLame code will run on linux / windows, and on any machine with OpenGL and shader support. It does not use CG, but native OpenGL shading language called simply 'GLSL' or 'GL-slang'.

Essential GPGPU

First of all, it's necessary to clarify how are we going to proceed. We are going to be interested in three kinds of OpenGL objects. Those are textures, shaders and framebuffers.

Textures are used to hold calculation inputs / outputs.
- To read data from texture, it must be bound to some texturing unit.
- To write data to texture, it must be bound to a framebuffer object.
Shaders are short programs which carry out the calculation.

To avoid common confusion, shader is not some routine, uploaded to GPU and then executed on command. Shader is used for shading, it replaces traditional OpenGL lighting calculation subroutine and gives programmer means to calculate his own 'shading'. We will be in fact doing some per-pixel shading, only the shading calculation doesn't actually have anything to do with light.

But to answer the question 'How to execute shader?' one can pose similar question: 'How to execute lighting calculation?'. Lighting calculations are carried out when drawing something. That's right. To execute shader we draw something, some geometry. What should it be? It depends, but the most used is so called fullscreen quad (quadrilateral, covering the whole output framebuffer).

Imagine simple case when we have lots of quadratic equations and we want to solve them. Quadratic equation is written as:

a.x² + b.x + c = 0

The equation is described with it's coefficients a, b and c. We need to store those coefficients in the texture, a b c values will easily map to RGB pixels. Each pixel of the texture represents one equation.

Results will be in form of triplets. First pair of values are equation sollutions (quadratic equation has two roots, sometimes equal), third value is 1 if sollution exists or 0 otherwise.

This is how we are going to proceed:

create texture, fill it with equation coefficents
setup viewport of the same size as texture (1 to 1 pixel to texel mapping)
create 'per-pixel lighting' shader to solve the equations:

void main()
{
    vec3 texel = ReadTexture2D(); // read texture

    float a = texel.r;
    float b = texel.g;
    float c = texel.b; // get quadratic equation coefficients

    float D = b * b - 4 * a * c; // calculate discriminant squared

    if(D < 0) {
        Output = RGB(0, 0, 0); // equation has no sollution
    } else {
        D = sqrt(D); // calculate discriminant

        float x1 = (-b - D) / (2 * a);
        float x2 = (-b + D) / (2 * a); // calculate roots

        Output = RGB(x1, x2, 1); // equation has one or two sollutions
    }
}

draw quad, textured with equation coefficents texture
results are in the framebuffer

In case we wanted results in texture (for further GPU processing), we'd have to create one more texture for results and a framebuffer object to redirect drawn pixels to the texture.

This is basically how GPGPU works. There can be more input textures, more output textures, alpha, stencil or depth test can be somehow involved. The possibilities are endless.

The following part of the article focuses more on the implementation details.

OpenGL textures revisited

Textures are used as 'pixel arrays' to hold calculation data. We have to decide which texture format to use and how to organize our data in texture. Note standard OpenGL textures have limited size and can hold power-of-two images only (images with resolutions that are power of two, such as 256x256, called POTS). In image / video processing it's often necessary to process non-power-of-two images, such as 640x480. Those are called NPOTS. There are two extensions, enabling creation of non-power-of-two textures.

First of them is ARB_texture_non_power_of_two. It simply relaxes the power-of-two restriction (for all texture dimensions, not only 2D), no changes to the api are made. As far as i know, there are no ATI cards, supporting this extension.

Second of them is ARB_texture_rectangle. It adds whole new texture target, it's no longer GL_TEXTURE_2D, it's GL_TEXTURE_RECTANGLE_ARB. Texture coordinates are no longer normalized (in [0, 1] range), but are in pixels. Also texture wrap and filtering modes are limited (read the spec). So it takes a bit more effort to use them, but they are widely supported.

The other important thing is texture format. Today's (2008) OpenGL implementations offer following texture formats:

GL_LUMINANCE8, 8-bit per pixel, 1 component, read-only
GL_LUMINANCE_ALPHA8, 8-bit per pixel, 2 components, read-only
GL_RGB8, 8-bit per pixel, 3 components
GL_RGBA8, 8-bit per pixel, 4 components

That's right, there are read-only formats. These are formats that are not color-renderable. So basically we have 3 or 4 component formats. NVidia OpenGL implementations will act like it supports 4-bit, 12-bit or 16-bit depths as well, but data are going to be stored in 8-bits. You can find more info at NVidia texture format support page.

To fill the gap in writable monochrome and two-component formats, there's ARB_texture_rg extension (can't say it's widely supported). It defines (amongst other) two following formats:

GL_R8, 8-bit per pixel, 1 component
GL_RG8, 8-bit per pixel, 2 components

It can be desired to store floating-point data. There's ARB_texture_float, it defines (among others) the following formats:

GL_LUMINANCE16F_ARB, 16-bit per pixel, 1 component, read-only
GL_LUMINANCE_ALPHA16F_ARB, 16-bit per pixel, 2 components, read-only
GL_RGB16F_ARB, 16-bit per pixel, 3 components
GL_RGBA16F_ARB, 16-bit per pixel, 4 components
GL_LUMINANCE32F_ARB, 32-bit per pixel, 1 component, read-only
GL_LUMINANCE_ALPHA32F_ARB, 32-bit per pixel, 2 components, read-only
GL_RGB32F_ARB, 32-bit per pixel, 3 components
GL_RGBA32F_ARB, 32-bit per pixel, 4 components

Note the 16-bit float formats use so called half-float. It's much quicker (at least on most older GPU's it is), of course with some precission cost. But it's precission is sufficient in most common cases. Full specification can be found at ARB_half_float_pixel.

Again, in case it's required to render monochrome there's remedy in form of another extension NV_float_buffer, it adds (among others) the following formats:

GL_FLOAT_R16_NV, 16-bit per pixel, 1 component
GL_FLOAT_RG16_NV, 16-bit per pixel, 2 components
GL_FLOAT_RGB16_NV, 16-bit per pixel, 3 components
GL_FLOAT_RGBA16_NV, 16-bit per pixel, 4 components
GL_FLOAT_R32_NV, 32-bit per pixel, 1 component
GL_FLOAT_RG32_NV, 32-bit per pixel, 2 components
GL_FLOAT_RGB32_NV, 32-bit per pixel, 3 components
GL_FLOAT_RGBA32_NV, 32-bit per pixel, 4 components

This is complete floating-point format set, all of them writable. However there's no substitute for this extension on ATI boards. When using RGB / RGBA formats, it's better to stick with ARB versions (enums aren't identical).

It is great thing OpenGL 3.0 adds some additional core formats. Those are:

GL_LUMINANCE16F, 16-bit per pixel, 1 component
GL_LUMINANCE_ALPHA16F, 16-bit per pixel, 2 components
GL_RGB16F, 16-bit per pixel, 3 components
GL_RGBA16F, 16-bit per pixel, 4 components
GL_LUMINANCE32F, 32-bit per pixel, 1 component
GL_LUMINANCE_ALPHA32F, 32-bit per pixel, 2 components
GL_RGB32F, 32-bit per pixel, 3 components
GL_RGBA32F, 32-bit per pixel, 4 components

Those are the same as ARB_texture_float formats, only all of them are now color-renderable.

ÜberLame offers CGLTexture_1D, CGLTexture_2D, CGLTexture_Rect, CGLTexture_3D and CGLTexture_Cube for different texture dimensionality.

Framebuffer objects

Framebuffer objects (FBO's) are defined by EXT_framebuffer_object extension. They add ability to render to off-screen framebuffer (meaning other framebuffer than the one, visible in application window). They can have color buffer, depth buffer and stencil buffer just as ordinary framebuffer. They also expose very useful ability to render directly to textures. (Yes, there are depth and stencil texture formats.)

Framebuffers are also very useful with the ARB_draw_buffers extension. It enables simultaneous rendering to multiple render buffers (multiple textures) at once. It can be very useful in case your algorithm needs more than 4 scalar outputs that fit to RGBA texture. Currently maximum of 8 draw buffers is supported. When rendering to multiple draw-buffers, all of them will contain the same output. It doesn't seem very useful, at least not until shader is bound. Shaders can write different output to each draw-buffer.

ÜberLame contains CGLFrameBuffer_FBO class, nicely encapsulating both above extensions.

Shaders

World 'shader' is (thanks to graphics card manufacturers' advertising) almost dark magic for most people. There's nothing difficult in shaders. Shaders are short programs, written in assembly or c-like language, executed upon OpenGL objects. (to be precise, 'programs' are in assembly, 'shaders' are c-like.) There are vertex shaders, geometry shaders (we're not going to discuss geometry shaders here) and fragment shaders. (sometimes 'pixel' shaders is used, but that is incorrect for OpenGL.)

There's a lot of specs on shader extensions, but OpenGL 2.0 defines core shader functionality we're going to use.

As said, vertex shader is executed upon corresponding OpenGL object - upon vertex. It gets vertex position, texture coordinates, transformation matrices, lighting and all associated OpenGL state as global variables. It's task is to calculate vertex position (transform it by matrices), generate vertex color, texture coordinates, fog ... and anything else useful for whatever the goal is. Vertex shaders are not used very much in common GPGPU.

Fragment shader is executed upon fragment. This is important. When pixel is about to be drawn to framebuffer, fragment shader gets executed. It gets interpolated vertex texture coordinates, color, fog coordinate and textures. It's task is to calculate output pixel color. Fragment shader may also calculate depth (what is going to be written to depth-buffer), but doesn't have to do so - in that case interpolated vertex depth value is automatically used.

Shaders are in OpenGL accessed using shader objects. Each shader object has associated id (similar to texture id), can be bound (as texture). Shader object with id 0 represents the fixed-function pipeline. Shader object can be composed of one or more stages, either sole vertex shader, sole fragment shader or cooperating vertex and fragment shaders.

ÜberLame offers CGL_ARB_program, CGL_ARB_shader and CGL_Core_shader classes, covering low-level shader functionality. Then there's TGLShaderInfo and CGLShaderObject for greater convenience (second layer shader system).

Hardcore GPGPU

In the past, branching instructions (if) were expensive on GPU's. Some branching can be avoided by proper arranging of output data so branch condition can be represented visually as well-representable area of output texture. Then, instead of fullscreen quad, proper geometry is used so the branching condition is constant over geometric primitives.

Consider, for example, simple decission wheter output pixel is on the edge of output image. It can be written as:

if(x == 0 || x == w - 1 || y == 0 || y == h - 1) {
    // border
} else {
    // inside
}

But that brings condition (if) to our shader. But output image edge is well-representable area of output texture. We will have two shaders, one with the 'border' branch and the other with 'inside' branch. We use one pixel smaller fullscreen quad for 'inside' shader and four lines along the edges for 'border' shader. Branching is gone.

Also, some algorithms are iterative. Output of one pass is input to the following pass. This requires two textures, one acting as input, other as output and alternating them. This is very common and is called ping-ponging.

ÜberLame demo 2

You might wonder why demo no. 2 is first. It has been written later, because the first demo is a little bit more complicated. Demo 2 is just 250 lines of code, including some comments.

Demo 2 demonstrates the quadratic equation solver. It creates equations from a simple raytracing problem - ray sphere intersection. So the demo displays results of those quadratic equations, near intersection in red channel, far intersection in blue channel and the 'solvable' flag in green / alpha channel.

It begins with OpenGL initialization:

int main(int n_arg_num, const char **p_arg_list)
{
    glutInit(&n_arg_num, (char**)p_arg_list);
    glutInitDisplayMode(GLUT_RGBA | GLUT_DEPTH | GLUT_DOUBLE);
    glutInitWindowSize(n_width, n_height);
    glutInitWindowPosition(200, 200);
    glutCreateWindow("ÜberLame GLUT window");
    // init OpenGL using GLUT

What you see above is pretty common GLUT initialization code.

Next, we want to check for some special extensions:

    if(!CGLExtensionHandler::b_SupportedExtension("GL_ARB_texture_float")) {
        fprintf(stderr, "error: ARB_texture_float is not supported\n");
        Cleanup();
        return -1;
    }
    // need ARB_texture_float extension to create floating-point texture

We need the ARB_texture_float extension so it's possible to create floating-point RGB texture to store quadratic equation coefficients.

Next, we want to create OpenGL state guard object:

    if(!(p_state = new CGLState())) {
        fprintf(stderr, "error: not enough memory\n");
        Cleanup();
        return -1;
    }
    // create OpenGL state guard

OpenGL state guard is quite useful object, containing copy of OpenGL states. When changing some state, such as calling glEnable(GL_TEXTURE_2D); the driver needs to go trough some decision tree to identify what does GL_TEXTURE_2D mean. Then the state change is queued, but the actual state is chaged at the point when it makes some effect (when drawing something in this case). While this is a lot of fuss, it's simple to keep track of wheter the texture is enabled or disabled and calling driver only in case it's really necessary. That is what OpenGL state guard does.

Now we're ready to create texture containing the equations coefficients:

    float *p_coeffs;
    if(!(p_coeffs = p_GenerateEquationCoefficients(n_width, n_height))) {
        fprintf(stderr, "error: not enough memory\n");
        Cleanup();
        return -1;
    }
    // generate equation coefficients

    if(!(p_equations = new CGLTexture_2D(p_state, n_width, n_height, GL_RGB16F_ARB,
       false, 0, GL_RGB, GL_FLOAT, p_coeffs)) || !p_equations->b_Status()) {
        fprintf(stderr, "error: failed to create equations texture\n");
        delete[] p_coeffs;
        Cleanup();
        return -1;
    }
    // create floating-point RGB texture, containing equation coefficients

    delete[] p_coeffs;
    // don't need this anymore

Most of this is quite easy to understand. We call some function p_GenerateEquationCoefficients, which creates array of n_width x n_height x 3 floats, then we create CGLTexture_2D object (parameters are almost identical to ones of glTexImage2D) with GL_RGB16F_ARB pixel format.

We're going to display results in the application window so we don't need to allocate framebuffer and the result texture. But we need shader to calculate the results:

   const char *p_s_solve_quad =
        "uniform sampler2D n_texture;\n"
        "\n"
        "void main()\n"
        "{\n"
        "    vec3 texel = texture2D(n_texture, gl_TexCoord[0].st).rgb; // read texture\n"
        "\n"
        "    float a = texel.r;\n"
        "    float b = texel.g;\n"
        "    float c = texel.b; // get quadratic equation coefficients\n"
        "\n"
        "    float D = b * b - 4.0 * a * c; // calculate discriminant squared\n"
        "\n"
        "    if(D < 0.0) {\n"
        "        gl_FragColor = vec4(0); // equation has no sollution\n"
        "    } else {\n"
        "        D = sqrt(D); // calculate discriminant\n"
        "\n"
        "        float x1 = (-b - D) / (2.0 * a);\n"
        "        float x2 = (-b + D) / (2.0 * a); // calculate roots\n"
        "\n"
        "        gl_FragColor = vec3(x1, x2, 1.0).xyzz; // equation has one or two sollutions\n"
        "    }\n"
        "}\n";

    if(!(p_shader = new CGL_ARB_shader(0, p_s_solve_quad))) {
        fprintf(stderr, "error: not enough memory\n");
        Cleanup();
        return -1;
    }
    // create shader for solving quadratic equations

Code above initializes a new CGL_ARB_shader object. But the shader itself isn't compiled.

    char *p_s_info_log = 0;
    if(!p_shader->Compile(p_state, p_s_info_log)) {
        fprintf(stderr, "error: error compiling shader:\n%s\n",
            (p_s_info_log)? p_s_info_log : "(null)");
        if(p_s_info_log)
            delete[] p_s_info_log;
        return -1;
    }
    if(p_s_info_log) {
        fprintf(stderr, "warning: while compiling shader:\n%s\n", p_s_info_log);
        delete[] p_s_info_log;
        p_s_info_log = 0;
    }
    if(!p_shader->Link(p_s_info_log)) {
        fprintf(stderr, "error: error linking shader:\n%s\n",
            (p_s_info_log)? p_s_info_log : "(null)");
        if(p_s_info_log)
            delete[] p_s_info_log;
        return -1;
    }
    if(p_s_info_log) {
        fprintf(stderr, "warning: while linking shader:\n%s\n", p_s_info_log);
        delete[] p_s_info_log;
        p_s_info_log = 0;
    }
    // compile and link the shader

In this quite long fragment of code, but actually does two important things only, calls Compile() and Link(). The rest is just error-checking. Compiler may also generate some warnings / errors, which are copied to p_s_info_log and have to be printed for the user to see them.

The shader is using texture to read input data. We need to get index of register on GPU so we can set it later. It is done by a simple call to CGL_ARB_shader member function n_GetUniformLocation:

    if((n_texture_param_register = p_shader->n_GetUniformLocation("n_texture")) < 0) {
        fprintf(stderr, "error: can't find \'n_texture\' parameter\n");
        Cleanup();
        return -1;
    }
    // find index of register on GPU where "n_texture" uniform is stored

And this is all the initialization is about. Now all we need is to render fullscreen quad while shader si bound. We must not forget to set texture unit index after binding the shader (but before actually using it):

static const float p_vertex[] = {0, 0, -1, -1, 0,
    1, 0, 1, -1, 0, 1, 1, 1, 1, 0, 0, 1, -1, 1, 0};
// fullscreen quad vertex coordinates

void GPGPU()
{
    glViewport(0, 0, n_width, n_height);
    // set viewport

    glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
    // clear

    glMatrixMode(GL_PROJECTION);
    glLoadIdentity();
    glMatrixMode(GL_MODELVIEW);
    glLoadIdentity();
    // set ortho projection

    p_equations->Bind_Enable(p_state); // set texture with equations to solve
    p_shader->Bind(p_state); // enable shader
    p_shader->SetParam1i(p_state, n_texture_param_register, 0); // texture is in unit 0
    // use shader to show equations sollutions

    glInterleavedArrays(GL_T2F_V3F, 0, p_vertex);
    glDrawArrays(GL_QUADS, 0, 4);
    glDisableClientState(GL_VERTEX_ARRAY);
    glDisableClientState(GL_TEXTURE_COORD_ARRAY);
    // draw fullscreen quad
}

And that's it. The demo renders a nice sphere. You can go and downolad ÜberLame GPGPU demo 2.

ÜberLame demo

This is a simple demo (roughly 600 lines of code), introducing basics of work with shaders in ÜberLame. The demo creates couple of textures and framebuffers, renders some geometry, applies a simple box filter shader and saves all results as images.

At the beginning, there's common GLUT initialization code:

int main(int n_arg_num, const char **p_arg_list)
{
    glutInit(&n_arg_num, (char**)p_arg_list);
    glutInitDisplayMode(GLUT_RGBA | GLUT_DEPTH | GLUT_DOUBLE);
    glutInitWindowSize(1, 1);
    glutCreateWindow("ÜberLame OpenGL window");
    // init OpenGL using GLUT

Next, we need to create OpenGL state guard. As we already know, it is an object, containing copy of OpenGL states. (When changing some state, such as calling glEnable(GL_TEXTURE_2D); the driver needs to go trough some decision tree to identify what does GL_TEXTURE_2D mean. Then the state change is queued, but the actual state is chaged at the point when it makes some effect (when drawing something in this case). While this is a lot of fuss, it's simple to keep track of wheter the texture is enabled or disabled and calling driver only in case it's really necessary. That is what OpenGL state guard does.)

    printf("create state guard\n");

    if(!(p_state = new CGLState())) {
        fprintf(stderr, "error: not enough memory\n");
        return -1;
    }
    // create OpenGL state guard

Now we need to create some textures that is really simple:

    printf("create two empty textures\n");

    CGLTexture_2D *p_texture;
    if(!(p_texture = new CGLTexture_2D(p_state, 256, 256, GL_RGBA8, true, 0))) {
        fprintf(stderr, "error: not enough memory\n");
        return -1;
    }
    // create 2D texture with unspecified contents

    CGLTexture_2D *p_texture2;
    if(!(p_texture2 = new CGLTexture_2D(p_state, 256, 256, GL_RGBA8, true, 0))) {
        fprintf(stderr, "error: not enough memory\n");
        return -1;
    }
    // create 2D texture with unspecified contents

Parameters to CGLTexture_2D are state guard object, texture width and height, internal format, flag wheter mip-maps are required and border width, quite similar to ones of glTexImage2D. Image format, data type and pointer to bitmap may follow, as shown in the following snippet:

    printf("create noise texture\n");

    unsigned char *p_image_data;
    if(!(p_image_data = new unsigned char[3 * 256 * 256])) {
        fprintf(stderr, "error: not enough memory\n");
        return -1;
    }
    for(unsigned char *p_dest = p_image_data, *p_end = p_image_data + 256 * 256 * 3; p_dest != p_end;)
        *p_dest ++ = rand();
    CGLTexture_2D *p_noise_texture;
    if(!(p_noise_texture = new CGLTexture_2D(p_state, 256, 256,
       GL_RGBA8, true, 0, GL_RGB, GL_UNSIGNED_BYTE, p_image_data))) {
        fprintf(stderr, "error: not enough memory\n");
        return -1;
    }
    delete[] p_image_data;
    // create 2D texture with random contents

Here we go ... allocate array of 256x256 8-bit RGB pixels, fill it with random values and create texture from it. It is almost identical to above code, creating empty textures.

Next thing we're going to try is some offscreen rendering. For that we need framebuffer object (FBO):

    printf("create render-buffer for r2t (with depth buffer)\n");

    if(!CGLFrameBuffer::b_Supported()) {
        fprintf(stderr, "error: CGLFrameBuffer not supported\n");
        return -1;
    }

    CGLFrameBuffer *p_frame_buffer;
    if(!(p_frame_buffer = new CGLFrameBuffer(256, 256, 1, true, GL_RGBA8,
       true, false, GL_DEPTH_COMPONENT24, false, false, 0))) {
        fprintf(stderr, "error: not enough memory\n");
        return -1;
    }

Calling CGLFrameBuffer::b_Supported() checks support for EXT_FRAMEBUFFER_OBJECT and loads all function entry points.

CGLFrameBuffer constructor needs output resolution, number of draw-buffers (1), flag wheter we're rendering to texture (true means we are, false means offscreen renderbuffer is required), internal texture format (required only in case we're not rendering to texture), flag wheter we need depth buffer, flag wheter we want to have depth buffer in texture and depth buffer format (required only in case we're not rendering depth to texture), flag wheter we need stencil buffer, flag wheter we want to have stencil buffer in texture and stencil buffer format (required only in case we're not rendering stencil to texture). This long constructor can be used to create any desirable framebuffer.

Next step is - of course, binding framebuffer and texture to render to:

    printf("render something to the texture\n");

    {
        if(!p_frame_buffer->b_Status() ||
           !p_frame_buffer->Bind() ||
           !p_frame_buffer->BindColorTexture_2D(*p_noise_texture, GL_TEXTURE_2D)) {
            fprintf(stderr, "error: failed to bind FBO for r2t\n");
            return -1;
        }
        glViewport(0, 0, p_frame_buffer->n_Width(), p_frame_buffer->n_Height());
        // bind FBO for rendering to texture

We're going to render to texture containing noise. It's necessary to set proper viewport.

All we need to do now is to draw something:

        glMatrixMode(GL_PROJECTION);
        glLoadIdentity();
        glMatrixMode(GL_MODELVIEW);
        glLoadIdentity();
        // ortho

        p_state->EnableDepthTest();
        glClear(GL_DEPTH_BUFFER_BIT);
        // clear depth

        p_state->LineWidth(3);
        p_state->DisableTexture2D();

        glColor3f(.75f, .75f, .75f);
        glBegin(GL_POLYGON);
        for(int i = 0; i < 360; i += 15) {
            float f_angle = i / 180.0f * f_pi;
            glVertex2f(sin(f_angle) * .95f, cos(f_angle) * .95f);
        }
        glEnd();
        // draw polygon

This should be clear to anyone who ever used OpenGL. We're deliberetly not clearing color buffer, we want to draw over the contents of the texture. Once we're done drawing to texture, we have to release texture and the render buffer before we can use it.

        if(!p_frame_buffer->BindColorTexture_2D(0, GL_TEXTURE_2D) ||
           !p_frame_buffer->Release()) {
            fprintf(stderr, "error: failed to release FBO\n");
            return -1;
        }
        // release FBO

That was simple. Bind texture object 0 to release previously bound texture. Now we could use texture to draw something.

In case the texture has mip-maps, we need to generate them. We rendered top level and we want OpenGL to update the rest. There have been extension for automatic mipmap generation (GL_SGIS_generate_mipmap), it allowed generating mipmaps automatically after rendering to it. But we're going to use a simple OpenGL 2.0 function instead:

        p_noise_texture->Bind_Enable(p_state);
        glGenerateMipmapEXT(GL_TEXTURE_2D);

Mipmaps updated. That's it.

We're now going to render to two textures at once. We're going to need a new framebuffer which would allow us to bind two textures:

    printf("create render-buffer for multi r2t (with depth buffer)\n");

    CGLFrameBuffer *p_frame_buffer2;
    if(!(p_frame_buffer2 = new CGLFrameBuffer(256, 256, 2, true, GL_RGBA8,
       true, false, GL_DEPTH_COMPONENT24, false, false, 0))) {
        fprintf(stderr, "error: not enough memory\n");
        return -1;
    }

We already know we need a shader to produce different outputs to different textures. So we have to build a shader. We're going to use simpler wrapper CGL_Core_shader (stands for OpenGL 2.0 core high-level shader).

First thing to do is to check if the shader is actually supported:

    if(!CGL_Core_shader::b_Supported()) {
        fprintf(stderr, "error: CGL_Core_shader not supported\n");
        return -1;
    }

Now, without hesitation we create a new CGL_Core_shader object:

    CGLShader *p_shader;

    const char *p_s_red_green = "void main() { gl_FragData[0] ="
        " vec4(1, 0, 0, 1); gl_FragData[1.0] = vec4(0, 1, 0, 1); }";
    // with intentional warning

    if(!(p_shader = new CGL_Core_shader(0, p_s_red_green))) {
        fprintf(stderr, "error: not enough memory\n");
        return -1;
    }

This creates a very simple shader. First parameter in CGL_Core_shader constructor is pointer to vertex shader source, second is fragment shader source (there's constructor with geometry shader as well). We need fragment shader only so we pass 0 instead of vertex shader source. For the sake of clarity, here's fragment shader source code (GLSL, not C++) once again:

void main()
{
    gl_FragData[0] = vec4(1, 0, 0, 1);
    gl_FragData[1] = vec4(0, 1, 0, 1);
}

It is very simple. This main() function is called once per output pixel. It outputs constant red color (specified as RGBA vector) to gl_FragData[0] (first draw-buffer) and constant green color to gl_FragData[1] (second draw-buffer). The only thing that remains is to compile the shader. That is what we do now:

    char *p_s_info_log = 0;
    if(!p_shader->Compile(p_state, p_s_info_log)) {
        fprintf(stderr, "error: error compiling shader:\n%s\n",
            (p_s_info_log)? p_s_info_log : "(null)");
        if(p_s_info_log)
            delete[] p_s_info_log;
        return -1;
    }
    if(p_s_info_log) {
        //fprintf(stderr, "warning: while compiling shader:\n%s\n", p_s_info_log);
        delete[] p_s_info_log;
        p_s_info_log = 0;
    }
    if(!p_shader->Link(p_s_info_log)) {
        fprintf(stderr, "error: error linking shader:\n%s\n",
            (p_s_info_log)? p_s_info_log : "(null)");
        if(p_s_info_log)
            delete[] p_s_info_log;
        return -1;
    }
    if(p_s_info_log) {
        //fprintf(stderr, "warning: while linking shader:\n%s\n", p_s_info_log);
        delete[] p_s_info_log;
        p_s_info_log = 0;
    }
    // compile and link the shader

Even though the code is longer, we only call two functions to compile and link the shader. They might return compiler logs (error messages).

The whole process of rendering to two textures is almost identical to rendering to a single texture so i'm not going to split the code to basic sections:.

    {
        if(!p_frame_buffer2->b_Status() ||
           !p_frame_buffer2->Bind() ||
           !p_frame_buffer2->BindColorTexture_2D(*p_texture, GL_TEXTURE_2D, 0) ||
           !p_frame_buffer2->BindColorTexture_2D(*p_texture2, GL_TEXTURE_2D, 1)) {
            fprintf(stderr, "error: failed to bind FBO for r2t\n");
            return -1;
        }
        glViewport(0, 0, p_frame_buffer->n_Width(), p_frame_buffer->n_Height());
        // bind FBO for rendering to texture

        glMatrixMode(GL_PROJECTION);
        glLoadIdentity();
        glMatrixMode(GL_MODELVIEW);
        glLoadIdentity();
        // ortho

        p_state->ClearColor4f(1, 1, 1, 1);
        p_state->EnableDepthTest();
        glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
        // clear color & depth

        p_shader->Bind(p_state);
        // bind shader

        glBegin(GL_QUADS);
        glVertex2f(-1, -1);
        glVertex2f( 1, -1);
        glVertex2f( 1,  1);
        glVertex2f(-1,  1);
        glEnd();
        // draw fullscreen quad

        if(!p_frame_buffer2->BindColorTexture_2D(0, GL_TEXTURE_2D, 0) ||
           !p_frame_buffer2->BindColorTexture_2D(0, GL_TEXTURE_2D, 1) ||
           !p_frame_buffer2->Release()) {
            fprintf(stderr, "error: failed to release FBO\n");
            return -1;
        }
        // release FBO

        p_state->BindProgramObject(0);
        // release shader
    }
    // render to two textures at once

That's actually all there is to image processing in OpenGL. The shader here is used to render green and red quads, but it could do many more.

In order to write complex shaders, one need to know how to sample textures in shaders and how to specify shader parameters. There are functions to specify shader parameters in CGL_Core_Shader, but there is one more complete class called CGLShaderObject, which automates parameter specification. CGLShaderObject doesn't take simple string, but more complete shader specification:

    TShaderInfo t_filter_hl("filter", TShaderInfo::proc_Fragment,
        "uniform sampler2D n_tex;\n"
        "uniform vec4 v_pix;\n"
        "void main()\n"
        "{\n"
        "    vec4 v_accum = texture2D(n_tex, gl_TexCoord[0].xy, 1.0);\n"
        "    v_accum += texture2D(n_tex, gl_TexCoord[0].xy + v_pix.xw, 1.0) * .5;\n"
        "    v_accum += texture2D(n_tex, gl_TexCoord[0].xy - v_pix.xw, 1.0) * .5;\n"
        "    v_accum += texture2D(n_tex, gl_TexCoord[0].xy + v_pix.wy, 1.0) * .5;\n"
        "    v_accum += texture2D(n_tex, gl_TexCoord[0].xy - v_pix.wy, 1.0) * .5;\n"
        "    v_accum += texture2D(n_tex, gl_TexCoord[0].xy + v_pix.xy, 1.0) * .25;\n"
        "    v_accum += texture2D(n_tex, gl_TexCoord[0].xy - v_pix.xy, 1.0) * .25;\n"
        "    v_accum += texture2D(n_tex, gl_TexCoord[0].xy + v_pix.xz, 1.0) * .25;\n"
        "    v_accum += texture2D(n_tex, gl_TexCoord[0].xy - v_pix.xz, 1.0) * .25;\n"
        "    gl_FragColor = v_accum * .25;\n" // enhance contrast a bit
        "}\n", "n_tex");

This is specification of source code of a single shader, named 'filter'. it is going to run on fragment processor. The source code is very simple 3x3 convolution filter. There are two important things in there.

'uniform sampler2D n_tex' defines basically integer variable, containing index of texturing unit. There's keyword 'uniform', meaning it is shader parameter, it must be filled after shader is bound and before anything is drawn. CGLShaderObject does this automatically.

'uniform vec4 v_pix' is parameter as well, it is vector of 4 floats and it contains size of pixel in source texture (so the convolution filter knows where the neighbor pixels are).

function texture2D gets texture sample from texturing unit n_tex (n_tex must be uniform sampler, it can't be constant). sample coordinate is gl_TexCoord[0] (interpolated vertex texture coordinate). '.xy' means we want to use it's two components only (need 2D texture coordinate, OpenGL texture coordinates are, in fact 4D).

Last parameter contains only 'n_tex'. It is used to assing each sampler it's texturing unit. If there were more samplers, they would be separated by '|' characters, for example: "n_texture|n_lightmap" (texture is in first texturing unit, lightmap in second).

What TShaderInfo does should be pretty straightforward now. Compiling time.

    CGLShaderObject *p_shader_object;
    if(!(p_shader_object = new CGLShaderObject(0, 0, &t_filter))) {
        fprintf(stderr, "error: not enough memory\n");
        return -1;
    }
    char *p_s_compile_log = 0;
    char *p_s_link_log = 0;
    if(!p_shader_object->Compile_Link(p_state, p_s_compile_log, p_s_link_log)) {
        fprintf(stderr, "error: error compiling shader:\n%s\n%s\n",
            (p_s_compile_log)? p_s_compile_log : "(null)",
            (p_s_link_log)? p_s_link_log : "(null)");
        if(p_s_compile_log)
            delete[] p_s_compile_log;
        if(p_s_link_log)
            delete[] p_s_link_log;
        return -1;
    }
    if(p_s_compile_log) {
        fprintf(stderr, "warning: while compiling shader:\n%s\n", p_s_compile_log);
        delete[] p_s_compile_log;
    }
    if(p_s_link_log) {
        fprintf(stderr, "warning: while linking shader:\n%s\n", p_s_link_log);
        delete[] p_s_link_log;
    }
    // compile and link shader

Compiling code is pretty similar to what we've seen already, only the compile and link functions have been merged. Now the shader is ready to use. We only need to set value of 'v_pix' parameter. (n_tex is set automatically and we don't have to think about it, unless we want to change it at runtime).

Changing parameter is pretty easy. Can find parameter by name and call SetValue(). It sets value of the parameter and the shader will remember it. It might as well be set to different value every frame (set it before binding the shader so the bind function can pass parameters to GPU). This is how to do that:

    if(!p_shader_object->t_Find_Parameter("v_pix").SetValue(2.0f / 256, 2.0f / 256, -2.0f / 256, 0))
        fprintf(stderr, "warning: v_pix uniform not found\n");
    // set pixel size

    p_shader_object->Bind(p_state);
    // bind shader

There might be a few cases when shader is running with several parameter combinations which doesn't change. Then it's possible to create a few detached parameter sets and use them when binding shader. We're going to try this one.

    CGLShaderParams *p_shader_params;
    if(!(p_shader_params = p_shader_object->p_GetParameterBuffer())) {
        fprintf(stderr, "error: not enough memory\n");
        return -1;
    }
    // get shader parameter buffer

    if(!p_shader_params->t_Find_Parameter("v_pix").SetValue(2.0f / 256, 2.0f / 256, -2.0f / 256, 0))
        fprintf(stderr, "warning: v_pix uniform not found\n");
    // set pixel size

When binding the shader, it's required to pass the shader parameter buffer in order for shader to use it:

    p_shader_object->Bind(p_state, p_shader_params);
    // bind shader

We already know the rest. Bind framebuffer and texture to render to, draw fullscreen quad, done. That's what GPGPU is about. However, to start, it's good to read GLSL specification (to know about built-in variables, functions, ...).

Anyway, you can download the ÜberLame GPGPU demo. You will need ÜberLame r18 to build it.

Go to ÜberLame project page.