Overview
A
compute shader is a programmable shader stage that expands OpenGL
beyond
graphics programming. Like other programmable shaders, a compute
shader is designed and implemented with
GLSL.
A compute shader provides
single
stage SIMD
pipeline parallelized
on the GPU.
The compute shader provides memory sharing and thread synchronization
features to allow more effective parallel programming methods.
Create
a Compute Shader Program:
-
glCreateShader(GL_COMPUTE_SHADER)
-
create a compute shader
-
glShaderSource() -
set
shader source
-
glCompileShader()
- compile
the shader
-
glCreateProgram()
-
create
a shader program
-
glAttachShader()
-
attach
the shader to the shader program
-
glLinkProgram()
-
link
all shaders in the shader program
-
glUseProgram()
-
set
the current program to execute on a GPGPU pass
Dispatch
the Compute Pipeline:
-
GlGenBuffers()
- Create
a buffer
-
glBindBuffer(GL_DISPATCH_INDIRECT_BUFFER)
– bind
a buffer – set it as the buffer in context.
-
glBufferData()
-
pass
a struct containing 3 x GLuint
fields representing the local group dimensions
-
glDispatchComputeIndirect()
- dispatch
the compute shader, with
GLintptr
pointing
to buffer location of parameters
including
local group dimensions
Built-in Variables:
-
uvec3
gl_WorkGroupSize
– 3D
volume size
-
uvec3
gl_NumWorkGroups
–
global
workgroup count
-
uvec3
gl_LocalInvocationID
–
work
item Id relative to local group
-
uvec3
gl_WorkGroupID
– local
workgroup Id
-
uvec3
gl_GlobalInvocationID
–
work
item Id relative to global group
-
uint
gl_LocalInvocationIndex
– a
1D array index representation of
gl_LocalInvocationID
Synchronization Functions:
-
barrier()
-
local workgroup sync - eg
write to shared variables from one invocation and read from
another.
-
MemoryBarrier()
- globally
force instructions to occur in order
-
groupMemoryBarrier()
- locally force instructions to occur in order
-
memoryBarrierAtomicCounter()
- wait
for write
to atomic counter before continuing.
-
MemoryBarrierBuffer()
- wait for write to buffer before continuing.
-
memoryBarrierImage()
- wait for write to image
variable
before continuing.
-
MemoryBarrierShared()
- wait for write to shared
variable
before continuing.
Implementation
Details
-
parallelism
is
explicitly specified in a 3D hierarchy
-
global
workgroups
contain
local
workgroups
which contain work
items.
-
in
GLSL,
read data from a specific location in an input array or
set
the value of elements in
an output
array
-
local
workgroup size
is
defined in GLSL
as
an input layout qualifier via
local_size_x,
local_size_y
,
local_size_z
–
each defaults
to 1. must
match parameters accessed by glDispatchComputeIndirect.
-
read
/
write to an imageBuffer
/ uniform
image2D
to store data into with
the imageLoad()
/
imageStore()
functions,
specifying a location (and
for
write, a value)
-
parallel
invocations of work
items
can
access GLSL
variables
with
shared
attrib (shared
across local workgroup) and
communicate
via
shared GPU
memory.
-
glGetIntegerv(GL_MAX_COMPUTE_SHARED_MEMORY_SIZE)
- query
max
capacity
of using
shared variables
-
Run
async – a compute shader can run in a non-blocking manner –
receive a callback on completion
-
Choose
optimal
local
workgroup size - appropriate for workload &
hardware – small enough to fit, but big enough to leverage GPU's
optimal parallelism capabilities
-
use
shared
variables
- better performance
than
access to images or shader storage buffers
-
sync
when necessary - avoid race conditions by using barriers
Eg
a
Canonical Compute Shader
#version
440
core
layout
(local_size_x
= 16, local_size_y = 16)
in;
// eg
specify
local workgroup to be a 16 x 16 ( x 1) 2D matrix
layout
(rgba32f,
binding
=
0)
uniform
imageBuffer
_buffer;
uniform
float
c;
void
main(void){
vec4
v
= imageLoad(_buffer,
int(gl_GlobalInvocationID.x));
v.xyz
+= v.xyz
* c;
imageStore(_buffer,
int(gl_GlobalInvocationID.x),
c);
}