A comprehensive guide to parallel video decoding
August 26, 2011 15 Comments
As promised, today we’ll talk about video decoding. We will review the most important operations that a decoder has to fulfill, and for each case see what kind of speed boost we can expect with a shader based video decoding. Because video decoding is a complex process and one blog post can hardly be thorough, I’ll provide related links for each chapter, if you wish to start your own research on a particular subject ;-)
A little bit of background is needed before going further.
1/ Colors coding
Colors are usually represented in the 24 bits RGB scheme (8 bits for each of the 3 components). It is the “native format” for most of the screens and most importantly for the human eye.
But when it comes to video compression, a different scheme is used. It’s called YCbCr, and you may know it as “YUV”. It is not a different colorspace than RGB but rather a different way to write it. The fact is that the human eye can see higher difference of luminosity than difference of colors. As for a RGB pixel, a YCbCr pixel has 3 components : Y is the “luma” component, Cb and Cr the “chroma” blue and red. The whole components of the same type of a picture are call a plane. The trick is to store the two chroma components at quarter resolution and the luma component at full resolution. This is called “chroma sub-sampling”, and the VP8 codec only support “4:2:0” mode.
So with a 4:2:0 sub-sampling, we have an overall of 50% of the components removed, for a drop of quality almost imperceptible !
2/ Block based coding
Most of the computation that we are going to study don’t work on the whole picture, but instead the video encoders divide the picture in smaller block of data, easier and especially faster to process.
First we have the “macroblocks” which are 16×16 pixels wide, used by the inter-prediction process. Then we have the “blocks” of 4×4 pixels, used by the intra-prediction process. Each macroblock includes 16 4×4 blocks of luma components, and 32 2×2 blocks of chroma components (8 for chroma blue, 8 for chroma red).
In this post I’ll mostly talk of the treatments applied on the 4×4 luma blocks in order to save time, not going too deep on the details.
3/ Entropy decoding
The first process in “decoding order” is the entropy decoder. Lets paraphrase wikipedia :
In information theory an entropy encoding is a lossless data compression scheme that is independent of the specific characteristics of the medium.
One of the main types of entropy coding creates and assigns a unique prefix-free code to each unique symbol that occurs in the input. These entropy encoders then compress data by replacing each fixed-length input symbol by the corresponding variable-length prefix-free output codeword. The length of each codeword is approximately proportional to the negative logarithm of the probability. Therefore, the most common symbols use the shortest codes.
The only thing we need to known, is that entropy decoding is a very linear process and cannot be parallelized. The symbols are decompressed one by one (the bitstream is processed bit by bit in the case of arithmetic coding, i/o are bufferized however) and each symbol decoded updates the probability context of the decoder. Loose one bit, loose one frame.
Conclusion ? The entropy decoding stays in the CPU. What we need to do though is to decode the bitstream content for one frame in one pass, and not to start other processes as soon as enough data are available. That way we will have all the data to start other processes in parallel.
4/ Inverse transformation
The DCT, initials for Discrete Cosine Transform, is the operation at the heart of today video compression formats. The DCT is a reversible and lossless operation, meaning that the encoder and the decoder do the exact same thing, and that it does not directly help data compression, but just reorganize the coefficients of a block by frequency order, with the high frequencies at the beginning of the block, and the low frequencies at the end.
Note that the first coefficient of the block is called the “DC” coefficient, and represents the average of all the values of the coefficients of the block. The other coefficients are called “AC”.
The VP8 codec uses a 2d DCT-II on AC coefficients of each 4×4 block of luma pixel in frame.
The DC coefficients are then bundled by groups of 16 (so each DC coefficient of all 16 4×4 blocks of a macroblock) and applied with a secondary WHT transform.
As the DCT and WHT transformations are basically matrix calculus, they does fit well inside shaders. Plus, the blocks of data to process are independent upon each other, which makes parallelization easy. The same DCT is applied to all of the blocks of the picture. But there is a major drawback that we will saw in chapter 6.
5/ Inverse quantization
The quantization stage reduces the amount of information by dividing each transformed blocks of coefficient (a 2d matrix) by a “quantization matrix” and then by a “quantization parameter (used to adjust the final quality of the block) to reduce the quantity of possible values that value could have. All near zero values are zeroed.
Because this makes the values fall into a narrower range, this allows entropy coding to express the values more compactly.
Here is a quick overview of the quantity of data in a block before and after transformation + quantization :
A simple way to gain a lot of speed using shaders is to process several 4×4 blocks of data simultaneously as there is no dependency between blocks (in the limits of available shader units), and to process 4 data at a time inside each block.
Intra-predictions use spatial redundancy inside a frame to reduce the amount of data necessary to encode adjacent blocks. It is a major compression feature, able to greatly reduce the size of a single picture. VP8 has 10 available intra-prediction schemes, which consist in copying the edge of the adjacent block(s) through another block, as seen in the following picture :
The major problem of the intra-prediction is that it reads the edges of adjacent reconstructed blocks. To reconstruct a block, the transformation, quantization and intra-prediction must have been performed on that block.
As the intra-prediction is in the middle of the operations needed to reconstruct a block, it mean that we cannot process these blocks in parallel, because each block can depend on the left, up-left, up-right block.
So basically each time we want to intra-predict a block we have to wait for the result of the previous intra-prediction process. It means that it suppresses the gain of doing parallel block transformation + quantization, because it would need to wait for the intra-prediction results to reconstruct the blocks anyway.
In a CPU implementation we do things like that :
- for each macroblock of a picture
* entropy decode a macroblock
– for each 16 blocks of the macroblock
* entropy decode a block
* block reconstruction
Once a set of data is loaded, it is better to perform all of the operations on that set while the data are still “hot” into the CPU cache, avoiding cache miss and expensive memory reads.
In a GPU implementation, we would try to do that :
- entropy decode everything in a frame
- upload data to the GPU memory
- transform all the blocks (in parallel)
- quantifiy all the blocks (in parallel)
- intra-predict all the blocks
- reconstruct all the blocks (in parallel)
So what to do with the intra-prediction stage ?
When you start several intra-predictions processes at different places of the picture, ignoring the dependency between block is not an option. Do one mistake during the decoding, and that mistake will propagate itself :
One another obstacle is that as each block uses a different intra-prediction scheme, we would need to setup shader programs for each scheme. Each of these shader programs must have a list of blocks it can handle, and they would have to wait for the availability of the reconstructed adjacent blocks of these inside the list to perform their task. But doing so (meaning doing linear processing using a parallel processor) could be counter productive performance wise.
The other solution would be to download transformed and quantified data (after parallel processing) to the main memory, do the intra-prediction in the CPU and re-upload the data for the reconstruction stage and further processing.
So far, my decision is to leave all the treatments before block reconstruction to the CPU, and concentrate on the parallelization of heavier treatments. The SIMD CPU optimizations (MMX, SSE…) will be reintroduced.
7/ Inter-prediction (motion compensation)
The inter-prediction, or motion compensation, is a practical way to use the temporal redundancy between frames. As two frames which follow each other have a great chance to be very similar, each macroblock of the picture can be just copied into another picture instead of being recoded. If needed, the copied macroblock can be slightly moved and interpolated to match the destination picture. This process can be quite intensive, up to 70% of CPU time, depending on the amount of inter-predicted blocks in the frame of course.
The VP8 specification allows up to 3 frames to be used as reference.
If the reference frames are already loaded into the GPU memory, recomposing a frame by copying macroblocks between GPU textures is faster than doing the same thing across CPU memory.
One other advantage is that sometimes an interpolation filter must be applied to the block of texture copied, and texture filtering is one thing a GPU do fast, or at least faster than a CPU.
8/ Loop filter
As we talked about on this post, most of the processing during video decoding are done on small blocks of data. One side effect is that differences can appear between blocks when they are compressed with slightly different parameters.
To counter this “blocking effect”, a “loop filter”, or “deblocking filter” must be applied on the final picture. This filter is usually applied as a post processing filter for low bitrate videos.
In VP8 as in the H.264, the loop filter is a mandatory feature, allowing the encoder to lower the bitrate by 10%, regained in quality by applying the loop filter in both encoding and decoding paths. The only problem is that the loop filter is the most expensive operation of the VP8 decoding process, from 30% and up to 60% of the total decoding time, especially with high definition videos.
VP8 has two filter modes, a low complexity “simple filter” and a high complexity “normal filter” (used in almost every video). These two modes can be adjusted on a frame basis. The filter is somewhat adaptive, as some filtering parameter can be adjusted for each macroblock.
The goal is to smooth the rough edges between each block of the three planes (Y, Cb and Cr) for the normal filter, and only the Y planes for the simple filter. Depending if we are on a block or macroblock boundaries the process is not exactly the same, as 16×16 macroblock can pack parameters specific to a group of 16 of 4×4 blocks, there can be bigger differences at these macroblocks edges, so adjacent blocks edges within a macroblock may be concatenated and processed at once in their entirety.
- left macroblock edge (16 pixels wide)
- each left edge (4 pixels wide) for every 16 blocks inside the macroblock
- top marcoblock edge (16 pixels wide)
- each top edge (4 pixels wide) for every 16 blocks inside the macroblock
The deblocking process reads and then writes to the same values. But the shader programs usually input and output to different textures. The idea would be to create a new texture to output only delta values for each pixel, and then to compose the “delta texture” to the “frame texture”.
This VP8 deblocking process is linear, each macroblock must be fully processed before staring the next one. In order to parallelize it, it should be safe to start several deblocking process at different macroblocks (chosen depending on their filter strength) at the expense of some particular edges left unfiltered.
9/ Color space conversion
Color space conversion is the last process in “decoding order”. As we’ve seen in chapter 1, the video formats don’t store their pixels in RGB, but in subsampled YCbCr format.
To convert a pixel between the two color models, the VP8 uses the ITU recommendation 601, which defines among some other things how to convert an analog RGB signal to a digital YCbCr signal. From there an equation to convert digital YCbCr to digital RGB can be extrapolated :
Where R, G and B are the 3 components of a 24 bits RGB color space, and Y, Cb and Cr are the 3 components of a 24 bits YCbCr color space. The prime denotes a gamma-corrected component.
The YCbCr to RGB conversion can be intense on the CPU, for each pixel, 9 divisions, 13 multiplications, 11 additions/subtractions. With some rounding and factorisation we can find a faster implementation, with still 7 multiplications, 7 bitshift and 7 additions/subtractions :
Once again, to do that operation faster in a shader unit, convert 4 components at a time, and start simultaneously several conversion processes at different places of the frame.
Video codecs are not conceived with performance in mind, as the primary goal is to save a lot of data for transmission. Video decoding is not really a “parallel job”. Some operations are by nature linear, and their handling by a GPU have to be carefully thought. The raw power of a GPU is only the addition of the raw power of its many computing units. When some operations cannot be parallelized, a GPU will end up being slower than a CPU for the same task.
The hard thing when it comes to GPU decoding is to properly identify which parts of the decoding can be parallelized and wich parts cannot. The key aspect is to do the maximum of operations inside the GPU memory, with shader programs working directly in that memory, and to try to have as little interaction as possible between CPU and GPU. Best case scenario, a shader outputs its results directly to another shader inputs. Uploading data to the GPU memory or scheduling shader programs introduces some latency which could became a net performance penalty in the decoding if done too frequently.
If you have any question regarding the content of this post, ask these in the comments and I’ll do my best to answer it.
11/ Summer of Code conclusion
As you may know the Google Summer of Code 2011 ends today.
I did not finish my project and only completed half of it. The “client side” of the VP8 VDPAU implementation is working and is currently being reviewed by the libvdpau maintainers. The mplayer/mplayer2 and ffmpeg/libav patches are also in shape and depending on the feedback on the libvdpau patch they will be sent upstream for review.
The Gallium3D decoder implementation inside Mesa is up and running, tested on r600g and softpipe drivers, but is only a slimmed down port of the libvpx decoder. Currently the only hardware accelerated part is the color space conversion stage, mainly because of the work of Younes Manton and Christian König on the G3DVL interface and the VDPAU state tracker.
My work here is not done, and the ideas expressed in this post have yet to be concretized. I still have some free time, and as I start working late on this project it’s only fair that I dedicate this time to move forward with the decoder implementation.
I’d like to thanks every one that has been involved in this project, starting with my mentor Jakob Bornecrantz who has been very patient with all my questions these past months, Younes Manton and Chrisitan Konïg who have been working on the current G3DVL interface and video decoders, Matt Dew for helping GSoC students since day one, and thanks to everyone else !
Special thanks to Chrisitan Konïg and AMD for providing hardware for the project !
Finally a big thanks to Google, to give students around the world an awesome learning opportunity with the Summer of Code program, and a great way of starting working inside a free software community.