A comprehensive guide to parallel video decoding

As promised, today we’ll talk about video decoding. We will review the most important operations that a decoder has to fulfill, and for each case see what kind of speed boost we can expect with a shader based video decoding. Because video decoding is a complex process and one blog post can hardly be thorough, I’ll provide related links for each chapter, if you wish to start your own research on a particular subject 😉

Let’s begin with a quick overview of the most important operations of a VP8 video decoding process :
vp8_decode_pipeline
Advertisement

My GSoC project status

Hi !
It’s been a long time that I didn’t blogged about my Google Summer of Code project, and I’m sorry about that. But today I have some good news !
I have successfully set up a VP8 decoder inside Mesa, built from Google’s libvpx. Let’s walk through the different parts of this new decoding stack.

We need a VP8 video to begin with. As the current decoder is based on the libvpx, the official VP8 Codec SDK, all of the VP8 functionalities are supported and so, every webm file can be used. That’s an important point, and that’s one of the reason why I initially decided to start using an existing decoder first and not write a new one from scratch. Then I could progressively replace keys computations using shaders while having a solid base known to work in every situations.

libvdpau

To link together a video player and a video decoder through the VDPAU API, the first component used is the libvdpau. Formerly part of the NVIDIA video driver, it is now a standalone package allowing third party implementation of the API into video player or video decoder.
This library has been patched to support the VP8 codec, and to handle a VdpPictureInfoVP8 structure. That structure is loaded by a VDPAU video player with information contained into each VP8 frame header and then passed to a VDPAU decoder alongside with the bitstream buffer.

mplayer/mplayer2 and ffmeg/libav

Then, we need a media player. I have patched mplayer (and its recent fork mplayer2) together with ffmpeg (and its recent fork libav) to support VP8 decoding through VDPAU. The patches between mplayer and mplayer 2 are almost the same, but the actual VDPAU implementation is different with an advantage to mplayer 2. The ffmpeg/libav patches are identical.

To launch mplayer the following arguments must be used :

mplayer -vo vdpau -vc ffvp8vdpau myvideofile.webm

  • -vo vdpau tells mplayer to initialize a VDPAU video output
  • -vc ffvp8vdpau tells mplayer to use the ffvp8vdpau video codec, a slightly modified version of the native VP8 codec of ffmpeg/libav

So basically, mplayer use the libvdpau to find an available decoder, initialize it, create a surface that is gonna be filled by the decoder, then after each decoded frame, draw the surface on screen.

As mplayer/mplayer2 relies on ffmpeg/libav to do the frame decoding itself, a hook has to be added into the frame decoding process, to bypass the regular decoder and send the datas to the VDPAU decoder.

Mesa

The “big part” of this GSoC is obviously the VP8 decoder implementation living inside Mesa. This piece of code is gonna be identified as a “device” by libvdpau. While most of the time a device is a hardware driver, the mesa decoder just register itself as an available video decoder. We know that this device is gonna be called by ffmpeg/libav for every frame to decode, with these arguments :

  1. The content of the frame header (with various information like frame size, type, …)
  2. The bitstream buffer, which contain the compressed data representing exactly one frame
  3. Up to 3 “reference frames”, which are already decoded frames used for motion compensation

With that, the decoder can do its job and decode frames. When a frame is ready to be drawn on screen, the decoder must load that frame into the surface provided by the VDPAU video output created by mplayer.

  • The first step was to create a new decoder stub, based on the existing g3dvl interface (where all the Gallium3D video decoding work take place), and add it to the VDPAU state tracker, to advertise VP8 decoding capabilities.
  • The second step was to plug a working decoder into these new function stubs.

I used the official libvpx (close to the 0.9.7 version) and stripped out several functionalities. The goal of course is to have a lightweigth standalone decoder. Example of removed code are the VP8 encoder, multi-threading (which was actually counter productive into the decoder part), the libvpx API, CPU run time detection, frame scaling functionality, a custom memory manager, etc…

And it worked !

This new decoder can be used by the patched mplayer, through the VDPAU Gallium3D state tracker, to decode VP8 video. Seeking, pausing, and everything works as intended.

Summary

Let’s see how our different components interacts :

decoding-stack

So far only mplayer can use the VP8 decoder inside Mesa, but other implementation can be provided for other popular media players like VLC and who knows, the flash media player ?! Similarly, the patched version of mplayer can only use the Mesa VP8 decoder, but only because there is no other VDPAU VP8 implementation in the wild right now.

What’s left

The work on libvdpau and mplayer/ffmpeg is done and only need to be reviewed.

Right now the VP8 decoder is a pure C implementation running on the CPU (only color space conversion is done by the GPU, tested with r600g), and it is about 3 times slower than the regular libvpx (mostly because all CPU SIMD code as been removed). Ultimately different parts of the decoder are going to be rewritten to fit GPU decoding, and I don’t intend to port code from libvpx anymore.

Next week, a more technical post where we will see what decoding operations are the biggest CPU time eater, and what we are gonna do about it 🙂

If you have any question, go for it !

A VP8 hardware decoding stack

Hi ! Today let’s talk about the different components needed in order to provide a working VP8 hardware decoding stack.

The goal of my Google Summer of Code project is to provide a generic way of decoding VP8 videos using hardware acceleration. This “hardware acceleration” is going to be provided by modern graphic cards and there “streaming processors”, “shaders units”, “cuda cores” or whatever names they can takes depending on vendor’s marketing departments. Basically they are small processors built inside GPUs that can process a lot of data simultaneously.

When I say a “generic way to use hardware acceleration”, I mean that a lot of media players should be able to benefit from it, and a lot of graphic cards should be able to provide it.

Decoding back-end

First, a hardware accelerated decoder has to be built. Because writing even a purely CPU based video decoder is a heavy task (in itself longer than the time allocated by the Google Summer of Code program), an existing decoder is going to be used first. Then, the heaviest computational tasks are going to be progressively rewritten to use shaders.

The libvpx library is going to be used and built inside the Mesa 3D Graphics Library. That should allow video cards drivers (r300g, r600g and nouveau are targeted) using Mesa to be shipped with VP8 hardware decoding support.
libvpx is the VP8 reference implementation, supported by Google, and has a BSD style license, compatible with Mesa’s MIT license, wich make it a great candidate for inclusion.

In order to be used, the decoder located within Mesa must advertises its capabilities through a Gallium3D state tracker supporting the VDPAU API.

API

So an API to make the link between the video decoder and a media player is needed. When a media player start decoding a video, it must check if your system is capable of hardware decoding, and if so, bypass its regular CPU based decoder to use the GPU based decoder. Several APIs can help doing that, let’s review some of them :

  • XvMC (X-Video Motion Compensation)

Build as a xorg extension, and based on the even older X video extension, XvMC allows media players to offload a limited number of operations (motion compensation/inter-frame prediction and iDCT/inverse transformation) to capable GPUs. XvMC design is quite old and has not been thought for recent video formats.
Its primary target are MPEG 1/2 videos.

  • VA API (Video Acceleration API)

Originally designed by Intel, VA API main motivation was to supersede XvMC with a new design and much extended capabilities. In addition of motion compensation and inverse transform, VA API can also handle deblocking filter, intra-frame prediction and bitstream processing. As XvMC, VA API just exposes to the GPUs only particular chunks of data and their associated treatments, so a lot of the video decoding logic stays inside regular CPU based decoder. Another particularity of VA API is to handle video encoding as well as video decoding.
Its primary target are H.264, VC-1, MPEG-2 and MPEG-4 videos. VA API is currently implemented by the Intel Linux driver.

  • VDPAU (Video Decode and Presentation API for Unix)

VDPAU was designed by NVIDIA to offload video decoding and post-processing effects from the CPU. A media player has to start the decoding, but then passes large portions of the bitstream to VDPAU and gets back fully decoded frames. As almost all of the decoding process can be offloaded, VDPAU allows great flexibility in the implementation of a GPU based decoder.
Its primary target are H.264, VC-1, MPEG-2 and MPEG-4 videos. The libvdpau library needs to be slightly patched in order to support VP8 decoding. VDPAU is currently implemented by the NVIDIA closed source driver available for Linux, FreeBSD and Solaris operating systems. VDPAU is well supported by media players, and this is why it has been chosen.

* XvMC wikipediawww.xfree86.org/~mvojkovi/XvMC_API.txt
* VA API wikipedia / www.freedesktop.org/wiki/Software/vaapi
* VDPAU wikipediadownload.nvidia.com/XFree86/vdpau/doxygen/html

Media player

Last but not least, you’ll need a media player. The media player loads a given video file, parses its container to gather some information about the file’s content (video length, definition, audio and video codec, etc) then launches the decoding process and finally draw the decoded pictures onto the screen.

Today most of the media players available don’t do media decoding themselves, but instead uses libraries dedicated to these tasks. The primary used library is ffmpeg and its recent fork libav. We can also mention GStreamer and xine-lib. These libraries are available on a wide range of operating systems and architectures, and can decode pretty much every video or audio file formats available in the wild (and that is a lot).

I made the choice to add VP8 VDPAU support inside ffmpeg/libav. They are well known libraries, already have support VDPAU for other video formats, and are used among others by the famous VLC media player and MPlayer.

* ffmpeg www.ffmpeg.org
* libav libav.org
* GStreamer gstreamer.freedesktop.org
* xine www.xine-project.org

Exams finished !

Hi there !

I have just finished my exams, an so my academic year is now over. Time to start working full time on my Google Summer of Code project. The purpose of this blog is to keep you posted about my progress. I will write it in english, so I apologize in advance for any errors that may slip into my sentences.

The goal of this project is to write a Gallium3D state tracker capable of hardware accelerated VP8 video decoding through the VDPAU API. This would allow every graphic card with a Gallium3D driver (primary targets are r300g, r600g and Nouveau drivers) to be able to decode VP8 videos, and every VDPAU enabled multimedia software to play these videos.

Hardware acceleration will be built upon graphic card’s shaders units, able to take care of some heavy computations like motion compensation, intra-predictions, iDCT, deblocking filter.

More information about the project can be found on this page, and a brief presentation of myself and contact information can be found on that page.

Feel free to contact me at any time if you have any questions or comments about this work !