xpra and pixel formats

Pixels all the way down


Editor’s, who is also me, note: This was meant to be a somewhat organised overview of how xpra handles pixels, based on the dive into the source code I’ve done over the last day or so. Explaining things makes you lay things out logically, and helps one’s own understanding. However, it basically morphed into “look at this stuff on pixel formats”. I leave it up as if it doesn’t serve to illuminate or educate, it’ll at least be a testament of my time spent looking over the xpra/ffmpeg code/API.

tl;dr the editor’s note: this is a bit confusing to follow, mostly as it’s a braindump for context


I’ve been thoroughly nerdsniped by xpra. Tracking git head and self compiling makes for some interesting errors. For example:

2021-08-26 12:35:25,797 unknown output pixel format: GBRP9LE, expected BGR0 for 'BGRX'
2021-08-26 12:35:25,800 Error: decode failed on 42280 bytes of h264 data
2021-08-26 12:35:25,801  128x320 pixels using avcodec
2021-08-26 12:35:25,801  frame options:
2021-08-26 12:35:25,801    frame=0
2021-08-26 12:35:25,801    pts=0
2021-08-26 12:35:25,802    csc=BGRX
2021-08-26 12:35:25,802    type=IDR
2021-08-26 12:35:25,803    window-size=[1604, 921]
2021-08-26 12:35:25,803    encoding=h264

I’ve been traversing the source code to figure out the workflow. I’ve not used cython myself, but the underlying python is pretty understandable, at least in isolation. Join me as we go down the rabbit hole…

Let’s do some basic debugging backwards from the site of the error:

            self.actual_pix_fmt = av_frame.format                                                                   
            if self.actual_pix_fmt not in ENUM_TO_FORMAT:                                                           
                av_frame_unref(av_frame)                                                                            
                av_frame_free(&av_frame)                                                                            
                log.error("unknown output pixel format: %s, expected %s for '%s'",                                  
                          FORMAT_TO_STR.get(self.actual_pix_fmt, self.actual_pix_fmt),                              
                          FORMAT_TO_STR.get(self.pix_fmt, self.pix_fmt),                                            
                          self.colorspace)

so the error gets tripped because self.actual_pix_fmt (which is just av_frame.format) is not in ENUM_TO_FORMAT.

Pixel Formats

You might be wondering: “hang on- what is a pixel format? I know a pixel is a dot on the screen, and I know you can format things, but I don’t see how you can format a pixel”.

The Wikipedia explanation is a reasonable place to start:

pixel format refers to the format in which the image data output by a digital camera is represented. 

Though it doesn’t actually tell us terribly much, so let’s break it down further. A digital image can be thought of a (typically) rectangular set of pixels. Being a rectangle, there’s spatial information there- you can think of rows and columns of pixels, or x and y, etc.

Now let’s think about the pixels themselves. You want to know what a pixel looks like. Say you were only interested in representing a monochrome, pure black and white image. You might have a grid of 6 x 6 pixels that look something like:

0 0 0 0 0 0
0 1 0 0 1 0
0 0 0 0 0 0
0 1 0 0 1 0
0 0 1 1 0 0
0 0 0 0 0 0 

which would give a pretty small image:

scaled up:

It’s smiling because it understand pixels

Of course, real images have more than simple black and white! You might recall that often a combination of red, green and blue are mixed to give a colour. For example, this is 200 red, 200 green and 50 blue: . That is 3 channels- R, G and B. There’s also luminance (Y’) and chroma (UV) for a different 3-channel representation. A fourth channel, alpha (A) is sometimes needed for transparency.

The ffmpeg wiki has more information and explanations on colorspaces and their representation.

Why Doesn’t

Overview of the Video Pipeline

NB this is my working understanding, which is necessarily simplified and potentially incomplete! For example, I will refer to a single window/application here

The problematic pathway here is the server encoding video and how the client now decodes that as of ffmpeg 4.4.

Encoding

xpra has a bunch of different options for encoding- as an image, as video, using hardware devices, etc.

def do_video_encode(self, encoding, image, options : dict):                                                     
        """                                                                                                         
            This method is used by make_data_packet to encode frames using video encoders.                          
            Video encoders only deal with fixed dimensions,                                                         
            so we must clean and reinitialize the encoder if the window dimensions                                  
            has changed.                                                                                            
                                                                                                                    
            Runs in the 'encode' thread.                                                                            
        """

    # *** SNIPPED ***

    if SAVE_VIDEO_FRAMES:                                                                                       
            from xpra.os_util import memoryview_to_bytes                                                            
            from PIL import Image                                                                                   
            img_data = image.get_pixels()                                                                           
            rgb_format = image.get_pixel_format() #ie: BGRA                                                         
            rgba_format = rgb_format.replace("BGRX", "BGRA")                                                        
            img = Image.frombuffer("RGBA", (w, h), memoryview_to_bytes(img_data), "raw", rgba_format, stride)

Decoding

I’ve focused more on the decoding side as that’s where the error crops up.

file xpra/codecs/dec_avcodec2/decoder.pyx:

    with nogil:                                                                                                 
            avpkt = av_packet_alloc()                                                                               
            avpkt.data = <uint8_t *> (padded_buf)                                                                   
            avpkt.size = buf_len                                                                                    
            ret = avcodec_send_packet(self.codec_ctx, avpkt)

Let’s look over at avcodec_send_packet‘s interface:

/**
 * Supply raw packet data as input to a decoder.
 * @param avctx codec context
 * @param[in] avpkt The input AVPacket. Usually, this will be a single video
 *                  frame, or several complete audio frames.
 *                  Ownership of the packet remains with the caller, and the
 *                  decoder will not write to the packet. The decoder may create
 *                  a reference to the packet data (or copy it if the packet is
 *                  not reference-counted).
 *                  Unlike with older APIs, the packet is always fully consumed,
 *                  and if it contains multiple frames (e.g. some audio codecs),
 *                  will require you to call avcodec_receive_frame() multiple
 *                  times afterwards before you can send a new packet.
 *                  It can be NULL (or an AVPacket with data set to NULL and
 *                  size set to 0); in this case, it is considered a flush
 *                  packet, which signals the end of the stream. Sending the
 *                  first flush packet will return success. Subsequent ones are
 *                  unnecessary and will return AVERROR_EOF. If the decoder
 *                  still has frames buffered, it will return them after sending
 *                  a flush packet.
 *
 * @return 0 on success, otherwise negative error code:

That’s a good explanation for what xpra is doing in the python code above. There’s a higher-level explanation further up in the interface definition:

/**
 * @ingroup libavc
 * @defgroup lavc_encdec send/receive encoding and decoding API overview
 * @{
 *
 * The avcodec_send_packet()/avcodec_receive_frame()/avcodec_send_frame()/
 * avcodec_receive_packet() functions provide an encode/decode API, which
 * decouples input and output.

so basically you send the codec [decoder/encoder] data using avcodec_send_* [packet/frame]; and receive data back via avcodec_receive_* [frame/packet], with the return code telling you if you need to resend data, read data etc.

Basically, for encoding:
– send frames, receive packets (which can then be written/transmitted/etc)
while for decoding:
– send packets, receive frames (which can then be displayed)

It’s a good interface.

Choices

The server has to capture the pixels of the application to encode. The pixels are in a format which makes sense to encode in BGRX, a 32 bits per-pixel format (per pixfmt.h). This gets sent to x264 to encode, then transmitted to the client as a data packet. On the client side, the packet is passed to ffmpeg (via libavcodec) for decoding.

The behaviour has changed- instead of decoding to GBRP8 (at 8 bits per pixel) it now decodes to GBRP9LE (“planar GBR 4:4:4 27bpp, little-endian”, also from pixfmt.h). This seems to be to avoid loss of precision.

“But can’t you set the desired pixel format when decoding?”, you might ask? Good question! (It’s the one I asked) But as avcodec.h says:

 /**
     * Pixel format, see AV_PIX_FMT_xxx.
     * May be set by the demuxer if known from headers.
     * May be overridden by the decoder if it knows better.
     *
     * @note This field may not match the value of the last
     * AVFrame output by avcodec_receive_frame() due frame
     * reordering.
     *
     * - encoding: Set by user.
     * - decoding: Set by user if known, overridden by libavcodec while
     *             parsing the data.
     */
    enum AVPixelFormat pix_fmt;

(thanks to JEEB in #ffmpeg for pointing this out)

libavcodec overrides the pixel format when decoding! D’oh.

I’ve been diving around the ffmpeg source code since then to see if there’s a flag that could be set to tell the h264 decoder to ignore loss of precision (or force output pixel format) but if there is I have yet to find it.

The other option on the client side would be to accept decoding to GBRP9LE, and then do a colourspace conversion (csc) to a format that can be used for drawing (painting) a window.

“What if you changed the encoding format on the server side?”, you may also ask. That could be an option, but to keep things fast it makes sense to shove the pixel data at the encoder in whatever format it is acquired, rather than changing the format or colorspace. it’s better to handle these things on the client side, where there is more likely to be faster/dedicated hardware for doing conversions.

Tell us what's on your mind