Wednesday, November 29, 2017

VP9 ELI5 Part 6

Now it's time to create the actual image data. We're at the section labeled decode_tiles(sz - headerBytes), near the bottom. These functions are deeply nested, but by the end it will only be 4 levels so don't worry if it seems like each function leads to something else.


Here's what decode_tiles() looks like.


Now we need to define some state variables. Open Notepad++ and begin by adding the lines tileCols, tileRows, MiRowStart, MiRowEnd, MiColStart, and MiColEnd. Save it as state.txt.


There are a few more to add. Just add the name (ie MiCols) and the end value, not the formulas. For example, you should have "MiCols 120".

     MiCols = (FrameWidth + 7) >> 3 = 120
     MiRows = (FrameHeight + 7) >> 3 = 68
     Sb64Cols = (MiCols + 7) >> 3 = 15
     Sb64Rows = (MiRows + 7) >> 3 = 9

We'll need to keep this updated as we go along because some future functions depend on these values and it's easy to find them at the beginning.

Now let's fill in the values. We'll start with tileCols and tileRows. They depend on log2 functions. For our image, we opted (in the uncompressed header) to leave the log2 values at 0 because 2 ^ 0 = 1 and we only want 1 tile. So, tileCols = 1 << 0 (1 left-shift 0), which equals 1. tileRows works the same way so it will also equal 1. Therefore, we have 1 tile column and 1 tile row for a total of 1 tile. Fill in tileCols and tileRows in the text file.

Do you see the nested FOR loops, for (tileRow = 0... and for (tileCol = 0...? Since we only have 1 tileCol and 1 tileRow, these will execute only once. Therefore, we can pretend these loops don't exist and just perform everything once.

The first thing being done is the statement

     lastTile = (tileRow = tileRows - 1) AND (tileCol = tileCols - 1)
     = (0 = 0) AND (0 = 0)
     = (TRUE) AND (TRUE)

Both conditions are met, so lastTile will be TRUE. Now we look at the IF block, if (lastTile). Since lastTile is TRUE, all we have to do is tile_size = sz. But don't worry about that statement. We can't know it right now anyway.

Skipping the ELSE block, we have the MiRow/MiCol statements. Let's start evaluating these. tileRow and tileCol will be 0.

     MiRowStart = get_tile_offset( tileRow, MiRows, tile_rows_log2 )
     MiRowEnd = get_tile_offset( tileRow + 1, MiRows, tile_rows_log2 )
     MiColStart = get_tile_offset( tileCol, MiCols, tile_cols_log2 )
     MiColEnd = get_tile_offset( tileCol + 1, MiCols, tile_cols_log2 )

Here's what get_tile_offset() looks like.


     MiRowStart = 0
     MiRowEnd = 68
     MiColStart = 0
     MiColEnd = 120

Copy and paste these values into your text file. You'll need them later on.

The next statement is init_bool(tile_size), but since we're encoding we don't worry about that. That's for decoders to do. What we will do, however, is start our boolean encoder once we have some values to write.

Finally, the exciting part: decode_tile(). This is where our image data will go.

decode_tile()


To decode the single tile in the image, we have 2 nested FOR loops. The outer loop iterates through Rows, and the inner loop iterates through Columns. In X/Y terms, the row is Y and the column is X. So, you can see that the outer loop runs through Y while the inner loop runs through X. These FOR loops increment in steps of 8 because the Mi series of variables represents 8-pixel increments. Incrementing these by 8 equates to incrementing by 64, the size of a superblock.

Inside the nested loops, we have clear_left_context(). This just says that every time we finish a horizontal line and jump back to the left, we have to forget everything about the previous line's left side. Notice that it's only inside the outer loop.

In the inner loop, we iterate through superblocks (64x64 blocks). That's what decode_partition(r, c, BLOCK_64X64) does. Recall at the beginning how I said that VP9 images are broken into 64x64 superblocks. This is where that happens. Basically, this loop goes left-to-right from the top to the bottom, processing superblocks.

So for the first superblock, we'll be executing the statement decode_partition(0, 0, BLOCK_64X64).


These decoding functions are already nested pretty deep. Later I'll draw a diagram to show which functions we're running.

We can forget the first IF statement since it only applies to right or bottom edges of the image.

Add the following variables to your text file:
num8x8 = 8
halfBlock8x8 = 4
hasRows = true
hasCols = true

Now here's a hard part. partition is a Tree, a binary tree specifically. A video on these is shown below. We need to encode a value that says we want PARTITION_SPLIT so we can split the superblock into 4 32x32 blocks. We could say PARTITION_NONE and it would still yield 4 32x32 blocks, but then they would share a prediction mode (smudge function) and for better compression, we want each block to specify its own prediction mode. Using PARTITION_SPLIT lets us choose on a per-block basis.




I had trouble with this part, but fortunately Ronald Bultje explained it to me and I drew a diagram of what he described. Here's my drawing of what the partition binary tree should look like.



We want Split, which breaks the block into 4 equal parts. In this case, that would break the 64x64 superblock into 4 32x32 blocks. That means we need to write 111 to the bitstream (using the bool-coder, of course).

(Advanced theory, not necessary to continue)
Credit: Ronald Bultje, in a chat with me on Nov 29, 2017


Here's the decoding side of this tree. I show it so we can work it backwards to encode.


The T[] array is the tree, which according to the spec is an array of integers. In this case it's partition_tree[] which may be found in the spec PDF.

Notice that read_bool() expects a probability, P(n >> 1). The >> is a logical shift of 1 bit, to the right. To find the probability to use when writing, we need to first run a complicated unnamed function (below).


To make a long story short, ctx will equal 15 at the end.

When the tree decoding function (9.3.3 Tree decoding process, above) first runs, n = 0. That means read_bool() will use P(0 >> 1) or just P(0) as the probability. Let's derive P(0).


hasRows = 1 (true) and hasCols = 1 (true). FrameIsIntra = 1 (true). This means we'll get our probability from partition_probs[ctx][node2]. We know that ctx is 15 and node2 = node = 0. So, we will retrieve the value partition_probs[15][0].

But there's a problem here. There's no such table as partition_probs in the VP9 spec PDF. How will we get the number we need? It turns out that we will use default_partition_probs[15][0], since we decided before that we would just use default probabilities.

default_partition_probs[15][0] = 10. So, our function will be read_bool(10). This means we need to write a 1 to the bitstream with a probability of 10. Let's keep a running tally of our bits and their probabilities, since we can't stop to write them now.

Running tally of bits
1, p = 10

Because we wrote a 1, read_bool() will return 1 when the video is decoded. Now refer to the binary tree shown above. Since we wrote a 1, we have a choice to write either 0 for a horizontal split (2 64x32's) or write another 1. We write another 1, and notice that we have to write just one more 1 to get Split.

Here are the passes the function will go through on the decode side:
  1. n starts at 0. At the end, n = 2, which equals T[0 + read_bool(P(0 >> 1))] = T[1]
  2. n starts at 2. At the end, n = 4, which equals T[2 + read_bool(P(2 >> 1))] = T[3]
  3. n starts at 4. At the end, n = 5, which equals T[4 + read_bool(P(4 >> 1))] = T[5]
Since T[5] = -PARTITION_SPLIT (negative PARTITION_SPLIT) and the loop condition is that n is above 0, we exit the loop and return PARTITION_SPLIT without the negative sign. This means the decode_partition() function (section 6.4.3, shown near the beginning) now knows that we want the current superblock to be broken into 4 32x32 blocks. The partition variable in that function will equal PARTITION_SPLIT.

Before we start writing more bits, we need to know the probabilities to use. Notice that, during our tree loop, P(n >> 1) was changing. First it was P(0), then P(1), then P(2). We know that default_partition_probs[15][0] = 10, but what about default_partition_probs[15][1] and default_partition_probs[15][2]? Well, those are 7 and 6, respectively. The designers knew that the chance of a 0 diminishes as we go further down the tree, so they set the default probabilities accordingly.

Now that we have our probabilities, let's add the bits to our list.



Running tally of bits
1, p = 10
1, p = 7
1, p = 6

That was quite a lot. So where are we? Now that we know how to tell the video player that we want 4 32x32 blocks, we can go back to section 6.4.3, decode_partition().

Just under partition is subsize = subsize_lookup[ partition][ bsize ]. We know partition now, so we can plug it in. bsize is BLOCK_64X64. Basically, subsize_lookup lets us plug in a partition type along with our current block size and see how big the resulting block(s) would be. In our case, subsize will be 9 which corresponds to BLOCK_32X32.

Now we have a bunch of IF statements to go through. The first one checks if subsize is less than 8x8 or equal to 64x64. Neither of those is the case, so we move on to check if it's a horizontal split (64x32 in our case) and then we check if it's a vertical split (32x64 in our case). The ELSE block at the end of the IF statement reflects that if none of those conditions were met, which they weren't, then the only other possibility is a PARTITION_SPLIT.

Notice that inside the ELSE block, we have 4 decode_partition() statements. These are processed in raster order within the superblock. This means that superblocks are processed in raster order, and blocks within superblocks are processed the same way.

We're done here since we're calling decode_partition() again. When a function calls itself, as decode_partition() is doing right now, it's called recursion. In this case we have to recursively call ourselves to decode each of the 4 32x32's. Now would be a good time for a diagram of where we are.


Saturday, October 7, 2017

VP9 ELI5 Part 5

This is a challenging chapter. If anything is unclear, please leave a comment. Having Q/A under the post will benefit everyone.

Now that we've completed the uncompressed header, we need to create the compressed header. Once we create it, we can find its length (in bytes) and fill in the final field in the uncompressed header that we couldn't fill during the previous chapter.

You may be wondering how compression works. There are many types of compression and they fall into 2 categories: lossy and lossless. MP3 sound and MP4 video are lossy compressors (they lose some quality) while Zip files are lossless (they don't lose anything). VP9 overall is a lossy compressor, but it includes a lossless compression step. That's what the compressed header is compressed with.

So how do we compress this header data without losing anything? We're going to use a very clever system called a range encoder. In this case it's a binary arithmetic coder because we are writing 0's and 1's.

The range encoder depends on probabilities. Don't worry, this isn't going to be hard like Prob and Stats in college. The basic rule is that a probability is a number anywhere from 0 to 1. To find a probability, we use the following formula:

P will be between 0 and 1, inclusive.

0 means guaranteed never to happen, and 1 means guaranteed to happen. Anything in between reflects a chance that something will happen.

For example, let's find the probability of getting heads when flipping a coin. You have 1 (the number of "heads" sides on a coin) divided by 2 (the number of sides on a coin). Your probability will be 1/2 or 0.5 because either side is equally likely to occur.

For another example, let's say you're watching a game show and the host spins a wheel to see if a player won a prize. If there are 50 places on the spinner and only 1 has the prize, then you have a probability of 1/50 or 0.02.

But what event are we expecting when compressing VP9? In other words, what event do we want to find the chance of? It's quite simple. We want to know the chance that the next bit in the bitstream will be 0.

Once we know that, we multiply our probability value (P, between 0 and 1) by 256 and round to the nearest whole number to get our VP9 probability. So if we have a 1 out of 50 chance (shown above) of encountering a 0 in the bitstream then P would be equal to 0.02. We multiply 0.02 by 256 to get 5.12 and round to 5. 5 is the probability we'd use to compress.

Why do I need to know the chance of finding a 0?

It's because the compressor takes two inputs: bits, and the overall chance that a bit will be 0. The output is a string of bytes that we will put in a hex editor. There are YouTube videos you can watch (one is shown below) explaining exactly why, but for simplicity's sake just accept that if you know how many 0's to expect, you can write less (a LOT less) data. What kind of "a LOT less", you ask? Like going from 100 MiB to 15 MiB without losing anything.

(Advanced theory, not necessary to continue)


Of course, this only works if the decompressor knows the probability of finding a 0, so we tell the video player beforehand. That's what the compressed header is mainly about: telling the player what the chances are of finding a 0 in the bitstream.

The decompression process is pretty straightforward, but I couldn't figure out how to reverse it and compress. Fortunately, I don't have to understand exactly how to compress and neither do you because there is free source code that does it. The license is very permissive so you can even use it in your own programs.

Because we will be using someone else's algorithm for the range compressor, all we need to worry about is what data to write, and the probability that a bit will be 0.

And, to make things easier, a lot of the time we will just pretend like we're flipping a coin by setting the probability to 0.5. That means the VP9 probability would be 128. A lot of bits in VP9 are compressed using a probability of 128. There is no compression benefit when the chances of 0 and 1 are the same, but they still do it in some parts of VP9.

[Just a reference; not necessary in this chapter] The source code we need is located at https://github.com/webmproject/libvpx/blob/master/vpx_dsp/. We are interested in the files bitwriter.c and bitwriter.h. However, at this stage it's just for reference. I've adapted the C++ code to Java and will show you the finished output. Later you will need to either copy or adapt the code so you can compress image data in bulk.

Here is the Java code we'll start with.


As you can see, we need to insert some code in the middle. See the line that says "vpx_write_bit(br, 0);"? We need to add some more lines like that. When finished, this code will compress the bits we provide into a string of bytes that it will print in the console at the bottom. When we're done, we'll enter those bytes into a hex editor.

Now that we know how to compress, we need to look at what to compress.

This process will be similar to what we did for the uncompressed header. I'll go bit-by-bit and explain everything. Here is what we need to encode for the compressed header.

(Taken from the VP9 Spec PDF)

We actually don't have much to encode here. We only have 3 sections: read_tx_mode, read_coef_probs, and read_skip_prob. The vast majority of what remains is for non-intra frames, which doesn't apply to what we're encoding since this is an intra frame.

So this is all we actually have to write:



First we'll collect our bits in a list like in the previous chapter. Then I'll show you how to compress them.

The idea here is that VP9 has default probabilities for everything. To avoid the complexity of having to find our probabilities and write them here, we'll just write a header that says we'll only be using default probabilities.

read_tx_mode: (All probabilities = 128)
      tx_mode: 2 bits (11) for ALLOW_32X32
      tx_mode_select: 1 bit (0)

read_coef_probs: (All probabilities = 128)
      update_probs: 1 bit (0)
      Do not update any probabilities for 4x4 transform coefficients

      update_probs: 1 bit (0)
      Do not update any probabilities for 8x8 transform coefficients

      update_probs: 1 bit (0)
      Do not update any probabilities for 16x16 transform coefficients

      update_probs: 1 bit (0)
      Do not update any probabilities for 32x32 transform coefficients

read_skip_prob: (All probabilities = 252)
These bits are known to be 0 most of the time, so the designers chose 252/256 as the probability of finding a 0.

      Change the skip prob for context 0? 1 bit (0)
      We do not want to change any probabilities, so this is 0.

      Change the skip prob for context 1? 1 bit (0)
      We do not want to change any probabilities, so this is 0.

      Change the skip prob for context 2? 1 bit (0)
      We do not want to change any probabilities, so this is 0.

Writing the code

Let's put what we've learned into action and get our VP9 bitstream moving forward again.


We have a function called vpx_write_bit(). That writes a bit to the compressed bitstream with a probability of 128. But we don't just want to write single bits; we have a multi-bit number called tx_mode at the top. How do we write that? We use a function called vpx_write_literal(). In VP9, a literal is a multi-bit number that is written with equal probability of finding a 0 or 1. The individual bits are simply written from left (most significant) to right (least significant). For example, a literal of 100 (binary) would be written as a 1 and then two 0's, in that order.

You may be wondering why there's already a vpx_write_bit(br, 0); line. Why do we start by writing a 0? Because the player expects to see a 0 at the beginning of any compressed section. If it doesn't, it will consider the bitstream invalid. The first 0 is discarded when playing the video.

Now that we've established that, we need to start writing the useful data. Let's add a line to our code:

      vpx_write_literal(br, 3, 2); //ALLOW_32X32

The br is the bitwriter object, 3 is the decimal equivalent of 11 binary, and 2 means we want 2 bits.

Now we need to write some single bits.

vpx_write_bit(br, 0); //tx_mode_select
vpx_write_bit(br, 0); // read_coef_probs 4x4
vpx_write_bit(br, 0); // read_coef_probs 8x8
vpx_write_bit(br, 0); // read_coef_probs 16x16
vpx_write_bit(br, 0); // read_coef_probs 32x32
vpx_write(br, 0, 252); //read_skip_prob, context 0
vpx_write(br, 0, 252); //read_skip_prob, context 1
vpx_write(br, 0, 252); //read_skip_prob, context 2


Here's what the code should look like at the end:



Notice that we have vpx_stop_encode(br) at the end. Throughout the compression process, we are modifying a string of bytes and we need to cleanly end the compression process before we can output them.

Run the program and it will give the following output:


The single output byte, 96, is a decimal value. In hex it's 0x60. We need to ensure that a video player won't run out of bytes when it's reading the compressed header, so we'll add 1 padding byte, a value of 0, just to be safe. That would be 96 0 (decimal) or 0x60 0x00 (hex).

That means that the length of our compressed header is 2 bytes. Now we can go back and fill in the header_size_in_bytes field we skipped in the previous chapter.

Start your hex editor and open uncompressed_header, the file we saved in the last chapter.



We need to change the last 2 bytes from 00 00 to 00 02. Since the first byte was already 00, I only had to change the second one (shown in red).



Save the file. Your uncompressed header is now complete.

Compressed header in hex

This part is very easy. Create a new file in the hex editor and enter 60 00 in the hex area on the left (not the text area on the right). Make sure you're in hex mode (you will be by default; you'd know if you weren't)

You should have an empty hex file and have the text/selection cursor at the very beginning as shown.


Enter 60 00 in the hex area on the left.


Now save it as compressed_header.

I recommend moving your headers into a folder called VP9 Files.



You now have valid headers for a VP9 intra frame. The next step is to encode the image data.

Wednesday, October 4, 2017

VP9 ELI5 Part 4

Okay, so now that we've examined what RAW video data is composed of, we can get to work creating a valid VP9 video file.

For this step, we'll need to look at the VP9 bitstream specification PDF. Don't worry, I'll walk you through the whole thing. But first, read the section on video containers.

Video Containers

Once we have a valid VP9 frame we will need a container to hold our finished VP9 video. A container is simply a file format that contains information for a video player like Media Player Classic to find individual frames in a file that would otherwise have been RAW.

Good container examples are *.MP4 and *.WAV files. MP4 isn't actually a video format, it's a file structure that can contain several different video formats, such as H.264. In a similar way, WAV isn't an audio format, it's simply a file that contains information about how to play RAW audio, usually LPCM.

We will be using an *.IVF (Indeo Video File) video container to hold our finished VP9 stream. It's really simply and easy to build in a hex editor.

Putting together a VP9 video

A standard VP9 bitstream is not a whole video as you might have assumed, but instead an individual frame. Think of a container (in our case, *.IVF file) as a ZIP archive with individual images. When encoding VP9, instead of compressing the entire video into one bitstream, we encode one frame, write it to the container file, and repeat until the end.

So a VP9 bitstream is just one frame. Here is how it is structured:

Adapted from the VP9 Bitstream Specification PDF.

We start with the uncompressed header (easy to generate), then write the compressed header, and finally start writing tile (image) data.

Advanced info: For simplicity and coding efficiency, we will only be dealing with 1 tile in our VP9 frames.

Here's the structure of a VP9 frame in text form. Don't worry that this doesn't make sense yet. We'll go bit-by-bit (literally).


Let's open Notepad++ and get started hand-crafting our bitstream.

We'll skip the first line startBitPos = get_position() and move on to uncompressed_header().

The bitstream spec defines different data structures depending on certain video parameters, so we need to define our parameters beforehand. We know it's a 960x540 image, but what Profile will the video be? VP9 supports 4 profiles, 0-3. Profile 0 is the default that YouTube videos are saved as. It can only carry 8-bit YUV 4:2:0. Profile 2 (the third profile) supports 10-bit pixels with YUV 4:2:0 color, so we'll be using Profile 2.

Here are the parameters we need to write, in order, in our case.

Uncompressed Header


frame_marker: 2 bits (1 0)
Always the case

profile_low_bit: 1 bit (0)
profile_high_bit: 1 bit (1)

Since Profile is 2 and 2 in binary is 10, the high bit is 1 and the low bit is 0.

show_existing_frame: 1 bit (0)
We are not trying to show a previously-decoded frame

frame_type: 1 bit (0)
0 means key frame, or initial image

show_frame: 1 bit (1)
Yes, we want to show the frame

error_resilient_mode: 1 bit (1)
Enable error resilient mode

Since frame_type = 0 (keyframe),
      frame_sync_code: 24 bits (01001001 10000011 01000010)
      color_config:
            ten_or_twelve_bit: 1 bit (0)
            0 because this is 10-bit data.

            color_space: 3 bits (010)
            Color space equals 2 (binary 010) for BT.709

            color_range: 1 bit (0)
            This designates Studio Swing, the 16-235 and 64-940 range limit we discussed earlier.
            However, it is not enforced by the VP9 spec. It must be enforced by players.

      frame_size:
            frame_width_minus_1: 16 bits (0000001110111111)
            959 in binary form

            frame_height_minus_1: 16 bits (0000001000011011)
            539 in binary form

      render_size:
            render_and_frame_size_different: 1 bit (0)
            The render size will be the same as the frame size, 960x540

frame_context_idx: 2 bits (00)
Save our frame context into the 0 position
(not really important since we reset after each frame)

loop_filter_params:
      loop_filter_level: 6 bits (000011)
      Let's set this to 3 (11 binary)

      loop_filter_sharpness: 3 bits (011)
      Let's set this to 3 (11 binary)

      loop_filter_delta_enabled: 1 bit (0)
      Do not enable loop filter delta (being able to change the loop filter for
      different parts of the image)

quantization_params:
      base_q_idx: 8 bits (00000001)
      Base quantizer (divisor for shrinking large values later on)
      This provides the lossy part of VP9 compression. I've set it to 1 because
      this is more about understanding the format and creating a valid image than
      about compressing.

      delta_coded: 1 bit (0)
      No change to the quantizer required for the DC component of transforms of B/W image
      (transforms are covered later)

      delta_coded: 1 bit (0)
      No change to the quantizer required for the DC component of transforms of
      color add-on data

      delta_coded: 1 bit (0)
      No change to the quantizer required for the AC components of transforms of
      color add-on data

segmentation_params:
      segmentation_enabled: 1 bit (0)
      Segmentation is a really cool feature but for simplicity we won't be using it.

tile_info:
      increment_tile_cols_log2: 1 bit (0)
      Do not increment that variable. Keep it at 0 since 2 ^ 0 = 1 and we only want 1 tile column.

      tile_rows_log2: 1 bit (0)
      Do not increment that variable. Keep it at 0 since 2 ^ 0 = 1 and we only want 1 tile row.

header_size_in_bytes: 16 bits (Can't know it yet; must complete next chapter first)
We'll pretend it's 16 0's (0000000000000000) for now.


Putting it together


So if you enter all these bits in order into Notepad++, you should start to have something like this:

(Obviously I superimposed this blog entry over Notepad++; I did not somehow manage to type formatted text in there)

Putting spaces between the fields is optional. I did it for clarity.

Anyway, once you've entered all the bits into Notepad++ (including the 16 0's for the end data we don't know), then you will need to press Ctrl+H to get the Find/Replace dialog if you added spaces. Skip this paragraph if you didn't add spaces. Clear the search field and press the spacebar once. Clear the Replace With field and press Replace All. Now there shouldn't be any spaces in the file.

Now that the spaces are gone, look at the status bar at the bottom of the window. If you entered the bits properly and removed any spaces, it should say Length: 112.


112 is divisible by 8 so this will fit in 14 bytes (112 / 8) without a need for extra empty bits to make a whole byte.

Now you'll need to separate the string of bits into groups of 8. Just start at the beginning, use your left arrow key to advance 8 characters, press Enter, and repeat. I know this is tedious, but don't worry, we only do this once.

It should look like this when it's done.


Now you can either manually copy-paste this one-by-one into MS Calculator in Bin mode and convert to Dec (tedious), or you can copy-paste the whole list into the top box at https://www.textfixer.com/numbers/binary-to-decimal.php (recommended) and use that tool to get the whole list at once.


Now open a hex editor with decimal input support, such as HHD Hex Editor Neo. Create a new file, press Ctrl+1 (for Hex Editor Neo) to set it to decimal mode, click in the first blank entry in the large area on the left and start typing these numbers in order, pressing Enter after each one.


Save this file as "uncompressed_header". We will come back to it after we finish the next chapter.

Tuesday, October 3, 2017

VP9 ELI5 Part 3

For this part, I'll need to define how the input image data (what we're encoding) is stored.

Pixels and color storage methods

Up until now, I only said that image data is not RGB, but rather YUV. But that doesn't tell you anything about how it's stored on the computer. The image, as you know, is 960x540 pixels. That means 960 pixels wide by 540 pixels high. When it was taken by the camera and when it's shown on your screen, it is merely a 2-dimensional array of pixels. It was not subdivided because it's just a plain image, not yet encoded with VP9.

Let's zoom in so you can get an idea of the individual pixels in the B/W image.



If you use the eyedropper tool in MS Paint, you can pick up a color from an image. You can see just what that color's RGB value is by going into Edit Colors.



As you can see, white is 255, 255, 255. As we discussed earlier, pixel values in VP9 follow a similar system but with YUV. Shades of gray, which is what the B/W YUV image is made up of, can be made in RGB by setting all three numbers to the same value.

Most importantly, why do you think MS Paint is limited to 255? It's because colors on a computer are almost always 8 bits (bits means 0's and 1's) per channel. On a computer screen, channels are RGB, hence MS Paint's use of RGB instead of YUV.

If you have 8 bit fields capable of holding 0 or 1, you can think of it as a row of light switches. How many combinations of On and Off can you come up with by flipping 8 light switches? The answer is 2 raised to the 8th power, or 256. That's because you have 2 states (On/Off) and you have 8 places that can be set to either state. But if you have 10 light switches, that would be 2 to the 10th, or 1024.

You can't reach 256 with 8 bits because 256 is how many states you have. One of them is 0. It would be like having 60 minutes in an hour but ending each hour on a clock at 60 instead of 59.

In this tutorial we will be dealing with 10-bit pixels, capable of ranging in value from 0 to 1023 instead of the usual 8-bit values you may be familiar with that range from 0-255. This is because it's a proven fact that 10-bit video carries higher quality at a given bitrate than 8-bit video. I'm not joking when I say that a 10-bit 1080p video at 1 Mbps can actually look a lot better than an 8-bit 1080p video at 2 Mbps.

Understanding RAW image data

You may have heard of RAW images taken by expensive DSLR cameras, but that's not what we're talking about here. In video, RAW data is a bit different.

A common format for RAW video data is YUV420p. 8 bits per value is implied because a bit depth wasn't specified. You can export to this format using a tool called FFmpeg. I'll describe how YUV420p files are structured before we move on because although we won't be using YUV420p, we will be using YUV420p10LE which is quite similar but more complicated.

In a RAW YUV420p video file, each frame is stored one after the other with no separators. It is simply the Y (B/W image), U and V. Then it repeats for the next frame.

You're probably familiar with the fact that computer data is stored as bytes, and that bytes are 8 bits each. Therefore, a byte can represent any number from 0 to 255. In a readable text file, each letter or other character is simply a number (usually less than 128) that your text reader knows corresponds to a letter or character. But if you've ever opened an image or program in a text editor, you know that they contain weird random characters. That's because the bytes in the file are storing numbers that don't correspond to readable characters.

(A PNG file opened in Notepad)

In a YUV420p video, we first store the B/W image pixel-by-pixel in raster scan fashion, meaning left to right and top to bottom. Note that there is no subdivision here. The raster scan does the whole image line by line.


So you see, raster scanning begins at the absolute top-left pixel, scans to the right until the end, goes down one, jumps back to the left, and goes again.

[You don't need to know the following formulas unless you choose to follow along. If you're just reading then you'll be fine without them]

That will take up (width * height) bytes, since each pixel takes one byte. Then we need to store the U and V images. Well, since the width and height of those are 1/2 the dimensions of the B/W image, the data for U and V will each be ((width / 2) * (height / 2)) bytes. Since we must copy U and V like this, that will be 2 * ((width / 2) * (height / 2)) bytes.

So now we have (width * height) + (2 * ((width / 2) * (height / 2)) bytes of data for just one frame. With an image size of 960x540, that comes out to 777,600 bytes total.

So for each frame in our YUV420p RAW video, the first 518,400 bytes will be the B/W image, the next 129,600 bytes will be the U image, and the remaining 129,600 bytes will be the V image.

Understanding YUV420p10le RAW image data

YUV420p only handles values from 0 to 255 because it's an 8-bit format. Since we want to work with 10-bit image data, we need a way to store values from 0 to 1023. This is where YUV420p10le comes in.

Since files are stored as bytes and bytes are multiples of 8 bits, we can't easily store 10-bit values because we'd have a fraction of a byte left over after almost every 10-bit value we write. This is why YUV420p10le stores the 10-bit values inside a larger 16-bit value. Since 16-bit numbers can range from 0 to 65,535 it will easily store our 0-1023 values.

So now we have to store 2 bytes for each value (16 bits = 2 bytes). The file structure stays the same so our file size doubles to 1,555,200 bytes or 1.48 MiB.

By now you may have guessed that having a higher range of available values makes for a higher-quality image. In a normal PC image, for example, 8 bits per color channel times 3 channels (RGB) equals 24 bits per pixel overall, or 24.7 million different colors. But what if you only have 256 colors at your disposal? The image will look quite bad.

(The image in only 256 colors. Click to enlarge and see how bad it is)

Videos encoded with 8-bit YUV420p look great. In fact, it's the bit depth that all YouTube videos are encoded with, even the ones that claim to be HDR which is at least 10-bit. But 10-bit video is better because you can express a far greater range of light and dark. Converting from 8-bit to 10-bit increases the range by simply multiplying the values to bring them up to the 10-bit range, but there will not be any increase in actual quality because you can't get information that was never there to begin with.

If that sounds confusing, let's say you have a gradient of 8-bit pixels: {15, 16, 17, 18, 19}. Now if you wanted to convert them to 10-bit, you would multiply each entry by 4 to get {60, 64, 68, 72, 76}. But notice that you don't have any values in-between whole multiples of 4. If you had recorded the video with a 10-bit camera you could have any value in the 10-bit range (theoretically), but since you converted from 8-bit you will only have the 256 different 8-bit levels spread evenly within 1024 total levels.

When you convert an 8-bit video to 10-bit, this is all you're doing.

This means that in our case the only purpose of converting to 10-bit is so we can have 10-bit values for our VP9 encoder to work with.

Opening RAW videos in a hex editor

Let's see what RAW video data looks like in a hex editor. What you're looking at in these hex screenshots are the pixel values at the beginning of the file, the first few pixels in the top row of the B/W image. These are shades of gray expressed as numbers, but if that doesn't make sense then I'll try to explain.

Here is what the 8-bit image looks like in a hex editor:

On the left we have hex digits and on the right, bytes as they'd be seen in a text editor

And this is the image in 10-bit format:

In this 10-bit example, each pixel value takes up 2 bytes.

There is an annoying quirk you have to watch out for on multi-byte numbers. Most computers read and write them as Little Endian which means byte-reversed.

Let's take the example of 255 which is 0xFF in hex. Hex works by using 0-9 and A-F to make a base-16 number system. 2 hex digits make a byte.

If we have a hex byte of 0xFF (255 decimal) and add 1 to make 256, it would "overflow" and we would get 2 hex bytes: 0x01 0x00. On Windows Vista and up, you can try setting Windows calculator to Programmer and use the Hex mode to enter 0100, then click Dec and it will say 256.

So now that we've established that, let's say you wanted to record that number in a file. You know about place value in the decimal system (tens, hundreds, etc) so you'd expect to just write 0x0100 to your file, right? Well, it turns out that due to a computer quirk carried from the 1970's, the computer would prefer it if you did it in reverse as 0x00 0x01. This applies even to longer hex numbers like 0xDEADBEEF. Each pair of hex digits must be written in reverse order; do not write the single digits in reverse order. 0xDEADBEEF should become 0xEFBEADDE, not 0xFEEBDAED.

Since this weird reverse notation is called Little Endian, it makes sense that the normal order you'd expect to use is called Big Endian. Keep in mind that the bytes (pairs of hex digits) are what's reversed, never bits or single hex digits.

Little Endian is what the LE in YUV420pLE stands for.

So now let's set our hex editor to group by Words (Word = 16 bits).

By default, it's set to Little Endian so these hex bytes are reversed to Big Endian for proper viewing.

Let's press Ctrl+1 or go to View -> Display As -> Decimal. Now you can see that the values are all normal human-readable numbers and that they never exceed 940.


[Random trivia, not necessary to continue] Why don't the values exceed 940 when we were supposed to convert to a 0-1023 range? Because video of photographic subjects (ie not computer screen captures) customarily ranges not from 0-255 or 0-1023, but rather from 16-235 or from 64-940 (16-235 times 4).

Now you can see what I mean by shades of gray expressed as numbers.

VP9 ELI5 Part 2

Here's the photo split into its 3 YUV components:
The U and V are the color add-on data. Notice how they are 1/4 the size of the B/W image.

Most video formats, VP9 included, depend on both intra and inter frames. An intra frame is a normal picture while an inter frame is just the difference between the current frame and a previous one. Think of it as only saving what moved in the current frame instead of saving the whole picture again.

Obviously we need to start a video with an intra frame, also known as a key frame, so that we can build future inter frames based on it.

Unlike BMP's, we don't save VP9 images pixel-by-pixel in order. Instead, we divide the image on several levels. The first, highest level of division is into 64x64 superblocks.

64x64 is a very large subdivision size. In MP4 (H.264), the largest size available is 16x16. Notice the difference below:

16x16

64x64

The superblocks are processed in raster order. That means left to right, top to bottom. In other words, each horizontal line of superblocks is processed from left to right, in order from top to bottom.

Each superblock is then optionally subdivided further. Here are some possible subdivisions:



Notice what subdivisions are allowed. Shown from left to right, we can split a superblock:
  • Vertically to make 16x32,
  • Horizontally to make 32x16, or
  • Four ways to make 4 32x32 blocks

After that, we can split the 32x32 blocks the same way or we could leave them alone. However, we may not further split one of the non-square sizes. Any non-square split is final in VP9. Also note that 4x4 is technically as small as we can get but things get messy at that size so for this tutorial we will not go smaller than 8x8.

This is just the first level of subdivision. The next level is the transform level and is more interesting.

VP9 ELI5 Part 1

Welcome to my VP9 Explain Like I'm 5 series. I'm going to explain the VP9 codec so anyone can understand it and perhaps even create their own encoder.

What is VP9 and why should I care?

First things first, let's establish what VP9 is. It's a video codec created by Google and it's what almost all YouTube videos are saved as on the server. Previously, YouTube videos were all MP4 (specifically, H.264). There were two problems with that. The first is that it's patented and copyrighted, so you have to pay the MPEG group. Second, it takes a lot of data for any given quality level. Mobile consumers aren't the only ones who pay by the gigabyte for their data usage; businesses have to pay for that as well to serve people. Normally video has a tradeoff: you can have higher quality for more data usage, or less data usage with bad quality. VP9, amazingly enough, allows Google to pack higher quality into less data, achieving both benefits at once. Unfortunately, only Google and Netflix use VP9; everyone else still uses MP4.

Video definition

You probably already know it's a series of still pictures called frames. If you took pictures really fast with your still picture camera, you could play them quickly to see a crude video. Video is merely a set of still pictures played one after the other really fast.

What are these pictures composed of?

You may be thinking, "A picture is simply dots (pixels). Each one has an RGB (red, green, blue) color value. I know because I've done pixel art in MS Paint." Well, I've got to tell you, that's not how images work in a video.

To explain this, I'll need to give a brief history lesson. You see, video frames could easily use RGB colors like you might have assumed. However, the US TV system was defined near the beginning of World War II and only carried black-and-white video until the 1950's. So instead of red, green, and blue, the old TV system only carried brightness values (shades of gray). You can approximate that in MS Paint by using the same number for red, green, and blue.

In the 50's, they wanted to add color to TV without breaking compatibility with existing TV's. They weren't willing to go the 2009 route and make everyone buy a new TV. What they did was mix a small amount of color data onto the existing black-and-white signal. That way, old TV's would still decode it while color TV's would add the color info on top of the black-and-white picture.

The FCC had allocated 6 MHz bandwidth for TV channels. The engineers couldn't add too much extra color without the signal bleeding onto other channels, so they devised a system of saving only a low-res copy of the image in color format while saving the full-res image in B/W format. The low-quality color image was enlarged and super-imposed over the B/W image, making good color because the human eye sees more brightness than color.

Even though NTSC is nearly dead, this system lives on in computers as YUV 4:2:0. If you want to read more, check out the YUV Wikipedia page.

Let's get started

For this series, we'll be working with the following 960x540 color photo. In case you didn't know, that's 1/4 the size of 1920x1080.

Credit: Lt. Zachary West, 100th MPAD, Flickr, CC BY 2.0

This wasn't actually 960x540 but I cropped and resized so it would be a nice easy size.

If you want to follow along, you can download the image.

Monday, September 18, 2017

OpenCodecDev Discord channel

I just created a new Discord channel called OpenCodecDev, for the purpose of discussing open-source media codecs. It's public so anyone can join. We also have a voice channel.

Here's what it will look like in the Discord desktop app. I'm under Admin, and Foxx is a newcomer.


Notable features of Discord include:

  • Posting images, whether photos or pasting from PrintScreen.
  • Blog URL's become a block with an excerpt from the post
  • YouTube URL's are expanded into a player in the chat history
  • Unlimited backlog which is stored on the server. You don't miss out on messages that happened while you were logged out, unlike IRC, and no past message is ever lost.
  • The backlog is searchable.
This will be more convenient than the Google group, and it has a better GUI than IRC.