14. DCT and WHT Inversion and Macroblock Reconstruction 14.1. Dequantization After decoding the DCTs/WHTs as described above, each (quantized) coefficient in each subblock is multiplied by one of six dequantization factors, the choice of factor depending on the plane (Y2, Y, or chroma) and position (DC = coefficient zero, AC = any

other coefficient). If the current macroblock has overridden the quantizer level (as described in Section 10), then the six factors are looked up from two dequantization tables with appropriate scaling and clamping using the single index supplied by the override. Otherwise, the frame-level dequantization factors (as described in Section 9.6) are used. In either case, the multiplies are computed and stored using 16-bit signed integers. The two dequantization tables, which may also be found in the reference decoder file dequant_data.h (Section 20.3), are as follows. ---- Begin code block -------------------------------------- static const int dc_qlookup[QINDEX_RANGE] = { 4, 5, 6, 7, 8, 9, 10, 10, 11, 12, 13, 14, 15, 16, 17, 17, 18, 19, 20, 20, 21, 21, 22, 22, 23, 23, 24, 25, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 91, 93, 95, 96, 98, 100, 101, 102, 104, 106, 108, 110, 112, 114, 116, 118, 122, 124, 126, 128, 130, 132, 134, 136, 138, 140, 143, 145, 148, 151, 154, 157, }; static const int ac_qlookup[QINDEX_RANGE] = { 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98, 100, 102, 104, 106, 108, 110, 112, 114, 116, 119, 122, 125, 128, 131, 134, 137, 140, 143, 146, 149, 152, 155, 158, 161, 164, 167, 170, 173, 177, 181, 185, 189, 193, 197, 201, 205, 209, 213, 217, 221, 225, 229, 234, 239, 245, 249, 254, 259, 264, 269, 274, 279, 284, }; ---- End code block ---------------------------------------- Lookup values from the above two tables are directly used in the DC and AC coefficients in Y1, respectively. For Y2 and chroma, values from the above tables undergo either scaling or clamping before the multiplies. Details regarding these scaling and clamping processes can be found in related lookup functions in dixie.c (Section 20.4).

14.2. Inverse Transforms If the Y2 residue block exists (i.e., the macroblock luma mode is not SPLITMV or B_PRED), it is inverted first (using the inverse WHT) and the element of the result at row i, column j is used as the 0th coefficient of the Y subblock at position (i, j), that is, the Y subblock whose index is (i * 4) + j. As discussed in Section 13, if the luma mode is B_PRED or SPLITMV, the 0th Y coefficients are part of the residue signal for the subblocks themselves. In either case, the inverse transforms for the sixteen Y subblocks and eight chroma subblocks are computed next. All 24 of these inversions are independent of each other; their results may (at least conceptually) be stored in 24 separate 4x4 arrays. As is done by the reference decoder, an implementation may wish to represent the prediction and residue buffers as macroblock-sized arrays (that is, a 16x16 Y buffer and two 8x8 chroma buffers). Regarding the inverse DCT implementation given below, this requires a simple adjustment to the address calculation for the resulting residue pixels. 14.3. Implementation of the WHT Inversion As previously discussed (see Sections 2 and 13), for macroblocks encoded using prediction modes other than B_PRED and SPLITMV, the DC values derived from the DCT transform on the 16 Y blocks are collected to construct a 25th block of a macroblock (16 Y, 4 U, 4 V constitute the 24 blocks). This 25th block is transformed using a Walsh-Hadamard transform (WHT). The inputs to the inverse WHT (that is, the dequantized coefficients), the intermediate "horizontally detransformed" signal, and the completely detransformed residue signal are all stored as arrays of 16-bit signed integers. Following the tradition of specifying bitstream format using the decoding process, we specify the inverse WHT in the decoding process using the following C-style source code:

---- Begin code block -------------------------------------- void vp8_short_inv_walsh4x4_c(short *input, short *output) { int i; int a1, b1, c1, d1; int a2, b2, c2, d2; short *ip = input; short *op = output; int temp1, temp2; for(i=0;i<4;i++) { a1 = ip[0] + ip[12]; b1 = ip[4] + ip[8]; c1 = ip[4] - ip[8]; d1 = ip[0] - ip[12]; op[0] = a1 + b1; op[4] = c1 + d1; op[8] = a1 - b1; op[12]= d1 - c1; ip++; op++; } ip = output; op = output; for(i=0;i<4;i++) { a1 = ip[0] + ip[3]; b1 = ip[1] + ip[2]; c1 = ip[1] - ip[2]; d1 = ip[0] - ip[3]; a2 = a1 + b1; b2 = c1 + d1; c2 = a1 - b1; d2 = d1 - c1;

op[0] = (a2+3)>>3; op[1] = (b2+3)>>3; op[2] = (c2+3)>>3; op[3] = (d2+3)>>3; ip+=4; op+=4; } } ---- End code block ---------------------------------------- In the case that there is only one non-zero DC value in input, the inverse transform can be simplified to the following: ---- Begin code block -------------------------------------- void vp8_short_inv_walsh4x4_1_c(short *input, short *output) { int i; int a1; short *op=output; a1 = ((input[0] + 3)>>3); for(i=0;i<4;i++) { op[0] = a1; op[1] = a1; op[2] = a1; op[3] = a1; op+=4; } } ---- End code block ---------------------------------------- It should be noted that a conforming decoder should implement the inverse transform using exactly the same rounding to achieve bit-wise matching output to the output of the process specified by the above C source code. The reference decoder WHT inversion may be found in the file idct_add.c (Section 20.8).

14.4. Implementation of the DCT Inversion All of the DCT inversions are computed in exactly the same way. In principle, VP8 uses a classical 2-D inverse discrete cosine transform, implemented as two passes of 1-D inverse DCT. The 1-D inverse DCT was calculated using a similar algorithm to what was described in [Loeffler]. However, the paper only provided the 8-point and 16-point version of the algorithms, which was adapted by On2 to perform the 4-point 1-D DCT. Accurate calculation of 1-D DCT of the above algorithm requires infinite precision. VP8 of course can use only a finite-precision approximation. Also, the inverse DCT used by VP8 takes care of normalization of the standard unitary transform; that is, every dequantized coefficient has roughly double the size of the corresponding unitary coefficient. However, at all but the highest datarates, the discrepancy between transmitted and ideal coefficients is due almost entirely to (lossy) compression and not to errors induced by finite-precision arithmetic. The inputs to the inverse DCT (that is, the dequantized coefficients), the intermediate "horizontally detransformed" signal, and the completely detransformed residue signal are all stored as arrays of 16-bit signed integers. The details of the computation are as follows. It should also be noted that this implementation makes use of the 16-bit fixed-point version of two multiplication constants: sqrt(2) * cos (pi/8) sqrt(2) * sin (pi/8) Because the first constant is bigger than 1, to maintain the same 16-bit fixed-point precision as the second one, we make use of the fact that x * a = x + x*(a-1) therefore x * sqrt(2) * cos (pi/8) = x + x * (sqrt(2) * cos(pi/8)-1)

---- Begin code block -------------------------------------- /* IDCT implementation */ static const int cospi8sqrt2minus1=20091; static const int sinpi8sqrt2 =35468; void short_idct4x4llm_c(short *input, short *output, int pitch) { int i; int a1, b1, c1, d1; short *ip=input; short *op=output; int temp1, temp2; int shortpitch = pitch>>1; for(i=0;i<4;i++) { a1 = ip[0]+ip[8]; b1 = ip[0]-ip[8]; temp1 = (ip[4] * sinpi8sqrt2)>>16; temp2 = ip[12]+((ip[12] * cospi8sqrt2minus1)>>16); c1 = temp1 - temp2; temp1 = ip[4] + ((ip[4] * cospi8sqrt2minus1)>>16); temp2 = (ip[12] * sinpi8sqrt2)>>16; d1 = temp1 + temp2; op[shortpitch*0] = a1+d1; op[shortpitch*3] = a1-d1; op[shortpitch*1] = b1+c1; op[shortpitch*2] = b1-c1; ip++; op++; } ip = output; op = output; for(i=0;i<4;i++) { a1 = ip[0]+ip[2]; b1 = ip[0]-ip[2]; temp1 = (ip[1] * sinpi8sqrt2)>>16; temp2 = ip[3]+((ip[3] * cospi8sqrt2minus1)>>16); c1 = temp1 - temp2; temp1 = ip[1] + ((ip[1] * cospi8sqrt2minus1)>>16);

temp2 = (ip[3] * sinpi8sqrt2)>>16; d1 = temp1 + temp2; op[0] = (a1+d1+4)>>3; op[3] = (a1-d1+4)>>3; op[1] = (b1+c1+4)>>3; op[2] = (b1-c1+4)>>3; ip+=shortpitch; op+=shortpitch; } } ---- End code block ---------------------------------------- The reference decoder DCT inversion may be found in the file idct_add.c (Section 20.8). 14.5. Summation of Predictor and Residue Finally, the prediction and residue signals are summed to form the reconstructed macroblock, which, except for loop filtering (taken up next), completes the decoding process. The summing procedure is fairly straightforward, having only a couple of details. The prediction and residue buffers are both arrays of 16-bit signed integers. Each individual (Y, U, and V pixel) result is calculated first as a 32-bit sum of the prediction and residue, and is then saturated to 8-bit unsigned range (using, say, the clamp255 function defined above) before being stored as an 8-bit unsigned pixel value. VP8 also supports a mode where the encoding of a bitstream guarantees all reconstructed pixel values between 0 and 255; compliant bitstreams of such requirements have the clamp_type bit in the frame header set to 1. In such a case, the clamp255 function is no longer required. The summation process is the same, regardless of the (intra or inter) mode of prediction in effect for the macroblock. The reference decoder implementation of reconstruction may be found in the file idct_add.c.

15. Loop Filter Loop filtering is the last stage of frame reconstruction and the next-to-last stage of the decoding process. The loop filter is applied to the entire frame after the summation of predictor and residue signals, as described in Section 14. The purpose of the loop filter is to eliminate (or at least reduce) visually objectionable artifacts associated with the semi- independence of the coding of macroblocks and their constituent subblocks. As was discussed in Section 5, the loop filter is "integral" to decoding, in that the results of loop filtering are used in the prediction of subsequent frames. Consequently, a functional decoder implementation must perform loop filtering exactly as described here. This is distinct from any postprocessing that may be applied only to the image immediately before display; such postprocessing is entirely at the option of the implementor (and/or user) and has no effect on decoding per se. The baseline frame-level parameters controlling the loop filter are defined in the frame header (Section 9.4) along with a mechanism for adjustment based on a macroblock's prediction mode and/or reference frame. The first is a flag (filter_type) selecting the type of filter (normal or simple); the other two are numbers (loop_filter_level and sharpness_level) that adjust the strength or sensitivity of the filter. As described in Sections 9.3 and 10, loop_filter_level may also be overridden on a per-macroblock basis using segmentation. Loop filtering is one of the more computationally intensive aspects of VP8 decoding. This is the reason for the existence of the optional, less-demanding simple filter type. Note carefully that loop filtering must be skipped entirely if loop_filter_level at either the frame header level or macroblock override level is 0. In no case should the loop filter be run with a value of 0; it should instead be skipped. We begin by discussing the aspects of loop filtering that are independent of the controlling parameters and type of filter chosen.

15.1. Filter Geometry and Overall Procedure The Y, U, and V planes are processed independently and identically. The loop filter acts on the edges between adjacent macroblocks and on the edges between adjacent subblocks of a macroblock. All such edges are horizontal or vertical. For each pixel position on an edge, a small number (two or three) of pixels adjacent to either side of the position are examined and possibly modified. The displacements of these pixels are at a right angle to the edge orientation; that is, for a horizontal edge, we treat the pixels immediately above and below the edge position, and for a vertical edge, we treat the pixels immediately to the left and right of the edge. We call this collection of pixels associated to an edge position a segment; the length of a segment is 2, 4, 6, or 8. Excepting that the normal filter uses slightly different algorithms for, and either filter may apply different control parameters to, the edges between macroblocks and those between subblocks, the treatment of edges is quite uniform: All segments straddling an edge are treated identically; there is no distinction between the treatment of horizontal and vertical edges, whether between macroblocks or between subblocks. As a consequence, adjacent subblock edges within a macroblock may be concatenated and processed in their entirety. There is a single 8-pixel-long vertical edge horizontally centered in each of the U and V blocks (the concatenation of upper and lower 4-pixel edges between chroma subblocks), and three 16-pixel-long vertical edges at horizontal positions 1/4, 1/2, and 3/4 the width of the luma macroblock, each representing the concatenation of four 4-pixel sub-edges between pairs of Y subblocks. The macroblocks comprising the frame are processed in the usual raster-scan order. Each macroblock is "responsible for" the inter-macroblock edges immediately above and to the left of it (but not the edges below and to the right of it), as well as the edges between its subblocks. For each macroblock M, there are four filtering steps, which are, (almost) in order: 1. If M is not on the leftmost column of macroblocks, filter across the left (vertical) inter-macroblock edge of M. 2. Filter across the vertical subblock edges within M.

3. If M is not on the topmost row of macroblocks, filter across the top (horizontal) inter-macroblock edge of M. 4. Filter across the horizontal subblock edges within M. We write MY, MU, and MV for the planar constituents of M, that is, the 16x16 luma block, 8x8 U block, and 8x8 V block comprising M. In step 1, for each of the three blocks MY, MU, and MV, we filter each of the (16 luma or 8 chroma) segments straddling the column separating the block from the block immediately to the left of it, using the inter-macroblock filter and controls associated to the loop_filter_level and sharpness_level. In step 4, we filter across the (three luma and one each for U and V) vertical subblock edges described above, this time using the inter-subblock filter and controls. Steps 2 and 4 are skipped for macroblocks that satisfy both of the following two conditions: 1. Macroblock coding mode is neither B_PRED nor SPLITMV; and 2. There is no DCT coefficient coded for the whole macroblock. For these macroblocks, loop filtering for edges between subblocks internal to a macroblock is effectively skipped. This skip strategy significantly reduces VP8 loop-filtering complexity. Edges between macroblocks and those between subblocks are treated with different control parameters (and, in the case of the normal filter, with different algorithms). Except for pixel addressing, there is no distinction between the treatment of vertical and horizontal edges. Luma edges are always 16 pixels long, chroma edges are always 8 pixels long, and the segments straddling an edge are treated identically; this of course facilitates vector processing. Because many pixels belong to segments straddling two or more edges, and so will be filtered more than once, the order in which edges are processed given above must be respected by any implementation. Within a single edge, however, the segments straddling that edge are disjoint, and the order in which these segments are processed is immaterial. Before taking up the filtering algorithms themselves, we should emphasize a point already made: Even though the pixel segments associated to a macroblock are antecedent to the macroblock (that is, lie within the macroblock or in already-constructed macroblocks), a

macroblock must not be filtered immediately after its "reconstruction" (described in Section 14). Rather, the loop filter applies after all the macroblocks have been "reconstructed" (i.e., had their predictor summed with their residue); correct decoding is predicated on the fact that already-constructed portions of the current frame referenced via intra-prediction (described in Section 12) are not yet filtered. 15.2. Simple Filter Having described the overall procedure of, and pixels affected by, the loop filter, we turn our attention to the treatment of individual segments straddling edges. We begin by describing the simple filter, which, as the reader might guess, is somewhat simpler than the normal filter. Note that the simple filter only applies to luma edges. Chroma edges are left unfiltered. Roughly speaking, the idea of loop filtering is, within limits, to reduce the difference between pixels straddling an edge. Differences in excess of a threshold (associated to the loop_filter_level) are assumed to be "natural" and are unmodified; differences below the threshold are assumed to be artifacts of quantization and the (partially) separate coding of blocks, and are reduced via the procedures described below. While the loop_filter_level is in principle arbitrary, the levels chosen by a VP8 compressor tend to be correlated to quantizer levels. Most of the filtering arithmetic is done using 8-bit signed operands (having a range of -128 to +127, inclusive), supplemented by 16-bit temporaries holding results of multiplies. Sums and other temporaries need to be "clamped" to a valid signed 8-bit range: ---- Begin code block -------------------------------------- int8 c(int v) { return (int8) (v < -128 ? -128 : (v < 128 ? v : 127)); } ---- End code block ----------------------------------------

Since pixel values themselves are unsigned 8-bit numbers, we need to convert between signed and unsigned values: ---- Begin code block -------------------------------------- /* Convert pixel value (0 <= v <= 255) to an 8-bit signed number. */ int8 u2s(Pixel v) { return (int8) (v - 128);} /* Clamp, then convert signed number back to pixel value. */ Pixel s2u(int v) { return (Pixel) (c(v) + 128);} ---- End code block ---------------------------------------- Filtering is often predicated on absolute-value thresholds. The following function is the equivalent of the standard library function abs, whose prototype is found in the standard header file stdlib.h. For us, the argument v is always the difference between two pixels and lies in the range -255 <= v <= +255. ---- Begin code block -------------------------------------- int abs(int v) { return v < 0? -v : v;} ---- End code block ---------------------------------------- An actual implementation would of course use inline functions or macros to accomplish these trivial procedures (which are used by both the normal and simple loop filters). An optimal implementation would probably express them in machine language, perhaps using single instruction, multiple data (SIMD) vector instructions. On many SIMD processors, the saturation accomplished by the above clamping function is often folded into the arithmetic instructions themselves, obviating the explicit step taken here. To simplify the specification of relative pixel positions, we use the word "before" to mean "immediately above" (for a vertical segment straddling a horizontal edge) or "immediately to the left of" (for a horizontal segment straddling a vertical edge), and the word "after" to mean "immediately below" or "immediately to the right of". Given an edge, a segment, and a limit value, the simple loop filter computes a value based on the four pixels that straddle the edge (two either side). If that value is below a supplied limit, then, very roughly speaking, the two pixel values are brought closer to each other, "shaving off" something like a quarter of the difference. The

same procedure is used for all segments straddling any type of edge, regardless of the nature (inter-macroblock, inter-subblock, luma, or chroma) of the edge; only the limit value depends on the edge type. The exact procedure (for a single segment) is as follows; the subroutine common_adjust is used by both the simple filter presented here and the normal filters discussed in Section 15.3. ---- Begin code block -------------------------------------- int8 common_adjust( int use_outer_taps, /* filter is 2 or 4 taps wide */ const Pixel *P1, /* pixel before P0 */ Pixel *P0, /* pixel before edge */ Pixel *Q0, /* pixel after edge */ const Pixel *Q1 /* pixel after Q0 */ ) { cint8 p1 = u2s(*P1); /* retrieve and convert all 4 pixels */ cint8 p0 = u2s(*P0); cint8 q0 = u2s(*Q0); cint8 q1 = u2s(*Q1); /* Disregarding clamping, when "use_outer_taps" is false, "a" is 3*(q0-p0). Since we are about to divide "a" by 8, in this case we end up multiplying the edge difference by 5/8. When "use_outer_taps" is true (as for the simple filter), "a" is p1 - 3*p0 + 3*q0 - q1, which can be thought of as a refinement of 2*(q0 - p0), and the adjustment is something like (q0 - p0)/4. */ int8 a = c((use_outer_taps? c(p1 - q1) : 0) + 3*(q0 - p0)); /* b is used to balance the rounding of a/8 in the case where the "fractional" part "f" of a/8 is exactly 1/2. */ cint8 b = (c(a + 3)) >> 3; /* Divide a by 8, rounding up when f >= 1/2. Although not strictly part of the C language, the right shift is assumed to propagate the sign bit. */ a = c(a + 4) >> 3; /* Subtract "a" from q0, "bringing it closer" to p0. */ *Q0 = s2u(q0 - a);

/* Add "a" (with adjustment "b") to p0, "bringing it closer" to q0. The clamp of "a+b", while present in the reference decoder, is superfluous; we have -16 <= a <= 15 at this point. */ *P0 = s2u(p0 + b); return a; } ---- End code block ---------------------------------------- ---- Begin code block -------------------------------------- void simple_segment( uint8 edge_limit, /* do nothing if edge difference exceeds limit */ const Pixel *P1, /* pixel before P0 */ Pixel *P0, /* pixel before edge */ Pixel *Q0, /* pixel after edge */ const Pixel *Q1 /* pixel after Q0 */ ) { if ((abs(*P0 - *Q0)*2 + abs(*P1 - *Q1)/2) <= edge_limit)) common_adjust(1, P1, P0, Q0, Q1); /* use outer taps */ } ---- End code block ---------------------------------------- We make a couple of remarks about the rounding procedure above. When b is zero (that is, when the "fractional part" of a is not 1/2), we are (except for clamping) adding the same number to p0 as we are subtracting from q0. This preserves the average value of p0 and q0, but the resulting difference between p0 and q0 is always even; in particular, the smallest non-zero gradation +-1 is not possible here. When b is one, the value we add to p0 (again except for clamping) is one less than the value we are subtracting from q0. In this case, the resulting difference is always odd (and the small gradation +-1 is possible), but the average value is reduced by 1/2, yielding, for instance, a very slight darkening in the luma plane. (In the very unlikely event of appreciable darkening after a large number of interframes, a compressor would of course eventually compensate for this in the selection of predictor and/or residue.)

The derivation of the edge_limit value used above, which depends on the loop_filter_level and sharpness_level, as well as the type of edge being processed, will be taken up after we describe the normal loop filtering algorithm below. 15.3. Normal Filter The normal loop filter is a refinement of the simple loop filter; all of the general discussion above applies here as well. In particular, the functions c, u2s, s2u, abs, and common_adjust are used by both the normal and simple filters. As mentioned above, the normal algorithms for inter-macroblock and inter-subblock edges differ. Nonetheless, they have a great deal in common: They use similar threshold algorithms to disable the filter and to detect high internal edge variance (which influences the filtering algorithm). Both algorithms also use, at least conditionally, the simple filter adjustment procedure described above. The common thresholding algorithms are as follows. ---- Begin code block -------------------------------------- /* All functions take (among other things) a segment (of length at most 4 + 4 = 8) symmetrically straddling an edge. The pixel values (or pointers) are always given in order, from the "beforemost" to the "aftermost". So, for a horizontal edge (written "|"), an 8-pixel segment would be ordered p3 p2 p1 p0 | q0 q1 q2 q3. */ /* Filtering is disabled if the difference between any two adjacent "interior" pixels in the 8-pixel segment exceeds the relevant threshold (I). A more complex thresholding calculation is done for the group of four pixels that straddle the edge, in line with the calculation in simple_segment() above. */

int filter_yes( uint8 I, /* limit on interior differences */ uint8 E, /* limit at the edge */ cint8 p3, cint8 p2, cint8 p1, cint8 p0, /* pixels before edge */ cint8 q0, cint8 q1, cint8 q2, cint8 q3 /* pixels after edge */ ) { return (abs(p0 - q0)*2 + abs(p1 - q1)/2) <= E && abs(p3 - p2) <= I && abs(p2 - p1) <= I && abs(p1 - p0) <= I && abs(q3 - q2) <= I && abs(q2 - q1) <= I && abs(q1 - q0) <= I; } ---- End code block ---------------------------------------- ---- Begin code block -------------------------------------- /* Filtering is altered if (at least) one of the differences on either side of the edge exceeds a threshold (we have "high edge variance"). */ int hev( uint8 threshold, cint8 p1, cint8 p0, /* pixels before edge */ cint8 q0, cint8 q1 /* pixels after edge */ ) { return abs(p1 - p0) > threshold || abs(q1 - q0) > threshold; } ---- End code block ---------------------------------------- The subblock filter is a variant of the simple filter. In fact, if we have high edge variance, the adjustment is exactly as for the simple filter. Otherwise, the simple adjustment (without outer taps) is applied, and the two pixels one step in from the edge pixels are adjusted by roughly half the amount by which the two edge pixels are adjusted; since the edge adjustment here is essentially 3/8 the edge difference, the inner adjustment is approximately 3/16 the edge difference.

---- Begin code block -------------------------------------- void subblock_filter( uint8 hev_threshold, /* detect high edge variance */ uint8 interior_limit, /* possibly disable filter */ uint8 edge_limit, cint8 *P3, cint8 *P2, int8 *P1, int8 *P0, /* pixels before edge */ int8 *Q0, int8 *Q1, cint8 *Q2, cint8 *Q3 /* pixels after edge */ ) { cint8 p3 = u2s(*P3), p2 = u2s(*P2), p1 = u2s(*P1), p0 = u2s(*P0); cint8 q0 = u2s(*Q0), q1 = u2s(*Q1), q2 = u2s(*Q2), q3 = u2s(*Q3); if (filter_yes(interior_limit, edge_limit, q3, q2, q1, q0, p0, p1, p2, p3)) { const int hv = hev(hev_threshold, p1, p0, q0, q1); cint8 a = (common_adjust(hv, P1, P0, Q0, Q1) + 1) >> 1; if (!hv) { *Q1 = s2u(q1 - a); *P1 = s2u(p1 + a); } } } ---- End code block ---------------------------------------- The inter-macroblock filter has potentially wider scope. If the edge variance is high, it performs the simple adjustment (using the outer taps, just like the simple filter and the corresponding case of the normal subblock filter). If the edge variance is low, we begin with the same basic filter calculation and apply multiples of it to pixel pairs symmetric about the edge; the magnitude of adjustment decays as we move away from the edge and six of the pixels in the segment are affected.

---- Begin code block -------------------------------------- void MBfilter( uint8 hev_threshold, /* detect high edge variance */ uint8 interior_limit, /* possibly disable filter */ uint8 edge_limit, cint8 *P3, int8 *P2, int8 *P1, int8 *P0, /* pixels before edge */ int8 *Q0, int8 *Q1, int8 *Q2, cint8 *Q3 /* pixels after edge */ ) { cint8 p3 = u2s(*P3), p2 = u2s(*P2), p1 = u2s(*P1), p0 = u2s(*P0); cint8 q0 = u2s(*Q0), q1 = u2s(*Q1), q2 = u2s(*Q2), q3 = u2s(*Q3); if (filter_yes(interior_limit, edge_limit, q3, q2, q1, q0, p0, p1, p2, p3)) { if (!hev(hev_threshold, p1, p0, q0, q1)) { /* Same as the initial calculation in "common_adjust", w is something like twice the edge difference */ const int8 w = c(c(p1 - q1) + 3*(q0 - p0)); /* 9/64 is approximately 9/63 = 1/7, and 1<<7 = 128 = 2*64. So this a, used to adjust the pixels adjacent to the edge, is something like 3/7 the edge difference. */ int8 a = c((27*w + 63) >> 7); *Q0 = s2u(q0 - a); *P0 = s2u(p0 + a); /* Next two are adjusted by 2/7 the edge difference */ a = c((18*w + 63) >> 7); *Q1 = s2u(q1 - a); *P1 = s2u(p1 + a); /* Last two are adjusted by 1/7 the edge difference */ a = c((9*w + 63) >> 7); *Q2 = s2u(q2 - a); *P2 = s2u(p2 + a);

} else /* if hev, do simple filter */ common_adjust(1, P1, P0, Q0, Q1); /* using outer taps */ } } ---- End code block ---------------------------------------- 15.4. Calculation of Control Parameters We conclude the discussion of loop filtering by showing how the thresholds supplied to the procedures above are derived from the two control parameters sharpness_level (an unsigned 3-bit number having maximum value 7) and loop_filter_level (an unsigned 6-bit number having maximum value 63). While the sharpness_level is constant over the frame, individual macroblocks may override the loop_filter_level with one of four possibilities supplied in the frame header (as described in Section 10). Both the simple and normal filters disable filtering if a value derived from the four pixels that straddle the edge (2 either side) exceeds a threshold / limit value. ---- Begin code block -------------------------------------- /* Luma and Chroma use the same inter-macroblock edge limit */ uint8 mbedge_limit = ((loop_filter_level + 2) * 2) + interior_limit; /* Luma and Chroma use the same inter-subblock edge limit */ uint8 sub_bedge_limit = (loop_filter_level * 2) + interior_limit; ---- End code block ---------------------------------------- The remaining thresholds are used only by the normal filters. The filter-disabling interior difference limit is the same for all edges (luma, chroma, inter-subblock, inter-macroblock) and is given by the following.

---- Begin code block -------------------------------------- uint8 interior_limit = loop_filter_level; if (sharpness_level) { interior_limit >>= sharpness_level > 4 ? 2 : 1; if (interior_limit > 9 - sharpness_level) interior_limit = 9 - sharpness_level; } if (!interior_limit) interior_limit = 1; ---- End code block ---------------------------------------- Finally, we give the derivation of the high edge-variance threshold, which is also the same for all edge types. ---- Begin code block -------------------------------------- uint8 hev_threshold = 0; if (we_are_decoding_akey_frame) /* current frame is a key frame */ { if (loop_filter_level >= 40) hev_threshold = 2; else if (loop_filter_level >= 15) hev_threshold = 1; } else /* current frame is an interframe */ { if (loop_filter_level >= 40) hev_threshold = 3; else if (loop_filter_level >= 20) hev_threshold = 2; else if (loop_filter_level >= 15) hev_threshold = 1; } ---- End code block ----------------------------------------

16. Interframe Macroblock Prediction Records We describe the layout and semantics of the prediction records for macroblocks in an interframe. After the feature specification (which is described in Section 10 and is identical for intraframes and interframes), there comes a Bool(prob_intra), which indicates inter-prediction (i.e., prediction from prior frames) when true and intra-prediction (i.e., prediction from already-coded portions of the current frame) when false. The zero-probability prob_intra is set by field J of the frame header. 16.1. Intra-Predicted Macroblocks For intra-prediction, the layout of the prediction data is essentially the same as the layout for key frames, although the contexts used by the decoding process are slightly different. As discussed in Section 8, the "outer" Y mode here uses a different tree from that used in key frames, repeated here for convenience. ---- Begin code block -------------------------------------- const tree_index ymode_tree [2 * (num_ymodes - 1)] = { -DC_PRED, 2, /* root: DC_PRED = "0", "1" subtree */ 4, 6, /* "1" subtree has 2 descendant subtrees */ -V_PRED, -H_PRED, /* "10" subtree: V_PRED = "100", H_PRED = "101" */ -TM_PRED, -B_PRED /* "11" subtree: TM_PRED = "110", B_PRED = "111" */ }; ---- End code block ---------------------------------------- The probability table used to decode this tree is variable. As described in Section 11, it (along with the similarly treated UV table) can be updated by field J of the frame header. Similar to the coefficient-decoding probabilities, such updates are cumulative and affect all ensuing frames until the next key frame or explicit update. The default probabilities for the Y and UV tables are: ---- Begin code block -------------------------------------- Prob ymode_prob [num_ymodes - 1] = { 112, 86, 140, 37}; Prob uv_mode_prob [num_uv_modes - 1] = { 162, 101, 204}; ---- End code block ----------------------------------------

These defaults must be restored after detection of a key frame. Just as for key frames, if the Y mode is B_PRED, there next comes an encoding of the intra_bpred mode used by each of the sixteen Y subblocks. These encodings use the same tree as does that for key frames but, in place of the contexts used in key frames, these encodings use the single fixed probability table. ---- Begin code block -------------------------------------- const Prob bmode_prob [num_intra_bmodes - 1] = { 120, 90, 79, 133, 87, 85, 80, 111, 151 }; ---- End code block ---------------------------------------- Last comes the chroma mode, again coded using the same tree as that used for key frames, this time using the dynamic uv_mode_prob table described above. The calculation of the intra-prediction buffer is identical to that described for key frames in Section 12. 16.2. Inter-Predicted Macroblocks Otherwise (when the above bool is true), we are using inter-prediction (which of course only happens for interframes), to which we now restrict our attention. The next datum is then another bool, B(prob_last), selecting the reference frame. If 0, the reference frame is the previous frame (the last frame); if 1, another bool (prob_gf) selects the reference frame between the golden frame (0) and the altref frame (1). The probabilities prob_last and prob_gf are set in field J of the frame header. Together with setting the reference frame, the purpose of inter-mode decoding is to set a motion vector for each of the sixteen Y subblocks of the current macroblock. These settings then define the calculation of the inter-prediction buffer (detailed in Section 18). While the net effect of inter-mode decoding is straightforward, the implementation is somewhat complex; the (lossless) compression achieved by this method justifies the complexity.

After the reference frame selector comes the mode (or motion vector reference) applied to the macroblock as a whole, coded using the following enumeration and tree. Setting mv_nearest = num_ymodes is a convenience that allows a single variable to unambiguously hold an inter- or intra-prediction mode. ---- Begin code block -------------------------------------- typedef enum { mv_nearest = num_ymodes, /* use "nearest" motion vector for entire MB */ mv_near, /* use "next nearest" "" */ mv_zero, /* use zero "" */ mv_new, /* use explicit offset from implicit "" */ mv_split, /* use multiple motion vectors */ num_mv_refs = mv_split + 1 - mv_nearest } mv_ref; const tree_index mv_ref_tree [2 * (num_mv_refs - 1)] = { -mv_zero, 2, /* zero = "0" */ -mv_nearest, 4, /* nearest = "10" */ -mv_near, 6, /* near = "110" */ -mv_new, -mv_split /* new = "1110", split = "1111" */ }; ---- End code block ---------------------------------------- 16.3. Mode and Motion Vector Contexts The probability table used to decode the mv_ref, along with three reference motion vectors used by the selected mode, is calculated via a survey of the already-decoded motion vectors in (up to) 3 nearby macroblocks. The algorithm generates a sorted list of distinct motion vectors adjacent to the search site. The best_mv is the vector with the highest score. The mv_nearest is the non-zero vector with the highest score. The mv_near is the non-zero vector with the next highest score. The number of motion vectors coded using the SPLITMV mode is scored using the same weighting and is returned with the scores of the best, nearest, and near vectors.

The three adjacent macroblocks above, left, and above-left are considered in order. If the macroblock is intra-coded, no action is taken. Otherwise, the motion vector is compared to other previously found motion vectors to determine if it has been seen before, and if so contributes its weight to that vector; otherwise, it enters a new vector in the list. The above and left vectors have twice the weight of the above-left vector. As is the case with many contexts used by VP8, it is possible for macroblocks near the top or left edges of the image to reference blocks that are outside the visible image. VP8 provides a border of 1 macroblock filled with 0x0 motion vectors left of the left edge, and a border filled with 0,0 motion vectors of 1 macroblocks above the top edge. Much of the process is more easily described in C than in English. The reference code for this can be found in modemv.c (Section 20.11). The calculation of reference vectors, probability table, and, finally, the inter-prediction mode itself is implemented as follows. ---- Begin code block -------------------------------------- typedef union { unsigned int as_int; MV as_mv; } int_mv; /* facilitates rapid equality tests */ static void mv_bias(MODE_INFO *x,int refframe, int_mv *mvp, int * ref_frame_sign_bias) { MV xmv; xmv = x->mbmi.mv.as_mv; if ( ref_frame_sign_bias[x->mbmi.ref_frame] != ref_frame_sign_bias[refframe] ) { xmv.row*=-1; xmv.col*=-1; } mvp->as_mv = xmv; } ---- End code block ----------------------------------------

---- Begin code block -------------------------------------- void vp8_clamp_mv(MV *mv, const MACROBLOCKD *xd) { if ( mv->col < (xd->mb_to_left_edge - LEFT_TOP_MARGIN) ) mv->col = xd->mb_to_left_edge - LEFT_TOP_MARGIN; else if ( mv->col > xd->mb_to_right_edge + RIGHT_BOTTOM_MARGIN ) mv->col = xd->mb_to_right_edge + RIGHT_BOTTOM_MARGIN; if ( mv->row < (xd->mb_to_top_edge - LEFT_TOP_MARGIN) ) mv->row = xd->mb_to_top_edge - LEFT_TOP_MARGIN; else if ( mv->row > xd->mb_to_bottom_edge + RIGHT_BOTTOM_MARGIN ) mv->row = xd->mb_to_bottom_edge + RIGHT_BOTTOM_MARGIN; } ---- End code block ---------------------------------------- In the function vp8_find_near_mvs(), the vectors "nearest" and "near" are used by the corresponding modes. The vector best_mv is used as a base for explicitly coded motion vectors. The first three entries in the return value cnt are (in order) weighted census values for "zero", "nearest", and "near" vectors. The final value indicates the extent to which SPLITMV was used by the neighboring macroblocks. The largest possible "weight" value in each case is 5. ---- Begin code block -------------------------------------- void vp8_find_near_mvs ( MACROBLOCKD *xd, const MODE_INFO *here, MV *nearest, MV *near, MV *best_mv, int cnt[4], int refframe, int * ref_frame_sign_bias )

{ const MODE_INFO *above = here - xd->mode_info_stride; const MODE_INFO *left = here - 1; const MODE_INFO *aboveleft = above - 1; int_mv near_mvs[4]; int_mv *mv = near_mvs; int *cntx = cnt; enum {CNT_ZERO, CNT_NEAREST, CNT_NEAR, CNT_SPLITMV}; /* Zero accumulators */ mv[0].as_int = mv[1].as_int = mv[2].as_int = 0; cnt[0] = cnt[1] = cnt[2] = cnt[3] = 0; /* Process above */ if (above->mbmi.ref_frame != INTRA_FRAME) { if (above->mbmi.mv.as_int) { (++mv)->as_int = above->mbmi.mv.as_int; mv_bias(above, refframe, mv, ref_frame_sign_bias); ++cntx; } *cntx += 2; } /* Process left */ if (left->mbmi.ref_frame != INTRA_FRAME) { if (left->mbmi.mv.as_int) { int_mv this_mv; this_mv.as_int = left->mbmi.mv.as_int; mv_bias(left, refframe, &this_mv, ref_frame_sign_bias); if (this_mv.as_int != mv->as_int) { (++mv)->as_int = this_mv.as_int; ++cntx; } *cntx += 2; } else cnt[CNT_ZERO] += 2; } /* Process above left */ if (aboveleft->mbmi.ref_frame != INTRA_FRAME) { if (aboveleft->mbmi.mv.as_int) { int_mv this_mv; this_mv.as_int = aboveleft->mbmi.mv.as_int; mv_bias(aboveleft, refframe, &this_mv, ref_frame_sign_bias);

if (this_mv.as_int != mv->as_int) { (++mv)->as_int = this_mv.as_int; ++cntx; } *cntx += 1; } else cnt[CNT_ZERO] += 1; } /* If we have three distinct MVs ... */ if (cnt[CNT_SPLITMV]) { /* See if above-left MV can be merged with NEAREST */ if (mv->as_int == near_mvs[CNT_NEAREST].as_int) cnt[CNT_NEAREST] += 1; } cnt[CNT_SPLITMV] = ((above->mbmi.mode == SPLITMV) + (left->mbmi.mode == SPLITMV)) * 2 + (aboveleft->mbmi.mode == SPLITMV); /* Swap near and nearest if necessary */ if (cnt[CNT_NEAR] > cnt[CNT_NEAREST]) { int tmp; tmp = cnt[CNT_NEAREST]; cnt[CNT_NEAREST] = cnt[CNT_NEAR]; cnt[CNT_NEAR] = tmp; tmp = near_mvs[CNT_NEAREST].as_int; near_mvs[CNT_NEAREST].as_int = near_mvs[CNT_NEAR].as_int; near_mvs[CNT_NEAR].as_int = tmp; } /* Use near_mvs[0] to store the "best" MV */ if (cnt[CNT_NEAREST] >= cnt[CNT_ZERO]) near_mvs[CNT_ZERO] = near_mvs[CNT_NEAREST]; /* Set up return values */ *best_mv = near_mvs[0].as_mv; *nearest = near_mvs[CNT_NEAREST].as_mv; *near = near_mvs[CNT_NEAR].as_mv; vp8_clamp_mv(nearest, xd); vp8_clamp_mv(near, xd); vp8_clamp_mv(best_mv, xd); //TODO: Move this up before the copy } ---- End code block ----------------------------------------

The mv_ref probability table (mv_ref_p) is then derived from the census as follows. ---- Begin code block -------------------------------------- const int vp8_mode_contexts[6][4] = { { 7, 1, 1, 143, }, { 14, 18, 14, 107, }, { 135, 64, 57, 68, }, { 60, 56, 128, 65, }, { 159, 134, 128, 34, }, { 234, 188, 128, 28, }, } ---- End code block ---------------------------------------- ---- Begin code block -------------------------------------- vp8_prob *vp8_mv_ref_probs(vp8_prob mv_ref_p[VP8_MVREFS-1], int cnt[4]) { mv_ref_p[0] = vp8_mode_contexts [cnt[0]] [0]; mv_ref_p[1] = vp8_mode_contexts [cnt[1]] [1]; mv_ref_p[2] = vp8_mode_contexts [cnt[2]] [2]; mv_ref_p[3] = vp8_mode_contexts [cnt[3]] [3]; return p; } ---- End code block ---------------------------------------- Once mv_ref_p is established, the mv_ref is decoded as usual. ---- Begin code block -------------------------------------- mvr = (mv_ref) treed_read(d, mv_ref_tree, mv_ref_p); ---- End code block ---------------------------------------- For the first four inter-coding modes, the same motion vector is used for all the Y subblocks. The first three modes use an implicit motion vector.

+------------+------------------------------------------------------+ | Mode | Instruction | +------------+------------------------------------------------------+ | mv_nearest | Use the nearest vector returned by | | | vp8_find_near_mvs. | | | | | mv_near | Use the near vector returned by vp8_find_near_mvs. | | | | | mv_zero | Use a zero vector; that is, predict the current | | | macroblock from the corresponding macroblock in the | | | prediction frame. | | | | | NEWMV | This mode is followed by an explicitly coded motion | | | vector (the format of which is described in the next | | | section) that is added (component-wise) to the | | | best_mv reference vector returned by find_near_mvs | | | and applied to all 16 subblocks. | +------------+------------------------------------------------------+ 16.4. Split Prediction The remaining mode (SPLITMV) causes multiple vectors to be applied to the Y subblocks. It is immediately followed by a partition specification that determines how many vectors will be specified and how they will be assigned to the subblocks. The possible partitions, with indicated subdivisions and coding tree, are as follows. ---- Begin code block -------------------------------------- typedef enum { mv_top_bottom, /* two pieces {0...7} and {8...15} */ mv_left_right, /* {0,1,4,5,8,9,12,13} and {2,3,6,7,10,11,14,15} */ mv_quarters, /* {0,1,4,5}, {2,3,6,7}, {8,9,12,13}, {10,11,14,15} */ MV_16, /* every subblock gets its own vector {0} ... {15} */ mv_num_partitions } MVpartition;

const tree_index mvpartition_tree [2 * (mvnum_partition - 1)] = { -MV_16, 2, /* MV_16 = "0" */ -mv_quarters, 4, /* mv_quarters = "10" */ -mv_top_bottom, -mv_left_right /* top_bottom = "110", left_right = "111" */ }; ---- End code block ---------------------------------------- The partition is decoded using a fixed, constant probability table: ---- Begin code block -------------------------------------- const Prob mvpartition_probs [mvnum_partition - 1] = { 110, 111, 150}; part = (MVpartition) treed_read(d, mvpartition_tree, mvpartition_probs); ---- End code block ---------------------------------------- After the partition come two (for mv_top_bottom or mv_left_right), four (for mv_quarters), or sixteen (for MV_16) subblock inter-prediction modes. These modes occur in the order indicated by the partition layouts (given as comments to the MVpartition enum) and are coded as follows. (As was done for the macroblock-level modes, we offset the mode enumeration so that a single variable may unambiguously hold either an intra- or inter-subblock mode.) Prior to decoding each subblock, a decoding tree context is chosen as illustrated in the code snippet below. The context is based on the immediate left and above subblock neighbors, and whether they are equal, are zero, or a combination of those. ---- Begin code block -------------------------------------- typedef enum { LEFT4x4 = num_intra_bmodes, /* use already-coded MV to my left */ ABOVE4x4, /* use already-coded MV above me */ ZERO4x4, /* use zero MV */ NEW4x4, /* explicit offset from "best" */ num_sub_mv_ref }; sub_mv_ref;

const tree_index sub_mv_ref_tree [2 * (num_sub_mv_ref - 1)] = { -LEFT4X4, 2, /* LEFT = "0" */ -ABOVE4X4, 4, /* ABOVE = "10" */ -ZERO4X4, -NEW4X4 /* ZERO = "110", NEW = "111" */ }; /* Choose correct decoding tree context * Function parameters are left subblock neighbor MV and above * subblock neighbor MV */ int vp8_mvCont(MV *l, MV*a) { int lez = (l->row == 0 && l->col == 0); /* left neighbor is zero */ int aez = (a->row == 0 && a->col == 0); /* above neighbor is zero */ int lea = (l->row == a->row && l->col == a->col); /* left neighbor equals above neighbor */ if (lea && lez) return SUBMVREF_LEFT_ABOVE_ZED; /* =4 */ if (lea) return SUBMVREF_LEFT_ABOVE_SAME; /* =3 */ if (aez) return SUBMVREF_ABOVE_ZED; /* =2 */ if (lez) return SUBMVREF_LEFT_ZED; /* =1*/ return SUBMVREF_NORMAL; /* =0 */ } /* Constant probabilities and decoding procedure. */ const Prob sub_mv_ref_prob [5][num_sub_mv_ref - 1] = { { 147,136,18 }, { 106,145,1 }, { 179,121,1 }, { 223,1 ,34 }, { 208,1 ,1 } }; sub_ref = (sub_mv_ref) treed_read(d, sub_mv_ref_tree, sub_mv_ref_prob[context]); ---- End code block ----------------------------------------

The first two sub-prediction modes simply copy the already-coded motion vectors used by the blocks above and to the left of the subblock at the upper left corner of the current subset (i.e., collection of subblocks being predicted). These prediction blocks need not lie in the current macroblock and, if the current subset lies at the top or left edges of the frame, need not lie in the frame. In this latter case, their motion vectors are taken to be zero, as are subblock motion vectors within an intra-predicted macroblock. Also, to ensure the correctness of prediction within this macroblock, all subblocks lying in an already-decoded subset of the current macroblock must have their motion vectors set. ZERO4x4 uses a zero motion vector and predicts the current subset using the corresponding subset from the prediction frame. NEW4x4 is exactly like NEWMV except that NEW4x4 is applied only to the current subset. It is followed by a two-dimensional motion vector offset (described in the next section) that is added to the best vector returned by the earlier call to find_near_mvs to form the motion vector in effect for the subset. Parsing of both inter-prediction modes and motion vectors (described next) can be found in the reference decoder file modemv.c (Section 20.11).