In Google Summer of Code 2007 I worked on mode decision and motion estimation on the Dirac reference codec with BBC Research and Development.

Project synposis

Motion estimation and mode decision incur a large proportion of the costs in compressing video. I propose a project to utilise more efficient algorithms for these features in the Dirac reference encoder. The focus will be on selection of an algorithm/algorithms which provide excellent performance whilst being conducive to lower-level optimization in the Schroedinger implementation of Dirac by use of the liboil library.

Deliverables

A library of algorithms for motion estimation and encoding mode decision, which perform well both in terms of speed and accuracy. The excellent performance of the Dirac reference codec will be maintained, whilst Schroedinger will use the same algorithms optimized using liboil and with threshold changes to perform as close to real-time as possible (at the expense of accuracy).

If the implementation performs well, given sufficient time I will look at quantization selection. I will produce documentation for the features, and if the optimizations fall under the scope of the specification I will write documentation for this as well. I will avoid algorithms which use patented techniques.

Project details

The project will first involve selection of a suitable algorithm (or algorithms) which try to determine the optimal choice of macroblock partitioning (whether one, four or sixteen sub-macroblocks) so that motion vectors steer appropriate sizes of sub-macroblocks in the image; if too few vectors are used the residual error will be significant, whilst if too many are used there will be redundancy in the abstract representation before entropy encoding. The coding mode decision also determines whether a macroblock should predict based on a single reference frame (and if so, which), two reference frames (for bi-directional, potentially weighted prediction) or whether the block should be skipped or encoded as an intra-block (in which case no partitioning is required).

These decisions normally have a large performance hit, because in the simple algorithm it is necessary to try each possibility and for complex codecs such as Dirac and H.264, which have various partitioning configurations, choices for reference material, and so on, there are a large number of possibilities. There are various algorithms which address the problem of making the mode decision quickly whilst maintaining coding efficiency.

In MPEG-2 and other, older codecs, there were fewer possibilities for mode decision to choose between, and so more work was concentrated on making motion estimation fast. There is therefore a large body of literature on this topic, investigating ways of making estimates of the most likely motion vectors for a block quickly. More recently, quarter-pixel motion estimation in H.264 has motivated more research in this area, as the number of search positions is greater for the same motion range.

RDO (Rate Distortion Optimization) is based around minimizing the number of bits required for encoding based on some constraints. The simple algorithm applies ME and mode decision separately, using an iterative approach to motion search (first searching integer pixel distances then refining estimates down to quarter-pixel accuracy).

To speed up this approach, motion estimation and mode decision can be combined, so that no time is wasted doing motion estimation if a particular encoding mode can be eliminated immediately. [1] describes one such algorithm. Mode decision is based around a series of tests which perform motion estimation and then calculate the SAE/SAD for the sub-macroblock. Whether a particular test is performed may be predicated on the results of other tests, so that at the end of mode decision possibly only a subset of all tests will have been done. From these, the best mode is chosen. For example, a check for monotonicity with three square block sizes is performed before checking a reduced set of non-square block sizes based on this. There is less of a performance hit than the simple algorithm because some tests are not performed at all. Various thresholds have to be selected, and must be chosen with care. It may also be possible to reduce the number of tests by taking into account decisions with neighbouring blocks.

Motion estimation must also be efficient to make the most of the algorithm. The fast algorithms for ME typically involve a iteratively updated search pattern based on neighbouring blocks, partitioning or fixed patterns (such as a square or diamond search pattern). An adaptive threshold decides whether early-stopping is a good strategy. EPZS (Enhance Predictive Zonal Search) is one such strategy. So, choosing a good early stopping threshold and set of search patterns are the most important tasks for fast ME algorithms. Various schemes are available.

Other possibilities include doing mode decisions and approximate motion estimation based on transformed data (so the cost of transforming to the spatial domain is removed) but this will likely be difficult in Dirac due to the use of Wavelets, though it may be worth investigating. (It has been shown to be useful in DCT-based codecs, however.) [2] takes this approach with H.264.

There is a large body of literature to investigate before I can make concrete decisions on which algorithms are the best ones to try with Dirac. However, the core of the project will involve setting up suitable infrastructure in Dirac to perform mode decision/motion estimation optimizations, implementing the algorithms in a library of functions which is suitable for use either in Schroedinger or the Dirac reference codec, and refining the best-performing algorithms so that they can be optimized with liboil. During the course of implementation, I will have to avoid algorithms which are patented or have patented elements.

References

Motion estimation results

Vector distributions

Robin Hood

RobinHood 3/4 RobinHood side

Mrs Beeton

MrsBeeton 3/4 MrsBeeton side

To try to capture movement as accurately as possible, I have used quite dense patterns so that the method of motion estimation does not affect the distribution too much; the idea is to compare the results from a much less computationally expensive motion estimation scheme with these results to see how much of the motion is captured. Method: using (0,0), temporal, median spatial, left, up and diagonally up and right predictors. A larger range pattern of squares is put around the two most successful predictors at 4, 8, 12, 16, 24, 32 and 64. These are refined using two diamond patterns around the best and second-best vectors.

There is a side view and a three-quarters view for each sequence. The vertical scale is logarithmic, showing that the (0, 0) predictor is selected (or refined to) very frequently, as expected. The x and y scales (on the ground plane) show the distance in pixels in x and y divided by 64.

Both sequences have skewed distributions, probably due to the fact that the sequences are relatively short. However, this could also be caused by the cumulative following effect of using a temporal predictor: this predictor is selected 15% of the time in RobinHood and 16% of the time in MrsBeeton. The temporal prediction may often also be propagated by other predictors (e.g. a 'left' spatial predictor following a block horizontally which used a temporal predictor). The distribution therefore contains vectors as far as 1500 pixels away, at the extremity of the search range; it seems as though the vectors are 'running away' towards the side of the picture when this many long-range patterns are used.

Discolouration

Encoded version

Original

Use of predictors

Mrs Beeton

Output

nb. the logging of vectors has takes a bit of time per-frame, and other programs were running, so the time per-frame measure is not indicative of the actual cost.

Total bits for sequence=272803784
Of these: 

218590768 were Y, 
11175248 were U, 
13696000 were V, and 
29245024 were motion vector data.The resulting bit-rate at 25Hz is 11944123 bits/sec.
Time per frame: 4.38989


Overall mean PSNR values
------------------------
Y: 39.56

U: 44.73

V: 45.82

Mean PSNR values by frame type and component
--------------------------------------------

                 ||       Y       ||       U       ||       V       ||
=================||===================================================
           Intra ||     40.74     ||     46.86     ||     48.33     ||    
-----------------||---------------------------------------------------
       Inter Ref ||     39.77     ||     44.65     ||     45.77     ||    
-----------------||---------------------------------------------------
   Inter Non Ref ||     38.98     ||     44.56     ||     45.54     ||     
-----------------||---------------------------------------------------

Predictors

Robin Hood

Output

Total bits for sequence=205319696
Of these: 

145131144 were Y, 
12757760 were U, 
13945488 were V, and 
33413424 were motion vector data.The resulting bit-rate at 25Hz is 12106114 bits/sec.
Time per frame: 4.28092


Overall mean PSNR values
------------------------
Y: 41.98

U: 43.32

V: 45.06

Mean PSNR values by frame type and component
--------------------------------------------

                 ||       Y       ||       U       ||       V       ||
=================||===================================================
           Intra ||      44.3     ||     46.87     ||     48.29     ||    
-----------------||---------------------------------------------------
       Inter Ref ||     42.23     ||     43.24     ||     45.01     ||    
-----------------||---------------------------------------------------
   Inter Non Ref ||     41.18     ||     42.92     ||     44.67     ||     
-----------------||---------------------------------------------------

Predictors

Vector distances distribution

These plots show the distribution of vector distances. The y-axis is (logarithm of) frequency and the x-axis gives the displacement from the stationarity vector (0, 0).

Distances

Simplified motion estimation

The distribution of vectors changes when using simpler, faster motion estimation. Here I use only two range-adaptive patterns; the two best predictors have patterns at 8 pixels away, and the best predictor has an additional pattern at 32 pixels away to capture large displacements. Two diamond refinement patterns are used.

Output

RobinHood

Total bits for sequence=196957920
Of these: 

139123248 were Y, 
12317400 were U, 
13281768 were V, and 
32163664 were motion vector data.The resulting bit-rate at 25Hz is 11613084 bits/sec.
Time per frame: 2.76901


Overall mean PSNR values
------------------------
Y: 41.88

U: 43.39

V: 45.09

Mean PSNR values by frame type and component
--------------------------------------------

                 ||       Y       ||       U       ||       V       ||
=================||===================================================
           Intra ||     43.53     ||     46.48     ||     47.96     ||    
-----------------||---------------------------------------------------
       Inter Ref ||     42.16     ||     43.31     ||     45.03     ||    
-----------------||---------------------------------------------------
   Inter Non Ref ||     41.07     ||        43     ||     44.68     ||     
-----------------||---------------------------------------------------

Using the same type of chart as above, the vector distribution is:

A side view comparing the two:

The distribution from the simpler ME is less skewed and slightly more concentrated around (0, 0). However, there are still points far away from the origin which may indicate that temporal tracking of movement is still occurring even with a smaller pattern.

MrsBeeton

Total bits for sequence=271461680
Of these: 

217044984 were Y, 
11111352 were U, 
13478760 were V, and 
29729848 were motion vector data.The resulting bit-rate at 25Hz is 11885362 bits/sec.
Time per frame: 2.92907


Overall mean PSNR values
------------------------
Y: 39.56

U: 44.75

V: 45.83

Mean PSNR values by frame type and component
--------------------------------------------

                 ||       Y       ||       U       ||       V       ||
=================||===================================================
           Intra ||     40.68     ||      46.8     ||     48.26     ||    
-----------------||---------------------------------------------------
       Inter Ref ||     39.78     ||     44.67     ||     45.78     ||    
-----------------||---------------------------------------------------
   Inter Non Ref ||     38.98     ||     44.58     ||     45.55     ||     
-----------------||---------------------------------------------------

Vector distributions:

A side view comparing the two:

The simpler ME is noticeably shallower to the left of the origin from this viewpoint, compared to the more thorough search above. The PSNR is almost as high as the PSNR from the fuller search, which might show that these vectors which the simpler method 'missed' were not reflecting real motion (perhaps matching noisy areas instead, for example).

Vector distances

The two motion estimation methods produce similar results for the RobinHood sequence, but for MrsBeeton, the simpler motion estimation algorithm chooses fewer vectors with large displacements (see where the red crosses are below the green x's). This could indicate better behaviour in the presence of noise; comparing with bit-rates and PSNR, the more thorough search had a higher bit-rate, and yet the simpler search beats it in overall PSNR per-component. (And though the Y components are the same in overall PSNR, the colour components are better in the simpler motion estimation scheme which might mean that more 'true'-motion is captured rather than chaotic motion for noise matching.)

Conclusions

Simplifying the motion estimation in terms of distances being checked has not improved quality by a large amount; even with a shorter search range, large displacements are still captured by the simpler motion estimation, as can be seen in the distribution charts. This may indicate that the predictors are working well.