2010-12-19

Android's Stagefright AAC Encoder or Reference Code by Any Other Name Would Smell as Sweet

Android and AAC

As readers of this blog know, AAC encoding holds a spot close to my heart. I'm responsible for getting the worst AAC encoder of all time included into FFmpeg and was never successful at fixing it.

The recent Android 2.3 "Gingerbread" release contains an Apache licensed AAC encoder as part of Android's new stagefright media library. I was excited that the free software community may finally have a good free AAC encoder. Unfortunately the stagefright AAC encoder seems to be little more than an optimized version of 3GPP's 26.411 fixed point reference encoder.

Origins

Looking at the source tree there is an immediate resemblance between the 3GPP code and the stagefright code:

$ ls 26411-900/26411-900-ANSI-C_source_code/3GPP_enhanced_aacPlus_etsiopsrc_200907/ETSI_aacPlusenc/etsiop_fastaacenc/src
aac_ram.c       bitenc.c        interface.c          psy_main.c   stat_bits.h
aac_ram.h       bitenc.h        interface.h          psy_main.h   stprepro.c
aac_rom.c       block_switch.c  line_pe.c            qc_data.h    stprepro.h
aac_rom.h       block_switch.h  line_pe.h            qc_main.c    tns.c
aacenc.c        channel_map.c   ms_stereo.c          qc_main.h    tns.h
adj_thr.c       channel_map.h   ms_stereo.h          quantize.c   tns_func.h
adj_thr.h       dyn_bits.c      pre_echo_control.c   quantize.h   tns_param.c
adj_thr_data.h  dyn_bits.h      pre_echo_control.h   sf_estim.c   tns_param.h
band_nrg.c      fft.c           psy_configuration.c  sf_estim.h   transform.c
band_nrg.h      fft.h           psy_configuration.h  spreading.c  transform.h
bit_cnt.c       grp_data.c      psy_const.h          spreading.h
bit_cnt.h       grp_data.h      psy_data.h           stat_bits.c

$ ls base/media/libstagefright/codecs/aacenc/src/
aac_rom.c      bitbuffer.c     line_pe.c            quantize.c
aacenc.c       bitenc.c        memalign.c           sf_estim.c
aacenc_core.c  block_switch.c  ms_stereo.c          spreading.c
adj_thr.c      channel_map.c   pre_echo_control.c   stat_bits.c
asm/           dyn_bits.c      psy_configuration.c  tns.c
band_nrg.c     grp_data.c      psy_main.c           transform.c
bit_cnt.c      interface.c     qc_main.c

As you can see almost all the files in stagefright aacenc are named identically to files in 26.403-v9.0.0. Still both are fixed point aac encoders and thus it is reasonable to expect similar names. Let's use Warren Toomy's Ctcompare tool to compare content.

./ctcompare -r 3gpp.ctf stagefright.ctf
5473  libstagefright/codecs/aacenc/src/aac_rom.c:1507-2262  ETSI_aacPlusenc/etsiop_fastaacenc/src/aac_rom.c:701-1459
2270  libstagefright/codecs/aacenc/src/aac_rom.c:1044-1338  ETSI_aacPlusenc/etsiop_fastaacenc/src/aac_rom.c:293-587
523  libstagefright/codecs/aacenc/basic_op/oper_32b.c:270-345  ETSI_aacPlusenc/etsiop_ffrlib/src/transcendent_enc.c:13-85
523  libstagefright/codecs/aacenc/basic_op/oper_32b.c:270-345  ETSI_aacPlusdec/etsiop_ffrlib/src/transcendent_enc.c:13-85
491  libstagefright/codecs/aacenc/src/bitenc.c:398-540  ETSI_aacPlusenc/etsiop_fastaacenc/src/bitenc.c:407-553
279  libstagefright/codecs/aacenc/src/aac_rom.c:1368-1403  ETSI_aacPlusenc/etsiop_fastaacenc/src/aac_rom.c:593-628
243  libstagefright/codecs/aacenc/inc/aac_rom.h:65-95  ETSI_aacPlusenc/etsiop_fastaacenc/src/aac_rom.h:38-69
218  libstagefright/codecs/aacenc/inc/bit_cnt.h:28-106  ETSI_aacPlusenc/etsiop_fastaacenc/src/bit_cnt.h:9-87
210  libstagefright/codecs/aacenc/src/aac_rom.c:2260-2347  ETSI_aacPlusenc/etsiop_fastaacenc/src/aac_rom.c:1572-1659
199  libstagefright/codecs/aacenc/basic_op/oper_32b.c:42-179  ETSI_aacPlusenc/etsioplib/oper_32b.c:45-182
199  libstagefright/codecs/aacenc/basic_op/oper_32b.c:42-179  ETSI_aacPlusdec/etsioplib/oper_32b.c:45-182
196  libstagefright/codecs/aacenc/src/sf_estim.c:831-881  ETSI_aacPlusenc/etsiop_fastaacenc/src/sf_estim.c:768-809
194  libstagefright/codecs/aacenc/inc/tns.h:31-108  ETSI_aacPlusenc/etsiop_fastaacenc/src/tns.h:12-89
190  libstagefright/codecs/aacenc/src/psy_main.c:424-451  ETSI_aacPlusenc/etsiop_fastaacenc/src/psy_main.c:387-414
165  libstagefright/codecs/aacenc/inc/qc_data.h:23-92  ETSI_aacPlusenc/etsiop_fastaacenc/src/qc_data.h:4-73
157  libstagefright/codecs/aacenc/inc/tns_func.h:31-75  ETSI_aacPlusenc/etsiop_fastaacenc/src/tns_func.h:10-54
157  libstagefright/codecs/aacenc/src/tns.c:64-96  ETSI_aacPlusenc/etsiop_fastaacenc/src/tns.c:34-65
156  libstagefright/codecs/aacenc/inc/dyn_bits.h:23-80  ETSI_aacPlusenc/etsiop_fastaacenc/src/dyn_bits.h:4-61
147  libstagefright/codecs/aacenc/src/dyn_bits.c:305-351  ETSI_aacPlusenc/etsiop_fastaacenc/src/dyn_bits.c:319-363
145  libstagefright/codecs/aacenc/src/psy_main.c:607-656  ETSI_aacPlusenc/etsiop_fastaacenc/src/psy_main.c:559-600
139  libstagefright/codecs/aacenc/src/psy_main.c:397-414  ETSI_aacPlusenc/etsiop_fastaacenc/src/psy_main.c:359-376
131  libstagefright/codecs/aacenc/src/psy_main.c:381-397  ETSI_aacPlusenc/etsiop_fastaacenc/src/psy_main.c:342-358
130  libstagefright/codecs/aacenc/src/aac_rom.c:1393-1401  ETSI_aacPlusdec/etsiop_aacdec/src/aac_rom.c:928-936
128  libstagefright/codecs/aacenc/src/block_switch.c:81-110  ETSI_aacPlusenc/etsiop_fastaacenc/src/block_switch.c:53-75
126  libstagefright/codecs/aacenc/src/psy_main.c:42-78  ETSI_aacPlusenc/etsiop_fastaacenc/src/psy_main.c:24-60
123  libstagefright/codecs/aacenc/src/stat_bits.c:28-55  ETSI_aacPlusenc/etsiop_fastaacenc/src/stat_bits.c:10-32
122  libstagefright/codecs/aacenc/inc/psy_const.h:28-76  ETSI_aacPlusenc/etsiop_fastaacenc/src/psy_const.h:9-55
119  libstagefright/codecs/aacenc/src/bitenc.c:639-664  ETSI_aacPlusenc/etsiop_fastaacenc/src/bitenc.c:632-657
119  libstagefright/codecs/aacenc/src/bitenc.c:337-396  ETSI_aacPlusenc/etsiop_fastaacenc/src/bitenc.c:339-403
118  libstagefright/codecs/aacenc/src/block_switch.c:33-77  ETSI_aacPlusenc/etsiop_fastaacenc/src/block_switch.c:14-49
113  libstagefright/codecs/aacenc/inc/psy_data.h:23-66  ETSI_aacPlusenc/etsiop_fastaacenc/src/psy_data.h:4-47
110  libstagefright/codecs/aacenc/src/bitenc.c:620-639  ETSI_aacPlusenc/etsiop_fastaacenc/src/bitenc.c:613-632
109  libstagefright/codecs/aacenc/src/psy_main.c:382-393  ETSI_aacPlusenc/etsiop_fastaacenc/src/psy_main.c:361-372
109  libstagefright/codecs/aacenc/src/psy_main.c:399-410  ETSI_aacPlusenc/etsiop_fastaacenc/src/psy_main.c:343-354
108  libstagefright/codecs/aacenc/inc/qc_data.h:93-136  ETSI_aacPlusenc/etsiop_fastaacenc/src/qc_data.h:73-116
105  libstagefright/codecs/aacenc/basic_op/typedef.h:23-63  ETSI_aacPlusdec/etsioplib/typedef.h:15-55
105  libstagefright/codecs/aacenc/basic_op/typedef.h:23-63  ETSI_aacPlusenc/etsioplib/typedef.h:15-55
104  libstagefright/codecs/aacenc/inc/adj_thr_data.h:30-69  ETSI_aacPlusenc/etsiop_fastaacenc/src/adj_thr_data.h:13-52
103  libstagefright/codecs/aacenc/src/aac_rom.c:1489-1491  ETSI_aacPlusdec/etsiop_aacdec/src/aac_rom.c:72-79
102  libstagefright/codecs/aacenc/inc/block_switch.h:30-62  ETSI_aacPlusenc/etsiop_fastaacenc/src/block_switch.h:12-44
...
The top 4 matches aac_rom.c and oper_32b.c are all data tables. Since they are both fixed point AAC encoders a lot of the data may be required by both but I'd expect to see some tables wind up in different order or slightly modified to be more useful to a particular implementation. Instead the files have large sections of identical data and identical comments. In fact the first match in both files opens with:
/*
  these tables are used only for counting and 
  are stored in packed format
*/

This is followed by a series of tables with identical names and formatting down to the spaces. Ignoring these tables, bitenc.c, tns*, psy_main.c, and sf_estim.c all have huge portions of identical code and comments.

108  ../base/media/libstagefright/codecs/aacenc/inc/qc_data.h:93-136  ../26411-900/26411-900-ANSI-C_source_code/3GPP_enhanced_aacPlus_etsiopsrc_200907/ETSI_aacPlusenc/etsiop_fastaacenc/src/qc_data.h:73-116
  Word16          staticBitsUsed; /* for verification purposes */
  Word16          dynBitsUsed;    /* for verification purposes */
  Word16          pe;
  Word16          ancBitsUsed;
  Word16          fillBits;
} QC_OUT_ELEMENT;

typedef struct
{
  QC_OUT_CHANNEL  qcChannel[MAX_CHANNELS];
  QC_OUT_ELEMENT  qcElement;
  Word16          totStaticBitsUsed; /* for verification purposes */
  Word16          totDynBitsUsed;    /* for verification purposes */
  Word16          totAncBitsUsed;    /* for verification purposes */
  Word16          totFillBits;
  Word16          alignBits;
  Word16          bitResTot;
  Word16          averageBitsTot;
} QC_OUT;

typedef struct {
  Word32 chBitrate;
  Word16 averageBits;               /* brutto -> look ancillary.h */
  Word16 maxBits;
  Word16 bitResLevel;
  Word16 maxBitResBits;
  Word16 relativeBits;            /* Bits relative to total Bits scaled down by 2 */
} ELEMENT_BITS;

typedef struct
{
  /* this is basically struct QC_INIT */
  Word16 averageBitsTot;
  Word16 maxBitsTot;
  Word16 globStatBits;
  Word16 nChannels;
  Word16 bitResTot;

  Word16 maxBitFac;

  PADDING   padding;

  ELEMENT_BITS  elementBits;
  ADJ_THR_STATE adjThr;
=====================================
  Word16          staticBitsUsed; /* for verification purposes */
  Word16          dynBitsUsed;    /* for verification purposes */
  Word16          pe;
  Word16          ancBitsUsed;
  Word16          fillBits;
} QC_OUT_ELEMENT;

typedef struct
{
  QC_OUT_CHANNEL  qcChannel[MAX_CHANNELS];
  QC_OUT_ELEMENT  qcElement;
  Word16          totStaticBitsUsed; /* for verification purposes */
  Word16          totDynBitsUsed;    /* for verification purposes */
  Word16          totAncBitsUsed;    /* for verification purposes */
  Word16          totFillBits;
  Word16          alignBits;
  Word16          bitResTot;
  Word16          averageBitsTot;
} QC_OUT;

typedef struct {
  Word32 chBitrate;
  Word16 averageBits;               /* brutto -> look ancillary.h */
  Word16 maxBits;
  Word16 bitResLevel;
  Word16 maxBitResBits;
  Word16 relativeBits;            /* Bits relative to total Bits scaled down by 2 */
} ELEMENT_BITS;

typedef struct
{
  /* this is basically struct QC_INIT */
  Word16 averageBitsTot;
  Word16 maxBitsTot;
  Word16 globStatBits;
  Word16 nChannels;
  Word16 bitResTot;

  Word16 maxBitFac;

  PADDING   padding;

  ELEMENT_BITS  elementBits;
  ADJ_THR_STATE adjThr;

Right there we can see large runs of identical declarations down the typedeffed names and comments. As far as actual code goes and not just declations consider these three functions from stagefright's tns.c:

/**
*
* function name: TnsDetect
* description:  Calculate TNS filter and decide on TNS usage 
* returns:  0 if success
*
*/
Word32 TnsDetect(TNS_DATA* tnsData,        /*!< tns data structure (modified) */
                 TNS_CONFIG tC,            /*!< tns config structure */
                 Word32* pScratchTns,      /*!< pointer to scratch space */
                 const Word16 sfbOffset[], /*!< scalefactor size and table */
                 Word32* spectrum,         /*!< spectral data */
                 Word16 subBlockNumber,    /*!< subblock num */
                 Word16 blockType,         /*!< blocktype (long or short) */
                 Word32 * sfbEnergy)       /*!< sfb-wise energy */
{

  Word32  predictionGain;
  Word32  temp;
  Word32* pWork32 = &pScratchTns[subBlockNumber >> 8];
  Word16* pWeightedSpectrum = (Word16 *)&pScratchTns[subBlockNumber >> 8];

                                                                                                    
  if (tC.tnsActive) {
    CalcWeightedSpectrum(spectrum,
                         pWeightedSpectrum,
                         sfbEnergy,
                         sfbOffset,
                         tC.lpcStartLine,
                         tC.lpcStopLine,
                         tC.lpcStartBand,
                         tC.lpcStopBand,
                         pWork32);

    temp = blockType - SHORT_WINDOW;                                                          
    if ( temp != 0 ) {
        predictionGain = CalcTnsFilter( &pWeightedSpectrum[tC.lpcStartLine],
                                        tC.acfWindow,
                                        tC.lpcStopLine - tC.lpcStartLine,
                                        tC.maxOrder,
                                        tnsData->dataRaw.tnsLong.subBlockInfo.parcor);


        temp = predictionGain - tC.threshold;                                                  
        if ( temp > 0 ) {
          tnsData->dataRaw.tnsLong.subBlockInfo.tnsActive = 1;                                      
        }
        else {
          tnsData->dataRaw.tnsLong.subBlockInfo.tnsActive = 0;                                      
        }

        tnsData->dataRaw.tnsLong.subBlockInfo.predictionGain = predictionGain;                      
    }
    else{

        predictionGain = CalcTnsFilter( &pWeightedSpectrum[tC.lpcStartLine],
                                        tC.acfWindow,
                                        tC.lpcStopLine - tC.lpcStartLine,
                                        tC.maxOrder,
                                        tnsData->dataRaw.tnsShort.subBlockInfo[subBlockNumber].parcor);

        temp = predictionGain - tC.threshold;                                                 
        if ( temp > 0 ) {
          tnsData->dataRaw.tnsShort.subBlockInfo[subBlockNumber].tnsActive = 1;                     
        }
        else {
          tnsData->dataRaw.tnsShort.subBlockInfo[subBlockNumber].tnsActive = 0;                     
        }

        tnsData->dataRaw.tnsShort.subBlockInfo[subBlockNumber].predictionGain = predictionGain;     
    }

  }
  else{

    temp = blockType - SHORT_WINDOW;                                                          
    if ( temp != 0 ) {
        tnsData->dataRaw.tnsLong.subBlockInfo.tnsActive = 0;                                        
        tnsData->dataRaw.tnsLong.subBlockInfo.predictionGain = 0;                                   
    }
    else {
        tnsData->dataRaw.tnsShort.subBlockInfo[subBlockNumber].tnsActive = 0;                       
        tnsData->dataRaw.tnsShort.subBlockInfo[subBlockNumber].predictionGain = 0;                  
    }
  }

  return(0);
}


/*****************************************************************************
*
* function name: TnsSync
* description: update tns parameter
*
*****************************************************************************/
void TnsSync(TNS_DATA *tnsDataDest,
             const TNS_DATA *tnsDataSrc,
             const TNS_CONFIG tC,
             const Word16 subBlockNumber,
             const Word16 blockType)
{
   TNS_SUBBLOCK_INFO *sbInfoDest;
   const TNS_SUBBLOCK_INFO *sbInfoSrc;
   Word32 i, temp;

   temp =  blockType - SHORT_WINDOW;                                                           
   if ( temp != 0 ) {
      sbInfoDest = &tnsDataDest->dataRaw.tnsLong.subBlockInfo;                                      
      sbInfoSrc  = &tnsDataSrc->dataRaw.tnsLong.subBlockInfo;                                       
   }
   else {
      sbInfoDest = &tnsDataDest->dataRaw.tnsShort.subBlockInfo[subBlockNumber];                     
      sbInfoSrc  = &tnsDataSrc->dataRaw.tnsShort.subBlockInfo[subBlockNumber];                      
   }

   if (100*abs_s(sbInfoDest->predictionGain - sbInfoSrc->predictionGain) <
       (3 * sbInfoDest->predictionGain)) {
      sbInfoDest->tnsActive = sbInfoSrc->tnsActive;                                                 
      for ( i=0; i< tC.maxOrder; i++) {
        sbInfoDest->parcor[i] = sbInfoSrc->parcor[i];                                               
      }
   }
}

/*****************************************************************************
*
* function name: TnsEncode
* description: do TNS filtering
* returns:     0 if success
*
*****************************************************************************/
Word16 TnsEncode(TNS_INFO* tnsInfo,     /*!< tns info structure (modified) */
                 TNS_DATA* tnsData,     /*!< tns data structure (modified) */
                 Word16 numOfSfb,       /*!< number of scale factor bands */
                 TNS_CONFIG tC,         /*!< tns config structure */
                 Word16 lowPassLine,    /*!< lowpass line */
                 Word32* spectrum,      /*!< spectral data (modified) */
                 Word16 subBlockNumber, /*!< subblock num */
                 Word16 blockType)      /*!< blocktype (long or short) */
{
  Word32 i;
  Word32 temp_s;
  Word32 temp;
  TNS_SUBBLOCK_INFO *psubBlockInfo;

  temp_s = blockType - SHORT_WINDOW;                                                             
  if ( temp_s != 0) {                                                                               
    psubBlockInfo = &tnsData->dataRaw.tnsLong.subBlockInfo;
 if (psubBlockInfo->tnsActive == 0) {
      tnsInfo->tnsActive[subBlockNumber] = 0;                                                       
      return(0);
    }
    else {

      Parcor2Index(psubBlockInfo->parcor,
                   tnsInfo->coef,
                   tC.maxOrder,
                   tC.coefRes);

      Index2Parcor(tnsInfo->coef,
                   psubBlockInfo->parcor,
                   tC.maxOrder,
                   tC.coefRes);

      for (i=tC.maxOrder - 1; i>=0; i--)  {
        temp = psubBlockInfo->parcor[i] - TNS_PARCOR_THRESH;         
        if ( temp > 0 )
          break;
        temp = psubBlockInfo->parcor[i] + TNS_PARCOR_THRESH;         
        if ( temp < 0 )
          break;
      }
      tnsInfo->order[subBlockNumber] = i + 1;                                                    


      tnsInfo->tnsActive[subBlockNumber] = 1;                                                       
      for (i=subBlockNumber+1; itnsActive[i] = 0;                                                                  
      }
      tnsInfo->coefRes[subBlockNumber] = tC.coefRes;                                                
      tnsInfo->length[subBlockNumber] = numOfSfb - tC.tnsStartBand;                                 


      AnalysisFilterLattice(&(spectrum[tC.tnsStartLine]),
                            (min(tC.tnsStopLine,lowPassLine) - tC.tnsStartLine),
                            psubBlockInfo->parcor,
                            tnsInfo->order[subBlockNumber],
                            &(spectrum[tC.tnsStartLine]));

    }
  }     /* if (blockType!=SHORT_WINDOW) */
  else /*short block*/ {                                                                            
    psubBlockInfo = &tnsData->dataRaw.tnsShort.subBlockInfo[subBlockNumber];
 if (psubBlockInfo->tnsActive == 0) {
      tnsInfo->tnsActive[subBlockNumber] = 0;                                                       
      return(0);
    }
    else {

      Parcor2Index(psubBlockInfo->parcor,
                   &tnsInfo->coef[subBlockNumber*TNS_MAX_ORDER_SHORT],
                   tC.maxOrder,
                   tC.coefRes);

      Index2Parcor(&tnsInfo->coef[subBlockNumber*TNS_MAX_ORDER_SHORT],
                   psubBlockInfo->parcor,
                   tC.maxOrder,
                   tC.coefRes);
      for (i=(tC.maxOrder - 1); i>=0; i--)  {
        temp = psubBlockInfo->parcor[i] - TNS_PARCOR_THRESH;    
         if ( temp > 0 )
          break;

        temp = psubBlockInfo->parcor[i] + TNS_PARCOR_THRESH;    
        if ( temp < 0 )
          break;
      }
      tnsInfo->order[subBlockNumber] = i + 1;                                                    

      tnsInfo->tnsActive[subBlockNumber] = 1;                                                       
      tnsInfo->coefRes[subBlockNumber] = tC.coefRes;                                                
      tnsInfo->length[subBlockNumber] = numOfSfb - tC.tnsStartBand;                             


      AnalysisFilterLattice(&(spectrum[tC.tnsStartLine]), (tC.tnsStopLine - tC.tnsStartLine),
                 psubBlockInfo->parcor,
                 tnsInfo->order[subBlockNumber],
                 &(spectrum[tC.tnsStartLine]));

    }
  }

  return(0);
}

Here are the same three functions in the same order with nearly identical implementation in the 3GPP code:

/*!
  \brief   Calculate TNS filter and decide on TNS usage 
  \return  zero
*/
Word32 TnsDetect(TNS_DATA* tnsData,        /*!< tns data structure (modified) */
                 TNS_CONFIG tC,            /*!< tns config structure */
                 Word32* pScratchTns,      /*!< pointer to scratch space */
                 const Word16 sfbOffset[], /*!< scalefactor size and table */
                 Word32* spectrum,         /*!< spectral data */
                 Word16 subBlockNumber,    /*!< subblock num */
                 Word16 blockType,         /*!< blocktype (long or short) */
                 Word32 * sfbEnergy)       /*!< sfb-wise energy */
{

  Word16  predictionGain;
  Word16  temp;
  Word32* pWork32 = &pScratchTns[mult(subBlockNumber,FRAME_LEN_SHORT)];
  Word16* pWeightedSpectrum = (Word16 *)&pScratchTns[mult(subBlockNumber,FRAME_LEN_SHORT)];

                                                                                                   test();
  if (tC.tnsActive) {
    CalcWeightedSpectrum(spectrum,
                         pWeightedSpectrum,
                         sfbEnergy,
                         sfbOffset,
                         tC.lpcStartLine,
                         tC.lpcStopLine,
                         tC.lpcStartBand,
                         tC.lpcStopBand,
                         pWork32);

    temp = sub( blockType, SHORT_WINDOW );                                                         test();
    if ( temp != 0 ) {
        predictionGain = CalcTnsFilter( &pWeightedSpectrum[tC.lpcStartLine],
                                        tC.acfWindow,
                                        sub(tC.lpcStopLine,tC.lpcStartLine),
                                        tC.maxOrder,
                                        tnsData->dataRaw.tnsLong.subBlockInfo.parcor);


        temp = sub( predictionGain, tC.threshold );                                                test(); 
        if ( temp > 0 ) {
          tnsData->dataRaw.tnsLong.subBlockInfo.tnsActive = 1;                                     move16();
        }
        else {
          tnsData->dataRaw.tnsLong.subBlockInfo.tnsActive = 0;                                     move16();
        }

        tnsData->dataRaw.tnsLong.subBlockInfo.predictionGain = predictionGain;                     move16();
    }
    else{

        predictionGain = CalcTnsFilter( &pWeightedSpectrum[tC.lpcStartLine],
                                        tC.acfWindow,
                                        sub(tC.lpcStopLine, tC.lpcStartLine),
                                        tC.maxOrder,
                                        tnsData->dataRaw.tnsShort.subBlockInfo[subBlockNumber].parcor);

        temp = sub( predictionGain, tC.threshold );                                                test();
        if ( temp > 0 ) {
          tnsData->dataRaw.tnsShort.subBlockInfo[subBlockNumber].tnsActive = 1;                    move16();
        }
        else {
          tnsData->dataRaw.tnsShort.subBlockInfo[subBlockNumber].tnsActive = 0;                    move16();
        }

        tnsData->dataRaw.tnsShort.subBlockInfo[subBlockNumber].predictionGain = predictionGain;    move16();
    }

  }
  else{

    temp = sub( blockType, SHORT_WINDOW );                                                         test();
    if ( temp != 0 ) {
        tnsData->dataRaw.tnsLong.subBlockInfo.tnsActive = 0;                                       move16();
        tnsData->dataRaw.tnsLong.subBlockInfo.predictionGain = 0;                                  move16();
    }
    else {
        tnsData->dataRaw.tnsShort.subBlockInfo[subBlockNumber].tnsActive = 0;                      move16();
        tnsData->dataRaw.tnsShort.subBlockInfo[subBlockNumber].predictionGain = 0;                 move16();
    }
  }

  return(0);
}


/*****************************************************************************
    functionname: TnsSync
    description:

*****************************************************************************/
void TnsSync(TNS_DATA *tnsDataDest,
             const TNS_DATA *tnsDataSrc,
             const TNS_CONFIG tC,
             const Word16 subBlockNumber,
             const Word16 blockType)
{
   TNS_SUBBLOCK_INFO *sbInfoDest;
   const TNS_SUBBLOCK_INFO *sbInfoSrc;
   Word16 i, temp;

   temp = sub( blockType, SHORT_WINDOW );                                                          test();
   if ( temp != 0 ) {
      sbInfoDest = &tnsDataDest->dataRaw.tnsLong.subBlockInfo;                                     move32();
      sbInfoSrc  = &tnsDataSrc->dataRaw.tnsLong.subBlockInfo;                                      move32();
   }
   else {
      sbInfoDest = &tnsDataDest->dataRaw.tnsShort.subBlockInfo[subBlockNumber];                    move32();
      sbInfoSrc  = &tnsDataSrc->dataRaw.tnsShort.subBlockInfo[subBlockNumber];                     move32();
   }

   if (100*abs(sbInfoDest->predictionGain - sbInfoSrc->predictionGain) <
       (3 * sbInfoDest->predictionGain)) {
      sbInfoDest->tnsActive = sbInfoSrc->tnsActive;                                                move16();
      for ( i=0; i< tC.maxOrder; i++) {
        sbInfoDest->parcor[i] = sbInfoSrc->parcor[i];                                              move32();
      }
   }
}


/*!
  \brief    do TNS filtering
  \return   zero
*/
Word16 TnsEncode(TNS_INFO* tnsInfo,     /*!< tns info structure (modified) */
                 TNS_DATA* tnsData,     /*!< tns data structure (modified) */
                 Word16 numOfSfb,       /*!< number of scale factor bands */
                 TNS_CONFIG tC,         /*!< tns config structure */
                 Word16 lowPassLine,    /*!< lowpass line */
                 Word32* spectrum,      /*!< spectral data (modified) */
                 Word16 subBlockNumber, /*!< subblock num */
                 Word16 blockType)      /*!< blocktype (long or short) */
{
  Word16 i;
  Word16 temp_s;
  Word32 temp;

  temp_s = sub(blockType,SHORT_WINDOW);                                                            test();
  if ( temp_s != 0) {                                                                              test();
    if (tnsData->dataRaw.tnsLong.subBlockInfo.tnsActive == 0) {
      tnsInfo->tnsActive[subBlockNumber] = 0;                                                      move16();
      return(0);
    }
    else {

      Parcor2Index(tnsData->dataRaw.tnsLong.subBlockInfo.parcor,
                   tnsInfo->coef,
                   tC.maxOrder,
                   tC.coefRes);

      Index2Parcor(tnsInfo->coef,
                   tnsData->dataRaw.tnsLong.subBlockInfo.parcor,
                   tC.maxOrder,
                   tC.coefRes);

      for (i=sub(tC.maxOrder,1); i>=0; i--)  {
        temp = L_sub( tnsData->dataRaw.tnsLong.subBlockInfo.parcor[i], TNS_PARCOR_THRESH );        test();
        if ( temp > 0 )
          break;
        temp = L_add( tnsData->dataRaw.tnsLong.subBlockInfo.parcor[i], TNS_PARCOR_THRESH );        test();
        if ( temp < 0 )
          break;
      }
      tnsInfo->order[subBlockNumber] = add(i,1);                                                   move16();


      tnsInfo->tnsActive[subBlockNumber] = 1;                                                      move16();
      for (i=add(subBlockNumber,1); itnsActive[i] = 0;                                                                 move16();
      }
      tnsInfo->coefRes[subBlockNumber] = tC.coefRes;                                               move16();
      tnsInfo->length[subBlockNumber] = sub(numOfSfb,tC.tnsStartBand);                             move16();   


      AnalysisFilterLattice(&(spectrum[tC.tnsStartLine]),
                            sub(S_min(tC.tnsStopLine,lowPassLine),tC.tnsStartLine),
                            tnsData->dataRaw.tnsLong.subBlockInfo.parcor,
                            tnsInfo->order[subBlockNumber],
                            &(spectrum[tC.tnsStartLine]));

    }
  }     /* if (blockType!=SHORT_WINDOW) */
  else /*short block*/ {                                                                           test();
    if (tnsData->dataRaw.tnsShort.subBlockInfo[subBlockNumber].tnsActive == 0) {
      tnsInfo->tnsActive[subBlockNumber] = 0;                                                      move16();
      return(0);
    }
    else {

      Parcor2Index(tnsData->dataRaw.tnsShort.subBlockInfo[subBlockNumber].parcor,
                   &tnsInfo->coef[subBlockNumber*TNS_MAX_ORDER_SHORT],
                   tC.maxOrder,
                   tC.coefRes);

      Index2Parcor(&tnsInfo->coef[subBlockNumber*TNS_MAX_ORDER_SHORT],
                   tnsData->dataRaw.tnsShort.subBlockInfo[subBlockNumber].parcor,
                   tC.maxOrder,
                   tC.coefRes);
      for (i=sub(tC.maxOrder,1); i>=0; i--)  {
        temp = L_sub( tnsData->dataRaw.tnsShort.subBlockInfo[subBlockNumber].parcor[i], TNS_PARCOR_THRESH );   test();
         if ( temp > 0 )
          break;

        temp = L_add( tnsData->dataRaw.tnsShort.subBlockInfo[subBlockNumber].parcor[i], TNS_PARCOR_THRESH );   test();
        if ( temp < 0 )
          break;
      }
      tnsInfo->order[subBlockNumber] = add(i,1);                                                   move16();

      tnsInfo->tnsActive[subBlockNumber] = 1;                                                      move16();
      tnsInfo->coefRes[subBlockNumber] = tC.coefRes;                                               move16();
      tnsInfo->length[subBlockNumber] = sub(numOfSfb, tC.tnsStartBand);                            move16();


      AnalysisFilterLattice(&(spectrum[tC.tnsStartLine]), sub(tC.tnsStopLine,tC.tnsStartLine),
                 tnsData->dataRaw.tnsShort.subBlockInfo[subBlockNumber].parcor,
                 tnsInfo->order[subBlockNumber],
                 &(spectrum[tC.tnsStartLine]));

    }
  }

  return(0);
}

I think at this point you'd have to be a crazy person to not see that the Stagefright AAC encoder is an independent implementation and not derived from 26.411.

Licensing

No where do VisualOn and Android seem to acknowledge that their encoder is derived from the 3GPP reference encoder. The only copyright headers on the Stagefright encoder are:

/*
 ** Copyright 2003-2010, VisualOn, Inc.
 **
 ** Licensed under the Apache License, Version 2.0 (the "License");
 ** you may not use this file except in compliance with the License.
 ** You may obtain a copy of the License at
 **
 **     http://www.apache.org/licenses/LICENSE-2.0
 **
 ** Unless required by applicable law or agreed to in writing, software
 ** distributed under the License is distributed on an "AS IS" BASIS,
 ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 ** See the License for the specific language governing permissions and
 ** limitations under the License.
 */

The licensing of the 3GPP encoder is somewhat ambiguous on it's own. There is no mention of copyright in the 3GPP bundle except for:

Copyright Notification
No part may be reproduced except as authorized by written permission.
The copyright and the foregoing restriction extend to reproduction in all media
on the cover of the documentation, and this function in the decoder:
static void display_copyright_message(void)
{
  fprintf(stderr,"\n");
  fprintf(stderr,"*************************************************************\n");
  fprintf(stderr,"* Enhanced aacPlus 3GPP ETSI-op Reference Decoder\n");
  fprintf(stderr,"* Build %s, %s\n", __DATE__, __TIME__);
  fprintf(stderr,"*\n");
  fprintf(stderr,"*************************************************************\n\n");
}

It seems pretty clear to me that 3GPP intends on their reference code to be used but the terms of such use are unknown. The 3GPP code was provided by a company called Coding Technologies. Coding Technologies has since been acquired by Dolby and is called Dolby International. Dolby isn't the most open source friendly company out there.

Some 3GPP source files contain similarity to MPEG reference source files which bear the notice:

/**********************************************************************
 
SC 29 Software Copyright Licencing Disclaimer:

This software module was originally developed by


and edited by


in the course of development of the ISO/IEC 13818-7 and ISO/IEC 14496-3 
standards for reference purposes and its performance may not have been 
optimized. This software module is an implementation of one or more tools as 
specified by the ISO/IEC 13818-7 and ISO/IEC 14496-3 standards.
ISO/IEC gives users free license to this software module or modifications 
thereof for use in products claiming conformance to audiovisual and 
image-coding related ITU Recommendations and/or ISO/IEC International 
Standards. ISO/IEC gives users the same free license to this software module or 
modifications thereof for research purposes and further ISO/IEC standardisation.
Those intending to use this software module in products are advised that its 
use may infringe existing patents. ISO/IEC have no liability for use of this 
software module or modifications thereof. Copyright is not released for 
products that do not conform to audiovisual and image-coding related ITU 
Recommendations and/or ISO/IEC International Standards.
The original developer retains full right to modify and use the code for its 
own purpose, assign or donate the code to a third party and to inhibit third 
parties from using the code for products that do not conform to audiovisual and 
image-coding related ITU Recommendations and/or ISO/IEC International Standards.
This copyright notice must be included in all copies or derivative works.
Copyright (c) ISO/IEC 1997.
**********************************************************************/

"Copyright is not released for products that do not conform to audiovisual and image-coding related ITU Recommendations and/or ISO/IEC International Standards" is generally viewed as the problematic clause in this license. This clause was problematic for LAME before they rewrote the last of the dist10 reference code and has made FAAC undistributable. To put this code under an apache license you would need to track down the copyright holders at the top and ask them to relicense. However Dolby clearly can relicense the code it owns and the code CT owns without asking anyone.

Community

From a community stand point my mind was initially boggled why the documentation is Proprietary & Confidential. Line endings are mixed and there is plenty of trailing whitepace in the source. This sort of thing wouldn't fly in many large opensource projects. Then I realized they simply bought an encoder from one of the many companies selling optimized multimedia reference code. It was probably cheaper than having employees write their own or port the reference code themselves. Still the license on their documentation doesn't give my much faith in their attention to detail on such matters.

Maybe a community effort can build on top of this like opencore-amr did with the AMR code from earlier android releases and we can finally have a decent OSS AAC encoder but I'm not going to hold my breath. Punting code over a wall usually isn't a good strategy for building community.

2010-11-29

In Defense of Lossy+Lossless Hybrid Audio Codecs

There have been many complaints leveled against lossy+lossless hybrid audio codecs lately. A lossy+lossless codec is one that codes a lossy layer, and has an optional enhancement layer to bring the quality up to lossless. Some examples of hybrid lossy+lossess audio formats are HD-AAC (MPEG-4 SLS), mp3HD, and WavPack. WavPack and MPEG-4 SLS can also run in pure lossless mode.

Lossy+lossless audio codecs do have many shortcomings. They require a canonical bit-exact version of the lossy decoder. This doesn't align well with modern popular pure lossy codecs. They (generally, WavPack does not) require a completely different coding scheme for the lossless enhancement layer that has similar complexity to a pure lossless coder on its own. They need to keep the lossless correction layers synchronized to the lossy file either on the file system or with special tools to attach and separate them. The sum filesize of both the lossy and lossless content is generally larger than what you would get using a pure lossless coder. All of these things certainly serve as barriers to adoption and good reasons to avoid lossy+lossless codecs should you not need their particular benefits.

The biggest benefit of lossy+lossless is that if you need to have a lossy copy as well as a lossless copy the sum space consumed will be less than having a lossy copy and a pure lossless copy (assuming you are using a good lossy+lossless codec).

The reason you might want a lossy copy rather than transcoding on the fly is for syncing a portable device where the whole library doesn't fit. In that case the user may swap music in and out on a regular basis. Transcoding 72 hours of music (72 hours * 128 kbps = 3.96 GB) well is a slow process, even on a fast machine.

I do recognize that this is a niche use case and that lossy+lossless adds a lot of complexity to the system. As storage on portables keeps getting larger and larger this advantage seems less and less important.

I think if anybody could have popularized such a codec it would have been Apple during the heyday of the iPod shuffle. iPods use a database with a syncing tool (iTunes) rather than USB mass storage making the single file solution an obvious choice. Remuxing on the fly is much faster than transcoding on the fly. But seeing as they haven't adopted lossy+lossless yet, I doubt they will bother with it at all.

2010-09-20

AAC Channel Model Revisited

Recently I wrote about the AAC channel model. Since then I tested a variety of real world broken files (synthetically recreated) with the reference decoder, faad, WinAMP, iTunes, and the Microsoft Windows 7 Decoder. I've placed the results in the multimedia wiki. I can't seem to get the CSS right to display it here.

Looking at the results of the other decoders I managed to fix bad_concat while removing special hacks for streams like elem_id0 with a new approach in FFmpeg. FFmpeg now treats streams with PCEs according to the letter of the channel configuration in the PCE. For streams lacking a PCE, element instance tags are now ignored and channels are treated purely positionally.

One other thing worth noting is that I couldn't make iTunes produce Parametric Stereo effects on any of the seven signalling streams from CT. Supposedly iTunes has supported PS since version 9.2. Perhaps it is due to the use of pure upsampling SBR.

AAC Bistream Flaws Part 2: AAC-960, Zero Sized Sections, and ADIF

AAC-960

AAC has a 1024 MDCT samples per frame variant and a 960 MDCT samples per frame variant. You are probably using the 1024 samples per frame variant. Most applications including FFmpeg, WinAMP, and iTunes don't support AAC-960. However until very recently the spec required the AAC/HE-AAC/HE-AACv2 profiles to support both 960 and 1024 variants.

Hilariously nobody seemed to notice this issue until April 2008 when MPEG working document M15437: "proposed clarification for ISO/IEC 14496-4" was published.

Dolby explains the problem much better than I can in MPEG working document M15641: "further considerations regarding the 960 transform length in the AAC profile, the HE AAC profile and the HE AAC v2 profile":

The MPEG-4 AAC-LC AOT can operate in different flavours, either with a frame length of 1024 samples or with a frame length of 960 samples. The AAC profile, the High Efficiency AAC profile, and the High Efficiency AAC v2 profile do not impose any restriction on the frame length that is used.

Typically application standards which build on one of these profiles use only one frame length (either 1024 or 960) and do not require support for both. An incomplete list of application standards that build on these profiles can be found in the Annex.

Implementers of decoders that comply with these profiles were not able to test their implementations for conformance until early this year when conformance sequences for the AAC AOT with a frame length of 960 samples have been made available. Such test sequences are still missing for the SBR AOT and the PS AOT.

There are popular players around that do not support the 960 frame length. Examples of these would be winamp, iTunes or the iPod.

The code from 3GPP is a very popular reference implementation that has been used many times as a starting point for a port to embedded devices. This code does not support the 960 frame length. As a result embedded players based on this code lack support for the 960 frame length.

One important part of the MPEG-4 reference software (mp4mcdec) does not support the 960 frame length. Instead it disregards the frameLengthFlag and consequently plays back at wrong speed.

A profile defines a subset of features that gives content producers the certainty that their content will play on all devices that support the profile. On the other hand there is a high amount of devices that do not comply with the profile by not supporting the 960 frame length, which makes using the 960 frame length a bad choice for creating content.

The solution chosen was to partition the AAC/HE-AAC/HE-AACv2 profiles into regular (1024) and 960 variants. This was clearly the best solution. Still the AAC-960 profile doesn't strike me as particularly useful. It makes short windows only 16 samples (2 ms) shorter. It makes frames 128 samples shorter (16 ms at 8 kHz), however there are Low Delay AAC variants that attack this problem in a more comprehensive manner.

Zero Sized Sections

AAC scalefactor bands are sectioned into groups of adjacent bands that use the same Huffman codebooks. The way these sections are written into the bitstream is that a 4 bit field that specifies the literal codebook number is written then the section length (3 bits for short windows and 5 bits for long windows with an all bits set escape mechanism for longer sections). This is repeated until all the scalefactor bands up to the maximum scalefactor band all have codebooks assigned. Sections with a size of zero are allowed. In theory you can have as many zero sized sections as you want (until you hit the frame size cap in bits). Now combine with an all zero buffer (like a get_bits() implementation that returns zeros when the input buffer is exhausted) and you have a recipe for an infinite loop (thanks to the Google Chrome team for finding this).

Now this isn't the end of the world, a buffer exhaustion check can be added to the sectioning loop to fix this. But these zero sized sections serve no constructive purpose. Zero sized sections could be forbidden, better yet the size minus one could be coded, or they could be used for the escape mechanism.

ADIF

While not technically AAC this encapsulation format does appear in the AAC specification as an annex and is AAC specific. It is formatted as single a header upfront followed by raw concatenated variable length AAC frames with no additional framing information. People seem to like to use ADIF when it is not an appropriate choice. It has no error resilience and no seekablity due to the way it is framed. Worse yet the ADIF global header is not compatible with the MPEG-4 global header (Audio Specific Config) when the MPEG-4 global header could be used here. This would be a great way to test Audio Specific Config parsing without a full MPEG-4 demuxer. (I invented my own ADIF like format for this very purpose). At least the reason for the different header format is probably a legacy MPEG-2 issue.

2010-08-31

Why you don't want to build your Chromium packages against a system copy of FFmpeg

Why you don't want to build your Chromium packages against a system copy of FFmpeg:

  1. Chromium's internal FFmpeg copy is based on FFmpeg-mt. FFmpeg-mt is an experimental multithreaded branch of FFmpeg. It hasn't been deemed ready to merge into FFmpeg proper. Chromium only uses a subset of FFmpeg functionality and is thus less likely to experience regressions from using FFmpeg-mt. FFmpeg-mt is however much faster than vanilla FFmpeg on multicore systems.
  2. Chromium's FFmpeg is heavily patched. Some of the patches allow for a checked get_bits() implementation. Better input buffer checking without the overhead of a checked get_bits() will probably require a major bump in libavcodec. Some simply remove code that happens to be dead in Chromium's FFmpeg but is useful in general. Some seem to have gotten stuck in patch hell.
  3. Chromium makes use of new FFmpeg features very quickly. Chromium first made use of av_register_protocol2() just one month after it was added to FFmpeg. If you are using system FFmpeg with Chromium either video playback breaks or the browser update is blocked by a system wide FFmpeg update.

If you are concerned about shipping FFmpeg, you can build Chromium against its internal libffmpegsumo and just not ship libffmpegsumo. Using the version number it is easy for a third party to reconstruct a libffmpegsumo that matches your Chromium build.

2010-08-25

AAC Bistream Flaws Part 1: The Channel Model

This post is the first in a series about things that I consider to be flaws in the AAC bitstream format itself.

The biggest problem with the AAC bitstream format is the channel model. The whole AAC channel model is fucked.


AAC has 8 types of channel elements (0 = single channel element (SCE), 1 = channel pair element (cpe), 2 = channel coupling element (CCE), 3 = low frequency element (LFE), 4 = data stream elemet (DSE), 5 = program config element (PCE), 6 = fill element (FIL), 7 = END). Each channel element instance has an associated element instance tag (elem_id). The only exception is end. This instance tag is usually but not always used to group instances that are considered to belong together, e.g. a SCE that represents the same channel would have the same element instance_tag in each frame in which it appears.
There are two sorts of AAC channel configurations, indexed and PCE. Every AAC stream has a 4-bit parameter called channelConfiguration. They are defined in the "Channel Configuration" table in subpart 1 of 14496-3.

value# of chanssyntax elements,
in order received
channel to speaker mapping
0--defined in AOTSpecificConfig
11SCEcenter front speaker
22CPEleft, right front speakers
33SCE,
CPE
center front speaker,
left, right front speakers
44SCE,
CPE,
SCE
center front speaker,
left, right center front speakers,
rear surround speakers
55SCE,
CPE,
CPE
center front speaker,
left, right front speakers,
left surround, right surround rear speakers
65.1SCE,
CPE,
CPE,
LFE
center front speaker,
left, right front speakers,
left surround, right surround rear speakers,
front low frequency effects speaker
77.1SCE,
CPE,
CPE,
CPE,
LFE
center front speaker
left, right center front speakers,
left, right outside front speakers,
left surround, right surround rear speakers,
front low frequency effects speaker
8-15--reserved

It seems pretty sane at first. Though it is a little tricky that the channel count is identical to the index until you get to configuration 7. But what if you need something not on the list like 7 channels, true dual mono, or more than 8 channels. Well then channelConfiguration gets set to 0 and an AOTSpecifcConfig is used. For non ER AAC variants (like AAC-LC/HE-AAC/HE-AACv2) this probably means a PCE is to follow.

I say probably because if it is MPEG-2 AAC, the decoder can implicitly figure out a channel mapping from the syntax elements present and needs no PCE. An MPEG-4 AAC decoder is forbidden from doing so. If you remux such a file from an MPEG-2 ADTS stream to a .mp4 file you have probably screwed everything up. I say probably because MPEG-2 objectTypeIndication 0x67 can be used to indicated MPEG-2 AAC in MPEG-4. However in practice MPEG-2 AAC usually gets remuxed to objectTypeIndication 0x40 AAC since MPEG-4 AAC is widely considered a superset of MPEG-2 even though we just have demonstrated it not to be a strict superset.

The first field in a PCE is an element instance tag. Yes, you may have up to 16 independent PCEs. What does that mean? The spec doesn't say how to interpret such a set up. Just that it's legal, and it doesn't place prohibitions on such a thing in any of the useful profiles. But for the sake of argument let's assume we only have one PCE. The next field is a two bit object_type that is equal to the AOT-1 for MPEG-2 compatibility. Now let's not forget that the very existance of a PCE is AOT specific. So we are just sending duplicate data at this point. What if this object_type doesn't match the outer AOT. Well the spec doesn't actually address that, but the experts say such a thing is forbidden. Despite being forbidden the official systems conformance streams don't have PCE object_types that match the outer AOTs. So assuming the object_types line up nicely we get to the sampling_frequency_index. Same problems apply. Finally it coarsely groups the logical output stream into front, side, back, and LFEs. For each non LFE it tells you if the element is a SCE or CPE and its element instance tags. All LFEs are represented with the LFE syntax element so they just have LFEs enumerated. You are then on your own for mapping this mess into a speaker configuration. If you have 22.2 channels or fewer you can use the "informative" ISO 13818-7 Annex H. This Annex is not reprinted in the MPEG-4 edition. In addition to listing These output channels the PCE also enumerates DSEs which hold ancillary data streams, CCEs which hold coupling elements used in the decoding of the output channels, and mixdowns. Here is what the spec says about mixdowns: "The matrix-mixdown provision enables a mode of operation which may be beneficial in some circumstances. However, it is advised that this method should not be used." At the end of the PCE is a comment field. The comment field contains a pascal string that describes the PCE. Before the comment a byte align is required. This makes it much more difficult to move the PCE around. In an MP4 file the PCE lives in the global header, and is byte aligned in relation to the start of the global header. In an ADTS file the PCE is in the actual frame payload and must be byte aligned in relation to the start of the frame.

So now that we have our duplicate codec parameters, a nest of output channels, a list of coupling channels, ancillary data streams, mixdowns that we shouldn't use, and a byte aligned comment we are done with the PCE. What happens if we get another PCE? Well... "An MPEG-4 decoder is always required to parse any program_config_element() inside the AAC payload. However, the decoder is only required to evaluate it, if no channel configuration is given outside the AAC payload." The decoder is has to parse it but may or may not evaluate it, wonderful.

The observant reader will notice that nowhere in the mess of syntax elements that we pulled out of the PCE was their order mentioned. The elements can arrive in any arbitrary order. The order does not even have to be consistent from frame to frame. One of the official test vectors (al17_*) is a dual-mono stream that alternates between {SCE.0}{SCE.15}{END} and {SCE.15}{SCE.0}{END}. This is very flexible, but I have been unable to figure out how this flexibility can ever be used in a beneficial manner. CPEs sometimes come before the channel they modify. The sometimes come afterward. In low memory situations it is useful for them to come before, but the decoder can't depend on them being there due to this flexibility.

To me it seems likely that at one time people may have wanted to do AAC domain mixing. The spec mentions: "programs could share a channel_pair_element() and use distinct coupling channels for voice over in different languages" [ISO/IEC 13818-7:2004(E) 8.5.2.2
Explicit channel mapping using a program_config_element()]. This justifies things like multiple PCEs but still doesn't justify this completely arbitrary and dynamic channel ordering.

Just when you think this crazy ordering is a pain in the ass, but at least it is super flexible, along comes SBR. SBR are contained in FIL elements and they must directly follow the tracks they modify. On a FIL element the element instance tag is actually a size so that the element can be skipped by decoders that don't support such extensions.

Because syntax element order implies speaker order in the non-PCE based channel configurations 1-7, decoders that don't support PCEs sometimes completely ignore element instance tags. Then simple encoders started assigning zeros to all element instances, even instances of the same syntax element in the same frame, e.g. there are 5.1 streams floating around that have a frame structure of {SCE.0}{CPE.0}{CPE.0}{LFE.0}{END}. In addition ADTS files are widely considered to be concatenatable if they contain the same channel count. This creates problems when the streams have the same channel order but different element instance tags on the channels, e.g. one stereo stream may use 0 as the element isntance tag for its CPE while the next stream uses 15. Concatentating them has the stream change element instance tags mid way through. This could be solved by requiring the instances of each syntax element count up from zero, e.g. require 5.1 to use {SCE.0}{CPE.0}{CPE.1}{LFE.0}. The early authors of the FFmpeg AAC decoder thought this was the case; sadly it is not.

The good news is that for complex multichannel files MPEG-D MPEG Surround might be a better choice. MPEG Surround can use AAC as its core coder. The problem is that MPEG Surround doesn't have anywhere near the AAC install base.

Next time: zero sized sections, AAC-960, and ADIF.

2010-08-05

Announcing AACX

AACX or AAC eXaminer is an AAC frame analysis tool inspired by LAME's mp3x. The program is still in its very early stages and is not yet complete. It requires a PCM decoded copy of the stream and a custom format log file. A patch to modify FFmpeg into generating the log file is included with the program.

AACX does not yet support stereo or surround streams. The interface is still very ugly. The coding style is also pretty ugly. Still the program has passed the threshold where it may be useful to others so I'm making the source public under the GPLv2 (or at your option any later version).

Happy Hacking!

2010-07-28

StarCraft 2 Cutscenes

As it has been well established the new international gaming sensation StarCraft 2 uses the Theora codec to compress its prerendered video content. I've selected a few stills from one of the cinematics to look at the quality of their result. Maybe they jsut didn't use a high enough bitrate but these stills look subpar to me (though much better than the Smacker cutscenes from StarCraft 1.
Input #0, ogg, from 'cinematic_thedream.ogv':
  Duration: 00:02:45.70, start: 0.000000, bitrate: 3984 kb/s
    Stream #0.0: Data: skeleton
    Stream #0.1: Video: theora, yuv420p, 1280x720, 24 fps, 24 tbr, 24 tbn, 24 tbc
    Stream #0.2: Audio: vorbis, 44100 Hz, stereo, s16, 160 kb/s
    Metadata:
      ENCODER         : ffmpeg2theora-0.24
The thumbnails link to full size stills.

Frame 484
This looks very blocky to me.

Frame 548
There seems to be ringing around Kerrigan's body, especially her legs.

Frame 1072
More blocking.

I'm not saying that these cutscenes are necessarily representative of Theora's top quality. I merely think we should take the quality of the result into consideration when scoring this as a victory for Theora. Perhaps the cutscenes should have been encoded at a higher quality at the expense of releasing releasing on BluRay or multiple DVDs or some more content should have been pushed off the disc onto the release day patch. If they were absolutely stuck with this amount of space for cutscenes, I would have gladly paid a few extra cents for H.264 cut scenes.

2010-07-11

AAC Verification

For a long time I've been using my ugly home spun aac-conf-tools to verify FFmpeg's decoder against the MPEG reference decoder over the ISO test vectors. This approach has one huge problem; it requires the non-free, unportable, and hard to build ISO reference software.

Luckily Mans Rullgard has come to my rescue and added off-by-one testing to FATE. This allows us to compare FFmpeg's output to predecoded streams. While migrating to this method it seemed worthwhile to use the output streams provided by ISO rather than decode ideal output on my system with FFmpeg or the reference decoder. In particular I don't trust the sloppy reference code on a modern compiler.

However this has caused several problems. Most importantly it appears that the output for the al##/am## series starts 2048 samples late compared to the reference decoder. For now I've generated silence (for streams that open with silence) or decoded the first 2048 samples with the reference decoder (for streams that don't) and prepended it to those streams as appropriate. The PNS (perceptual noise substitution) tool added in MPEG-4 AAC takes parts of the signal that noisy parts of the signals describes the noise, and allows the decoder to regenerate the noise. FFmpeg uses a RNG to generate the noise that is different from from the reference decoder so our results are different than the reference rendering but still fall within the requirements for conformance.

For the time being I've added five tests to try to cover the bulk of AAC features. Let's look at the tests individually.

  • fate-aac-al04_44: This test covers AAC-LC mono at 48000 Hz with the following bitstream features: program config element, data stream element, pulse data, TNS, and window shape switching.
  • fate-aac-al07_96: This test covers AAC-LC 5.1 at 96000 Hz with the following bitstream features: program config element, intensity stereo, mid/side stereo, TNS, and dependent coupling.
  • fate-aac-am00_88: This test covers AAC-Main mono at 88200 Hz with the following bitstream features: program config element, window shape switching, and backwards prediction.
  • fate-aac-al_sbr_hq_cm_48_2: This test covers AAC-LC stereo at 24000 Hz + SBR with the following bitstream features: indexed channel configuration, mid/side stereo, TNS, window shape switching, pure upsampling SBR, and upsampled SBR synthesis.
  • fate-aac-al_sbr_ps_06_ur: This test covers AAC-LC mono at 16000 Hz + SBR + PS with the following bitstream features: program config element, window shape switching, pure upsampling SBR, upsampled SBR synthesis, PS IID data, PS ICC data, PS mixing mode A, and PS iid-/icc-mode.

Things that are missing are syntax element order switching, PNS (explained above), non-meaningful window transitions (which FFmpeg handles differently than the spec does), independent coupling, downsampled SBR, the detailed SBR tool tests, PS IPD/OPD, PS mixing mode B, other sampling frequencies including 7350 Hz (missing from the conformance suite), HE-AAC signaling (the CT suite has its own problems). In addition the unofficial extensions FFmpeg supports like relaxed channel ordering and 6 patches in SBR are also missing.

2010-06-27

Adding HE-AAC support to ffaacdec

Over the past eight months I've been working on adding HE-AACv1 and HE-AACv2 support to the FFmpeg AAC decoder (ffaacdec). This work is now complete. Rob Swain wrote much of the HE-AACv1 decoder and deserves his share of the credit for it.

Right now the defacto standard AAC decoder is libfaad2. Faad for quite some time has supported decoding of HE-AAC(v2). Faad has a few problems though. Libfaad2 is pure GPL which is less than ideal for some applications. Ffaacdec is 100% LGPL. (If you are taking advantage of this licensing distinction, please consider donating to fund further work.) Libfaad2 also reinvents the wheel in regard to several standard bitstream and signal processing blocks. As Ronald says (paraphrased): FFmpeg already has very fast wheels. This makes our decoder much faster on floating point systems (like my Intel Core2).

HE-AACv1 (also known as aacPlus or AAC+) is the combination of MPEG-4 AAC-LC (which FFmpeg has supported natively since 2008) with a tool called Spectral Band Replication (SBR). SBR generates high frequency signal components based on the low frequency components and bit stream guidance. SBR uses a 32 band 32 time-slot analysis QMF to generate a joint time frequency representation of each AAC frame. SBR generates new 32-high frequency bands and transforms the results back to the time domain.

HE-AACv2 (also known as aacPlusv2 or eAAC+) is the combination of mono HE-AAC with a tool called Parametric Stereo (PS). PS takes the mono SBR QMF output and generates a left and right SBR QMF output. It does this by further splitting the QMF into 71 or 91 bands and using a few very small parameters to mix the input signal with a decorrelated variant of itself.. The parameters it uses are interchannel intensity difference (IID), interchannel coherence (ICC), and phase rotations (IPD/OPD). The ICC mode also is used to choose which mixing mode is used. PS also comes in a baseline flavor which always uses 71 bands, always uses mixing mode A, and does not use IPD/OPD.

A few implementation notes:

  • Right now the FFmpeg decoder supports unrestricted PS. It can emulate baseline behavior by setting the PS_BASELINE define to 1 in aacps.c. This however does not fully optimize for the baseline case.
  • It seems that mainstream encoders focus on the baseline feature set, especially in regard to IPD/OPD.
  • The SBR filterbank has been optimized on top of a DCT-IV which in turn is implemented on top of the FFmpeg half IMDCT. This is based on a paper by Han-Wen Hsu, et al. Please note that many of the final equations presented by the paper are obviously incorrect but if you follow their work you should be able to arrive at the correct equations.
  • The PS filterbank remains (largely) unoptimized.
  • Coding Technologies (creators of SBR and PS) had an HE-AAC(v2) decoder test package focused mainly on signaling the presence of the SBR and PS extensions. While it disappeared after Dolby acquired CT, it is mirrored in the mplayer samples archive.
  • The SBR spec says the maximum number of patches allowed is 5, the stream in the CT package has six. This leads me to believe that some version of the CT encoder didn't strictly enforce that limit and thus decoders should follow suit.
  • The reference software and the spec seem to disagree about when to apply ipd/opd smoothing. The spec says only to apply it when ipd/opd are enabled. The reference software applies it constantly. These may seem equivelent but in actuality the reference decoder applies it to envelopes passed the end of the IPD/OPD enabled envelopes due to the 3-tap smoother. For now FFmpeg follows the spec here.

Related future work:

  • MPEG Surround is a more flexible multichannel coding scheme based on PS.
  • Allow the user to optionally decode without SBR and/or PS. In addition 3GPP specifies an SBR domain downmix that should be used if a mono channel configuration is requested.
  • Add support for less popular AAC variants to the decoder including LTP, LD, and 960.
  • The current experimental FFmpeg AAC-LC encoder is buggy and suboptimal.