Audio Coding
Yuli You
Audio Coding Theory and Applications
123
Yuli You, Ph.D. University of Minnesota in Twin Cities
ISBN 978-1-4419-1753-9 e-ISBN 978-1-4419-1754-6 DOI 10.1007/978-1-4419-1754-6 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2010931862 c Springer Science+Business Media, LLC 2010 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
To My parents and Wenjie, Amy and Alan
Preface
Since branching out of speech coding in the early 1970s, audio coding has now slipped into our daily lives in a variety of applications, such as mobile music/video players, digital television/audio broadcasting, optical discs, online media streaming, and electronic games. It has become one of the essential technologies in today’s consumer electronics and broadcasting equipments. In its more than 30 years of evolution, many audio coding technologies had come into the spotlight and then became obsolete, but only a minority have survived and are deployed in major modern audio coding algorithms. While covering all the major turns and branches of this evolution is valuable for technology historians or for people with intense interests, it is distracting and even inundating for most readers. Therefore, those historic events will be omitted and this book will, instead, focus on the current state of this evolution. Such a focus also helps to provide full coverage to selected topics in this book. This state of the art is presented from the perspective of a practicing engineer and adjunct associate professor, who single-handedly developed the whole DRA audio coding standard, from algorithm architecture to assembly-code implementation and to subjective listening tests. This perspective has a clear focus on “why” and “how to.” In particular, many purely theoretical details such as proof of perfect reconstruction property of various filter banks are omitted. Instead, the emphasis is on the motivation for a particular technology, why it is useful, what it is, and how it is integrated into a complete algorithm and implemented in practical products. Consequently, many practical aspects of audio coding technologies normally excluded in audio coding books, such as transient detection and implementation of decoders on low-cost microprocessors, are covered in this book. This book should help readers to grasp the state-of-the-art audio coding technologies and build a solid foundation for them to either understand and implement various audio coding standards or develop their own should the need arise. It is, therefore, a valuable reference for engineers in the consumer electronics and broadcasting industry and for graduate students of electrical engineering. Audio coding seeks to achieve data compression by removing perceptual irrelevance and statistical redundancy from a source audio signal and the removal efficiency is powerfully augmented by data modeling which compacts and/or decorrelates the
vii
viii
Preface
source signal. Therefore, the presentation of this book is centered around these three basic elements and organized into the following five parts. Part I gives an overview of audio coding, describing the basic ideas, the key challenges, important issues, fundamental approaches, and the basic codec architecture. Part II is devoted to quantization, the tool for removing perceptual irrelevancy. Chapter 2 delineates scalar quantization which quantizes a source signal one sample at a time. Both uniform and nonuniform quantization, including the Lloyd–Max algorithm, are discussed. Companding is posed as a structured and simple method to implement nonuniform quantization. Chapter 3 describes vector quantization which quantizes two or more samples of a source signal as one block each time. Also included is the Linde–Buzo–Gray (LBG) or k-means algorithm which builds an optimal VQ codebook from a set of training data. Part III is devoted to data modeling which transforms a source signal into a representation that is energy-compact and/or decorrelated. Chapter 4 describes linear prediction which uses a linear combination of the historic samples of the source signal as a prediction for the current sample so as to arrive at a prediction error signal that has lower energy and is decorrelated. It first explains why quantizing the prediction error signal, instead of the source signal, can dramatically improve coding efficiency. It then presents open-loop DPCM and DPCM, the two most common forms of linear prediction, derives the normal equation for optimal prediction, presents Levinson–Durbin algorithm that iteratively solves the normal equation, shows that the prediction error signal has a white spectrum and is thus decorrelated, and illustrates that the prediction decoder filter provides an estimate of the spectrum of the source signal. Finally, a general framework for linear prediction that can shape the spectrum of quantization noise to desirable shapes, such as that of the absolute threshold of hearing, is presented. Chapter 5 deals with transforms which linearly transform a block of source signal samples into another block of coefficients whose energy is compacted to a minority. It first explains why this compaction of energy leads to dramatically improved coding efficiency through the AM–GM inequality and the associated optimal bit allocation strategy. It then derives the Karhunen–Loeve transform from the search for the optimal transform. Finally, it presents suboptimal and practical transforms, such as discrete Fourier transform (DFT) and discrete cosine transform (DCT). Chapter 6 presents subband filter banks as extended transforms in which historic blocks of source samples overlap with the current block. It describes various aspects of subband coding, including reconstruction error and polyphase representation and illustrates that the dramatically improved coding efficiency is also achieved through energy compaction. Chapter 7 is devoted to cosine modulated filter banks (CMFB), whose structure is amenable for fast implementation. It first builds this filter bank from DFT and explains that it has a structure of a prototype filter plus cosine modulation. It then presents nonperfect reconstruction and perfect reconstruction CMFB and their efficient implementation structures. Finally, it illustrates that modified discrete cosine transform (MDCT), the most widely used filter bank in audio coding, is a special and simple case of CMFB.
Preface
ix
Part IV is devoted to entropy coding, the tool for removing statistical redundancy. Chapter 8 establishes that entropy is determined by the probability distribution of the source signal and is the fundamental lower limit of bit rate reduction. It then shows that any meaningful entropy codes have to be uniquely decodable and, to be practically implementable, should be instantaneously decodable. Finally, it illustrates that prefix-free codes are just such codes and further proves Shannon’s noiseless coding theorem, which essentially states that the entropy can be asymptotically approached by a prefix-free code if source symbols are coded as blocks and the block size goes to infinity. Chapter 9 presents Huffman code, an optimal prefix-free code widely used in audio coding. It first presents Huffman’s algorithm, which is an iterative procedure to build a prefix-free code from the probability distribution of the source signal, and then proves its optimality. It also addresses some practical issues related to the application of Huffman coding, emphasizing the importance of coding source symbols as longer blocks. While the previous parts can be applied to signal coding in general, Part V is devoted to audio. Chapter 10 covers perceptual models which determines which part of the source signal is inaudible (perceptually irrelevant) and thus can be removed. It starts with the absolute threshold of hearing, which is the absolute sensitivity level of the human ear. It then illustrates that the human ear processes audio signals in the frequency domain using nonlinear and analog subband filters and presents Bark scale and critical bands as tools to describe the nonuniform bandwidth of these subband filters. Next, it covers masking effects which describe the phenomenon that a weak sound becomes less audible due to the presence of a strong sound nearby. Both simultaneous and temporal masking are covered, but emphasis is given to the former because it is more thoroughly studied and extensively used in audio coding. The rest of the chapter addresses a few practical issues, such as perceptual bit allocation, converting masked threshold to the subband domain, perceptual entropy, and an example perceptual model. Chapter 11 addresses the resolution challenge posed by transients. It first illustrates that audio signals are mostly quasistationary, hence need fine frequency resolution to maximize energy compaction but are frequently interrupted by transients, which requires fine time resolution to avoid “pre-echo” artifacts. The challenge, therefore, arises: a filter bank cannot have fine frequency and time resolution simultaneously according to the Fourier uncertainty principle. It then states that one approach to address this challenge is to adapt frequency resolution in time to the presence and absence of transients and further presents switched-window MDCT as an embodiment: switching the window length of MDCT in such a way that short windows are applied to transients and long ones to quasistationary episodes. Two such examples are given, which can switch between two and three window lengths, respectively. For the double window length example, two more techniques, temporal noise shaping and transient localization are given, which can further improve the temporal resolution of the short windows. Practical methods for transient detection are finally presented.
x
Preface
Chapter 12 deals with joint channel coding. Only two widely used methods are covered, they are joint intensity coding and sum/difference (M/S stereo) coding. Methods to deal with low-frequency effect (LFE) channels are also included. Chapter 13 covers a few practical issues frequently encountered in the development of audio coding algorithms, such as how to organize various data, how to assign entropy codebooks, how to optimally allocate bit resources, how to organize bits representing various compressed data and control commands into a bit stream suitable for transmission over various channels, and how to make the algorithm amenable for implementation on low-cost microprocessors. Chapter 14 is devoted to performance assessment, which, for a given bit rate, becomes an issue of how to evaluate coding impairments. It first points out that objective methods are highly desired, but are generally inadequate, so subjective listening tests are necessary. The double-blind principle of subjective listening test is then presented, along with the two methods, namely the ABX test and ITU-R BS.1116, that implement it. Finally, Chap. 15 presents Dynamic Resolution Adaptation (DRA) audio coding standard as an example to illustrate how integrate the technologies described in this book to create a practical audio coding algorithm. DRA algorithm has been approved by the Blu-ray Disc Association as part of its BD-ROM 2.3 specification and by Chinese government as its national standard. Yuli You Adjunct Associate Professor Department of Electrical and Computer Engineering
[email protected] Contents
Part I Prelude 1
Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 1.1 Audio Coding.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 1.2 Basic Idea .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 1.3 Perceptual Irrelevance .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 1.4 Statistical Redundancy .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 1.5 Data Modeling .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 1.6 Resolution Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 1.7 Perceptual Models .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 1.8 Global Bit Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 1.9 Joint Channel Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 1.10 Basic Architecture .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 1.11 Performance Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
3 4 6 8 9 9 11 13 13 14 14 16
Part II Quantization 2
Scalar Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 2.1 Scalar Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 2.2 Re-Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 2.3 Uniform Quantization .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 2.3.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 2.3.2 Midtread and Midrise Quantizers .. . . . . . . . . .. . . . . . . . . . . . . . . . . 2.3.3 Uniformly Distributed Signals . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 2.3.4 Nonuniformly Distributed Signals . . . . . . . . . .. . . . . . . . . . . . . . . . . 2.4 Nonuniform Quantization .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 2.4.1 Optimal Quantization and Lloyd-Max Algorithm . . . . . . . . . . 2.4.2 Companding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
19 21 23 24 24 25 27 28 33 35 39
3
Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 3.1 The VQ Advantage .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 3.2 Formulation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 3.3 Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
43 43 46 48
xi
xii
Contents
3.4 3.5
LBG Algorithm.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 48 Implementation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 49
Part III Data Model 4
Linear Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 4.1 Linear Prediction Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 4.2 Open-Loop DPCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 4.2.1 Encoder and Decoder .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 4.2.2 Quantization Noise Accumulation .. . . . . . . . .. . . . . . . . . . . . . . . . . 4.3 DPCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 4.3.1 Quantization Error .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 4.3.2 Coding Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 4.4 Optimal Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 4.4.1 Optimal Predictor .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 4.4.2 Levinson–Durbin Algorithm .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 4.4.3 Whitening Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 4.4.4 Spectrum Estimator.. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 4.5 Noise Shaping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 4.5.1 DPCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 4.5.2 Open-Loop DPCM . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 4.5.3 Noise-Feedback Coding .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
53 53 55 55 57 59 59 60 61 61 63 65 68 69 69 71 71
5
Transform Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.1 Transform Coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.2 Optimal Bit Allocation and Coding Gain . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.2.1 Quantization Noise . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.2.2 AM–GM Inequality . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.2.3 Optimal Conditions .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.2.4 Coding Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.2.5 Optimal Bit Allocation . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.2.6 Practical Bit Allocation.. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.2.7 Energy Compaction . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.3 Optimal Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.3.1 Karhunen–Loeve Transform . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.3.2 Maximal Coding Gain .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.3.3 Spectrum Flatness . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.4 Suboptimal Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.4.1 Discrete Fourier Transform . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5.4.2 DCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
73 73 76 76 77 78 79 80 81 82 82 83 84 85 85 86 88
6
Subband Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 6.1 Subband Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 6.1.1 Transform Viewed as Filter Bank .. . . . . . . . . .. . . . . . . . . . . . . . . . . 6.1.2 DFT Filter Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 6.1.3 General Filter Banks. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .
91 91 92 93 94
Contents
xiii
6.2 6.3
Subband Coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 96 Reconstruction Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 97 6.3.1 Decimation Effects . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 98 6.3.2 Expansion Effects . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .100 6.3.3 Reconstruction Error . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .102 Polyphase Implementation .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .103 6.4.1 Polyphase Representation . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .103 6.4.2 Noble Identities .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .107 6.4.3 Efficient Subband Coder . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .109 6.4.4 Transform Coder.. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .109 Optimal Bit Allocation and Coding Gain . . . . . . . . . . . .. . . . . . . . . . . . . . . . .110 6.5.1 Ideal Subband Coder . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .110 6.5.2 Optimal Bit Allocation and Coding Gain . .. . . . . . . . . . . . . . . . .111 6.5.3 Asymptotic Coding Gain . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .112
6.4
6.5
7
Cosine-Modulated Filter Banks . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .115 7.1 Cosine Modulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .115 7.1.1 Extended DFT Bank .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .116 7.1.2 2M -DFT Bank.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .117 7.1.3 Frequency-Shifted DFT Bank .. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .119 7.1.4 CMFB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .120 7.2 Design of NPR Filter Banks . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .122 7.3 Perfect Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .123 7.4 Design of PR Filter Banks . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .124 7.4.1 Lattice Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .124 7.4.2 Linear Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .127 7.4.3 Free Optimization Parameters . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .129 7.5 Efficient Implementation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .131 7.5.1 Even m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .131 7.5.2 Odd m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .134 7.6 Modified Discrete Cosine Transform .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .136 7.6.1 Window Function .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .136 7.6.2 MDCT .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .137 7.6.3 Efficient Implementation .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .138
Part IV 8
Entropy Coding
Entropy and Coding .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .145 8.1 Entropy Coding .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .146 8.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .148 8.2.1 Entropy .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .148 8.2.2 Model Dependency .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .150 8.3 Uniquely and Instantaneously Decodable Codes . . . .. . . . . . . . . . . . . . . . .152 8.3.1 Uniquely Decodable Code . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .152 8.3.2 Instantaneous and Prefix-Free Code . . . . . . . .. . . . . . . . . . . . . . . . .153
xiv
Contents
8.4
9
8.3.3 Prefix-Free Code and Binary Tree . . . . . . . . . .. . . . . . . . . . . . . . . . .154 8.3.4 Optimal Prefix-Free Code .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .155 Shannon’s Noiseless Coding Theorem . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .156 8.4.1 Entropy as the Lower Bound .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .156 8.4.2 Upper Bound .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .158 8.4.3 Shannon’s Noiseless Coding Theorem . . . . .. . . . . . . . . . . . . . . . .159
Huffman Coding .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .161 9.1 Huffman’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .161 9.2 Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .163 9.2.1 Codeword Siblings . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .163 9.2.2 Proof of Optimality .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .165 9.3 Block Huffman Code.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .166 9.3.1 Efficiency Improvement .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .167 9.3.2 Block Encoding and Decoding.. . . . . . . . . . . . .. . . . . . . . . . . . . . . . .169 9.4 Recursive Coding.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .169 9.5 A Fast Decoding Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .170
Part V
Audio Coding
10 Perceptual Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .173 10.1 Sound Pressure Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .174 10.2 Absolute Threshold of Hearing .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .174 10.3 Auditory Subband Filtering .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .176 10.3.1 Subband Filtering .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .176 10.3.2 Auditory Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .177 10.3.3 Bark Scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .179 10.3.4 Critical Bands .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .179 10.3.5 Critical Band Level .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .184 10.3.6 Equivalent Rectangular Bandwidth .. . . . . . . .. . . . . . . . . . . . . . . . .184 10.4 Simultaneous Masking .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .185 10.4.1 Types of Masking .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .186 10.4.2 Spread of Masking.. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .188 10.4.3 Global Masking Threshold .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .191 10.5 Temporal Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .193 10.6 Perceptual Bit Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .194 10.7 Masked Threshold in Subband Domain .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . .195 10.8 Perceptual Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .195 10.9 A Simple Perceptual Model.. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .197 11 Transients. . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .199 11.1 Resolution Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .199 11.1.1 Pre-Echo Artifacts . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .202 11.1.2 Fourier Uncertainty Principle . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .204 11.1.3 Adaptation of Resolution with Time . . . . . . . .. . . . . . . . . . . . . . . . .205
Contents
xv
11.2 Switched-Window MDCT . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .207 11.2.1 Relaxed PR Conditions and Window Switching . . . . . . . . . . . .207 11.2.2 Window Sequencing .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .209 11.3 Double-Resolution Switched MDCT. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .210 11.3.1 Primary and Transitional Windows .. . . . . . . .. . . . . . . . . . . . . . . . .210 11.3.2 Look-Ahead and Window Sequencing . . . . .. . . . . . . . . . . . . . . . .213 11.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .214 11.3.4 Window Size Compromise .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .215 11.4 Temporal Noise Shaping .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .215 11.5 Transient-Localized MDCT . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .217 11.5.1 Brief Window and Pre-Echo Artifacts . . . . . .. . . . . . . . . . . . . . . . .217 11.5.2 Window Sequencing .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .220 11.5.3 Indication of Window Sequence to Decoder . . . . . . . . . . . . . . . .222 11.5.4 Inverse TLM Implementation .. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .223 11.6 Triple-Resolution Switched MDCT . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .224 11.7 Transient Detection.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .226 11.7.1 General Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .227 11.7.2 A Practical Example .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .228 12 Joint Channel Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .231 12.1 M/S Stereo Coding .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .231 12.2 Joint Intensity Coding .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .232 12.3 Low-Frequency Effect Channel . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .234 13 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .235 13.1 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .235 13.1.1 Frame-Based Processing . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .236 13.1.2 Time–Frequency Tiling.. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .236 13.2 Entropy Codebook Assignment . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .238 13.2.1 Fixed Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .239 13.2.2 Statistics-Adaptive Assignment .. . . . . . . . . . . .. . . . . . . . . . . . . . . . .240 13.3 Bit Allocation.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .241 13.3.1 Inter-Frame Allocation . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .241 13.3.2 Intra-Frame Allocation . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .242 13.4 Bit Stream Format .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .243 13.4.1 Frame Header .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .243 13.4.2 Audio Channels .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .244 13.4.3 Error Protection Codes . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .245 13.4.4 Auxiliary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .245 13.5 Implementation on Microprocessors . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .246 13.5.1 Fitting to Low-Cost Microprocessors.. . . . . .. . . . . . . . . . . . . . . . .246 13.5.2 Fixed-Point Arithmetic .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .247
xvi
Contents
14 Quality Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .251 14.1 Objective Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .252 14.2 Subjective Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .252 14.2.1 Double-Blind Principle .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .253 14.2.2 ABX Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .253 14.2.3 ITU-R BS.1116 .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .253 15 DRA Audio Coding Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .255 15.1 Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .255 15.2 Architecture.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .256 15.3 Bit Stream Format .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .258 15.3.1 Frame Synchronization .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .259 15.3.2 Frame Header .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .262 15.3.3 Audio Channels .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .264 15.3.4 Window Sequencing for LFE Channels . . . .. . . . . . . . . . . . . . . . .278 15.3.5 End of Frame Signature . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .279 15.3.6 Auxiliary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .280 15.3.7 Unpacking the Whole Frame. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .280 15.4 Decoding.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .281 15.4.1 Inverse Quantization .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .281 15.4.2 Joint Intensity Decoding . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .282 15.4.3 Sum/Difference Decoding.. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .283 15.4.4 De-Interleaving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .285 15.4.5 Window Sequencing .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .286 15.4.6 Inverse TLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .289 15.4.7 Decoding the Whole Frame . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .289 15.5 Formal Listening Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .290 Large Tables . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .293 A.1 Quantization Step Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .293 A.2 Critical Bands for Short and Long MDCT . . . . . . . . . . .. . . . . . . . . . . . . . . . .294 A.3 Huffman Codebooks for Codebook Assignment . . . .. . . . . . . . . . . . . . . . .301 A.4 Huffman Codebooks for Quotient Width of Quantization Indexes .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .303 A.5 Huffman Codebooks for Quantization Indexes in Quasi-Stationary Frames .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .304 A.6 Huffman Codebooks for Quantization Indexes in Frames with Transients.. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .318 A.7 Huffman Codebooks for Indexes of Quantization Step Sizes .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .332 References .. . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .335 Index . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .339
Part I
Prelude
Chapter 1
Introduction
Sounds are physical waves that propagate in the air or other media. Such waves, which may be expressed as changes in air pressure, may be transformed by an analog audio system using a transducer, such as a microphone, into continuous electrical waves in the forms of current and/or voltage changes. This transformation of sounds into an electrical representation, which we call an audio signal, facilitates the storage, transmission, duplication, amplification, and other processing of sounds. To reproduce the sounds, the electrical representation, or audio signal, is converted back into physical waves via loudspeakers. Since electronic circuits and storage/transmission media are inherently noisy and nonlinear, audio signals are susceptible to noise and distortion, resulting in loss of sound quality. Consequently, modern audio systems are mostly digital in that the audio signals obtained above are sampled into discrete-time signals and then digitized into numerical representations, which we call digital audio signals. Once in the digital domain, a lot of technologies can be deployed to ensure that no inadvertent loss of audio quality occurs. Pulse-code modulation (PCM) is usually the standard representation format for digital audio signals. To obtain a PCM representation, the waveform of an analog audio signal is sampled regularly at uniform intervals (sampling period) to generate a sequence of samples (a discrete-time signal), which are then quantized to generate a sequence of symbols, each represented as a numerical (usually binary) code. The Nyquist–Shannon sampling theorem states that an analog signal that was sampled can be perfectly reconstructed from the samples if the sample rate exceeds twice the highest frequency in the original analog signal [68]. To ensure this condition is satisfied, the input analog signal is usually filtered with a low-pass filter whose stopband corner frequency is less than half of the sample rate. Since the human ear’s perceptual range for pure tones is widely believed to be between 20 Hz and 20 kHz (see Sect. 10.2) [102], such low-pass filters may be designed in such a way that the cutoff frequency starts at 20 kHz and a few kilohertz are allowed as the transition band before the stopband. For example, the sample rate is 44.1 kHz for compact discs (CD) and 48 kHz for sound tracks in DVD-Video. Some people, however, believe that the human ear can perceive frequencies much higher than 20 kHz, especially when transients present, so sampling rates as high as 192 kHz
Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 1, c Springer Science+Business Media, LLC 2010
3
4
1 Introduction
are used in some audio systems, such as DVD-Audio. Note that, there is still power in the stopband for any practical low-pass filters, so perfect reconstruction is only approximately satisfied. The subsequent quantization process also introduces noise. The more the number of bits is used to represent each audio sample, the less the quantization noise becomes (see Sect. 2.3). The compact discs (CD), for example, use 16 bits to represent each sample. Due to the limited resolution and dynamic range of the human ear, 16 bits per sample are argued by some to be sufficient to deliver the full dynamics of almost all music, but higher resolution audio formats are called for in applications such as soundtracks in feature films, where there is often a very wide dynamic range between whispered conversations and explosions. The higher resolution format also enables more headroom to be left for audio processing which may inadvertently or intentionally introduce noise. Twenty-four bits per sample, used by DVD, are widely believed to be sufficient for most applications, but 32 bits are not uncommon. Digital audio signals rarely come as a single channel or monaural sound. The CD delivers two channels (stereo) and the DVD up to 7.1 channels (surround sound) that consists of seven normal channels (front left, front right, center, surround left, surround right, back left, and back right, for example) and one lowfrequency effect (LFE) channel. The terminology of 0.1 is used to indicate that an LFE channel has a very low bandwidth, usually no more than 120 Hz. Apparently, the number of channels are in an increasing trend. For example, Japan’s NHK demonstrated 22.2 channel surround sound in 2005 and China’s DRA audio coding standard (see Chap. 15) allows for 64.3 channel surround sound.
1.1 Audio Coding Higher audio quality demands higher sample rate, more number of bits per sample, and more channels. But all of these come with a significant cost: a large number of bits to represent and transfer the digital audio signals. Let b denote the number bits to represent each PCM sample and Fs the sample rate in samples per second, the bit rate to represent and transfer an Nch channel audio signal is B0 D b Fs Nch (1.1) in bits per second. As an example, let us consider a moderate case which is typically deployed by DVD-Video: 48 kHz sample rate and 24 bits per sample. This amounts to a bit rate of 48 24 D 1;152 kbps (kilo bits per second) for each channel. The total bit rate becomes 2,304 kbps for stereo, 6,912 kbps for 5.1, 9,216 kbps for 7.1, and 2,7648 kbps for 22.2 surround sound, respectively. And this is not the end of the story. If the 192 kHz sample rate is used, for example, these bit rates increase by four times. For mass consumption, audio signals need to be delivered to consumers through some sort of communication/broadcasting or storage channel whose capacity is usually very limited.
1.1 Audio Coding
5
Storage channels usually have the best channel capacity. DVD-Video, for example, is designed to hold at least two hours of film with standard definition video and 5.1 surround sound. It was given a capacity of 4.7 GB (gigabytes), which was the state-of-arts when DVD was standardized. If two hours of 5.1 surround sound is to be delivered in the standard PCM format (24 bits per sample and 48 kHz sample rate), it needs about 6.22 GB (gigabytes) storage space. This is more than the capacity of the whole DVD disc and there is no capacity left for the video, whose demand for bit rate is usually more than ten times that of audio signals. Apparently, there is a problem of insufficient channel capacity for the delivery of audio signals. This problem is much more acute with wireless channels. For example, overthe-air audio and/or television broadcasting usually allocates no more than 64 kbps channel capacity to deliver stereo audio. If delivered at 24 bits per sample and 48 kHz sample rate PCM, a stereo audio signal needs a bit rate of 2,304 kbps, which is 36 times the allocated channel capacity. This problem of insufficient channel capacity for delivering audio signals may be addressed by either allocating more channel capacity or reducing the demand of it. Allocating more channel capacity is usually very expensive and even physically impossible in situations such as wireless communication or broadcasting. It is often more plausible and effective to pursue the demand reduction route: reducing the bit rate necessary for delivering audio signals. This is the task of digital audio (compression) coding. Audio coding achieves this goal of bit rate reduction through an encoder and a decoder, as shown in Fig. 1.1. The encoder obtains a compact representation of the input audio signal, often referred to as the source signal, that demands less bits. The bits for this compact representation is delivered through a communication/broadcasting or storage channel to the decoder, which then reconstructs the original audio signal from the received compact representation. Note that the term “channel” used here is an abstraction or aggregation of channel coder, modulator, physical channel, channel demodulator, and channel decoder in communication literature. Since the channel is well known for introducing bit errors, the compact representation received by the decoder may be different than that at the output of the encoder. From the viewpoint of audio coding, however, the channel
Fig. 1.1 Audio coding involves an encoder to transform a source audio signal into a compact representation for transmission through a channel and a decoder to decode the compact representation received from the channel to reconstruct the source audio signal
6
1 Introduction
may be assumed to be error-free, but an audio coding system must be designed in such way that it can tolerate a certain degree of channel errors. At the very least, the decoder must be able to detect and recover from most channel errors. Let B be the bit rate needed to deliver the compact representation, the performance of an audio coding algorithm may be assessed by the compression ratio defined below: B0 rD : (1.2) B For the previous example of over-the-air audio and/or television broadcasting, the required compression ratio is 36:1. The compact representation obtained by the encoder may allow the decoder to perfectly reconstruct the original audio signal, i.e., the reconstructed audio signal at the output of the decoder is an exact or identical copy of the source audio signal inputted to the encoder, bit by bit. This audio coding process is called lossless. Otherwise, it is called lossy, meaning that the reconstructed audio signal is just an approximate copy of the source audio signal, some information is lost in the coding process and the audio signal is irreversibly distorted (hopefully not perceived).
1.2 Basic Idea According to information theory [85, 86], the minimal average bit rate that is necessary to transmit a source signal is its entropy, which is determined by the probability distribution of the source signal (see Sect. 8.2). Let H denote the entropy of the source signal, the following difference: R D B0 H
(1.3)
is the component in the source signal that is statistically redundant for the purpose of transmitting the source signal to the decoder and is thus called statistical redundancy. The goal of lossless audio coding is to remove statistical redundancy from the source signal as much as possible so that it is delivered to the decoder with a bit rate B as close as possible to the entropy. This is illustrated in Fig. 1.2. Note that, while entropy coding is a general terminology for coding techniques that remove statistical redundancy, a lossless audio coding algorithm usually involves sophisticated data modeling (to be discussed in Sect. 1.3), so the use of entropy coding in Fig. 1.2 is an over simplification if the context involves lossless coding algorithms and may imply that data modeling is part of it. The ratio of compression achievable by lossless audio coding is usually very limited, an overall compression ratio of 2:1 may be considered high. This level of compression ratio cannot satisfy many practical needs. As stated before, the over-the-air audio and/or television broadcasting, for example, may require a compression ratio of 36:1. To achieve this level of compression, some information in the source signal has to be irreversibly discarded by the encoder.
1.2 Basic Idea
7
Fig. 1.2 A lossless audio coder removes through entropy coding statistical redundancy from the the source audio signal to arrive at a compact representation
Fig. 1.3 A lossy audio coder removes both perceptual irrelevancy and statistical redundancy from the source audio signal to achieve much higher compression ratio
This irreversible loss of information causes distortion in the reconstructed audio signal at the decoder output. The distortion may be significant if assessed using objective measures such as mean square error, but is perceived differently by the human ear, which audio coding serves. Proper coder design can ensure that no distortion can be perceived by the human ear, even if the distortion is outstanding when assessed by objective measures. Furthermore, even if some distortion can be perceived, it may still be tolerated if it is not “annoying.” The portion of information in the source signal that leads to either unperceivable or unannoying distortion is, therefore, perceptually irrelevant and thus may be removed from the source signal to significantly reduce bit rate. After removal of perceptual irrelevance, there is still statistical redundancy in the remaining signal components, which can be further removed through entropy coding. Therefore, a lossy coder usually consists of two modules as shown in Fig. 1.3.
8
1 Introduction
Note that, while quantization is a general terminology for coding techniques that remove perceptual irrelevance, a lossy audio coding algorithm usually involves sophisticated data modeling, so the use of quantization in Fig. 1.3 is an over simplification if the context involves lossy audio coding algorithms and may imply that data modeling is part of it.
1.3 Perceptual Irrelevance The basic approach to removing perceptual irrelevance is quantization which involves representing the samples of the source signal with lower resolution (see Chaps. 2 and 3). For example, the integer value of 1,000, which needs 10 bits for binary representation, may be quantized by a scalar quantizer (SQ) with a quantization step size of 9 as 1; 000 9 111 which now only needs 7 bits. At the decoder side, the original value may be reconstructed as 111 9 D 999 for a quantization error of 1; 000 999 D 1: Consider the value of 1,000 above as a sample of a 10-bit PCM signal (no sign), the above quantization process may be applied to all samples of the PCM signal to generate another PCM signal of only 7 bits, for a compression ratio of 10:7. Of course, the original 10-bit PCM signal cannot be perfectly reconstructed from the 7-bit one due to quantization error. The quantization error is obviously dependent on the quantization step size, the larger the step size, the larger the quantization error. If the level of quantization error above is considered perceptually irrelevant, we have effectively compressed the 10-bit PCM signal into a 7-bit one. Otherwise, the quantization step size needs to be reduced until the quantization error is perceptually irrelevant. To optimize compression performance, the step size can be adjusted to an optimal value which gives a quantization error that is just not perceivable. The quantization scheme illustrated above is the simplest uniform scalar quantization (see Sect. 2.3) which may be characterized by a constant quantization step size applied to the whole dynamic range of the input signal. The quantization step size may be made variable depending on the values of the input signal so as to better adapt to the perceived quality of the quantized and reconstructed signal. This amounts to nonuniform quantization (see Sect. 2.4). To exploit the inter-sample structure and correlation among the samples of the input signal, a block of these samples may be grouped together and quantized as a vector, amounting to vector quantization (VQ) (see Chap. 3).
1.5 Data Modeling
9
1.4 Statistical Redundancy The basic approach to removing statistical redundancy is entropy coding whose basic idea is to use long codewords to represent less frequent sample values and short codewords for more frequent ones. As an example, let us consider the four PCM sample values listed in the first column of Table 1.1 which has a probability distribution listed in the second column of the same table. Since there are four PCM sample values, we need to use at least 2 bits to represent a PCM signal that draws sample values from the sample set above. However, if we use the codewords listed in the third column of Table 1.1 to represent the PCM sample values, we end up with the following average bit rate: 1
1 1 1 1 C 2 C 3 C 4 D 1:875 bits; 2 4 8 8
which amounts to a compression ratio of 2:1.875. The code in Table 1.1 is a variant of the unary code, which is not optimal for the probability distribution in the table. For an arbitrary probability distribution, if there is an optimal code which uses the least average number of bits to code the samples of the source signal, Huffman code is one of such codes [29] (see Chap. 9).
1.5 Data Modeling If audio coding involved only techniques for removing perceptual irrelevance and statistical redundancy, it would be a much simpler field of study and coding performance would also be significantly limited. Fortunately, there is another class of techniques that make audio coding much more effective and also complex. It is data modeling. Audio signals, like many other signals, are usually strongly correlated and have internal structures that can be expressed via data models. As an example, let us consider the 1,000 Hz sinusoidal signal shown at the top of Fig. 1.4, which is represented using 16-bit PCM with a sample rate of 48 kHz. Its periodicity manifests that it is strongly correlated. One simple approach to modeling the periodicity so as to
Table 1.1 The basic idea of entropy coding is to use long codewords to represent less frequent sample values and short codewords for more frequent ones PCM sample value Probability Entropy code 1 0 1 2 1 01 1 4 1 001 2 8 1 0001 3 8
10
1 Introduction
Amplitude
x 104 2 0 −2 0.005
0.01 Time (second)
0.015
0.02
0.005
0.01 Time (second)
0.015
0.02
0 Frequency (Hz)
1
4
Magnitude (dB)
Amplitude
x 10 2 0 −2
80 60 40 20 0
−2
−1
2 x104
Fig. 1.4 A 1,000 Hz sinusoidal signal represented as 16-bit PCM with a sample rate of 48 kHz (top), its linear prediction error signal (middle), and its DFT spectrum (bottom)
remove correlation is through linear prediction (see Chap. 4). Let x.n/ denote the nth sample of the sinusoidal signal and x.n/ O its predicted value, an extremely simple prediction scheme is to use the immediate preceding sample value as the prediction for the current sample value: x.n/ O D x.n 1/: This prediction is, of course, not perfect, so there is prediction error or residue which is e.n/ D x.n/ x.n/ O and is shown in the middle of Fig. 1.4. If we elect to send this residue signal, instead of the original signal, to the decoder, we will end up with a much smaller number of bits due to its significantly reduced dynamic range. In fact, its dynamic range is [4278, 4278] which can be represented using 14-bit PCM, resulting a compression ratio of 16:14.
1.6 Resolution Challenge
11
Another approach to decorrelation is orthogonal transforms (see Chap. 5) that, when properly designed, can transform the input signal into coefficients which are decorrelated and whose energy is compacted to a small number of coefficients. This compaction of energy is illustrated in the bottom of Fig. 1.4 which plots the logarithmic magnitude of the Discrete Fourier Transform (DFT) (see Sect. 5.4.1) of the 1,000 Hz sinusoidal signal at the top of Fig. 1.4. Instead of dealing with the periodically occurring large sample values of the original sinusoidal signal in the time domain, there are only a small number of large DFT coefficients in the frequency domain and the rest are extremely close to zero. A bit allocation strategy can be deployed which allocates bits to the representation of the DFT coefficients based on their respective magnitudes. Due to the energy compaction, only a small number of large DFT coefficients demand significant number of bits and the rest majority demand little if any, so a tremendous degree of compression can be achieved. DFT is rarely used in practical audio coding algorithms partly because it is a real-to-complex transform: for a block of N real input samples, it generates a block of N complex coefficients, which actually consist of N real and N imaginary coefficients. Discrete cosine transforms (DCT), which is a real-to-real transform, are more widely used in place of DFT. Note that, the N is hereafter referred to as the block size or block length. When blocks of transform coefficients are coded independent of each other, discontinuity occurs at the block boundaries. Referred to as the blocky effect, the discontinuity causes periodic “clicking” sound in the reconstructed audio and is usually very annoying. To overcome this blocky effect, lapped transforms that overlap between blocks are developed [49], which may be considered as special cases of subband filter banks [93] (see Chap. 6). Another benefit of overlapping between blocks is that the resultant transforms have sharper frequency responses and thus better energy compacting performance. To mitigate codec implementation cost, structured filter banks that are amenable to fast algorithms are mostly deployed in practical audio coding algorithms. Prominent among them are cosine modulated filter banks (CMFB) whose implementation cost is essentially that of a prototype FIR filter and DCT (see Chap. 7). A special case of it, modified discrete cosine transform (MDCT) (see Sect. 7.6), whose prototype filter is only twice as long as the block size, has essentially dominated various audio coding standards.
1.6 Resolution Challenge The compression achieved through energy compaction is based on two assumptions. The first is that the input signal be quasistationary, full of fine frequency structures, such as the one shown at the top of Fig. 1.5. This assumption is mostly correct because audio signals are quasistationary most of the time. The second assumption is that the transform or subband filter bank have a good frequency resolution to resolve these fine frequency structures. Since the frequency resolution of a transform
12
1 Introduction
Amplitude
0.2 0.1 0 −0.1 −0.2
0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 Time (second)
Amplitude
1 0.5 0 −0.5 −1
0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 Time (second)
Fig. 1.5 Audio signals are quasistationary (such as the one shown at the top) most of the time, but are frequently interrupted by dramatical transients (such as the one shown at the bottom)
or filter bank is largely proportional to the block size, this second assumption calls for the deployment of transforms or filter banks with large block sizes. To achieve a high degree of energy compaction, the block size should be as large as possible, limited only by the physical memory of the encoder/decoder as well as the delay associated with buffering a long block of samples. Unfortunately, the first assumption above is not correct all the time – quasistationary episodes of audio signals are frequently interrupted by dramatic transients which may rise from absolute quiet to extreme loudness within a few samples. Examples of such transients include sudden gun shots and explosions. A less dramatic transient produced by a music instrument is shown at the bottom of Fig. 1.5. For such a transient, it is well known that a long transform or filter bank would produce a flat spectrum that corresponds to small, if any, energy compaction, resulting in poor compression performance. To mitigate this problem, a short transform or filter bank should be used that has fine time resolution to localize the transient in the time domain. To be able to code all audio signals with high coding performance all the time, a transform/filter bank that would have good resolution in both time and frequency domains is desired. Unfortunately, the Fourier uncertainty principle [75], which is related to the Heisenberg uncertainty principle [90], states that this is impossible: a transform/filter bank can have a good resolution either in the time or frequency domain, but not both (see Sect. 11.1). This poses one of the most difficult challenges in audio coding. This challenge is usually addressed along the line of adapting the resolution of transforms/filter banks with time to the changing resolution demands of the input
1.8 Global Bit Allocation
13
signal: applying long block sizes to quasistationary episodes and short ones to transients. There are a variety of ways to implement this scheme, most dominant among them seems to be the switched block-size MDCT (see Sect. 11.2). In order for the resolution adaptation to occur on the fly, a transient detection mechanism which detects the occurrence of transients and identifies their locations (see Sect. 11.7) is needed.
1.7 Perceptual Models While quantization is the tool for removing perceptual irrelevance, a question as to what is the optimal degree of perceptual irrelevance that can be safely removed without audible distortion remains. This question is addressed by perceptual models that mimic the psychoacoustic behaviors of the human ear. When a source signal is fed to a perceptual model, it provides as output, some kind of description as to which parts of the source audio signal are perceptually irrelevant. This description usually comes in the form a threshold of power, called masking threshold, below which sound cannot be perceived by the human ear and thus can be removed. Since the human ear does most signal processing in the frequency domain, a perceptual model is best built in the frequency domain and the masking threshold given as a function of the frequency. Consequently, the data model should desirably be a frequency transform/filter bank so that the results from the perceptual model, such as the masking threshold, can be readily and effective utilized. It is, therefore, not a surprise that most modern audio coders operate in the frequency domain. However, it is still possible for an audio coder to operate in other domains, but there should be a mechanism to bridge that domain and the frequency domain in which the human ear mostly processes sound signals.
1.8 Global Bit Allocation The adjustment of the quantization step size affects proportionally the level of quantization noise and inversely the number of bits needed to represent transform coefficients or subband samples. A small quantization step size can ensure that the quantization noise is not perceivable, but at the expense of consuming a large number of bits. A large quantization step size, on the other hand, demands a small number of bits, but at the expense of a high level of quantization noise. Since a lossy audio coder usually operates under a tight bit rate budget with a limited number of bits that can be used, a global bit allocation mechanism needs to be installed to optimally allocate the limited bit resource so as to minimize the total perceived power of quantization noise.
14
1 Introduction
The basic bit allocation strategy is to allocate bits (by decreasing the quantization step size) iteratively to a group of transform coefficients or subband samples whose quantization noise is most audible until either the bit pool is exhausted or quantization noises for all transform coefficients/subband samples are below the masking thresholds.
1.9 Joint Channel Coding The discrete channels of a multichannel audio signals are coordinated, synchronized in particular, to produce dynamic sound imaging, so the inter-channel correlation in a multichannel audio signal is very strong. This statistic redundancy can be exploited through some forms of joint channel coding, either in the temporal or transform/subband domain. The human ear relies on a lot of cues in the audio signal to achieve sound localization and the processing involved is very complex. However, a lot of psychoacoustic experiments have consistently indicated that some components of the audio signal are either insignificant or even irrelevant for sound localization, thus can be removed for bit rate reduction. Joint channel coding is the general term for the techniques that explore interchannel statistic redundancy and perceptual irrelevance. Unfortunately, this is a less studied area and the existing techniques are rather primitive. The ones primarily used by most audio coding algorithms are sum/difference coding (M/S stereo coding) and joint intensity coding, both are discussed in Chap. 12.
1.10 Basic Architecture The various techniques discussed in the earlier sections can now be put together to arrive at the basic audio encoder architecture shown in Fig. 1.6. The multiplexer in the figure is a necessary module that packs all elements of the compressed audio data into a coherent bit stream adhering to a specific format suitable for transmission over various communication channels. The corresponding basic decoder architecture is shown in Fig. 1.7. Each module in this figure simply performs the inverse, and usually simpler, operation of the corresponding module in the encoder. Perceptual model, transient detection and global bit allocation are usually complex and computationally expensive, so are not suitable for inclusion in the decoder. In addition, the decoder usually does not have the relevant information to perform those operations. Therefore, these modules usually do not have counterparts in the decoder. All that the decoder needs are the results from these modules and they can be packed into the bit stream as part of the side information.
1.10 Basic Architecture
15
Fig. 1.6 The basic audio encoder architecture. The solid lines represent movement of audio data and the dashed line indicates control information
Fig. 1.7 The basic audio decoder architecture. The solid lines represent movement of audio data and the dashed line indicates control information
In addition to the modules shown in Figs. 1.6 and 1.7, the development of an audio coding algorithm also involves many practical and important issues, including: Data Structure. How transform coefficients or subband samples from all audio channels are organized and accessed by the audio coding algorithm. Bit Stream Format. How entropy codes and bits representing other control information are packed into a coherent bit stream.
16
1 Introduction
Implementation. How to structure the arithmetic of the algorithm to make the encoder and, especially, the decoder amenable for easy implementation on cheap hardwares such as fixed-point microprocessors. An important but often overlooked issue is the necessity for frame-based processing. An audio signal may last as long as a few hours, so encoding/decoding it as a monolithic piece takes a long time and demands tremendous hardware resources. The resulting monolithic piece of encoded bit stream also makes real-time delivery and decoding impossible. A practical approach is to segment a source audio signal into consecutive frames usually ranging from 2 to 50 ms in duration and code each of them in sequence. Most transforms/filter banks, such as MDCT, are block-based, so the size of the frames can be conveniently set to be either the same as the block size or a multiple of it.
1.11 Performance Assessment Performance evaluation is an essential and necessary part of any algorithm development. The intuitive performance measure for audio coding is the compression ratio defined in (1.2). Although simple, its effectiveness is very limited, mostly because it changes with time for a given audio signal and even more dramatically between different audio signals. A rational approach is to use the worst compression ratio for a set of difficult audio signals as the compression ratio for the audio coding algorithm. This is usually enough for lossless audio coding. For lossy audio coding, however, there is another factor that affects the usefulness of compression ratio – the perception or assessment of coding distortion. The compression ratio defined in (1.2) assumes that there is no audible distortion in the decoded audio signal. This is a critical assumption that renders the compression ratio meaningful. If this assumption were removed, any algorithm can achieve the maximal possible compression ratio, which is infinity, by not sending any bits to the decoder. Of course, this results in the maximal distortion of no audio signal outputted by the decoder. On the other extreme, we can throw an excessive number of bits to the encoder to make the distortion far below the threshold of perception, in the course wasting the precious bit resource. It is, therefore, necessary to establish the level of just inaudible distortion before compression ratio can be calculated. It is the compression ratio calculated at this point that authentically reflects the coding performance of the underlying audio coding algorithm. A more widely used approach to performance assessment, especially when different audio coding algorithms are compared, is to perceptually evaluate the level of distortion for a given bit rate and a selected set of critical test signals. The perceptual evaluation of distortion must be ultimately performed by the human ear through listening tests. For the same piece of decoded audio signal, different people are likely to hear differently, some may hear distortion and some may not. Playback equipments and the listening conditions also significantly impact the audibility of distortion. Therefore, a set of procedure and method for conducting casual and formal listening tests are needed and are discussed in Chap. 14.
Part II
Quantization
Quantization literally is a process of converting samples of a discrete-time source signal into a digital representation with reduced resolution. It is a necessary step for converting analog signals in the real world to digital signals, which enables digital signal processing. During this process of conversion, quantization also achieves a tremendous deal of compression because an analog sample is considered as having infinite resolution, thus requiring an infinite number of bits to represent, while a digital sample is of limited resolution and is represented using a limited number of bits. This conversion also means that a tremendous amount of information is lost forever. This loss of information might be a serious concern, but can be made imperceptible or tolerable by properly designing the quantization process. The human ear, for example, is widely believed to be unable to perceive resolution higher than 24 bits per sample. Any information or resolution more than this may be considered as irrelevant, hence can be discarded through quantization. When a digital signal is acquired through quantizing an analog one, the primary concern is to make sure that the digital signal is obtained at the desired resolution, or all relevant information is not lost. There is little, if any, attempt to seek a compact representation for the acquired digital signal. A compact representation is pursued afterwards, usually when the need for storage or transmission arises. Once the digital signal is inspected under the spotlight of compact representation, one may be surprised by the amount of unnecessary or irrelevant information that it still contains. This irrelevance can be removed by re-quantizing the already-quantized digital signal. There is essentially no difference in methodology between quantizing an analog signal and re-quantizing a digital signal, so they will not be distinguished in the treatment in this book. Scalar quantization (SQ) quantizes a source signal one sample at a time. It is simple, but its performance is not as good as the more sophisticated vector quantization (VQ) which quantizes a block of input samples each time.
Chapter 2
Scalar Quantization
An audio signal is a representation of sound waves usually in the form of sound pressure level that varies with time. Such a signal is continuous both in value and time, hence carries an infinite amount of information. The first step of significant compression is accomplished when a continuous-time audio signal is converted into a discrete-time signal using sampling. In what constitutes uniform sampling, the simplest sampling method, the continuous-time signal is sampled at a regular interval T , called sampling period. According to Nyquist– Shannon sampling theorem [65, 68], the original continuous-time signal can be perfectly reconstructed from the sampled discrete-time signal if the continuoustime signal is band-limited and its bandwidth is no more than half of the sample rate (1=T ). Therefore, sampling accomplishes a tremendous amount of lossless compression if the source signal is ideally bandlimited. After sampling, each sample of the discrete-time signal has a value that is continuous, so the number of possible distinct output values is infinite. Consequently, the number of bits needed to represent and/or convey such a value exactly to a recipient is unlimited. For the human ear, however, an exact continuous sample value is unnecessary because the resolution that the ear can perceive is very limited. Many believe that it is less than 24 bits. So a simple scheme of replacing an analog sample value with an integer value that is closet to it would not only satisfy the perceptual capability of the ear, but also removes a tremendous deal of imperceptible information from a continuously valued signal. For example, the hypothetical “analog” samples in the left column of Table 2.1 may be represented by the respective integer values in the right column. This process is called quantization. The underlying mechanism for quantizing the sample values in Table 2.1 is to divide the real number line into real intervals and then map each of such interval to an integer value. This is shown in Table 2.2, which is call a quantization table. The quantization process actually involves three steps as shown in Fig. 2.1 and explained below: Forward Quantization. A source sample value is used to look up the left column to find the interval, referred to as decision interval, that it falls into and the corresponding index, referred to as quantization index, in the center column is then identified. This mapping is referred to as encoder mapping. Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 2, c Springer Science+Business Media, LLC 2010
19
20
2 Scalar Quantization Table 2.1 An example of mapping “analog” sample values to integer values that would take place in a process called quantization “Analog” sound pressure level Integer sound pressure level 3.4164589759 3 3.124341 3 2.14235 2 1.409086743 1 0.61341984378562890423 1 0.37892458 0 0.61308 1 1.831401348156 2 2.8903219654710 2 3.208913064 3
Table 2.2 Quantization table that maps source sample intervals in the left column to integer values in the right column Sample value interval Index Integer value (1, 2.5) 0 3 Œ 2.5, 1.5) 1 2 Œ 1.5, 0.5) 2 1 Œ 0.5, 0.5) 3 0 Œ0.5, 1.5) 4 1 Œ1.5, 2.5) 5 2 Œ2.5, 1) 6 3
Fig. 2.1 Quantization involves an encoding or forward quantization stage represented by “Q”, which maps an input value to the quantization index, and a decoding or inverse quantization stage represented by “Q1 ”, which maps the quantization index to the quantized value
Index Transmission. The quantization index is transmitted to the receiver. Inverse Quantization. Upon receiving the quantization index, the receiver uses it to read out the integer value, referred to as the quantized value, in the right column. This mapping is referred to as decoder mapping. The quantization table above maps sound pressure levels with infinite range and resolution into seven integers, which need only 3 bits to represent, thus achieving a great deal of data compression. However, this comes with a price: much of the original resolution is lost forever. This loss of information may be significant, but it was done on purpose: those lost pieces of information are irrelevant to our needs or perception, we can afford to discard them.
2.1 Scalar Quantization
21
2.1 Scalar Quantization To pose the quantization process outlined above mathematically, let us consider a source random variable X with a probability density function (PDF) of p.X /. Suppose that we wish to quantize this source with M decision intervals defined by the following M C 1 endpoints bq ; q D 0; 1; : : : ; M;
(2.1)
referred to as decision boundaries, and with the following M quantized values, xO q ; q D 1; 2; : : : ; M;
(2.2)
which are also called output values or representative values. A source sample value x is quantized to the quantization index q if and only if x falls into the qth decision interval ıq D Œbq1 ; bq /; (2.3) so the operation of forward quantization is q D Q.x/; if and only if bq1 x < bq :
(2.4)
The quantized value can be reconstructed from the quantization index by the following inverse quantization xO q D Q1 .q/; (2.5) which is also referred to as backward quantization. Since q is a function of x as shown in (2.4), xO q is also a function of x and can be written as: x.x/ O D xO q D Q 1 ŒQ.x/ :
(2.6)
This quantization scheme is called scalar quantization (SQ) because the source signal is quantized one sample each time. The function in (2.6) is another approach to describing the input–output map of a quantizer, in addition to the quantization table. Figure 2.2 is such a function that describe the quantization map of Table 2.2. The quantization operation in (2.4) obviously causes much loss of information, the reconstructed quantized value obtained in (2.5) or (2.6) is different than the input to the quantizer. The difference between them is called quantization error q.x/ D x.x/ O x:
(2.7)
It is also referred to as quantization distortion or quantization noise. Equation (2.7) may be rewritten as x.x/ O D x C q.x/;
(2.8)
so the quantization process is often modeled as an additive noise process as shown in Fig. 2.3.
22
2 Scalar Quantization
Fig. 2.2 Input–output map for the quantizer shown in Table 2.2
+ Fig. 2.3 Additive noise model for quantization
The average loss of information introduced by quantization may be characterized by average quantization error. Among the many norms that may be used to measure this error, the L-2 norm or Euclidean distance is usually used and is called mean squared quantization error (MSQE): q2
Z D D D
1
Z1 1 1
q 2 .x/p.x/dx .x.x/ O x/2 p.x/dx
M Z X
bq
qD1 bq1
.x.x/ O x/2 p.x/dx
(2.9)
2.2 Re-Quantization
23
Since x.x/ O D xO q is a constant within the decision interval Œbq1 ; bq /, we have q2
D
M Z X qD1
bq bq1
.xO q x/2 p.x/dx:
(2.10)
The MSQE may be better appreciated when compared with the power of the source signal. This may be achieved using signal-to-noise ratio (SRN) defined below ! x2 ; (2.11) SNR (dB) D 10 log10 q2 where x2 is the variance of the source signal. It is obvious that the smaller the decision intervals, the smaller the error term .xO q x/2 in (2.10), thus the smaller the mean squared quantization error q2 . This indicates that q2 is inversely proportional to the number of decision intervals M . The placement of each individual decision boundary and the quantized value also play major roles in the final q2 . The problem of quantizer design may be posed in a variety of ways, including: Given a fixed M :
M D Constant;
(2.12)
find the optimal placement of decision boundaries and quantized values so that q2 is minimized. This is the most widely used approach. Given a distortion constraint: q2 < Threshold;
(2.13)
find the optimal placement of decision boundaries and quantized values so that M is minimized. A minimal M means a minimal number of bits needed to represent the quantized value, hence a minimal bit rate.
2.2 Re-Quantization The quantization process was presented above with the assumption that the source random variable or sample values are continuous or analog. Quantization by name usually gives the impression that it were only for quantizing analog sample values. When dealing with such analog source sample values, the associated forward quantization is referred to as ADC (analog-to-digital conversion) and the inverse quantization as DAC (digital-to-analog conversion).
24
2 Scalar Quantization Table 2.3 An quantization table for “re-quantizing” a discrete source Decision interval Quantization index Re-quantized value Œ 0, 10) 0 5 Œ 10, 20) 1 15 Œ 20, 30) 2 25 Œ 30, 40) 3 35 Œ 40, 50) 4 45 Œ 50, 60) 5 55 Œ 60, 70) 6 65 Œ 70, 80) 7 75 Œ 80, 90) 8 85 Œ 90, 100) 9 95
Discrete sources sample values can also be further quantized. For example, consider a source that takes integer sample values between 0 through 100. If it is decided, for some reason, that this resolution is too much or irrelevant for a particular application and sample values spaced at an interval of 10 are really what are needed, a quantization table shown in Table 2.3 can be established to re-quantize the integer sample values. With discrete sources sample values, the formulation of quantization process in Sect. 2.1 is still valid with the replacement of probability density function with probability distribution function and integration with summation.
2.3 Uniform Quantization Both quantization Tables 2.2 and 2.3 are the embodiment of uniform quantization, which is the simplest among all quantization schemes. The decision boundaries of a uniform quantizer are equally spaced, so its decision intervals are all of the same length and can be represented by a constant called quantization step size. For example, the quantization step size for Table 2.2 is 1 and for Table 2.3 is 10. When an analog signal is uniformly sampled and subsequently quantized using a uniform quantizer, the resulting digital representation is called pulse-code modulation (PCM). It is the default form of representation for many digital signals, such as speech, audio, and video.
2.3.1 Formulation Let us consider a uniform quantizer that covers an interval of ŒXmin ; Xmax of a random variable X with M decision intervals. Since its quantization step size is
2.3 Uniform Quantization
25
D
Xmax Xmin ; M
(2.14)
its decision boundaries can be represented as bq D Xmin C q;
q D 0; 1; : : : ; M:
(2.15)
The mean of an decision interval is often selected as the quantized value for that interval: xO q D Xmin C q 0:5; q D 1; 2; : : : ; M:
(2.16)
For such a quantization scheme, the MSQE in (2.10) becomes q2 D
M Z X
Xmin Cq
qD1 Xmin C.q1/
.Xmin C q 0:5 x/2 p.x/dx:
(2.17)
Let y D Xmin C q 0:5 x; (2.17) becomes q2 D
M Z X
0:5
qD1 0:5
y 2 p ŒXmin C q .y C 0:5/2 dy:
(2.18)
Plugging in (2.15), (2.18) becomes q2 D
M Z X
0:5
qD1 0:5
x 2 p bq .x C 0:5/ dx:
(2.19)
Plugging in (2.16), (2.18) becomes q2 D
M Z X
0:5
qD1 0:5
x 2 p.xO q x/dx:
(2.20)
2.3.2 Midtread and Midrise Quantizers There are two major types of uniform quantizers. The one shown in Fig. 2.2 is called midtread because it has zero as one of its quantized values. It is useful for situations where it is necessary for the zero value to be represented. One such example is control systems where a zero value needs to be accurately represented. This is also
26
2 Scalar Quantization
important for audio signals because the zero value is needed to represent the absolute quiet. Due to the midtreading of zero, the number of decision intervals (M ) is odd if a symmetric sample value range (Xmin D Xmax ) is to be covered. Since both the decision boundaries and the quantized values can be represented by a single step size, the implementation of the midtread uniform quantizer is simple and straight forward. The forward quantizer may implemented as q D round
x
(2.21)
where round./ is the rounding function which returns the integer that is closest to the input. The corresponding inverse quantizer may be implemented as xO q D q:
(2.22)
The other uniform quantizer does not have zero as one of its quantized values, so is called midrise. This is shown in Fig. 2.4. Its number of decision intervals is even if a symmetric sample value range is to be covered. The forward quantizer may implemented as ( x truncate C 1; if x > 0I qD (2.23) x truncate 1; otherwiseI
Fig. 2.4 An example of midrise quantizer
2.3 Uniform Quantization
27
where truncate./ is the truncate function which returns the integer part of the input, without the fractional digits. Note that q D 0 is forbidden for a midrise quantizer. The corresponding inverse quantizer is expressed below ( xO q D
.q 0:5/; if q > 0I .q C 0:5/; otherwise:
(2.24)
2.3.3 Uniformly Distributed Signals As seen in (2.20), the MSQE of a uniform quantizer depends on the probability density function. When this density function is uniformly distributed over ŒXmin ; Xmax : p.x/ D
1 ; x 2 ŒXmin ; Xmax ; Xmax Xmin
(2.25)
(2.20) becomes q2
M Z 0:5 X 1 D y 2 dx Xmax Xmin qD1 0:5
D
M X 3 1 Xmax Xmin qD1 12
D
3 M Xmax Xmin 12
Due to the step size given in (2.14), the above equation becomes q2 D
2 : 12
(2.26)
For the uniform distribution in (2.25), its variance (signal power) is x2 D
1 Xmax Xmin
Z
Xmax Xmin
x 2 dx D
.Xmax Xmin /2 ; 12
(2.27)
28
2 Scalar Quantization
so the signal-to-noise ratio (SNR) of the uniform quantizer is SNR (dB) D 10 log10
x2 q2
!
.Xmax Xmin /2 12 12 2 Xmax Xmin D 20 log10
D 10 log10
(2.28)
Due to the step size given in (2.14), the above SNR expression becomes SNR (dB) D 20 log10 .M / D
20 log2 .M / 6:02 log2 .M /: log2 .10/
(2.29)
If the quantization indexes are represented using fixed-length codes, each codeword can be represented using R D ceil Œlog2 .M / bits;
(2.30)
which is referred as bits per sample or bit rate. Consequently, (2.29) becomes SNR (dB) D
20 R 6:02R dB; log2 .10/
(2.31)
which indicates that, for each additional bit allocated to the quantizer, the SNR is increased by about 6.02 dB.
2.3.4 Nonuniformly Distributed Signals Most signals, and audio signals in particular, are rarely uniformly distributed. As indicated by (2.20), the contribution of each quantization error to the MSQE is weighted by the probability density function. A nonuniform distribution means that the weighting is different now, so a different MSQE is expected and is discussed in this section.
2.3.4.1 Granular and Overload Error A nonuniformly distributed signal, such as Gaussian, is usually not bounded, so the dynamic range ŒXmin ; Xmax of a uniform quantizer cannot cover the whole range of the source signal. This is illustrated in Fig. 2.5. The areas beyond ŒXmin ; Xmax are called overload areas. When a source sample falls into an overload area, the quantizer can only assign either the minimum or the maximum quantized value to it:
2.3 Uniform Quantization
29 PDF
-Xmax
Overload _3 Δ
Xmax
Granular _2 Δ
_Δ
0
Overload Δ
2Δ
3Δ
X
Fig. 2.5 Overload and granular quantization errors
x.x/ O D
Xmax 0:5; if x > Xmax I Xmin C 0:5; if x < Xmin :
(2.32)
This introduces additional quantization error, called overload error or overload noise. The mean squared overload error is obviously the following 2 D q.overload/
Z
1 Xmax
Z
C
Œx .Xmax 0:5/2 p.x/dx
Xmin 1
Œx .Xmin C 0:5/2 p.x/dx:
(2.33)
The MSQE given in (2.17) only accounts for quantization error within ŒXmin ; Xmax , which is referred to as granular error or granular noise. The total quantization error is 2 2 q.total/ D q2 C q.overload/ : (2.34) For a given PDF p.x/ and the number of decision intervals M , (2.17) indicates that the smaller the quantization step size is, the smaller the granular quantization noise q2 becomes. According to (2.14), however, the smaller quantization step size also translates into smaller Xmin and Xmax for a fixed M . Smaller Xmin and Xmax obviously leads to larger overload areas, hence a larger overload quantization 2 error q.overload/ . Therefore, the choice of , or equivalently the range ŒXmin ; Xmax of the uniform quantizer, represents a trade-off between granular and overload quantization errors. This trade-off is, of course, relative to the effective width of the given PDF, which may be characterized by its variance . The ratio of the quantization range ŒXmin ; Xmax over the signal variance Fl D
Xmax Xmin ;
(2.35)
called the loading factor, is apparently a good description of this trade-off. For Gaussian distribution, a loading factor of 4 means that the probability of input
30
2 Scalar Quantization
samples going beyond the range is 0.045. For a loading factor of 6, the probability reduces to 0.0027. For most applications, 4 loading is sufficient.
2.3.4.2 Optimal SNR and Step Size To find the optimal quantization step size that gives the minimum total MSQE 2 q.total/ , let us drop (2.17) and (2.33) into (2.34) to obtain 2 q.total/ D
M Z X
Xmin Cq
qD1 Xmin C.q1/
Z
C Z C
1
Œx .Xmin C q 0:5/2 p.x/dx
Œx .Xmax 0:5/2 p.x/dx
Xmax
Xmin 1
Œx .Xmin C 0:5/2 p.x/dx:
(2.36)
Usually, a uniform quantizer is symmetrically designed such that Xmin D Xmax :
(2.37)
Then (2.14) becomes D
2Xmax : M
(2.38)
Replacing all Xmin and Xmax with using the above equations, we have 2 q.total/
D
M Z X
.q0:5M /
qD1 .q10:5M /
Z
C
0:5M
Z C
1
Œx 0:5.M 1/2 p.x/dx
0:5M
1
Œ.q 0:5 0:5M / x2 p.x/dx
Œx C 0:5.M 1/2 p.x/dx:
(2.39)
p.x/ D p.x/
(2.40)
Assuming a symmetric PDF:
and doing a variable change of y D x in the last term of (2.39), it turns out that this last term becomes the same as the second term, so (2.39) becomes
2.3 Uniform Quantization 2 q.total/ D
31
M Z X
.q0:5M /
qD1 .q10:5M / Z 1
C2
0:5M
Œ.q 0:5 0:5M / x2 p.x/dx
Œx 0:5.M 1/2 p.x/dx
(2.41)
Now that both (2.39) and (2.41) are only a function of , their minimum can be found by setting their respective first order derivative against to zero: @ 2 D 0: @ q.total/
(2.42)
This equation can be solved using a variety of numerical methods, see [76], for example. Figure 2.6 shows optimal SNR achieved by a uniform quantizer at various bits per sample (see (2.30)) for Gaussian, Laplacian, and Gamma distributions [33]. The SNR given in (2.31) for uniform distribution, which is the best SNR that a uniform quantizer can achieve, is plotted as the bench mark. It is a straight line in the form of SNR(R) D a C bR (dB);
(2.43)
with a slope of bD
20 6:02 log2 .10/
(2.44)
and an intercept of a D 0:
(2.45)
50 45
Optimal SNR (dB)
40
Uniform Gaussian Laplacian Gamma
35 30 25 20 15 10 5 0 1
2
3
4 5 Bits Per Sample
6
7
8
Fig. 2.6 Optimal SNR achieved by a uniform quantizer for uniform, Gaussian, Laplacian, and Gamma distributions
32
2 Scalar Quantization
Apparently, the curves for other PDF’s also seem to fit a straight line with different slopes and intercepts. Notice that both the slope b and the intercept a decrease as the peakedness or kurtosis of the PDF increases in the order of uniform, Gaussian, Laplacian, and Gamma, indicating that the overall performance of a uniform quantizer is inversely related to PDF kurtosis. This degradation in performance is mostly reflected in the intercept a. The slope b is only moderately affected. There is, nevertheless, reduction in slope when compared with the uniform distribution. This reduction indicates that the quantization performance for other distributions relative to the uniform distribution becomes worse at higher bit rates. Figure 2.7 shows the optimal step size normalized by the signal variance, opt =x , for Gaussian, Laplacian, and Gamma distributions as a function of the number of bits per sample [33]. The data for uniform distribution is used as the benchmark. Due to (2.14), (2.27) and (2.30), the normalized quantization step size for the uniform distribution is log2 .M / R 2 2 D log10 p D log10 p ; (2.46) log10 x log2 10 log2 10 3 3 so it is a straight line. Apparently, as the peakedness or kurtosis increases in the order of uniform, Gaussian, Laplacian, and Gamma distributions, the step size also increases. This is necessary for optimal balance between granular and overload quantization errors: an increased kurtosis means that the probability density is spread more toward the tails, resulting more overload error, so the step size has to be increased to counteract this increased overload error.
Optimal Step Size
100
10−1
10−2
Uniform Gaussian Laplacian Gamma 1
2
3
4 5 Bits Per Sample
6
7
8
Fig. 2.7 Optimal step size used by a uniform quantizer to achieve optimal SNR for uniform, Gaussian, Laplacian, and Gamma distributions
2.4 Nonuniform Quantization
33
The empirical formula (2.43) is very useful for estimating the minimal total MSQE for a particular quantizer, given the signal power and bit rate. In particular, dropping in the SNR definition in (2.11) to (2.43), we can represent the total MSQE as 10 log10 q2 D 10 log10 x2 a bR (2.47) or q2 D 100:1.aCbR/ x2 :
(2.48)
2.4 Nonuniform Quantization Since the MSQE formula (2.10) indicates that the quantization error incurred by a source sample x is weighted by the PDF p.x/, one approach to reduce MSQE is to reduce quantization error in densely distributed areas where the weight is heavy. Formula (2.10) also indicates that the quantization error incurred by a source sample value x is actually the distance between it and the quantized value x, O so large quantization errors are caused by input samples far away from the quantized value, i.e., those which are near the decision boundaries. Therefore, reducing quantization errors in densely distributed areas necessitates using smaller decision intervals. For a given number of decision intervals M , this also means that larger decision intervals need to be placed to the rest of the PDF support so that the whole input range is covered. From the perspective of resource allocation, each quantization index is a piece of bit resource that is allocated in the course of quantizer design, and there are only M pieces of resources. A quantization index is one-to-one associated with a quantized value and decision interval, so a piece of resource is considered as consisting of a set of quantization index, quantized value, and a decision interval. The problem of quantizer design may be posed as optimal allocation of these resources to minimize the total MSQE. To achieve this, each piece of resources should be allocated to carry the same share of quantization error contribution to the total MSQE. In other words, the MSQE conbribution carried by individual pieces of resources should be “equalized”. For a uniform quantizer, its resources are allocated uniformly, except for the first and last quantized values which cover the overload areas. As shown at the top of Fig. 2.8, its resources in the tail areas of the PDF are not fully utilized because low probability density or weight causes them to carry too little MSQE contribution. Similarly, its resources in the head area are over utilized because high probability density or weight causes them to carry too much MSQE contribution. To reduce the overall MSQE, those mis-allocated resources need to be re-distribute in such a way that the MSE produced by individual pieces of resource are equalized. This is shown at the bottom of Fig. 2.8.
34
2 Scalar Quantization PDF
-3Δ
-3Δ
-3Δ
-3Δ
-3Δ
-3Δ
-3Δ
-3Δ
-3Δ x
Resources are uniformly allocated for a uniform quantizer PDF
x
Resources are nonuniformly allocated for a nonuniform quantizer
Fig. 2.8 Quantization resources are under-utilized by the uniform quantizer (top) in the tail areas and over-utilized in the head area of the PDF. These resources are re-distributed in the nonuniform quantizer (bottom) so that individual pieces of resources carry the same amount of MSQE contribution, leading to smaller MSQE
The above two considerations indicates that the MSQE can be reduced by assigning the size of decision intervals inversely proportional to the probability density. The consequence of this strategy is that the more densely distributed the PDF is, the more densely placed the decision intervals can be, thus the smaller the MSQE becomes. One approach to nonuniform quantizer design is to post it as an optimization problem: finding the quantization intervals and quantized values that minimizes the MSQE. This leads to the Lloyd-Max algorithm. Another approach is to transform the source signal through a nonlinear function in such a way that the transformed signal has a PDF that is almost uniform, then a uniform quantizer may be used to deliver improved performance. This leads to companding.
2.4 Nonuniform Quantization
35
2.4.1 Optimal Quantization and Lloyd-Max Algorithm Given a PDF p.x/ and a number of decision intervals M , one approach to the design of a nonuniform quantizer is to find the set of decision boundaries fbq gM 0 and quantized values fxO q gM such that the MSQE in (2.10) is minimized. Towards the solution 1 of this optimization problem, let us first consider the following partial derivative @q2 @xO q
Z D2
bq
bq1
Z
D 2xO q
.xO q x/p.x/dx
bq bq1
Z p.x/dx 2
bq
xp.x/dx:
(2.49)
bq1
Setting it to zero, we have R bq b
xO q D R q1 bq
xp.x/dx ;
(2.50)
p.x/dx bq1
which indicates that the quantized value for each decision interval is the centroid of the probability mass in the interval. Let us now consider another partial derivative @q2 @bq
D .xO q bq /2 p.bq / .xO qC1 bq /2 p.bq /
(2.51)
Setting it to zero, we have bq D
1 .xO q C xO qC1 /; 2
(2.52)
which indicates that the decision boundary is simply the midpoint of the neighboring quantized values. Solving (2.50) and (2.52) would give us the optimal set of decision boundaries 2 fbq gM O q gM 0 and quantized values fx 1 that minimizes q . Unfortunately, to solve (2.50) for xO q we need bq1 and bq , but to solve (2.52) for bq we need xO q and xO qC1 . The problem is a little difficult.
2.4.1.1 Uniform Quantizer as a Special Case Let us consider a simple case where the probability distribution is uniform as given in (2.25). For such a distribution, (2.50) becomes xO q D
bq1 C bq : 2
(2.53)
36
2 Scalar Quantization
Incrementing q for this equation, we have xO qC1 D
bq C bqC1 : 2
(2.54)
Dropping (2.53) and (2.54) into (2.52), we have 4bq D bq1 C bq C bq C bqC1 ;
(2.55)
bqC1 bq D bq bq1 :
(2.56)
bq bq1 D ;
(2.57)
bqC1 bq D :
(2.58)
which leads us to Let us denote plugging it into (2.56), we have
Therefore, we can conclude by induction on q that all decision boundaries are uniformly spaced. For quantized values, let us subtract (2.53) from (2.54) to give xO qC1 xO q D
bqC1 bq C bq bq1 : 2
(2.59)
Plugging in (2.57) and (2.58), we have xO qC1 xO q D ;
(2.60)
which indicates that the quantized values are also uniformly spaced. Therefore, uniform quantizer is optimal for uniform distribution.
2.4.1.2 Lloyd-Max Algorithm Lloyd-Max algorithm is an iterative procedure for solving (2.50) and (2.52) for an arbitrary distribution, so an optimal quantizer is also referred to as Lloyd-Max quantizer. Note that its convergence is not proven, but only experimentally found. Before presenting the algorithm, let us first note that we already know the first and last decision boundaries: b0 D Xmin and bM D Xmax :
(2.61)
For unbounded inputs, we may set Xmin D 1 and/or Xmax D 1. Also, we rearrange (2.52) into xO qC1 D 2bq xO q ; (2.62)
2.4 Nonuniform Quantization
37
The algorithm involves the following iterative steps: 1. Make a guess for xO 1 . 2. Let q D 1. 3. Plugging xO q and bq1 into (2.50) to solve for bq . This may be done by integrating the two integrals in (2.50) forward from bq1 until the equation holds. 4. Plugging xO q and bq into (2.62) to get a new xO qC1 . 5. Let q D q C 1. 6. Go back to step 3 unless q D M . 7. When q D M , calculate R bM b
1 D xO M R M bM
bM 1
xp.x/dx
(2.63)
p.x/dx
8. Stop if jj < predetermined threshold:
(2.64)
9. Decrease xO 1 if > 0 and increase xO 1 otherwise. 10. Go back to step 2. A little explanation is in order for (2.63). The iterative procedure provides us with an xO M upon entering step 7, which is used as the first term to the right of (2.63). On the other hand, since we know bM from (2.61), we can use it with bM 1 provided by the procedure to obtain another estimate of xO M using (2.50). This is given as the second term on the right side of (2.63). The two estimates of the same xO M should be equal if equations (2.50) and (2.52) are solved. Therefore, we stop the iteration at step 8 when the absolute value of their difference is smaller than some predetermined threshold. The adjustment procedure for xO 1 at step 9 can also be easily explained. The iterative procedure is started with a guess for xO 1 at step 1. Based on this guess, a whole set of decision boundaries fbq gM O q gM 0 and quantized values fx 1 are obtained from step 2 through step 8. If the guess is off, the whole set derived from it is off. In particular, if the guess is too large, the resulting xO M will be too large. This will cause > 0, so xO 1 needs to be reduced; and vice versa.
2.4.1.3 Performance Gain Figure 2.9 shows optimal SNR achieved by Lloyd-Max algorithm for uniform, Gaussian, Laplacian, and Gamma distributions against the number of bits per sample [33]. Since the uniform quantizer is optimal for uniform distribution, its optimal SNR curve in Fig. 2.9 is the same as in Fig. 2.6, thus can serve as the reference. Notice that the optimal SNR curves for the other distributions are closer to this curve in Fig. 2.9 than in Fig. 2.6. This indicates that, for a given number of bits per sample, optimal nonuniform quantization achieves better SNR than optimal uniform quantization.
38
2 Scalar Quantization 45 Uniform Gaussian Laplacian Gamma
40
Optimal SNR (dB)
35 30 25 20 15 10 5 0
1
2
3
4 5 Bits Per Sample
6
7
Fig. 2.9 Optimal SNR versus bits per sample achieved by Lloyd-Max algorithm for uniform, Gaussian, Laplacian, and Gamma distributions
Apparently, the optimal SNR curves in Fig. 2.9 also fit straight lines well, so can be approximated by the same equation given in (2.43) with improved slope b and intercept a. The improved performance of nonuniform quantization results in better fitting to a straight line than those in Fig. 2.6. Similar to uniform quantization in Fig. 2.6, both the slope b and the intercept a decrease as the peakedness or kurtosis of the PDF increases in the order of uniform, Gaussian, Laplacian, and Gamma, indicating that the overall performance of a Lloyd–Max quantizer is inversely related to PDF kurtosis. Compared with the uniform distribution, all other distributions have reduced slopes b, indicating that their performance relative to the uniform distribution becomes worse as the bit rate increases. However, the degradations of both a and b are less conspicuous than those in Fig. 2.6. In order to compare the performance between Lloyd-Max quantizer and uniform quantizer, Fig. 2.10 shows optimal SNR gain of Lloyd-Max quantizer over uniform quantizer for uniform, Gaussian, Laplacian, and Gamma distributions: Optimal SNR Gain D SNRNonuniform SNRUniform ; where SNRNonuniform is taken from Fig. 2.9 and SNRUniform from Fig. 2.6. Since the Lloyd-Max quantizer for uniform distribution is a uniform quantizer, the optimal SNR gain is zero for uniform distribution. It is obvious that the optimal SNR gain is more profound when the distribution is more peaked or is of larger kurtosis.
2.4 Nonuniform Quantization
39
8 Uniform Gaussian Laplacian Gamma
Optimal SNR Gain (dB)
7 6 5 4 3 2 1 0
1
2
3
4 5 Bits Per Sample
6
7
Fig. 2.10 Optimal SNR gain of Lloyd-Max quantizer over uniform quantizer for uniform, Gaussian, Laplacian, and Gamma distributions
2.4.2 Companding Finding the whole set of decision boundaries fbq gM O q gM 0 and quantized values fx 1 for an optimal nonuniform quantizer using Lloyd-Max algorithm usually involves a large number of iterations, hence may be computationally intensive, especially for a large M . The storage requirement for these decision boundaries and quantization values may also become excessive, especially for the decoder. Companding is an alternative. Companding is motivated by the observation that a uniform quantizer is simple and effective for a matching uniformly distributed source signal. For a nonuniformly distributed source signal, one could use a nonlinear function f .x/ to convert it into another one with a PDF similar to a uniform distribution. Then the simple and effective uniform quantizer could be used. After the quantization indexes are transmitted to and subsequently received by the decoders, they are first inversely quantized to reconstruct the uniformly quantized values and then the inverse function f 1 .x/ is applied to produce the final quantized values. This process is illustrated in Fig. 2.11. The nonlinear function in Fig. 2.11 is called a compressor because it usually has a shape similar to that shown in Fig. 2.12 that stretches the source signal when its sample value is small and compresses it otherwise. This shape of compression is to match the typical shape of PDF, such as Gaussian and Laplacian, which has large probability density for small absolute sample values and tails off towards large absolute sample values, in order to make the converted signal have a PDF similar to a uniform distribution.
40
2 Scalar Quantization
Fig. 2.11 The source sample value is first converted by the compressor into another one with a PDF similar to a uniform distribution. It is then quantized by a uniform quantizer and the quantization index is transmitted to the decoder. After inverse quantization at the decoder, the uniformly quantized value is converted by the expander to produce the final quantized value Expander 1
0.5
0.5 Output
Output
Compressor 1
0
−0.5
−0.5 −1 −1
0
−0.5
0 Input
0.5
1
−1 −1
−0.5
0 Input
0.5
1
Fig. 2.12 -Law companding deployed in North American and Japanese telecommunication systems
The inverse function is called an expander because the inverse of compression is expansion. After the compression-expansion, hence “companding”, the effective decision boundaries when viewed from the expander output is nonuniform, so the overall effect is nonuniform quantization. When companding is actually used in speech and audio applications, additional considerations are given to the perceptual properties of the human ear. Since the perception of loudness by the human ear may be considered as logarithmic, logarithmic companding is widely used. 2.4.2.1 Speech Processing In speech processing, the -law companding, deployed in North American and Japanese telecommunication systems, has a compression function given by [33]
2.4 Nonuniform Quantization
41
y D f .x/ D sign.x/
ln.1 C jxj/ ; 1 x 1I ln.1 C /
(2.65)
where D 256 and x is the normalized sample value to be compounded and is limited to 13 magnitude bits. Its corresponding expanding function is x D f 1 .y/ D sign.y/
.1 C /jyj 1 ; 1 y 1:
(2.66)
Both functions are plotted in Fig. 2.12. A similar companding, called A-law companding, is deployed in Europe, whose compression function is sign.x/ y D f .x/ D 1 C ln.A/
Ajxj; 0 jxj A1 I 1 C ln.Ajxj/; A1 < jxj 1I
(2.67)
where A D 87:7 and the normalized sample value x is limited to 12 magnitude bits. Its corresponding expanding function is ( x D f 1 .y/ D sign.y/
1Cln.A/ jyj; A
ejyj.1Cln.A//1 ACA ln.A/
0 jyj ;
1 1Cln.A/
1 I 1Cln.A/
< jyj 1:
(2.68)
It is usually very difficult to implement both the logarithmic and exponential functions used in the companding schemes above, especially on embedded microprocessors with limited resources. Many such processors even do not have a floating point unit. Therefore, the companding functions are usually implemented using piece-wise linear approximation. This is adequate due to the fairly low requirement for speech quality in telephonic systems,
2.4.2.2 Audio Coding Companding is not as widely used in audio coding as in speech processing, partly due to higher quality requirement and wider dynamic range which renders implementation more difficult. However, MPEG 1&2 Layer III [55, 56] and MPEG 2&4 AAC [59, 60] use the following exponential compression function to quantize MDCT coefficients: y D f .x/ D sign.x/jxj3=4 ; (2.69) which may be considered as an approximation to the logarithmic function. The allowed compressed dynamic range is 8191 y 8191. The corresponding expanding function is obviously x D f 1 .y/ D sign.y/jyj4=3 :
(2.70)
42
2 Scalar Quantization
The implementation cost for the above exponential function is a remarkable issue in decoder development. Piece-wise linear approximation may lead to degradation in audio quality, hence may be unacceptable for high fidelity application. Another alternative is to store the exponential function as a quantization table. This amounts to 13 3 D 39 KB if each of the 213 entries in the table are stored using 24 bits. The most widely used companding in audio coding is the companding of quantization step sizes of uniform quantizers. Since quantization step sizes are needed in the inverse quantization process in the decoder, they need to be packed into the bit stream and transmitted to the decoder. Transmitting these step sizes with arbitrary resolution is out of the question, so it is necessary that they be quantized. The perceived loudness of quantization noise is usually considered as logarithmically proportional to the quantization noise power, or linearly proportional to the quantization noise power in decibel. Due to (2.28), this means the perceived loudness is linearly proportional to the quantization step size in decibel. Therefore, almost all audio coding algorithms use logarithmic companding to quantize quantization step sizes: ı D f ./ D log2 ./; (2.71) where is the step size of a uniform quantizer. The corresponding expander is obviously D f 1 .ı/ D 2ı : (2.72) Another motivation for logarithmic companding is to cope with the wide dynamic range of audio signals, which may amount to more than 24 bits per sample.
Chapter 3
Vector Quantization
The scalar quantization discussed in Chap. 2 quantizes the samples of a source signal one by one in sequence. It is simple because it deals with only one sample each time, but it can only achieve so much for quantization efficiency. We now consider quantizing two or more samples as one block each time and call this approach vector quantization (VQ).
3.1 The VQ Advantage Let us suppose that we need to quantize the following source sample sequence: f1:2; 1:4; 1:7; 1:9; 2:1; 2:4; 2:6; 2:9g:
(3.1)
If we use the scalar midtread quantizer given in Table 2.2 and Fig. 2.2, we get the following SQ indexes: f1; 1; 2; 2; 2; 2; 3; 3g; (3.2) which is also the sequence for the quantized values since the quantization step size is one. Since the range of the quantization indexes is Œ1; 3, we need 2 bits to convey each index. This amounts to 8 2 D 16 bits for encoding the whole sequence. If two samples are quantized as a block, or vector, each time using the VQ codebook given in Table 3.1, we end up with the following sequence of indexes: f0; 1; 1; 2g: When this sequence is used by the decoder to look up Table 3.1, we obtain exactly the same reconstructed sequence as in (3.2), so the total quantization error is the same. Now 2 bits are still needed to convey each index, but there are only four indexes, so we need 4 2 D 8 bits to convey the whole sequence. This is only half the number of bits needed by the SQ while the total quantization error is same. To explain why VQ can achieve much better performance than SQ, let us view the sequence in (3.1) as a sequence of two-dimensional vectors:
Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 3, c Springer Science+Business Media, LLC 2010
43
44
3 Vector Quantization
Table 3.1 An example VQ codebook
Fig. 3.1 A source sequence, viewed as a sequence of two-dimensional vectors, is plotted as dots using the first and second elements of each vector as x and y coordinates, respectively. The solid straight lines represent the decision boundaries for SQ and the solid curved lines for VQ. The solid dashed lines represent the quantized values for SQ and the diamonds represent the quantized or representative vectors for VQ
Index 0 1 2
Representative vector [1,1] [2,2] [3,3]
3.5
2.5
1.5
0.5 0.5
1.5
fŒ1:2; 1:4; Œ1:7; 1:9; Œ2:1; 2:4; Œ2:6; 2:9g;
2.5
3.5
(3.3)
and plot them in Fig. 3.1 as dots using the first and second elements of each vector as x and y coordinates, respectively. For the first element of each vector, we use solid vertical lines to represent its decision boundaries and dashed vertical lines its quantized value. Consequently, its decision intervals are represented by vertical strips defined by two adjacent vertical lines. For the second element of each vector, we do the same with horizontal solid and dashed lines, respectively, so that its decision intervals are represented by horizontal strips defined by two adjacent horizontal lines. When the first element is quantized using SQ, a vertical decision strip is activated to produce the quantized value. Similarly, a horizontal decision strip is activated when the second element is quantized using SQ. Since the first and second elements of each vector are quantized separately, the quantization of the whole sequence can be viewed as a process of alternative activation of vertical and horizontal decision strips. However, if the results of the SQ described above for the two elements of each vector are viewed jointly, the quantization decision is actually represented by the squares where the vertical and horizontal decision strips cross. Each decision square represents the decision intervals for both elements of a vector. The crossing point of dashed lines in the middle of the decision square represents the quantized values for both the elements of the vector. Therefore, the decision square and the associated crossing point inside it are the real decision boundaries and quantized values for each source vector.
3.1 The VQ Advantage
45
It is now easy to realize that many of the decision squares are never used by SQ. Due to their existence, however, we still need a two-dimensional vector to identify and hence represent each of the decision squares. For the current example, each of such vectors needs 2 2 D 4 bits to be represented, corresponding to 2 bits per sample. So there is no bit saved. What a waste those unused decision squares cause! To avoid this waste, we need to consider the quantization of each source vector as a joint and simultaneous action and forgo the SQ decision squares. Along this line of thinking, we can arbitrarily place decision boundaries in the two-dimensional space. For example, noting that the data points are scattered almost along a straight line, we can re-design the decision boundaries as those depicted by the curved solid lines in Fig. 3.1. With this design, we designate three crossing points represented by the diamonds in the figure as the quantized or representative vectors. They obviously represent the same quantized values as those obtained by SQ, thus leaving quantization error unchanged. However, the two-dimensional decision boundaries carve out only three decision regions, or decision cells, with only three representative vectors that need to be indexed and transmitted to the decoder. Therefore, we need to transmit 2 bits per vector, or 1 bit per sample, to the decoder, amounting to a bit reduction of 2 to 1. Of course, source sequences in the real world are not as simple as those in (3.1) and Fig. 3.1. To look at realistic sequences, Fig. 3.2 plots a correlated Gaussian sequence (meanD0, varianceD1) in the same way as Fig. 3.1. Also plotted are the decision boundaries (solid lines) and quantized values (dashed lines) of the same uniform scalar quantizer used above. Apparently, the samples are highly concentrated along a straight line and there are a lot of SQ decision squares that are wasted. One may argue that a nonuniform SQ would do much better. Recall that it was concluded in Sect. 2.4 that the size of a decision interval of a nonuniform SQ should be inversely proportional to the probability density. The translation of this rule into two dimension is that the size of the SQ decision squares should be inversely proportional to the probability density. So a nonuniform SQ can improve the coding performance by placing small SQ decision squares in densely distributed areas and large ones in loosely distributed areas. However, there still are a lot of SQ decision regions placed in areas that are extremely sparsely populated by the source samples, causing a waste of bit resources. 3.5 2.5 1.5
Fig. 3.2 A correlated Gaussian sequence (meanD0, varianceD1) is plotted as two-dimensional vectors over the decision boundaries (solid lines) and quantized values (dashed lines) of a midtread uniform SQ (step sizeD1)
0.5 −0.5 −1.5 −2.5 −3.5 −3.5 −2.5 −1.5 −0.5
0.5
1.5
2.5
3.5
46 Fig. 3.3 An independent Gaussian sequence (meanD0, varianceD1) is plotted as two dimensional vectors over the decision boundaries (solid lines) and quantized values (dashed lines) of a midtread uniform SQ (step sizeD1)
3 Vector Quantization 3.5 2.5 1.5 0.5 −0.5 −1.5 −2.5 −3.5 −3.5 −2.5 −1.5 −0.5
0.5
1.5
2.5
3.5
To void this waste, the restriction that decision boundaries have to be square has to be removed. Once this restriction is removed, we arrive at the world of VQ, where decision regions can be of arbitrary shapes and can be arbitrarily allocated to match the probability distribution density of the source sequence, achieving better quantization performance. Even with uncorrelated sequences, VQ can still achieve better performance than SQ. Figure 3.3 plots an independent Gaussian sequence (meanD0, varianceD1) over the decision boundaries (solid lines) and quantized values (dashed lines) of the same midtread uniform SQ. Apparently the samples are still concentrated, even though not as much as the correlated sequence in Fig. 3.2, so a VQ can still achieve better performance than an SQ. At least, an SQ has to allocate decision squares to cover the four corners of the figure, but a VQ can use arbitrarily shaped decision regions to cover those areas without wasting bits.
3.2 Formulation Let us consider an N -dimensional random vector x D Œx0 ; x1 ; : : : ; xN 1 T
(3.4)
with a joint PDF p.x/ over a vector space or support of ˝. Suppose this vector space is divided by a set of M regions fıg0M 1 in a mutually exclusive and collectively exhaustive way: M 1 [ ıq (3.5) ˝D qD0
and ıp \ ı q D ˚
for all
p ¤ q;
(3.6)
3.2 Formulation
47
where ˚ is the null set. These regions, referred to as decision regions, play a role similar to decision intervals in SQ. To each decision region, a representative vector is assigned: rq 7! ıq ; for q D 0; 1; : : : ; M 1: (3.7) The source vector x is vector-quantized to VQ index q as follows: q D Q.x/
if and only if
x 2 ıq :
(3.8)
The corresponding reconstructed vector is the representative vector: xO D rq D Q1 .q/:
(3.9)
Plugging (3.8) into (3.9), we have xO .x/ D rq .x/ D Q1 ŒQ.x/ :
(3.10)
The VQ quantization error is now a vector given below q.x/ D xO x:
(3.11)
xO D x C q.x/;
(3.12)
Since this may be rewritten as
the additive quantization noise model in Fig. 2.3 is still valid. The quantization noise is best measured by a distance, such as the L-2 norm or Euclidean distance defined below d.x; xO / D .x xO /T .x xO / D
N 1 X
.xk xO k /2 :
(3.13)
kD0
The average quantization error is then Z Err D D
d.x; xO /p.x/dx
˝ M 1 Z X qD0
ıq
d.x; rq /p.x/dx:
(3.14) (3.15)
If the Euclidean distance is used, the average quantization noise may be again called MSQE. The goal of VQ design is to find a set of decision regions fıg0M 1 and representative vectors frq g0M 1 , referred to as a VQ codebook, that minimizes this average quantization error.
48
3 Vector Quantization
3.3 Optimality Conditions A necessary condition for an optimal solution to the VQ design problem stated above is that, for a given set of representative vectors frq g0M 1 , the corresponding decision regions fıg0M 1 should decompose the input space in such a way that each source vector x is always clustered to its nearest representative vector [19]: Q.x/ D rq
d.x; rq / d.x; rk / for all k ¤ q:
if and only if
(3.16)
Due to this, the decision boundaries can be defined as ıq D fx j d.x; rq / d.x; rk / for all k ¤ qg:
(3.17)
Such disjoint sets are referred to as Voronoi regions, which rids us off the trouble of literally defining and representing the boundary of each decision region. This also indicates that a whole VQ scheme can be fully described by the VQ codebook, or the set of representative vectors frq g0M 1 . Another condition for an optimal solution is that, given a decision region ıq , the best choice for the representative vector is the conditional mean of all vectors within the decision region [19]: Z xq D
xp.xjx 2 ıq /dx:
(3.18)
x2ıq
3.4 LBG Algorithm Let us now consider the problem of VQ design, or finding the set of representative vectors and thus decompose the input space into a set of Voronoi regions so that the average quantization error in (3.15) is minimized. An immediate obstacle that needs to be addressed is that, other than multidimensional Gaussian, there is essentially no suitable theoretical joint PDF p.x/ to work with. Instead, we can usually draw a large set of input vectors, fxk gL1 , from the source governed by an unknown kD0 PDF. Therefore, a feasible approach is to use such a set of vectors, referred to as the training set, to come up with a optimal VQ codebook. In absence of the joint PDF p.x/, the average quantization error defined in (3.15) is no longer available. It may be replaced by the following total quantization error: Err D
L1 X
d .xk ; xO .xk // :
(3.19)
kD0
For the same reason, the best choice of the representative vector in (3.18) needs to be replaced by 1 X xq D xk (3.20) Lq xk 2ıq
where Lq is the number of training vectors in decision region ıq .
3.5 Implementation
49
The following Linde–Buzo–Gray algorithm (LBG algorithm) [42], also referred to as k-means algorithm, has been found to converge to a local minimum of (3.19): 1. nD0. 2. Make a guess for the representative vectors frq g0M 1 . By (3.17), this implicitly builds an initial set of Voronoi regions fıg0M 1 . 3. nDnC1. 4. Quantize each training vector using (3.16). Upon completion, the training set has been partitioned into Voronoi regions fıq g0M 1 . 5. Build a new set of representative vectors using (3.20). This implicitly builds a new set of Voronoi regions fıg0M 1 . 6. Calculate the total quantization error Err.n/ using (3.19). 7. Go back to step 3 if Err.n 1/ Err.n/ > (3.21) Err.n/ where is a predetermined positive threshold. 8. Stop. Steps 4 and 5 in the iterative procedure above can only cause the total quantization error to decrease, so the LBG algorithm converges to at least a local minimum of the total quantization error (3.19). This also explains the stopping condition in (3.21). There is a chance that, upon the completion of step 4, a Voronoi region may be empty in the sense that it contains not a single training vector. Suppose this happens to region q, then step 5 is problematic because Lq D 0 in (3.20). This indicates that representative vector rq is an outlier that is far away from the training set. A simple approach to fixing this problem is to replace it with a training vector in the most popular Voronoi region. After the completion of the LBG algorithm, the resulting VQ codebook can be tested against a separate set of source data, referred to as the test set, that are drawn from the same source. The LBG algorithm offers an approach to exploiting the real multidimensional PDF directly from the data without a theoretical multidimensional distribution model. This is an advantage over SQ which usually relies on a probability model. This also enables VQ to remove nonlinear dependencies in the data, a clear advantage over other technologies, such as transforms and linear prediction, which can only deal with linear dependencies.
3.5 Implementation Once the VQ codebook is obtained, it can be used to quantize the source signal that generates the training set. Similar to SQ, this also involves two stages as shown in Fig. 3.4.
50
3 Vector Quantization
Fig. 3.4 VQ involves an encoding or forward VQ stage represented by “VQ” in the figure, which maps a source vector to a VQ index, and a decoding or inverse VQ stage represented by “VQ1 ”, which maps a VQ index to its representative vector Fig. 3.5 The vector-quantization of a source vector entails searching through the VQ codebook to find the representative vector that is closest to the source vector
According to optimal condition of (3.16), the vector quantization of a source vector x entails searching through all representative vectors in the VQ codebook to find the one that is closest to the source vector. This is shown in Fig. 3.5. The decoding is very simple because it only entails using the received VQ index to look up the VQ codebook to retrieve the representative vector. Since the size of a VQ codebook grows exponentially with the vector dimension, the amount of storage for the codebook may easily become a significant cost, especially for the decoder which is usually more cost-sensitive. The amount of computation involved in searching through the VQ codebook for each source vector is also a major concern for encoder. Therefore, VQ with a dimension of more than 20 is rarely deployed in practical applications.
Part III
Data Model
Companding discussed in Sect. 2.4.2 illustrates a basic framework for improving quantization performance. If the compressor is replaced by a general signal transformation built upon a data model, expander by the corresponding inverse transformation, and the uniform quantizer by a general quantizer, respectively, companding can be expanded into a general scheme for quantization performance enhancement: data model plus quantization. The steps involved in such a scheme may be summarized as follows: 1. 2. 3. 4.
Transform the source signal into another one using a data model. Quantize the transformed signal. Transmit the quantization indexes to the decoder. Inverse-quantize the received quantized indexes to reconstruct the transformed signal. 5. Inverse-transform the reconstructed signal to reconstruct the original signal. A necessary condition for this scheme to work is that the transformation must be invertible, either exactly or approximately. Otherwise, original signal cannot be reconstructed even if no quantization is applied. The key to the success of such a scheme is that the transformed signal must be compact. Companding achieves this using a nonlinear function to arrive at a PDF that is similar to a uniform distribution. There are other methods that are much more powerful, most prominently among them are linear prediction, linear transform, and subband filter banks. Linear prediction uses a linear combination of historic samples as a prediction for the current sample. As long as the samples are fairly correlated, the predicted value will be a good estimate to the current sample value, resulting a small prediction error signal, which may be characterized by a smaller variance. Since the MSQE of an optimal quantizer is proportional to the the variance of the source signal (see (2.48)), the reduced variance will result in a reduced MSQE. A linear transform takes a block of input samples to generate another block of transform coefficients whose energy is compacted to a minority. Bit resources can then be concentrated to those high-energy coefficients to arrive at a significantly reduced MSQE.
A filter bank may be considered as an extension of transform by using samples from multiple blocks to achieve higher level of energy compaction without changing the block size, thus delivering even smaller MSQE. The primary role of data modeling is to exploit the inner structure or correlation of the source signal. As discussed in Chap. 3, VQ can also achieve this. But a major difficulty with VQ is that its complexity grows exponential with vector dimension, so a vector dimension of more 20 is usually considered as too complex to be deployed. However, correlation in most signals is usually much stronger than 20 samples. Audio signals, in particular, are well known for strong correlations up to thousands of samples. Therefore, VQ is usually not directly deployed, but rather jointly with a data model.
Chapter 4
Linear Prediction
Let us consider the source signal x.n/ shown at the top of Fig. 4.1. A simple approach to linear prediction is to just use the previous sample x.n 1/ as the prediction for the current sample: p.n/ D x.n 1/:
(4.1)
This prediction is, of course, not perfect, so there is prediction error or residue r.n/ D x.n/ p.n/ D x.n/ x.n 1/;
(4.2)
which is shown at the bottom of Fig. 4.1. The dynamic range of the residue is obviously much smaller than that of the source signal. The variance of the residue is 2.0282, which is much smaller than 101.6028, the variance of the source signal. The histograms of the source signal and the residue, both shown in Fig. 4.2, clearly indicate that, if the residue, instead of the source signal itself, is quantized, the quantization error will be much smaller.
4.1 Linear Prediction Coding More generally, for a source signal x.n/, a linear predictor makes an estimate of its sample value at time instance n using a linear combination of its K previous samples: K X ak x.n k/; (4.3) p.n/ D kD1
fak gK kD1
where are the prediction coefficients and K is the predictor order. The transfer function for the prediction filter is A.z/ D
K X
ak zk :
(4.4)
kD1
Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 4, c Springer Science+Business Media, LLC 2010
53
54
4 Linear Prediction Input Signal
20
0
−20
1000
2000
3000
4000
5000
6000
7000
8000
6000
7000
8000
Prediction Residue
20
0
−20 1000
2000
3000
4000
5000
Fig. 4.1 An source signal and its prediction residue Input Signal
Prediciton Residue 2200
400
2000 1800
350
1600
300
1400 250 1200 200
1000
150
800 600
100 400 50 0
200
−20
0
20
0
−20
Fig. 4.2 Histograms for the source signal and its prediction residue
0
20
4.2 Open-Loop DPCM
55
The prediction cannot be perfect, so there is always a prediction error r.n/ D x.n/ p.n/;
(4.5)
which is also referred to as prediction residue. With proper design of the prediction coefficients, this prediction error can be significantly reduced. To assess the performance of this prediction error reduction, prediction gain is defined: PG D
x2 r2
(4.6)
where x2 and r2 denotes the variances of the source and the residue signals, respectively. For example, the prediction gain for the simple predictor in last section is PG D 10 log10
101:6028 17 dB: 2:0282
Since the prediction residue is likely to have a much smaller variance than the source signal, the MSQE could be significantly reduced if the residue signal is quantized in place of the source signal. This amounts to linear prediction coding (LPC).
4.2 Open-Loop DPCM There are many methods for implementing LPC, which are mostly concerned how quantization noise is handled in the encoder. Open-loop DPCM discussed in this section is one of the simplest, but it suffers from quantization noise accumulation.
4.2.1 Encoder and Decoder An encoder for implementing LPC is shown in Fig. 4.3 [66], where the quantizer is modeled by an additive noise source (see (2.8) and Fig. 2.3). Since the quantizer is placed outside the prediction loop, this scheme is often called open-loop DPCM [17]. The name DPCM will be explained in Sect. 4.3. The overall transfer function for the LPC encoder before the quantizer is H.z/ D 1
K X
ak zk D 1 A.z/;
(4.7)
kD1
where A.z/ is defined in (4.4). After quantization, the quantization indexes for the residue signal are transmitted to the decoder and are used to reconstruct the quantized residue rO .n/ through inverse
56
4 Linear Prediction
Fig. 4.3 Encoder for open-loop DPCM. The quantizer is attached to the prediction residue and is modeled by an additive noise source
Fig. 4.4 Decoder for open-loop DPCM
quantization. This process may be viewed as if the quantized residue rO .n/ were received directly by the decoder. This is the convention adopted in Figs. 4.3 and 4.4. Since the original source signal x.n/ is not available at the decoder, the prediction scheme in (4.3) cannot be used by the decoder. Instead, the prediction at the decoder has to use the past reconstructed sample values: p.n/ O D
K X
ak x.n O k/:
(4.8)
kD1
According to (4.5), the reconstructed sample itself is obtained by x.n/ O D p.n/ O C r.n/ O D
K X
ak x.n O k/ C rO .n/:
(4.9)
kD1
This leads us to the decoder shown in Fig. 4.4, where the overall transfer function of the LPC decoder or the LPC reconstruction filter is D.z/ D
1
1 PK
kD1
ak zk
D
1 : 1 A.z/
(4.10)
If there were no quantizer, the overall transfer function of the encoder (4.7) and decoder (4.10) is obviously one, so the reconstructed signal at the decoder output is the same as the encoder input. When the quantizer is deployed, the prediction
4.2 Open-Loop DPCM
57
residue used by the decoder is different from that by the encoder and the difference is the additive quantization noise. This causes reconstruction error at the decoder output e.n/ D x.n/ O x.n/; (4.11) which is the quantization error of the overall open-loop DPCM.
4.2.2 Quantization Noise Accumulation As illustrated above, the difference between the prediction residues used by the decoder and the encoder is the additive quantization noise, so this quantization noise is reflected in each reconstructed sample value. Since these sample values are convoluted by the prediction filter, it is expected that the quantization noise is accumulated at the decoder output. To illustrate this quantization noise accumulation, let us first use the additive noise model of (2.8) to write the quantized residue as rO .n/ D r.n/ C q.n/;
(4.12)
where q.n/ is again the quantization error or noise. The reconstructed sample at the decoder output (4.9) can then be rewritten as x.n/ O D p.n/ O C r.n/ C q.n/:
(4.13)
Dropping in the definition of residue (4.5), we have x.n/ O D p.n/ O C x.n/ p.n/ C q.n/:
(4.14)
Therefore, the reconstruction error at the decoder output (4.11) is e.n/ D x.n/ O x.n/ D p.n/ O p.n/ C q.n/:
(4.15)
Plugging in (4.3) and (4.8), this may be further expressed as e.n/ D
K X
ak Œx.n O k/ x.n k/ C q.n/
kD1
D
K X
ak e.n k/ C q.n/;
(4.16)
kD1
which indicates that the reconstruction error is equal to the current quantization error plus weighted sum of reconstruction errors in the past.
58
4 Linear Prediction
Moving this relationship backward one step to n 1, the reconstruction error at n 1 is equal to the quantization error at n 1 plus the weighted sum of reconstruction errors before n 1. Repeating this procedure all the way back to the start of the prediction, we conclude that quantization errors in each prediction step in the past are all accumulated to arrive at the reconstruction error at the current step. To illustrate this accumulation of quantization noise more clearly, let us consider the simple difference predictor used in Sect. 4.1: a1 D 1 and K D 1:
(4.17)
Plugging this into (4.16) we have e.n/ D e.n 1/ C q.n/:
(4.18)
Let us suppose that x.0/ D x.0/, O then the above equation produces iteratively the following reconstruction errors: e.1/ D e.0/ C q.1/ D q.1/ e.2/ D e.1/ C q.2/ D q.1/ C q.2/ e.3/ D e.2/ C q.3/ D q.1/ C q.2/ C q.3/ :: :
(4.19)
For the nth step, we have e.n/ D
n X
q.k/;
(4.20)
kD1
which is the summation of quantization noise in all previous steps, starting from the beginning of prediction. To obtain a closed representation of quantization accumulation, let E.z/ denote the Z-transform of reconstruction error e.n/ and Q.z/ the Z-transform of quantization noise q.n/, then the reconstruction error in (4.16) may be expressed as E.z/ D
Q.z/ ; 1 A.z/
(4.21)
which indicates that the reconstruction error at the decoder output is the quantization error filtered by an all-pole LPC reconstruction filter (see (4.10)). An all-pole IIR filter may be unstable and may produce large instances of reconstruction error which may be perceptually annoying, so open-loop DPCM is avoided in many applications. But it is sometimes purposely deployed in other applications to exploit the shaping of quantization noise spectrum by the all-pole reconstruction filter, see Sects. 4.5 and 11.4 for details.
4.3 DPCM
59
4.3 DPCM This problem of quantization error accumulation can be avoided by forcing the encoder to use the same predictor as the decoder, i.e., the predictor given in (4.8). This entails moving the quantizer in Fig. 4.3 inside the encoding loop, leading to the encoder scheme shown in Fig. 4.5. This LPC scheme is often referred to as differential pulse code modulation (DPCM) [9]. Note that a uniform quantizer is usually deployed.
4.3.1 Quantization Error To illustrate that quantization noise accumulation is no longer an issue with DPCM, let us first note that the prediction residue is now given by (see Fig. 4.5): r.n/ D x.n/ p.n/: O
(4.22)
Plugging this into the additive noise model (4.12) for quantization, we have r.n/ O D x.n/ p.n/ O C q.n/;
(4.23)
which is the quantized residue at the input to the decoder. Dropping the above equation into (4.9), we obtain the reconstructed value at the decoder x.n/ O D p.n/ O C x.n/ p.n/ O C q.n/ D x.n/ C q.n/:
(4.24)
Rearranging this equation, we have e.n/ D x.n/ O x.n/ D q.n/;
(4.25)
E.z/ D Q.z/:
(4.26)
or equivalently
Fig. 4.5 Differential pulse code modulation (DPCM). A full decoder is embedded as part of the encoder. The quantizer is modeled by an additive noise source
60
4 Linear Prediction
Therefore, the reconstruction error at the decoder output is exactly the same as the quantization error of the residue in the encoder, there is no quantization error accumulation.
4.3.2 Coding Gain As stated for (4.11), the reconstruction error at the decoder output is considered as 2 the quantization noise of LPC, so its variance, denoted as q.DPCM/ , is the MSQE 2 , the for DPCM. For a given bit rate, this MSQE should be smaller than q.PCM/ MSQE for directly quantizing the source signal, to justify the increased complexity of linear prediction. This improvement may be assessed by coding gain GDPCM D
2 q.PCM/ 2 q.DPCM/
;
(4.27)
which is usually evaluated in the context of scalar quantization. 2 Due to (4.25), q.DPCM/ is the same as the MSQE of the prediction residue 2 q.Residue/ : 2 2 q.DPCM/ D q.Residue/ :
(4.28)
2 is related to the variance of the reside r2 by (2.48) for uniform and Since q.Residue/ nonuniform scalar quantization, we have 2 q.DPCM/ D 100:1.aCbR/ r2 ;
(4.29)
for a given bit rate of R. Note that the slope b and intercept a in the above equation are determined by a particular quantizer and the PDF of the residue. On the other hand, if the source signal is quantized directly with the same bit rate R in what constitutes the PCM, (2.48) gives the following MSQE 2 q.PCM/ D 100:1.aCbR/ x2 ;
(4.30)
where x2 is the variance of the source signal. Here it is assumed that both the source signal and the DPCM residue share the same set of parameters a and b. This assumption may not be valid in many practical applications. Consequently, the coding gain for DPCM is GDPCM D
2 q.PCM/ 2 q.DPCM/
D
x2 D PG; r2
(4.31)
4.4 Optimal Prediction
61
which is the same as the prediction gain defined in (4.6). This indicates that the quantization performance of the DPCM system is dependent on the prediction gain, or how well the predictor predicts the source signal. If the predictor in the DPCM is properly designed so that r2 x2 ; (4.32) then GDPCM 1 or a significant reduction in quantization error can be achieved.
4.4 Optimal Prediction Now that it is established that the quantization performance of a DPCM system depends on the performance of linear prediction, the next task is to find an optimal predictor that maximizes the prediction gain.
4.4.1 Optimal Predictor For a given source signal x.n/, the maximization of prediction gain is equivalent to minimizing the variance of the prediction residue (4.22), which can be expressed as r.n/ D x.n/ p.n/ O D x.n/
K X
ak x.n O k/
(4.33)
kD1
using (4.8). Therefore, the design problem is to find the set of prediction coefficients fak gK that minimizes r2 : kD1 2 min r2 D E 4 x.n/
fak gK kD1
K X
!2 3 ak x.n O k/ 5 ;
(4.34)
kD1
where E./ is the expectation operator defined below Z E.y/ D
y./p./d:
(4.35)
Since x.n/ O in (4.34) is the reconstructed signal and is related to the source signal by the additive quantization noise (see (4.25)), the minimization problem in (4.34) involves optimal selection of the prediction coefficients as well as the minimization of quantization error. As discussed in Chaps. 2 and 3, independent minimization of quantization error or quantizer design itself is frequently very difficult, so it is highly desirable that the problem be simplified by taking quantization out of the picture. This essentially implies that the DPCM scheme is given up and the openloop DPCM encoder in Fig. 4.3 is used when it comes to predictor design.
62
4 Linear Prediction
One way to consider this simplification is the assumption of fine quantization: the quantization step size is so small that the resulting quantization error is negligible x.n/ O x.n/:
(4.36)
This enables the replacement of x.n/ O by x.n/ in (4.34). Due to the arguments above, the prediction residue considered for optimization purpose becomes K X
r.n/ D x.n/ p.n/ D x.n/
ak x.n k/;
(4.37)
!2 3 ak x.n k/ 5 :
(4.38)
kD1
and the predictor design problem (4.34) becomes 2 min r2 D E 4 x.n/
fak gK kD1
K X kD1
To minimize this error function, we set the derivative of r2 with respect to each prediction coefficient aj to zero: @r2 D 2E aj
" x.n/
K X
!
#
ak x.nk/ x.nj / D 0; j D 1; 2; : : : ; K: (4.39)
kD1
Due to (4.3), the above equation may be written as E Œ.x.n/ p.n// x.n j / D 0; j D 1; 2; : : : ; KI
(4.40)
and, using (4.5), further as E Œr.n/x.n j / D 0; j D 1; 2; : : : ; K:
(4.41)
This that the minimal prediction error or residue must be orthogonal to all data used in the prediction. This is called orthogonality principle. By moving the expectation inside the summation, (4.39) may be rewritten as K X
ak E Œx.n k/x.n j / D E Œx.n/x.n j / ; j D 1; 2; : : : ; K:
(4.42)
kD1
Now we are ready to make the second assumption: the source signal x.n/ is a wide sense stationary process so that its autocorrelation function can be defined as R.k/ D EŒx.n/x.n k/;
(4.43)
4.4 Optimal Prediction
63
and has the following property: R.k/ D R.k/:
(4.44)
Consequently, (4.42) can be written as K X
ak R.k j / D R.j /; j D 1; 2; : : : ; K:
(4.45)
kD1
It can be further written into the following matrix form: Ra D r;
(4.46)
a D Œa1 ; a2 ; a3 ; : : : ; aK T ;
(4.47)
r D ŒR.1/; R.2/; R.3/; : : : ; R.K/T ;
(4.48)
where
and
2 6 6 6 RD6 6 4
R.0/ R.1/ R.2/ :: :
R.1/ R.0/ R.1/ :: :
R.2/ R.1/ R.0/ :: :
3 R.K 1/ R.K 2/ 7 7 R.K 3/ 7 7: 7 :: 5 :
R.K 1/ R.K 2/ R.K 3/
(4.49)
R.0/
The equations above are known as normal equations, Yule–Walker prediction equations or Wiener–Hopf equations [63]. The matrix R and vector r are all built from the autocorrelation values of fR.k/gK kD0 . The matrix R is a Toeplitz matrix in that it is symmetric and all elements along a diagonal are equal. Such matrices are known to be positive definite and therefore nonsingular, yielding a unique solution to the determination of the linear prediction coefficients: a D R1 r: (4.50)
4.4.2 Levinson–Durbin Algorithm Levinson–Durbin recursion is a procedure in linear algebra to recursively calculate the solution to an equation involving a Toeplitz matrix [12, 41], thus avoiding an explicit inversion of the matrix R. The algorithm iterates on the prediction order, so the order of the prediction filter is denoted for each filter coefficient using superscripts: akn
(4.51)
64
4 Linear Prediction
where k is the kth coefficient for nth iteration. To get the prediction filter of order K, we need to iterate through the following sets of filter coefficients: nD1W
fa11 g
nD2W
fa12 ; a22 g
:: : K g n D K W fa1K ; a1K ; : : : ; aK
The iteration above is possible because both the matrix R and vector r are built from the autocorrelation values of fR.k/gK . The algorithm proceeds as follows: kD0 1. 2. 3. 4.
Set n D 0. Set E 0 D R.0/. Set n D n C 1. Calculate n D
1
R.n/
E n1
n1 X
! akn1 R.n k/
(4.52)
kD1
5. Calculate
ann D n
(4.53)
n1 akn D akn1 n ank ; for k D 1; 2; : : : ; n 1:
(4.54)
E n D .1 n2 /E n1 :
(4.55)
6. Calculate
7. Calculate 8. Go to step 3 if n < M . For example, to get the prediction filter of order K D 2, two iterations are needed as follows. For n D 1, we have E 0 D R.0/
1 D
R.1/ R.1/ D E0 R.0/
a11 D 1 D " 1
E D .1
12 /E 0
D 1
(4.56)
R.1/ R.0/
(4.57)
R.1/ R.0/
2 #
R.0/ D
(4.58) R2 .0/ R2 .1/ : R.0/
(4.59)
4.4 Optimal Prediction
65
For n D 2, we have 2 D
R.2/ a11 R.1/ R.2/R.0/ R2 .1/ D E1 R2 .0/ R2 .1/ a22 D 2 D
R.2/R.0/ R2 .1/ R2 .0/ R2 .1/
a12 D a11 .1 2 / D R.1/
R.0/ R.2/ : R2 .0/ R2 .1/
(4.60)
(4.61) (4.62)
Therefore, the final prediction coefficients for K D 2 are a1 D a12
and
a2 D a22 :
(4.63)
4.4.3 Whitening Filter From the definition of prediction residue (4.5), we can establish the following relationship: x.n j / D r.n j / C p.n j /: (4.64) Dropping it into the orthogonality principle (4.41), we have E Œr.n/r.n j / C E Œr.n/p.n j / D 0; j D 1; 2; : : : ; K;
(4.65)
so the autocorrelation function of the prediction residue is Rr .j / D E Œr.n/p.n j / ; j D 1; 2; : : : ; K:
(4.66)
Using (4.3), the right-hand side of the above equation may be further expanded into Rr .j / D
K X
ak E Œr.n/x.n j k/ ; j D 1; 2; : : : ; K:
(4.67)
kD1
4.4.3.1 Infinite Prediction Order If the predictor order is infinity (K D 1), the orthogonality principle (4.41) ensures that the right-hand side of (4.67) is zero, so the autocorrelation function of the prediction residue becomes ( Rr .j / D
r2 ; j D 0I 0; otherwise:
(4.68)
66
4 Linear Prediction
It indicates that the prediction residue sequence is a white noise process. Note that this conclusion is valid only when the predictor has an infinite number of prediction coefficients. 4.4.3.2 Markov Process For predictors with a finite number of coefficients, the above condition is generally not true, unless the source signal is a Markov process with an order N M . Also called an autoregressive process and denoted as AR(N), such a process x.n/ is generated by passing a white-noise process w.n/ through an N-th order all-pole filter W .z/ W .z/ X.z/ D D ; (4.69) PN k 1 B.z/ 1 kD1 bk z where B.z/ D
N X
bk zk ;
(4.70)
kD1
and X.z/ and W .z/ are the z transforms of x.n/ and w.n/, respectively. The corresponding difference equation is x.n/ D w.n/ C
N X
bk x.n k/:
(4.71)
kD1
The autocorrelation function of the AR process is Rx .j / D EŒx.n/x.n j / N X
D EŒw.n/x.n j / C
bk EŒx.n k/x.n j /
kD1 N X
D EŒw.n/x.n j / C
bk R.j k/
(4.72)
kD1
Since w.n/ is white, ( EŒw.n/x.n j / D Consequently,
( Rx .j / D
w2 C PN
PN
kD1
kD1
w2 ; j D 0I 0; j > 0:
bk R.k/; j D 0I
bk R.j k/; j > 0:
(4.73)
(4.74)
4.4 Optimal Prediction
67
A comparison with the Wiener–Hopf equations (4.45) leads to the following set of optimal prediction coefficients: ak D
bk ; 0 < j N I 0; N < j M I
(4.75)
which essentially sets A.z/ D B.z/:
(4.76)
The above result makes intuitive sense because it sets the LPC encoder filter (4.7) to be the inverse of the filter (4.69) that generates the AR process. In particular, the Z-transform of the prediction residue is given by R.z/ D Œ1 A.z/X.z/
(4.77)
according to (4.7). Dropping in (4.69) and using (4.76), we obtain R.z/ D Œ1 A.z/
W .z/ D W .x/; 1 B.z/
(4.78)
which is the unpredictable white noise that drives the AR process. An important implication of the above equation is that the prediction residue process is, once again, white for an AR process whose order is not larger than the predictor order.
4.4.3.3 Other Cases When a predictor with a finite order is applied to a general stochastic process, the prediction residue process is generally not white, but may be considered as approximately white in practical applications. As an example, let us consider the signal at the top of Fig. 4.1, which is not an AR process. Applying the Levinson–Durbin procedure in Sect. 4.4.2, we obtain the optimal first-order filter as R.1/ a1 D 0:99: (4.79) R.0/ The spectrum of the prediction residue using this optimal predictor is shown at the top of Fig. 4.6. It is obviously flat, so may be considered as white. Therefore, the LPC encoder filter (4.7) that produces the prediction residue signal is sometimes called a whitening filter. Note that it is the inverse of the decoder or the LPC reconstruction filter (4.10) H.z/ D
1 : D.z/
(4.80)
68
4 Linear Prediction Prediction Residue
Magnitude (dB)
10 0 −10 −20 −30
500
1000
1500
2000 2500 Frequency (Hz)
3000
3500
4000
3500
4000
Input Signal and Spectral Envelop
Magnitude (dB)
40
20
0
−20
500
1000
1500
2000 2500 Frequency (Hz)
3000
Fig. 4.6 Power spectra of the prediction residue (top), source signal (bottom), and the estimate by the reconstruction filter (the envelop in the bottom)
4.4.4 Spectrum Estimator From (4.7), the Z-transform of the source signal may be expressed as, X.z/ D
R.z/ R.z/ ; D PK H.z/ 1 kD1 ak zk
(4.81)
which implies that the power spectrum of the source signal is Srr .ej! / Sxx .ej! / D ˇ ˇ2 ; PK ˇ ˇ jk! a e ˇ1 ˇ kD1 k
(4.82)
where Sxx .ej! / and Srr .ej! / are the power spectrum of x.n/ and r.n/, respectively. Since r.n/ is white or nearly white, the equation above becomes r2 Sxx .ej! / DD ˇ ˇ2 : PK ˇ ˇ ˇ1 kD1 ak ejk! ˇ
(4.83)
4.5 Noise Shaping
69
Therefore, the decoder or the LPC reconstruction filter (4.10) provides an estimate of the spectrum of the source signal and linear prediction is sometimes considered as a temporal-frequency analysis tool. Note that this is an all-pole model for the source signal, so it can model peaks well, but is incapable of modeling zeros (deep valleys) that may exist in the source signal. For this reason, linear prediction spectrum is sometimes referred to as an spectral envelop. Furthermore, if the source signal cannot be modeled by poles, linear prediction may fail completely. The spectrum of the source signal and the estimate by the LPC reconstruction filter (4.83) are shown at the bottom of Fig. 4.6. It can be observed that the spectrum estimated by the LPC reconstruction filter matches the signal spectrum envelop very well.
4.5 Noise Shaping As discussed in Sect. 4.4, the optimal LPC encoder produces a prediction residue sequence that is white or nearly white. When this white residue is quantized, the quantization noise is often white as well, especially when fine quantization is used. This is a mismatch to the sensitivity curve of the human ear which is well known for its substantial variation with frequency. It is, therefore, desirable to shape the spectrum of quantization noise at the decoder output to match the sensitivity curve of the human ear so that more perceptual irrelevancy can be removed. Linear prediction offers a flexible and simple mechanism for achieving this.
4.5.1 DPCM Let us revisit the DPCM system in Fig. 4.5. Its output given by (4.23) may be expanded using (4.8) to become rO .n/ D x.n/
K X
ak x.n O k/ C q.n/
(4.84)
kD1
Due to (4.25), the above equation may be written as rO .n/ D x.n/
K X
ak Œx.n k/ C q.n k/ C q.n/
kD1
D x.n/
K X kD1
ak x.n k/ C q.n/
K X kD1
ak q.n k/
(4.85)
70
4 Linear Prediction
Fig. 4.7 Different schemes of linear prediction for shaping quantization noise at the decoder output: DPCM (top), open-loop DPCM (middle) and general noise feedback coding (bottom)
The Z-transform of the above equation is O R.z/ D Œ1 A.z/X.z/ C Œ1 A.z/Q.z/;
(4.86)
O is the Z-transform of rO .n/. where R.z/ The above equation indicates that the DPCM system in Figs. 4.5 and 4.4 may be implemented using the structure at the top of Fig. 4.7 [5]. Both the source signal and the quantization noise are shaped by the same LPC encoder filter and the shaping is subsequently reversed by the reconstruction filter in the decoder, so the overall transfer functions for the source signal and the quantization noise are the same and equal
4.5 Noise Shaping
71
to one. Therefore, the spectrum of the quantization noise at the decoder output is the same as that produced by the quantizer (see (4.26)). In other words, the spectrum of the quantization noise as produced by the quantizer is faithfully duplicated at the decoder output, there is no shaping of quantization noise. Many quantizers, including the uniform quantizer, produce a white quantization noise spectrum when the quantization step size is small (fine quantization). This white spectrum is faithfully duplicated at the decoder output by DPCM.
4.5.2 Open-Loop DPCM Let us now consider the open-loop DPCM in Figs. 4.3 and 4.4 which are redrawn in the middle of Fig. 4.7. Compared with the DPCM scheme at the top, the processing for the source signal is unchanged, the overall transfer function is still one. However, there is no processing for the quantization noise in the encoder, it is shaped only by the LPC reconstruction filter, so the quantization noise at the decoder output is given by (4.21). As shown in Sect. 4.4.4, the optimal LPC reconstruction filter traces the spectral envelop of the source signal, so the quantization noise is shaped toward that envelop. If the quantizer produces white quantization noise, the quantization noise at the decoder output is shaped to the spectral envelop of the source signal.
4.5.3 Noise-Feedback Coding While the LPC filter coefficients are determined by the necessity to maximize prediction gain, the DPCM and open-loop DPCM schemes imply that the filter for processing the quantization noise in the encoder can be altered to shape the quantization noise without any impact to the prefect reconstruction of the source signal. Apparently, the transfer function for such a filter, called error-feedback function or noise-feedback function, do not have to be either 1 A.z/ or 1, a different noise feedback function can be used to shape the quantization noise spectrum to other desirable shapes. This gives rise to the noise-feedback coding shown at the bottom of Fig. 4.7, where the noise-feedback function is denoted as 1 F .z/:
(4.87)
Figure 4.7 implies an important advantage of noise feedback coding: the shaping of quantization noise is accomplished completely in the encoder, there is zero implementation impact or cost at the decoder. When designing the noise feedback function, it is important to realize that the noise-feedback function is only half of the overall noise-shaping filter
72
4 Linear Prediction
S.z/ D
1 F .z/ ; 1 A.z/
(4.88)
the other half is the LPC reconstruction filter which is determined by solving the normal equation (4.46). The Z-transform of the quantization error at the decoder output is given by E.z/ D S.z/Q.z/; (4.89) so the noise spectrum is See .ej! / D jS.ej! /j2 Sqq .ej! /:
(4.90)
In audio coding, this spectrum is supposed to be shaped to match the masked threshold for a given source signal. In principle, this matching can be achieved for all masked threshold shapes provided by a perceptual model [38].
Chapter 5
Transform Coding
Transform coding (TC) is a method that transforms a source signal into another one with a more compact representation. The goal is to quantize the transformed signal in such a way that the quantization error in the reconstructed signal is smaller than directly quantizing the source signal.
5.1 Transform Coder Transform coding is block-based, so the source signal x.n/ is first grouped into blocks, each of which consists of M samples, and is represented by a vector: x.n/ D Œx0 .n/; x1 .n/; : : : ; xM 1 .n/T D Œx.nM /; x.nM 1/; : : : ; x.nM M C 1/T :
(5.1)
The dimension M is called block size or block length. For a linear transform, the transformed block is obtained as y.n/ D Tx.n/;
(5.2)
y.n/ D Œy0 .n/; y. n/; : : : ; yM 1 .n/T
(5.3)
where is called the transform of x.n/ or transform coefficients and the M M matrix 2 6 6 TD6 4
t0;0 t1;0 :: : tM 1;0
t0;1 t1;1 :: :
tM 1;1
t0;M 1 t1;M 1 :: :
3 7 7 7 5
(5.4)
tM 1;M 1
is called the transformation matrix or simply the transform. This transform operation is shown in the left of Fig. 5.1.
Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 5, c Springer Science+Business Media, LLC 2010
73
74
5 Transform Coding
Fig. 5.1 Flow chart of a transform coder. The coding delay for transforming and inverse transforming the signals is ignored in this graph
Transform coding is shown in Fig. 5.1. The transform coefficients y.n/ are quantized into quantized coefficients yO .n/ and the resulting quantization indexes transmitted to the decoder. The decoder reconstructs from the received indexes the quantized coefficients yO .n/ through inverse quantization. This process may be viewed as if the quantized coefficients yO .n/ are received directly by the decoder. The decoder then reconstructs an estimate xO .n/ of the source signal vector x.n/ from the quantized coefficients yO .n/ through an inverse transform xO .n/ D T1 yO .n/;
(5.5)
where T1 represents the inverse transform. The reconstructed vector xO .n/ can then be unblocked to rebuild an estimate x.n/ O of the original source signal x.n/. A basic requirement for a transform in the context of transform coding is that it must be invertible T1 T D I; (5.6) so that an source block can be recovered from its transform coefficients in the absence of quantization: (5.7) x.n/ D T1 y.n/: Orthogonal transforms are most frequently used in practical applications. To be orthogonal, a transform must satisfy T1 D TT ;
(5.8)
TT T D I;
(5.9)
or
5.1 Transform Coder
75
Consequently, the inverse transform becomes x.n/ D TT y.n/:
(5.10)
A transform matrix T can be considered as consisting of M rows of vectors 2
tT0
3
6 T 7 6 t1 7 7 6 T D 6 : 7; 6 :: 7 5 4 tTM 1
(5.11)
where tTk represents the kth row of T tTk D Œtk;0 ; tk;1 ; : : : ; tk;M 1 ;
k D 0; 1; : : : ; M 1:
(5.12)
Equation (5.10) becomes x.n/ D Œt0 ; t1 ; : : : ; tM 1 y.n/ D
M 1 X
yk tk ;
(5.13)
kD0
which means that the source can be represented as a linear combination of the vecM 1 . Therefore, the rows of T are often referred to as basis vectors or basis tors ftk gkD0 functions. One of the advantages of an orthogonal transform is that its inverse transform TT is immediately defined without considering matrix inversion. The fact that the inverse matrix is just the transposition of the transform matrix means that the inverse transform can be implemented by just transposing the transform flow graph or running it backwards. Another advantage of an orthogonal transform is energy conservation which means that the transformed coefficients have the same total energy as the source signal: M 1 X kD0
y 2 .k/ D
M 1 X
x 2 .k/:
(5.14)
kD0
This can be easily proved: M 1 X kD0
y 2 .k/ D yT y D xT TT Tx D xT x D
M 1 X
x 2 .k/;
(5.15)
kD0
where the orthogonal condition in (5.9) is used. In the derivation above, the block index n is dropped for convenience. When the context is appropriate, this practice will be followed in the reminder of this book.
76
5 Transform Coding
5.2 Optimal Bit Allocation and Coding Gain The operations involved in transform and inverse transform are obviously sophisticated and entail a significant amount of calculation. This extra burden is carried due to the anticipation that, for a given bit rate, the quantization error in the reconstructed signal will be smaller than directly quantizing the source signal. This is explained below.
5.2.1 Quantization Noise The quantization error for the transform coder is the reconstruction error at the decoder output: qOk D x.nM O k/ x.nM k/; k D 0; 1; : : : ; M 1;
(5.16)
so the MSQE is 2 q. x/ O D
M 1 M 1 1 X 2 1 X 2 qO k D E qO k ; M M kD0
(5.17)
kD0
where zero mean is assumed without loss of generality. Let qO D ŒqO0 ; qO 1 ; : : : ; qOM 1 T ;
(5.18)
the above equation becomes 2 q. x/ O D
1 T E qO qO : M
(5.19)
From Fig. 5.1, we obtain qO D TT q; so 2 q. x/ O D
(5.20)
1 T T E q TT q : M
(5.21)
Due to the orthogonal condition in (5.9), TTT D I, so 2 q. x/ O D
M 1 M 1 1 X 2 1 T 1 X 2 E qk D qk : E q q D M M M kD0
(5.22)
kD0
Suppose that the bit rate for the transform coder is R bits per source sample. Since there are M samples in a block, the total number of bits available for coding one block of source samples is MR. These bits are allocated to the quantizers in Fig. 5.1. If the kth quantizer is allocated rk bits, the total must be MR:
5.2 Optimal Bit Allocation and Coding Gain
RD
77 M 1 1 X rk : M
(5.23)
kD0
Due to (2.48), the MSQE for the kth quantizer is q2k D 100:1.aCbrk / y2k ; k D 0; 1; : : : ; M 1;
(5.24)
where the parameters a and b are dependent on the quantization scheme and the probability distribution of yk .n/. Dropping (5.24) into (5.22), we obtain the following total MSQE: 2 q. x/ O D
M 1 1 X 0:1.aCbrk / 2 10 yk : M
(5.25)
kD0
Apparently, the MSQE is a function of both signal variance y2k of each transform coefficient and the number of bits allocated to quantize it. While the former is determined by the source signal and the transform T, the later is by how the bits are allocated to the quantizers, i.e., bit allocation strategy. For a given the bit rate R, M 1 a different bit allocation strategy assigns a different set of frk gkD0 , called a bit allocation, which results in a different MSQE. It is, therefore, imperative to find the optimal one that gives the minimum MSQE.
5.2.2 AM–GM Inequality Before addressing this question, let us first define arithmetic mean AM D
M 1 1 X pk M
(5.26)
kD0
and geometric mean GM D
M 1 Y
!1=M pk
(5.27)
kD0
for a given set of nonnegative numbers pk , k D 1; 2; : : : ; M 1. The AM–GM inequality [89] states that AM GM
(5.28)
p0 D p1 D D pM 1 :
(5.29)
with equality if and only if
78
5 Transform Coding
For an interpretation of this inequality, let us consider a p0 p1 rectangle. Its 1 perimeter is 2.p0 C p1 /, area is p0 p1 , and average length of sides is p0 Cp .A 2 p square with the same area obviously has sides with a length of p0 p1 . The AM– p 1 p0 p1 with equality if and only if p0 D p1 . GM inequality states that p0 Cp 2 In other words, among all rectangles with the same area, the square has the shortest average length of sides.
5.2.3 Optimal Conditions Applying the AM–GM inequality to (5.22), we have 2 q. x/ O
M 1 1 X 2 D qk M
M 1 Y
!1=M q2k
(5.30)
q20 D q21 D D q2M 1 :
(5.31)
kD0
kD0
with equality if and only if
Plugging in (5.24), we have M 1 Y kD0
!1=M q2k
D
M 1 Y kD0
!1=M 100:1.aCbrk / y2k
D 100:1a 100:1b
PM 1 kD0
rk
M 1 Y kD0
D 100:1.aCbR/
M 1 Y kD0
!1=M y2k
!1=M y2k
;
(5.32)
where (5.23) is used to arrive at the last equation. Plugging this back into (5.30), we have !1=M M 1 Y 2 0:1.aCbR/ 2 yk ; (5.33) q.x/ O 10 kD0
with equality if and only if (5.31) holds. For a given transform matrix T, the variances of transform coefficients y2k are determined by the source signal, so the right-hand side of (5.33) is a fixed value for a given bit rate R. Therefore, (5.33) and (5.31) state that
5.2 Optimal Bit Allocation and Coding Gain
79
M 1 Optimal MSQE. MSQE from any bit allocation frk gkD0 is either equal to or larger than the fixed value on the right-hand side of (5.33), which is thus the minimal MSQE that could be achieved.
Optimal Bit Allocation. This minimal MSQE is achieved, or the equality holds, if and only if (5.31) is satisfied, or the bits are allocated in such a way that the MSQE from all quantizers are equalized. Note that this result is obtained without dependence on the actual transform matrix used. As long as the transform matrix is orthogonal, the results in this section hold.
5.2.4 Coding Gain Now it is established that the minimal MSQE of a transform coder is given by the right-hand side of (5.33), which we denote as 2 q.TC/
D 10
0:1.aCbR/
M 1 Y kD0
!1=M y2k
:
(5.34)
If the source signal is quantized directly as PCM with the same bit rate of R, the MSQE is given by (2.48) and denoted as 2 q.PCM/ D 100:1.aCbR/ x2 :
(5.35)
Then the coding gain of a transform coder over PCM is GTC D
2 q.PCM/ 2 q.TC/
x2 D ; QM 1 2 1=M kD0 yk
(5.36)
which is the ratio of source variance x2 to the geometric mean of the transform coefficient variances y2k . Due to the energy conservation property of T (5.14), we have x2
M 1 M 1 1 X 2 1 X 2 D xk D yk ; M M kD0
(5.37)
kD0
where x2k is the variance of the kth element of the source vector x. Applying this back into (5.36), we have 1
PM 1 kD0
GTC D M QM 1
2 kD0 yk
y2k 1=M ;
(5.38)
80
5 Transform Coding
which states that the optimal coding gain of a transform coder is the ratio of arithmetic to geometric mean of the transform coefficient variances. Due to the AM–GM inequality (5.28), we always have GTC 1:
(5.39)
Note, however, this is valid if and only if the optimal bit allocation strategy is deployed. Otherwise, the equality in (5.30) will not hold, the transform coder will not be able to achieve the minimal MSQE given by (5.34).
5.2.5 Optimal Bit Allocation The optimal bit allocation strategy requires that bits be allocated in such a way that the MSQE from all quantizers are equalized to a constant (see (5.31). Denoting such a constant as 02 , we obtain from (5.24) that the bits allocated to the kth transform coefficient is 10 1 log10 y2k .10 log10 02 C a/: rk D (5.40) b b Dropping it into (5.23), we obtain M 1 10 1 X 1 log10 y2k : .10 log10 02 C a/ D R b b M
(5.41)
kD0
Dropping this back into (5.40), we obtain the following optimal bit allocation: " # M 1 10 1 X 2 2 log10 yk rk D R C log10 yk b M kD0
DRC
y2k 10 log2 ; QM 1 2 1=M b log2 10 yk kD0
for k D 0; 1; : : : ; M 1:
(5.42)
If each transform coefficient is considered as satisfying a uniform distribution and is quantized using a uniform quantizer, the parameter b is given by (2.44). Then the above equation becomes y2k ; for k D 0; 1; : : : ; M 1: rk D R C 0:5 log2 QM 1 2 1=M yk kD0
(5.43)
5.2 Optimal Bit Allocation and Coding Gain
81
Both (5.42) and (5.43) state that, aside from a global bias determined by the bit rate R and the geometric mean of the variances of the transform coefficients, bits should be assigned to a transform coefficient in proportion to the logarithm of its variance.
5.2.6 Practical Bit Allocation It is unlikely that the optimal bit allocation strategy would allocate an integer value rk to any transform coefficients. When the quantization indexes are directly packed into a bit stream for delivery to the decoder, only an integer number of bits can be packed each time. If entropy coding is subsequently used to code the quantization indexes, an integer number of bits is not necessary, but number of quantization intervals has to be an integer. In this case, rk can be a rational number. But this still cannot be guaranteed by the bit allocation strategy. A simple approach to addressing this problem is to round rk to its nearest integer or a value that corresponds to an integer number of quantization intervals. There are, however, more elaborate methods, such as a water-filling procedure which iteratively allocates bits to transform coefficients with the largest quantization error [4, 51]. In addition, the optimal bit allocation strategy assumes an ample supply of bits. Otherwise, some of the transform coefficients with a small variance will get negative number of bits, as can be seen from both (5.42) and (5.43). If the bit rate R is not large enough to ensure positive number of bits allocated to all transform coefficients, the strategy can be modified to include the following clause: rk D 0 if rk < 0: (5.44) Furthermore, if the variance of any transform coefficient becomes zero, zero bit is allocated to it and the zero variance is subsequently taken out of the geometric mean calculation. With these modifications, however, the equalization condition (5.31) of the AM–GM inequality no longer holds, so the quantizer cannot achieve the minimal MSQE in (5.34) and the coding gain in (5.36). To illustrate the above method, let us consider an extreme example where y20 D M x2 ; y2k D 0; k D 1; 2; : : : ; M 1:
(5.45)
The zero variance causes the geometric mean to completely break down, so both (5.34) and (5.36) are meaningless. The modified strategy above, however, dictates the following bit allocation: b0 D MR; bk D 0; k D 1; 2; : : : ; M 1:
(5.46)
82
5 Transform Coding
5.2.7 Energy Compaction Dropping the above two equations for the extreme example into (5.25), we obtain the total MSQE as 2 0:1.aCbMR/ 2 q. x : (5.47) x/ O D 10 Since the MSQE for direct quantization (PCM) is still given by (5.35), the effective coding gain is 2 q.PCM/ D 100:1.M 1/bR : (5.48) GTC D 2 q. x/ O To appreciate this improvement, consider uniform distribution whose parameter b is given by (2.44), the above coding gain becomes GTC .dB/ 6:02.M 1/R:
(5.49)
For a scenario of M D 1;024 and R D 1 bits per source sample which is typical in audio coding, this coding gain is 6164.48 dB! An intuitive explanation for this dramatic improvement is energy compaction and exponential reduction of quantization error with respect to bit rate. With direct quantization, each sample in the source vector has a variance of x2 and is allocated R bits, resulting in a total signal variance of M x2 and a total number of MR bits for the whole block. With transform coding, however, this total variance of M x2 for the whole block is compacted to the first coefficient and all MR bits for the whole block are allocated to it. Since MSQE is linearly proportional to the variance (see (5.25)), the MSQE for the first coefficient would increase M times due to the M times increase in variance, but this increase is spread out to M samples in the block, resulting no change. However, MSQE decreases exponentially with bit rate, so the M times increase in bit rate causes M times MSQE decrease in decibel! In fact, energy compaction is the key for coding gain in transform coding. Since the transform matrix T is orthogonal, the arithmetic mean is constant for a given signal, no matter how its energy is distributed by the transform to the individual coefficients. This means that the numerator of (5.38) remains the same regardless of the transform matrix T. However, if the transform distributes most of its energy (variance) to a minority of transform coefficients and leaves the balance to the rest, the geometric mean in the denominator of (5.38) becomes extremely small. Consequently, the coding gain becomes extremely large.
5.3 Optimal Transform Section 5.2 has established that the coding gain is dependent on the degree of energy compaction that the transform matrix T delivers. Is there a transform that is optimal in terms of having the best energy compaction capability or delivering the best coding gain?
5.3 Optimal Transform
83
5.3.1 Karhunen–Loeve Transform To answer this question, let us go back to (5.2) to establish the following equation: y.n/yT .n/ D Tx.n/xT .n/TT :
(5.50)
Taking expected values on both sides, we obtain the covariance of the transform coefficients (5.51) Ryy D TRxx TT ; where
Ryy D E y.n/yT .n/
(5.52)
and Rxx is the covariance matrix of source signal defined in (4.49). As noted there, it is symmetric and Toeplitz. By definition, the kth diagonal element of Ryy is the variance of kth transform coefficient: ŒRyy kk D E Œyk .n/yk .n/ D y2k ; (5.53) M 1 so the geometric mean of fy2k gkD0 is the product of the diagonal elements of Ryy : M 1 Y kD0
y2k D
M 1 Y
ŒRyy kk :
(5.54)
kD0
It is well known that a covariance matrix is positive semidefinite, i.e., its eigenvalues are all real and nonnegative [33]. For practical signals, Rxx may be considered as positive definite (no zero eigenvalues). Due to (5.51), Ryy may also be considered as positive definite as well, so the following inequality holds [93]: M 1 Y
ŒRyy kk det Ryy
(5.55)
kD0
with equality if and only if Ryy is diagonal. Due to (5.51), we have det Ryy D det T det Rxx det TT :
(5.56)
Taking determinant of (5.9) gives det TT det T D det I D 1;
(5.57)
j det Tj D 1:
(5.58)
which leads to
84
5 Transform Coding
Dropping this back into (5.56), we obtain det Ryy D det Rxx :
(5.59)
Consequently, the inequality in (5.55) becomes M 1 Y
ŒRyy kk det Rxx ;
(5.60)
kD0
again, with equality if and only if Ryy is diagonal. Since Rxx is completely determined by statistical properties of the source signal, the right-hand side of (5.60) is a fixed value. Due to (5.51), however, we can adjust the transform matrix T to alter the value on the left-hand side of (5.60). The best we can achieve by doing this is to find a T that makes Ryy a diagonal matrix: Ryy D TRxx TT D diagfy20 ; y21 ; : : : ; y2M 1 g
(5.61)
so that the equality in (5.60) holds. It is well known in matrix theory [28] that the matrix T which makes (5.61) hold is an orthonormal matrix whose rows are the orthonormal eigenvectors of the M 1 are the eigenvalues of Rxx . Such matrix Rxx and the diagonal elements fy2k gkD0 a transform matrix is called Karhunen–Loeve Transform (KLT) of source signal x.n/ and the eigenvalues its transform coefficients.
5.3.2 Maximal Coding Gain With a Karhunen–Loeve transform matrix, the equality in (5.60) holds, which gives the minimum value for the geometric mean:
Minimum:
M 1 Y kD0
!1=M y2k
D .det Rxx /1=M :
(5.62)
Dropping this back into (5.36), we establish that the maximum coding gain of the optimal transform coder, for a given source signal x.n/, is Maximum: GTC D
x2
.det Rxx /1=M
Note that this is made possible by Deploying the Karhunen–Loeve transform Providing ample bits And following the optimal bit allocation strategy
:
(5.63)
5.4 Suboptimal Transforms
85
The maximal coding gain is achieved by Karhunen–Loeve transform through diagonalizing the covariance matrix Rxx into Ryy . The diagonalized covariance matrix Ryy means that the transform coefficients that constitute the vector y are uncorrelated, so maximum coding gain or energy compaction is directly linked to decorrelation.
5.3.3 Spectrum Flatness The maximal coding gain discussed above is optimal for a given block size M . Typically, the maximal coding gain increases with the block size, approaching an upper limit when the block size becomes infinity. It can be shown [33] that transform coding is asymptotically optimal because this upper limit is equal to the theoretic upper limit predicted by rate-distortion theory [33]: lim GTC D
M !1
1 ; x2
(5.64)
where x2 is the spectrum flatness measure x2
D
exp
R 1 ln Sxx .ej! /d! 2 R 1 S .ej! /d! 2 xx
(5.65)
of the source signal x.n/ whose power spectrum is Sxx .ej! /.
5.4 Suboptimal Transforms Barring its optimality, KLT is rarely used in practical applications. The first reason is that KLT is signal-dependent: it is built from the covariance matrix of the source signal. There are a few ramifications for this including: KLT is as good as the statistical signal model, but a good model is not always
available. Signal statistics changes with time in most applications. This calls for real-time
calculation of covariance matrix, eigenvectors, and eigenvalues. This is seldom plausible in practical applications, especially in the decoder. Even if the encoder is assigned to do the calculation, transmission of eigenvectors to the decoder consumes a large number bits, so is not feasible for compression coding. Even if the signal-dependent issues above are resolved and the associated eigenvectors and eigenvalues are available at our conveniences, the calculation of the Karhunen–Loeve transform itself is still a big deal, especially with a large M , because both the transform in (5.2) and the inverse transform in (5.10) require
86
5 Transform Coding
an order of M 2 calculations (multiplications and additions). This is unfavorable when compared with structured transforms, such as DCT, whose structures are amenable for fast implementation algorithms that require calculations on the order of M log2 .M /. There are many structured and signal-independent transforms which can be considered as suboptimal in the sense that their performances approach that of KLT when the block size is large. In fact, all sinusoidal orthogonal transforms are found to approach the performance of KLT when the block size tends to infinity [74], including discrete Fourier transform, DCTs, and discrete sine transforms. With such sinusoidal orthogonal transforms, frequency is a key characteristic for the basis functions or vectors, so the transform coefficients are usually indexed by frequency. Consequently, they are often referred to as frequency coefficients and are considered as in the frequency domain. The transforms may also be referred to as frequency transforms and time–frequency analysis.
5.4.1 Discrete Fourier Transform DFT (Discrete Fourier Transform) is the most prominent transform in this category of sinusoidal transforms. Its transform matrix is given by kn T D WM D ŒWM ;
(5.66)
WM D ej2=M
(5.67)
kn ŒTk;n D WM D ej2kn=M :
(5.68)
where so that
The matrix is unitary, meaning .WM /T WM D M I;
(5.69)
so its inverse is
1 .WM /T : (5.70) M Note that the scale factor M above is usually disregarded when discussing orthogonal transforms, because it can be adjusted either on the forward or backward transform side within a particular context. The matrix is also symmetric, that is, W1 D
WTM D WM ;
(5.71)
so (5.70) becomes W1 D
1 W : M M
(5.72)
5.4 Suboptimal Transforms
87
The DFT is more commonly written as yk D
M 1 X
kn x.n/WM
(5.73)
nD0
and its inverse (IDFT) as x.n/ D
M 1 X
kn yk WM :
(5.74)
kD0
When DFT is applied to a block of M source samples, this block of M samples are virtually extended periodically on both sides of the block boundary to infinity, as shown at the top of Fig. 5.2. This introduces sharp discontinuities at both boundaries. In order to accommodate these discontinuities, DFT needs to incur a lot of large coefficients, especially at high frequencies. This causes spread of energy, the opposite of energy compaction, so DFT is not ideal for signal coding. DFT
…
… DFT-II
…
… DFT-IV
… Fig. 5.2 Periodic boundaries of DFT (top), DCT-II (middle) and DCT-IV (bottom)
…
88
5 Transform Coding
In addition, DFT is a complex transform that produces complex transform coefficients even for real signals. This simplies that the number of transform coefficients that need to be quantized and conveyed to the decoder is doubled. Due to the two drawbacks above, DFT is seldom directly deployed in practical transform coders. It is, however, frequently deployed in many transform coders as a conduit for fast calculation of other transforms because of its good structure for fast algorithm and the abundance of such algorithms.
5.4.2 DCT DCT is a family of Fourier-related transforms that use only real transform coefficients [2, 80]. Since the Fourier transform of a real and even signal is real and even, a DCT operates on real data and is equivalent to a DFT of roughly twice the block length. Depending on how the beginning and ending block boundaries are handled, there are eight types of DCTs, and correspondingly eight types of DSTs, but only two of them are widely used.
5.4.2.1 Type-II DCT The most prominent member of the DCT family is the type-II DCT given below: r DCT-II: ŒCk;n D c.k/ where
( c.k/ D
p1 ; 2
1;
k 2 cos .n C 0:5/ M M if k D 0I otherwise:
(5.75)
(5.76)
Its inverse transform is IDCT-II: ŒCT n;k D ŒCk;n :
(5.77)
DCT-II tends to deliver the best energy compaction performance in the DCT and DST family. It achieves this mostly because it uses symmetric boundary conditions on both sides of its period, as shown in the middle of Fig. 5.2. In particular, DCT-II extends its boundaries symmetrically on both sides of a period, so the samples can be considered as periodic with a period of 2M and there is essentially no discontinuity at both boundaries. DCT-II for M D 2 was shown to be identical to the KLT for a first-order autoregression (AR) source [33]. Furthermore, the coding gain of DCT-II for other M is shown to be very close to that of the KLT for such a source with high correlation coefficient: R.1/ D 1: (5.78) R.0/ Similar results with real speech were also observed in [33].
5.4 Suboptimal Transforms
89
Since many real-world signals can be modeled as such a source, DCT-II is deployed in many signal coding or processing applications and is sometimes simply called “the DCT” (its inverse is, of course, called “the inverse DCT” or “the IDCT”). Two-dimensional DCT-II, which shares these characteristics, has been deployed by many international image and video coding standards, such as JPEG [37], MPEG1&2&4 [54, 57, 58], and MPEG-4(AVC)/H.264 [61].
5.4.2.2 Type-IV DCT Type-IV DCT is obtained by shifting the frequencies of the Type-II DCT in (5.75) by =2M , so it has the following form r DCT-IV: ŒCk;n D
h i 2 : cos .n C 0:5/.k C 0:5/ M M
(5.79)
It is also its own inverse transform. Due to this frequency shifting, its right boundary is no longer smooth, as shown at the bottom of Fig. 5.2. Such a sharp discontinuity requires a lot of large transform coefficients to compensate, significantly degrading its energy compacting ability. So it is not as useful as DCT-II. However, it serves as valuable building block for fast algorithms of DCT, MDCT and other cosine-modulated filter banks.
Chapter 6
Subband Coding
Transform coders artificially divide an source signal into blocks, then process and code each block independently of each other. This leads significant variations between blocks, which may become visible or audible as discontinuities at the block boundaries. Referred to as blocking artifacts or blocking effects, these artifacts may appear as “tiles” in decoded images or videos that were coded at low bit rates. In audio, blocking artifacts sound like periodic “clicking” which is considered as annoying by many people. While the human eye can tolerate a large degree of blocking artifacts, the human ear is extremely intolerant of such periodic “clicking”. Therefore, transform coding is rarely deployed in audio coding. One approach to avoiding blocking artifacts is to introduce overlapping between blocks. For example, a transform coder may be structured in such a way that it advances only 90% of a block, so that there is a 10% overlap between blocks. Reconstructed samples in the overlapped region can be averaged to smooth out the discontinuities between blocks, thus avoiding the blocking artifacts. The cost for this overlapping is reduced coding performance. Since the number of transform coefficients is equal to the number of source samples in a block, the number of samples overlapping between blocks is the number of extra transform coefficients that needs to be quantized and conveyed to the decoder. They consume extra valuable bit resources, thus degrading coding performance. What is really needed is overlapping in the temporal domain, but no overlapping in the transform or frequency domain. This means the transform matrix T in (5.2) is no longer M M , but M N , where N > M . This leads to subband coding (SBC). For example, application of this idea to DCT leads to modified discrete cosine transform (MDCT) which has 50% overlapping in the temporal domain or its transform matrix T is M 2M .
6.1 Subband Filtering Subband coding is based on the decomposition of an source signal into subband samples through a filter bank. This decomposition can be considered as an extension of a transform where the filter bank corresponds to the transform matrix and Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 6, c Springer Science+Business Media, LLC 2010
91
92
6 Subband Coding
the subband samples to the transform coefficients. This extension is based on a new perspective on transforms, which views the multiplication of a transform matrix with the source vector as filtering of the source vector by a bank of subband filters whose impulse responses are the row vectors of the transform matrix. Allowing these subband filters to be longer than the block size of the transform enables dramatical improvement in energy compaction capability, in addition to the elimination of blocky artifacts.
6.1.1 Transform Viewed as Filter Bank To extend from transform coding to subband coding, let us consider the transform expressed in (5.2). The operations involved in such a transform to obtain the kth component of the transform coefficient vector y.n/ from the source vector x.n/ may be written as yk .n/ D tTk x.n/ D
M 1 X
tk;m xm .n/ D
mD0
M 1 X
tk;m x.nM m/;
(6.1)
mD0
where the last step is obtained via (5.1). The above operation obviously can be considered as filtering the source signal x.n/ by a filter with an impulse response given by the kth basis vector or basis function tk . Consequently, the whole transform may be considered as filtering by a bank of filters, called analysis filter banks, with impulse responses given by the row vectors of the transform matrix T. This is shown in Fig. 6.1. Similarly, the inverse transform in (5.10) may be considered as the output from a bank of filters, called synthesis filter banks, with impulse responses given by the row vectors of the inverse matrix TT . This is also shown in Fig. 6.1. For the suboptimal sinusoidal transforms discussed in Chap. 5, each of the filters in either the analysis or synthesis bank is associated with a basis vector or basis
Fig. 6.1 Analysis and synthesis filter banks
6.1 Subband Filtering
93
function which corresponds to a specific frequency, so it deals with components of the source signal associated with that frequency. Such filters are usually band-pass and decompose the frequency into small bands, called subbands, so such filters are M 1 called subband filters and the decomposed signal components fyk gkD0 are called subband samples.
6.1.2 DFT Filter Bank To illustrate the power of this new perspective on transforms, let us consider a simple analysis filter bank shown in Fig. 6.2, which is built using the inverse DFT W matrix given in (5.70). The delay chain in the analysis bank consists of M 1 delay units z1 connected together in series. As the source signal x.n/ passes through it, a bank of signals are extracted uk .n/ D x.n k/; k D 0; 1; : : : ; M 1:
(6.2)
This enables that M samples from the source signal are presented simultaneously to the transform matrix. Due to (5.67), except for the scale factor of 1=M , the subband samples for the kth subband is the output from the kth subband filter and is given as yk .n/ D
M 1 X
km um .n/WM ; k D 0; 1; : : : ; M 1;
(6.3)
mD0
which is essentially the inverse DFT in (5.74). Due to (6.2), it becomes km x.n m/WM :
(6.4)
…
M 1 X
…
yk .n/ D
mD0
…
Fig. 6.2 DFT analysis filter banks
94
6 Subband Coding
Its Z-transform is Yk .z/ D
M 1 X
km X.Z/zm WM D X.Z/
mD0
M 1 X
k zWM
m
;
(6.5)
mD0
so the transfer function for the kth subband filter is Hk .z/ D
M 1 m X Yk .z/ k D zWM : X.z/ mD0
(6.6)
Since the transfer function for the zeroth subband is H.z/ D
M 1 X
zm ;
(6.7)
mD0
the transfer functions for all other subbands may be represented by it: k Hk .z/ D H zWM :
(6.8)
Its frequency response is 2k Hk .ej! / D H ej.! M / ;
(6.9)
in the frequency domain. Therefore, which is H ej! uniformly shifted by 2k M H.z/ is called the prototype filter and all other subband filters in the DFT bank are built by uniformly shifting or modulating the prototype filter. A filter bank with such a structure is called a modulated filter bank. It is a prominent category of filter banks which are most notably amenable for fast implementation. The magnitude response of the prototype filter (6.7) is sin M! !2 ; H ej! D sin 2
(6.10)
which is shown at the top of Fig. 6.3 for M D 8. According to (6.9), all other subband filters in the bank are shifted or modulated versions of it and are shown at the bottom of Fig. 6.3.
6.1.3 General Filter Banks Viewed from the new perspective of subband filtering, DFT apparently has rather inferior energy compaction capability: the subband filters have wide transition bands and their stopband attenuation is only about 13 dB. A significant amount of energy in one subband is spilled into other subbands, appearing as ripples in Fig. 6.3.
6.1 Subband Filtering H0
20 Amplitude (dB)
95 H0
10 0 −10 −20
0
0.2
0.4
0.6
0.8
1
ω /2π
Amplitude (dB)
20
H0
H2
H1
H4
H3
H5
H6
H7
H0
10 0 −10 −20 0
0.2
0.4
0.6
0.8
1
ω /2π
Fig. 6.3 Magnitude responses of a DFT analysis filter bank with M D 8. The top shows the prototype filter. All subband filters in the bank are uniformly shifted or modulated versions of it and are shown at the bottom
Fig. 6.4 Ideal bandpass subband filters for M D 8
While other transforms, such as KLT and DCT, may have better energy compacting capability than the DFT, they are eventually limited by M , the number of samples in the block. It is well known in filter design that a sharp magnitude response with less energy leakage requires long filters [67], so the fundamental limiting factor is the block size M of these transforms or subband filters. To obtain even better performance, subband filters longer than M need to be used. To maximize energy compaction, subband filters should have the magnitude response of an ideal bandpass filter shown in Fig. 6.4, which has no energy leakage at all and achieves the maximum coding gain (to be proved later). Unfortunately, such a bandpass filter requires an infinite filter order [68], which is very difficult to implement in a practical system, so the challenge is to design subband filters that can optimize coding gain for a given limited order.
96
6 Subband Coding
Once the order of subband filters are extended beyond M , overlapping between blocks occurs, providing an additional benefit of mitigating the blocking artifacts discussed at the beginning of this chapter. It is clear that transforms are a special type of filter banks whose subband filters have order less than M . Its main characteristics is that there is no overlapping between transform blocks. Therefore, transforms are frequently referred as filter banks and transform coefficients as subband samples. On the other hand, filter banks with subband filters longer than M are also sometimes referred to as transforms or lapped transform in the literature [49]. One such example is the modulated cosine transform (MDCT), whose subband filters are twice as long as the block size.
6.2 Subband Coder When the filter bank in Figs. 6.1 and 6.2 is directly used for subband coding, there exists an immediate obstacle: M -fold increase in the number of samples to be coded because the analysis bank generates M subband samples for each source sample. This problem may be resolved, as shown in Fig. 6.5, by M -fold decimation in the analysis bank to make the total number of subband samples equal to that of the source block, followed by M -fold expansion in the synthesis bank to recover the sample rate of the subband samples back to the original sample rate of the source signal. An M -fold decimator, also referred to as a downsampler or sample rate compressor, discards M 1 samples for each block of M input samples and retains only one sample for output: xD .n/ D x.M n/;
(6.11)
…
…
…
…
…
Fig. 6.5 Maximally decimated filter bank and subband coder. The #M denotes M -fold decimation and "M M -fold expansion. The additive noise model is used to represent quantization in each subband
6.3 Reconstruction Error
97
where x.n/ is the source sequence and xD .n/ is the decimated sequence. Due to the loss of M 1 samples incurred in decimation, it may not be possible to recover x.n/ from the decimated xD .n/ due to aliasing [65, 68]. When applied to the analysis bank in Fig. 6.5, the decimator reduces the sample rate of each subband to its 1=M . Since there are M subbands, the total sample rate for all subbands is still the same as the source signal. The M -fold expander, also referred to as an upsampler or interpolator, passes through each source sample to the output and, after each, inserts M 1 zeros to the output: ( x.n=M /; if n=M is an integerI xE .n/ D (6.12) 0; otherwise: Since all samples from the input are passed through to the output, there is obviously no loss of information. For example, the source can be recovered from the expanded output by an M -old decimator. As explained in Sect. 6.3, however, expansion causes images in the spectrum, so needs to be handled accordingly. When applied to the analysis bank in Fig. 6.5, the expander for each subband recovers the sample rate of each subband back to the original sample rate of the source signal. It is then possible to output an reconstructed sequence at the same sample rate as the source signal. Coding in the subband domain, or subband coding, is accomplished by attaching a quantizer to the output of each subband filter in the analysis filter bank and a corresponding inverse quantizer to the input of each subband filter in the synthesis filter bank. The abstraction of this quantization and inverse quantization is the additive noise model in Fig. 2.3, which is deployed in Fig. 6.5. The filter bank above has a special characteristic: its sample rate for each subband is 1=M of that of the source. This happens because the decimation factor is equal to the number of subbands. Such a subband system is called a maximally decimated or critically sampled filter bank.
6.3 Reconstruction Error When a signal moves through a maximally decimated subband system, it is altered by analysis filters, decimators, quantizers, inverse quantizers, expanders, and synthesis filters. The combined effect of these alternation may lead to reconstruction error at the output of the synthesis bank: e.n/ D x.n/ O kx.n d /;
(6.13)
where k is a scale factor and d is a delay. For subband coding, this reconstruction error needs to be either exactly or approximately zero, in the absence of quantization, so that the reconstructed signal is a delayed and scaled version of the source signal. This section analyzes decimation and expansion effects to arrive at conditions on analysis and synthesis filters that guarantee zero reconstruction error.
98
6 Subband Coding
6.3.1 Decimation Effects Before considering the problem of decimation effects, let us first consider the following geometric series: pM .n/ D
M 1 1 X j 2mn e M : M mD0
(6.14)
When n is a multiple of M , the above equation becomes pM .n/ D 1
(6.15)
due to ej2m D 1. For other values of n, the equation becomes 2n M j 1 1 e M D 0; pM .n/ D M ej 2n M 1
(6.16)
due to the formula for geometric series and ej2n D 1. Therefore, the geometric series (6.14) becomes (
1; if n D multiples of M I
pM .n/ D
0; otherwise:
(6.17)
Now let us consider the Z-transform of the decimated sequence in (6.11): XD .z/ D
1 X
xD .n/zn D
nD1
1 X
x.Mn/zn :
(6.18)
nD1
Due to the upper half of (6.17), we can multiply the right-hand side of the above equation with pM .nM/ to get 1 X
XD .z/ D
pM .nM/x.Mn/zn :
(6.19)
nD1
Due to the lower half of (6.17), we can do a variable change of m D nM to get 1 X
XD .z/ D
pM .m/x.m/zm=M ;
(6.20)
mD1
where m takes on integer values at an increment of one. Dropping in (6.14), we have 1 M 1 1 X X 2k m XD .z/ D x.m/ z1=M ej M ; M mD1 kD0
(6.21)
6.3 Reconstruction Error
99
Due to (5.67), the equation above becomes XD .z/ D
M 1 1 m 1 X X k x.m/ z1=M WM ; M mD1
(6.22)
kD0
Let X.z/ denote the Z-transform of x.m/, the equation above can be written as XD .z/ D
M 1 1 X 1=M k X z WM : M
(6.23)
M 1 1 X j !2k ; X e M M
(6.24)
kD0
The Fourier transform of (6.23) is XD .ej! / D
kD0
which can be interpreted as 1. 2. 3. 4.
Stretch X.ej! / by a factor of M . Create M 1 aliasing copies and shift them by 2k, respectively. Add all shifted aliasing copies obtained in step 2 to the stretched copy in step 1. Divide the sum above by M .
As an example, let us consider the prototype filter H.z/ in (6.7) as the Z-transform for a regular signal. Its time-domain representation is obviously as follows: ( 1; 0 n < M I x.n/ D (6.25) 0; otherwise: Let M D 8, we now examine the effect of eightfold decimation on this signal. Its Fourier transform is given in (6.10) and shown at the top of Fig. 6.3. The stretched H.ej!=M / and all its shifted aliasing copies are shown at the top of Fig. 6.6. Due to the stretching factor of M D 8, their period is no longer 2, but stretched to 8 2, which is the frequency range covered by Fig. 6.6. The Fourier transform for the decimated signal is shown at the bottom of Fig. 6.6, whose period is 2 as required by the Fourier transform for a sequence. Due to the overlapping of the stretched spectrum with its shifted aliasing copies and the subsequent mutual cancellation, the spectrum for the decimated signal is totally different than that of the original source signal shown in Fig. 6.3, so we cannot recover the original signal from its decimated version. One approach to avoid aliasing is to band-limit the source signal to j!j < =M , according to the teaching from Nyquist’s sampling theorem [65]. Due to the stretching factor of M , the stretched spectrum is now bandlimited to j!j < . Since its shifted copies are placed at 2 interval, there is no overlapping between the original and the aliasing copies. The aliasing copies can be removed by an ideal low-pass filter, leaving only the original copy.
100
6 Subband Coding 8−Fold Streched Spectrum and its Shifted Copies H0
Amplitude (dB)
20
H1
H2
H3
1
2
3
1
2
H4
H5
H0
H6
H7
6
7
8
6
7
8
10 0 −10 −20
0
4 5 ω/2π 8−Fold Decimated Signal
Amplitude (dB)
20 10 0 −10 −20
0
3
4 ω /2π
5
Fig. 6.6 Stretched spectrum of the source signal and all its shifted aliasing copies (top). Due to the stretching factor of M D 8, their period is also stretched from 2 to 8 2. Spectrum for the decimated signal (bottom) has a period of 2, but is totally different from that of the original signal
The approach above is not the only one for aliasing-free decimation, see [82] for details. However, aliasing-free decimation is not the goal for filter bank design. A certain amount of aliasing is usually allowed in some special ways. As long as aliasing from all decimators in the filter bank cancel each other completely at the output of the synthesis bank, the reconstructed signal is still aliasing free. Even if aliasing cannot be canceled completely, proper filter bank design can still keep them small enough so as to obtain a reconstruction with tolerable error.
6.3.2 Expansion Effects To see the consequence of expansion, let us consider the Z-transform of the expanded signal xE .n/: 1 X xE .n/zn : (6.26) XE .z/ D nD1
Due to (6.12), xE .n/ is nonzero only when n is a multiple of M : n D kM, where k is an integer. Replacing n with kM in the above equation, we have
6.3 Reconstruction Error
101 1 X
XE .z/ D
xE .kM/zkM :
(6.27)
kD1
Due to the upper half of (6.12), xE .kM/ D x.k/, so we have XE .z/ D
x.k/zKM D X zM :
1 X
(6.28)
kD1
Its Fourier transform is
XE .ej! / D X ejM! ;
(6.29)
which is an M -fold compressed version of XE .ej! /. In other words, The effect of sample rate expansion is frequency compression. As an example, the signal in (6.25), whose spectrum is shown at the top of Fig. 6.7, is eightfold expanded to give an output signal whose spectrum is shown at the bottom of Fig. 6.7. Due to the compression of frequency by a factor of 8, seven images are shifted into Œ0; 2 region from outside.
Input Signal Amplitude (dB)
20 10 0 −10 −20 0
0.2
0.4
0.6
0.8
1
0.6
0.8
1
ω/2π 8−Fold Expanded Signal
Amplitude (dB)
20 10
0 −10 −20 0
0.2
0.4
ω /2π
Fig. 6.7 The effect of sample rate expansion is frequency compression. The spectrum for the source signal on the top is compressed by a factor of 8 to produce the spectrum for the expanded signal at the bottom. Seven images are shifted into Œ0; 2 region from outside due to this frequency compression
102
6 Subband Coding
6.3.3 Reconstruction Error Let us now consider the reconstruction error of the subband system in Fig. 6.5 in the absence of quantization. Due to (6.23), each subband signal after decimation is Yk .z/ D
M 1 1 X m m X z1=M WM : Hk z1=M WM M mD0
(6.30)
Due to (6.28), the reconstructed signal is XO .z/ D
M 1 X
Fk .z/Yk zM
kD0
D
M 1 M 1 m m 1 X X X zWM Fk .z/Hk zWM M mD0 kD0
D D
1 M
M 1 X mD0
1 X.z/ M C
1 X m m M X zWM Fk .z/Hk zWM kD0 M 1 X
Fk .z/Hk .z/
kD0
M 1 M 1 m 1 X m X : X zWM Fk .z/Hk zWM M mD1
(6.31)
kD0
Define the overall transfer function as T .z/ D
M 1 1 X Fk .z/Hk .z/ M
(6.32)
kD0
and the aliasing transfer function as Am .z/ D
M 1 m 1 X ; Fk .z/Hk zWM M
m D 1; 2; : : : ; M 1;
(6.33)
kD0
the reconstructed signal is XO .z/ D T .z/X.z/ C
M 1 X
m X zWM Am .z/:
(6.34)
mD1
Note that T .z/ is also the overall transfer function ofm the filter bank in the absence of both the decimators and expanders. Since X zWM is the shifted version of the source signal, the reconstructed signal may be considered as a linear combination of the source signal X.z/ and its shifted aliasing versions.
6.4 Polyphase Implementation
103
To set the reconstruction error (6.13) to zero, the overall transfer function should be set to a delay and scale factor: T .z/ D kzd
(6.35)
and the total aliasing effect to zero: M 1 X
m X zWM Am .z/ D 0:
(6.36)
mD1
If a subband system produces no reconstruction error, it is called a perfect reconstruction (PR) system. If there is reconstruction error, but it is limited and approximately zero, it is called a near-perfect reconstruction or nonperfect reconstruction (NPR) system. For subband coding, PR is desirable and NPR is the minimal requirement.
6.4 Polyphase Implementation While the total sample rate of all subbands within a maximally decimated filter bank is made equal to that of the source signal through decimation and expansion, waste of computation is still an issue. To illustrate this issue, let us look at the output of the decimators in Fig. 6.5. The decimator keeps only one subband sample and discards the other M 1 subband samples, so the subband filtering for generating the discarded M 1 subband samples is a total waste of computational resources and thus should be eliminated. This is achieved using polyphase representation of subband filters and noble identities.
6.4.1 Polyphase Representation Polyphase representation is an important advancement in the theory of filter banks that greatly simplifies the implementation structures of both analysis and synthesis banks [3, 94].
6.4.1.1 Type-I Polyphase Representation For any given integer M , an FIR or IIR filter given below H.z/ D
1 X nD1
h.n/zn
(6.37)
104
6 Subband Coding
can always be written as 1 X
H.z/ D
h.nM/znM
nD1
C z1
1 X
h.nM C 1/znM
nD1
:: : 1 X
C z.M 1/
h.nM C M 1/znM :
(6.38)
nD1
Denoting pk .n/ D h.nM C k/;
0 k < M;
(6.39)
which is called a type-I polyphase component of h.n/, and its Z-transform Pk .z/ D
1 X
pk .n/zn ; 0 k < M;
(6.40)
nD1
the (6.38) may be written as H.z/ D
M 1 X
zk Pk zM :
(6.41)
kD0
The equation above is called the type-I polyphase representation of H.z/ with respect to M and its implementation is shown in Fig. 6.8. The type-I polyphase representation in (6.41) may be further written as H.z/ D pT zM d.z/;
Fig. 6.8 Type-I polyphase implementation of an arbitrary filter
(6.42)
6.4 Polyphase Implementation
where
105
h iT d.z/ D 1; z1 ; : : : ; zM 1
(6.43)
p.z/ D ŒP0 .z/; P1 .z/; : : : ; PM 1 .z/T
(6.44)
is the delay chain and
is the type-I polyphase (component) vector. The type-I polyphase representation of an arbitrary filter may be used to implement the analysis filter bank in Fig. 6.5. Using (6.42), the kth subband filter Hk .z/ may be written as Hk .z/ D hTk zM d.z/; 0 k < M;
(6.45)
hk .z/ D Œhk;0 ; hk;1 ; : : : ; hk;M 1 T
(6.46)
where are the type-I polyphase components of Hk .z/. The analysis bank may then be represented by 3 hT0 zM d.z/ 7 6 hT zM d.z/ 7 6 7 6 1 7 6 h.z/ D 6 7D6 7 D H zM d.z/; :: 5 4 5 4 :M T HM 1 .z/ hM 1 z d.z/ 2
H.z/ H1 .z/ :: :
where
3
2
2 6 6 H.z/ D 6 4
hT0 .z/ hT1 .z/ :: :
(6.47)
3 7 7 7 5
(6.48)
hTM 1 .z/ is called a polyphase (component) matrix. This leads to the type-I polyphase implementation in Fig. 6.9 for the analysis filter bank in Fig. 6.5.
Fig. 6.9 Type-I polyphase implementation of a maximally decimated analysis filter bank
106
6 Subband Coding
6.4.1.2 Type-II Polyphase Representation Type-II polyphase representation of a general filter H.z/ with respect to M may be obtained from (6.41) through a variable change of k D M 1 n: H.z/ D
M 1 X
z.M 1n/ PM 1n zM
(6.49)
z.M 1n/ Qn zM ;
(6.50)
nD0
D
M 1 X nD0
where Qn .z/ D PM 1n .z/
(6.51)
is a permutation of Pn .z/. Figure 6.10 shows type-II polyphase implementation of an arbitrary filter. The type-II polyphase representation in (6.50) may be re-written as zn Qn zM ;
(6.52)
H.z/ D z.M 1/ dT z1 q zM ;
(6.53)
q.z/ D ŒQ0 .z/; Q1 .z/; : : : ; qM 1 .z/T
(6.54)
H.z/ D z.M 1/
M 1 X nD0
so it can be expressed in vector form
where represents the type-II polyphase components. Similar to type-I polyphase representation of the analysis filter bank, a synthesis filter bank may be implemented using type-II polyphase representation. The type-II polyphase components of the kth synthesis subband filter may be denoted as fk .z/ D Œfk;0 .z/; fk;1 .z/; : : : ; fk;M 1 .z/T ;
Fig. 6.10 Type-II polyphase implementation of an arbitrary filter
(6.55)
6.4 Polyphase Implementation
107
then the kth synthesis subband filter may be written as Fk .z/ D z.M 1/ dT z1 fk zM ;
(6.56)
so the synthesis filter bank may be written as fT .z/ D ŒF0 .z/; F1 .z/; : : : ; FM 1 .z/ i h D z.M 1/ dT z1 f0 zM ; f1 zM ; : : : ; fM 1 zM D z.M 1/ dT z1 F zM ;
(6.57)
F.z/ D Œf0 .z/; f1 .z/; : : : ; fM 1 .z/:
(6.58)
where This leads to the type-II polyphase implementation in Fig. 6.11.
6.4.2 Noble Identities Now that we have polyphase implementation of both analysis and synthesis filter banks, we can move on to rid off the M 1 wasteful filtering operations in both filter banks. We achieve this using noble identities.
6.4.2.1 Decimation The noble identity for decimation is shown in Fig. 6.12 and is proven below using (6.23)
Fig. 6.11 Type-II polyphase implementation of a maximally decimated synthesis filter bank
Fig. 6.12 Noble identity for decimation
108
6 Subband Coding
Y1 .z/ D
M 1 1 X 1=M k U z WM M kD0
M 1 k k H z1=M WM X z1=M WM M kD0 " M 1 # 1 X 1=M k D X z WM H.z/ M
D
M 1 X
kD0
D Y2 .z/:
(6.59)
Applying the notable identity given above to the analysis bank in Fig. 6.9, we can move the decimators on the right side of the analysis polyphase matrix to its left side to arrive at the analysis filter bank in Fig. 6.13. With this new structure, the delay chain presents M source samples in correct succession simultaneously to the decimators. The decimators ensure that the subband filters operate only once for each block of M input samples, generating only one block of M subband samples. The sample rate is thus reduced by M times, but the data move in parallel now. The combination of the delay chain and the decimators essentially accomplishes a series-to-parallel conversion.
6.4.2.2 Interpolation The noble identity for interpolation is shown in Fig. 6.14 and is easily proven below using (6.28) Y1 .z/ D U zM D X zM H zM D Y2 .z/:
Fig. 6.13 Efficient implementation of a maximally decimated analysis filter bank
Fig. 6.14 Noble identity for interpolation
(6.60)
6.4 Polyphase Implementation
109
Fig. 6.15 Efficient implementation of a maximally decimated synthesis filter bank
For the synthesis bank in Fig. 6.11, the expanders on the left side of the synthesis polyphase matrix may be moved to its right side to arrive at the filter bank in Fig. 6.15 due to the noble identity just proven. With this structure, the synthesis subband filters operate once for each block of subband samples, whose sample rate was reduced to 1=M of the source sample rate by the analysis filter. The expanders increase the sample rate by inserting M 1 zeros after each source sample, making the sample rate the same as the source signal. The delay chain then delays their outputs in succession to align and interlace the M nonzero subband samples in the time domain so as to form an output stream which has the same sample rate as the source. The combination of the expander and the delay chain essentially accomplishes a parallel-to-series conversion.
6.4.3 Efficient Subband Coder Replacing the analysis and synthesis filter banks in the subband coder in Fig. 6.5 with the efficient architectures in Figs. 6.13 and 6.15, respectively, we arrive at the efficient architecture for subband coding shown in Fig. 6.16.
6.4.4 Transform Coder Compared with the subband coder structure shown in Fig. 6.16, the transform coder in Fig. 5.1 is obviously a special case with H.z/ D T and F.z/ D TT :
(6.61)
In other words, the transform matrix is a polyphase matrix of order zero. The delay chain and the decimators in the analysis bank simply serve as a series-toparallel converter that feeds the source samples to the transform matrix in blocks of M samples. The expander and the delay chain in the synthesis bank serve as a
110
6 Subband Coding
Fig. 6.16 An efficient polyphase structure for subband coding
parallel-to-series converter that interlaces subband samples outputted from the transform matrix to form an output sequence. The orthogonal condition (5.9) ensures that the filter bank satisfies the PR condition. For this reason, the transform coefficients can be referred to as subband samples.
6.5 Optimal Bit Allocation and Coding Gain Section 6.4.4 has shown that a transform coder is a special subband coder whose polyphase matrix is of order zero. A subband coder, on the other hand, can have a polyphase matrix with a higher order. Other than this, there is no difference between the two, so it can be expected that the optimal bit allocation strategy and the method for calculating optimal coding gain for subband coding should be similar to transform coding. This is shown in this section with the ideal subband coder.
6.5.1 Ideal Subband Coder An ideal subband coder uses the ideal bandpass filters in Fig. 6.4 as both the analysis and synthesis subband filters. Since the bandwidth of each filter is limited to 2=M , there is no overlapping between these subband filters. This offers optimal band separation between subband bands in terms that no energy in one subband is leaked into another, achieving optimal energy compaction.
6.5 Optimal Bit Allocation and Coding Gain
111
m Since shifting the frequency of any of these ideal bandpass filters by WM creates a copy that does not overlap with the original one:
m m Hk .z/Hk zWM D Fk .z/Hk zWM D 0; for m D 1; 2; : : : ; M 1;
(6.62)
the aliasing transfer function (6.33) is zero: Am .z/ D
M 1 m 1 X Fk .z/Hk zWM D 0; m D 1; 2; : : : ; M 1: M
(6.63)
kD0
Therefore, condition (6.35) for zero total aliasing effect is satisfied. On the other side, each of the bandpass filters has a uniform transfer function Hk .z/ D Fk .z/ D
(p M ; passband 0;
stopband
(6.64)
so the overall transfer function defined in (6.32) is T .z/ D 1;
(6.65)
which obviously satisfies (6.36). Therefore, the ideal subband system satisfies both conditions for perfect reconstruction and is thus a PR system.
6.5.2 Optimal Bit Allocation and Coding Gain Let us consider the synthesis bank in Fig. 6.5 and assume that the quantization noise qk .n/ from the kth quantizer is zero-mean wide sense stationary with a variance of q2k (fine quantization). After it passes through the expander, it is no longer wide sense stationary because each qk .n/ is periodically interlaced by M 1 zeros. However, after it passes through the ideal passband filter Fk .z/, it becomes wide sense stationary again with a variance of q2k =M [82]. Therefore, the total MSQE of x.n/ O at the output of the synthesis filter bank is 2 q. x/ O D
M 1 1 X 2 qk M
(6.66)
kD0
Let us now consider the analysis bank in Fig. 6.5. Since the decimator retains one sample out of M source samples, its output has the same variance as its input, which, for the kth subband, is given by: y2k D
1 2
Z
jHk .ej! /j2 Sxx .ej! /d!:
(6.67)
112
6 Subband Coding
Using (6.64) and denoting its passband as k , we can write the equation above as y2k D
M 2
Z k
Sxx .ej! /d!
(6.68)
Adding both sides of the equation above for all subbands, we obtain M 1 X kD0
y2k D
Z M 1 Z M X M Sxx .ej! /d! D Sxx .ej! /d! D M x2 ; 2 2 k
(6.69)
kD0
which leads to x2
M 1 1 X 2 D yk : M
(6.70)
kD0
Since (6.66) and (6.70) are the same as (5.22) and (5.37), respectively, all the derivations in Sect. 5.2 related to coding gain and optimal bit allocation applies to the ideal subband coder as well. In particular, we have the following coding gain: 1 M
x2
GSBC D QM 1 kD0
y2k
PM 1 kD0
1=M D Q M 1 kD0
y2k
y2k 1=M ;
(6.71)
and optimal bit allocation strategy y2k ; for k D 0; 1; : : : ; M 1: rk D R C 0:5 log2 QM 1 2 1=M kD0 yk
(6.72)
6.5.3 Asymptotic Coding Gain When the number of subbands M is sufficiently large, each subband becomes sufficiently narrow, so the variance in (6.68) may be approximated by y2k
M jk jSxx .ej! /; 2
(6.73)
where jk j denotes the width of k . Since jk j D the (6.73) becomes
2 ; M
y2k Sxx .ej! /:
(6.74)
(6.75)
6.5 Optimal Bit Allocation and Coding Gain
113
The geometric mean used by the coding gain formula (6.71) may be rewritten as M 1 Y kD0
2
!1=M
M 1 Y
D exp 4ln
y2k
kD0
!1=M 3 5
y2k
M 1 1 X ln y2k D exp M
! :
(6.76)
kD0
Dropping (6.75) into the equation above, we have M 1 Y kD0
!1=M y2k
"
# M 1 1 X j! exp ln Sxx .e / M kD0 " # M 1 1 X j! 2 D exp ln Sxx .e / 2 M kD0 " # M 1 1 X j! D exp ln Sxx .e /jk j ; 2
(6.77)
kD0
where (6.74) is used to obtain the last equation. As M ! 1, the equation above becomes !1=M Z M 1 Y 1 2 yk D exp ln Sxx .ej! /d! (6.78) 2 kD0
Dropping it back into the coding gain (6.71), we obtain lim GSBC D
M !1
D D
exp
1 2 1
exp 1 : x2
x2 j! ln Sxx .e /d!
R R
21 R 2
Sxx .ej! /d!
ln Sxx .ej! /d!
(6.79)
(6.80) (6.81)
where x2 is the spectral flatness measure defined in (5.65). Therefore, the ideal subband coder approaches the same asymptotic optimal coding gain as KLT (see (5.64)).
Chapter 7
Cosine-Modulated Filter Banks
Between the KLT transform coder and the ideal subband coder, there are many subband coders which offer great energy compaction capability with a reasonable implementation cost. Prominent among them are cosine modulated filter banks (CMFB) whose subband filters are derived from a prototype filter through cosine modulation. The first advantage of CMFB is that the implementation cost of both analysis and synthesis banks are that of the prototype filter plus the overhead associated with cosine modulation. For a CMFB with M bands and N taps per subband filter, the number of operations for the prototype filter is on the order of N and that for the cosine modulation, when implemented using a fast algorithm, is on the order of M log2 M , so the total operations is merely on the order of N C M log2 M . For comparison, the number of operations for a regular filter bank is on the order of M N. The second advantage is associated with the design of subband filters. Instead of designing all subband filters in a filter bank independently, which entails optimizing a total of M N coefficients, we only need to optimize the prototype filter with CMFB, which has no more than N coefficients. Early CMFBs are near perfect reconstruction systems [6, 8, 50, 64, 81] in which only “adjacent-subband aliasing” is canceled, so the reconstructed signal at the output of the synthesis filter bank is only approximately equal to a delayed and scaled version of the signal inputted to the analysis filter bank. The same filter bank structure was later found to be capable of delivering perfect reconstruction if two additional constraints are imposed on the prototype filter [39, 40, 46–48, 77–79].
7.1 Cosine Modulation The idea of modulated filter bank was exemplified by the DFT bank discussed in Sect. 6.1.2 whose analysis filters are all derived from the prototype filter in (6.7) using DFT modulation (6.8). This leads to the implementation structure shown in Fig. 6.2 whose implementation cost is the delay line plus the DFT which can be implemented using an FFT. Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 7, c Springer Science+Business Media, LLC 2010
115
116
7 Cosine-Modulated Filter Banks
While the subband filters of the DFT bank are limited to just M taps (see (6.7)), it illustrates the basic idea of modulated filter banks and can be easily extended to accommodate for subband filters with more than M taps through polyphase representation. Even if the prototype filter of an extended DFT filter bank is real-valued, the modulated subband filters are generally complex-valued because the DFT modulation is complex-valued. Consequently, the subband samples of an extended DFT filter bank are complex-valued and are thus not amenable to subband coding. To obtain a real-valued modulated filter bank, the idea that extends DFT to DCT is followed: a 2M DFT is used to modulate a real-valued prototype filter to produce 2M complex subband filters and then subband filters symmetric with respect to the zero frequency are combined to obtain real-valued ones. This leads to CMFB. There is a little practical issue as shown in Fig. 6.3, where the magnitude response of the prototype filter in the DFT bank is split at the zero frequency, leaving half of its bandwidth in the real frequency domain and the other half in the imaginary frequency domain. Due to periodicity of the DFT spectrum, it appears to have two subbands in the real frequency domain, one starting at the zero frequency and the other ending at 2. This is very different from other modulated subbands whose subbands are not split. This issue may be easily addressed by shifting the subband filters by =2M . Afterwards, the subband filters whose center frequencies are symmetric with respect to the zero frequency are combined together to construct a real subband filter.
7.1.1 Extended DFT Bank The prototype filter H.z/ given in (6.7) for the DFT bank in Sect. 6.1.2 is of length M . This can be extended by obtaining type-I polyphase representation of the modulated subband filter (6.8). In particular, using the type-I polyphase representation (6.41), the subband filters of the DFT modulated filter bank (6.8) may be written as k Hk .z/ D H zWM D
M 1 X
k zWM
m
Pm
k zWM
M
mD0
D
M 1 X mD0
km m WM z Pm zM ; for k D 1; 2; : : : ; M 1;
(7.1)
where Pm .zM / is the mth type-I polyphase component of H.z/ with respect to M . Note that a subscript M is attached to W to emphasize that it is for an M -fold DFT: WM D ej2=M : Equation (7.1) leads to the implementation structure shown in Fig. 7.1.
(7.2)
7.1 Cosine Modulation
117
…
…
…
Fig. 7.1 Extension of DFT analysis filter bank to accommodate for subband filters with more M 1 than M taps. fPm .zM /gmD0 are the type-I polyphase components of the prototype filter H.z/ with respect to M . Note that the M subscript for the DFT matrix WM is included to emphasize that each of its elements is WM and it is an M M matrix
If the prototype filter used in (7.1) is reduced to (6.7): Pm zM D 1; for m D 1; 2; : : : ; M 1;
(7.3)
then the filter bank in Fig. 7.1 degenerates to the DFT bank in Fig. 6.2. There is obviously no explicit restriction on the length of the prototype filter in (7.1), so a generic N -tap FIR filter can be assumed: H.z/ D
N 1 X
h.n/zn :
(7.4)
nD0
This extension enables longer prototype filters which can offer much better energy compaction capability than the 13 dB achieved by the DFT bank in Fig. 6.3.
7.1.2 2M -DFT Bank Even if the coefficients of the prototype filter H.z/ in Fig. 7.1 are real-valued, the modulated filters Hk .z/ generally do not have real-valued coefficients because the DFT modulation is complex. Consequently, the subband samples outputted from these filters are complex. Since a complex sample actually consists of a real and imaginary parts, there are now 2M subband samples to be quantized and coded for each block of M real-valued input samples, amounting to a onefold increase. To avoid this problem, a modulation scheme that leads to real-valued subband samples is called for. This can be achieved using an approach similar to the derivation of DCT from DFT: a real-valued prototype filter is modulated by a 2M DFT to produce 2M complex subband filters and then each pair of such subband filters symmetric with respect to the zero frequency are combined to form a real-valued subband filter.
118
7 Cosine-Modulated Filter Banks
2M 1 Fig. 7.2 2M -DFT analysis filter bank. fPm .z2M /gmD0 are the type-I polyphase components of the prototype filter H.z/ with respect to 2M and W2M is the 2M 2M DFT matrix
To arrive at 2M DFT modulation, the polyphase representation (7.1) becomes 1 2M X k km m D Hk .z/ D H zW2M W2M z Pm z2M ; for k D 1; 2; : : : ; 2M 1; mD0
(7.5)
where W2M D ej2=2M D ej=M ;
(7.6)
and Pm .z2M / is the m-th type-I polyphase component of H.z/ with respect to 2M . The implementation structure for such a filter bank is shown in Fig. 7.2. The magnitude responses of the above 2M DFT bank is shown in Fig. 7.3 for M D 8. The prototype filter is again given in (6.7) with 2M D 16 taps and its magnitude response is shown at the top of the figure. There are 2M D 16 subband filters whose magnitude responses are shown in the bottom of the figure. Since a filter with only real-valued coefficients has a frequency response that is conjugate-symmetric with respect to the zero frequency [67], each pair of the subband filters in the above 2M DFT bank satisfying this condition can be combined to form a subband filter with real-valued coefficients. Since the frequency responses of both H0 .z/, which is the prototype filter, and H8 .z/ are themselves conjugatesymmetric with respect to the zero frequency, their coefficients are already realvalued and they cannot be combined with any other subband filters. The remaining subband filters from H1 .z/ to H7 .z/ and from H9 .z/ to H1 5.z/ can be combined to form a total of .M 2/=2 D 7 real-valued subband filters. These combined subband filters, plus H0 .z/) and H8 .z/, give us a total of 7 C2 D 9 combined subband filters. Since the frequency response of the prototype filter H.z/ is split at zero frequency and that of H8 .z/ split at and , their bandwidth is only half of that of the other combined subband filters, as shown at the bottom of Fig. 7.3. This results in a situation where two subbands have half bandwidth and the remaining subbands have full-bandwidth. While this type of filter banks with unequal subband bandwidth can be made to work (see [77], for example), it is rather awkward for practical subband coding.
7.1 Cosine Modulation
119
Amplitude (dB)
30
H0
20 10 0 −10 −0.5
Amplitude (dB)
30
H8 H9
0 ω /2π H10
H11
H12
H13
H14
H15
H0
0.5
H1
H2
H3
H4
H5
H6
H7 H8
20 10 0 −10 − 0.5
0 ω /2π
0.5
Fig. 7.3 Magnitude responses of the prototype filter of a 2M DFT filter bank (top) and of all its subband filters (bottom)
7.1.3 Frequency-Shifted DFT Bank The problem of unequal subband bandwidth can be addressed by shifting the subband filters to the right by the additional amount of =2M so that (7.5) becomes kC0:5 Hk .z/ D H zW2M D
2M 1 X mD0
D
2M 1 X
kC0:5 zW2M
m
Pm
2M kC0:5 zW2M
0:5 m km W2M Pm z2M ; k D 1; 2; : : : ; 2M 1; (7.7) zW2M
mD0 M D 1 was used. Equation (7.7) can be implemented using the structure where W2M shown in Fig. 7.4. Figure 7.5 shows the magnitude responses of the filter bank above using the prototype filter given in (6.7) with 2M D 16 taps. They are the same magnitude responses given in Fig. 7.2 except for a frequency shift of =2M , respectively. Now all subbands have the same bandwidth.
120
7 Cosine-Modulated Filter Banks
Fig. 7.4 2M -DFT analysis filter bank with an additional frequency shift of =2M to the right. 2M 1 are the type-I polyphase components of H.z/ with respect to 2M and W2M is the fPm .z2M /gmD0 2M 2M DFT matrix
Amplitude (dB)
30
H0
20 10 0 −10 −0.5
Amplitude (dB)
30
0 ω /2π H8
H9
H10 H11 H12 H13 H14 H15 H0
0.5
H1
H2
H3
H4
H5
H6
H7
20 10 0 −10 −0.5
0 ω/2π
0.5
Fig. 7.5 Magnitude response of a prototype filter shifted by =2M to the right (top). Magnitude responses of all subband filters modulated from such a filter using a 2M DFT (bottom)
7.1.4 CMFB From Fig. 7.5, it is obvious that the frequency response of the k-th subband filter is conjugate-symmetric to that of the .2M 1 k/-th subband filter with respect to zero frequency (they are images of each other), so they are candidate pairs for combination into a real filter. Let us drop (7.4) into first equation of (7.7) to obtain
7.1 Cosine Modulation
121
Hk .z/ D
N 1 X nD0
D
N 1 X nD0
n kC0:5 h.n/ zW2M .kC0:5/n n
h.n/W2M
z
(7.8)
and H2M 1k .z/ D
N 1 X nD0
D
N 1 X
n 2M 1kC0:5 h.n/ zW2M n k0:5 h.n/ zW2M
nD0
D
N 1 X nD0
.kC0:5/n n
h.n/W2M
z
:
(7.9)
Comparing the two equations above we can see that the coefficients of Hk .z/ and H2M 1k .z/ are obviously conjugates of each other, so the combined filter will have real coefficients. When the pair above are actually combined, they are weighted by a unitmagnitude constant c: Ak .z/ D ck Hk .z/ C ck H2M 1k .z/
(7.10)
to aid alias cancellation and elimination of phase distortion. In addition, linear phase condition is imposed on the prototype filter: h.n/ D h.N 1 n/
(7.11)
and mirror image condition on the synthesis filters: sk .n/ D ak .N 1 n/;
(7.12)
where ak .n/ and sk .n/ are the impulse responses of the kth analysis and synthesis subband filters, respectively. After much derivation [93], we arrive at the following cosine modulated analysis filter:
N 1 ak .n/ D 2h.n/ cos C k ; .k C 0:5/ n M 2 for k D 0; 1; : : : ; M 1; where the phase k D .1/k
: 4
(7.13)
(7.14)
122
7 Cosine-Modulated Filter Banks
The cosine modulated synthesis filter can be obtained from the analysis filter (7.13) using the mirror image relation (7.12):
N 1 k ; .k C 0:5/ n M 2 for k D 0; 1; : : : ; M 1:
sk .n/ D 2h.n/ cos
(7.15)
The system of analysis and synthesis banks above completely eliminates phase distortion, so the overall transfer function T .z/ defined in (6.32) is linear phase. But amplitude distortion remains. Therefore, the CMFB is a nonperfect reconstruction system and is sometimes called pseudo QMF (quadrature mirror filter). Part of the amplitude distortion comes from incomplete aliasing cancellation: aliasing from only adjacent subbands are canceled, not from all subbands. Note that, even though linear phase condition (7.11) is imposed to the prototype filter, the analysis and synthesis filters generally do not have linear phase.
7.2 Design of NPR Filter Banks The CMFB given in (7.13) and (7.15) cannot deliver perfect reconstruction due to incomplete aliasing cancellation and existence of amplitude distortion. But the overall reconstruction error can be reduced to an acceptable level if the amplitude distortion is properly controlled. Amplitude distortion arises if the magnitude of the overall transfer function jT .z/j is not exactly flat, so the problem may be posed as designing the prototype filter in such a way that jT .z/j is flat or close to flat. It turns out that this can be ensured if the following function is sufficiently flat: [93]: jH.ej! /j2 C jH ej.!=M / j2 1;
for ! 2 Œ0; =M :
(7.16)
This condition can be enforced through minimizing the following cost function: Z ˇ.H.z// D
0
=M
2 jH.ej! /j2 C jH ej.!=M / j2 1 d!:
(7.17)
In addition to the concern above over amplitude distortion, energy compaction is also of prominent importance for signal coding and other applications. To ensure this, all subband filters should have good stopband attenuation. Since all subband filters are shifted copies of the prototype filter, they all have the same amplitude shape of the prototype filter. Therefore, the optimization of stopband attenuation for all subband filters can be reduced to that of the prototype filter. The nominal bandwidth of the prototype filter on the positive frequency is 2M , so stopband attenuation can be optimized by minimizing the following cost function:
7.3 Perfect Reconstruction
123
Z .H.z// D
2M
C
jH.ej! /j2 d!;
(7.18)
where controls the transition bandwidth and should be adjusted for a particular application. Now both amplitude distortion and stopband attenuation can be optimized by minh.n/
ı.H.z// D ˛ˇ.H.z// C .1 ˛/.H.z//;
Subject to (7.11);
(7.19)
where ˛ controls the trade-off between amplitude distortion and stopband attenuation. See [76] for standard optimization procedures that can be applied.
7.3 Perfect Reconstruction The CMFB in (7.13) and (7.15) becomes a perfect reconstruction system when aliasing is completely canceled and amplitude distortion eliminated. Toward this end, we first impose the following length constraint on the prototype filter N D 2mM;
(7.20)
where m is a positive integer. Then the CMFB is a perfect reconstruction system if and only if the polyphase components of the prototype filter satisfy the following pairwise power complementary conditions [40, 93]: PQk .z/Pk .z/ C PQM Ck .z/PM Ck .z/ D ˛; k D 0; 1; : : : ; M 1;
(7.21)
where ˛ is a positive number. The notation “tilde” applied to a rational Z-transform function H.z/ means taking complex conjugate of all its coefficients and replacing z with z1 . For example, if
then
H.z/ D a C bz1 C cz2 ;
(7.22)
HQ .z/ D a C b z C cz2 :
(7.23)
It is intended to effect complex conjugation applicable to a frequency response function: (7.24) HQ .ej! / D H .ej! /: When applied to a matrix of Z-transform functions H.z/ D ŒHi;j .z/, a transpose operation is also implied: Q H.z/ D ŒHQ i;j .z/T : (7.25)
124
7 Cosine-Modulated Filter Banks
7.4 Design of PR Filter Banks The method for designing a PR prototype filter is similar to that for the NPR prototype filter discussed in Sect. 7.2, the difference is that the amplitude distortion is now eliminated by the power complementary conditions (7.21), so the design problem is focused on energy compaction: Z .H.z// D jH.ej! /j2 d!; minh.n/ C (7.26) 2M Subject to (7.11) and (7.21): While the above minimization step may be straight-forward by itself, the difficulty lies in the imposition of the power-complementary constraints (7.21) and the linear phase condition (7.11).
7.4.1 Lattice Structure One approach to impose the power-complementary constraints (7.21) during the minimization process is to implement the power-complementary pairs of polyphase components using a cascade of lattice structures. 7.4.1.1 Paraunitary Systems Toward this end, let us write each power-complementary pair as the following 2 1 transfer matrix or system: Pk .z/ D
Pk .z/ ; k D 0; 1; : : : ; M 1; PM Ck .z/
(7.27)
then the power complementary condition (7.21) may be rewritten as PQ k .z/Pk .z/ D ˛; k D 0; 1; : : : ; M 1;
(7.28)
which means that the 2 1 system Pk .z/ is paraunitary. Therefore, the power complementary condition (7.21) is equivalent to the condition that Pk .z/ is paraunitary. In general terms, an m n rational transfer matrix or system H.z/ is called paraunitary if Q H.z/H.z/ D ˛In ; (7.29) where In is the n n unit matrix. It is obviously necessary that m n. Otherwise, the rank of H.z/ is less than n. If m D n or the transfer matrix is square, the transfer system is further referred to as unitary.
7.4 Design of PR Filter Banks
125
7.4.1.2 Givens Rotation As an example, consider Givens rotation described by the following transfer matrix [20, 22]: cos sin ; (7.30) G./ D sin cos where is a real angel. A flowgraph for this transfer matrix is shown in Fig. 7.6. It can be easily verified that GT ./G./ D
cos sin sin cos
cos sin sin cos
D I2 ;
(7.31)
so it is unitary. A geometric interpretation of Givens rotation is that it rotates an input vector clockwise by . In particular, if an input A D Œr cos ˛; r sin ˛T with an angle of ˛ is rotated clockwise by an angle of , the output vector has an angle of ˛ and is given by
r cos.˛ / r cos ˛ cos C r sin ˛ sin r cos.˛/ D D G./ : (7.32) r sin.˛ / r sin ˛ cos r cos ˛ sin r sin.˛/
7.4.1.3 Delay Matrix Let us consider another 2 2 transfer matrix: 1 0 ; DD 0 z1
(7.33)
which is a simple 2 2 delay system and is shown in Fig. 7.7. It is unitary because
Fig. 7.6 The Givens rotation
Fig. 7.7 A simple 2 2 delay system
Z -1
126
7 Cosine-Modulated Filter Banks
Q D 1 0 DD 0 z
1 0
0 z1
D I2 :
(7.34)
7.4.1.4 Rotation Vector The following simple 2 1 transfer matrix:
cos R./ D sin
(7.35)
is a simple rotation vector. It is paraunitary because
R ./R./ D Œcos T
cos sin sin
D I1 :
(7.36)
Its flowgraph is shown in Fig. 7.8.
7.4.1.5 Cascade of Paraunitary Matrices An important property of paraunitary matrices is that a cascade of paraunitary matrices are also paraunitary. In particular, if H1 .z/ and H2 .z/ are paraunitary, then H.z/ D H1 .z/H2 .z/ is also paraunitary. This is because Q H.z/H.z/ D HQ2 .z/HQ1 .z/H1 .z/H2 .z/ D ˛ 2 I:
(7.37)
The result above obviously can be extended to include a cascading of any number of paraunitary systems. Using this property we can build more complex 2 1 paraunitary systems of arbitrary order by cascading the elementary paraunitary transfer matrices discussed above. One such example is the lattice structure shown in Fig. 7.9. It has N 1 delay subsystem and N 1 Givens rotation. Its transfer function may be written as
Fig. 7.8 A simple 2 1 rotation system
7.4 Design of PR Filter Banks
127
Fig. 7.9 A cascaded 2 1 paraunitary systems
P.z/ D
!
1 Y
G.n /D.z/ R.0 /:
(7.38)
nDN 1 N 1 and N 1 delay units, so represents a 2 1 It has a parameter set of fn gnD0 real-coefficient FIR system of order N 1. It was shown that any 2 1 realcoefficient FIR paraunitary systems of order N 1 may be factorized by such a lattice structure [93].
7.4.1.6 Power-Complementary Condition Let us apply the result above to the 2 1 transfer matrices Pk .z/ defined in (7.27) to enforce the power-complementary condition (7.21). Due to (7.20), both polyphase components Pk .z/ and PM Ck .z/ with respect to 2M have an order of m 1, so Pk .z/ has an order of m 1. Due to (7.38), it can be factorized as follows: Pk .z/ D
1 Y
! k G n D.z/ R 0k ; k D 0; 1; : : : ; M 1:
(7.39)
nDm1
Since this lattice structure is guaranteed to be paraunitary, the parameter set n
nk
om1 nD0
(7.40)
can be arbitrarily adjusted to minimize (7.26) without violating the power complementary condition. Since there are M such systems, the total number of free parameters nk that can be varied for optimization is reduced to mM from 2mM.
7.4.2 Linear Phase When the linear phase condition (7.11) is imposed, the number of free parameters above that are allowed to be varied for optimization will be further reduced.
128
7 Cosine-Modulated Filter Banks
To see this, let us first represent the linear phase condition (7.11) in the Z-transform domain: H.z/ D
N 1 X
h.n/zn
nD0
D z.N 1/
N 1 X
h.N 1 m/zm
mD0
D z.N 1/
N 1 X
h.m/zm
mD0 .N 1/
Dz
HQ .z/:
(7.41)
where a variable change of n D N 1 m is used to arrive at the second equation, the linear phase condition (7.11) is used for the third equation, and the assumption that h.n/ are real-valued is used for the fourth equation. The type-I polyphase representation of the prototype filter with respect to 2M is H.z/ D
2M 1 X
zk Pk z2M :
(7.42)
kD0
Dropping it into (7.41) we obtain HQ .z/
2M 1 X
D
zN 1k Pk z2M
kD0 nD2M 1k
D
2M 1 X
zN Ck2M P2M 1k z2M
kD0 N D2mM
D
2M 1 X
zk z2M.m1/ P2M 1k z2M :
(7.43)
kD0
From (7.42) we also have HQ .z/ D
2M 1 X
zk PQk z2M :
(7.44)
kD0
Comparing the last two equations, we have PQk z2M D z2M.m1/ P2M 1k z2M
(7.45)
7.4 Design of PR Filter Banks
129
or PQk .z/ D zm1 P2M 1k .z/:
(7.46)
Therefore, half of the 2M polyphase components are completely determined by the other half due to the linear phase condition.
7.4.3 Free Optimization Parameters Now there are two sets of constraints on the prototype filter coefficients: the power complementary condition ties polyphase components Pk .z/ and PM Ck .z/ using the lattice structure, while the linear phase condition ties Pk .z/ and P2M 1k .z/ using (7.46). Intuitively, this should leave us with roughly one quarter of polyphase components that can be freely optimized. 7.4.3.1 Even M Let us first consider the case that M is even or M=2 is an integer. Each pair of the following two sets of polyphase components: P0 .z/; P1 .z/; : : : ; PM=21 .z/
(7.47)
PM .z/; PM C1 .z/; : : : ; P3M=21 .z/
(7.48)
and can be used to form the 2 1 system in (7.27), which in turn can be represented by the lattice structure in (7.39) with a parameter set given in (7.40). Since each parameter set has m free parameters, the total number of free parameters is mM=2. The remaining polyphase components can be derived from the two sets above using the linear phase condition (7.46). In particular, the set of polyphase components in (7.47) determines the following set: P2M 1 .z/; P2M 2 .z/; : : : ; P3M=2 .z/
(7.49)
PM 1 .z/; PM 2 .z/; : : : ; PM=2 .z/;
(7.50)
and (7.48) determines
respectively.
130
7 Cosine-Modulated Filter Banks
7.4.3.2 Odd M When M is odd or .M 1/=2 is an integer, the scheme above is still valid, except for P.M 1/=2 and PM C.M 1/=2 . In particular, each pair of the following two sets of polyphase components: P0 .z/; P1 .z/; : : : ; P.M 1/=21 .z/
(7.51)
PM .z/; PM C1 .z/; : : : ; P.3M 1/=21 .z/:
(7.52)
and can be used to form the 2 1 system in (7.27), which in turn can be represented by the lattice structure in (7.39) with a parameter set given in (7.40). Since each parameter set has m free parameters, the total number of free parameters is m.M 1/=2. The linear phase condition (7.46) in turn causes (7.51) to determine P2M 1 .z/; P2M 2 .z/; : : : ; P.3M C1/=2 .z/
(7.53)
and (7.52) to determine PM 1 .z/; PM 2 .z/; : : : ; P.M C1/=2 .z/;
(7.54)
respectively. Apparently, both P.M 1/=2 and PM C.M 1/=2 are missing from the lists above. The underlying reason is that both the power-complementary and linear-phase conditions apply to them simultaneously. In particular, the linear phase condition (7.46) requires PQ.M 1/=2 .z/ D zm1 P.3M 1/=2 .z/: (7.55) This causes the power complementary condition (7.21) to become 2PQ.M 1/=2 .z/P.M 1/=2 .z/ D ˛; which leads to P.M 1/=2 .z/ D
p
0:5˛zı ;
(7.56)
(7.57)
where ı is a delay. Since H.z/ is low-pass with a cutoff frequency of =2M , it can be shown that the only acceptable choice of ı is [93] ( ıD
m1 ; 2 m ; 2
if m is oddI if m is even:
(7.58)
Since ˛ is a scale factor for the whole the system, P.M 1/=2 is now completely determined. In addition, P.3M 1/=2 .z/ can be derived from it using (7.55): P.3M 1/=2 .z/ D z.m1/ PQ.M 1/=2 .z/ D
p 0:5˛z.m1ı/ :
(7.59)
7.5 Efficient Implementation
131
7.5 Efficient Implementation Direct implementation of either the analysis filter bank (7.13) or the synthesis filter bank (7.15) requires operations on the order M N . This can be significantly reduced to the order of N C M log2 M by utilizing polyphase representation and cosine modulation. There is a little variation in the actual implementation structures, depending on whether m is even or odd, and both cases are presented in Sects. 7.5.1 and 7.5.2 [93].
7.5.1 Even m When m is even, the analysis filter bank, represented by the following vector: 2
A0 .z/ A1 .z/ :: :
6 6 a.z/ D 6 4
3 7 7 7; 5
(7.60)
AM 1 .z/ may be written as [93] a.z/ D
p
M Dc CŒI J
P .z2M / 0 I J 0 0 P1 .z2M /
d.z/ zM d.z/
(7.61)
where ŒDc kk D cosŒ.k C 0:5/m; k D 0; 1; : : : ; M 1;
(7.62)
is an M M diagonal matrix, C is the DCT-IV matrix given in (5.79), 2
3 1 07 7 :: 7 :7 7 1 0 05 1 0 0 0
0 60 6 6 J D 6 ::: 6 40
0 0 0 1 :: :: : :
(7.63)
is the reversal or anti-diagonal matrix, h
i D Pk z2M ; k D 0; 1; : : : ; M 1; P0 z2M kk
(7.64)
is the diagonal matrix consisting of the initial M polyphase components of the prototype filter P .z/, and
132
7 Cosine-Modulated Filter Banks
h
P1 z2M
i kk
D PM Ck z2M ; k D 0; 1; : : : ; M 1;
(7.65)
is an diagonal matrix consisting of the last M polyphase components of the prototype filter P .z/. For an even m, (7.62) becomes ŒDc kk D cosŒ0:5m D .1/0:5m ; k D 0; 1; : : : ; M 1;
(7.66)
Dc D .1/0:5m I:
(7.67)
so
Therefore, the analysis bank may be written as a.z/ D .1/
p 0:5m
" M CŒIJ IJ
P0 .z2M / 0 0
P1 .z2M /
#"
d.z/ zM d.z/
# :
(7.68) The filter bank above may be implemented using the structure shown in Fig. 7.10. It is obvious that the major burdens of calculations are the polyphase filtering which entails operations on the order of N and the M M DCT-IV which needs operations
…
…
…
…
…
Fig. 7.10 Cosine modulated analysis filter bank implemented using DCT-IV. The prototype filter has a length of N D 2mM with an even m. The Pk .z2M / is the p kth polyphase component of the prototype filter with respect to 2M . A scale factor of .1/0:5m M is omitted
7.5 Efficient Implementation
133
on the order of M log2 M when implemented using a fast algorithm, so the total number of operations is on the order of N C M log2 M . The synthesis filter bank is obtained from the analysis bank using (7.12), which becomes (7.69) Sk .z/ D z.N 1/ AQk .z/ due to (7.41). Denoting sT .z/ D ŒS0 .z/; S0 .z/; : : : ; SM 1 .z/;
(7.70)
the equation above becomes sT .z/ D z.N 1/ aQ .z/:
(7.71)
Dropping in (7.68) and using (7.20), the equation above becomes #" # " p
P Q 0 z2M 0 IJ MQ Q s .z/ D z d.z/ z d.z/ C M .1/0:5m Q 1 z2M 0 P I J " #
Q 0 z2M P 0 2M D z2M C1 dQ .z/ zM dQ .z/ z2M.m1/ Q 1 z 0 P .2mM 1/
T
"
# p IJ C M .1/0:5m : I J (7.72)
Due to (7.46), we have PQ 0 z2M 0 2M z 0 PQ 1 z 2 0 0 P2M 1 z2M 6 :: :: :: :: 6 : : : : 6 6 2M 6 0 PM z 0 6 D6 6 0 0 PM 1 z2M 6 6 :: :: :: :: 6 : : : : 4 2M.m1/
0 " D
JP1 z2M J 0 0
0
JP0 z2M J
0 #
:: :
0 :: :
0
:: :
0 :: :
P0 z2M
3 7 7 7 7 7 7 7 7 7 7 7 5
(7.73)
134
7 Cosine-Modulated Filter Banks
Fig. 7.11 Cosine modulated synthesis filter bank implemented using DCT-IV. The prototype filter has a length of N D 2mM with an even m. The Pk .z2M / is the p kth polyphase component of the prototype filter with respect to 2M . A scale factor of .1/0:5m M is omitted
where the last step uses the following property of the reversal matrix: Jdiagfx1 ; x2 ; : : : ; xM 1 gJ D diagfxM 1 ; : : : ; x2 ; x1 g:
(7.74)
Therefore, the synthesis bank becomes i JP z2M J 0 h 1 Q Q 2M zM C1 d.z/ sT .z/ D zM zM C1 d.z/ 0 JP0 z J p IJ C M .1/0:5m ; (7.75) I J which may be implemented by the structure in Fig. 7.11.
7.5.2 Odd m When m is odd, both the analysis and synthesis banks are essentially the same as when m is even, with only minor differences. In particular, the analysis bank is given by [93]
7.5 Efficient Implementation
135
a.z/ D
p P .z2M / 0 M Ds CŒI C J I J 0 0 P1 .z2M /
d.z/ zM d.z/
(7.76)
where ŒDs kk D sinŒ.k C 0:5/m D .1/
m1 2
.1/k ; k D 0; 1; : : : ; M 1;
(7.77)
is a diagonal matrix with alternating 1 and 1 on the diagonal. Since this matrix only changes the signs of alternating subband samples and these sign changes are reversed upon input to the synthesis bank, the implementation of this matrix can be omitted in both analysis and synthesis bank. Similar to the case with even m, the corresponding synthesis filter bank may be obtained as: i JP .z2M /J 0 h 1 T M M C1 Q M C1 Q d.z/ z d.z/ s .z/ D z z 0 JP0 .z2M /J p ICJ (7.78) CDs M : IJ The analysis and synthesis banks can be implemented by structures shown in Figs. 7.12 and 7.13, respectively.
Fig. 7.12 Cosine modulated analysis filter bank implemented using DCT-IV. The prototype filter has a length of N D 2mM with an odd m. The Pk .z2M / is the kth polyphase component of the prototype filter with respect to 2M . The diagonal matrix Ds only negates alternating subbands, p hence can be omitted together with that in the synthesis bank. A scale factor of M is omitted
136
7 Cosine-Modulated Filter Banks
…
…
…
…
…
Fig. 7.13 Cosine modulated synthesis filter bank implemented using DCT-IV. The prototype filter has a length of N D 2mM with an odd m. The Pk .z2M / is the kth polyphase component of the subbands, prototype filter with respect to 2M . The diagonal matrix Ds only negates alternating p hence can be omitted together with that in the analysis bank. A scale factor of M is omitted
7.6 Modified Discrete Cosine Transform Modified discrete cosine transform (MDCT) is a special case of CMFB when m D 1. It deserves special discussion here because of its wide application in audio coding. The first PR CMFB is the time-domain aliasing cancellation (TDAC) which was obtained from the 2M -DFT discussed in Sect. 7.1.2 without the =2M frequency shifting (even channel stacking) [77]. Referred to as evenly-stacked TDAC, it has M 1 full-bandwidth channels and 2 half-bandwidth channels, for a total of M C 1 channels. This issue was latter addressed using the =2M frequency shifting, which is called odd channel stacking, and the resultant filter bank is called oddly stacked TDAC [78].
7.6.1 Window Function With m D 1, the polyphase components of the prototype filter with respect to 2M becomes the coefficients of the filter itself: Pk .z/ D h.k/;
k D 0; 1; : : : ; 2M 1;
(7.79)
7.6 Modified Discrete Cosine Transform
137
which makes it intuitive to understand the filter bank. For example, the powercomplementary condition (7.21) becomes h2 .n/ C h2 .M C n/ D ˛; n D 0; 1; : : : ; M 1:
(7.80)
Also, the polyphase filtering stage in both Figs. 7.12 and 7.13 becomes simply applying the prototype filter coefficients, so the prototype filter is often referred to as the window function or simply the window. Since the block size is M and the window size is 2M , there is half window or one block overlapping between blocks, as shown in Fig. 7.15. The window function can be designed using the procedure discussed in Sect. 7.4. Due to m D 1, the lattice structure (7.39) degenerates into the rotation vector (7.35), so the design problem is much simpler. A window widely used in audio coding is the following half-sine window or simply sine window: h i h.n/ D ˙ sin .n C 0:5/ : 2M
(7.81)
It satisfies the power-complementary condition (7.80) because h h i i h2 .n/ C h2 .M C n/ D sin2 .n C 0:5/ C sin2 C .n C 0:5/ 2M i 2M h h2 i 2 2 C cos .n C 0:5/ D sin .n C 0:5/ 2M 2M D 1: (7.82) It is a unique window that allows perfect DC reconstruction using only the low-pass subband, i.e., subband zero [45,49]. This was shown to be a necessary condition for maximum asymptotic coding gain for an AR(1) signal with the correlation coefficient approaching the value of one.
7.6.2 MDCT The widely used MDCT is actually given by the following synthesis filter: M C1 .k C 0:5/ n C M 2 for k D 0; 1; : : : ; M 1:
sk .n/ D 2h.n/ cos
(7.83)
It is obtained from (7.15) using the following phase: k D .k C 0:5/.2m C 1/ : 2
(7.84)
138
7 Cosine-Modulated Filter Banks
Table 7.1 Phase difference between CMFB and MDCT. Its impact to the subband filters is a sign change when the phase difference is
k 4n C 0 4n C 1 4n C 2 4n C 3
k =4 =4 =4 =4
k =4 C =4 =4 =4 C
k k 0 0
…
…
…
…
…
…
…
…
…
Fig. 7.14 Implementation of forward MDCT as application of window function and then calculation of DCT-IV. The block C represents DCT-IV matrix
It differs from the k in (7.14) for some k by , as shown in Table 7.1. A phase difference of causes the cosine function to switch its sign to negative, so some of the analysis and synthesis filters will have negative values when compared with those given by (7.14). This is equivalent to that a different Ds is used in the analysis and synthesis banks in Sect. 7.5.2. As stated there, this kind of sign change is insignificant as long as it is complemented in both analysis and synthesis banks.
7.6.3 Efficient Implementation An efficient implementation structure for the analysis bank which utilizes the linear phase condition (7.11) to represent the second half of the window function is given in [45]. To prepare for window switching which is critical for coping with transients in audio signals (to be discussed in Chap. 11), we forgo this use of the linear phase condition to present the structure shown in Fig. 7.14 which uses the second half
7.6 Modified Discrete Cosine Transform
139
Fig. 7.15 Implementation of MDCT as application of an overlapping window function and then calculation of DCT-IV
of the window function directly. In particular, the input to the DCT block may be expressed as un D x.M=2 C n/h.3M=2 C n/ x.M=2 1 n/h.3M=2 1 n/ (7.85) and unCM=2 D x.n M /h.n/ x.1 n/h.M 1 n/
(7.86)
for n D 0; 1; : : : ; M=2 1, respectively. For both equations above, the current block is considered as consisting of samples x.0/; x.1/; : : : ; x.M 1/ and the past block as x.M /; x.M 1/; : : : ; x.1/ which essentially amounts to a delay line. The first half of the inputs to the DCT-IV block, namely un in (7.85), are obviously obtained by applying the second half of the window function to the current block of input data. The second half, namely unCM=2 in (7.86), are calculated by applying the first half of the window function to the previous block of data. This constitutes an overlap with the previous block of data. Therefore, the implementation of MDCT may be considered as application of an overlapping window function and then calculation of DCT-IV, as shown in Fig. 7.15. The following is the Matlab code for implementing MDCT: function [y] = mdct(x, n0, M, h) % % [y] = mdct(x, n0, M, h) % % x: Input array. M samples before n0 are considered as the % delay line % n0: Start of a block of new data % M: Block size % h: Window function % y: MDCT coefficients % % Here is an example for generating the sine win % n = 0:(2*M-1); % h = sin((n+0.5)*0.5*pi/M); %
140
7 Cosine-Modulated Filter Banks
% Convert to DCT4 for n=0:(M/2-1) u(n+1) = - x(n0+M/2+n+1)*h(3*M/2+n+1) - x(n0+M/2-1-n+1)*h(3*M/2-1-n+1); u(n+M/2+1) = x(n0+n-M+1)*h(n+1) - x(n0+-1-n+1)*h(M-1-n+1); end % % DCT4, you can use any DCT4 subroutines y=dct4(u);
The inverse of Fig. 7.14 is shown in Fig. 7.16 which does not utilize the symmetric property of the window function imposed by the linear phase condition. The output of the synthesis bank may be expressed as x.n/ D xd.n/ C uM=2Cn h.n/; xd.n/ D uM=21n h.M C n/; for n D 0; 1; : : : ; M=2 1I
(7.87)
and x.n/ D xd.n/ u3M=21n h.n/; xd.n/ D uM=2Cn h.M C n/; for n D M=2; M=2 C 1; : : : ; M 1I
(7.88)
where xd.n/ is the delay line with a length of M samples.
Fig. 7.16 Implementation of backward MDCT as calculation of DCT-IV and then application of window function. The block C represents DCT-IV matrix
7.6 Modified Discrete Cosine Transform
141
The following is the Matlab code for implementing inverse MDCT: function [x, xd] = imdct(y, xd, M, h) % % [x, xd] = imdct(y, xd, M, h) % % y: MDCT coefficients % xd: Delay line % M: Block size % h: Window function % x: Reconstruced samples % % Here is an example for generating the sine window % n = 0:(2*M-1); % h = sin((n+0.5)*0.5*pi/M); % % DCT4 u=dct4(y); % % for n=0:(M/2-1) x(n+1) = xd(n+1) + u(M/2+n+1)*h(n+1); xd(n+1) = -u(M/2-1-n+1)*h(M+n+1); end % for n=(M/2):(M-1) x(n+1) = xd(n+1) - u(3*M/2-1-n+1)*h(n+1); xd(n+1) = -u(-M/2+n+1)*h(M+n+1); end
If Figs. 7.14 and 7.16 were followed strictly, we could end up with half length for the delay lines at the cost of an increased complexity for swapping variables.
Part IV
Entropy Coding
After the removal of perceptual irrelevancy using quantization aided by data modeling, we are left with a set of quantization indexes. They can be directly packed into a bit stream for transmission to the decoder. However, they can be further compressed through the removal of statistical redundancy via entropy coding. The basically idea of entropy coding is to represent more probable quantization indexes with shorter codewords and less probable ones with longer codewords so as to achieve a shorter average codeword length. In this way, the set of quantization indexes can be represented with less number of bits. The theoretical minimum of the average codeword length for a particular quantizer is its entropy which is a function of the probability distribution of the quantization indexes. The practical minimum of average codeword length achievable by all practical codebooks is usually higher than the entropy. The codebook that delivers such practical minimum is called the optimal codebook. However, this practical minimum can be made to approach the entropy if quantization indexes are grouped into blocks, each block is coded as one block symbol, and the block size is allowed to be arbitrarily large. Huffman’s algorithm is an iterative procedure that always produces an optimal entropy codebook.
Chapter 8
Entropy and Coding
Let us consider a 2-bit quantizer that represents quantized values using the following set of quantization indexes: f0; 1; 2; 3g: (8.1) Each quantization index given above is called a source symbol, or simply a symbol, and the set is called a symbol set. When applied to quantize a sequence of input samples, the quantizer produces a sequence of quantization indexes, such as the following: 1; 2; 1; 0; 1; 2; 1; 2; 1; 0; 1; 2; 2; 1; 2; 1; 2; 3; 2; 1; 2; 1; 1; 2; 1; 0; 1; 2; 1; 2:
(8.2)
Called a source sequence, it needs to be represented by or converted to a sequence of codewords or codes that are suitable for transmission over a variety of channels. The primary concern is that the average codeword length is minimized so that the transmission of the source sequence demands a lower bit rate. An instinct approach to this coding problem is to use a binary numeral system to represent the symbol set. This may lead to the codebook in Table 8.1. Each codeword in this codebook is of fixed length, namely 2 bits, so the codebook is referred to as a fixed-length codebook or fixed-length code. Coding each symbol in the source sequence (8.2) using the fixed-length code in Table 8.1 takes two bits, amounting to a total of 2 30 D 60 bits to code the entire 30 symbols in source sequence (8.2). The average codeword length is obviously LD
60 D 2 bits/symbol; 30
(8.3)
which is the codeword length of the fixed-length codebook and is independent of the frequency that each symbol occurs in the source sequence (8.2). Since the symbols appear in source sequence (8.2) with obviously different frequencies or probabilities, the average codeword length would be reduced if a short codeword is assigned to a symbol with high probability and a long one to a symbol with low probability. This strategy leads to variable-length codebooks or simply variable-length codes.
Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 8, c Springer Science+Business Media, LLC 2010
145
146
8 Entropy and Coding
Table 8.1 Fixed-length codebook for the source sequence in (8.2)
Symbol 0 1 2 3
Codeword 00 01 10 11
Codeword length (bits) 2 2 2 2
Table 8.2 Variable-length unary code for the example sequence in (8.5). Shorter codewords are assigned to more frequently occurring symbols and longer codewords to less frequently occurring ones Symbol Codeword Frequency Codeword length (bits) 0 001 3 3 1 1 14 1 2 01 12 2 3 0001 1 4
As an example, the first two columns of Table 8.2 are such a variable-length codebook that is built using the unary numeral system. Assuming that the sequence is iid and that the number that a symbol appears in the sequence accurately reflects its frequency of occurrence, the probability distribution may be estimated as p.0/ D
3 14 12 1 ; p.1/ D ; p.2/ D ; p.3/ D : 30 30 30 30
(8.4)
With this unary codebook, the most frequently occurring symbol ‘1’ is coded with one bit and the least frequently occurring symbol ‘3’ is coded with four bits. The average codeword length is LD
3 14 12 1 3C 1C 2C 4 D 1:7 bits; 30 30 30 30
which is 0.3 bits better than the the fixed-length code in Table 8.1.
8.1 Entropy Coding To cast the problem above of coding a source sequence to reduce average codeword length in mathematical terms, let us consider an information source that emits a sequence of messages or source symbols X.1/; X.2/; : : :
(8.5)
by drawing from a symbol set or alphabet of S D fs0 ; s1 ; : : : ; sM 1 g
(8.6)
8.1 Entropy Coding
147
consisting of M source symbols. The symbols in the source sequence (8.5) is often assumed to be an independently and identically distributed (iid) random variable with a probability distribution of p.sm / D pm ; m D 0; 1; : : : ; M 1:
(8.7)
The task is to convert this sequence into a compact sequence of codewords Y .1/; Y .2/; : : :
(8.8)
C D fc0 ; c1 ; : : : ; cM 1 g;
(8.9)
drawn from a set of codewords
called a codebook or simply code. The goal is to find a codebook that minimizes the average codeword length without loss of information. The codewords are often represented using binary numerals, in which case the resultant sequence of codewords, called a codeword sequence or simply a code sequence, is referred to as a bit stream. While other radix of representation, such as hexadecimal, can also be used, binary radix is assumed in this book without loss of generality. This coding process has to be “lossless” in the sense that the complete source sequence can be recovered or decoded from the received codeword sequence without any error or loss of information, so it is called lossless compression coding or simply lossless coding. In what amounts to symbol code approach to lossless compression coding, a one-to-one mapping: sm
! cm ;
for m D 0; 1; : : : ; M 1;
(8.10)
is established between each source symbol sm in the symbol set and a codeword cm in the codebook and then deployed to encode the source sequence or decode the codeword sequence symbol by symbol. The codebooks in Tables 8.1 and 8.2 are symbol codes. Let l.cm / denotes the codeword length of codeword cm in codebook (8.9), then the average codeword length per source symbol of the code sequence (8.8) is LD
M 1 X
p.sm /l.cm /:
(8.11)
mD0
Due to the symbol code mapping (8.10), the equation above becomes: LD
M 1 X
p.cm /l.cm /;
mD0
which is the average codeword length of the codebook (8.9).
(8.12)
148
8 Entropy and Coding
Apparently any symbol set and consequently source sequence can be represented or coded using a binary numeral system with L D ceil Œlog2 .M / bits;
(8.13)
where the function ceil.x/ returns the smallest integer no less than x. This results in a fixed-length codebook or fixed-length code in which each codeword is coded with an L-bits binary numeral or is said to have a codeword length of L bits. This fixed length binary codebook is considered as the baseline code for an information source and is used by PCM in (2.30). The performance of a variable-length code may be assessed by compression ratio: RD
L0 ; L
(8.14)
where L0 and L are the average codeword lengths of the fixed-length code and the variable-length code, respectively. For the example unary code in Table 8.2, the compression ratio is 2 RD 1:176: 1:7
8.2 Entropy In pursuit of a codebook that delivers an average codeword length as low as possible, it is critical to know if there exists a minimal average codeword length and, if it exists, what it is. Due to (8.12), the average codeword length is weighted by the probability distribution of the given information source, so it can be expected that the answer is dependent on this probability model. In fact, it was discovered by Claude E. Shannon, an electrical engineer at Bell Labs, in 1948 that this minimum is the entropy of the information source which is solely determined by the probability distribution [85, 86].
8.2.1 Entropy When a message X from an information source is received by the receiver which turns out to be symbol sm , the associated self-information is I.X D sm / D log p.sm /:
(8.15)
The average information per symbol for all messages emitted by the information source is obviously dependent on the probability that each symbol occurs and is thus given by: M 1 X p.sm / log p.sm /: (8.16) H.X / D mD0
8.2 Entropy
149
This is called entropy and is the minimal average codeword length for the given information source (to be proved later). The unit of entropy is determined by the logarithmic base. The bit, based on the binary logarithm (log2 ), is the most commonly used unit. Other units include the nat, based on the natural logarithm (loge ), and the hartley, based on the common logarithm (log10 ). Due to log2 x loga x D ; log2 a conversion between these units are simple and straigthforward, so binary logarithm (log2 ) is always assumed in this book unless stated otherwise. The use of logarithm as a measure of information makes sense intuitively. Let us first note that the function in (8.15) is a decreasing function of the probability p.sm / and equals zero when p.sm / D 1. This means that A less likely event carries more information because the amount of surprise
is larger. When the receiver knows that an event is sure to happen, i.e., p.X D sm / D 1, before receiving the message X , the event of receiving X to discover that X D sm carries no information at all. So the self-information (8.15) is zero. More bits need to be allocated to encode less likely events because they carry more information. This is consistent with our strategy for codeword assignment: assigning longer codewords to less frequently occurring symbols. Ideally, the length of the codeword assigned to encode a symbol should be its selfinformation. If this were done, the average codeword length would be the same as the entropy. Partly because of this, variable-length coding is often referred to as entropy coding. To view another intuitive perspective about entropy, let us suppose that we received two messages (symbols) from the source: Xi and Xj . Since the source is iid, we have p.Xi ; Xj / D p.Xi /p.Xj /. Consequently, the self-information carried by the two messages is I.Xi ; Xj / D log p.Xi ; Xj /
D log .p.Xi // log p.Xj / D I.Xi / C I.Xj /:
(8.17)
This is exact what we expect: The information for two messages should be the sum of the information that each
message carries. The number of bits to code two messages should be the sum of coding each
message individually. In addition to the intuitive perspectives outlined above, there are other considerations which ensure that the choice of using logarithm for entropy is not arbitrary. See [85, 86] or [83] for details.
150
8 Entropy and Coding
As an example, let us calculate the entropy for the source sequence (8.2): H.X / D
3 log2 30
12 log2 30
3 30 12 30
14 log2 30
1 log2 30
14 30 1 30
1:5376 bits: Compared with this value of entropy, the average codeword length of 1.7 bits achieved by the unary code in Table 8.2 is quite impressive.
8.2.2 Model Dependency The definition of entropy in (8.16) assumes that messages emitted from an information source be iid, so the source can be completely characterized by the one-dimensional probability distribution. Since entropy is completely determined by this distribution, its value as the minimal average codeword length is apparently as good as the probability model, especially the iid assumption. Although simple, the iid assumption usually do not reflect the real probability structure of the source sequence. In fact, most information source, and audio signals in particular, are strongly correlated. The violation of this iid assumption significantly skew the calculated entropy toward a value larger than the “real entropy”. This is shown by two examples below. In the first example, we notice that each symbol in the example sequence (8.2) can be predicted from its predecessor using (4.1) to give the following residual sequence: 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 0; 1; 1; 1; 1; 1; 1; 1; 1; 1; 0; 1; 1; 1; 1; 1; 1; 1: Now the alphabet is reduced to f1, 0, 1g with the following probabilities: p.1/ D
13 2 15 ; p.0/ D ; p.1/ D : 30 30 30
Its entropy is H.X / D
13 log2 30
13 30
2 log2 30
2 30
15 log2 30
15 30
1:2833 bits,
8.2 Entropy
151
which is 1:5376 1:2833 D 0:2543 bits less than the entropy achieved with the false iid assumption. Another approach to exploiting the correlation in example sequence (8.2) is to borrow the idea from vector quantization and consider the symbol sequence as a sequence of vectors or block symbols: .1; 2/; .1; 0/; .1; 2/; .1; 2/; .1; 0/; .1; 2/; .2; 1/; .2; 1/; .2; 3/; .2; 1/; .2; 1/; .1; 2/; .1; 0/; .1; 2/; .1; 2/ with an alphabet of f(2, 3), (1, 0), (2, 1), (1, 2)g: The probabilities of occurrence for all block symbols or vectors are p..2; 3// D
1 3 4 7 ; p..1; 0// D ; p..2; 1// D ; p..1; 2// D : 15 15 15 15
The entropy for the vector symbols is 1 H.X / D log2 15 4 log2 15
1 15 4 15
3 log2 15 7 log2 15
3 15 7 15
1:7465 bits per block symbol. Since there are two symbols per block symbol, the entropy is actually 1:7465=2 0:8733 bits per symbol. This is 1:5376 0:8733 D 0:6643 bits less than the entropy achieved with the false iid assumption. Table 8.3 summarizes the entropies achieved using three data models to code the same source sequence (8.2). It is obvious that the entropy for a source is a moving entity, depending on how good the data model is. Since it is generally not possible to completely know a physical source and hence build the perfect model for it, it is generally impossible to know the “real entropy” of the source. The entropy is as good as the data model used. This is similar to quantization where data models play a critical role. Table 8.3 Entropies obtained using three data models for the same source sequence (8.2)
Data model iid Prediction Block coding
Entropy (bits per symbol) 1.5376 1.2833 0.87325
152
8 Entropy and Coding
8.3 Uniquely and Instantaneously Decodable Codes A necessary requirement for entropy coding is that the source sequence can be reconstructed without any loss of information from the the codeword sequence received by the decoder. While the one-to-one mapping (8.10) is the first step toward this end, it is not sufficient because the codeword sequence generated by concatenating the codewords from such a codebook can become undecodable. There is, therefore, an implicit requirement for a codebook that all of its codewords must be uniquely decodable when concatenated together in any order. Among all the uniquely decodable codebooks for a given information source, the one or ones with the least average codeword length is called the optimal codebook. Uniquely decodable codes vary significantly in terms of computational complexity, especially when decoding is involved. There is a subset, called prefix-free codes, which are instantaneous decodable in the sense that each of its codewords can be decoded as soon as the last bit of it is received. It will be shown that, if there is an optimal codebook, at least one of them is a prefix-free code.
8.3.1 Uniquely Decodable Code Looking at the unary code in Table 8.2, one notices that the ‘1’ does not appear anywhere other than the end of the codewords. One would wonder why this is needed? To answer this question, let us consider the old unary numeral system shown in Table 8.4 which uses the number of 0’s to represent the corresponding number and thus establishes a one-to-one mapping. It is obvious that it takes the same number of bits to code the sequence in (8.2) as the unary code given in Table 8.2. Therefore, the codebook in Table 8.4 seems to be equally adequate. The problem lies in that the codeword sequence generated from the codebook in Table 8.4 cannot be uniquely decoded. For example, the first three symbols in (8.2) are f1, 2, 1g, so will be coded into f0 00 0g. Once the receiver received this sequence of f0000g, the decoder cannot uniquely decode the sequence: it cannot determine whether the received codeword sequence is either f3g, f2,2g, or f1,1,1,1g. To ensure unique decodability of the unary code, the ‘1’ is used to signal the end of a codeword in Table 8.2. Now, the three symbols f1, 2, 1g will be coded via the codebook in Table 8.2 into f1 01 1g. Once the receiver received the sequence f1011g, it can uniquely determines that the symbols are f1,2,1g. Table 8.4 A not uniquely decodable codebook
Symbol 0 1 2 3
Codeword 000 0 00 0000
Frequency 3 14 12 1
Codeword length 3 1 2 4
8.3 Uniquely and Instantaneously Decodable Codes
153
Fixed-length codes such as the one in Table 8.1 is also uniquely decodable because all codewords are of the same length and unique in the codebook. To decode a sequence of symbols coded with a fixed-length code of n bits, one can cut the sequence into blocks of n bits each, extract the bits in each block, and look up the codebook to find the symbol it represents. The unique decodability imposes limit on codeword lengths. In particular, McMillan’s inequality states that a binary codebook (8.9) is uniquely decodable if and only if M 1 X 2l.cm / 1; (8.18) mD0
where l.cm / is again the length of codeword cm . See [43] for proof. Given the source sequence (8.5) and the probability distribution (8.7), there are many uniquely decodable codes satisfying the requirement for decodability (8.18). But these codes are not equal because their average codeword lengths may be different. Among all of them, the one that produces the minimum average codeword length M 1 X p.cm /l.cm / (8.19) Lopt D min fl.cm /g
mD0
is referred as the optimal codebook Copt . It is the target of codebook design.
8.3.2 Instantaneous and Prefix-Free Code The uniquely decodable fixed-length code in Table 8.1 and unary code in Table 8.2 are also instantaneously decodable in the sense that each codeword can be decoded as soon as its last bit is received. For the fixed-length code, decoding is possible as soon as the fixed number of bits are received, or the last bit is received. For the unary code, decoding is possible as soon as the ‘1’ is received, which signals the end of a codeword. There are codes that are uniquely decodable, but cannot be instantaneously decoded. Two such examples are shown in Table 8.5. For Codebook A, the codeword f0g is a prefix of codewords f01g and f011g. When f0g is received, the receiver cannot decide whether it is codeword f0g or the first bit of codewords f01g or f011g, so it has to wait. If f1g is subsequently received, the receiver cannot decide whether the received f01g is codeword f01g or the first two bits of codeword f011g because f01g is a prefix of codeword 011. So the receiver has to wait again. Except for the Table 8.5 Two examples of not instantaneously decodable codebooks
Symbol 0 1 2
Codebook A 0 01 011
Codebook B 0 01 11
154
8 Entropy and Coding
reception of f011g which the receiver can decide immediately that it is codeword f011g, the decoder has to wait until the reception of f0g, the start of the next source symbol. Therefore, the decoding delay may be more than one codeword. For Codebook B, the decoding delay may be as long as the whole sequence. For example, the source sequence f01111g may decode as f0,2,2g. However, when another f1g is subsequently received to give a codeword sequence of f011111g, the interpretation of the initial 5’s bits (f01111g) becomes totally different because the codeword sequence f011111g decodes as f1,2,2g now. The decoder cannot make the decision until it sees the end of the sequence. The primary reason for the delayed decision is that the codeword f0g is a prefix of codeword f01g. When the receiver sees f0g, it cannot decide whether it is codeword f0g or just the first bit of codeword f01g. To resolve this, it has to wait until the end of the sequence and work backward. Apparently, whether or not a codeword is a prefix of another codeword is critical to whether it is instantaneously decodable. A codebook in which no codeword is a prefix of any other codewords is referred to as a prefix code, or more precisely prefix-free code. The unary code in Table 8.2 and fixed-length code in Table 8.1 are both prefix codes and are instantaneously decodable. The two uniquely decodable codes in Table 8.5 are not prefix codes and are not instantaneous decodable. In fact, this association is not a coincidence: a codebook is instantaneously decodable if and only if it is a prefix-free code. To see this, let us assume that there is a codeword in an instantaneously decodable code that is a prefix of at least another codeword. Because of this, this codeword is obviously not instantaneously decodable, as shown by Codebook A and Codebook B in the above example. Therefore, an instantaneously decodable code has to be prefix-free code. On the other hand, all codewords of a prefix-free code can be decoded instantaneously upon reception because there is no ambiguity with any other codewords in the codebook.
8.3.3 Prefix-Free Code and Binary Tree A codebook can be viewed as a binary tree or code tree. This is illustrated in Fig. 8.1 for the unary codebook in Table 8.2. The tree starts from a root node of NULL and can have no more than two possible branches at each node. Each branch represents either ‘0’ or ‘1’. Each node contains the codeword that represents all the branches connecting from the root node all the way through to the current node. If a node does not grow any more branches, it is called an external node or leaf; otherwise, it is called an internal node. Since an internal node grows at least one branch, the codeword it represents is a prefix of whatever codeword or node that grows from it. On the other hand, a leaf or external node does not grow any more branches, the codeword it represents is not a prefix of any other nodes or codewords. Therefore, the codewords of a prefix-free code are taken only from the leaves. For example, the code in Fig. 8.1 is a prefix-free code since only its leaves are taken as codewords.
8.3 Uniquely and Instantaneously Decodable Codes
155
Fig. 8.1 Binary tree for an unary codebook. It is a prefix-free code because only its leaves are taken as codewords
Fig. 8.2 Binary trees for two noninstantaneous codes. Since left tree (for Codebook A) takes codewords from internal nodes f0g and f01g, and the right tree (for Codebook B) from f0g, respectively, both are not prefix-free codes. The branches are not labeled due to the convention that left branches represent ‘0’ and right branches represents ‘1’
Figure 8.2 shows the trees for the two noninstantaneous codes given in Table 8.5. Since Codebook A takes codewords from internal nodes f0g and f01g, and Codebook B from f0g, respectively, both are not prefix-free codes. Note that the branches are not labeled because both trees follow the convention that left branches represent ‘0’ and right branches represent ‘1’. This convention is followed throughout this book unless stated otherwise.
8.3.4 Optimal Prefix-Free Code Instantaneous/prefix-free codes are obviously desirable for easy and efficient decoding. However, prefix-free codes are only a subset of uniquely decodable codes, so there is a legitimate concern that the optimal code may not be a prefix-free code for
156
8 Entropy and Coding
a given information source. Fortunately, this concern turns out to be unwarranted due to Kraft’s inequality which states that there is a prefix-free codebook (8.9) if and only if M 1 X
2l.cm / 1:
(8.20)
mD0
See [43] for proof. Since Kraft’s inequality (8.20) is the same as McMillan’s inequality (8.18), we conclude that there is a uniquely decodable code if and only if there is an instantaneous/prefix-free code with the same set of codeword lengths. In terms of the optimal codebooks, there is an optimal codebook if and only if there is a prefix-free code with the same set of codeword lengths. In other words, there is always a prefix-free codebook that is optimal.
8.4 Shannon’s Noiseless Coding Theorem Although prefix-free codes are instantaneously decodable and there is always a prefix-free codebook that is optimal, there is still a question as to how close the average codeword length of such an optimal prefix-free code can approach the entropy of the information source. Shannon’s noiseless coding theorem states that the entropy is the absolute minimal average codeword length of any uniquely decodable codes and that the entropy can be asymptotically approached by a prefix-free code if source symbols are coded as blocks and the block size goes to infinity.
8.4.1 Entropy as the Lower Bound To reduce the average codeword length, it is desired that only short codewords be used. But McMillan’s inequality (8.18) states that, to ensure unique decodability, the use of some short codewords requires the other codewords to be long. Consequently, the overall or average codeword length cannot be arbitrarily low, there is an absolute lower bound. It turns out that the entropy is this lower bound. To prove this, let KD
M 1 X
2l.cm / :
mD0
Due to McMillan’s inequality (8.18), we have log.K/ 0:
(8.21)
8.4 Shannon’s Noiseless Coding Theorem
157
Consequently, we have LD
M 1 X
p.cm /l.cm /
mD0
M 1 X
p.cm / Œl.cm / C log2 .K/
mD0
D
M 1 X
i h p.cm / log2 2l.cm / K
mD0
D
M 1 X
" p.cm / log2
mD0
D
M 1 X
p.cm / log2
mD0
DH
p.cm /2l.cm / K p.cm /
#
M 1 h i X 1 C p.cm / log2 p.cm /2l.cm / K p.cm / mD0
M 1 X
p.cm / log2
mD0
1 : p.cm /2l.cm / K
Due to log2 .x/
1 .x 1/; 8x > 0; ln 2
the second term on the right-hand side is always negative because: M 1 X
p.cm / log2
mD0
1 p.cm /2l.cm / K
M 1 1 X 1 1 p.cm / ln 2 mD0 p.cm /2l.cm / K
D
M 1 1 1 X / p.c m ln 2 mD0 2l.cm / K
1 D ln 2
"
M 1 M 1 1 X l.cm / X 2 p.cm / K mD0 mD0
#
1 Œ1 1 ln 2 D 0: D
Therefore, we have L H:
(8.22)
158
8 Entropy and Coding
8.4.2 Upper Bound Since the entropy is the absolute lower bound on average codeword length, an intuitive approach to the construction of an optimal codebook is to set the length of the codeword assigned to a source symbol to its self-information (8.15), then the average codeword length would be equal to the entropy. This is, unfortunately, unworkable because the self-information is most likely not an integer. But we can get close to this by setting the codeword length to the next smallest integer: l.cm / D ceilŒ log2 p.cm /:
(8.23)
Such a codebook is called a Shannon–Fano code. A Shannon–Fano code is uniquely decodable because it satisfies the McMillan inequality (8.18): M 1 X
2l.cm / D
mD0
M 1 X
2ceilŒ log2 p.cm /
mD0
M 1 X
2log2 p.cm /
mD0
D
M 1 X
p.cm /
mD0
D 1:
(8.24)
The average codeword length of the Shannon–Fano code is LD
M 1 X
p.cm /ceilŒ log2 p.cm /
mD0
M 1 X
p.cm / Œ1 log2 p.cm /
mD0
D 1
M 1 X
p.cm / log2 p.cm /
mD0
D 1 C H.X /;
(8.25)
where the second inequality is obtained due to ceil.x/ 1 C x: A Shannon–Fano code may or may not be optimal, although it sometimes is. But the above inequality constitutes an upper bound on the optimal codeword length.
8.4 Shannon’s Noiseless Coding Theorem
159
Combining this and the lower entropy bound (8.22), we obtain the following bound for the optimal codeword length: H.X / Lopt < 1 C H.X /:
(8.26)
8.4.3 Shannon’s Noiseless Coding Theorem Let us group n source symbols in source sequence (8.5) as a block, called a block symbol, (8.27) X.k/ D ŒXkn ; XknC1 ; : : : ; XknCn1 ; thus converting source sequence (8.5) into a sequence of block symbols: X.0/; X.1/; : : : :
(8.28)
Since source sequence (8.5) is assumed to be iid, the probability distribution for a block symbol is p.sm0 ; sm1 smn1 / D p.sm0 /p.sm1 / p.smn1 /:
(8.29)
Using this equation, the entropy for the block symbols is H n .X/ D
M 1 M 1 X X m0 D0 m1 D0
M 1 X mn1 D0
p.sm0 ; sm1 smn1 / log p.sm0 ; sm1 smn1 /: D n
M 1 X
p.sm0 / log p.sm0 /
m0 D0
D nH.X /:
(8.30)
Applying the bounds in (8.26) to the block symbols, we have nH.X / Lnopt < 1 C nH.X /
(8.31)
where Lnopt is the optimal codeword length for the block codebook that is used to code the sequence of block symbols. The average codeword length per source symbol is obviously Lnopt : (8.32) Lopt D n
160
8 Entropy and Coding
Therefore, the optimal codeword length per source symbol is bounded by H.X / Lopt
2, let us assume that Huffman’s algorithm produces an optimal code for a symbol set of size M 1, we prove that Huffman’s algorithm produces an optimal code for a symbol set of size M .
9.2.1 Codeword Siblings From (9.8) we notice that the Huffman codeword for the two least probable symbols have the form of fc 0 0g and fc 0 1g, i.e., they have the same length and differ only in the last bit (see Fig. 9.1). Such a pair of codewords are called “siblings”. In fact, any instantaneous codebook can always be re-arranged in such a way that the codewords for the two least probably symbols are siblings while keeping the average codeword length the same or less. To show this, let us first note that, for two symbols with probabilities p1 < p2 , if a longer codeword was assigned to the more probable symbol, i.e., l1 < l2 , the codewords can always be swapped without any topological change to the tree, but with reduced average codeword length. One such example is shown in Fig. 9.2 where codewords for symbols with probabilities 0:3 and 0:6 are swapped. Repetitive application of this topologically constant procedure to a codebook can always end up with a new one which has the same topology as the original one, but whose two least probable symbols are assigned the two longest codewords and whose average codeword length is at least the same as, if not shorter than, the original one. Second, if an internal node in a codebook tree does not grow two branches, it can always be removed to generate shorter codewords. This is shown in Fig. 9.3 where node ‘1’ in the tree shown on the top is removed to give the tree at the bottom
164
9 Huffman Coding
Fig. 9.2 Codewords for symbols with probabilities 0:3 and 0:6 in the codebook tree on the top are swapped to give the codebook tree in the bottom. There is no change to the tree topology, average codeword length becomes shorter because the shorter codeword is weighted by the higher probability
Fig. 9.3 Removal of the internal node ‘1’ in the tree on the top produces the tree at the bottom which has shorter average codeword length
with shorter average codeword length. Application of this procedure to the two least probable symbols in a codebook ensures that they always have the same codeword length. Otherwise, the longest codeword must be from at least one internal node which grows only one branch. Third, if the codewords for the two least probable symbols do not grow from the same last internal node, the last internal node for the least probable symbol must grow another codeword whose length is the same as the second least probable
9.2 Optimality
165
symbol. Due to the same codeword length, these two codewords can be swapped with no impact to the average codeword length. Consequently, the codewords for the two least probable symbols can always grow from the same last internal node. In other words, the two codewords can always be of the form of c 0 0 and c 0 1.
9.2.2 Proof of Optimality Let LM be the average codeword length for the codebook produced by Huffman’s algorithm for the symbol set of (8.6) with the probability distribution of (8.7). Without loss of generality, the symbol set in (8.6) can always be permutated to give the following symbol set: fs0 ; s1 ; : : : ; sM 3 ; sM 2 ; sM 1 g
(9.9)
with a probability distribution of pm pM 2 pM 1
for all 0 m M 3:
(9.10)
Then the last two symbols can be merged to give the following symbol set: fs0 ; s1 ; : : : ; sM 3 ; s 0 g
(9.11)
which has M 1 symbols and a probability distribution of
where
p0 ; p1 ; : : : ; pM 3 ; p 0
(9.12)
p 0 D pM 2 C pM 1 :
(9.13)
Applying Huffman’s recursive procedure in Sect. 9.1 to symbol set (9.11) produces a Huffman codebook with an average codeword length of LM 1 . By the induction hypothesis, this Huffman codebook is optimal. The last step of Huffman procedure grows the last codeword in symbol set (9.11) into two codewords by attaching one bit (‘0’ and ‘1’) to its end to produce a codebook of size M for symbol set (9.9). This additional bit is added with a probability of pM 2 CpM 1 , so the average codeword length for the new Huffman codebook is LM D LM 1 C pM 2 C pM 1 :
(9.14)
Suppose that there were another instantaneous codebook for the symbol set in O M that is less than LM : (8.6) with an average codeword length of L O M < LM : L
(9.15)
166
9 Huffman Coding
As shown in Sect. 9.2.1, this codebook can be modified so that the codewords for two least probable symbols sM 2 and sM 1 have the forms of c 0 0 and c 0 1, while keeping the average codeword length the same or less. This means the symbol set is permutated to have a form given in (9.9) with the corresponding probability distribution given by (9.10). This codebook can be used to produce another codebook of size M 1 for symbol set (9.11) by keeping the codewords for fs0 ; s1 ; : : : ; sM 3 g the same and encoding the last symbol s 0 using c. Let us denote its average codeword length as LO M 1 . Following the same argument as that which leads to (9.14), we can establish LO M D LO M 1 C pM 2 C pM 1 : (9.16) Subtracting (9.16) from (9.14), we have LM LO M D LM 1 LO M 1
(9.17)
By the induction hypothesis, LM 1 is the average codeword length for an optimal codebook for the symbol set in (9.11), so we have LM 1 LO M 1 :
(9.18)
Plugging this into (9.17), we have
or
O M LM 0 L
(9.19)
LO M LM
(9.20)
which contradicts the supposition in (9.15). Therefore, Huffman’s algorithm produces an optimal codebook for M as well. To summarize, it was proven above that, if Huffman’s algorithm produces an optimal codebook for a symbol set of size M 1, it produces an optimal codebook for a symbol set of size M . Since it produces the optimal codebook for M D 2, it produces optimal codebooks for any M .
9.3 Block Huffman Code Although Huffman code is optimal for symbol sets of any size, the optimal average codeword length that it achieves is often much larger than the entropy when the symbol set is small. To see this, let us consider the extreme case of M D 2. The only possible codebook, which is also Huffman code, is f0; 1g:
9.3 Block Huffman Code
167
It obviously has an average codeword length of one bit, regardless the underlying probability distribution and entropy. The more skewed the probability distribution is, the smaller the entropy is, hence the less efficient the Huffman code is. As an example, let us consider the probability distribution of p1 D 0:1 and p2 D 0:9. It results in an entropy of H D 0:1 log.0:1/ 0:9 log.0:9/ 0:4690 bits; which is obviously much smaller than the one bit delivered by the Huffman code.
9.3.1 Efficiency Improvement By Shannon’s noiseless coding theorem (8.33), however, the entropy can be approached by grouping more symbols together into block symbols. To illustrate this, let us first block-code two symbols together as one block symbol. This gives the probability distribution in Table 9.1 and the corresponding Huffman code in Fig. 9.4. Its average codeword length is LD
1 .0:01 3 C 0:09 3 C 0:09 2 C 0:81 1/ D 0:645 bits; 2
which is significantly smaller than the one bit achieved by the M D 2 Huffman code. Table 9.1 Probability distribution when two symbols are coded together as one block symbol
Symbol 00 01 10 11
Fig. 9.4 Huffman code when two symbols are coded together as one block symbol
Probability 0.01 0.09 0.09 0.81
168
9 Huffman Coding
Table 9.2 Probability distribution when three symbols are coded together as one block symbol
Symbol 000 001 010 011 100 101 110 111
Probability 0.001 0.009 0.009 0.081 0.009 0.081 0.081 0.729
Fig. 9.5 Huffman code when three symbols are coded together as one block symbol
This can be further improved by coding three symbols as one block symbol. This gives us the probability distribution in Table 9.2 and the Huffman code in Fig. 9.5. The average codeword length is LD
1 .0:001 7 C 0:009 7 C 0:009 6 C 0:081 4 3 C 0:009 5 C 0:081 3 C 0:081 2 C 0:729 1/
D 0:5423 bits; which is more than 0.1 bit better than coding two symbols as a block symbol and closer to the entropy.
9.4 Recursive Coding
169
9.3.2 Block Encoding and Decoding In the examples above, the block symbols are constructed from the original symbols by stacking the bits of the primary symbols together. For example, a block symbol in Table 9.2 is constructed as B D fs2 s1 s0 g where s0 , s1 , and s2 are the bits representing the original symbols. This is possible because the number of symbols in the original symbol set is M D 2. In general, for any finite M , a block symbol B consisting of n original symbols may be expressed as B D sn1 M n1 C sn2 M n2 C C s1 M C s0 ;
(9.21)
where si ; i D 0; 1; : : : ; n 1; represents the bits for each original symbol. It is, therefore, obvious that block encoding just consists of a series of multiplication and accumulation operations. Equation (9.21) also indicates that the original symbols may be decoded from the block symbol through the following iterative procedure:
s0 D B B1 M; s1 D B1 B2 M; :: :
sn2 D Bn2 Bn1 M; sn1 D Bn1 Bn M:
B1 D B=M I B2 D B1=M I B3 D B2=M I :: :
(9.22)
Bn D Bn1 =M
where the = operation represents integer division. When M is 2’s power, it may be implemented as right shifting. The first step in each iteration (the step to obtain si ) is actually the operation to get the remainder.
9.4 Recursive Coding Huffman encoding is straightforward and simple because it only involves looking up the codebook. Huffman decoding, however, is rather complex because it entails searching through the tree until a matching leaf is found. If the codebook is too large, consisting of more than 300 codewords, for example, the decoding complexity can be excessive. Recursive indexing is a simple method for representing an excessively large symbol set by a moderate one so that a moderate Huffman codebook can be used to encode the excessively large symbol set.
170
9 Huffman Coding
Without loss of generality, let us represent a symbol set by its indexes starting from zero, thus each symbol in the symbol set corresponds to a nonnegative integer x. This x can be represented as x D q M C r;
(9.23)
where M is the maximum value of the reduced symbol set f0; 1; : : : ; M g, q is the quotient, and r is the remainder. Once M is agreed upon by the encoder and decoder, only q and r need to be conveyed to the decoder. Usually r is encoded using a Huffman codebook and q by other means. One simple approach to encoding q is to represent it by repeating q times the symbol M . In this way a single Huffman codebook can be used to encode a large symbol set, no matter how large it is.
9.5 A Fast Decoding Algorithm Due to the need to search the tree to find the leaf that matches bits from the input stream, Huffman decoding is computationally expensive. While fast decoding algorithms are usually tied to specific hardware (computer) architectures, a generic algorithm is provided here to illustrate the steps involved in Huffman decoding: 1. 2. 3. 4. 5. 6.
n D 1I Unpack one bit from the bit stream; Concatenate the bit to the previously unpacked bits to form a word with n bits; Search the codewords of n bits in the Huffman codebook; Stop if a codeword is found to be equal to the unpacked word; n D n C 1 and go back to step 2.
Part V
Audio Coding
While presented from the perspective of audio coding, the chapters in the previous parts cover theoretical aspects of the coding technology that can be applied to the coding of signals in general. The chapters in this part are devoted to the coding of audio signals. In particular, Chap. 10 covers perceptual models which determines which part of the source signal is inaudible (perceptually irrelevant) and thus can be removed. Chapter 11 addresses the resolution challenge posed by the frequent interruption by transients to the otherwise quasi-stationary audio signals. Chapter 12 deals with widely used methods for joint channel coding as well as the coding of low-frequency effect (LFE) channels. Chapter 13 covers a few practical issues frequently encountered in the development of audio coding algorithms. Chapter 14 is devoted to performance assessment of audio coding algorithms and Chap. 15 presents dynamic resolution adaptation (DRA) audio coding standard as an example to illustrate how to integrate the technologies described in this book to create a practical audio coding algorithm.
Chapter 10
Perceptual Model
Although data model and quantization have been discussed in detail in the earlier chapters as the tool for effectively removing perceptual irrelevance, a question still remains as to which part of the source signal is perceptually irrelevant. Feasible answers to this question obviously depend on the underlying application. For audio coding, perceptual irrelevance is ultimately determined by the human ear, so perceptual models need to be built that mimic the human auditory system so as to indicate to an audio coder which parts of the source audio signal are perceptually irrelevant, hence can be removed without audible artifacts. When a quantizer removes perceptual irrelevance, it essentially substitutes quantization noise for perceptually irrelevant parts of the source signal, so the quantization process should be properly controlled to ensure that quantization noise is not audible. Quantization noise is not audible if its power is below the sensitivity threshold of the human ear. This threshold is very low in an absolutely quiet environment (threshold in quiet), but becomes significantly elevated in the presence of other sounds due to masking. Masking is a phenomenon where a strong sound makes a weak sound less audible or even completely inaudible when the power of the weak sound is below a certain threshold jointly determined by the characteristics of both sounds. Quantization noise may be masked by signal components that occur simultaneously with the signal component being quantized. This is called simultaneous masking and is exploited most extensively in audio coding. Quantization noise may also be masked by signal components that are ahead of and/or behind it. This is called temporal masking. The task of the perceptual model is to explore the threshold in quiet and the simultaneous/temporal masking to come up with an estimate of global masking threshold which is a function of frequency and time. The audio coder can then adjust its quantization process in such a way that all quantization noises are below this threshold to ensure that they are not audible.
Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 10, c Springer Science+Business Media, LLC 2010
173
174
10 Perceptual Model
10.1 Sound Pressure Level Sound waves traveling in the air or other transmission media can be described by time-varying atmosphere pressure change p.t/, called sound pressure. Sound pressure is measured by the sound force per unit area and its unit is Newton per square meter (N=m2 ), which is also known as a Pascal (Pa). The human ear can perceive sound pressure as low as 105 Pa and a sound pressure of 100 Pa is considered as the threshold of pain. These two values establish a dramatic dynamic range of roughly 107 Pa. When compared with the atmospheric pressure which is 101,325 Pa, the absolute values of sound pressure perceivable by the human ear are obviously very small. To cope with this situation, sound pressure level or SPL is introduced l D 20 log10
p dB; p0
(10.1)
where p0 is a reference level of 2 105 Pa. This reference level corresponds to the best hearing sensitivity of an average listener at around 1,000 Hz. Another description of sound waves is the sound intensity I which is defined as the sound power per unit area. For a spherical or plane progressive wave, sound intensity I is proportional to the square of sound pressure, so sound intensity level is related to the sound pressure level by l D 20 log10
p I D 10 log10 dB; p0 I0
(10.2)
where I0 D 1012 W=m2 is the reference level. Due to this relationship, SPL level and sound intensity level are identical on logarithmic scale. When a sound signal is considered as a wide sense stationary random process, its spectrum or intensity density level is defined as L.f / D
P .f / ; I0
(10.3)
where P .f / is the power spectrum density of the sound wave.
10.2 Absolute Threshold of Hearing The absolute threshold of hearing (ATH) or threshold in quiet (THQ) is the minimum sound pressure level of a pure tone that an average listener with normal hearing capability can hear in an absolutely quiet environment. This SPL threshold varies with the frequency of the test tone, an empirical equation that describes this relationship is [91]
10.2 Absolute Threshold of Hearing
175
100 90 Threshold in Quiet (dB SPL)
80 70 60 50 40 30 20 10 0 −10
102
103 Frequency (Hz)
104
Fig. 10.1 Absolute threshold of hearing
Tq .f / D 3:64
f 1;000
0:8
2
6:5e0:6.f =1;0003:3/ C 0:001
f 1;000
4 dB (10.4)
and is plotted in Fig. 10.1. As can be seen in Fig. 10.1, the human ear is very sensitive in frequencies from 1,000 to 5,000 Hz and is most sensitive around 3,300 Hz. Beyond this region, the sensitivity of hearing degrades rapidly, especially below 100 Hz and above 10,000 Hz. Below 20 Hz and above 18,000 Hz, the human ear can hardly perceive sounds. The formula in (10.4) and hence Fig. 10.1 does not fully reflect the rapid degradation of hearing sensitivity below 20 Hz. When people age, the hearing sensitivity degrades mostly at high frequencies and there is little change in the low frequencies. It is rather difficult to apply the threshold in quiet to audio coding mostly because there is no way to know the playback SPL that an audio signal is presented to a listener. A safe bet is to equate the minimum in Fig. 10.1 around 3,300 Hz to the lowest bit in the audio coder. This ensures that quantization noise is not audible even if the audio signal is played back at the maximum volume, but is usually too pessimistic because listeners rarely playback sound at the maximum volume. Another difficulty with applying the threshold in quiet to audio coding is that quantization noise is complex and is unlikely sinusoidal. The actual threshold in quiet for complex quantization noise is definitely different than pure tones. But there is not much research reported on this regard.
176
10 Perceptual Model
10.3 Auditory Subband Filtering As shown in Fig. 10.2, when sound is perceived by the human ear, it is first preprocessed by the human body, including the head and shoulder, and then the outer ear canal before it reaches the ear drum. The vibration of the ear drum is transferred by the ossicular bones in the middle ear to the oval window which is the entrance to or the start of the cochlea in the inner ear. The cochlea is a spiral structure filled with almost incompressible fluids, whose start at the oval window is known as the base and whose end as the apex. The vibrations at the oval window induce traveling waves in the fluids which in turn transfer the waves to the basilar membrane that lies along the length of cochlea. These traveling waves are converted into electrical signals by neural receptors that are connected along the length of the basilar membrane [53].
10.3.1 Subband Filtering Different frequency components of an input sound wave are sorted out while traveling along the basilar membrane from the start (base) towards the end (the apex). This is schematically illustrated in Fig. 10.3 for an example signal consisting of three tones (400, 1,600, and 6,400 Hz) presented to the base of basilar membrane [102]. For each sinusoidal component in the input sound wave, the amplitude of basilar membrane displacement increases at first, reaches a maximum, and then decreases rather abruptly. The position where the amplitude peak occurs depends on the frequency of the sinusoidal component. In other words, a sinusoidal signal resonants strongly at a position on the basilar membrane that corresponds to its frequency. Equivalently, different frequency components of an input sound wave resonant at different locations on the basilar membrane. This allows different groups of neural receptors connected along the length of the basilar membrane to process different frequency components of the input signal. From a signal processing perspective, this frequency-selective processing of sound signals may be viewed as subband filtering and the basilar membrane may be considered as a bank of bandpass auditory filters.
Fig. 10.2 Major steps involved in the conversion of sound waves into neural signals in the human ear
10.3 Auditory Subband Filtering
177
Fig. 10.3 Instantaneous displacement of basilar membrane for an input sound wave consisting of three tones (400, 1,600, and 6,400 Hz) that is presented to the oval window of basilar membrane. Note that the three excitations do not appear simultaneously because the wave needs time to travel along the basilar membrane
An observation from Fig. 10.3 is that the auditory filters are continuously placed along the length of the basilar membrane and is activated in response to the frequency components of the input sound wave. If the frequency components of the sound wave are close to each other, these auditory filters overlap significantly. There is, of course, no decimation in the continuous-time world of neurons. As will be discussed later in this chapter, the frequency responses of these auditory filters are asymmetric, nonlinear, level-dependent, and with nonuniform bandwidth that increases with frequency. Therefore, auditory filters are very different from the discretely placed, almost nonoverlapping and often maximally decimated subband filters that we are familiar with.
10.3.2 Auditory Filters A simple model for auditory filters is the gammatone filter whose impulse response is given by [1, 71] h.t/ D At n1 e2Bt cos.2fc t C /;
(10.5)
where fc is the center frequency, the phase, A the amplitude, n the filter’s order, t the time, and B is the filter’s bandwidth. Figure 10.4 shows the magnitude response of the gammatone filter with fc D 1;000 Hz, n D 4, and B determined by (10.7). A widely used model that catches the asymmetric nature of auditory filters is the rounded exponential filter, denoted as roex.pl ; pu /, whose power spectrum is given by [73] 8 fc f ˆ < 1 C pl fcff epl fc ; f fc c (10.6) W .f / D f fc ˆ : 1 C pu f fc epu fc ; f > fc fc
178
10 Perceptual Model 20
Magnitude Response (dB)
10 0 −10 −20 −30 −40 −50 −60
0
1000
2000 3000 Frequency (Hz)
4000
Fig. 10.4 Magnitude response of a gammatone filter that models the auditory filters
0 −20
Power Spectrum (dB)
−40 −60 −80 −100 −120 −140 −160
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Frequency (Hz)
Fig. 10.5 Power spectrum of roex.pl ; pu / filter (fc D 1;000 Hz, pl D 40, and pu D 10) that models auditory filters
where fc represents the center frequency, pl determines the slope of the filter below fc , and pu determines the slope of the filter above fc . Figure 10.5 shows the power spectrum of this filter for fc D 1;000 Hz, pl D 40 and pu D 10.
10.3 Auditory Subband Filtering
179
The roex.pl ; pu / filter with pl D pu was used to estimate the critical bandwidth (ERB) of the auditory filters [53, 70]. When the order of the gammatone filter is in the range 3–5, the shape of its magnitude characteristic is very similar to that of the roex.pl ; pu / filter with pl D pu [72]. In particular, when the order n D 4, the suggested bandwidth for gammatone filter is B D 1:019ERB;
(10.7)
where ERB is given in (10.15).
10.3.3 Bark Scale From a subband filtering perspective, the location on the basilar membrane where the amplitude maximum occurs for a sinusoidal component may be considered as the point that represents the frequency of the sinusoidal component and as the center frequency of the auditory filter that processes this component. Consequently, the distance from the base of cochlea or the oval window on the basilar membrane represents a new frequency scale that is different from the linear frequency scale (in Hz) that we are familiar with. A frequency scale that seeks to linearly approximate the frequency scale represented by the basilar membrane length is the critical band rate or Bark scale. Its relationship with the linear frequency scale f (Hz) is empirically determined and may be analytically expressed as follows [102]: z D 13 arctan.0:00076f / C 3:5 arctan .f =7; 500/2 Bark
(10.8)
and is shown in Fig. 10.6. In fact, one Bark approximately corresponds to a distance of about 1.3 mm along the basilar membrane [102]. The Bark scale is apparently neither linear nor logarithmic with respect to the linear frequency scale. The Bark scale was proposed by Eberhard Zwicker in 1961 [101] and named in memory of Heinrich Barkhausen who introduced the “phon”, a scale for loudness as perceived by the human ear.
10.3.4 Critical Bands While complex, the auditory filters exhibit strong frequency selectivity: the loudness of a signal remains constant as long as its energy is within the passband, referred to as the critical band (CB), of an auditory filter and decreases dramatically as the energy moves out of the critical band. The critical bandwidth is a parameter that quantifies the bandwidth of the auditory filter passband. There are a variety of methods for estimating the critical bandwidth. A simple approach is to present a set of uniformly spaced tones with equal power to the
180
10 Perceptual Model 25
Bark
20 15 10 5 0
2000
4000
6000 8000 10000 12000 14000 16000 Frequency (Hz)
25
Bark
20 15 10 5 0 100
101
102 Frequency (Hz)
103
104
Fig. 10.6 Relationships of the Bark scale with respect to the linear frequency scale (top) and the logarithmic frequency scale (bottom)
listeners and measure the threshold in quiet [102]. For example, to estimate the critical bandwidth near 1,000 Hz, the threshold in quiet is measured by placing more and more equally powered tones starting from 920 Hz with a 20 Hz frequency increment. Figure 10.7 shows that the measured threshold in quiet, which is the total power of all tones, remains at about 3 dB when the number of tones increases from one to eight and begins to increase afterwards. This indicates that the critical bandwidth near 1,000 Hz is about eight tones, or 160 Hz, that starts at 920 Hz and ends at 920 C 160 D 1,080 Hz. To see this, let us denote the power of each tone as 2 and the total number of tones as n. Before n reaches eight, the power of each tone is 2 D
100:3 n
(10.9)
to maintain a total power of 3 dB in the passband. When n reaches eight, the power of each tones is 100:3 2 D : (10.10) 8 When n is more than eight, only the first eight tones fall into the passband and the others are filtered out by the auditory subband filter. Therefore, the power for each tone given in (10.10) needs to be maintained to maintain a total power of 3 dB in the passband. Consequently, the total power of all tones is
10.3 Auditory Subband Filtering
181
5 4.5
Total Power (dB)
4 3.5 3 2.5 2 1.5 1 0.5 0
1
2
3
4 5 6 7 Total Number of Tones
8
9
10
Fig. 10.7 The measurement of critical bandwidth near 1,000 Hz by placing more and more uniformly spaced and equally powered tones starting from 920 Hz with a 20 Hz increment. The total power of all tones indicated by the cross remains constant when the number of tones increases from one to eight and begins to increase afterwards. This indicates that critical bandwidth near 1,000 Hz is about eight tones, or 160 Hz, that starts at 920 Hz and ends at 920 C 160 D 1,080 Hz. This is so because tones after the eighth are filtered out by the auditory filter, so do not contribute to the power in the passband. The addition of more tones causes an increase in total power, but the power perceived in the passband remains constant
P D
100:3 n: 8
(10.11)
To summarize, the total power of all tones as a function of the number of tones is ( P D
100:3 ; n 8I 100:3 n; 8
(10.12)
otherwise.
This is the curve shown in Fig. 10.7. The method above is valid only in the frequency range between 500 and 2,000 Hz where the threshold in quiet is approximately independent of frequency. For other frequency ranges, more sophisticated methods are called for. One such method, called masking in frequency gap, is shown in Fig. 10.8 [102]. It places a test signal, called a maskee, at the center frequency where the critical bandwidth is to be estimated and then places two masking signals, called maskers, of equal power at equal distance in the linear frequency scale from the test signal. If the power of the test signal is weak relative to the total power of the maskers, the test signal is not audible. When this happens, the test signal is said to be masked
182
10 Perceptual Model
a
b
c
Fig. 10.8 Measurement of critical bandwidth with masking in frequency gap. A test signal is placed at the center frequency f0 where the critical bandwidth is to be measured and two masking signals of equal power are placed at equal distance from the test signal. In (a), the test signal is a narrow-band noise and the two maskers are tones. In (b), two narrow-band noises are used as maskers to mask the tone in the center. The masked threshold versus the frequency separation between the maskers is shown in (c). The frequency separation where the masked threshold begins to drop off may be considered as the critical bandwidth
by the maskers. In order for the test signal to become audible, its power has to be raised to above a certain level, denoted as masked threshold or sometimes masking threshold. When the frequency separation between the maskers is within the critical bandwidth, all of their powers fall into the critical band where the test signal resides and the total masking power is the summation of that of the two maskers. In this case, no matter how wide the frequency separation is, the total power in the critical band is constant, so a constant masked threshold is expected. As the separation becomes wider than the critical bandwidth, part of the powers of the maskers begin to fall out of the critical band and are hence filtered out by the auditory filter. This causes the total masking power in the critical band to become less, so their ability to mask the test signal becomes weaker, causing the masked threshold to decrease accordingly. This is summarized in the curve of the masked threshold versus the frequency separation shown in Fig. 10.8c. This curve is flat or constant for small frequency separations and begins to fall off when the separation becomes larger than a certain value, which may be considered as the critical bandwidth. Data from many subjective tests were collected to produce the critical bandwidth listed in Table 10.1 and shown in Fig. 10.9 [101, 102]. Here the lowest critical band is considered to have a frequency range between 0 and 100 Hz, which includes the inaudible frequency range between 0 and 20 Hz. Some authors may choose to assume the lowest critical band to have a frequency range between 20 and 100 Hz.
10.3 Auditory Subband Filtering Table 10.1 Critical bandwidth as proposed by Zwicker
183
Band number 1 2 3 4 5 6 7 8 9 10 11 12 13 13 14 15 16 17 18 19 20 21 22 23
Upper frequency boundary (Hz) 100 200 300 400 510 630 770 920 1;080 1;270 1;480 1;720 2;000 2;320 2;700 3;150 3;700 4;400 5;300 6;400 7;700 9;500 12;000 15;500
3500
Critical Bandwidth (Hz)
3000 2500 2000 1500 1000 500 0
5
10 15 Critical Band Number
Fig. 10.9 Critical bandwidth as proposed by Zwicker
20
Critical bandwidth (Hz) 100 100 100 100 110 120 140 150 160 190 210 240 280 320 380 450 550 700 900 1;100 1;300 1;800 2;500 3;500
184
10 Perceptual Model
Many audio applications, such as CD and DVD, deploy sample rates that allow for a frequency range higher than the maximum 15,500 Hz given in Table 10.1. For these applications, one more critical band may be added, which starts from 15,500 Hz and ends with half of the sample rate. The critical bandwidth listed in Table 10.1 might give the wrong impression that critical bands are discrete and nonoverlapping. To emphasize that critical bands are continuously placed on the frequency scale, an analytic expression is usually more useful. One such approximation is given below "
f fG D 25 C 75 1 C 1:4 1;000
2 #0:69 Hz;
(10.13)
where f is the center frequency in hertz [102]. In terms of the number of critical bands, Table 10.1 may be approximated by (10.8) where the nth Bark represents the nth critical band. Therefore, the Bark scale is also called the critical band rate scale in the sense that one Bark is equal to one critical bandwidth.
10.3.5 Critical Band Level Similar to sound intensity level, critical band level of a sound in a critical band z is defined as the total sound intensity level within the critical band: Z L.z/ D
L.f /df:
(10.14)
f 2z
For tones, the critical band level is obviously the same as the tone’s intensity level. For noise, the critical band level is the total of the noise intensity density levels within the critical band.
10.3.6 Equivalent Rectangular Bandwidth An alternative measure of critical bandwidth is the equivalent rectangular bandwidth (ERB). For a given auditory filter, its ERB is the bandwidth of an ideal rectangular filter which has a passband magnitude equal to the maximum passband gain of the auditory filter and passes the same amount of energy as the auditory filter [53]. A notched-noise method is used to estimate the shape of the roex filter in (10.6), from which the ERB of the auditory filter is obtained [53, 70]. A formula that fits many experimental data well is given below [21, 53]
10.4 Simultaneous Masking
185
Bandwidth (Hz)
103 Traditional CB
ERB
102
102
103 Center Frequency (Hz)
104
Fig. 10.10 Comparison of ERB with traditional critical bandwidth (CB)
ERB D 24:7.0:00437f C 1/ Hz;
(10.15)
where f is the center frequency in hertz. The ERB formula indicates that the ERB is linear with respect to the center frequency. This is significantly different from the critical bandwidth in (10.13), as shown in Fig. 10.10, especially for frequency below 500 Hz. It was argued that ERB given in (10.15) is a better approximation than the traditional critical bands discussed in Sect. 10.3.4 because it is based on new data that were obtained using direct measurement of critical bands using the notched-noise method by a few different laboratories [53]. One ERB obviously represent one frequency unit in the auditory system, so the number of ERBs corresponds to a frequency scale and is conceptually similar to Bark scale. A formula for calculating this ERB scale or the number of ERBs for a center frequency f in hertz is given below [21, 53] Number of ERBs D 21:4 log10 .0:00437f C 1/:
(10.16)
10.4 Simultaneous Masking It is very easy to hear quiet conversation when the background is quiet. When there is sound in the background which may be either another conversation, music or noise, the speaker has to raise his/her volume to be heard. This is a simple example of simultaneous masking.
186
10 Perceptual Model
Fig. 10.11 Masking of a weak sound (dashed lines) by a strong one (solid lines) when they are presented simultaneously and their frequencies are close to each other
While the mechanism behind simultaneous masking is very complex, involving at least the nonlinear basilar membrane and the complex auditory neural system, a simple explanation is offered in Fig. 10.11 where two sound waves are presented to a listener simultaneously. Since their frequencies are close to each other, the excitation pattern of the weaker sound may be completely shadowed by the stronger one. If the basilar membrane is assumed to be linear, the weaker sound cannot be perceived by auditory neurons, hence is completely masked by the stronger one. Apparently, the masking effect is, to a large extent, dependent on the power of the masker relative to the maskee. For audio coding, the masker is considered as the signal and the maskee is the quantization noise that is to be masked, so this relative value is expressed by signal-to-mask ratio (SMR) SMR D 10 log10
2 Masker : 2 Maskee
(10.17)
In order for a masker to completely mask a maskee, the SMR has to pass a certain threshold, called SMR threshold TSMR D minfSMR j the maskee is inaudibleg:
(10.18)
The negative of this threshold in decibel is called masking index [102]: I D TSMR dB:
(10.19)
10.4.1 Types of Masking The frequency components of an audio signal may be considered as either noiselike or tone-like. Consequently, both the masker and the maskee may be either tones or noise, so there are four types of masking as shown in Table 10.2. Note that the noise here is usually considered as narrow-band with a bandwidth no more than the critical bandwidth. Since the removal of perceptual irrelevance is all about the masking of quantization noise which is complex and is rarely tone like, the cases of TMN and NMN are most relevant for audio coding.
10.4 Simultaneous Masking Table 10.2 Four types of masking
187
Tone Noise
Tone Tone masks tone (TMT) Noise masks tone (NMT)
Noise Tone masks noise (TMN) Noise masks noise (NMN)
10.4.1.1 Tone Masking Tone For the case of a pure tone masking another pure tone (TMT), both the masker and the maskee are simple, but it turned out to be very difficult to conduct masking experiments and to build good models, mostly due to the beating phenomenon that occurs when two pure tones with frequencies close to each other are presented to a listener. For example, two pure tones of 990 and 1,000 Hz, respectively, produce a beating of 10 Hz which causes the listeners to hear something different from the steady-state tone (masker). Then, they believe that the maskee had been heard. But this is, in fact, different from having actually heard another tone (maskee). Fortunately, quantization noise is rarely tone-like, so the lack of good models on this regard is less a problem for audio coding. A large number of experiments have, nevertheless, indicated an SMR threshold of about 15 dB [102].
10.4.1.2 Tone Masking Noise In this case, a pure tone masks the narrow-band noise whose spectrum falls within the critical band in which the tone stands. Since quantization noise is more noiselike, this case is very relevant to audio coding. Unfortunately, there exist relatively few studies to provide a good model for this useful case. A couple of studies, however, do indicate an SMR threshold between 21 and 28 dB [24, 84]. 10.4.1.3 Noise Masking Noise This case of a narrow band noise masking another one is very relevant to audio coding, but is very difficult to study because of the phase correlations between the masker and the maskee. So it is not surprising that there are little experimental results addressing this important issue. The limited data, however, do suggest an SMR threshold of about 26 dB [23, 52]. 10.4.1.4 Noise Masking Tone A narrow-band noise masking tone (NMT) is most widely studied in psychoacoustics. This type of experiments was deployed to estimate the critical bandwidth and the excitation patterns of the auditory filters. The masking spreading function to be discussed later in this section is largely based on this kind of studies. There are a lot of experimental data and models for NMT. The SMR threshold is generally considered to vary from about 2 dB at low frequencies to 6 dB at high frequencies [102].
188
10 Perceptual Model
Table 10.3 Empirical SMR thresholds for the four masking types
Masking type SMR threshold (dB)
TMT 15
TMN 21–28
NMN 26
NMT 2–6
10.4.1.5 Practical Masking Index Table 10.3 summarizes SMR thresholds for the four types of masking discussed above. From this table, we observe that tones have much weaker masking capability than noise and noise is much more difficult to be masked than tones. For audio coding, only TMN and NMT are usually considered with the following masking index formula: (10.20) ITMN D 14:5 z dB and INMT D K dB;
(10.21)
where K is a parameter between 3 and 5 dB [32].
10.4.2 Spread of Masking The discussion above addresses the situation that a masker masks maskee(s) that is within the masker’s critical band. Masking effect is no doubt the strongest in this case. However, masking effect also exists when the maskee is not in the same critical band as the masker. An essential contributing factor for this effect is that the auditory filters are not ideal bandpass filters, so do not attenuate frequency components outside the passband completely. In fact, the roll-off beyond the passband is rather gradual, as shown in Figs. 10.4 and 10.5. This means that a significant chunk of masker’s power is picked up by the auditory filter of the critical band where the maskee resides, making the maskee less audible. This effect is called the spread of masking. This explanation is apparently very simplistic in the light of the nonlinear basilar membrane and the complex auditory neural system. As discussed in Sect. 10.4.1, it is very difficult to study the masking behavior when both masker and maskees are within the same critical band, especially for the important case of TMN and NMN, it is no surprise that it is even more difficult to deal with the spread of masking effects. For simplification, masking spread function SF.zr ; ze / is introduced to express the masking effect due to a masker at critical band zr to maskees at critical band ze . If the masker at critical band zr has a critical band level of L.zr / , the power leaked to critical band ze or the critical band level that the maskee’s auditory filter picks up from the masker is L.zr / SF.zr ; ze /: (10.22)
10.4 Simultaneous Masking
189
Fig. 10.12 Spreading of masking into neighboring critical bands. The left maskee is audible while the right one is completely masked because it is completely below the masked threshold
If the masking index at critical band ze due to the masker is I.ze /, the masked threshold at critical band ze is LT .zr ; ze / D I.ze /L.zr / SF .zr ; ze /:
(10.23)
This relationship is shown in Fig. 10.12. The basic masking spread function is also shown in Fig. 10.12 which is mostly extracted from data obtained from NMT experiments [102]. It is a triangular function with a slope of about 25 dB per Bark below the masker and 10 dB per Bark above the masker. The slope of 25 dB for the lower half almost remains constant for all frequencies. The slope of 10 dB for the upper half is also almost constant for all frequencies higher than 200 Hz. Therefore, the spread function may be considered as shift-invariant across the frequency scale. As shown in Fig. 10.13, the simple spread function in Fig. 10.12 is captured by Schroeder in the following analytic form [84]: p SF.z/ D 15:81 C 7:5.z C 0:474/ 17:5 1 C .z C 0:474/2 dB;
(10.24)
where z D zr ze Bark;
(10.25)
which signifies that the spread function is frequency shift-invariant. A modified version of the spreading function above is given below SF.z/ D 15:8111389 C 7:5.Kz C 0:474/ p 17:5 1 C .Kz C 0:474/2 ˚ C8 min 0; .Kz 0:5/2 2.Kz 0:5/ dB
(10.26)
where K is a tunable parameter. This spreading function is essentially the same as the Schroeder spreading function when K D 1, except for the last term, which
190
10 Perceptual Model
Critical Band Level (dB)
0
−50
−100
Schroeder Schroeder With Dip MPEG Psycho Model 2 −150
−5
0
5
10
Bark Scale
Fig. 10.13 Comparison of Schroeder’s, modified Schroeder’s, and MPEG Psychoacoustic Model 2 spreading functions. The dips in modified Schroeder’s and MPEG Model near the top are intended to model additional nonlinear effects in auditory system
introduces a dip near the top (see Fig. 10.13) that is intended to model additional nonlinear effects in auditory system as reported in [102]. This function is used in MPEG Psychoacoustic Model 2 [60] with following parameter: ( KD
3; z < 0;
(10.27)
1:5; otherwise:
The three models above are independent of the SPL or critical band level of the masker. While simple, this is not a realistic reflection of the auditory system. A model that accounts for level dependency is the spreading function used in MPEG Psychoacoustic Model 1 given below
SF.z; Lr / D
8 ˆ 17.z C 1/ .0:4Lr C 6/; ˆ ˆ ˆ ˆ ˆ ˆ < .0:4Lr C 6/z;
3 z < 1 1 z < 0
17z; 0 z < 1 ˆ ˆ ˆ ˆ .z 1/.17 0:15Lr / 17; 1 z < 8 ˆ ˆ ˆ : 0; otherwise;
dB
(10.28)
where Lr is the critical band level of the masker [55]. This spreading function is shown in Fig. 10.14 for masker critical band levels at 20, 40, 60, 80, and 100 dB, respectively. It apparently delivers increased masking for higher masker SPL on both sides of the masking curve to match the nonlinear masking properties of the auditory system [102].
10.4 Simultaneous Masking
191
0 −10
Critical Band Level (dB)
−20 Lr = 100 dB
−30
Lr =80 dB
−40 −50
Lr = 60 dB
−60 Lr = 40 dB
−70 −80
Lr = 20 dB
−90 −100
−2
0
2 Bark Scale
4
6
8
Fig. 10.14 The spreading function of MPEG Psychoacoustic Model 1 for masker critical band levels at 20, 40, 60, 80, and 100 dB, respectively. Increased masking is provided for higher masker critical band levels on both sides of the masking curve
The masking characteristics of the auditory system is also frequency dependent: the masking slope decreases as the masker frequency increases. This dependency is captured by Terhardt in the following model [91]: ( SF.z; Lr ; f / D
.0:2Lr C 230=f 24/z; z 0 24z;
otherwise
dB;
(10.29)
where f is the masker frequency in hertz. Figure 10.15 shows this model at Lr D 60 dB and for f D 100; 200; and 1;000, respectively.
10.4.3 Global Masking Threshold The masking spread function helps to estimate the masked threshold of a masker in one critical band over maskees in the same a different critical band. From a maskee’s perspective, it is masked by all maskers in all critical bands, including the critical band in which it resides. A question arises as to how those masked thresholds add up for maskees in a particular critical band? To answer this question, let us consider two maskers, one at critical band zr1 and the other at critical band zr2 , and denote their respective masking spread at critical band ze as LT .zr1 ; ze / and LT .zr2 ; ze /, respectively. If one of the masking effect is much stronger than the other one, the total masking effect would obviously
192
10 Perceptual Model 0
Critical Band Level (dB)
−20
f = 100 Hz
−40 f = 200 Hz −60
−80
f = 10000 Hz
−100
−120 −5
0
5
10
Bark Scale
Fig. 10.15 Terhardt’s spreading function at masker critical band level of Lr D 60 dB and masker frequencies of f D 100; 200; and 1;000 Hz, respectively. It shows reduced masking slope as the masker frequency increases
be dominated by the stronger one. When they are equal, how do the two masking effects “add up”. If it were intensity addition, a 3 dB gain would be expected. If it were sound pressure addition, a gain of 6 dB would be expected. Experiment using a tone masker placed at a low critical band and a critical-band wide noise masker placed at a high critical band to mask a tone maskee at a critical band in between (they are not in the same critical band) indicates a masking effect gain of 12 dB when the two maskers are of equal power. Even when one is much weaker than the other one, a gain between 6 and 8 dB is still observed. Therefore, the “addition” of masking effect is stronger than sound pressure addition and much stronger than intensity addition [102]. When the experiment is performed within the same critical band, however, the gain of masking effect is only 3 dB. This correlates well to intensity addition. In practical audio coding systems, intensity addition is often performed for simplicity, so the total masked threshold is calculated using the following formula: LT .ze / D I.ze /
X
L.zr /SF.zr ; ze /; for all ze :
(10.30)
all zr Since threshold in quiet establishes the absolute minimal masked threshold, the global masked threshold curve is LG .ze / D maxfLT .ze /; LQ .ze /g;
(10.31)
10.5 Temporal Masking
193
where LQ .ze / is the critical band level that represents the threshold in quiet. A conservative approach to establish this critical band level is to use the minimal threshold in the whole critical band: LQ .ze / D f .ze / min Tq .f /; f 2ze
(10.32)
where f .ze / is the critical bandwidth in hertz for critical band ze .
10.5 Temporal Masking The simultaneous masking discussed in Sect. 10.4 is under the condition of steady state, i.e., both the masker and maskee are long lasting and in steady state. This steady-state assumption is true most of the time because audio signals may be characterized as consisting of quasistationary episodes which are frequently interrupted by strong transients. Transients bring on masking effects that vary with time. This type of masking is called temporal masking. Temporal masking may be exemplified by postmasking (postmasker masking) which occurs after a loud transient, such as a gun shot. Immediately after such an event, there are a few moments when most people cannot hear much. In addition to postmasking, there are premasking (premasker masking) and simultaneous masking, as illustrated in Fig. 10.16. The time period that premasking can be measured is about 20 ms. During this period, the masked threshold gradually increases with time and reaches the level of simultaneous masking when the masker switches on. The period of strong premasking may be considered as long as 5 ms. Although premasking occurs before the masker is switched on, it does not mean that the auditory system can listen to the future. Instead, it is believed to be caused by the build-up time of the auditory system which is shorter for strong signals and longer for weak signals. The shorter build-up time of the strong masker enables parts of the masker to build up quickly, which then mask parts of the weak maskee which are built-up slowly.
Fig. 10.16 Schematic drawing illustrating temporal masking. The masked threshold is indicated by the solid line
194
10 Perceptual Model
Post-masking is much stronger than premasking. It kicks in immediately after the masker is switched off and shows almost no decay for the first 5 ms. Afterwards, it decreases gradually with time for about 200 ms. And this decay cannot be considered as exponential. The auditory system integrates sound intensity over a period of 200 ms [102], so the simultaneous masking in Fig. 10.16 may be described by the steady-state models described in the last section. However, if the maskee is switched on shortly after the masker is switched on, there is an overshoot effect which boosts the masked threshold about 10 dB upward above the threshold for steady-state simultaneous masking. This effect may last as long as 10 ms.
10.6 Perceptual Bit Allocation The optimal bit allocation strategy discussed in Chaps. 5 and 6 stipulates that the minimal overall MSQE is achieved when the MSQE for all subbands are equalized. This is achieved based on the assumption that all quantization noises in all frequency bands are “equal” in terms of their contribution to the overall MSQE as seen in (5.22) and (6.66). From the perspective of perceptual irrelevancy, quantization noise in each critical band is not “equal” in terms of perceived distortion and thus its contribution to the total perceived distortion is not equal. Only quantization noise in those critical bands whose power is above the masked threshold is of perceptual importance. Therefore, the MSQE for each critical band should be normalized by the masked threshold of that critical band in order to assess its contribution to perceptual distortion. Toward this end, let us define the critical band level of quantization noise in the subband context by rewriting the critical band level defined in (10.14) as X q2 .z/ D e2k (10.33) k2z
where e2k is again the MSQE for subband k. This critical band level of quantization noise may be normalized by the masked threshold of the same critical band using the following NMR (noise to mask ratio): NMR.z/ D
q2 .z/ LG .z/
:
(10.34)
Quantization noise for each critical band normalized in this way may be considered as “equal” in terms of its contribution to the perceptually meaningful total distortion. For this reason, NMR can be viewed as the variance of perceptual quantization error. Consequently, the total average perceptual quantization error becomes p2 D
1 X NMR.z/; Z all z
(10.35)
10.8 Perceptual Entropy
195
where Z is the number of critical bands. Note that only critical bands with NMR 1 need to be considered because the other ones are completely inaudible and thus have no contribution to p2 . Comparing the formula above with (5.22) and (6.66), we know that, if subbands are replaced by critical bands and quantization noise by perceptual quantization noise NMR.z/, the derivation of optimal bit allocation and coding gain in Sect. 5.2 applies. So the optimal bit allocation strategy becomes: Allocating bits to individual critical bands so that the NMR for all critical bands are equalized. Since NMR.z/ 1 for a critical band z means that quantization noise in the critical band is completely masked, the bit allocation strategy should ensure that NMR.z/ 1 for all critical bands should the bit resource is abundant enough.
10.7 Masked Threshold in Subband Domain The psychoacoustic experiments and theory regarding masking in the frequency domain are mostly based on Fourier transform, so DFT is most likely the frequency transform in perceptual models built for audio coding. On the other hand, subband filters are the preferred method for data modeling, so there is a need to translate the masked threshold in the DFT domain into the subband domain. While frequency scale correspondence between DFT and subbands are straightforward, the magnitude scales are not obvious. One general approach to address this issue is to use the relative value between signal power and masked threshold, which is essentially the SMR threshold defined in (10.17). Since the masked threshold is given by (10.31), the SMR threshold may be written as TSMR .z/ D
2 Masker .z/ ; LG .z/
(10.36)
2 where Masker .z/ is the power of the masker in critical band z. This ratio should be the same either in the DFT or subband domain, so can be used to obtain the masked threshold in the subband domain
L0G .z/ D
02 Masker .z/ ; TSMR .z/
(10.37)
02 where Masker .z/ is the signal power in the subband domain within critical band z.
10.8 Perceptual Entropy Suppose that the NMR.z/ for all critical bands are equalized to NMR0 , then the total MSQE for all subbands within critical band z is
196
10 Perceptual Model
q2 .z/ D LG .z/NMR0 ;
(10.38)
according to (10.34). Since the normalization of quantization error for all subbands within a critical band by the masked threshold also means that all subbands are quantized together with the same quantization step size, so the MSQE for each subband is the same: q2 .z/ LG .z/NMR0 e2k D D ; for all k 2 z; (10.39) k.z/ k.z/ where k.z/ represents the number of subbands within critical band z Therefore, the SNR for a subband in critical band z is SNR.k/ D
y2k e2k
D
y2k k.z/ LG .z/NMR0
;
for all k 2 z:
(10.40)
Due to (2.43), the number of bits assigned to subband k is 10 rk D log2 b log2 10
y2k k.z/ LG .z/NMR0
!
a ; b
for all k 2 z;
(10.41)
so the total number of bits assigned to all subbands within critical band z is r.z/ D
X
"
k2z
10 log2 b log2 10
!
y2k k.z/ LG .z/NMR0
# a : b
(10.42)
The average bit rate is then " 1 XX 10 RD log2 M b log2 10 k2z allz
!#
y2k k.z/ LG .z/NMR0
a : b
(10.43)
For perceptual transparency, the quantization noise in each critical band must be below the masked threshold, i.e., NMR0 1:
(10.44)
A smaller NMR0 means a lower quantization noise level relative to the masked threshold, so requires more bits to be allocated to critical bands. Therefore, setting NMR0 D 1 requires the least number of bits and at the same time ensures that quantization noise is just at masked threshold. This leads to the following minimum average bit rate: " 1 XX 10 RD log2 M b log2 10 all z k2z
y2k k.z/ LG .z/
!#
a : b
(10.45)
10.9 A Simple Perceptual Model
197
The derivation above does not assume any quantization scheme because a general set of parameters a and b has been used. If uniform quantization is used and subband samples are assumed to have the matching uniform distribution, a and b are determined by (2.45) and (2.44), respectively, so the above minimum average bit rate becomes ! y2k k.z/ 1 XX RD 0:5 log2 : (10.46) M LG .z/ allz k2z To avoid negative values that the logarithmic function may produce, we add one to its argument to arrive at ! y2k k.z/ 1 XX 0:5 log2 1 C ; RD M LG .z/ allz k2z
(10.47)
which is the perceptual entropy proposed by Johnston [34].
10.9 A Simple Perceptual Model To calculate perceptual entropy, Johnston proposed a simple perceptual model which has influenced many perceptual models deployed in practical audio coding systems [34]. This model is described here. A Hann window [67] is first applied to a chunk of 2;048 input audio samples and the windowed samples are transformed into frequency coefficients using a 2,048point DFT. Since a real input signal produces a DFT spectrum that is symmetric with respect to the zero frequency, only the first half of the DFT coefficients need to be considered. Therefore, the number of subbands M is 1,024. The magnitude square of the DFT coefficients P .k/; k D 0; 1; : : : ; M 1 may be considered as the power spectrum Sxx .e j! / of the input signal, so are used to calculate the spectral flatness measure defined in (5.65) for each critical band, which is denoted as x2 .z/ for critical band z. If the input signal is noise-like, its spectrum is flat, then the spectral flatness measure should be close to one. If the input signal is tone-like, its spectrum is full of peaks, then the spectral flatness measure should be close to zero. Therefore, the spectral flatness measure is a good inverse measure for tonal quality of the input signal, thus can be used to derive a tonality index such as the following:
x2 .z/ (dB) T .z/ D min ;1 : 60
(10.48)
Note that the x2 .z/ above is in decibel. Since the spectral flatness measure is always positive and less than one, its decibel value is always negative. Therefore, the
198
10 Perceptual Model
tonality index is always positive. It is also limited to less than one in the equation above. The tonality index indicates that the spectral components in critical band z is T .z/ degree tone-like and 1 T .z/ degree noise-like, so it is used to weigh the masking indexes given in (10.20) and (10.21), respectively, to give a total masking index of I.z/ D T .z/ITMN .z/ C .1 T .z//INMT .z/ D .14:5 C z/T .z/ K.1 T .z// dB;
(10.49)
where K is set to a value of 5.5 dB. The magnitude square of the DFT coefficients P .k/; k D 0; 1; : : : ; M 1 may be considered as the signal variance y2k ; k D 0; 1; : : : ; M 1, so the critical band level defined in (10.14) in the current context may be written as y2 .z/ D
X
P .k/:
(10.50)
k2z
Using (10.30) with a masking spread function, the cumulative masking effects for each critical band can be obtained as X y2 .zr /SF .zr ; z/; for all z: (10.51) LT .z/ D I.z/ all zr Denoting E.z/ D
X
y2 .zr /SF.zr ; z/;
(10.52)
all zr we obtain the total masked threshold LT .z/ D 10 log10 E.z/ C I.z/ D 10 log10 E.z/ .14:5 C z/T .z/ K.1 T .z// dB
(10.53)
for all critical bands z. As usual, this threshold is combined with the threshold in quiet using (10.31) to produce a global threshold, which is then been substituted into (10.47) to give the perceptual entropy.
Chapter 11
Transients
An audio signal often consists of quasistationary episodes, each including a number of tonal frequency components, which are frequently interrupted by dramatic transients. To achieve optimal energy compaction and thus coding gain, a filter bank with fine frequency resolution is necessary to resolve the tonal components or fine frequency structures in quasistationary episodes. But this filter bank is an ill fit for transients which often last for no more than a few samples, hence require fine time resolution for optimal energy compaction. Therefore, filter banks with both good time and frequency resolution are needed to effectively code audio signals. According the Fourier uncertainty principle, however, a filter bank cannot have fine frequency resolution and high time resolution simultaneously. One approach to mitigate this problem is to adapt the resolution of a filter bank in time to match the conflicting resolution requirements posted by transients and quasistationary episodes. This chapter presents a variety of contemporary schemes for switching the time– frequency resolution of MDCT, the preferred filter bank for audio coding. Also presented are practical methods for mitigating pre-echo artifacts that sometimes occur when the time resolution is not good enough to effectively deal with transients. Finally, switching the time–frequency resolution of a filter bank requires the knowledge of the occurrence and location of transients. Practical methods for detecting and locating transients are presented at the end of this chapter.
11.1 Resolution Challenge Audio signals mostly consist of quasistationary episodes, such as the one shown at the top of Fig. 11.1, which often include a number of tonal frequency components. For effective energy compaction to maximize coding gain, these tonal frequency components may be resolved or separated using filter banks, as shown by the MDCT magnitude spectra of 1,024 (middle) and 128 (bottom) subbands in the middle and bottom of Fig. 11.1, respectively. Apparently, the 1,024-subband MDCT is able to resolve the frequency components much better than the 128-subband MDCT, thus having a clear advantage in Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 11, c Springer Science+Business Media, LLC 2010
199
200
11 Transients
Amplitude
0.1 0 −0.1
10
20
30
40
Time
Magnitude (dB)
0
−50
−100 5
10 Frequency (kHz)
15
20
5
10 Frequency (kHz)
15
20
Magnitude (dB)
0
−50
−100
Fig. 11.1 An episode of quasistationary audio signal (top) and its MDCT spectra of 1,024 (middle) and 128 (bottom) subbands
energy compaction. It can be expected that the closer together the frequency components are, the more number of subbands is needed to resolve them. A filter bank with a large number of subbands, conveniently referred to as a long filter bank in this book, is able to deliver this advantage because using a large number of subbands to represent the full frequency range, as determined by the sample rate, means that each subband is allocated a small frequency range, hence the frequency resolution is finer. Also, such a filter bank has a long prototype filter that covers a large number of time samples, hence is able to resolve minute signal variations with frequency.
11.1 Resolution Challenge
201
Amplitude
0.5
0
−0.5
10
20
30
40
Time
Magnitude (dB)
−20 −40 −60 −80 −100 −120 5
10 Frequency (kHz)
15
20
5
10 Frequency (kHz)
15
20
Magnitude (dB)
−20 −40 −60 −80 −100 −120
Fig. 11.2 A transient that interrupts quasistationary episodes of an audio signal (top) and its MDCT spectra of 1,024 (middle) and 128 (bottom) subbands, respectively
Unfortunately, quasistationary episodes of an audio signal are intermittently interrupted by dramatic transients, as shown at the top of Fig. 11.2. Applying the same 1024-subband and 128-subband MDCT produces the spectra shown in the middle and bottom of Fig. 11.2, respectively. Now the short MDCT resolves the spectral valleys better than the long MDCT, so is more effective in energy compaction. There is another reason for improved overall energy compaction performance of a short filter bank. Transients are well known for causing flat spectra, hence a large
202
11 Transients
spectral flatness measure and bad coding gain, according to (5.65) and (6.81). As long as the prototype filter covers a transient attack, this spectral flattening effect is reflected in the whole block of subband samples. For overlapping filter banks, this affects multiple blocks of subband samples whose prototype filter covers the transient. For a long filter bank with a long prototype filter, this flattening effect affects a large number of subband samples, resulting low coding gain for a larger number of subband samples. Using a short filter bank with a short prototype filter, however, helps to isolate this spectral flatting effect to a smaller number of subband samples. Before and after those affected subband samples, the coding gain may go back to normal. Therefore, applying a short filter bank to cover the transients improves the overall coding gain.
11.1.1 Pre-Echo Artifacts Another reason that favors short filter banks when dealing with transients is pre-echo artifacts. Quantization is a step in an audio coder that compresses the signal most effectively, but it also introduces quantization noise. Under a filter bank scheme, the quantization noise introduced in the subband or frequency domain becomes almost uniformly distributed in the time domain after the audio signal is reconstructed from the quantized subband samples. This quantization noise is shown at the top of Fig. 11.3, which is the difference signal between the reconstructed signal and
Amplitude
0.5 0 −0.5 5
10
15
20
25
30
35
40
45
5
10
15
20
25 Time
30
35
40
45
Amplitude
0.5 0 −0.5
Fig. 11.3 Pre-echo artifacts produced by a long 1024-subband MDCT. The top figure shows the error between the reconstructed signal and the original, which is, therefore, the quantization noise. The bottom shows the reconstructed signal alone, the quantization noise before the transient attack is clearly visible and audible
11.1 Resolution Challenge
203
the original signal. When looking at the reconstructed signal alone (see bottom of Fig. 11.3), however, the quantization noise is not visible after the transient attack because it is visually masked by the signal. For the ear, it is also not audible due to simultaneous and postmasking. Before the transient attack, however, it is clearly visible. For the ear, it is also very audible and frequently very annoying because it is supposed to be quiet before the transient attack (see original signal at the top of Fig. 11.2). This frequently annoying quantization noise that occurs before the transient attack is called pre-echo artifacts. One approach to mitigate pre-echo artifacts is to use a short filter bank whose fine time localization or resolution would help at least limit the extent that the artifacts appear. For example, a short 128-subband MDCT is used to process the same transient signal to give the reconstructed signal and the quantization noise in Fig. 11.4. The quantization noise that occurs before the transient attack is still visible, but is much shorter, less than 5 ms in fact. As discussed in Sect. 10.5, the period of strong premasking may be considered as long as 5 ms, so the short pretransient quantization noise is unlikely to be audible. For a 128-subband MDCT, the window size is 256 samples, which covers a period of 256=44:1 5:8 ms for an audio signal of 44.1 kHz sample rate. In
Amplitude
0.5 0 −0.5 5
10
15
20
25
30
35
40
45
5
10
15
20
25 Time
30
35
40
45
Amplitude
0.5 0 −0.5
Fig. 11.4 Pre-echo artifacts produced by a short 128-subband MDCT. The top figure shows the quantization noise and the bottom the reconstructed signal. The quantization noise before the transient attack is still visible, but may be inaudible because it could be shorter than premasking which may last as long as 5 ms
204
11 Transients
order for significant quantization noise to build up, the MDCT window must cover a significant amount of signal energy, so the number of input samples after the transient attack and still covered by the MDCT window must be significant. This means that the number of input samples before the transient attack is much shorter than 256, so the period for pre-echo artifacts is much shorter than 5.8 ms and thus is very likely to be masked by premasking. Therefore, a 128-subband MDCT is likely to suppress most pre-echo artifacts.
11.1.2 Fourier Uncertainty Principle Now it is clear that a filter bank needs to have both fine frequency resolution and fine time resolution to effectively encode both transients and quasistationary episodes of audio signals. The time–frequency resolution of a filter bank is largely determined by its number of subband samples. For modulated filter banks, this is often reflected in the length of the prototype filter. A long prototype filter has better frequency resolution but poor time resolution. A short prototype filter (said to be compactly supported) has good time resolution but poor frequency resolution. There is no filter that can provide both good time and frequency resolution at the same time due to the Fourier uncertainty principle, which is related to the Heisenberg uncertainty principle [90]. Without loss of generality, let h.t/ denotes a signal that is normalized as follows: Z
1
1
jh.t/j2 dt D 1
(11.1)
and H.f / its Fourier transform. The dispersion about zero in both time and frequency domain may defined by Z Dt D Z
and Df D
1
t 2 jh.t/j2 dt
(11.2)
f 2 jH.f /j2 df;
(11.3)
1 1 1
respectively. It is obvious that they represent the energy concentration of h.t/ and H.f / toward zero in time and frequency domains, respectively, hence their respective time and frequency resolutions. The Fourier uncertainty principle states that [75] 1 (11.4) Dt Df 16 2 The equality is attained only in the case that h.t/ is an Gaussian function.
11.1 Resolution Challenge
205
Although the Gaussian function provides the optimal simultaneous time– frequency resolution under the uncertainly principle, it is still within the limit stipulated by the uncertainly principle, thus not the level of time–frequency resolution desired for audio coding. In addition, it has an infinity support and clearly does not satisfies the power complementary condition for use in CMFB, so is not suitable for practical audio coding systems. Providing simultaneous time–frequency resolution is one of the motivations behind the creation of the wavelet transform which may be viewed as a nonuniform filter bank. Its high-frequency basis functions are short to give good time resolution for high-frequency components and low-frequency basis functions are long to provide good frequency resolution for low-frequency components, so it does not violate the Fourier uncertainty principle. This approach can address the time–frequency resolution problems in many areas, such as image and videos, but it is not very suitable for audio. Audio signals often contain tones at high frequencies, which require fine frequency resolution at high frequencies, thus cannot be effectively handled by a wavelet transform.
11.1.3 Adaptation of Resolution with Time A general approach for mitigating this limitation on time–frequency resolution is to adapt the time–frequency resolution of a filter bank with time: deploy high frequency resolution to code quasistationary episodes and high time resolution to localize transients. This may be implemented using a hybrid filter bank in which each subband of the first stage filter bank is cascaded with a transform as the second stage, as shown in Fig. 11.5. Around a transient attack, only the first stage is deployed whose
Fig. 11.5 A hybrid filter bank for adaptation of time–frequency resolution with time. For transients, only the first stage is deployed to provide limited frequency resolution but better time localization. For quasistationary episodes, the second stage is cascaded to each subband of the first stage to boost frequency resolution
206
11 Transients
good time resolution helps to isolate the transient attack and limit pre-echo artifacts. For quasistationary episodes, the second stage is deployed to further decompose the subband samples from the first stage so that a much better frequency resolution is delivered. It is desirable that the first stage filter banks have good time resolution while the second stage transforms have good frequency resolution. A variation to this scheme is to replace the transform in the second stage with linear prediction, as shown in Fig. 11.6. The linear prediction in each subband is switched on whenever the prediction gain is large enough and off otherwise. This approach is deployed by DTS Coherent Acoustic where the first stage is a 32band CMFB with a 512-tap prototype filter [88]. The DTS scheme suffers from the poor time resolution of the first filter bank stage because 512 taps translates into 512=44:1 11:6 ms for a sample rate of 44.1 kHz, far longer than the effective premasking period of no more than 5 ms. A more involved but computationally inexpensive scheme is to switch the number of subbands of an MDCT in such a way that a smaller number of subbands is deployed to code transients and a large number of subbands to code quasistationary episodes. It seems to have become the dominant scheme in audio coding for the adaptation of time–frequency resolution with time, as can be seen in Table 11.1. This switched MDCT is often cascaded to the output of a CMFB in some audio coding algorithms. For example, it is deployed by MPEG-1&2 Layer III [55, 56], whose first stage is a 32-band CMFB with a prototype filter of 512 taps and the second stage is an MDCT which switches between 6 and 18 subbands. In this
Fig. 11.6 Cascading linear predictors with a filter bank to adapt time–frequency resolution with time. Linear prediction is optionally applied to the subband samples in each subband if the resultant prediction gain is sufficiently large
Table 11.1 Switched-window MDCT used by various audio coding algorithms Audio coder Number of subbands Dolby AC-2A [10] 128/512 Dolby AC-3 [11] 128/256 Sony ATRAC [92] 32/128 and 32/256 Lucent PAC [36] 128/1,024 MPEG 1&2 Layer 3 [55, 56] 6/18 MPEG 2&4 AAC [59, 60] 128/1,024 Xiph.Org Vorbis [96] 64, 128, 256, 512, 1,024, 2,048, 4,096 or 8,192 Microsoft WMA [95] 64, 128, 256, 512, 1,024 or 2,048 Digirise DRA [98] 128/1,024
11.2 Switched-Window MDCT
207
configuration, the resolution adaptation is actually achieved through the switched MDCT. This scheme suffers from the poor time resolution because the combined prototype filter length is 512 C 2 6 32 D 896 even when the MDCT is in short mode. This amounts to 896=44:1 20:3 ms for a sample rate of 44.1 kHz, which is far longer than the effective premasking period of no more than 5 ms. A similar scheme is used by MPEG-2&4 AAC in its gain control tool box [59, 60]. It deploys as the first stage a 4-subband CMFB with a 96-tap prototype filter and as the second stage an MDCT that switches between 32 and 256 subbands. This scheme seems to be able to barely avoid pre-echo artifacts due to its short combined prototype filter of 96 C 2 32 4 D 352 taps, which amount to 352=44:1 8:0 ms for a sample rate of 44.1 kHz. A more sophisticated scheme is used by Sony ATRACT deployed in its MiniDisc and Sony Dynamic Digital Sound (SDDS) cinematic sound system, which involves cascading of three stages of filter banks [92]. The first stage is a quadrature mirror filter bank (QMF) with two subbands. Its low-frequency subband is connected to another two-subband QMF, the outputs of which are connected to MDCTs which switches between 32 and 128 subbands. The high-frequency subband from the first stage QMF is connected to an MDCT that switches between 32 and 256 subbands. The combined short prototype filter lengths are 2.9 ms for the low-frequency subbands and 1.45 ms for the high-frequency subband, so they are within the safe zone of premasking. In place of the short MDCT, Lucent EPAC deploys a wavelet transform to handle transients. It still uses a 1024-subband MDCT to process quasistationary episodes [87].
11.2 Switched-Window MDCT A switched-window MDCT or Switched MDCT operates in long filter bank mode (with a long window function) to handle quasistationary episodes of audio signal, switches to short filter bank mode (with a short window function) around a transient attack, and reverts back to the long mode afterwards. A widely used scheme is 1,024 subbands for the long mode and 128 subbands for the short mode.
11.2.1 Relaxed PR Conditions and Window Switching To ensure perfect reconstruction, the linear phase condition (7.11) and powercomplementary conditions (7.80) must be satisfied. Since these conditions impose symmetric and power-complementary constraints on each half of the window function, it seems impossible to change either the window shape or the number of subbands.
208
11 Transients 1 0.8
Left Window
Right Window
0.6 Any Valid Window ⇒
0.4 0.2 0
Current Block 500
1000
1500
2000
2500
3000
1 0.8
Left Window
Right Window
0.6 Any Valid Window ⇒
0.4 0.2 0
Current Block 500
1000
1500
2000
2500
3000
Fig. 11.7 The PR conditions only apply to two window halves that operate on the current block of input samples, so the second half of the right window can have other shapes that satisfy the PR conditions on the next block
An inspection of the MDCT operation in Figs. 7.15, however, indicates that each block of the input samples are operated on by two window functions only. This is illustrated in Fig. 11.7, where the middle block marked by the dotted lines is considered as the current block. The first half of the left window, denoted as hL .n/, operates on the previous block and is thus not related to the current block. Similarly, the second half of the right window, denoted as hR .n/, operates on next block and is thus not related to the current block, either. Only the second half of the left window hL .n/ and the first half of the right hR .n/ operates on the current block. To ensure perfection reconstruction for the current block of samples, only the window halves covering it need to be constrained by the PR conditions (7.11) and (7.80). Therefore, they can be rewritten as hR .n/ D hL .2M 1 n/
(11.5)
h2R .n/ C h2L .M C n/ D ˛;
(11.6)
and
for n D 0; 1; : : : ; M 1, respectively.
11.2 Switched-Window MDCT
209
Let us do a variable change of n D M C m to the index of the left window so that m D 0 corresponds to the middle of the left window, or the start of the second half of the left window: hN L .m/ D hL .M C m/;
M m M 1:
(11.7)
This enables us to rewrite the PR conditions as
and
hR .n/ D hN L .M 1 n/
(11.8)
h2R .n/ C hN 2L .n/ D ˛;
(11.9)
for n D 0; 1; : : : ; M 1, respectively. Therefore, the linear phase condition (11.8) states that the second half of the left window and the first half of the right window are symmetric to each other with respect to the center of the current block and the power complementary condition states that these two window halves are power complementary at each point within the current block. Since these PR conditions are imposed to window halves that apply only to the current block of samples, the window halves in the next block, or any other blocks are totally independent from the current block or any other blocks. Therefore, different blocks can have totally different set of window halves and they can change in either shape or length from block to block. In other words, windows can be switched from block to block.
11.2.2 Window Sequencing While the selection of window halves are independent from block to block, both halves of a window are applied together to two blocks of samples to generate one block of MDCT coefficients. This imposes a little constraint on how window halves transit from one block to another block. To see this, let us consider the current block of input samples denoted by the dashed line in Fig. 11.7. It is used together with the previous block by the left window to produce the current block of MDCT coefficients. Once this block of MDCT coefficients are generated and transferred to the decoder, the second half of the left window is determined and cannot be changed. When it is time to produce the next block of MDCT coefficients, the second half of the left window was already determined when the current block of MDCT coefficients are generated. The PR conditions then dictate that its symmetric and power-complementary window half must be used as the first half of the right window. Therefore, the first half of the right window is also completely determined, there is no flexibility for change.
210
11 Transients
This, however, does not restrict the selection of the second half of the right window which can be totally different from the first half, thus enabling switching to a different window.
11.3 Double-Resolution Switched MDCT The simplest form of window switching is obviously the case of using a long window to process quasistationary episodes and a short window to deal with transients. This set of window lengths represents two modes of time–frequency resolution.
11.3.1 Primary and Transitional Windows The first step toward this simple case of window switching is to build a set of long and short primary windows using a window design procedure. Conveniently denoted as hM .n/ and hm .n/, respectively, where M designates the block length or the number of subbands enabled by the long window and m that by the short window, they must satisfy the original PR conditions (7.11) and (7.80), i.e., they are symmetric and power-complementary with respect to their respective middle points. Without loss of generality, the sine window can always be used for this purpose and are shown as the first and second windows in Fig. 11.8 where m D 128 and M D 1;024. Since the long and short windows are totally different in length, corresponding to different block lengths, transitional windows are needed to bridge this transition of block sizes. To maintain fairly good frequency response, a smooth transition in window shape is highly desirable and abrupt change should be avoided. An example set of such transitional windows are given in Fig. 11.8 as the last three windows. In particular, window WL L2S is a long window for transition from a long window to a short window, WL S2L for transition from a short window to a long window, and WL S2L for transition from a short window to a short window, respectively. A moniker in the form of WX Y2Z has been used to identify the different windows above, where “X” designates the total window length, “Y” the left half, and “Z” the right half of the window. For example, WL L2S designates a long window for transition from a long window to a short window and WS S2S designates a short primary window which may be considered as a short window for “transition” from a short window to a short window. Mathematically, let us denote the left and right half of the long window as hM L .n/ m m and hM R .n/ and the short window as hL .n/ and hR .n/, respectively. Then the two primary windows can be rewritten as
11.3 Double-Resolution Switched MDCT
211
1 WS_S2S
0.5 0
200
400
600
800
1000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1 WL_L2L
0.5 0
200
400
600
800
1000
1 WL_L2S
0.5 0
200
400
600
800
1000
1 WL_S2L
0.5 0
200
400
600
800
1000
1 WL_S2S
0.5 0
200
400
600
800
1000
Fig. 11.8 Window functions produced from a set of 2048-tap and 256-tap sine windows. Note that the length of W S S2S is only 256 taps
( WS S2S:
hS
S2S .n/
D (
and WL L2L:
hL L2L .n/ D
hm L .n/; 0 n < mI hm R .n/; m n < 2m
hM L .n/; 0 n < M I hM R .n/; M n < 2M:
(11.10)
(11.11)
The transitional windows can then be expressed in terms of the long and short window halves as follows: 8 M ˆ ˆ hL .n/; 0 n < M I ˆ ˆ ˆ < 1; M n < 3M=2 m=2I WL L2S: hL L2S .n/ D (11.12) m ˆ h .n/; 3M=2 m=2 n < 3M=2 C m=2I ˆ ˆ R ˆ ˆ : 0; 3M=2 C m=2 n < 2M:
212
WL S2L:
WL S2S:
11 Transients
hL S2L .n/ D
8 ˆ 0; 0 n < M=2 m=2I ˆ ˆ ˆ ˆ m < h .n/; M=2 m=2 n < M=2 C m=2I L
(11.13) ˆ 1; M=2 C m=2 n < M I ˆ ˆ ˆ ˆ : M hR .n/; M n < 2M: 8 0; 0 n < M=2 m=2I ˆ ˆ ˆ ˆ ˆ m ˆ h .n/; M=2 m=2 n < M=2 C m=2I ˆ ˆ < L hL S2S .n/ D 1; M=2 C m=2 n < 3M=2 m=2I (11.14) ˆ ˆ ˆ m ˆ ˆ hR .n/; 3M=2 m=2 n < 3M=2 C m=2I ˆ ˆ ˆ : 0; 3M=2 C m=2 n < 2M:
Figure 11.9 shows some window switching examples using the primary and transitional windows in Fig. 11.8. Transitional windows are placed back to back in the second and third rows and eight short windows are placed in the last row. As shown in Fig. 11.8, the long-to-short transitional windows WL L2S and WL S2S provide for a short window half to be placed in the middle of their second half. In the coordinate of the long transitional window, the short window must be
1 0.5 0
500
1000
1500
2000
2500
3000
3500
4000
500
1000
1500
2000
2500
3000
3500
4000
500
1000
1500
2000
2500
3000
3500
4000
500
1000
1500
2000
2500
3000
3500
4000
1 0.5 0 1 0.5 0 1 0.5 0
Fig. 11.9 Some possible window sequence examples
11.3 Double-Resolution Switched MDCT
213
placed within Œ3M=2 m=2; 3M=2 C m=2 (see (11.12) and (11.14), respectively). After 3M=2 C m=2, the long transitional window does not impose any constraint because its window values have become zero, so other methods can be used to represent the samples beyond 3M=2 C m=2. The same argument applies to the first half of the short-to-long transitional windows WL S2L and WL S2S as well (see Fig. 11.8), leading to no constraint placed on samples before M=2 m=2, so other signal representation methods can be accommodated for the samples before M=2 m=2. MDCT with the short window function WS S2S is a simple method for representing those samples. Due to the overlapping of m samples between the short windows and transitional windows, M=m short windows need to be used to represent the samples between the long-to-short (WL L2S and WL S2S) and short-to-long (WL S2L and WL S2S) transitional windows. See the last row of Fig. 11.9, for example. These M=m short windows amount to a total of M samples, which are the same as the block length of the long windows, so this window switching scheme is amenable to maintaining a constant frame size, which is highly desirable for convenient real-time processing of signals. For the seek of convenience, a long block may sometimes be referred to as the frame in the remainder of this book. Under such a scheme, the short window represents the fine time resolution mode and the long windows, including the transitional ones, represent the fine frequency resolution mode, thus amounting to a double-resolution switched MDCT. If constant frame size is forgone, other possible number of short windows can be used. This typically involves more sophisticated window sequencing and buffer management. If the window size of the short window is set zero, i.e., m D 0, there is absolutely no need for any form of short window. The unconstrained regions before M=2 and after 3M=2 are open for any kind of representation methods. For example, a DCT-IV or wavelet transform may be deployed to represent samples in those regions.
11.3.2 Look-Ahead and Window Sequencing It is clear now that the interval for possible short window placement is ŒM=2 C m=2; 3M=2 C m=2) (see the last row of Fig. 11.9). Referred to as transient detection interval, it obviously resides between two frames: about half of the short windows are placed in the current frame and the other placed in the second frame. If there is a transient in the interval, the short windows should be placed to cover it; otherwise, one of the long windows should be used. In preparation for placing the short windows in this interval, a long-to-short transitional window WL X2S needs to be used in the current frame, where the “X” can be either “L” or “S” and are determined in the previous frame. Therefore, to determine the window for the current frame we need to look-ahead to the transient detection interval for the presence or absence of transients. Since this interval ends
214 Table 11.2 Possible window switching scenarios
11 Transients Current window half WL X2S WL X2L WS S2S WS S2S
Next window half WS S2S WL L2X WS S2S WL S2X
in the middle of the next frame, we need a look-ahead interval of up to a half frame, which causes additional coding delays. This look-ahead interval of half frame is necessary for other window switching situations, such as transition from short windows to long windows (see the last row of Fig. 11.9 again). If there is a transient in the transient detection interval, the short windows have to be placed to cover it. Since the short windows only cover the second half of the current frame, the first half of the current frame needs to be covered by a WL X2S long window. The short windows also cover the first half of the next frame, whose next half is covered by a WL S2X long window. This complete the transition from a long window to short windows and back to a long window. Table 11.2 summarizes the possible window transition scenarios between two frames. The decoder needs to know exactly what window the encoder used to generate the current block of MDCT coefficients in order to use the same window to perform the inverse MDCT, so information about window sequencing needs to be included in the bit stream. This can be done using three bits to convey an window index that identifies the windows in (11.10) to (11.14) and shown in Fig. 11.8. If window WL S2S is forbidden, only two bits are needed to convey the window index.
11.3.3 Implementation Double-resolution switched IMDCT is widely used in a variety of audio coding standards, its implementation is straight forward and fully illustrated in the third row of Fig. 11.9. In particular, starting from the second window in that row (the WL L2S window), the IMDCT may be implemented in the following steps: 1. Copy the .M m/=2 samples, where the WL L2S is one, from the delay line directly to the output; 2. Do M=2m short IMDCT and put all results to the output because all of them belong to the current frame; 3. Do another IMDCT, put first m=2 samples of its result to the output and store the remaining m=2 samples to the delay line, because the first half belong to the current frame and the rest to the next frame; 5. Do M=2m 1 short IMDCT and put all results to the delay line because all of them belong to the next frame; 6. Clear the remaining .M C m/=2 samples of the delay line to zero because the last WS S2S has ended.
11.4 Temporal Noise Shaping
215
11.3.4 Window Size Compromise To localize transient attacks, the shorter the window is, the better the localization is achieved, thus the better the coding gain becomes. This is true for the group of samples around the transient attack, but causes poor coding gain for the other samples in the frame, which do not contain transient attacks and are quasistationary. Therefore, there is a trade-off or compromise when choosing the size for the short window. Too short a window size means good coding gain for the transient attack but poor coding gain for the quasistationary remainder, and vice versa for a window size too long. The compromise eventually reached is, of course, optimal neither for the transients nor for the quasistationary remainder. This problem is further compounded by the need for longer long windows to better encode tonal components in quasistationary episodes. If the short window size is fixed, longer long windows mean more short windows in a frame, thus more short blocks of audio samples coded with poor coding gain. Therefore, the long windows cannot be too long. Apparently, there is a consensus of using 256 taps for the short window, as shown in Table 11.1. This seems to be a good compromise because a window size of 256 taps is equivalent to 256=44:1 5:8 ms, which is barely longer than the 5 ms for premasking. In other words, it is the longest acceptable size that is barely short enough, but not unnecessarily short. With this longest acceptable size for the short window, 2,048 taps for the long window is also a widely accepted option. However, pre-echo artifacts are still frequently audible with such a window size arrangement, especially for audio pieces with significant transient attacks. This calls for techniques to improve the time resolution of the short window for enhanced control of pre-echo artifacts.
11.4 Temporal Noise Shaping Temporal noise shaping (TNS) [26], used by AAC [60], is one of the preecho control methods [69]. It deploys an open-loop DPCM on the block of short MDCT coefficients that cover the transient attack, and leaves other blocks of short MDCT coefficients in the frame untouched. In particular, it deploys the open-loop DPCM encoder shown in Fig. 4.3 on the block of short MDCT coefficients that cover the transient in the following steps: 1. Estimate the autocorrelation matrix of the MDCT coefficients. 2. Solve the normal equations (4.46) using a method such as the Levinson–Durbin algorithm to produce the prediction coefficients. 3. Produce the prediction residue of the MDCT coefficients using the prediction coefficients obtained in the last step. 4. Quantize the prediction residue. Note that the quantizer is placed outside the prediction loop as shown in Fig. 4.3.
216
11 Transients
On the decoder side, the regular decoder shown in Fig. 4.4 is used to reconstruct the MDCT coefficients for the block with a transient. The first theoretical justification for TNS is the spectral flattening effect of transients. For the short window that covers a transient attack, the resultant MDCT coefficients are either close to or essentially flat, thus are amenable for linear prediction. As discussed in Chap. 4, the resultant prediction gain may be considered as the same as the coding gain. The second theoretic justification is that an open-loop DPCM shapes the spectrum of quantization noise toward the spectral envelop of the input signal (see Sect. 4.5.2). For TNS, the input signal is the MDCT coefficients and their spectrum is the time-domain samples covered by the short window, so the quantization noise of MDCT coefficients in the time domain is shaped toward the envelop of the time-domain samples. This means that more quantization noise is placed after the transient attack and less noise before the transient attack. Therefore, there is less likelihood for pre-echo artifacts. As an example, let us apply the TNS method to the MDCT block that covers the transient attack in the audio signal in Fig. 11.2 with a predictor order of 2. The prediction coefficients are obtained from the autocorrelation matrix using (4.63). The quantization noise and the reconstructed signal are shown in Fig. 11.10. While the quantization noise before the transient attack is still visible, it is significantly shorter than that of the regular short window shown in Fig. 11.4.
Amplitude
0.5 0 −0.5 5
10
15
20
25
30
35
40
45
5
10
15
20
25 Time
30
35
40
45
Amplitude
0.5 0 −0.5
Fig. 11.10 Pre-echo artifacts for TNS. The top figure shows the quantization noise and the bottom the reconstructed signal. The quantization noise before the transient attack is still visible, but is significantly shorter than that of the regular short window. However, the concentration of quantization noise in a short period of time (top) elevates the noise intensity significantly and hence may become audible
11.5 Transient-Localized MDCT
217
However, the concentration of quantization noise in a short period of time as shown at the top of the figure elevates the noise intensity significantly and hence may become audible. More sophisticated noise shaping methods may be deployed to shape the noise in such a way that it is more uniformly distributed behind the transient attack. Since linear prediction needs to be performed for each MDCT coefficients, TNS is computationally intensive, even on the decoder side. The overhead for transferring the description of the predictor, including the prediction filter coefficients, is also remarkable.
11.5 Transient-Localized MDCT Another approach to improving the window size compromise is to leave the short window size unchanged but use a narrower window shape to cope with transients better. This allows better transient localization with minimal impact to the coding gain for the quasistationary remainder of the frame and to the complexity of both encoder and decoder.
11.5.1 Brief Window and Pre-Echo Artifacts Let us look at window function WL S2S, which is the last one in Fig. 11.8. It is a long window, but its window shape is much narrower than the regular long window WL L2L. This is achieved by shifting the short window outward and properly padding zeros to make it as long as a long window. This same idea may be applied to the short window using a model window whose length, denoted as 2B, is even shorter than the short window, i.e., B < m. Denoting its left and right half as hB L .n/ and hB R .n/, respectively, this model window is not directly used in the switched MDCT, other than building other windows, so it may be referred to as the virtual window. As an example, a 64-tap sine window, shown at the top of Fig. 11.11 as WB B2B, may be used for such a purpose. It is plotted using a dashed line to emphasize that it is a virtual window. Based on this virtual widow, the following narrow short window, called a brief window, may be built
WS B2B:
8 0; 0 n < m=2 B=2I ˆ ˆ ˆ ˆ ˆ B ˆ h .n/; m=2 B=2 n < m=2 C B=2I ˆ < L hS B2B .n/ D 1; m=2 C B=2 n < 3m=2 B=2I (11.15) ˆ ˆ ˆ ˆ hB ˆ R .n/; 3m=2 B=2 n < 3m=2 C B=2I ˆ ˆ : 0; 3m=2 C B=2 n < 2m:
218 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0
11 Transients
WB_B2B 200
400
600
800
1000
1200
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
WS_S2S 200
400
600
800
1000
1200
WS_S2B 200
400
600
800
1000
1200
WS_B2S 200
400
600
800
1000
1200
WS_B2B 200
400
600
800
1000
1200
WL_L2L 200
400
600
800
1000
1200
WL_L2S 200
400
600
800
1000
1200
WL_S2L 200
400
600
800
1000
1200
WL_S2S 200
400
600
800
1000
1200
WL_L2B 200
400
600
800
1000
1200
WL_B2L 200
400
600
800
1000
1200
WL_B2B 200
400
600
800
1000
1200
WL_S2B 200
400
600
800
1000
1200
WL_B2S 200
400
600
800
1000
1200
Fig. 11.11 Window functions for a transient-localized 1024/128-subband MDCT. The first window is plotted using a dashed line to emphasize that it is a virtual window (not directly used). Its length is 64 taps. All windows are built using the sine window as the model window
Its nominal length is the same as the regular short window WS S2S, but its effective length is only .3m=2 C B=2/ .m=2 B=2/ D m C B
(11.16)
11.5 Transient-Localized MDCT
219
because its leading and trailing .mB/=2 taps are zero, respectively. Its overlapping with its neighboring windows is only B. As an example, the WB B2B virtual window at the top of Fig. 11.11 may be used in combination with the short window BS S2S in the same figure to build the brief window shown as the fifth window in Fig. 11.11. Its effective length of 128 C 32 D 160 is significantly shorter than the 256 taps of the short window, so should provide better transient localization. For a sample rate of 44.1 kHz, this corresponds to 160=44:1 3:6 ms. Compared with the 5:8-ms length of the regular short window, this amounts to an improvement of 1:3 ms, or 22.4%. This improvement is critical for pre-echo control because the 3.6-ms length of the brief window is well within the 5-ms range of premasking, while the 5.8-ms length of the regular short window is not. Figure 11.12 shows the quantization noise achieved by this brief window for a piece of real audio signal. Its pre-echo artifacts are obviously shorter and weaker than those with the regular short window (see Fig. 11.4) but longer and weaker than those delivered by TNS (see Fig. 11.10). One may argue that TNS might deliver better pre-echo control due to its significantly shorter but more powerful pre-echo artifacts. Due to the premasking effect that often lasts up to 5 ms, however, pre-echo artifacts significantly shorter than the premasking period is most likely inaudible and thus irrelevant. Therefore, the simple brief window approach to pre-echo control serves the purpose well.
Amplitude
0.5
0
−0.5 5
10
15
20
5
10
15
20
25
30
35
40
45
25
30
35
40
45
Amplitude
0.5
0
−0.5
Time
Fig. 11.12 Pre-echo artifacts for transient localized MDCT. The top figure shows the quantization noise and the bottom the reconstructed signal. The quantization noise before the transient attack is still visible, but is remarkably shorter than that of the regular short window
220
11 Transients
11.5.2 Window Sequencing To switch between this brief window (WS B2B), the long window (WL L2L) and the short window (WS S2S), the PR conditions (11.8) and (11.9) call for the addition of various transitional windows which are illustrated in Fig. 11.11 along with the primary windows. Since the brief window provides much better transient localization, the new switched-window MDCT scheme may be referred to as transient-localized MDCT (TLM). Due to the increased number of windows as compared with the conventional approach, the determination of appropriate window sequence is more involved, but still fairly simple. The addition to the usual placement of long and short windows discussed in Sect. 11.3 is the placement of the brief window within a frame with transients. Within such a frame, this brief window is placed only to the block of samples containing a transient, while the short and/or the appropriate short transitional windows are applied to the quasistationary samples in the remainder of the frame. Some window sequence examples are shown in Fig. 11.13.
1 0.5 0
500
1000
1500
2000
2500
3000
3500
4000
500
1000
1500
2000
2500
3000
3500
4000
500
1000
1500
2000
2500
3000
3500
4000
500
1000
1500
2000
2500
3000
3500
4000
1 0.5 0 1 0.5 0 1 0.5 0
Fig. 11.13 Window sequence examples. The top sequence is for the conventional method which does not use the brief window. The brief window is used to cover blocks with transients for the sequences in the rest of the figure. The second and the third sequences are for a transient occurring in the first and the third blocks, respectively. Two transients occur in the first and sixth blocks in the last sequence
11.5 Transient-Localized MDCT
221
11.5.2.1 Long Windows If there is no transient within the current frame, a long window should be selected, the specific shape of which depending on the shape of the immediately previous and subsequent window halves, respectively. This is summarized in Table 11.3.
11.5.2.2 Short Windows If there is a transient in the current frame, a sequence of eight short windows should be used, the specific shape of each depends on transient locations. This is summarized as follows: WS B2B is placed to each short block within which there is a transient, to im-
prove the time resolution of the short MDCT. The window for the block that is immediately before this transient block has a
designation of the form “WS X2B”. The window for the block that is immediately after this transient block has a
designation of the form “WS B2X”. The moniker “X” in the above designation can be either “S” or “B”. The allowed placement of short windows may then be summarized in Table 11.4. For the remainder of the frame (away from the transients), short window WS S2S should be deployed, except for the first and last blocks of the frame, whose window assignments are dependent on the immediate window halves in the previous and subsequent frames, respectively. They are listed in Tables 11.5 and 11.6, respectively. Table 11.3 Determination of long window shape for a frame without detected transient Previous window half Current window Subsequent window half WL X2L
WS X2S
WS X2B
Table 11.4 Allowed placement of short windows around a block with a detected transient
WL WL WL WL WL WL WL WL WL
L2L L2S L2B S2L S2S S2B B2L B2S B2B
WL L2X WS S2X WS B2X WL L2X WS S2X WS B2X WL L2X WS S2X WS B2X
Pretransient WL L2B WL S2B WL B2B WS S2B WS B2B
Transient
WS B2B
Posttransient WL B2L WL B2S WL B2B WS B2S WS B2B
222
11 Transients Table 11.5 Determination of the first half of the first window in a frame with detected transients Last window in previous frame First window in current frame WL X2S WS S2X WS S2X WS X2S WS B2X WS B2B
Table 11.6 Determination of the second half of the last window in a frame with detected transients Last window in current frame First window in subsequent frame WS X2S WS X2S WS B2B
WL S2X WS S2X WS B2X
11.5.3 Indication of Window Sequence to Decoder The encoder needs to indicate to the decoder the window(s) that it used to encode the current frame so that the decoder can use the same window(s) to decode the frame. This can be accomplished again using a window index. For a frame without a detected transient, one label from the middle column of Table 11.3 is all that is needed. For a frame with transients, the window sequencing procedure outlined in Sect. 11.5.2 can be used to determine the sequence of short window shapes based on the knowledge of transient locations in the current frame. This procedure also need to know whether there is a transient in the first block of the subsequent frame due to the need for “look-ahead”. A value of 0 or 1 may be used to indicate the absence or presence of a transient in a block. For example, ‘00100010’ indicates that there is a transient in the third and seventh block, respectively. This sequence may be reduced by the block count starting from the last transient block. For example, the previous sequence may be coded by ‘23’. Note that, the particular method above cannot indicate if there is a transient in the first block of the current frame. This particular information, together with the absence or presence of transient in the first block of the subsequent frame, may be conveyed by the nomenclature WS Curr2Subs, where: 1. Curr (S=no, B=yes) identifies if there is transient in the first block of current frame, and 2. Subs (S=no, B=yes) identifies if there is transient in the first block of the subsequent frame. This is summarized in Table 11.7. The first column of Table 11.7 is obviously the same set of labels used for the short windows. Combining the labels in Tables 11.7 and 11.3, we arrive at the
11.5 Transient-Localized MDCT Table 11.7 Encoding of the existence or absence of transient in the first block of the current and subsequent frames
223
Label WS B2B WS B2S WS S2B WS S2S
Table 11.8 Window indexes and their corresponding labels
Transient in the first block of Current frame Subsequent frame Yes Yes Yes No No Yes No No
Window index 0 1 2 3 4 5 6 7 8 9 10 11 12
Window label WS S2S WL L2L WL L2S WL S2L WL S2S WS B2B WS S2B WS B2S WL L2B WL B2L WL B2B WL S2B WL B2S
complete set of window labels shown in Table 11.8. The total number of windows labels is now 13, requiring 4 bits to transmit the index to the decoder. A pseudo CCC function is given in Sect. 15.4.5 that illustrates how to obtain short window sequencing from this window index and transient location information.
11.5.4 Inverse TLM Implementation Both TLM and the regular double-resolution switched MDCT discussed in Sect. 11.3 involve switching between long and short MDCT. The difference is that the regular one uses a small set of short and long windows, while TLM deals with a more complex set of short and long windows that involve the brief window WS B2B. Note that the brief window and all of its related transitional windows are either short or long windows, just like the windows used by the regular switched MDCT. They are different simply because the window functions have different values. These simple differences in values do not change the procedure that calculate the switched MDCT, so the same procedure given in Sect. 11.3.3 that calculate the switched MDCT can be applied to calculate TLM.
224
11 Transients
11.6 Triple-Resolution Switched MDCT Another approach to improving the window size compromise is to introduce a third window size, called the medium window size, between the short and long window sizes. The primary purpose is to provide better frequency resolution to the stationary segments within a frame with detected transients, thus allowing a much shorter window size to be used to deal with transient attacks. There are, therefore, two window sizes within such a frame: short and medium. In addition, the medium window size can also be used to handle a frame with smooth transients. In this case, there are only medium windows and no short windows within such a frame. The three kinds of frames are summarized in Table 11.9. There are obviously three resolution modes, represented by the three window sizes, under such a switched MDCT architecture. To maintain a constant frame size, the long window size must be a multiple of the medium window size, which in turn a multiple of the short window size. As an example, let us reconsider the 1024/128 switched MDCT discussed Sect. 11.3. To mitigate the pre-echo artifacts encountered by its short window size of 256 taps, 128 may be selected as the new short window size, which corresponds to 64 MDCT subbands. To achieve better coding gain for the remainder of a transient frame, a medium window size of 512 may be selected, corresponding to 256 MDCT subbands. Keeping the long window size of 2048, we end up with a 1024/256/64 switched MDCT, or triple-resolution switched MDCT, whose window sizes are multiples of 4 and 4, respectively. Since there are three sets of window sizes that can be switched between each other, three sets of transitional windows are needed. Each set of these windows can be built using the formulas from (11.10) to (11.14). Figure 11.14 shows all these windows built based on the sine window. The advantage of this new architecture is illustrated in Fig. 11.15, where a few example window sequences are shown. Now the much shorter short window can be used to better localize the transients and the medium window to achieve more coding gains for the remainder of the frame. The all medium window frame is suitable for handling slow transient frames. Comparing Fig. 11.14 with Fig. 11.11, we notice they are essentially the same, except the first window. Window WB B2B in Fig. 11.11 is virtual and not actually used. Instead, the dilated version of it, WS B2B is used to deal with transients. The window equivalent to WB B2B is labeled as WS S2S in Fig. 11.14 and used to
Table 11.9 Three types of frames in a triple-resolution switched MDCT
Frame type Quasistationary Smooth transient Transient
Windows A long window A multiple of medium windows A multiple of short and medium windows
11.6 Triple-Resolution Switched MDCT 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0
225
WS_S2S 200
400
600
800
1000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
1200
1400
1600
1800
2000
WM_M2M 200
400
600
800
1000 WM_M2S
200
400
600
800
1000 WM_S2M
200
400
600
800
1000 WM_S2S
200
400
600
800
200
400
600
800
1000 WL_L2L 1000 WL_L2M
200
400
600
800
1000 WL_M2L
200
400
600
800
1000 WL_M2M
200
400
600
800
1000 WL_L2S
200
400
600
800
1000 WL_S2L
200
400
600
800
1000 WL_S2S
200
400
600
800
1000 WL_M2S
200
400
600
800
1000 WL_S2M
200
400
600
800
1000
Fig. 11.14 Window functions produced for a 1024/256/64 triple-resolution switched MDCT using the sine window as the model window
cope with transient. Since it can be much shorter than WS B2B, it can deliver much better time localization. Also, it is smoother than WS B2B, so it should have better frequency resolution property.
226
11 Transients
1 0.5 0
500
1000
1500
2000
2500
3000
3500
4000
500
1000
1500
2000
2500
3000
3500
4000
500
1000
1500
2000
2500
3000
3500
4000
500
1000
1500
2000
2500
3000
3500
4000
1 0.5 0 1 0.5 0 1 0.5 0
Fig. 11.15 Window sequence examples for a 1024/256/64 triple-resolution switched MDCT. The all medium window frame on the top is suitable for handling slow transient frames. For the rest of the figure, the short window can be used to better localize transients and the medium window to achieve better coding gains for the remainder of a frame with detected transients
Comparing Figs. 11.15 and 11.13, we notice that they are also very similar, the only difference is essentially that the WS B2B window in Fig. 11.13 is replaced by four WS S2S windows in Fig. 11.15. Therefore, window sequencing procedures are similar and thus are not discussed here. However, the addition of another resolution mode means that the resultant audio coding algorithm will become much more complex because each resolution mode usually requires its own sets of critical bands, quantizers, and entropy codes. See Chap. 13 for more explanation.
11.7 Transient Detection The adaptation of the time–frequency resolution of a filter bank hinges on whether there are transients in a frame as well as their locations, so the proper detection and locating of transients are critical to the success of audio coding.
11.7 Transient Detection
227
11.7.1 General Procedure Since transients are mostly consist of high-frequency components, an input audio signal x.n/ is usually preprocessed by a high-pass filter h.n/ to extract its highfrequency components: y.n/ D
X
h.k/x.n k/:
(11.17)
k
The Laplacian whose impulse response function is given below h.n/ D f1; 2; 1g
(11.18)
is an example of such a high-pass filter. The high-pass filtered samples y.n/ within a transient detection interval (see Sect. 11.3.2) are then divided into blocks of equal size, referred to as transientdetection blocks. Note that this transient-detection block is different from the block used by filter banks or MDCT. Let L denote the number of samples in such a transient-detection block, then there are K D N=L transient-detection blocks in each transient detection interval, assuming that it has a size of N samples. The short block size of filter bank should be a multiple of this transient-detection block size. Next, some kind of metric or energy for each transient-detection block is calculated. The most straight-forward is the following L2 metric: E.k/ D
L1 X
jy.kL C i /j2 ; for k D 0; 1; : : : ; K 1:
(11.19)
i D0
For reduced computational load, the following L1 metric is also a good choice: E.k/ D
L1 X
jy.kL C i /j; for k D 0; 1; : : : ; K 1:
(11.20)
i D0
Other sophisticated “norms”, such as perceptual entropy [34] can also be used. At this point, transient detection decision can be made based on the variations of the metric or energy among the blocks. As a simple example, let us first calculate Emax D max E.k/;
(11.21)
Emin D min E.k/:
(11.22)
0k T; 0; otherwise;
(11.25)
where T is a threshold. It may be set as T Dk
Emax C Emin 2
(11.26)
where k is an adjustable constant. Since a short MDCT or subband block may contain a multiple of transientdetection blocks, the transient locations obtained above need to be converted into MDCT blocks. This is easily done by declaring that an MDCT block contains a transient if any of its transient-detection block contains a transient.
11.7.2 A Practical Example A practical example is provided here to illustrate how transient detection is done in practical audio coding systems. It entails two stages of decision. In the preliminary stage, no transient in the current frame is declared if any of the following conditions are true: 1. 2. 3. 4.
Emax < k1 Emin , where k1 is a tunable parameter. k2 Dmax < Emax Emin , where k2 is a tunable parameter. Emax < T1 , where T1 is a tunable threshold. Emin > T2 , where T2 is a tunable threshold.
The Dmax used above is the maximum of absolute metric difference defined below Dmax D max jE.k/ E.k 1/j: 0 E[k] ) { break; } } PreK = k-1; The preattack peak is P reE max D
max
0k=K ) break; } while ( E[k] > EX ); PostK = k+1; The postattack peak is PostE max D
max E.k/: PostKk k3 maxfPreEmax ; PostEmax g; where k3 is a tunable parameter.
(11.31)
Chapter 12
Joint Channel Coding
Multichannel audio signals or programs, including the most widely used stereo and 5.1 surround sounds, are considered as consisting of discrete channels. Since a multichannel signal is intended for reproduction of coherent sound field, there is strong correlation between its discrete channels. This inter-channel correlation can obviously be exploited to reduce bit rate. On the receiving end, the human auditory system relies on a lot of cues in the audio signal to achieve sound localization and the processing involved is very complex. However, a lot of psychoacoustic experiments have consistently indicated that some components of the audio signal are either insignificant or even irrelevant for sound localization, thus can be removed for bit rate reduction. Surround sounds usually include one or more special channels, called low frequency effect or LFE channels, which are specifically intended for deep and lowpitched sounds with a frequency range from 3 to 120 Hz. The significantly reduced bandwidth presents a great opportunity for reducing bit rate. It is obvious that a great deal of bit rate reduction can be achieved by jointly coding all channels of an audio signal through exploitation of inter-channel redundancy and irrelevancy. Unfortunately, joint channel coding has not reached the same level of sophistication and effectiveness as that of intra-channel coding. Therefore, only a few widely used and simple methods are covered in this chapter. See [25] to explore further.
12.1 M/S Stereo Coding M/S stereo coding, or sum/difference coding, is an old technology which was deployed in stereo FM radio [15] and stereo TV broadcasting [27] to extend from monaural or monophonic sound reproduction (often shortened to mono) to stereophonic sound reproduction (shortened to stereo) while maintaining backward compatibility with old mono receivers. Toward this end, the left (L) and right (R) channels of a stereo program are encoded into sum S D 0:5.L C R/ Y. You, Audio Coding: Theory and Applications, DOI 10.1007/978-1-4419-1754-6 12, c Springer Science+Business Media, LLC 2010
(12.1) 231
232
12 Joint Channel Coding
and difference D D 0:5.L R/
(12.2)
channels. A mono receiver can process the sum signal only, so the listener can hear both left and right channels in a single loudspeaker. A stereo receiver, however, can decode the left channel by LDS CD (12.3) and the right channel by R D S D;
(12.4)
respectively, so the listener can enjoy stereo sound reproduction. The sum signal is also referred to as the main signal and the difference signal as the side signal, so this technology is often called main/side stereo coding. From the perspective of audio coding, this old technology obviously provides a means for exploiting the strong correlation between stereo channels. In particular, if the correlation between the left and right channels is strongly positive, the difference channel becomes very weak, thus needs less bits to encode. If the correlation is strongly negative, the sum channel becomes weak, thus needs less bits to encode. However, if the correlation between the left and right channels is weak, both sum and difference signals are strong, then there is not much coding gain. Also, if the left or right channel is much stronger than the other one, sum/difference coding is unlikely to provide any coding gain either. Therefore, the encoder needs to dynamically make the decision as to whether or not sum/difference coding is deployed and indicates the decision to the decoder. Instead of the time domain approach used by stereo FM radio and TV broadcasting, sum/difference coding in audio coding is mostly performed in the subband domain. In addition, the decision as to whether sum/difference coding is deployed for a frame is tailored to each critical band. In other words, there is a sum/difference coding decision for each critical band [35].
12.2 Joint Intensity Coding The perception of sound localization by the human auditory system is frequencydependent [16, 18, 102]. At low frequencies, the human ear localizes sound dominantly through inter-aural time differences (ITD). At high frequencies (higher than 4–5 kHz, for example), however, sound localization is dominated by inter-aural amplitude differences (IAD). This latter property renders a great opportunity for significant bit rate reduction. The basic idea of joint intensity coding (JIC) is to merge subbands at high frequencies into just one channel (thus significantly reducing the number of samples to be coded) and to transmit instead a smaller number of bits that describe the amplitude differences between channels.
12.2 Joint Intensity Coding
233
Joint intensity coding is, of course, performed on critical band basis, so only critical band z is considered in the following discussion. Since sound localization is dominated by inter-aural amplitude differences only at high frequencies, higher than 4–5 kHz, the critical band z should be selected in such a way that its lower frequency bound is higher than 4–5 kHz. All critical bands higher than this can be considered for joint intensity coding. To illustrate the basic idea of and the steps typically involved in joint intensity coding, let us suppose that there are K channels that can be jointly coded and denote the nth subband sample from the kth channel as X.k; n/. The first step of joint intensity coding is to calculate the power or intensity of all subband samples in critical band z for each channel: X k2 D X 2 .k; n/; 0 k < K: (12.5) n2z
At the second step, all subband samples in critical band z are jointed together to form a joint channel: J.n/ D
X
X.k; n/;
n 2 z:
(12.6)
0k