Audio Watermarking Techniques Hyoung Joong Kim Department of Control and Instrumentation Engineering Kangwon National University Chunchon 200-701, Korea
[email protected] Abstract This paper surveys the audio watermarking schemes. State-of-the-art of the current watermarking schemes and their implementation techniques are briefly summarized. They are classified into five categories: quantization scheme, spread-spectrum scheme, twoset scheme, replica scheme, and self-marking scheme. Advantages and disadvantages of each scheme are also discussed. In addition, synchronization schemes are also surveyed.
Blind Watermarking
Quantization s[k]=Q(x[k]+d)
Spread-Spectrum s[k]=x[k]+w[k]
Non-Blind Watermarking
Two-Set
Replica
s[k]=x[k]+d
s[k]=x[k]+x[k-d]
Self-Marking
Figure 1: A typical audio watermarking schemes.
1
Introduction
Audio watermarks are special signals embedded into digital audio. These signals are extracted by detection mechanisms and decoded. Audio watermarking schemes rely on the imperfection of the human auditory system. However, human ear is much more sensitive than other sensory motors. Thus, good audio watermarking schemes are difficult to design (Kim et al. 2003). Even though the current watermarking techniques are far from perfect, during the last decade audio watermarking schemes have been applied widely. These schemes are sophisticated very much in terms of robustness and imperceptibility (Bender et al. 1996) (Cox et al. 2002) (Cox and Miller 2002). Robustness and imperceptibility are important rquirements of watermarking, while they are conflicting each other.
Non-blind watermarking schemes are theoretically interesting, but not so useful in practical use, since it requires double storage capacity and double communication bandwidth for watermark detection. Of course, non-blind schemes may be useful as copyright verification mechanism in a copyright dispute (and even necessary, see (Craver et al. 1998) or inversion attacks). On the other hand, blind watermarking scheme can detect and extract watermarks without use of the unwatermarked audio. Therefore, it requires only a half storage capacity and half bandwidth compared with the non-blind watermarking scheme. Hence, only blind audio watermarking schemes are considered in this chapter. Needless to say, the blind watermarking methods need selfdetection mechanisms for detecting watermarks without unwatermarked audio.
This paper presents basically five audio watermarking schemes (see Figure 1). First scheme is quantization based watermarking which quantizes the sample values to make valid sample values and invalid ones. Second one is the spread-spectrum method based on the similarity between watermarked audio and pseudo-random sequence. Third one is the two-set method based on differences between two or more sets, which includes the patchwork scheme. Fourth one is the replica method using the close copy of the original audio, which includes the replica modulation scheme. Last one is the self-marking scheme. Of course, much more schemes and their variants are available. For example, time-base modulation (Foote and Adcock 2002) is theoretically interesting. However, this mechanism is a non-blind watermarking scheme. Audio watermarking scheme that encodes compressed audio data (Nahrstedt and Qiao 1998) does not embed real watermarking signal into raw audio. Furthermore, no psychoacoustic model is available in the compressed domain to enable the adjustment of the watermark to ensure inaudibility. Synchronization is important for detecting watermarks especially when audio is attacked. Most of the audio watermarking schemes are position-based, i.e., watermarks are embedded into specific positions and detected from the position. Thus, shift in positions by attack makes such detection schemes fail to work. Main purpose of synchronization schemes are to find the shifted positions. Several synchronization schemes are surveyed in this article. In audio watermarking, time-scaling or pitch-scaling attack is one of the most difficult attacks to manage. A brief idea for these attacks is summarized, which is proposed by (Tachibana et al. 2001).
D x
D 2
D 2
Anchor q(x,D) Quantized value to 0 Quantized value to 1
Figure 2: A simple quantization scheme. where q(·) is a quatization function and D is a quantization step. A quatization function q(x) is given as follows: q(x, D) = [x/D] · D, where [x] rounds to the nearest integer of x. The concept of the simplest quantization scheme in Equation (1) is illustrated in Figure (2). A sample value x is quantized to q(x, D) or to the black circle (•). Let q(x, D) denote anchor. If the watermarking bit b is 1, the anchor is moved to the white circle (◦). Otherwise, the cross (×) stands for the watermarking bit 0. For example, let D be 8, and x be 81. Then, q(81, 8) = 80. If b = 1, then y = 82. Otherwise, y = 78. As is shown in the figure, the distance between achors is D. Detection is the inverse process of embedding. The detection process is summarized as follows: 1 if 0 < y − q(x, D) < D/4 b= 0 if −D/4 < y − q(x, D) < 0
This scheme is simple to implement. This scheme is robust againt noise attack so long as the noise margin is below D/4. In other words, the additive noise A scalar quantization scheme quantizes a sample is larger than D/4, then quantized value is perturbed value x and assign new value to the sample x based so much that detector misinterprets the watermarkon the quantized sample value. In other words, the ing bit. The robustness can be enhanced if dither watermarked sample value y is represented as follows: modulation (Chen and Wornell 1999) is used. This scheme is formulated as follows: q(x, D) + D/4 if b = 1 y= (1) ym = q(x + dm , D) − dm , q(x, D) − D/4 otherwise
2
Quantization Method
2
Audio Signal s(n)
PsychoAcoustic Model
Watermark Shaping Filter
Power Spectrum Estimation
Malvar 2001) (Kim 2000) (Lee and Ho 2000) (Seok et al. 2002) (Swanson et al. 1998). This method is easy to implement, but has some serious disadvantages: it requires time-consuming psycho-acoustic shaping to reduce audible noise, and susceptible to time-scale modification attack. (Of course, usage of psychoacoustic models is not limited to spreadspectrum techniques.) Basic idea of this scheme and implementation techniques are described below.
x(n) Watermarked Audio Scaling
w(n)
Optional Part r(n) Pseudo-Random Sequence
3.1
Basic Idea
This scheme spreads pseudo-random sequence across the audio signal . The wideband noise can be spread into either time-domain signal or transform-domain Figure 3: A typical embedder of the spread-spectrum signal no matter what transform is used. Frequently used transforms include DCT (Discrete Cosine Transwatermarking scheme. form), DFT (Discrete Fourier Transform), and DWT (Discrete Wavelet Transform). The binary waterwhere m is an index, and dm is the m-th dither vec- mark message v = {0, 1} or its equivalent bipolar tor. For example, let d1 = 2, d2 = 0, x = 8, and variable b = {−1, +1} is modulated by a pseudoD = 4. Then, y1 = 10 and y2 = 8. Detection proce- random sequence r(n) generated by means of a secret dure estimates the distance and detect watermarking key. Then the modulated watermark w(n) = br(n) index as follows: is scaled according to the required energy of the audio signal s(n). The scaling factor α controls the trade-off between robustness and inaudibility of the m = 1 if e(y1 , d1 ) < e(y1 − d2 ) b= (2) watermark. The modulated watermark w(n) is equal m = 2 if e(y2 , d2 ) < (e(y2 − d1 ) to either r(n) or −r(n) depending on whether v = 1 or v = 0. The modulated signal is then added to where e(yi , dj ) = yi − q(yi + dj ) + dj . Now, the original audio to produce the watermarked audio from the Equation (2), it is possible to detect wax(n) such as termark index. In the above example, e(y1 , d1 ) = 0 and e(y1 − d2 ) = 2, Thus, it is clear that y1 is close x(n) = s(n) + αw(n). to d1 . Similarly, y2 is close to d2 . This procedure cab be extended to the dither vector. The detection scheme uses linear correlation. Because the pseudo-random sequence r(n) is known and can be regenerated generated by means of a secret 3 Spread-Spectrum Method key, watermarks are detected by using correlation between x(n) and r(n) such as Spread-spectrum watermarking scheme is an example of the correlation method which embeds pseudoN random sequence and detects watermark by cal1 x(i)r(i), (3) c= culating correlation between pseudo-random noise N i=1 sequence and watermarked audio signal. Spreadwhere N denotes the length of signal. Equation spectrum scheme is the most popular scheme and has been studied well in literature (Boney et al. 1996) (3) yields the correlation sum of two components as (Cox et al. 1996) (Cvejic et al. 2001) (Kirovski and follows: Message b(n)
3
in which a detector fails to detect a watermark in a watermarked audio. x(n) Filter Watermarked Audio
~r(n)
Correlator
c
3.2
Pseudo-random sequence has statistical properties similar to those of a truly random signal, but it can be exactly regenerated with knowledge of privileged information (see Section 2.1). Good pseudo-random sequence has a good correlation property such that any two different sequences are almost mutually orthogonal. Thus, cross-correlation value between them is very low, while auto-correlation value is moderately large. Most popular pseudo-random sequence is the maximum length sequence (also known as M -sequence). This sequence is a binary sequence r(n) = {0, 1} having the length N = 2m − 1 where m is the size of the linear feedback shift register. This sequence has very nice auto-correlation and cross-correlation properties. If we map the binary sequence r(n) = {0, 1} into bipolar sequence r(n) = {−1, +1}, auto-correlation of the M -sequence is given as follows:
r(n) Pseudo-Random Sequence
Figure 4: A typical preprocessing block for detector of the spread-spectrum watermarking scheme.
c=
N N 1 1 s(i)r(i) + αbr2 (i). N i=1 N i=1
Pseudo-Random Sequence
(4)
Assume that the first term in Equation (4) is almost certain to have small magnitudes. If those two signals s(n) and r(n) are independent, the first term should vanish. However, it is not the case. Thus, the watermarked audio is preprocessed as is shown in Figure 4 in order to make such assumption valid. One possible solution is filtering out s(n) from x(n). Preprocessing methods include high-pass filtering (Hartung and Girod 1998) (Haitsma et al. 2000), linear predictive coding (Seok et al. 2002), and filtering by whitening filter (Kim 2000). Such preprocessing allows the second term in Equation (4) to have a much larger magnitude and the first term almost to be vanished. If the first term has similar or larger magnitude than the second term, detection result will be erroneous. Based on the hypothesis test using the correlation value c and the predefined threshold τ , the detector outputs 1 if c > τ m= 0 if c ≤ τ
N −1 1 1 r(i)r(i − k) = −1/N N i=0
if k = 0 otherwise
(5)
The M -sequences have two disadvantages. First, length of the M -sequences, which is called chip rate, is strictly limited to as given by 2m − 1. Thus, it is impossible to get, for example, nine-chip sequences. Length of the typical pseudo-random sequences is 1,023 (Cvejic et al. 2001) or 2,047. There is always a possibility to make the trade-off between the length of the pseudo-random sequence and robustness. However, very short sequences such as length 7 are also used (Liu et al. 2002). Second, the number of different M -sequences is also limited once the size m is determined. It is shown that M -sequence is not secure in terms of cryptography. Thus, not all pseudo-random sequences are M sequences. Sometimes, non-binary and consequently real-valued pseudo-random sequence r(n) ∈ R with Gaussian distribution (Cox et al. 1996) is used. Nonbinary chaotic sequence (Bassia et al. 2001) is also
Typical value of τ is 0. The detection threshold has a direct effect both on the false positive and false negative probabilities. False positive means a type of error in which a detector incorrectly determines that a watermark is present in a unwatermarked audio. On the other hand, false negative is a type of error 4
used. As long as they are non-binary, its correlation characteristic is very nice. However, since we have to use integer sequences (processed such as αr(n)) due to finite precision, correlation properties become less promising.
80
Watermark Shaping
Carelessly added pseudo-random sequence or noise to audio signal can cause unpleasant audible sound whatever watermarking schemes are used. Thus, just reducing the strength α of pseudo-random sequence cannot be the final solution. Because human ears are very sensitive especially when the sound energy is very low, even a very little noise with small value of α can be heard. Moreover, small α makes the spread-spectrum scheme not robust. One solution to ensure inaudibility is watermark shaping based on the psycho-acoustic model (Arnold and Schilz 2002) (Bassia et al. 2001) (Boney et al. 1996) (Cvejic et al. 2001) (Cvejic and Sepp¨ anen 2002). Interestingly enough, the watermark shaping can also enhance robustness since we can increase the strength α sufficiently as long as the noise is below the margin. Psycho-acoustic models for audio compression exploit frequency and temporal masking effects to ensure inaudibility by shaping the quantized noise according to the masking threshold. Psycho-acoustic model depicts the human auditory system as a frequency analyzer with a set of 25 bandpass filters (also known as critical bands). The required intensity of a single sound expressed in unit of decibel [dB] to be heard in the absence of another sound is known as quiet curve (Cvejic et al. 2001) or threshold of audibility (Rossing et al. 2002). Figure 5 shows the quiet curve. In this case, the threshold in quiet is equal to the so-called minimum masking threshold. However, masking effect can increase the minimum masking threshold. A sound lying in the frequency or temporal neighborhood of another sound affects the characteristics of the neighboring sound, which phenomenon is known as masking. The sound that does the masking is called masker and the sound that is masked is called the maskee. The psycho-acoustic model analyzes the input signal s(n) in order to calculate the minimum masking threshold T . Figure 6
60 50 40
Masking Curve
30 20 10
Threshold of Audibility (Quiet Curve)
0 -10 20
50
100
200
500 1,000 2,000 Frequency (Hz)
5,000 10,000 20,000
Figure 5: A typical curve for masking. Noise sound below the solid line or bold line is inaudible. Bold line is moved upward by taking masking effects into consideration.
80 70 Power Spectral Density (dB)
3.3
Sound Pressure Level (dB)
70
Original Audio Signal
60 50 Audible Watermark Signal
40 30 20 10
Inaudible Watermark Signal
0 -10 0
0.5
1 1.5 Frequency (Hz)
2
2.5 x 104
Figure 6: An example of noise shaping. Audible noise (dotted line) is transformed into inaudible noise (broken line).
5
shows inaudible and audible watermark signals. The audible watermark signal can be transformed into inaudible signal by applying watermark shaping based on the psycho-acoustic model. The frequency masking procedure is given as follows: 1. Calculate the power spectrum. 2. Locate the tonal (sinusoid-like) and non-tonal (noise-like) components. 3. Decimate the maskers to eliminate all irrelevant maskers. 4. Compute the individual masking thresholds. 5. Determine the minimum masking threshold in each subband. This minimum masking threshold defines the frequency response of the shaping filter, which shapes the watermark. The filtered watermark signal is scaled in order to embed the watermark noise below the masking threshold. The shaped signal below the masking threshold is hardly audible. In addition, the noise energy of the pseudo-random sequence can be increased as much as possible in order to maximize robustness. The noise is inaudible as far as the noise power is below the masking threshold T . Temporal masking effects are also utilized for watermark shaping. Watermark shaping is a time-consuming task especially when we try to exploit the masking effects frame by frame in real-time because watermark shaping filter coefficients are computed based on the psycho-acoustic model. In this case, we have to use Fourier transform and inverse Fourier transform, and follow the five steps described above. Needless to say, then detection rate increases since robustness of the watermark increases. However, since it is too timeconsuming, watermark shaping filter computed based on the quiet curve can be used. Since this filter exploits the minimum noise level, it is not optimal in terms of the watermark strength α. This results in a strong reduction of the robustness. Of course, instead of maximizing the masking threshold, we can increase the length of the pseudorandom sequence for the robustness. However, this method reduces the embedding message capacity.
Figure 7: Seven example waveforms for sinusoidal modulation watermarking. By the courtesy of Dr. Zheng Liu.
3.4
Sinusoidal Modulation
Another solution is the sinusoidal modulation based on the orthogonality between sinusoidal signals (Liu et al. 2002). Sinusoidal modulation utilizes the orthogonality between sinusoidal signals with different frequencies N −1 2πim 2πin 1 1 sin sin = 0 N i=0 N N
if m = n otherwise
Based on this properties, the sinusoidally modulated watermark can be generated by adding sinusoids with different frequencies by pseudo-random sequences (Liu et al. 2002) as follows: w=
N −1
bi αi sin(2πfi ).
i=0
Note that watermark signal modulated by the elements of pseudo-random sequence bi keeps the same correlation characteristics as that of pseudo-random sequence in Equation (5). Coefficient bi is a bipolar pseudo-random sequence, αi is a scaling factor for the 6
Amplitude
4.1
Original patchwork scheme embeds a special statistic into an original signal (Bender et al. 1996). The two major steps in the scheme are: (i) choose two patches pseudo-randomly and (ii) add the small constant value d to the samples of one patch A and subtract the same value d from the samples of another patch B. Mathematically speaking,
Just noticeable difference
Minimum inaudible amplitude
Frequency
Figure 8: Just noticeable differences for sinusoidal modulation.
a∗i = ai + d,
b∗i = bi − d,
where ai and bi are samples of the patchwork sets A and B, respectively. Thus, the original sample values have to be slightly modified. The detection process starts with the subtraction of the sample values between two patches. Then, E[¯ a∗ − ¯b∗ ], the expected value of the differences of the sample means is used to decide whether the samples contain watermark information or not, where a ¯∗ and ¯b∗ are sample means of the individual sample a∗i and b∗i , respectively. Since two patches are used rather than one, it can detect the embedded watermarks without the original signal, which makes it a blind watermarking scheme. Patchwork has some inherent drawbacks. Note that
i-th sinusoidal component with frequency fi . For example, Figure 8 shows seven waveforms for sinusoidal modulated watermarking scheme. Seven sinusoids are linearly combinated with different bi coefficients. This sinusoidal modulation method has following advantages. First, watermark embedding and detection can be simply done in the time-domain. Thus, its embedding complexity is relatively low. Second, length of the pseudo-random sequence is very short. Third, the embedded sinusoids always start from zero and end on zero, which minimizes the chance of block noise. Of course, this scheme also need psychoacoustic modulation for inaudibility. However, the number of sinusoids are quite few in numbers, just noticeable difference (see Figure 8) for them can be decided in the frequency domain by audibility experiments.
4
Patchwork Scheme
a + d) − (¯b − d)] = E[¯ a − ¯b] + 2d, E[¯ a∗ − ¯b∗ ] = E[(¯ where a ¯ and ¯b are sample means of the individual sample ai and bi , respectively. The patchwork scheme assumes that E[¯ a∗ − ¯b∗ ] = 2d due to the prior assumption that random sample ensures that expected values are all the same such that E[¯ a − ¯b] = 0. However, the actual difference of sample means, a ¯ − ¯b, is not always zero in practice. Although the distribution of the random variable E[a∗ − b∗ ] is shifted to the right as shown in Figure 9, the probability of a wrong detection still remains (see the area smaller than 0 in the watermarked distribution). The performance of the patchwork scheme depends on the distance between two sample means and d which affects inaudibility. Furthermore, the patchwork scheme has originally been designed for images. The original patchwork scheme has been applied to the spatial-domain image (Bender et al. 1996) (or, equivalently, time-domain in audio) data. However,
Two-Set Method
Blind watermarking scheme can be devised by making two sets different. For example, if two sets are different, then we can conclude that watermark is present. Such decisions are made by hypothesis tests typically based on the difference of means between two sets. Making two sets of audio blocks have different energies can also be a good solution for blind watermarking. Patchwork (Arnold 2000) (Bender et al. 1996) (Yeo and Kim 2003) also belongs to this category. Of course, depending on the applications we can exploit the differences between two sets or more. 7
Unwatermarked Distriburion
Unwatermarked Distribution
Watermarked Distriburion
0
Watermarked Distribution
-d
2d
0
d
Figure 9: A comparison of the unwatermarked and Figure 10: A comparison of the un-watermarked and watermarked distributions of the mean difference. watermarked distributions of the mean difference by the modified patchwork algorithm time-domain embedding is vulnerable even to weak attacks and modifications. Thus, patchwork scheme where C is a constant and ”sign” is the sign can be implemented in the transform-domain (Arnold function. This function makes the large value 2000) (Bassia et al. 2001) (Yeo and Kim 2003). Their set larger and the small value set smaller so that implementations have enhanced original patchwork the distance between √ two sample means is always algorithms. First, mean and variance of the sample bigger than d = CS as shown in Figure 10. values are computed in order to detect the watermarks. Second, new algorithms assume that the dis3. Finally, replace the selected elements ai and bi tribution of the sample values is normal. Third, they by a∗i and b∗i . try to decide the value d adaptively. Modified Patchwork Scheme (MPA) (Yeo and Kim Since the proposed embedding function (6) intro2003) is described below: duces relative distance changes of two sets, a natural test statistic which is used to decide whether or not 1. Generate two sets A = {ai } and B = {bi } the watermark is embedded should concern the disrandomly. Calculate the sample means a ¯ = N tance between the means of A and B. In this section, N −1 −1 ¯ N i=1 ai and b = N i=1 bi , respectively, we present the detecting scheme and investigate the and the pooled sample standard error statistical properties. The decoding process is as fol lows: N N ¯)2 + i=1 (bi − ¯b)2 i=1 (ai − a . S= 1. Calculate the test statistics N (N − 1) T2 =
2. The embedding function presented below introduces an adaptive value change,
a∗i b∗i
√ = ai + sign(¯ a − ¯b)√ CS/2 = bi − sign(¯ a − ¯b) CS/2
(¯ a − ¯b)2 . S2
2. Compare T 2 with the threshold τ and decide that watermark is embedded if T 2 > τ and no watermark is embedded otherwise.
(6)
8
Multiplicative patchwork scheme (Yeo and Kim 2003) provides a new way of patchwork embedding. Most of the embedding schemes are additive such as x = x + αw, while multiplicative embedding schemes have the form x = s(1 + αw). Additive schemes shift average, while multiplicative scheme changes variance. Thus, detection scheme exploits such facts.
4.2
Original Signal 1 a
Echo Signal Echo Amplitude
0
d Delay Offset
Time
Figure 11: Kernels for echo hiding.
Amplitude Modification
This method embeds watermark by changing energies of two or three blocks. Energy of each block of length also embeds part of the original signal in frequency N is defined and calculated as domain as a watermark. Thus, replica modulation embeds replica, i.e., a properly modulated original N signal, as a watermark. Detector can also generate |s(i)|. E= the replica from the watermarked audio and calculate i=1 the correlation. The most significant advantage of The energy is high when the amplitude of signal is this method is its high immunity to synchronization large. Assume that two consecutive blocks be used attack. to embed watermark. We can make the two blocks A and B have the same energies or different energies by modifying the amplitude of each block. Let 5.1 Echo Hiding EA and EB denote the energies of blocks A and B, Echo hiding embeds data into an original audio signal respectively. If EA ≥ EB + τ , then, for example, by introducing an echo in the time domain such that we conclude that watermark message m = 0 is embedded. If EA ≤ EB − τ , then we conclude that watermark message m = 1 is embedded. Otherwise, x(n) = s(n) + αs(n − d). (7) no watermark is embedded. However, this method has a serious problem. AsFor simplicity, a single echo is added above (see sume that block A has much more energy than block Figure 11). However, multiple echoes can be added B and the watermark message to be embedded is (Bender et al. 1996). Binary messages are embedded 0, then there is no problem at all. Otherwise, we by echoing the original signal with one of two delays, have to make EA larger than EB . As long as the eneither a d0 sample delay or a d1 sample delay. Extracergy difference gap is wide, the resulting artifact betion of the embedded message involves the detection comes obvious and so unnatural to be noticed. This of delay d. Autocepstrum or cepstrum detects the scheme can turn ”forte” part into ”piano” part, undelay d. Ceptrum analysis duplicates the cepstrum fortunately, or vice versa. Such problem can be modimpulses every d samples. The magnitude of the imerated by using three blocks (Lie and Chang 2001) or pulses representing the echoes are small relative to more. By using multiple blocks, such artifacts can be the original audio. The solution to this problem is reduced slightly by distributing the burdens across to take auto-correlation of the cepstrum (Gruhl et al. other blocks. 1996). Double echo (Oh et al. 2001) such as
5
x(n) = s(n) + αs(n − d) − αs(n − d − ∆).
Replica Method
Original signal can be used as an audio watermark. can reduce the perceptual signal distortion and enEcho hiding is a good example. Replica modulation hance robustness. Typical value of ∆ is less than 9
three or four samples. Echo hiding is usually imperceptible and sometimes makes the sound rich. Synchronization methods frequently adopt this method for coarse synchronization. Disadvantage of echo hiding is its high complexity due to cepstrum or autocepstrum computation during detection. On the other hand, anybody can detect echo without any prior knowledge. In other words, it provides the clue for the malicious attack. This is another disadvantage of echo hiding. Blind echo removing is partially successful (Petitcolas et al. 1998). Time-spread echo (Ko et al. 2002) can reduce such a possibility of attacks. Another way of evading blind attack is auto-correlation modulation (Petrovic et al. 1999) which obtains watermark signal w(n) from the echoed signal x(n) in Equation (7). This method is more sophisticated and elaborated in the replica modulation. Double echo hiding scheme (Kim and Choi 2003)
to contrast it with ”time-domain echo” - the case where replica is obtained by a time-shift of original (or its portion). Such a modulated signal w(n) is a replica. This replica can be used as a carrier in much the same manner as PN sequence in spreadspectrum techniques. Thus, the watermarked signal has the following form: x(n) = s(n) + αw(n). As long as the components are invariant against modifications, the replica in the frequency domain can be generated from the watermarked signal. The watermark signal w(n) ˜ can be generated from the watermarked signal x(n) by processing it according to the embedding process. Then, correlation between x(n) and w(n) ˜ is computed as follows
x(n) = s(n) + αs(n − d) + αs(n + d), is now available. The virtual echo s(n + d) violates the causality. However, it is possible to embed virtual echoes by delaying echo-embedding process by d samples. These twin echoes make the cesptrum peak higher than single echo with the same strength of echo α. Thus, double echoes can enhance detection rate due to higher peak or enhance imperceptibility by reducing α accordingly.
5.2
Replica Modulation
Replica modulation (Petrovic 2001) is a novel watermarking scheme that embeds a replica, i.e., a modified version of original signal. Three replica modulation methods include frequency-shift, phaseshift, and amplitude-shift schemes. The frequencyshift method transforms s(n) into frequency domain, copies a fraction of low-frequency components in certain ranges (for example, from 1 kHz to 4 kHz), modulates them (by moving 20 Hz, for example, with a proper scaling factor), inserts them back to the original components (to cover ranges from 1020 Hz to 4020 Hz) and transforms inversely to time domain to generate watermark signal w(n). Since the frequency components are shifted and added in the frequency domain, we call it ”frequency-domain echo” 10
c=
N N 1 1 s(i)w(i) ˜ + αw(i)w(i). ˜ N i=1 N i=1
(8)
to detect watermark. As long as we use frequency band with lower cut-off much larger than frequency shift, and the correlation is done over integer number of frequency shift period, we have very small correlation between s(n) and w(n) ˜ in Equation (8). On the other hand, the spectra of the product w(n)w(n) ˜ has a strong dc component, and, thus, c contains a term of mean value of w(n)w(n), ˜ i.e., it contains the scaled auxiliary signal in the last term of Equation (8). Note that the frequency-shift is just one way to generate replica. Combination of frequency-shift, phase-shift, and amplitude-shift makes the replica modulation more difficult for malicious attacker to derive a clue, and makes the correlation value between s(n) and w(n) ˜ even smaller. The main advantage in comparison to PN sequence is that chip synchronization is not needed during detection, which makes replica modulation immune to synchronization attack. When an attacker makes a jitter attack (e.g., cuts out a small portion of audio, and splices the signal) against PN sequence techniques, synchronization is a must. On the contrary, the replica modulation is free from synchronization since replica and original give the same correlation before and after cutting and
2001). Time-scale modification refers to the process of either compressing or expanding the time-scale of audio. Basic idea of the time-scale modification watermarking is to change the time-scale between two extrema (successive maximum and minimum pair) of the audio signal (see Figure 12). The intervals between two extrema are partitioned to N segments of equal amplitude. We can change the slope of the signal in certain amplitude interval(s) according to the bits we want to embed, which changes the timescale. For example, the steep slope and gentle slope stand bits ”0” and ”1” or vice versa, respectively. Advanced time-scale modification watermarking scheme (Mansour and Tewfik 2001) can survive time-scale modification attack.
(a) Original Signal
Bit "1" is embedded. (Gentle slope)
Bit "0" is embedded. (Steep slope)
(b) Time-Scale Modified Signal
Figure 12: The concept of time-scale modification watermarking scheme. Messages, either bit ”0” and ”1”, can be embedded by changing slopes between two successive extrema.
6.2
splicing. Of course, the time-scaling attacks can affect bit and packet synchronization, but this is much smaller problem than chip synchronization. Pitchscaling (Shin et al. 2002) is a variant of the replica modulation, which makes it possible that the length of audio remains unchanged, but the harmonics is either expanded or contracted accordingly.
6
Salient features are special and noticeable signal to the embedders, but common signal to the attackers. They may be either natural or artificial. However, in either case they must be robust against attacks. So far those features are extracted or made empirically. The salient features can be used especially for synchronization or for robust watermarking, for example, against time-scale modification attack.
7
Self-Marking Method
Salient Features
Synchronization
Watermark detection starts by alignment of watermarked block with detector. Losing synchronization causes false detection. Time-scale or frequency-scale modification makes the detector lose synchronization. Thus, most serious and malicious attack is probably the desynchronization. All the watermarking algorithms assume that any detector be synchronized before detection. Brute-force search is computationally infeasible. Thus, we need fast and exact synchronization algorithms. Some watermarking schemes such as replica modulation or echo hiding are rather robust against certain type of desynchronization attacks. Such schemes can be used as a baseline method 6.1 Time-Scale Modification for coarse synchronization. Synchronization code can Time-scale modification is a challenging attack and be used to synchronize the onset of the watermarked can be used for watermarking (Mansour and Tewfik block. Self-marking method embeds watermark by leaving self-evident marks into the signal. This method embeds special signal into the audio, or change signal shapes in time domain or frequency domain. Time-scale modification method (Mansour and Tewfik 2001) and many schemes based on the salient features (Wu et al. 2000) belong to this category. Clumsy self-marking method, for example, embedding a peak into frequency domain, is prone to attack since it is easily noticeable.
11
However, refined synchronization scheme design is not simple. Clever attackers also try to devise sophisticated methods for desynchronization. Thus, synchronization scheme should also be robust against attacks and fast. There are two synchronization problems. First one is to align the starting point of a watermarked block. This approach is applied to the attacks such as cropping out or inserting redundancy. For example, a garbage clip can be added to the beginning of audio intentionally or unintentionally. Some MP3 encoders unintentionally add around 1,000 samples, which makes innocent decoder fail to detect exact watermarks. Second one is time-scale and frequency-scale modifications, intentionally done by malicious attackers or unintentionally done by the audio systems (Petrovic et al. 1999), anyway which are very difficult to cope with. Time-scale modification is a time-domain attack that adds fake samples periodically into target audio or delete samples periodically (Petitcolas et al. 1998) or uses sophisticated time-scaling schemes (Arfib 2002) (Dutilleux 2002) to keep pitches. Thus, audio length may be increased or decreased. On the other hand, frequencyscale modification (or pitch-scaling) adjusts frequencies and then applies time-scale modification to keep the size unchanged. This attack can be implemented by sophisticated audio signal processing techniques (Arfib 2002) (Dutilleux 2002). Aperiodic modification is more difficult to manage. There are many audio features such as brightness, zero-crossing rate, pitch, beat, frequency centroid, and so on. Some of them can be used for synchronization as long as such features are invariant under attacks. Feature analysis in speech processing has been studied well in literature while very few studies are available in audio processing. Recently, a precise synchronization scheme which is efficient and reliable against time-scaling and pitchscaling attacks has been presented (Tachibana et al. 2001). For the robustness, this scheme calculates and manipulates the magnitudes of segmented areas in the time-frequency plane using short-term DFTs. The detector correlates the magnitudes with a pseudo-random array that corresponds to twodimensional areas in the time-frequency plane. The purpose of 2-D array is to detect watermark when at 12
least one plane of information is alive under the assumption that attacking watermarks in both planes at the same time is not so feasible. Manipulation of magnitudes (which is similar to amplitude modification) is useful since magnitudes are less influenced than phases under attack. This scheme is useful to fight against time-scaling and pitch-scaling attacks and defenses quite well against them.
7.1
Coarse Alignment
Fine alignment is the final goal of synchronization. However, such alignment is not simple. Thus, coarse synchronization is needed to locate possible position fast and effectively. Once such positions are identified, fine synchronization mechanisms are used for exact synchronization. Thus, coarse alignment scheme should be simple and fast. Combination of energy and zero-crossing is a good example for coarse alignment scheme. Total energy and number of zero-crossings of each block are calculated. A sliding window is used to confine a block. If the two measures meet the predefined criteria, then we can conclude that the block is close to the target block for synchronization. Such conclusion is drawn from the assumption that energy and number of zero-crossing are invariant. For example, a block with low energy and large number of zero-crossings may be a good clue. Number of zero-crossings are closely related with frequencies. Large number of zero-crossings implies that the audio contains high frequency components. Energy computation is simple to implement. Just taking absolute values of each sample and summing up all gives the energy of the sample. Counting the number of sign changes from positive to negative and vice versa gives the number of zero-crossings. Echo-hiding can also be used for coarse synchronization. For example, if an evidence of echo existence is identified, it shows that the block is near from synchronization. Unfortunately, echo detection is considerably costly in terms of computing complexity. Replica modulation is rather robust against desynchronization attacks.
7.2
Synchronization Code
The synchronization code in time domain based on Bark code (Huang et al. 2002) is a notable idea. The Bark code (with bit length 12, for example, given as ”111110011010”) can be used as a synchronization since this code has a special autocorrelation function. To embed the Bark code successively, this method sets the lowest 13 bits to be ”1100000000000” when embedding message is ”1”, and set to be ”0100000000000” otherwise, regardless of the sample values. For example, a 16-bit sample value ”1000000011111111” is changed forcibly into ”1001100000000000” to embed message ”1” in time domain. This method is claimed to achieve the best performance to resist additive noise and keep sufficient inaudibility.
Sequence A
Sequence B (a) Exact match (b) One-chip off (15 matches) (3 matches)
(c) One-chip off with etxtended chip rates by 3 (15 matches)
Figure 13: The concept of redundant-chip coding. Right figure is an extended version of the center figure by chip-rate of 3. Correlation is calculated at the areas with dotted lines only.
and marking on it with special sawtooth shape is an example. Such artificial marking may generate audi7.3 Salient Point Extraction ble high frequency noise. Careful shaping can reduce Salient point extraction without changing the orig- the noise to a hardly audible level. inal signal (Wu et al. 2000) is also a good scheme. Basic idea of this scheme is to extract salient points 7.4 Redundant-Chip Coding as locations where the audio signal energy is climbing fast to a peak value. This approach works well for Pseudo-random sequence is a good tool for watersimple audio clips played by few instruments. How- marking. As is mentioned, correlation is effective to ever, this scheme has two disadvantages with more detect watermark as long as perfect synchronization complex audio clips. First, overall energy variation is achieved. When the pseudo-random sequence is exbecomes ambiguous for complex audio where many actly aligned, its correlation approaches to Equation music instruments are played altogether. Then, the (5). Figure 13-(a) depicts a perfect synchronization stability of the salient points decreases. Second, there between a 15-chip pseudo-random sequence (if we use exists the difficulty to define appropriate thresholds M -sequence, but not in this example). Its normalized for all piece music. High threshold value is suitable auto-correlation is 1. However, if the sequences are for audio with sharp energy variation. However, the misaligned by one chip off as is shown in (b), its autosame value to complex audio would yield very few correlation falls down to −3/15. This problem can be salient points. Thus, audio content analysis (Wu et solved by redundant-chip coding (Kirovski and Malal. 2000) parses complex audio into several simpler var 2001). Figure 13-(c) shows an expanded chip rate ones so that stability of salient points could be im- 3. Now, misalignment by one chip off doesn’t matter. proved and the same threshold could be applied to During the detection phase, only the central sample all audio clips. of each expanded chip is used for computing correlaIn order to avoid such complex operations, spe- tion. The central chips are marked by broken lines cial shaping of audio signal is also useful for coarse in Figure 13-(c). By using such a redundant-chip ensynchronization. This approach intentionally modi- coding with expansion by R chips, correct detection fies signal shape to keep salient points, which is suffi- is possible up to R/2 chips off misalignment. Of ciently invariant under malicious modifications. For course, this method enhances robustness at the cost example, choosing the fast climbing signal portion of embedding capacity. 13
7.5
Beat-Scaling Transform
The beat, salient periodicity of music signal, is one of the fundamental characteristics of audio. Serious beat change can spoil the music. Thus, beat must be almost invariant under attacks. In this context, beat can be a very important marker for synchronization. The beat-scaling transform (Kirovski and Attias 2002) can be used for enabling synchronicity between the watermark detector and the location of the watermark in an audio clip. Beat-scaling transform method calculates the average beat period in the clip and identifies the location of each beat as accurately as possible. Next, the audio clip is scaled (i.e., stretched or shortened) such that the length of each beat period is constant and equal to the average beat period rounded to the nearest multiple of a certain block of samples. The scaled clip is watermarked and scaled back to its original tempo. As long as beat remains unchanged, watermarks can be detected from the scaled beat periods. Beat detection algorithms are presented in (Goto and Muraoka 1999) (Scheirer 1998). Of course, in this case the synchronization relies on the accuracy of the beat detection algorithms.
chronization. On the other hand, replica method is effective for synchronization. However, echo hiding is vulnerable to attack. Replica modulation (Petrovic 2001) is rather secure than echo hiding. Among two-set schemes, the modified patchwork algorithm (Yeo and Kim 2003) is also very much elaborated. Self-marking method can be used especially for synchronization or for robust watermarking, for example, against time-scale modification attack. Such five seminal works have improved watermarking schemes remarkably. However, more sophisticated technologies are required, and expected to be achieved in the next decade. Some synchronization schemes are also very important. This article briefly surveys the basic ideas for synchronization.
Acknowledgments
This work was in part supported by the Brain Korea 21 Project, Kangwon National University. The authors appreciate Prof. D. Ghose of Indian Institute of Science for their comments. The authors also appreciate Dr. Rade Petrovic of Verance Inc., Mr. Michael Arnold of Fraunhofer Gesellscaft, Dr. Fabien A. P. Petitcolas of Microsoft, for their kind personal communications and review. The authors also appreciate Taehoon Kim, Kangwon National Univer8 Conclusions sity, for implementing various schemes and providing Available studies on audio watermarking is far less useful information. than that of image watermarking or video watermarking. However, during the last decade audio watermarking studies have also increased consider- References ably. Those studies have contributed much to the Arfib, D., Keiler, F., and Z¨ oler, U. (2002), ”Timeprogress of audio watermarking technologies. This frequency Processing,” in DAFX: Digital Audio Efpaper surveyed those papers and classified them fects, edited by U. Z¨ oler, John Wiley and Sons, pp. into four categories: quantization scheme, spread237-297. spectrum scheme, two-set scheme, replica scheme, and self-marking. Quantization scheme is not so ro- Arnold, M. (2000), ”Audio watermarking: features, bust against attacks, but easy to implement. Spreadapplications and algorithms,” IEEE International spectrum scheme requires psycho-acoustic adaptaConferenc Multimedia and Expo, vol. 2, pp. 1013tion for inaudible noise embedding. This adapta1016. tion is rather time-consuming. Of course, most of the audio watermarking schemes need psychoacous- Arnold, M. (2001), ”Audio Watermarking: Burying information in the data,” Dr. Dobb’s Journal, vol. tic modelling for inaudibility. Another disadvantage 11, pp. 21-28. of spread-spectrum scheme is its difficulty of syn14
Arnold, M., and Schilz, K. (2002), ”Quality evaluation of watermarked audio tracks,” SPIE Electronic Imaging, vol. 4675, pp. 91-101.
Felten, E. W. (2001), ”Reading between the lines: Lessons from the SDMI challenge,” UXENIX Security Symposium.
Bassia, P., Pitas, I., and Nikolaidis, N. (2001), ”Ro- Craver, S., Liu, B, and Wolf, W. (2002), ”Detectors bust audio watermarking in the time domain,” for echo hiding systems,” Information Hiding , LecIEEE Transactions on Multimedia, vol. 3, pp. 232ture Notes in Computer SCience, vol. 2578, pp. 241. 247-257. Bender, W., Gruhl, D., Morimoto, N., and Lu, A. Cvejic. N., Keskinarkaus, A., and Sepp¨ anen, T. (1996), ”Techniques for data hiding,” IBM Systems (2001), ”Audio watermarking using m-sequences Journal, vol. 35, pp. 313-336. and temporal masking,” IEEE Workshops on Applications of Signal Processing to Audio and AcousBoeuf, J., and Stern, J.P. (2001), ”An analysis of tics, New Paltz, New York, pp. 227-230. one of the SDMI audio watermarks,” Proceedings: Information Hiding, pp. 407-423. Cvejic. N., and Sepp¨ anen, T. (2002), ”Improving audio watermarking scheme using psychoacoustic waBoney, L., Tewfik, A. H., and Hamdy, K. N. (1996), termark filtering,” IEEE Internation Conference ”Digital watermarks for audio signal,” Internaon Signal Processing and Information Technology, tional Conference on Multimedia Computing and Cairo, Egypt, pp. 169-172. Systems, Hiroshima, Japan, pp. 473-480. Dutilleux, P., de Poli, C., and Z¨ oler, U. (2002), Chen, B., and Wornell, G.W., (19969), ”Dither mod”Time-frequency Processing,” in DAFX: Digital ulation: A new approach to digital watermarking Audio Effects, edited by U. Z¨ oler, John Wiley and and information embedding,” Proceedings of the Sons, pp. 201-236. SPIE: Security and Watermarking of Multimedia Contents, vol. 3657, pp. 342-353. Foote. J., and Adcock, J. (2002), ”Time base modulation: A new approach to watermarking audio Cox, I.J., Kilian, J., Leigton, F.T., and Shamoon, T. and images,” e-print. (1996), ”Secure Spread Spectrum Watermarking for Multimedia,” IEEE Trans. Image Processing, Goto. M., and Muraoka, Y. (1999), ”Real-time beat vol. 6, pp. 1673-1687. tracking for drumless audio signals,” Speech Communication, vol. 27, nos. 3-4, pp. 331-335. Cox, I.J., Miller, M.I., and Bloom, J.A. (2002), Digital Watermarking, Morgan Kaufman Publishers. Gruhl. D., Lu, A, and Bender, W. (1996), ”Echo Hiding,” Pre-Proceedings: Information Hiding, CamCox, I.J., and Miller, M.I. (2002), ”The first 50 years bridge, UK, pp. 295-316. of electronic watermarking,” Journal of Applied Signal Processing, vol. 2, pp. 126-132. Haitsma. J., van der Veen, M., Kalker, T., and Bruekers, F. (2000), ”Audio watermarking for monitorCraver, S. A., Memon, N., Yeo, B.-L., and Yeung, M. ing and copy protection,” ACM Multimedia WorkM. (1998), ”Resolving Rightful Ownerships with shop, Marina del Ray, California., pp. 119-122. Invisible Watermarking Techniques: Limitations, Attacks, and Implication,” IEEE Journal on Se- Hartung, F., and Girod, B. (1998), ”Watermarking of lected Areas in Communications, vol. 16, no. 4, pp. uncompressed and compressed video,” Signal Pro573-586, 1998. cessing, vol. 66, pp. 283-301. Craver, S. A., Wu, M., Liu, B., Stubblefield, A., Hsieh, C.-T., and Tsou, P.-Y. (2002), ”Blind cepSwartzlander, B., Wallach, D. S., Dean, D., and strum domain audio watermarking based on time 15
energy features,” IEEE International Conference on Digital Signal Processing, vol. 2, pp. 705-708.
Liu, Z., Kobayashi, Y., Sawato, S., and Inoue, A. (2002), ”A robust audio watermarking method using sine function patterns based on pseudo-random Huang, J., Wang, Y., and Shi, Y. Q. (2002), ”A sequences,” Proceedings of Pacific Rim Workshop blind audio watermarking algorithm with selfon Digital Steganography 2002, pp. 167-173. synchronization,” IEEE International Conference Mansour, M. F., and Tewfik, A. H. (2001), ”Timeon Circuits and Systems, vol. 3, pp. 627-630. scale invariant audio data embedding,” InternaKim, H. (2000), ”Stochastic model based audio wational Conference on Multimedia and Expo. termark and whitening filter for improved detection,” IEEE International Conference on Acous- Mansour, M. F., and Tewfik, A. H. (2001), ”Audio watermarking by time-scale modification,” Intertics, Speech, and Signal Processing, vol. 4, pp. national Conference on Acoustics, Speech, and Sig1971-1974. nal Processing, vol. 3, pp. 1353-1356. Kim, H.J., Choi, Y.H., Seok, J., and Hong, J. (2003), ”Audio watermarking techniques,” Intelligent Wa- Nahrstedt, K., and Qiao, L. (1998), ”Non-invertible watermarking methods for MPEG video and autermarking Techniques: Theory and Applications, dio,” ACM Multimedia and Security Workshop, World Scientific Publishing (to appear). Bristol, U.K., pp. 93-98. Kim, H.J., and Choi, Y.H. (2003), ”A novel echo Oh, H.O., Seok, J.W., Hong, J.W., and Youn, D.H. hiding algorithm,” IEEE Transactions on Circuits (2001), ”New echo embedding technique for robust and Systems for Video Technology, (to appear). and imperceptible audio watermarking,” IEEE International Conference on Acoustics, Speech, and Kirovski, D., and Malvar, H. (2001), ”Robust spreadSignal Processing, vol. 3, pp. 1341-1344. spectrum audio watermarking,” IEEE International Conference on Acoustics, Speech, and Signal Petitcolas, F.A.P., Anderson, R.J., Kuhn, M.G. Processing, Salt Lake City, UT, pp. 1345-1348. (1998), ”Attacks on copyright marking system,” Information Hiding, Lecture Notes in Computer Kirovski, D., and Attias, H. (2002), ”Audio waScience, vol. 1525, pp. 218-238. termark robustness to desynchronization via beat detection,” Information Hiding, Lecture Notes in Petrovic, R., Winograd, J.M., Jemili, K., and Metois, Computer Science, vol. 2578, pp. 160-175. E. (1999) ”Data hiding within audio signals,” International Conference on Telecommunications in Ko, B.-S., Nishimura, R, and Suzuki, Y. (2002), Modern Satellite, Cable, and Broadcasting Service, ”Time-spread echo method for digital audio watervol. 1, pp. 88-95. marking using pn sequences,” IEEE International Conference on Acoustic, Speech, and Signal Pro- Petrovic, R. (2001) ”Audio signal watermarking cessing, vol. 2, pp. 2001-2004. based on replica modulation,” International Conference on Telecommunications in Modern SatelLee, S.K., and Ho, Y.S. (20010), ”Digital audio walite, Cable, and Broadcasting Service, vol. 1, pp. termarking in the cepstrum domain,” IEEE Trans227-234. actions on Consumer Electronics, vol. 46, no. 3, pp. 744-750. Rossing, T.D., Moore, F.R., and Wheeler, P.A. (2002), The Science of Sound, 3rd ed., AddisonLie, W.-N., and Chang, L.-C. (2001), ”Robust and Wesley, San Francisco. high-quality time-domain audio watermarking subject to psychoacoustic masking,” IEEE Interna- Seok, J., Hong, J., and Kim, J. (2002), ”A novel audio tional Symposium on Circuits and Systems, vol. 2, watermarking algorithm for copyright protection of pp. 45-48. digital audio,” ETRI Journal, vol. 24, pp. 181-189. 16
Scheirer, E (1998), ”Tempo and beat analysis of acoustic musical signals,” Journal of the Acoustic Society of America, vol. 103, pp. 588-601. Shin, S., Kim, O., Kim, J., and Choi, J. (2002), ”A robust audio watermarking algorithm using pitch scaling,” IEEE International Conference on Digital Signal processing, pp. 701-704. Swanson, M., Zhu, B., Tewfik, A., and Boney, L. (1998), ”Robust audio watermarking using perceptual masking,” Signal Processing, vol. 66, pp. 337355. Tachibana, R., Shimizu, S., Kobayashi, S., and Nakamura, T. (2001), ”An audio watermarking method robust against time- and frequency-fluctuation,” Proceedings of the SPIE: Security and Watermarking of Multimedia Contents, vol. 4314, pp. 104-115. Wu, C.-P., Su, P.-C., and Kuo, C.-C. J. (2000), ”Robust and efficient digital audio watermarking using audio content analysis,” Security and Watermarking of Multimedia Contents, SPIE, vol. 3971, pp. 382-392. Wu, M., Craver, S. A., Felten, E. W., and Liu, B. (2001), ”Analysis of attacks on SDMI audio watermarks” IEEE International Conference on Acoustic, Speech, and Signal Processing, pp. 1369-1372. Yeo, I.-K., and Kim, H.J. (2003), ”Modofied patchwork algorithm: A novel audio watermarking scheme,” IEEE Transactions on Speech and Audio Processing, vol. 11, (to appear). Yeo, I.-K., and Kim, H.J. (2003), ”Generalized patchwork algorithm for image watermarking scheme,” ACM Multimedia Systems, (to appear).
17