BUCHNER ET AL.
FULLDUPLEX SYSTEMS BASED ON SOUND FIELD RECORDING AND WFS
Audio Engineering Society
Convention Paper Presented at the 116th Convention 2004 May 8–11 Berlin, Germany
This convention paper has been reproduced from the author’s advance manuscript, without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio Engineering Society, 60 East 42 nd Street, New York, New York 101652520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering Society.
FullDuplex Systems for Sound Field Recording and Auralization Based on Wave Field Synthesis Herbert Buchner, Sascha Spors, and Walter Kellermann1 1
University of ErlangenNuremberg, Cauerstr. 7, 91058 Erlangen, Germany
Correspondence should be addressed to Herbert Buchner (
[email protected]) ABSTRACT For highquality multimedia communication systems such as telecollaboration or virtual reality applications, both multichannel sound reproduction and fullduplex capability are highly desirable. Full 3D sound spatialization over a large listening area is offered by wave field synthesis, where arrays of loudspeakers generate a prespecified sound field. However, before this new technique can be utilized for fullduplex systems with microphone arrays and loudspeaker arrays, an efficient solution to the problem of multichannel acoustic echo cancellation (MC AEC) has to be found in order to avoid acoustic feedback. This paper presents a novel approach that extends the current state of the art of MC AEC and transformdomain adaptive filtering by reconciling the flexibility of adaptive filtering and the underlying physics of acoustic waves in a systematic and efficient way. Our new framework of wavedomain adaptive filtering (WDAF) explicitly takes into account the spatial dimensions of loudspeaker arrays and microphone arrays with closely spaced transducers. Experimental results with a 32channel AEC verify the concept for both, simulated and measured room acoustics. AES 116TH CONVENTION, BERLIN, GERMANY, 2004 MAY 8–11 1
1. INTRODUCTION Multichannel techniques for reproduction and acquisition of speech and audio signals at the acoustic humanmachine interface offer spatial selectivity and diversity as additional degrees of freedom over singlechannel systems. Multichannel sound reproduction enhances sound realism in virtual reality and multimedia communication systems, such as teleconferencing or teleteaching (especially of music), and aims at creating a threedimensional illusion of sound sources positioned in a virtual acoustical environment. However, advanced loudspeakerbased approaches, like the 3/2surround format still rely on a restricted listening area (‘sweet spot’). A volume solution for a large listening space is offered by the Wave Field Synthesis (WFS) method [1] which is based on wave physics. In WFS, arrays of a large number P of individually driven loudspeakers generate a prespecified sound field. P may lie between 20 and several hundred. On the recording side of advanced acoustic humanmachine interfaces, the use of microphone arrays [2], where the number Q of microphones may reach up to 500 [3], is an effective approach to separate desired and undesired sources in the listening environment, and to cope with reverberation in the recorded signal. Figure 1 shows an example for a general multichannel communication setup. microphone array loudspeaker array
reverberation echoes background noise
of adaptive MIMO (multipleinput and multipleoutput) systems that are suitable for the large number of channels in this environment. The pointtopoint optimization in adaptive MIMO systems often suffers from convergence problems and high computational complexity so that some applications are beyond reach with current techniques. In particular, before fullduplex communication in twoway systems can be deployed, acoustic echo cancellation (AEC) needs to be implemented for the resulting P · Q echo paths which seems to be out of reach for current multichannel AEC [4, 9] in conjunction with large loudspeaker arrays for spatial audio. Similar problems arise in other building blocks of the acoustic interface, e.g., for acoustic room compensation (ARC) on the reproduction side, where a system of suitable prefilters takes into account the actual room acoustics prior to sound reproduction by WFS, and also for adaptive interference cancellation on the recording side [2]. To address the specific problems of adaptive array processing for acoustic humanmachine interfaces, we present in this paper a novel framework for spatiotemporal transformdomain adaptive filtering, called wavedomain adaptive filtering (WDAF). This concept is reconciling the flexibility of adaptive filtering and the underlying physics described by the acoustic wave equation. It is suitable for spatial audio reproduction systems like wave field synthesis with an arbitrarily high number of reproduction channels. Although we refer here to twodimensional wave fields and WFS, the proposed technique can also be applied to Ambisonics and extended to 3D fields. We illustrate the concept by means of a fullduplex acoustic interface consisting of an AEC, beamforming for signal acquisition, and acoustic room compensation for highquality sound reproduction. 2. WAVE FIELD SYNTHESIS AND ANALYSIS
Fig. 1: Exemplary setup for multichannel communication.
In this paper, we consider fullduplex systems based on such massive multichannel techniques for highquality recording and reproduction. A major challenge to fully exploit the potential of array processing in such applications lies in the development
Sound reproduction by wave field synthesis (WFS) using loudspeaker arrays is based on Huygens principle. It states that any point of a wave front of a propagating sound pressure wave p(r, t) at any instant of time conforms to the envelope of spherical waves emanating from every point on the wave front at the prior instant. This principle can be used to synthesize acoustical wavefronts of arbitrary shape. Due to
BUCHNER ET AL.
FULLDUPLEX SYSTEMS BASED ON SOUND FIELD RECORDING AND WFS
the reciprocity of wave propagation it also applies to wave field analysis (WFA) on the recording side. Its mathematical formulation is given by the KirchhoffHelmholtz integrals (e.g., [1, 10]) which can be derived from the acoustic wave equation (given here for lossless media) and Newton’s second law, 1 ∂ 2 p(r, t) = 0, c2 ∂t2 ∂v(r, t) , −∇p(r, t) = ρ ∂t ∇2 p(r, t)
−
(1) (2)
respectively, where c denotes the velocity of sound, ρ is the density of the medium, and v(r, t) is the particle velocity. Since we assume twodimensional wavefields, we choose polar coordinates (r, θ) throughout this paper. Using the second theorem of Green, applied to a contour C enclosing a region S, we obtain from (1) and (2) the 2D forward KirchhoffHelmholtz integral I n −jk (2) (2) p(r0 , ω) cos ϕH1 (k∆r) p (r, ω) = 4 C o (2) + jρcv n (r0 , ω)H0 (k∆r) d` (3) and the 2D inverse KirchhoffHelmholtz integral I n −jk (1) p(1) (r, ω) = p(r0 , ω) cos ϕH1 (k∆r) 4 C o (1) + jρcv n (r0 , ω)H0 (k∆r) d`, (4)
p(r, ω) = p(1) (r, ω) + p(2) (r, ω).
(5)
The 2D KirchhoffHelmholtz integrals (3) and (4) state that at any listening point within the sourcefree listening area the sound pressure can be calculated if both the sound pressure and its gradient are known on the contour C enclosing this area. For practical implementations in 2D sound fields the
MULTI
3.1. Multichannel Acoustic Echo Cancellation Classical AEC applications are handsfree telephony or teleconference systems, where most of them are still based on monaural sound reproduction. Only recently, first stereophonic prototypes appeared [11], [12], and lately, it has become possible to extend the system to the multichannel case (for 5channel surround sound see, e.g., [13]). In this paper, the concept of this frequencydomain framework will be extended for WFS in Sect. 4. Transmission Room g1(n) g P(n)
Receiving Room
x1(n) ...
(2)
3. CONVENTIONAL ADAPTIVE CHANNEL PROCESSING
...
(1)
where ∆r = r − r0  and k = ω/c. Hn and Hn are the Hankel functions of the first and second kind, respectively, which are the fundamental solutions of the wave equation in polar coordinates. All quantities in the temporal frequency domain are underlined. v n denotes the frequencydomain version of the radial component of v. The total wave field is then given by the sum of the incoming and outgoing contributions w.r.t. S:
acoustic sources on the closed contour are realized by loudspeakers on discrete positions. Note that (3) and (4) can analogously be applied for wave field analysis using a microphone array consisting of pressure and pressure gradient microphones. The spatial sampling along the contour C defines the aliasing frequencies. While microphone spacings are usually designed for a wide frequency range, lower aliasing frequencies may be tolerated for reproduction as the human auditory system seems not to be very sensitive to spatial aliasing artifacts above approximately 1.5kHz. Thus, without loss of generality, for higher frequencies, a practical system could be easily complemented by other existing methods, e.g., 5.1 systems.
xP (n)
...
... ^ h P (n) e(n)
^ h 1 (n)
h P (n) h 1 (n)
 y^P (n)  y^1(n) +
+
+
+
y(n)
Fig. 2: Basic MC AEC structure The fundamental idea of any P channel AEC structure (Fig. 2) is to use adaptive FIR filters of length ˆ i (n), i = 1, . . . , P L with impulse response vectors h that continuously identify the truncated (generally timevarying) echo path impulse responses hi (n) whenever the sources in the receiving room are inˆ i (n) are stimulated by the loudactive. The filters h speaker signals xi (n) and, then, the resulting echo
AES 116TH CONVENTION, BERLIN, GERMANY, 2004 MAY 8–11 3
BUCHNER ET AL.
FULLDUPLEX SYSTEMS BASED ON SOUND FIELD RECORDING AND WFS
estimates yˆi (n) are subtracted from the microphone signal y(n) to cancel the echoes. For multiple microphones, each of them is considered separately in this way. The filter length L may be on the order of several thousand. The specific problems of MC AEC include all those known for mono AEC, but in addition to that, MC AEC often has to cope with high correlation of the different loudspeaker signals [7, 9]. The correlation results from the fact that the signals are almost always derived from common sound sources in the transmission room, as shown in Fig. 2. The optimization problem therefore often leads to a severely illconditioned normal equation to be solved for the P · L filter coefficients. Therefore, sophisticated adaptation algorithms taking the crosscorrelation into account are necessary for MC AEC [9] (see Sect. 3.3). 3.2. A Conventional Approach to System Integration with WFS Figure 3 shows a multichannel loudspeakerenclosuremicrophone (LEM) setup which acts as transmission and receiving room simultaneously. In general, the loudspeaker signals are generated Transmission Room
Receiving Room
x’(n) A(n)
x(n) G(n)
P’’
...
x’’(n)
P
P’
H(n)
^ H(n) Q’
V(n) y’’(n)
y’(n)
B(n)
Processing on the recording side using fixed or timevarying (adaptive) beamformers (BF) can generally be described by another MIMO system B(n) in Fig. 3. Using B(n), beams of increased sensitivity can be directed at the active talker(s), so that interfering sources, background noise, and reverberation are attenuated at the output of B(n). To facilitate the integration of AEC into the microphone path, a decomposition of B(n) may be carried out, e.g., as shown in [2, 12]. At first, a set of Q0 fixed beams is generated from the Q microphone signals. These fixed beams cover all potential sources of interest and correspond to a timeinvariant impulse response matrix B(n). The fixed beamformer is followed by a timevariant stage V(n) (‘voting’). The advantage of this decomposition is twofold. At first, it allows integration of AEC as explained below. Secondly, automatic beam steering towards sources of interest is possible, whereby external information on the positions via audio, video, or multimodal object localization can be easily incorporated.
Q
...
Q’’
HL(n)
stands for an adaptive MIMO system for acoustic room compensation (ARC). Similar to AEC for array processing, ARC is still a challenging research topic, as it requires measurement and control of the wave field in the entire listening area which is hardly possible with current methods. (The impulse response matrix from the WFS array to the possible listener positions is given by HL (n), while the corresponding matrix from the WFS array to the microphone array is given by H(n).) However, application of the new concept presented in Sect. 4, to ARC shows promising results [6].
y(n)
Fig. 3: Building blocks of conventional structure. in a twostep procedure: auralization of a transmission room or an arbitrary virtual room, and compensation of the acoustics in the receiving room. Auralization using WFS is performed by convolution of source signals x00 (n) with a generally timevarying  matrix A(n) of impulse responses which may be computed according to the WFS theory as shown above [1]. Matrix G(n)
When placing the AEC between the two branches in Fig. 3, ideally, it is desirable that the number of impulse responses to be identified is minimized and the echo paths are timeinvariant or very slowly timevarying. In [4] it has been concluded that the most practical solution is placing the AEC between x00 and y0 as shown in Fig. 3, since placing the AEC in parallel to the room echoes H(n) (i.e., between x and y) is prohibitive due to the high number of P · Q impulse responses. On the other hand, positioning the AEC between x00 and y00 (P 00 ·Q00 impulse responses) would include the timevariant matrix V(n) into the LEM model. However, a major drawback of this system is that the wave field rendering system A(n) is
AES 116TH CONVENTION, BERLIN, GERMANY, 2004 MAY 8–11 4
BUCHNER ET AL.
FULLDUPLEX SYSTEMS BASED ON SOUND FIELD RECORDING AND WFS
not allowed to be timevarying which limits the applicability to render only fixed virtual sources. The new approach in Sect. 4 does not exhibit this limitation. 3.3. Multichannel Adaptive Filtering For various illconditioned optimization problems in adaptive signal processing, such as MC AEC, the recursive leastsquares (RLS) algorithm is known to be the optimum choice in terms of convergence speed as  in contrast to other algorithms  it exhibits properties that are independent of the eigenvalue spread, i.e., the condition number, of the input correlation matrix [14]. The update equation of the multichannel RLS (MC RLS) algorithm reads for one output channel ˆ ˆ − 1) + R−1 (n)x(n)e(n), h(n) = h(n xx
is a generic wideband frequencydomain algorithm which is equivalent to the RLS algorithm. As a result of this approach, the arithmetic complexity of multichannel algorithms can be significantly reduced compared to timedomain adaptive algorithms while the desirable RLSlike properties and the basic structure of (6) are maintained by an inherent approximate blockdiagonalization of the correlation matrix as shown in the second column of Fig. 4. This allows to perform the matrix inversion in (6) in a frequencybin selective way using only (ν) small and better conditioned P × P matrices Sxx in the bins ν = 0, . . . , 2L − 1. Note that all crosscorrelations between different input channels are still fully taken into account by this approach. 4. THE NOVEL APPROACH: WAVEDOMAIN ADAPTIVE FILTERING
(6)
ˆ where h(n) is the multichannel coefficient vector obtained by concatenating the lengthL impulse reˆ i (n) of all input channels, e(n) = sponse vectors h y(n) − yˆ(n) is the current residual error between the echoes and the echo replicas. The lengthP L vector x(n) is a concatenation of the input signal vectors containing the L most recent input samples of each channel. The correlation matrix Rxx takes all autocorrelations within, and  most importantly for multichannel processing  all crosscorrelations between the input channels into account (see upper left corner of Fig. 4). However, the major problems of RLS algorithms are the very high computational complexity (mainly due to the large matrix inversion) and potential numerical instabilities which often limit the actual performance in practice. An efficient and popular alternative to timedomain algorithms are transformdomain adaptive filtering algorithms [15], and in particular algorithms working in the DFTdomain, called frequencydomain adaptive filtering (FDAF) algorithms [16]. In FDAF, the adaptive filters are updated in a blockbyblock fashion, using the fast Fourier transform (FFT) as a powerful vehicle. Recently, the FDAF approach has been extended to the multichannel case (MC FDAF) by a mathematically rigorous derivation based on a weighted leastsquares criterion [13, 17]. It has been shown that there
With the dramatically increased number of highly correlated loudspeaker channels in WFSbased sys(ν) tems, even the matrices Sxx become large and illconditioned so that current algorithms cannot be used. In this section we extend the conventional concept of MC FDAF by a more detailed consideration of the spatial dimensions and by exploitation of wave physics foundations shown in Sect. 2. 4.1. Basic Concept From a physical point of view, the nice properties of FDAF result from the orthogonality property of the DFT basis functions, i.e., the complex exponentials. Obviously, these exponentials also separate the temporal dimension of the wave equation (1). Therefore, it is desirable to find a suitable spatiotemporal transform domain based on orthogonal basis functions that allow not only an approximate decomposition among the temporal frequencies as in MC FDAF, but also an approximate spatial decomposition with basis functions fulfilling (1) as illustrated by the third column of Fig. 4. In the next subsection we will introduce a suitable transform domain. Performing the adaptive filtering in a spatiotemporal transform domain requires spatial sampling on both, the input and the output of the system. Then, in contrast to conventional MC FDAF, not only all loudspeaker signals, but also all microphone signals must simultaneously be taken
AES 116TH CONVENTION, BERLIN, GERMANY, 2004 MAY 8–11 5
PSfrag replacements
BUCHNER ET AL.
FULLDUPLEX SYSTEMS BASED ON SOUND FIELD RECORDING AND WFS
MC FDAF
MC RLS
WDAF
components, this number is further reduced to P/2. Thus, for a typical system with P = 48, the number of channels is reduced from 2304 to 24 (or, e.g., 70 if we also include the first offdiagonals).
temporal diag.
Rxx
Sxx
Txx
decomp. into decomp. into temporal freq. bins temporal freq. bins spatial (ν)
Sxx
diag.
(ν)
Txx
decomp. into spatiotemporal freq. bins (ν,kθ )
Txx
Fig. 4: Illustration of the WDAF concept and its relation to conventional algorithms.
into account for the adaptive processing. Moreover, with the given orthogonality between the spatial frag replacements components in the transform domain, most crosswave fieldchannels in the transform domain can be completely neglected, so that in practice only the main diagonal (see Sect. 6), and possibly (depending on the appliwave fieldcation) the first offdiagonals of the filter coefficient matrix need to be adapted. This leads to the general setup of WDAFbased acoustic interface processing incorporating spatial filtering (analogously to Fig. 3) and AEC, as shown in Fig. 5. Due to representation from far end
room T1
x ˜(·)
loudspeaker array adaptive subfilters
y˜(·) e˜(·) represent. to far end
4.2. Transformations and Adaptive Filtering In this section we introduce suitable transformations T1 , T2 , T3 shown in Fig. 5. Note that in general there are many possible spatial transformations depending on the choice of the coordinate system. A first approach to obtain the desired decoupling would be to simply perform spatial Fourier transforms analogously to the temporal dimension. This corresponds to a decomposition into plane waves [10] which is known to be a flexible format for auralization purposes [18]. However, in this case we would need loudspeakers and microphones at each point of the listening area which is not practicable. Therefore, plane wave decompositions taking into account the KirchhoffHelmholtz integrals are desirable. These transformations depend on the array geometries and have been derived for various configurations [18]. Circular arrays are known to show a particularly good performance in wave field analysis [18], and lead to an efficient WDAF solution. A cylindrical coordinate system is used (see Fig. 5 for the definition of the angle θ.). For the realization, temporal and spatial sampling are implemented according to the desired spatial aliasing frequency. For transform T1 we obtain [18] the following plane wave decomposition of the wave field to be emitted by the loudspeaker array with radius R:
θ
+
microphone array

x ˜ (1) (kθ , ω) =
+
T3
T2
Fig. 5: Setup for proposed AEC in the wave domain.
the decoupling of the channels, not only the convergence properties are improved but also the computational complexity is reduced dramatically. Let us assume Q = P microphone channels. In the simplest case, instead of P 2 filters in the conventional approach, we only need to adapt P channels in the transform domain. By additionally taking into account the symmetry property of spatial frequency
x ˜ (2) (kθ , ω) =
(1)
j 1−kθ n (2)0 Hkθ (kR)˜ px (kθ , ω) DR (kθ , ω) o (2) −Hkθ (kR)jρc˜ v x,n (kθ , ω) , (7) −j 1+kθ n (1)0 px (kθ , ω) Hkθ (kR)˜ DR (kθ , ω) o (1) −Hkθ (kR)jρc˜ v x,n (kθ , ω) , (8) (2)0
(2)
(1)0
DR (kθ , ω) = Hkθ (kR)Hkθ (kR)−Hkθ (kR)Hkθ (kR). (9) (·)0 Hkθ denotes the derivative of the respective Hankel function with the angular wave number kθ , and k = ω/c as in Sect. 2. Underlined quantities with a tilde
AES 116TH CONVENTION, BERLIN, GERMANY, 2004 MAY 8–11 6
BUCHNER ET AL.
FULLDUPLEX SYSTEMS BASED ON SOUND FIELD RECORDING AND WFS
denote spatiotemporal frequency components, e.g., Z 2π 1 p˜x (kθ , ω) = (10) px (θ, ω)e−jkθ θ dθ. 2π 0 Analogously to (7) and (8) the plane wave components y˜(1) (kθ , ω) and y˜(2) (kθ , ω) of the recorded signals in the receiving room are obtained by transform T2 with
(r, θ, ω) p(2) e
Z
=
2π
0
e(2) (θ0 , ω)ejkr cos(θ−θ ) dθ0 0
which corresponds to inverse spatial Fourier transforms in polar coordinates. Due to the independence from the array geometries, the planewave representation is very suitable for direct transmission. Moreover, application of linear prediction techniques on this representation is attractive for source coding of acoustic wavefields.
j 1−kθ n (2)0 py (kθ , ω) Hkθ (kR)˜ 5. SYSTEM INTEGRATION DR (kθ , ω) o (2) −Hkθ (kR)jρc˜ v y,n (kθ , ω) , (11) As in Sect. 3.2, we now study how to integrate the 1+kθ n −j (1)0 proposed AEC into a multichannel acoustic human(2) y˜ (kθ , ω) = py (kθ , ω) Hkθ (kR)˜ machine interface. In contrast to the conventional DR (kθ , ω) o structure in Fig. 3, the WDAFbased AEC can now (1) (kθ , ω)replacements (12) −Hkθ (kR)jρc˜ v y,nPSfrag be applied after auralization and ARC. Moreover, the concept of WDAF can also efficiently be applied from p˜y (kθ , ω) and v˜y,n (kθ , ω) using the pressure to ARC, as shown in [6]. Fig. 6 shows the structure and pressure gradient microphone elements. On of the WDAFbased ARC. It can easily be verified the loudspeaker side an additional spatial extrapthat the adaptations of ARC and AEC in the inteolation assuming free field propagation of each loudgrated solution after Fig. 7 are then mutually fully speaker signal to the microphone positions is necesseparable from each other, so that there are no repersary within T1 prior to using (7) and (8) in order cussions between them. to obtain px and vx,n of the incident waves on the T1 T3 microphone positions. x ˜(·) wave y˜(1) (kθ , ω) =
Adaptive filtering is then carried out for each spatiotemporal frequency bin. Note that conventional singlechannel FDAF algorithms realizing FIR filtering can directly be applied to each subfilter in Fig. 5. These subfilters already contain the temporal part of the transformation into the spatiotemporal frequency domain. In practice, both, the spatial transformation, and the temporal transformation are realized by DFTs. However, while in the temporal component, we have to ensure linear convolutions by certain constraints within FDAF [13, 16], this is not necessary for the spatial (angular) component, as it is inherently circulant. Since the plane wave representation after the AEC is independent of the array geometries, the plane wave (·) components e˜(·) (kθ , ω) = y˜(·) (kθ , ω) − yˆ˜ (kθ , ω) can either be sent to the far end directly, or they can be used to synthesize the total spatiotemporal wave field using an extrapolation T3 of the wave field [10] Z 2π 0 (r, θ, ω) = e(1) (θ0 , ω)e−jkr cos(θ−θ ) dθ0 , p(1) e 0
room
y˜(·) field
loudspeaker array
e˜(·) adaptive subfilters freefield transfer matrix
θ microphone array
+
+
T2
Fig. 6: WDAFbased ARC after [6].
The new WDAF structure offers another interesting aspect: since the plane wave decomposition can be interpreted as a special set of spatial filters, the set of beamformers for acquisition (as in Fig. 3) is inherently integrated in a natural way. Thus, the spatial filter B and the transformation T2 may be simply merged, and could be implemented as a masking in the Θdomain. ‘Voting’, as in the conventional setup in Sect. 3.2 is obtained by additional timevarying weighting of the (already available) spatial components.
AES 116TH CONVENTION, BERLIN, GERMANY, 2004 MAY 8–11 7
BUCHNER ET AL.
FULLDUPLEX SYSTEMS BASED ON SOUND FIELD RECORDING AND WFS
x ˜(·) y˜(·)
ARC subfilters
T1 from far end
T3 room loudspeaker array

freefield transfer matrix to far end
+
AEC filt.
+ 
θ microphone array
+ 
overlap factors exist [13]). In [5] it is shown that the performance of WDAFbased AEC is also relatively robust against timevarying scenarios in the transmission room. This robustness is a very important indicator of the quality of the estimated room parameters [9].
+
T3
T2
Fig. 7: Integrated system in the wave domain.
6. EVALUATION OF THE AEC We verify the proposed concept for the AEC application. For the simulations using measured data from a real room, we used two concentric circular arrays of 48 loudspeakers and 48 microphones, respectively (the recording was done by one rotating sound field microphone mounted on a stepper motor), as shown in Fig. 8. The radius of the loudspeaker array is 142cm (spacing 19cm), and the radius of the microphone array is 75cm (spacing 9.8cm). The reverberation time T60 of the room is approximately 500ms. A virtual point source (music signal) was placed by WFS at 3m distance from the array center. All signals were downsampled to the aliasing frequency of the microphone array of fal ≈ 1.7kHz (as discussed in Sect. 2). For the adaptation of the parameters, wavenumberselective FDAF algorithms (filter length 1024 each) with an overlap factor 256 after [13] were applied. Figure 9 shows the socalled echo return loss enhancement (ERLE), i.e., the attenuation of the echoes (note that the usual fluctuations in any ERLE curve result from the source signal statistics as ERLE is a signaldependent measure.). While conventional AEC techniques cannot be applied in this case (48 × 48 = 2304 filters would have to be adapted, giving a total of 2359296 FIR filter taps for this extremely illconditioned leastsquares problem), the WDAF approach allows stable adaptation and sufficient attenuation levels. The convergence speed is well comparable to conventional singlechannel AECs. However, a high overlap factor for FDAF is necessary due to the low sampling rate [5] (note that efficient realizations exploiting high
Fig. 8: Setup for measurements.
40
30 ERLE [dB]
e˜(·)
20
10
0
0
10
20
30 Time [sec]
40
50
60
Fig. 9: ERLE convergence of WDAFbased 48 × 48channel AEC.
7. CONCLUSIONS A novel concept for efficient adaptive MIMO filtering in the wave domain has been proposed in the context of acoustic humanmachine interfaces based on wavefield analysis and synthesis using loudspeaker arrays and microphone arrays. The illustration by means of acoustic echo cancellation shows promising results.
AES 116TH CONVENTION, BERLIN, GERMANY, 2004 MAY 8–11 8
BUCHNER ET AL.
FULLDUPLEX SYSTEMS BASED ON SOUND FIELD RECORDING AND WFS
8. REFERENCES [1] A.J. Berkhout, D. de Vries, and P. Vogel, “Acoustic control by wave field synthesis,” Journal of the Acoustic Society of America, vol. 93, no. 5, pp. 2764–2778, May 1993. [2] M.S. Brandstein and D.B. Ward, Microphone Arrays, Springer, 2001. [3] H.F. Silverman, W.R. Patterson, J.L. Flanagan, and D. Rabinkin, “A digital system for source location and sound capture by large microphone arrays,” in Proc. IEEE ICASSP, 1997. [4] H. Buchner, S. Spors, W. Kellermann, and R. Rabenstein, “FullDuplex Communication Systems with Loudspeaker Arrays and Microphone Arrays,” Proc. IEEE Int. Conference on Multimedia and Expo (ICME), Lausanne, Switzerland, Aug. 2002. [5] H. Buchner, S. Spors, and W. Kellermann, “Wavedomain adaptive filtering: Acoustic echo cancellation for fullduplex systems based on wavefield synthesis,” in Proc. IEEE ICASSP, 2004. [6] S. Spors, H. Buchner, and R. Rabenstein,“An Efficient Approach to Active Listening Room Compensation for Wave Field Synthesis,” 116th Convention of the Audio Engineering Society (AES), May 2004. [7] M. M. Sondhi and D. R. Morgan, “Stereophonic Acoustic Echo Cancellation  An Overview of the Fundamental Problem,” IEEE SP Lett., Vol.2, No.8, Aug. 1995, pp. 148151.
[12] H. Buchner, W. Herbordt, and W. Kellermann, “An Efficient Combination of Multichannel Acoustic Echo Cancellation With a Beamforming Microphone Array,” in Proc. Int. Workshop on HandsFree Speech Communication, Kyoto, Japan, pp. 5558, April 2001. [13] H. Buchner, J. Benesty, and W. Kellermann, “Multichannel FrequencyDomain Adaptive Algorithms with Application to Acoustic Echo Cancellation,” in J.Benesty and Y.Huang (eds.), Adaptive signal processing: Application to realworld problems, SpringerVerlag, Berlin/Heidelberg, Jan. 2003. [14] S. Haykin, Adaptive Filter Theory, 3rd ed., Prentice Hall Inc., Englewood Cliffs, NJ, 1996. [15] S.S. Narayan, A.M. Peterson, and M.J. Narasimha, “Transform Domain LMS Algorithm,” IEEE Trans. on Acoustics, Speech and Signal Processing, vol. ASSP31, no.3, June 1983. [16] J.J. Shynk, “Frequencydomain and multirate adaptive filtering,” IEEE SP Magazine, pp. 1437, Jan. 1992 [17] J. Benesty, A. Gilloire, and Y. Grenier, “A frequencydomain stereophonic acoustic echo canceler exploiting the coherence between the channels,” J. Acoust. Soc. Am., vol. 106, pp. L30L35, Sept. 1999. [18] E. Hulsebos, D. de Vries, and E. Bourdillat, “Improved microphone array configurations for auralization of sound fields by Wave Field Synthesis,” 110th Convention of the Audio Engineering Society (AES), May 2001.
[8] S. Shimauchi and S. Makino, “Stereo Projection Echo Canceller with True Echo Path Estimation,” in Proc. IEEE ICASSP, 1995, pp. 30593062. [9] J. Benesty, D.R. Morgan, and M.M. Sondhi, “A better understanding and an improved solution to the specific problems of stereophonic acoustic echo cancellation,” IEEE Trans. on Speech and Audio Processing, vol. 6, no.2, March 1998. [10] A.J. Berkhout, Applied Seismic Wave Theory, Elsevier, 1987. [11] V. Fischer et al., “A Software Stereo Acoustic Echo Canceler for Microsoft Windows,” in Proc. IWAENC, Darmstadt, Germany, pp. 8790, Sept. 2001.
AES 116TH CONVENTION, BERLIN, GERMANY, 2004 MAY 8–11 9