A Systolic Architecture and Implementation of ... - Springer Link

Journal of VLSI Signal Processing 47, 117–126, 2007

* 2007 Springer Science + Business Media, LLC. Manufactured in The United States. DOI: 10.1007/s11265-006-0036-3

A Systolic Architecture and Implementation of Feedback Network for Blind Source Separation H. JEONG AND Y. KIM Department of Electronic and Electrical Engineering, POSTECH, Pohang, Kyungbuk 790-784, South Korea

Received: 21 July 2004; Accepted: 1 December 2006

Abstract. Blind source separation of independent sources from their convolutive mixtures is a problem in many real-world multi-sensor applications. However, the existing BSS architectures are more often than not based upon software and thus not suitable for direct implementation on hardware. The existing software of feedback network algorithm is not suitable for real-time implementations. In this paper, we present a parallel algorithm and architecture for hardware implementation of blind source separation. The algorithm is based on feedback network and is highly suited for parallel processing. The implementation is designed to operate in real time for speech signal sequences. It is systolic and easily scalable by simple adding and connecting chips or modules. In order to verify the proposed architecture, we have also designed and implemented it in a hardware prototyping with Xilinx FPGAs running at 33 MHz. Keywords: 1.

blind source separation (BSS), convolutive mixtures, VLSI, FPGA

Introduction

Blind source separation(BSS) is a basic and important problem in signal processing. BSS denotes observing mixtures of independent sources, and by making use of these mixture signals only and nothing else, recovering the original signals. In the simplest form of BSS, mixtures are assumed to be linear instantaneous mixtures of sources. The problem was formalized by Jutten and Herault [1] in the 1980s and many models for this problem have recently been proposed [7–13]. In this paper, we present K. Torkkola_s feedback network [2, 3] algorithm which is capable of coping with convolutive mixtures, and T. Nomura_s extended Herault–Jutten method [4] algorithm for learning algorithm. Then we provide the linear systolic architecture design and implementation of an efficient BSS method using these algorithms. The

architecture consists of forward and update processor. The chip contains processing elements connected in a systolic architecture and thus can be easily scalable with more identical chips. We have designed the BSS chip using a very high speed integrated circuit hardware description language(VHDL) and fabricated Field programmable gate array(FPGA). Although FPGA does not offer an optimized hardware implementation when compared to application specific integrated circuit(ASIC), it allows short development time and enables verification of algorithms in hardware at a low cost. The organization of this paper is as follows. Section 2 derives an algorithm for feedback network, and Section 3 shows the detailed architecture of feedback network. The simulation results and conclusions are, respectively, presented in Section 4 and 5.

118

2.

Jeong and Kim

The feedback network whose i th output yi ðtÞ is described by

Background of the BSS Algorithm

In this section, we assume that observable signals are convolutively mixed, and present K. Torkkola_s feedback network algorithm and T. Nomura_s extended Herault–Jutten method. This method was used by software implementation by Choi and Chichocki [5].

Real speech signals present one example where the instantaneous mixing assumption does not hold. The acoustic environment imposes a different impulse response between each source and microphone pair. This kind of situation can be modeled as convolved mixtures. Assume n statistically independent speech sources sðtÞ = ½s1 ðtÞ; s2 ðtÞ; . . . ; sn ðtÞT . There sources are convolved and mixed in a linear medium leading to m signals measured at an array of microphones xðtÞ = ½x1 ðtÞ; x2 ðtÞ; . . . ; xn ðtÞT ,

xi ðtÞ ¼

p

L X n X

wij;p ðtÞyj ðt pÞ; ð3Þ

p¼0 j6¼i

for

i; j ¼ 1; 2; . . . ; n;

where wij;p is the weight between yi ðtÞ and yj ðt pÞ. In compact form, the output vector yðtÞ is

2.1. Mixing Model

n XX

yi ðtÞ ¼ xi ðtÞ þ

y ðtÞ ¼ xðtÞ þ

hij;p ðtÞsj ðt pÞ;

for ð1Þ

W p ðtÞyðt pÞ;

p¼0

¼ ½I W 0 ðtÞ1 fxðtÞ þ

L X

W p ðtÞyðt pÞg:

p¼1

ð4Þ This model can be expressed as follows in the Zdomain (see Fig. 1).

(

j¼0

L X

Y1 ðzÞ ¼ X1 ðzÞ þ W21 ðzÞY2 ðzÞ; Y2 ðzÞ ¼ X2 ðzÞ þ W12 ðzÞY1 ðzÞ:

ð5Þ

i ¼ 1; 2; . . . ; n;

where fhij;p g is the room impulse response between the (j)th source and the (i)th microphone and xi ðtÞ is the signal present at the (i )th microphone at time instant t. Two microphones provide measured signals x1 ðtÞ and x2 ðtÞ , which are unknown convolutive mixtures of two unknown source signals s1 ðtÞ and s2 ðtÞ, in the Z-domain:

If the ouput of the feedback network is perfectly separated (Y ðzÞ ¼ SðzÞ), then we can easily derive the weight from Eqs. (2) and (5):

W12 ðzÞ ¼ H12 ðzÞ; W21 ðzÞ ¼ H21 ðzÞ::

ð6Þ

X1 ðzÞ ¼ S1 ðzÞ þ H12 ðzÞS2 ðzÞ; X2 ðzÞ ¼ H21 ðzÞS1 ðzÞ þ S2 ðzÞ::

ð2Þ

2.2. Algorithm of Feedback Network The feedback network algorithm was already considered in [2, 6]. Here we describe this algorithm.

Figure 1. The feedback network for the separation of convolutively mixing sources.

A Systolic Architecture and Implementation of Feedback Network

Figure 2.

Timing diagram of modified algorithm.

The existence of this solution requires H12 ðzÞ and H21 ðzÞ to have stable inverses. If these are not fulfilled, the network is unable to achieve separation. The learning algorithm of weight W for instantaneous mixtures was formalized by Jutten– Herault algorithm [1]. In [4], the learning algorithm was the extended Jutten–Herault algorithm and proposed the model for blind separation where observable signals are convolutively mixed. Here we describe this algorithm. The learning algorithm of updating W has the form W p ðtÞ ¼ W p ðt 1Þ t f ðyðtÞÞgðyT ðt pÞÞ;

Figure 3.

119

Linear systolic array for forward processor.

ð7Þ

where t > 0 is the learning rate. One can see that when the learning algorithm achieves convergence, the correlation between f ðyi ðtÞÞ and gðyj ðt pÞÞ vanishes. f ð:Þ and gð:Þ are odd symmetric functions. In this learning algorithm, the weights are updated based on the gradient descent method. The function f ð:Þ is used as the signum function and the function gð:Þ is used as the first order linear function because the implementation is easy with the hardware. 2.3. The Modified Algorithm We assume that speech signals are quasi-stationary. Then we can modify the equations of the proceed-

Jeong and Kim

120

3.

Systolic Architecture for Feedback Network

The systolic algorithm can be easily transformed into hardware. The overall architecture of the forward and update processor is shown first and then follows each detailed internal structure of the processing element(PE). In the end, the overall system is analyzed in terms of the computational complexity.

Figure 4.

The internal structure of the processing element.

3.1. Systolic Architecture for Forward Processor

r

ing sections. The rth frame output y ðtÞ is obtained by using the rth frame input xr ðtÞ and the (r 1)th frame weight wr1 ðtÞ . Also the r th frame weight wr ðtÞ is obtained by using the rth frame output yr ðtÞ (see Fig. 2). Using Eqs. (4), and (7), we can write the modified forward process equation and update the equation as

yðrÞ ðtÞ ¼ xðrÞ ðtÞ þ

L X

W ðr1Þ ðtÞyðrÞ ðt pÞ; p

In this section, we introduce the architecture for forward processor of the feedback network. The advantage of this architecture is spatial efficiency, which accommodates more time delays for a given limited space. Let us define the cost of the forward processor fi;p ðtÞ at time index t as fi;0 ðtÞ ¼ 0; fi;p ðtÞ fi;p1 ðtÞ þ ðwij;p yj ðt pÞ þ wij;0 wji;p yi ðt pÞÞ for p ¼ 1; 2; . . . ; L:

ð10Þ

p¼0 ðr1Þ

¼ ½I W 0 þ

L X

ðtÞ1 fxðrÞ ðtÞ

W ðr1ÞðtÞ yðrÞ ðt pÞg; p

Combining Eqs. (4) and (10), we can rewrite the output y1 ðtÞ and y2 ðtÞ as ð8Þ

p¼1

y1 ðtÞ ¼ ð1 w12;0 w21;0 Þ1 fðx1 ðtÞ þ w12;0 x2 ðtÞÞ þ f1;L ðtÞg; y2 ðtÞ ¼ ð1 w21;0 w12;0 Þ1 fðx2 ðtÞ þ w21;0 x1 ðtÞÞ þ f2;L ðtÞg:

ð11Þ ðr1Þ W ðrÞ ðt 1Þ t f ðyðrÞ ðtÞÞgðyðrÞ ðt pÞÞT ; p ðtÞ ¼ W p

ð9Þ where r is the frame number.

Figure 5.

Linear systolic array for the update processor.

Figure 3 shows the systolic array architecture of the forward processor. This architecture consists of L þ 1 processing elements. During p ¼ 1; 2; . . . ; L, the pth PE receives inputs y1 ðt pÞ, y2 ðt pÞ, and fi;p1 ðtÞ from p 1th PE, and weights w12;p and w21;p. Then, it


Figure 6.

The internal structure of update element.

updates PE cost fi;p ðtÞ by Eq. (10). The L þ 1th PE calculates y1 ðtÞ, and y2 ðtÞ according to Eq. (11). In other words, we obtain the cost of PE using recursive computation. This results in a very scalable architecture that is easy to implement. With this method,

Figure 7.

121

Overall block diagram of feedback network architecture.

computational complexity decreases plentifully. The computational complexity for this architecture is introduced in the last section. The remaining task is to describe the internal structure of the processing element. Figure 4 shows

122

Table 1.

Jeong and Kim

Overall algorithm of feedback network architecture.

Algorithm of feedback network architecture Forward processing

Initialization: The cost of processing elements are initialized as zeros : f1;p ¼ 0, f2;p ¼ 0 Recursion : n clocks for p ¼ 1; 2; . . . L : f1;p ¼ f1;p1 þ ðw12;p y2 ðt pÞ þ w12;0 w21;p y1 ðt pÞÞ f2;p ¼ f2;p1 þ ðw21;p y1 ðt pÞ þ w21;0 w12;p y2 ðt pÞÞ for p=L : y1 ðtÞ ¼ ð1 w12;0 w21;0 Þ1 fðx1 ðtÞ þ w12;0 x2 ðtÞÞ þ f1;L g y2 ðtÞ ¼ ð1 w21;0 w12;0 Þ1 fðx2 ðtÞ þ w21;0 x1 ðtÞÞ þ f2;L g

Update processing

Initialization : All the weight are initialized as zeros : W p ¼ 0 Calculation : 2L clocks if t=even, then for p=even else t=odd, then for p=odd : up ðtÞ ¼ up ðt 1Þ t ’ðy1 ð½1=2ðt p LÞÞÞy2 ð½1=2ðt þ p LÞÞ; p 0 up ðtÞ ¼ up ðt 1Þ t ’ðy1 ð½1=2ðt p LÞÞÞy2 ð½1=2ðt þ p LÞÞ; p 0 Termination : for p ¼ L; . . . ; 2; 1 : w12;p ¼ up ðtÞ for p=0 : w12;0 ¼ w21;0 ¼ u0 ðtÞ for p ¼ 1; 2; . . . L : w21;p ¼ up

the processing element. The processing element consists of two adders, two registers, and two multipliers. The p th PE receives inputs y1 ðtÞ and y2 ðtÞ. After passing the registers, they are multipled by weights. Then it updates PE cost fi;p ðtÞ.

Figure 8.

Two original speech signals.

3.2. Systolic Architecture for the Update Processor This section presents the architecture that updates the weights. This architecture also consists of the processing elements that operate in the same


Figure 9.

Two mixtures of speech signals.

method. We define the processing element which was used in the update processor as Update Element(UE). It is assumed that the number of output is two(n ¼ 2 ), then we can rewrite Eq. (7) as

123

w12;p ðtÞ ¼ w12;p ðt 1Þ t f ðy1 ðtÞÞy2 ðt pÞ; w21;p ðtÞ ¼ w21;p ðt 1Þ t f ðy2 ðtÞÞy1 ðt pÞ:

denote the forward process clock by CLKf , and the update clock is CLKu ¼ 2 CLKf . At even times, only even UE are active and the odd UE are inactive. At odd times, the UE play the roles in reversed way. The UE equation has the form n

up ðtÞ ¼ up ðt 1Þ t ’ðy1 ð½1=2ðt p LÞÞÞy2 ð½1=2ðt þ p LÞÞ; up ðtÞ ¼ up ðt 1Þ t ’ðy2 ð½1=2ðt p LÞÞÞy1 ð½1=2ðt þ p LÞÞ;

p 0; p 0;

ð12Þ

ð13Þ

If the number of PE is L, the architecture of the update consists of 2L þ 1 UEs shown in Fig. 5. We

where ½x is the maximum integer which is not

Figure 10.

Two recovered signals using the feedback network.

over x.

124

Jeong and Kim

Table 2.

The experimental results of SRNI with noisy mixtures. SNR(signal to noise ratio)¼ 10 log10 ðs=nÞ clean

10 dB

5 dB

0 dB

j5 dB

j10 dB

SNRI1(dB)

2.7061

2.6225

2.4396

2.3694

1.9594

0.1486

SNRI2(dB)

4.9678

4.9605

4.7541

4.6506

3.9757

3.2470

Figure 6 shows the internal structure of the UE of the feedback network. The p th UE receives two inputs ya ðtÞ and yb ðtÞ , then one input becomes f ðya ðtÞÞ. And it updates the UE cost up ðtÞ finally. In this architecture, f ðÞis the signum function, f ðya ðtÞÞ ¼ signðya ðtÞÞ.

are required for output yðtÞ and 2L þ 1 buffers for the cost of UE. If UE stores the partial cost of Bu bits, the total OðLBy þ LBw Þ bits are sufficient.

3.3. Overall Architecture and Computational Complexity

4.1. FPGA Hardware Design

The system configuration is shown in Fig. 7. We observe a set of signals from an two-input twooutput nonlinear dynamic system, where the its input signals are generated from two independent sources. The minimization of spatial dependence among the input signals results in the elimination of cross-talking in the presence of convolutively mixing signals. Table 1 shows the overall feedback network algorithm. The system consists of Initialization, Forward process, Update process, and Weights update. The initialization step is all about the initial state that must be guaranteed whenever the system starts. All the weights and costs of processing elements are initialized as zero. After the forward process step and the update process step, all the weights calculated from the update architecture reload the forward process architecture in the weights update step. The forward process of the feedback network algorithm uses two linear arrays of ðL þ 1Þ PEs. 4L buffers are required for output yðtÞ and 2L buffers for the cost of PE. Since each output yðtÞ is By bits long, a memory with OðLBy Þ bits is needed. Each PE must store the partial cost of Bp bit and thus additional OðLBp Þ bits are needed. As a result, the total OðLBy þ LBp Þ bits are sufficient. The update of the feedback network algorithm uses a linear array of ð2L þ 1Þ UEs. 4L þ 2 buffers

4.

System Implementation and Experimental Results

The system is desiged for an FPGA(Xilinx Virtex-II XC2V8000) running at 33 MHz. The entire chip is designed with VHDL code and fully tested and error free. The following experimental results are all based upon the VHDL simulation. The chip has been simulated extensively using Cadence simulation tools. It is designed to interface with the PLX9656 PCI chip which has a maximum clock frequency of 66 MHz. The PLX9656 PCI was used within a PC with a Pentium 4, 3.06 GHz processor.

Figure 11.

The convergence of SNRI.(Fo_:SNRI1, F+_:SNRI2).


4.2. Experimental Results In the feedback network, we used the length of delay, L=100 and the learning rate t ¼ 107 . As a performance measure, In [5], Choi and Chichocki have used a signal to noise ratio improvement, defined as

125

tecture has the form of a linear systolic array using simple PEs that are connected with only neighboring PEs and thus can be easily scalable with more identical chips.

Acknowledgements SNRIi ¼ 10 log10

Efðxi ðkÞ si ðkÞÞ2 g Efðyi ðkÞ si ðkÞÞ2 g

:

ð14Þ

Two different digitized speech signals sðtÞ , as shown in Fig. 8, were used in this simulation. The received signals xðtÞ collected from different microphones and recovered signals yðtÞ using a feedback network are shown in Figs. 9, 10. In this case, we obtained SNRI1 ¼ 2:7061, SNRI2 ¼ 4:9678. We have evaluted the performance of feedback network, given noisy mixtures. Table 2 shows the experimental results of SRNI with noisy mixtures. The system shown good performance in high SNR (above 0 dB only). Figure 11 shows the convergence of the learning for the feedback network. When the sliding window was r 100, results were converged.

5.

Conclusion

In this paper, the systolic algorithm and architecture of a feedback network have been derived and tested with VHDL code simulation. This scheme is fast and reliable since the architectures are highly regular. In addition, the processing can be done in real time. The full scale system can be easily obtained by the number of PEs, and UEs. Our system has two inputs but we will extend it for N inputs. Because the algorithms used for hardware and software implementation differ significantly it will be difficult, if not impossible, to migrate software implementations directly to hardware implementations. The hardware needs different algorithms for the same application in terms of performance and quality. We have presented a fast and efficient VLSI architecture and implementation of BSS. The archi-

The authors would like to thank the MOE for its financial support toward the ECE Division at POSTECH through its BK21 program.

References 1. C. Jutten and J. Herault, BBlind Separation of Source, Part I: An Adaptive Algorithm Based on Neuromimetic Architecture,^ Signal Process., vol. 24, 1991, pp. 1–10. 2. K. Torkkola, BBlind Separation of Convolved Sources Based on Information Maximization,^ Proc. IEEE Workshop Neural Networks for Signal Processing, 1996, pp. 423–432. 3. Hoang-Lan Nguyen Thi and Christian Jutten, BBlind Source Separation for Convolutive Mixtures,^ Signal Process., vol. 45, 1995, pp. 209–229. 4. T. Nomura, M. Eguchi, H. Niwamoto, H. Kokubo, and M. Miyamoto, BAn Extension of The Herault–Jutten Network to Signals Including Delays for Blind Separation,^ IEEE Neural Networks Signal Processing, vol. VI, 1996, pp. 443–452. 5. S. Choi and A. Cichocki, BAdaptive Blind Separation of Speech Signals: Cocktail Party Problem,^ Proc. Int. Conf. Speech Processing (ICSP_97), 1997, pp. 617–622. (August) 6. N. Charkani and Y. Deville, BSelf-adaptive Separation of Convolutively Mixed Signals with a Recursive Structure—Part I: Stability Analysis and Optimization of Asymptotic Behaviour,^ Signal Process., vol. 73, no. 3, 1999, pp. 255–266. 7. K. Torkkola, BBlind Separation of Delayed Source Based on Information Maximaization,^ Proc. ICASSP, Atlanta, GA, 1996, pp. 7–10. (May) 8. K. Torkkola, BBlind Source Separation for Audio Signal—Are We There Yet?^ IEEE Workshop on Independent Component Analysis and Blind Signal Separation, Aussois, France, 1999. (Jan) 9. K. C. Yen and Y. Zhao, BAdaptive Co-channel Speech Separation and Recognition,^ IEEE Trans. Speech Audio Process., vol. 7, no. 2, 1999, pp. 138–151. 10. A. Cichocki, S. Amari, and J. Cao, BBlind Separation of Delayed and Convolved Signal with Self-adaptive Learning Rate,^ in NOLTA, Tochi, Japan, 1996, pp. 229–232. 11. S. Amari, S. C. Douglas, A. Cichocki, and H. H. Yang, BNovel On-line Adaptive Learning Algorithms for Blind Deconvolution Using the Natural Gradient Approach,^ in IFAC Symposium on System Identification, Fukuoka, Japan, 1997.

126

Jeong and Kim

12. T. Lee, A. Bell, and R. Lambert, BBlind Separation of Delayed and Convolved Source,^ in Advances in Neural Information Processing Systems, MIT Press, 1997. 13. N. Charkani and Y. Deville, BOptimization of the Asymptotic Performance of Time-domain Covolutive Source Separation Algorithms,^ in ESANN, Brugers, Belgium, 1997, pp. 273–278.

H. Jeong He received the BS degree from the EE Dept. at the Seoul National University in 1977. In 1979, he received the MS degree from the EE Dept. at KAIST. During 1984–1988, he received the SM, EE, and PhD degrees in EECS Dept. from MIT. During the period of 1979–1982, he taught at the Kyungbuk National University. Since 1988, he has worked at

POSTECH, where he is now an associate professor. During 1996–1997, he worked in the Bell Labs at Murray Hill. During 2005–2006, he was on leave for USC as a visiting professor. His major research area is multimedia signal processing.

Y. Kim He received the BS degree from the MSE and EE Dept. at POSTECH, in 2000, and the MS degree from the EE Dept. at POSTECH in 2002. Since 2002, he has been working towards the PhD degree at POSTECH. His current research interests include speech signal processing.

A Systolic Architecture and Implementation of ... - Springer Link

A Systolic Architecture and Implementation of ... - Springer Link

Suggest Documents

A Novel Systolic Architecture for an Efficient RSA ... - Springer Link

A Novel Systolic Architecture for an Efficient RSA ... - Springer Link

Design and FPGA Implementation of Systolic Array Architecture for

Architecture design and implementation of image ... - Springer Link

SMIL State: an architecture and implementation for ... - Springer Link

Comparison of systolic and diastolic criteria for isolated ... - Springer Link

Impact of right ventricular end systolic volume and ... - Springer Link

Assessment of interventricular systolic relationship and ... - Springer Link

A Reassessment of Enterprise Architecture ... - Springer Link

Scalable and Systolic Architecture for Computing Double ...

Left and Right Ventricular Volumes and Global Systolic ... - Springer Link

Constitutional implementation - Springer Link

Implementation and evaluation of amyloidosis ... - Springer Link

Principles, structures, and implementation of ... - Springer Link

Implementation and Characterization of Vibrotactile ... - Springer Link

COSTA: Design and Implementation of a Cost and ... - Springer Link

Architecture and Implementation of a Collaborative ... - CiteSeerX

Architecture and Implementation of a Distributed ...

Mitral annular plane systolic excursion and intra ... - Springer Link

ARCHITECTURE AND IMPLEMENTATION OF ... - Journals

ip: a context-aware architecture - Springer Link

towards a service-oriented architecture - Springer Link

A Metamodel for Enterprise Architecture - Springer Link

Implementation and Impact of a Family-Based ... - Springer Link