IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 4, APRIL 2015
643
Separation of Singing Voice Using Nonnegative Matrix Partial Co-Factorization for Singer Identification Ying Hu and Guizhong Liu, Member, IEEE
Abstract—In order to improve the performance of singer identification, we propose a system to separate singing voice from music accompaniment for monaural recordings. Our system consists of two key stages. The first stage exploits the nonnegative matrix partial co-factorization (NMPCF), which is a joint matrix decomposition integrating prior knowledge of singing voice and pure accompaniment to separate the mixture signal into singing voice portion and accompaniment portion. In the second stage, based on the separated singing voice obtained by the first stage, the pitches of singing voice are first estimated and then the harmonic components of singing voice can be distinguished. For a frame, the distinguished harmonic components are regarded as reliable while other frequency components unreliable, thus the spectrum is incomplete. With those harmonic components, the complete spectrums of singing voice can be reconstructed by a missing feature method, spectrum reconstruction, obtaining a refined signal with more clean singing voice. Experimental results demonstrate that, from the point view of source separation, the singing voice refinement can further improve in contrast with the singing voice separation using NMPCF, while for the point view of singer identification, the singing voice separated by NMPCF is more appropriate than the refined singing voice. Index Terms—Nonnegative matrix partial co-factorization (NMPCF), singer identification, singing voice separation, spectrum reconstruction.
I. INTRODUCTION
T
HE development of singer identification enables the effective management and exploration of large amounts of music data based on singer similarity. With this technology, songs performed by a particular singer can be automatically clustered for easy management or searching. Several studies in the field of singer identification pay attention to features extraction directly from the songs [14], [24]. In popular music, singing voice is often interwoven with accompaniment. So those
Manuscript received December 17, 2013; revised April 09, 2014; accepted January 20, 2015. Date of publication January 26, 2015; date of current version March 06, 2015. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. DeLiang Wang. The authors are with the School of Electronic and Information Engineering, Xian Jiaotong University, Xian 710049, China (e-mail:
[email protected];
[email protected]) Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TASLP.2015.2396681
methods based on the features extracted directly from the accompanied vocal segments are difficult to achieve good performance when accompaniment is stronger or singing voice is weaker. Some researchers began to devote more care to reducing or removing the negative influences caused by instrumental accompaniment. Fujihara et al. extracted the harmonics of a singing voice based on the estimated fundamental frequency (F0) [8]. The singing voice eliminating accompaniments were then re-synthesized for use in feature extraction. There are still two problems exist in their system. First, estimated F0s of singing are difficult to be very accurate because of the influence of accompaniment. Second, even if the F0 is correct the extracted harmonics of singing voice are not completely pure because the harmonics components of singing voice may be superimposed by pitched instrument. Generally speaking, songs are often accompanied by percussive instrument and/or pitched instrument. Percussive instruments occupy the whole frequency spectrum while pitched instruments often partially have common frequency components with singing voice. This is because, more often, music accompanies the singing voice harmonically [27]. Tsai et al. [26] proposed a singer identification system that characterizes a singing voice by transforming the cepstrum of an accompanied singing voice into a solo singing voice cespstrum. It was assumed that when a solo singing voice is superimposed by some accompaniments, the voice cepstrum changes in a certain fashion which can be analyzed from the large set of an artificially mixed music database. The transformed cepstral coefficients are regarded as the features which capture the singer’s characteristics. Ying Hu and Guizhong Liu [10] exploited computational auditory scene analysis (CASA) to segregate singing voice units for each time frame. Those segregated singing voice units were regarded as reliable components. And then two missing feature methods were used respectively together with those reliable components to perform the tasks of singer identification. The reconstruction method were exploited to obtain a complete singing spectrum which were further used to extract the features for singer identification, and the marginalization method were exploited to directly perform the identification task based solely on reliable components. In contrast to the method in [26], their methods show good performance, especially the spectrum reconstruction method. However, when the energy ratio of singing to accompaniment is lower, e.g. at signal-to accompaniment ratio (SAR) of dB or dB,
2329-9290 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
644
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 4, APRIL 2015
the system performance dropped by a big margin. Through a preprocessing of singing voice separation, the performance of singer identification is expected to improve. Singing voice separation attempts to isolate singing voice (also called vocal line) from a song. In recent years, this problem has attracted increasing attention as the demand for automatic singer identification, automatic lyrics recognition and alignment. Although these tasks seem effortless to humans, it turns out to be very difficult for machines, especially when the singing voice is accompanied by musical instruments. However, such a requirement can be satisfied if successful separation of singing voice is used for preprocessing. For the quality of separated signal, different tasks have different demands. As soon as the contents of singing voice are intelligible, the separated signal could satisfy the demand of automatic lyrics recognition. If the contents of singing voice are intelligible but the singer is unrecognizable, the separated signal is not applicable for automatic singer identification. Generally, the singer identification system needs the separated singing voice of higher quality. In separation system, harmonics structure modeling and temporal instability of harmonics were regarded as most important characteristics for source separation [1], [2], [15], [23], [25]. However, the pitch of target source need to be estimated in advance [15], [23], [25], while when the energy ratio of singing to accompaniment declines, the accuracy of pitch estimation also follows the decline. Although the system in [1] did not require the process of pitch estimation, it is only applicable for the separation of percussive instrument and harmonic source. Matrix factorization plays an important role in the field of signal separation. Huang et al. [17] proposed using robust principal component analysis (RPCA) for singing-voice separation from music accompaniment. By RPCA, they obtained two output matrices, one is the sparse matrix containing formant structures which indicates vocal activity, and another is low-rank matrix which indicates musical notes. It was based on the assumption that repetition is a core principle in music and the singing voice has more variation and is relatively sparse within a song. Ryynanen et al. [20] separated accompaniment from polyphonic music based on automatic melody transcription. This method used sinusoidal modeling to estimate, synthesize, and remove the lead vocals. In their system, the pitches of singing also needed to be estimated in advance. The authors in [12] presented tensor factorization model for musical source separation. The approach is an extension of Nonnegative Matrix Factorization (NMF) where more than one matrix or tensor object are simultaneously factorized. Their models incorporated spectral information by using isolated note recordings or incorporated harmonic information. Coincidentally, M. Kim and J. Yoo together with their team [7], [9] addressed a problem of separating drum sources from monaural mixtures of polyphonic music containing various pitched instruments as well as drums. They applied nonnegative matrix partial co-factorization (NMPCF) to two or more column-blocks of the mixture spectrograms matrices and a drum-only spectrogram matrix, in order to determine common basis vectors that capture the spectral and temporal characteristics of drum sources. Then the separated drum signal can be
obtained by reconstructing the spectrogram of drum sources and inverse-transforming the spectrogram into time-domain. NMPCF was emerged from the concept of joint decomposition or collective matrix factorization, which make the multiple input matrices be decomposed into several factor matrices while some of them are shared, therefore, shows a greater potential in singing voice separation from monaural recordings. The method of multi-matrix joint decomposition was also used for rhythmic source separation [5]. Analogously, B. Raj et al. [19] applied Probabilistic Latent Component Decomposition (PLCD) for separating singing voices from background music in popular songs. The set of basis vectors described by the frequency marginal were learned for each component signal from a separated unmixed training recording (vocal or background music). The spectrograms for the voice-only and music-only components of the mixed recordings were obtained using PLCD. Shashanka et al. [29] have shown that the probabilistic latent variable model decompositions (PLVM) are numerically identical to NMF. PLVM and PLCD are essentially the same. However, NMF is more flexible and tractable. In addition, CASA and NMF have all along been two important and effective methods in the field of audio signal separation [11], [13], [18]. Y. Li and D. L. Wang [18] also proposed a system to separate singing voice from music accompaniment for monaural recordings. Their system consists of three stages. The singing voice detection stage partitions an input song into vocal and non-vocal portions. For vocal portions, the predominant pitch detection estimates the pitch of the singing voice and then CASA-based separation stage uses the estimated pitch to group the time-frequency segments of singing voice. Finally, the singing voice are re-synthesized from the segments. In this paper, we propose a singing voice separation system consisting of two key stages. The first stage exploits the concept of NMPCF, which is a joint matrix decomposition integrating prior knowledge of singing voice and accompaniment, to separate the mixture signal into singing voice portion and accompaniment portion. Based on the separated singing voice obtained by the source separation of first stage, the second stage exploits a missing feature method to further obtain a refined signal with more clean singing voice. The remainder of this paper is organized as follows. In Section II, we detail the method of NMPCF for singing voice separation. Section III presents the method of singing voice refinement exploiting missing feature method. Section IV provides a simple description of a singer identification system which is based on CASA. In Section V, we provide quantitative evaluation of separation system from the perspective of signal-to-noise ratio (SNR) gain and from the perspective of completeness of singer information respectively. Then, we draw conclusions in Section VI. II. NONNEGATIVE MATRIX PARTIAL CO-FACTORIZATION FOR SINGING VOICE SEPARATION Several algorithms based on the non-negative matrix factorization (NMF) were developed in applications for blind (or semi-blind) source separation (BSS) [30], and those NMF algorithms are efficient and robust for source separation when
HU AND LIU: SEPARATION OF SINGING VOICE USING NMPCF FOR SINGER IDENTIFICATION
sources are statistically dependent under conditions that additional constraints are imposed such as nonnegativity, sparsity, smoothness, lower complexity or better predicability. However, without any prior knowledge of a source signal, the standard NMF can not separate specific source signal from the mixing signal [1], [7]. To tackle this problem, nonnegative matrix partial co-factorization (NMPCF) was introduced [5], [7], [9]. The authors [5] separated rhythmic sources from the mixture without any prior knowledge of drum sources. They segmented the input mixture signal into multiple shorter excerpts, and then factorized them into the common part and individual parts which represent rhythmic and harmonic sources, respectively. As most drum sources have the temporal property of repeatability. J. Yoo et al. [7] exploited the prior knowledge of drum sources to separate drum sources from the commercial music mixtures. While a solo playing of various drums was used as an auxiliary input signal along with the mixture signal to be separated. The authors [9] proposed a unified approach to harmonize two branches [5], [7] of NMPCF-based BSS systems. Their system sought to extract both spectral and temporal features, drum and harmonic sources clearly differs in those two types of characteristics. In this paper, we are dedicated to separating singing voice from monaural songs, which are composed of solo singing voice and pure accompaniment. If the magnitude spectrogram matrix of accompanied singing is denoted as , then each element represents the spectrum magnitude of the th frequency bin at the th time frame. By applying standard NMF on this mixture nonnegative matrix as , where and are also constrained to be nonnegative, the resulting matrix represents the frequency bases of the sources contained in the mixture signal, and the corresponding rows of represent the activations of frequency bases across the time. As the result, some of the bases in represent the singing source, and the others represent the remaining accompaniment source. If we collect the frequency basis vectors representing the singing source and corresponding activations , then we can reconstruct the magnitude spectrogram of solo singing voice as [7]. However, the components representing singing are placed in the arbitrary locations of , in order to distinguish the components of target source, the NMPCF was introduced. NMPCF conducts partial co-factorization simultaneously exploiting the prior knowledge of singing voice and accompaniment, it can locate the frequency bases of each source. So that the single-source signals can be separated from the mixing signals. By NMPCF, the mixture matrix can be decomposed as follows: (1) where and represent the frequency and time characteristics of the singing voice, and and represent the frequency and time characteristics of accompaniment. All the matrices , , , are constrained to be nonnegative. The authors [7], [9] exploited only one additional prior spectrogram of drum-only signal to automatically discriminated the drum bases and harmonic bases. Here, expanding their approach, we propose a NMPCF method which shares the frequency basis matrices and of two sources in mixture by imposing two
645
Fig. 1. Illustration of magnitude spectrogram decomposition. (a) The construction of spectrograms used for NMPCF. (b) A pictorial illustration of NMPCF model. Base matrices in the factorization of the mixture spectrogram are shared in the factorizations of the prior signal spectrograms. Matrix X, Y and Z indicate mixture spectrogram, pure accompaniment spectrogram and clean singing spectrogram, respectively.
additional factorizations of prior spectrograms, while not considering the temporal characteristic. A. Factorization Models Fig. 1 describes the matrix generation process and the framework of proposed NMPCF. Assuming that a mixture signal is composed of the pure accompaniment and clean singing voice , formulated as . A vocal segment in a song denoting can be separated into two signals, clean singing voice and pure accompaniment, by the matrix factorization (1). Through a singing voice detection process, a song is divided into vocal and non-vocal portions, whose spectrograms are denoted by and , respectively. The non-vocal portions are generally interludes between two vocal parts or at the first and last portions in a song. So the non-vocal portions are pure accompaniments of various intensities. This process is shown in Fig. 1(a). Only those vocal portions mixed by the singing voice and accompaniment are processed to be separated, while the non-vocal portions participate co-factorization as prior knowledge. The basis vectors of factor matrix in (1) are also used to factorize the side information matrix . For a popular music, singing voices are generally accompanied by percussive instrument and pitched instrument. The pitched instruments in accompaniment generate harmonic components which are generally closely related to the harmonic components of singing voice and partly overlapped with them. Therefore, in order to obtain a
646
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 4, APRIL 2015
separated singing voice with the quality as good as possible, the basis matrix that captures the frequency characteristics of singing voice should be specified during the process of matrix factorization. As another additional prior spectrogram, the spectrogram of clean singing voice covering all singers, denoting , also participates co-factorization. Given three input matrices , and , the joint decomposition approximates the target matrix factorization (1) and the two following models: (2) (3) Fig. 1(b) describes the joint decomposition model. The models (1) and (2) share the basis matrix to factorize the target mixture signal and pure accompaniment , while the models (1) and (3) share the basis matrix to factorize the target mixture signal and clean singing voice . The matrix stands for a prior side information about accompaniment in the target song, while the matrix a prior side information about clean singing voice covering all singers. Assuming that the basis matrices and could characterize the accompaniment and singing voice well, respectively. B. Objective Function and Update Rules The objective function of NMPCF can be constructed to minimize the residuals of the models (1), (2) and (3), that becomes
(4) is a parameter reflecting the relative importance of acwhere companiment matrix among the matrices , and , and a parameter reflecting the relative importance of singing voice matrix among the matrices , and . After calculating the derivative of with respect to , , , , and , respectively, multiplicative update rules to learn those matrices can be derived by taking negative terms of the partial derivative as numerator of multiplication factor, while taking positive terms as denominator, (5) (6) (7) (8) (9) (10) By iteratively updating the matrices in from (5) to (10), we can obtain the basis matrix for singing voice and corresponding time activation . Then, the separated signal containing singing voice only can be reconstructed. First, compute
Fig. 2. Spectrogram Comparison. (a) The Mixture spectrogram of accompadB. (b) The spectrogram of the vocal nied vocal signal mixed at the SAR of signal after NMPCF. (c) Estimated harmonic mask matrix. White pixels indicate 1 and black pixels indicate 0. (d) Masked spectrogram of singing voice. (e) The Spectrogram after singing spectrum reconstruction.(f) The Spectrogram of clean singing voice. In order to facilitate a clear understanding of spectrogram, only the spectrogram contents of frequency from 0 to 2 kHz are displayed.
the spectrogram of singing voice . Next, the temporal signal can be re-synthesized by inverse-transforming the spectrogram and using overlap-add method. Certainly, the accompaniment signal can also be re-synthesized. Fig. 2(b) gives us a pictorial example of separated signal. Note that all sub pictures in Fig. 2 except for Fig. 2(c) are displayed in the same color scaling. In order to facilitate a clear understanding of the spectrogram, the amplitude values were all compressed by a cubic root operation, and only the frequency components with the frequency from 0 to 2 kHz are displayed. Generally, the pitches of accompanied pitched instruments are lower than that of singing voice. This can be found out by the comparison between Fig. 2(a) and Fig. 2(f). In contrast to the spectrogram after NMPCF process showed in Fig. 2(b), it is distinct that, in Fig. 2(a), most of the frequency components of percussive instrument and pitched instrument are removed. This also can be verified by comparing Fig. 3(a) with Fig. 3(b). In particular, after NMPCF process, some frequency components with distinct harmonic characteristic and the frequency of below 100 Hz during in Fig. 2(a), which is produced by pitched instrument, are vanished as shown in Fig. 2(b). However, there are still some frequency components with the frequency of about 200 Hz produced by pitched instrument around 1.2 second as shown in Fig. 2(b) still not be removed. In additional, we perform the singing voice separation in order to identify the singer, so the prior matrix should be
HU AND LIU: SEPARATION OF SINGING VOICE USING NMPCF FOR SINGER IDENTIFICATION
very large because it should covers all underlying singers. The decomposition of model (3) would be performed in each target mixture matrix . Therefore, in actual experiments, to reduce the amount of calculation, the basis matrix was obtained in advance by matrix decomposition of using most straight-forward way. Only the matrices in (5), (6), (7) and (9) were updated iteratively. Whatever, during the whole process of NMPCF, the base matrices of two sources existing in mixture recordings are both determined by pure source signals. This helps to improve the performance of source separation. A. Lefevre et al. proposed a semi-supervised NMF method [35] to perform the single-channel source separation. Firstly, a graphical user interface was exploited to retrieve the time-frequency annotations of singing voice. Then, based on those annotations, the incomplete masked singing voice spectrogram and accompaniment spectrogram serving as local information participate in NMF on the observed mixture spectrogram. The authors think that when a certain amount of annotations is reached, source separation quality is near that of ideal binary masks. J. Han and C. Chen [34] proposed a approach for extracting the melody from polyphonic audio leaving the vocal (singing) signal thus finishing the source separation. For each song, the non-vocal segments are first segmented and then are factorized to learn a dictionary for the accompaniment signal by PLCA. The accompaniment dictionary is then used to separate the vocal segments, where the accompaniment dictionary is fixed while the vocal dictionary is updated. As mentioned previously, the PLVM are numerically identical to NMF while the PLVM, the PLCD and PLCA are essentially the same. So the supervised PLCA can also separate the target source. However, in their method [34], during the separation process, only the dictionary of one source is determinate. Whereas in our proposed NMPCF method, the base matrices of clean singing voice and pure accompaniment signal are both determined by single source signal. Here, the base matrix can also be regarded as the source dictionary in [34]. The determination of base matrices of two sources helps to separate the two source signals through NMPCF process. III. SINGING VOICE REFINEMENT The separated signal by the source separation process described in the previous section has good intelligibility score. However, in comparison with the spectrogram of clean singing voice as shown in Fig. 2(f), there are still some extraneous frequency components exist in the spectrogram of separated signal. To remove those extraneous frequency components from the separated signal, we use a missing feature method to refine the singing voice. A. Harmonic Mask Estimation First, the separated signals after the NMPCF process severed as the input signal of the singing voice refinement process, are undergone an estimation process of the pitch of singing voice by using Praat [3]. After source separation process, the negative influence of the accompaniment almost has been reduced or even eliminated, only few extraneous frequency components with lower energy still exist in the separated signal. As it is, the performance of pitch estimation is largely improved. And this will be tested in Section V.
647
Next, the input recordings are transformed into spectrograms using the short-time Fourier transform (STFT) and estimated pitches are exploited to generate a harmonic mask for singing voice by identifying the frequency bins associated with each harmonic at each time frame [31]. For a perfectly harmonic sound, the frequency of harmonic at time frame is , where denotes the pitch (fundamental frequency) of singing voice at time frame . The central position of a harmonic may be among the frequency bins at time frame if (11) where is a threshold. is the frequency resolution of the discrete Fourier transform (DFT), denotes the sampling frequency and is the length of the DFT. Ultimately, the central position of harmonic is decided by the frequency bin which owns the largest energy among those frequency bins . Then, frequency is associated with harmonic satisfying: (12) where is also a threshold. We denote the set of frequency bins associated with as . A harmonic mask is simply a binary matrix, where 1 indicates that the frequency bin is associated with the harmonics of the singing voice and 0 indicates otherwise. A harmonic mask matrix defined as follow: (13) where denotes the collection of harmonic orders of frame . Given the spectrum of separated signal and harmonic mask matrix , a masked spectrum of more clean singing voice can be got from the element-wise product of and . Thus the spectrum vector of each frame are partitioned into reliable and unreliable parts, . The components of reliable parts corresponding to 1 in mask matrix are regarded as reliable, available to the classifier, and the components of unreliable parts corresponding to 0 in mask matrix are regarded as missing [6], without true data. B. Singing Voice Spectrum Reconstruction Finally, a missing feature method, reconstruction [21], is used to reconstruct a complete spectrum of singing voice. For reconstruction, the aim is to estimate the values for unreliable components producing a complete observation vector together with reliable components , i.e. . Assuming that a Gaussian mixture models (GMM) have been trained on clean singing voices covering all the singers. To model all the data more comprehensively and accurately, the universal background model (UBM) [22], a large GMM, is trained to represent the distribution of spectrum. The density of spectrum vector can be modeled using the mixtures of Gaussians: (14) denotes the prior probability of the th Gaussian where component and the th Gaussian distribution density with mean vector and diagonal covariance . The most
648
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 4, APRIL 2015
probable value of the conditional density is imputed in place of . For a unimodel distribution, the expected value is: (15) An application of Bayes rule results in the following expression for the conditional density: (16) Substituting (16) into (15) produces (17) The integral expression in the right side of the equation is the mean of the th component for the unreliable portions of data vector, . Hence, (17) can be rewritten as: (18) where is posteriori probability of the th Gaussian component given reliable components . (19) of the unreliable sub-vector are The bounds necessary to constrain the choice of imputed values for unreliable data so that they fall within the bounds of the amplitude of time-frequency (T-F) unit. For each individual Gaussian, we compute the most likely value within the bounds [6]: (20) In our system, the unreliable components values are ranged from the minimum values of given frame to the observed value of mixture signal. The reconstructed unreliable components can be rewritten as: (21) For each frame of singing voice, the complete spectrum vector is then reconstructed. Note that it does not need to conduct spectrum reconstruction for those frames where the pitch of singing voice is zero. Fig. 2(c) shows the estimated harmonic mask matrix of singing voice. Fig. 2(d) shows the masked spectrogram of singing voice. It is the result of the element-wise product of mask matrix and mixture spectrogram. Because of the mis-detection of the pitch of singing voice, there are some harmonic components belonging to the pitched instrument still exist in the area of around 1.2 second in the masked spectrogram. Fig. 2(e) shows the complete spectrogram of singing voice after the process of spectrum reconstruction. For those frames where the pitches of singing voice are detected as zeros, the amplitude
Fig. 3 Waveform Comparison. (a) The Mixture signal waveform at SAR of -6 dB. (b) The separated singing voice after the process of singing voice separation by NMPCF. (c) The refined singing voice after the process of singing voice refinement. (d) Clean singing voice.
values of each frequency bin were set to the average value of frequency bins in silence frames. In contrast to Fig. 2(b), the noisy components around harmonic components, especial the first order harmonic components, of singing voice are reduced even vanish in the reconstructed spectrogram. Similarly, the refined temporal signal are also obtained by inverse-transforming and overlap-add method. Hereafter, the refined signal denotes the output signal of the process of singing voice refinement, and the separated signal denotes the output signal of the process of singing voice separation by NMPCF. Corresponding to the spectrograms shown in Fig. 2, Fig. 3 shows their temporal waveform, respectively. Fig. 3(a) is the waveform of mixture signal at SAR of dB. Fig. 3(b) is the separated singing voice re-synthesized from the output of NMPCF shown in Fig. 2(b) and Fig. 3(c) is the refined singing
HU AND LIU: SEPARATION OF SINGING VOICE USING NMPCF FOR SINGER IDENTIFICATION
voice re-synthesized from the reconstructed complete spectrogram shown in Fig. 2(e). As can be seen, the separated singing voice Fig. 3(b) and refined singing voice Fig. 3(c) both well match the original clean singing voice as shown in Fig. 3(d). IV. SINGER IDENTIFICATION As the goal of previous two sections is to get more clean singing voice for improving the performance of singer identification. We will test the separated singing voice not only quantitatively by ordinary method [18] but also by a singer identification system. In this section, we briefly describe a singer identification system [10]. The singer identification consists of two stages. The first stage performs singing voice segregation based on CASA [4]. More specifically, the vocal recordings are analyzed through a cochlear filter bank to be represented in the T-F domain by an auditory periphery model. Meanwhile the input vocal signals are also used to estimate the predominant pitches using a pitch estimation algorithm, Praat [3]. Then the estimated singing pitches are used to compute pitch-based features together with the output of cochlear filtering. Those features can capture the harmonic characteristic of singing voice in a T-F unit and thus can be used to label it through a trained multi-layer perceptron (MLP) classifier. This labeling is called binary mask estimation. The segregated singing components can then be obtained according to the estimated binary mask and all the T-F units for each frame. The second stage is singer identification using two missing feature methods. In this paper, we only choose one missing feature method, spectrum reconstruction which was verified to be better than another one in singer identification, to perform the task of singer identification. The segregated singing components are the incomplete spectrum where the units of more intense instrumental accompaniment sound are removed. The complete spectrums are reconstructed from incomplete noisy ones and then used to generate cepstral features, Gammatone frequency cepstral coefficients (GFCCs), by using the discrete cosine transform (DCT) [28]. Subsequently, the acquired cepstral features are used in conjunction with trained singer models to derive the underlying singer identity. More details about the algorithm of singer identification can refer to [10]. In addition, the reconstruction in the singer identification system [10] is based on the features of cochleagram and the reconstructed feature vectors are used for singer identification. While the reconstruction in the second stage of our proposed system in this paper is based on the STFT spectrum and aims to obtain a temporal signal of more clean singing. They both utilize the same method but have different purposes. Note that the algorithm [10] were performed directly on the original mixture vocal segment with the length of from 1 to 4 seconds. In this paper, the tasks of singer identification were performed on the vocal segments of separated singing voice, which are of the same length with that used in [10]. V. EVALUATION AND COMPARISON In this section we describe a series of experiments which were conducted to evaluate the performance of proposed singing voice separation. In addition, we draw comparisons with two related singing voice separation systems [18], [35].
649
In order to verify the quality of separated singing voice, our proposed source separation system were also compared with a singer identification system [10], which were operated directly on original mixture recordings. A. Experimental Setup The data-set used here comprises 31 songs covering 22 singers: 10 female singers and 12 male singers. It contains an English singer, a Russian singer, a Japanese singer and 19 Chinese singers. Each original song consists of two tracks, an accompanied singing track and a pure accompaniment track. The former is a mixture of the lead vocal and accompaniment, and the latter contains background accompaniment only. They were downloaded from a Karaoke website http://k.kuwo.cn/. All the data are recorded at mono PCM wave with 16 kHz sampling rate and 16 bit quantization level. With an accompanied singing track and a pure accompaniment track, we extracted a clean singing voice for each song by using a vocal extractor software, utagoe, which could be got from the internet. To extract the clean singing voice for each song by using utagoe, there needs two tracks: an accompanied singing track and a pure accompaniment track. A part of the extracted solo singing voices was still accompanied by faint instrumental sounds. Only some relatively clean songs were used for experiments, so the database has only 31 songs. The extracted clean singing voices are served as training data set. In the process of NMPCF, the training date containing all singers were used to form the magnitude spectrogram matrix . To avoid overlarge matrix , only 15% of data were selected to participate in the process of NMPCF. It is important that the selected data must cover all types of singers. In the process of singing voice refinement, similarly, the clean singing voice containing all types of singers were used to train the GMM of the singing voice spectrum for spectrum reconstruction. But because a large GMM, universal background model (UBM) [22], were trained for modeling the spectrum distribution of the clean singing voice, the training data set could contains more data. So almost all data were used for training. In the last process of the singer identification, the solo singing voice of each signer were used to train the GMM of GFCC. To study the performance of system under the conditions of different interference (accompaniment) intensities, the extracted solo singing voices were superimposed with pure accompaniment track at various singing-to-accompaniment ratios (SARs) from 0 dB to dB at dB intervals. The superimposition was done carefully, such that the resulting accompanied singing voices are in rhythm. The definition of SAR is the energy ratio of clean singing and pure accompaniment in a recording in the units of dB [26]. Those monophonic mixture songs serve as testing data set. In this paper, we assume that the segmentation of vocal/non-vocal segments is accurate. The mixture songs were divided into vocal and non-vocal segments according to the pitches of singing voice, which were detected over clean singing voice using Praat [3] before the mixing of clean singing voice and accompaniment. A vocal segment starts from the first frame with non-zero pitch and ends at the last frame with non-zero pitch. However, if a interval contains multiple consecutive frames with zero pitches less than 0.2 second, then this interval together with previous one vocal segment and next
650
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 4, APRIL 2015
one vocal segment will be merged into one vocal segment. The system discarded the vocal segments shorter than 1 second and further divided the vocal segments longer than 4 seconds until the length of each segment lie within the range . There are 1568 segments in all in the experiments. The experiments were conducted on segment data, i.e. a singer’s label of a test segment is assigned based on the sum of log likelihood of each frame, and the singer identification accuracy is computed as the percentage of correctly identified segments over the total number of test segments. In our implementation, a frame length of 32 ms and hop length of 16 ms with the sampling frequency 16 kHz were used. The length of DFT was 512. No zero-padding was used in the DFT. In the separation process of NMPCF, the number of basis vectors were 100 for of singing voice and 50 for of accompaniment. is set to 12 and is set to 1. In the process of singing voice refinement, we chosen , which is about half of the 40-dB bandwidth of a hamming window, and , which is approximately the number of frequency bins covering the 6-dB bandwidth of a hamming window [31]. The number of harmonics was chosen as 20. The clean singing voices covering all singers were used to train a big GMM of 256 Gaussian components for complete spectrum reconstruction.
TABLE I GAIN COMPARISON. AVERAGE (DB) OF THE PROPOSED SYSTEM AND A EXISTING SYSTEM [18]
SNR
where is the approximation of matrix . , are the approximations of matrix and respectively. is the Itakura-Saito divergence. In the paper [35], the wrong annotation rate is set to 0%, 20% and 50%. In our implementation, the wrong annotation rate was set to 20% and the areas of wrong annotation were randomly selected. For each source at time frame and frequency bin , , the time-frequency masks were selected as and the optimization parameter was set to 10. According to the paper [18], signal-to-noise ratio (SNR) gain was selected to quantify the performances of the systems. The SNR are defined as following [31]: (23)
B. Source Separation Evaluation In this section, the performance of singing voice separation of the proposed system was provided. And the proposed system was compared with two source separation systems. One is a CASA-based singing voice separation system [18], which first detect the pitch of singing voice and then based on the detected pitches, exploit the CASA to group the T-F segments of singing voice and further to re-synthesize the singing voice. Their CASA-based separation method is formed by the expansion of the algorithm proposed by Hu and Wang [37], so there also needs to train a MLP net using clean singing voice. In the paper [18], the reference pitch contours were calculated using Praat [3] with clean singing voice, which is the same pitch detection method used in our system. Here, we only discuss the separation performance of the second stage in their system that group the T-F segment of singing voice exploiting CASA. So, we tested their system with the ideal pitches of singing voice but not the pitches obtained by their pitch detection method. The ideal pitches were also detected by applying Praat on the clean singing voice before the mixing process of the clean singing and accompaniment at various SARs. And therefore, the CASAbased separation method in their system was actually tested for comparison with our system. Another source separation system is based on the semi-supervised NMF algorithm [35], which firstly exploits a graphical user interface to retrieve the time-frequency annotations of singing voice in observed spectrogram. Then the incomplete masked singing voice spectrogram and accompaniment spectrogram , which are obtained by the element-wise product of annotations (mask matrix) and mixture spectrogram , serving as side information participate in the NMF. This idea translates into minimizing the following objective function: (22)
(24) is the clean singing voice, is the estimated singing voice and the mixture signal. The SNR gain is then . Table I shows the average of 1568 samples when applying the proposed system and two source separation systems. The singing voice separation system based on the CASA [18] is denoted by “Li & Wang(2007)”. The source separation system based on the semi-supervised NMF [35] is denoted by “Lefevre et al. (2012)”. Those test signals were mixed at four various SARs levels. The item of “NMPCF” means the performance of separated singing voice, which was re-synthesized from the output spectrogram of NMPCF process, and in (23) denotes the separated singing voice. The item of “Refinement” means the performance of refined singing voice which was re-synthesized from the output spectrogram of singing voice refinement process, and in (23) denotes the refined singing voice. The last column shows the average results of each row. The test segments all have the length of from 1 second to 4 second. Results show the effectiveness of both novel aspects of the proposed system, the NMPCF separation stage and singing voice refinement stage. In comparison to Li & Wang(2007), the proposed system provides an excellent improvement in all cases. For the Li and Wang’s system, with the accurate pitches of singing voice, as the SAR decreases that is the accompaniment becomes stronger, the performance of singing voice separation declines. In the field of speech separation, the intelligibility score produced by the CASA-based method with ideal binary mask (IBM) decreases systematically for the mixture with very low signal-to-noise ratio (SNR) [4]. The same phenomenon is also found in singing voice separation [10]. For the Lefevre et al.’s system with the wrong annotation rate fixed at 20%, as the SAR decreases, significantly
HU AND LIU: SEPARATION OF SINGING VOICE USING NMPCF FOR SINGER IDENTIFICATION
increase. When dB, the performance of Lefevre et al.’s system is even superior to that of our proposed system, while when dB, dB, dB, the results are all worse than that of our proposed system, especially when dB. Actually, as the SAR decreases, the wrong annotation rate of their proposed semi-supervised method should increases. Thus the s of lower SARs should be much lower. The results in Table I also reflect that the CASA-based separation method is not applicable in the case of lower SAR (or SNR) and the Lefevre et al.’s system is not applicable in the case of higher SAR (such as dB). Whereas, NMPCF can remove the great majority of the accompaniment components and alleviates the negative influence to a great extent. So the average of the separated singing voice and refined singing voice show the better results. As one would expect that the refinement of singing voice improves the average in all cases of various SARs. Comparing the “NMPCF” row of Table I, the average improvement achieved by the refinement of singing voice is the SNR gain of 0.65 dB.
651
TABLE II AVERAGE F-MEASURE (%) OF PITCH ESTIMATION OF SINGING VOICE
C. Applicability For Singer Identification Evaluation We dedicate to the singing voice separation in order to improve the performance of singer identification. So, in this subsection, we also test the separated signals by a singer identification system [10], which first detects the pitches of singing voice directly over the song mixed by accompaniment then perform CASA-based segregation and finally reconstruct a complete spectrum of singing voice using missing feature method. Here, we perform the singer identification over the separated singing voice and refined singing voice, respectively. As the first step of the singer identification system is pitch detection, so we first test the pitch detection of singing voice. The reference pitches were also detected by Praat[3] on the clean singing voices in training data set. A correct pitch estimate was defined to deviate less than 3% from the reference , making it “round” to a correct musical note [36]. For pitch detection, a more appropriate measure is F-measure [32], which considers both the precision and recall of the test to compute the score: is the number of correct results divided by the number of all returned results and is the number of correct results divided by the number of results that should be returned. The F-measure is the harmonic mean of the precision and recall: (25) Table II shows the F-measure values of four difference SARs at the frame level. Since the test samples are all pop music, the pitch range of the test samples is relatively small compared to that of the operatic song, we set Hz as the plausible pitch range for all the test. The results of pitch estimation performed over the original songs mixed with the accompaniments are also presented in the second row of Table II. The separated singing voice, which was re-synthesized from the output spectrogram of NMPCF process, achieves significant improvement on the pitch estimation of singing voice comparing with the original mixture signal. Moreover, for pitch estimation, the refined singing voice has a little better performance than the separated singing voice. It seems that there are still some negative
Fig. 4. Singer Identification Accuracy Comparison. (a) Average singer identification accuracy at different SARs with IBM. (b) Average singer identification accuracy at different SARs with EBM.
frequency components caused by the accompaniment to be removed or alleviated through the process of singing voice refinement. This is consistent with the result of the test in Section V-B. Next, we test the singer identification over the separated singing voice and refined singing voice in comparison with original mixture recordings. In the singer identification system [10], first stage is to perform singing voice segregation based on CASA, which applies binary T-F masking to extract a target sound. An ideal binary mask (IBM) is a binary matrix defined as follow: (26) where is a binary matrix indexed by the time frame and frequency channel , where 1 indicates that the target
652
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 4, APRIL 2015
energy is stronger than the interference energy within the corresponding T-F unit and 0 indicates otherwise. and are the amplitude values of the singing voice and of the accompaniment for the T-F unit of time frame and frequency channel . IBM can be calculated before the superimposition of clean singing and pure accompaniment. IBM is the main CASA goal [33] while the binary mask is actually estimated from an input mixture. The estimated binary mask (EBM) was obtained by employing a CASA system [10]. Moreover, the resulting EBM is also influenced by the test samples. For a CASA-based method, the smaller the difference between the results of IBM and EBM, the higher the quality of test samples. Our comparison uses the same experimental setup and database as in [10]. Similarly, our tests include two types of singer identification that the singer identification are performed with IBM and with EBM. According with the conclusion in [10], we selected a more appropriate missing-feature method, reconstruction method, to identify singer. In this tests, the GMM used for spectrum reconstruction includes 256 Gaussian components and the GMM of the GFCCs includes 32 Gaussian components. Fig. 4(a) and (b) show the average singer identification accuracies in the case of “with IBM” and “with EBM”, respectively. As can be seen, the separated singing voice and refined singing voice both perform substantially better than the original mixture recordings at lower SARs. For the SAR of 0 dB, the separated singing voice and refined singing voice even slightly lower than that of the original mixture when using IBM. This maybe states that through the NMPCF process, when the energy of accompaniment is very large it could be well remove or alleviated, whereas if the energy of accompaniment is not very large, even if it could be removed but would still partly accompanied the singing voice components. However, through two processes of the NMPCF and singing voice refinement, most of the interference in the mixture are removed so that a higher performances of pitch detection can be achieved. At the first stage of singer identification [10], a pitch estimation algorithm is used to estimate the predominant pitch. For the case of “with EBM” in Fig.4(b), results show the effectiveness of both stages of the proposed system, the singing voice separation and singing voice refinement. Although Table I shows that the singing voice refinement actually could improves average in the case of singing voice separation, whereas it does not improve the performance of singer identification in Fig.4. This means that for several refined singing voices, the contents of singing voice are intelligible while the contents of singer are unintelligible i.e. refined singing voice scores higher in singing intelligibility but scores lower in singer intelligibility. So the refined singing voice is more appropriate for automatic lyrics recognition. However, only the singing voice separation of NMPCF is a suitable preprocessing for singer identification. VI. CONCLUSIONS This paper presents a new method to separate singing voice from monaural recordings. The system consists two key stages. The singing voice separation stage exploits NMPCF to separate the clean singing voice from mixture signal. Based on the output
signal of singing voice separation stage, the singing voice refinement stage exploits spectrum reconstruction method to obtain the spectrum of relatively more clean singing voice and thus further to obtain the temporal signal of more clean singing voice. Results of singing voice separation show that in comparison to two existing singing voice separation systems, on the whole, those two stages both present excellent improvements on SNR gain. And the refined singing voice even achieved higher scores than the separated singing voice. This can also be maintained in the results of singing pitch estimation. While in the case of singer identification, the separated singing voice after NMPCF process achieved higher scores than the refined singing voice. This means that the refined singing voices partly lose some singer information although the contents of singing voices are still intelligible. This also imply that the singing voice refinement maybe useful for automatic lyrics recognition while the singing voice separation using NMPCF is extremely helpful for singer identification. REFERENCES [1] T. Virtanen, “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 3, pp. 1066–1074, Mar. 2007. [2] T. Virtanen, A. Mesaros, and M. Ryynanen, “Combining pitch-based inference and non-negative spectrogram factorization in separating vocals from polyphonic music,” in Proc. ISCA Tutorial Res. Workshop Statist. Percept. Audit. (SAPA), 2008. [3] P. Boersma and D. Weenink, “Praat: Doing phonetics by computer [Computer program],” Version, vol. 5, p. 21, 2005. [4] D. L. Wang, “On ideal binary mask as the computational goal of auditory scene analysis. in,” in Speech Separat. by Humans Mach.. Norwell, MA, USA: Kluwer, 2005, vol. 60, pp. 181–197. [5] M. Kim et al., “Blind rhythmic source separation: Nonnegativity and repeatability,” in Proc. IEEE Int. Conf. Acoust. Speech, Signal Process., 2010, pp. 2006–2009. [6] M. Cooke, P. Green, L. Josifovski, and A. Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Commun., vol. 34, pp. 267–285, 2001. [7] J. Yoo et al., “Nonnegative matrix partial co-factorization for drum source separation,” in Proc. IEEE Int. Conf. Acoust. Speech, Signal Process., 2010, pp. 1942–1945. [8] H. Fujihara, M. Goto, T. Kitahara, and H. G. Okuno, “A modeling of singing voice robust to accompaniment sounds and its application to singer identification and vocal-timbre-similarity-based music information retrieval,” IEEE Trans. Audio, Speech, Lang. Process, vol. 18, no. 3, pp. 638–648, Mar. 2010. [9] M. Kim et al., “Nonnegative matrix partial co-factorization for spectral and temporal drum source separation,” IEEE J. Sel. Topics Signal Process., vol. 5, no. 6, pp. 1192–1204, Dec. 2011. [10] Y. Hu and G. Z. Liu, “Singer identification based on computational auditory scene analysis and missing feature methods,” J. Intell. Inf. Syst., pp. 1–20, 2013. [11] C. L. Hsu and J. Jang, “On the improvement of singing voice separation for monaural recordings using the MIR-1 K dataset,” IEEE Trans. Audio, Speech, Lang. Process, vol. 18, no. 2, pp. 310–319, Feb. 2010. [12] U. Simsekli and A. T. Cemgil, “Score guided musical source separation using generalized coupled tensor factorization,” in Proc. 20th Eur. Signal Process. Conf. (EUSIPCO), 2012, pp. 2639–2643. [13] Z. Jin and D. L. Wang, “A supervised learning approach to monaural segregation of reverberant speech,” IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 4, pp. 625–638, May 2009. [14] S. Kalayar Khine, T. Nwe, and H. Li, “Exploring perceptual based timbre feature for singer identification,” in Computer Music Modeling and Retrieval Sense of Sounds. New York, NY, USA: Springer, 2008, pp. 159–171. [15] A. Klapuri, T. Virtanen, and T. Heittola, “Sound source separation in monaural music signals using excitation-filter model and em algorithm,” in Proc. IEEE Int. Conf. Acoust. Speech, Signal Process., 2010, pp. 5510–5513.
HU AND LIU: SEPARATION OF SINGING VOICE USING NMPCF FOR SINGER IDENTIFICATION
[16] V. Rao, S. Ramakrishnan, and P. Rao, “Singing voice detection in polyphonic music using predominant pitch,” in Proc. Interspeech, 2009, pp. 1131–1134. [17] P. S. Huang, S. D. Chen, P. Smaragdis, and M. H. Johnson, “Singing-voice separation from monaural recordings using robust principal component analysis,” in Proc. IEEE Int. Conf. Acoust. Speech, Signal Process., 2012, pp. 57–60. [18] Y. Li and D. L. Wang, “Separation of singing voice from music accompaniment for monaural recordings,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 4, pp. 1475–1487, May 2007. [19] B. Raj, P. Smaragdis, M. Shashanka, and R. Singh, “Separating a foreground singer from background music,” in Proc. Int. Symp. Frontiers Res. Speech Music, Mysore, India, 2007. [20] M. Ryynanen, T. Virtanen, J. Paulus, and A. Klapuri, “Accompaniment separation and karaoke application based on automatic melody transcription,” in Proc. IEEE Int. Conf. Multimedia Expo, 2008, pp. 1417–1420. [21] B. Raj, M. L. Seltzer, and R. M. Stern, “Reconstruction of missing features for robust speech recognition,” Speech Commun., vol. 43, pp. 275–296, 2004. [22] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted Gaussian mixture models,” Digital Signal Process., vol. 10, pp. 19–41, 2000. [23] Z. Duan, Y. Zhang, C. Zhang, and Z. Shi, “Unsupervised single-channel music source separation by average harmonic structure modeling,” IEEE Trans. Audio, Speech, Lang. Process, vol. 16, no. 4, pp. 766–778, May 2008. [24] J. Shen, J. Shepherd, B. Cui, and K. L. Tan, “A novel framework for efficient automated singer identification in large music databases,” ACM Trans. Inf. Syst. (TOIS), vol. 27, p. 18, 2009. [25] V. Rao and P. Rao, “Vocal melody extraction in the presence of pitched accompaniment in polyphonic music,” IEEE Trans. Audio, Speech, Lang. Process, vol. 18, no. 8, pp. 2145–2154, Nov. 2010. [26] W. H. Tsai and H. P. Lin, “Background music removal based on cepstrum transformation for popular singer identification,” IEEE Trans. Audio, Speech, Lang. Process, vol. 19, no. 5, pp. 1196–1205, Jul. 2011. [27] S. Sofianos, A. Ariyaeeinia, R. Polfreman, and R. Sotudeh, “H-Semantics: A hybrid approach to singing voice separation,” J. Audio Eng. Soc., vol. 60, no. 10, pp. 831–841, 2012. [28] X. Zhao, Y. Shao, and D. Wang, “CASA-based robust speaker identification,” IEEE Trans. Audio, Speech, Lang. Process, vol. 20, no. 5, pp. 1608–1616, Jul. 2012. [29] M. Shashanka, B. Raj, and P. Smaragdis, “Probabilistic latent variable models as nonnegative factorizations,” Comput. Intell. Neurosci., vol. 2008, pp. 1–8, 2008. [30] A. Cichocki, R. Zdunek, and S. I. Amari, “New algorithms for non-negative matrix factorization in applications to blind source separation,” in Proc. IEEE Int. Conf. Acoust. Speech, Signal Process., 2006, vol. 5, pp. 621–624.
653
[31] Y. Li, J. Woodruff, and D. L. Wang, “Monaural musical sound separation based on pitch and common amplitude modulation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 7, pp. 1361–1371, Sep. 2009. [32] E. Vincent, N. Bertin, and R. Badeau, “Adaptive harmonic spectral decomposition for multiple pitch estimation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 3, pp. 528–537, Mar. 2010. [33] D. L. Wang and G. J. Brown, Computaional Auditory Scene Analysis: Principles, Algorithms and Applications. Hoboken, NJ, USA: WileyIEEE Press, 2006. [34] J. Han and C. W. Chen, “Improving melody extraction using probabilistic latent component analysis,” Proc. IEEE Int. Conf. Acoust. Speech, Signal Process., pp. 33–36, 2011. [35] A. Lefevre, F. Bach, and C. Fotte, “Semi-supervised NMF with timefrequency annotations for single-channel source separation,” in Proc. Int. Symp. Music Inf. Retrieval (ISMIR), 2012. [36] A. Klapuri, “A perceptually motivated multiple-f0 estimation method,” in Proc. IEEE Workshop Applicat. Signal Process. Audio Acoust., 2005, pp. 291–294. [37] G. Hu and D. L. Wang, “Monaural speech segregation based on pitch tracking and amplitude modulation,” IEEE Trans. Neural Netw., vol. 15, no. 5, pp. 1135–1150, Sep. 2004. Ying Hu received the B.S. degree and M.S. degree in electronics and information engineering from Xingjiang University, Urumqi, China, in 1997 and 2002. She is currently pursuing the Ph.D. degree from the School of Electronics and Information Engineering, Xi’an Jiaotong University, Xi’an, China. She was a Lecturer with Xingjiang University. Her current research interests include content-based audio analysis and music information retrieval.
Guizhong Liu (M’05) received the B.S. and M.S. degrees in computational mathematics from Xi’an Jiaotong University, Xi’an, China, in 1982 and 1985, respectively, and the Ph.D. degree in mathematics and computing science from the Eindhoven University of Technology, Eindhoven, The Netherlands, in1989.He is currently a Full Professor with the School of Electronics and Information Engineering, Xi’an Jiaotong University. His current research interests include non-stationary signal analysis and processing, image processing, audio and video compression, and inversion problems.