Fast Basis Selection and Instantaneous Frequency ... - IEEE Xplore

1 downloads 0 Views 323KB Size Report
University of New Orleans ... A greedy heuristic scheme is used for signal reconstruction based ... signal reconstruction problem which is applicable to any.
Fast Basis Selection and Instantaneous Frequency Tracking for Audio Signal Analysis and Synthesis Huimin Chen

Dimitrios Charalampidis

Department of Electrical Engineering University of New Orleans New Orleans, LA 70148, USA Email:{hchen2,dcharala}@uno.edu

Abstract— This paper focuses on finding the best basis for the synthesis of audio signals and the tracking of slow or fast changing instantaneous frequencies. We propose a penalty based basis selection scheme which allows the random sampling strategy of compressed sensing to reduce the sample size for sparse signals. A greedy heuristic scheme is used for signal reconstruction based on an iterative refinement. Numerical experiments on synthetic and real sounds show the effectiveness of our scheme by achieving high accuracy of tracking instantaneous frequencies with a small number of selected basis.

I. I NTRODUCTION Instantaneous frequency (IF) estimation is a very useful tool in a wide range of signal processing applications. Of particular interest is the analysis and synthesis of speech and musical signals by accurately estimating the IF trajectories for both monocomponent and multicomponent signals. High resolution IF tracking finds applications in music transcription and computer composition [2], [6], [8], as well as in speech analysis [3], [12], for the purpose of speech coding and speech recognition. In speech and music analysis, specialized techniques have been developed in order to accommodate specific needs of the application at hand. For instance, speech signal can be characterized by intervals of voiced or mixed speech having some quasi-periodic properties, unvoiced speech having a non-periodic nature, and transitional speech where finding accurate segmentation and appropriate model for each speech type has presented several challenges. Speech classification or an equivalent processing step is often required prior to IF estimation. Assuming that the voiced speech segments have been identified, an IF estimation or, in other words, a pitch estimation method is often necessary. Accurate estimation of IF trajectories is crucial for speech synthesis and recognition. In general, since the IF trajectories for speech signal is relatively slow varying, IF estimation is commonly performed through some form of autocorrelation-based technique. In the case of multicomponent speech analysis, decomposition using a filter bank, such as the one based on the gammatone function [10], has been applied to obtain a number of channels which are then analyzed separately. Although the fundamental frequency for voiced speech segments is usually found within a relatively limited band, any frequency within that band can be a possible fundamental frequency. 1-4244-0921-7/07 $25.00 © 2007 IEEE.

Alternatively, music signals are usually multicomponent with each IF consisting of a sequence of quasi-periodic or silent segments. One exception is the breathy instruments that may be of a mixed nature. Prior knowledge about the instrument or the music-type under consideration may be helpful in IF estimation. For instance, most Western instruments produce a discrete set of tones, and follow a commonly used scale within an octave. Therefore, the IF trajectories can be modeled as a sequence of specific, discrete tones. On the other hand, many non-Western instruments, as well as some Western instruments, including the violin, have the ability to produce any IF within their frequency range. Nevertheless, certain types of music, such as Byzantine or other eastern types of music use more discrete tone levels including 3/4-th or 1/9-th of a tone [9]. Previous works on general IF tracking mainly focus on the analysis in time-frequency space. Along those lines, the Chirplet transform (CT) and its modifications have been used to achieve better resolution in discriminating multiple closely spaced IF trajectories [1]. However, only few works look into the selection of appropriate basis for accurate IF tracking and signal reconstruction. In particular, speech and music signals, due to their inherent characteristics, can greatly benefit from the appropriate basis selection. This paper deals with finding the best basis from a large dictionary with possibly redundant atoms. We propose a penalty based basis selection scheme which allows the random sampling strategy of compressed sensing to reduce the sample size for sparse signals. A greedy heuristic algorithm is used for signal reconstruction based on an iterative refinement. Numerical experiments on synthetic and real music signals show the effectiveness of our scheme by achieving high accuracy of tracking instantaneous frequencies with a small number of selected basis. II. P ROBLEM F ORMULATION We assume that the speech or musical signal can be efficiently expanded by a few basis functions of unknown type. The energy of each basis function has to be localized in a small area of the time-frequency plane. This will also allow a compressed representation of the signal via random sampling [5]. More specifically, as described in the introduction section and later in Section IV, multicomponent speech and musical signals can be modeled as the weighted sum of a set of warped

341

sinusoidal functions. Each sinusoidal function corresponds to either the fundamental frequency of a single component or one of its harmonics. Therefore, once a specialized dictionary is appropriately chosen, the IF estimate of a signal component can be obtained with the best basis for audio signal analysis and synthesis. Next, we formulate the basis selection and signal reconstruction problem which is applicable to any kind of signal. However, the dictionary selection described in Section IV fits our scheme nicely to the IF estimation problem for speech and music signals. A. Penalty Based Basis Selection The following notations are used throughout the paper. A signal f is a  discrete time function with finite lp norm defined N by ||f ||lp = i=1 |f [i]|p . The l0 norm is defined by the number of nonzero elements in f . A dictionary DΛ = {Bλ }λ∈Λ is a set of basis Bλ = {φm (λ)}m∈Z where φm (λ) ∈ RN . A penalty p(λ) is defined as a complexity measure associated to each basis Bλ . The penalty can be interpreted as the number of bits needed to specify a basis [4]. For a given signal f , the best basis minimizes the penalized total expansion coefficients given by λ∗ = arg min λ∈Λ

N 

|φi (λ)T f | + C0 p(λ)

(1)

i=1

where C0 is a design parameter depending on the noise level. Note that the best basis provides the sparsest description of f measured by the l1 norm. If the dictionary is large enough, then most speech and musical signals can be approximated with a small number of basis functions. The choice of dictionary DΛ should be appropriate for audio signals as it will be discussed in Section IV.

where the original l0 -minimization problem is relaxed by l1 minimization and C1 is a design parameter to control the sparsity of the solution. [5] also showed in the noiseless case that the l1 -minimization problem has identical optimal solution to the original l0 -minimization problem which turns out to be NP hard. C. Adapting to Unknown Sparsity Consider that we have a sampled vector y which is a compressed version of the original signal f under additive noise. We want to find the best basis which has a sparse representation of f . The problem can be formulated as follows. (f ∗ , λ∗ ) = arg min arg min (||y − Ψf ||l2 + f ∈RN

III. G REEDY H EURISTIC A LGORITHM FOR B EST BASIS S ELECTION AND S IGNAL R ECOVERY Searching optimal solution to the problem given by (4) in the whole dictionary is not feasible for an arbitrary dictionary of large size. Instead, we propose a greedy heuristic algorithm which is in line with the basis pursuit [7] or orthogonal matching pursuit [11] that finds the optimal solution in a structured basis set. The algorithm iteratively refines the signal estimate and updates the basis selection by the following steps. 1) Initialization: Set k = 0, f0 = 0 and choose λ0 ∈ Λ by some prior knowledge of the signal. 2) Refining the estimate: Use Newton-Ralphson update

In practice, a sparse signal can be highly compressed without knowing the best basis. For instance, if the signal f is sketched by a sampling matrix Ψ = [ψ1 ... ψn ]T with a sampled vector y = Ψf , then we can have n

Suggest Documents