Learning Continuous-Valued Word

3 downloads 0 Views 727KB Size Report
Sep 14, 2014 - In this paper, we propose a neural network dictionary learning architecture to induce task specific continuous valued word representations ..... supervised learning of acoustic driven prosodic phrase breaks for text-to-speech ...
INTERSPEECH 2014

Learning continuous-valued word representations for phrase break prediction Anandaswarup Vadapalli, Kishore Prahallad Speech and Vision Lab, IIIT Hyderabad [email protected], [email protected]

Abstract

words into a set of discrete classes. This raises issues when there is ambiguity in the linguistic representation of the word. For example, the English word plant can be categorized as a noun or as a verb depending upon the context it occurs in. Also such a representation does not take into account the distributional behavior of words. To address these issues, there have been efforts towards deriving or inducing continuous dimensional representations of words, i.e., representing words as points in a continuous dimensional space, for various natural language processing (NLP) applications. These continuous dimensional representations have several advantages over conventional discrete word representations. The primary benefits being, continuous valued representations make no assumptions about the granularity of the discrete categories and the use of these features allows delaying a hard decision about assigning words to a particular category.

Phrase break prediction is the first step in modeling prosody for text-to-speech systems (TTS). Traditional methods of phrase break prediction have used discrete linguistic representations (like POS tags, induced POS tags, word-terminal syllables) for modeling these breaks. However these discrete representations suffer from a number of issues such as fixing the number of discrete classes and also such a representation does not capture the co-occurrence statistics of the words. As a result, the use of continuous valued word representation was proposed in literature. In this paper, we propose a neural network dictionary learning architecture to induce task specific continuous valued word representations, and show that these task specific features perform better at phrase break prediction as compared to continuous features derived using Latent Semantic Analysis (LSA). Index Terms: phrase break prediction, text-to-speech synthesis, embedded training of neural networks, continuous word representations

Sch¨utze [15] described a method for representing words in a continuous dimensional space using a technique derived from Latent Semantic Analysis (LSA). In this approach, words are mapped to points in a continuous dimensional space defined by the rows of a word co-occurrence matrix. A transformation is then applied to this matrix, to project it into a lower dimensional space (called the latent space). This lower dimensional latent space matrix captures the distributional behavior of the words in as few dimensions as possible. In [15], these word representations were then clustered to produce another discrete set (induced POS-like categories). However this step is not necessary and the continuous features can directly be used for further processing as in [2, 4], where the authors omit the final quantization step and instead use the continuous features directly in a regression tree framework to predict phrase breaks.

1. Introduction In the context of TTS, the first step in modeling of prosody is phrase break prediction. Phrase breaks are manifested in the speech signal in the form of several acoustic cues like pauses as well as relative changes in the intonation and duration of syllables. In this paper, we restrict ourselves only to pauses in speech, and limit our phrase break models to predicting the locations of pauses while synthesizing speech. For details, please refer to [1–4]. Typically phrase break prediction has been achieved by using machine learning models like regression trees or HMMs in conjuction with data labeled with linguistic classes (such as part-of-speech (POS) tags, phrase structure etc.) [5–11]. These methods assume the availability of labeled data, and thus cannot be used for languages where such resources are not readily available. In view of the above limitations, a lot of effort has been directed towards unsupervised methods of inducing word representations, which can be used as surrogates for POS tags / linguistic classes. Parlikar and Black [12] used the NeyEssen clustering algorithm [13] to automatically induce POS tags. These induced POS tags are automatically generated from text using the frequency analysis of the words. In [14], a set of morpheme tags units were manually identified and used to model phrase breaks in Telugu TTS systems. In [3], the authors showed that word-terminal syllables in Indian languages (last syllable of the word) discriminate between words based on syntactic meaning, and can be used for phrase break prediction in Indian language TTS systems. All the methods mentioned so far use discrete linguistic representations of words (like POS tags or word-terminal syllables). Such a representation requires a hard classification of

Copyright © 2014 ISCA

In [16,17], the authors describe a unified multitask architecture for NLP that learns features relevant to the tasks at hand. Their approach utilizes a deep convolutional neural network architecture trained in an end-to-end fashion. The input sentence is first processed through several layers of feature extraction, and these features are then automatically trained by backpropagation to be relevant to the task. They report results on the following six standard NLP tasks : POS tagging, Chunking, Named Entity Recognition, Semantic Role Labeling, Language Modeling and Semantically Related Words. Following the work of Collobert et al., [17], we propose a neural network based dictionary learning architecture to induce task specific word embeddings for phrase break prediction. We show that our proposed architecture combines feature induction and phrase break prediction in a single framework and thus avoids the two stage process required while using LSA based features for the same task. We experiment our approach on audiobook data, and compare the results obtained for phrase break prediction with the results obtained using LSA based word features.

41

14- 18 September 2014, Singapore

2. Unsupervised word features using Latent Semantic Analysis

learning framework (like CART or ANN), resulting in a two stage process. As a solution to the above mentioned issues, we propose a neural network dictionary learning architecture where we induce task specific word features, and use them for phrase break prediction in the same framework, resulting in a one stage process. We also do not impose any constraint on the context while inducing the features, resulting in a greater amount of task specific information about word characteristics being captured. This approach is on similar lines to the work in [17].

To derive unsupervised continuous valued word representations using Latent Semantic Analysis (LSA), we follow the same method as in [4]. For the sake of completeness, we describe the method below. This method is based on the first POS induction method in [15] (POS induction based on word type), where a word type is characterized distributionally, i.e. in terms of the words with which it co-occurs in a body of text. The steps involved in generating this representation are :

3. Task specific continuous word features using neural network dictionary learning

1. Given a corpus of m unique word types, out of which a subset n is chosen as feature words (typically n