A Multi-Topic Meta-Classification Scheme for Analyzing Lobbying Disclosure Data. Xinpeng L. Liao, Chengcui Zhang. Dept. Computer and Information Sciences.
A Multi-Topic Meta-Classification Scheme for Analyzing Lobbying Disclosure Data Xinpeng L. Liao, Chengcui Zhang
Ariel D. Smith, Grant T. Savage
Dept. Computer and Information Sciences The University of Alabama at Birmingham Birmingham, Alabama, USA {xinpeng, czhang02}@uab.edu
COLLAT School of Business The University of Alabama at Birmingham Birmingham, Alabama, USA {arields, gsavage}@uab.edu
Abstract—For the functioning of American democracy, the Lobbying Disclosure Act (LDA), for the very first time, provides data to empirically research interest groups behaviors and their influence on congressional policymaking. One of the main research challenges is to automatically find the topic(s), by short & sparse text classification, in a large corpus of unorganized, semi-structured, and poorly connected lobbying filings to reveal the underlying purpose(s) of these lobbying activities. Common techniques for alleviating data sparseness are to enrich the context of data by external information. This paper, however, proposed an inter-disciplinary yet practical solution to this problem using a Multi-Topic MetaClassification (MTMC) scheme built upon a set of semantic attributes (i.e., General Issue, Specific Issue, and Bill Info.), integrated with a domain-specific Policy Agenda (PA) coding/labeling procedure. First, multi-label base-classifiers that have been transformed into multi-class classification problems were learned from the abovementioned three semantic sources, respectively; second, to render reliability classification, one meta-classifier per attribute was trained based on meta-instances dataset labeled in a cross-validation fashion; third, the final prediction is made via fusing the reliable outputs of such ensembles of classifiers. Experiments demonstrated satisfactory classification performance with various evaluation measures on such a real-world textual dataset that poses many challenges including problems with noisy data and semantic ambiguity. Keywords- machine learning applications; multi-class & multi-label classification; meta-classifier; information fusion
I.
INTRODUCTION
The question of how American public perceive the influence of interest groups on the decision of the U.S. Congress and the Executive Branch is a great concern for the functioning of American democracy. Empirical studies that directly address this question using the lobbying disclosure databases from the Senate and the House of Representatives are curiously limited due to the lack of access to clean, organized, structured and well-connected data. The Lobbying Disclosure Act (LDA) provides data to empirically research interest groups behaviors and their influence on congressional policymaking. However, the LDA data is not a perfect measurement instrument due both to the complex lobbying process and to the fact that both of the House of Representative and the Senate maintain separate lobbying databases, which present challenges that require the state-ofthe-art text mining techniques to develop a comprehensive model of lobbying and influence.
Figure 1. An example part of a LDA filing showing only the General Issue and the Specific Issue fields. Some bill numbers, e.g., HR. 5512, S.2745, HR.2419 and HR. 4238 are embeded in the text body of the Specific Issue.
Toward this end, one of the pivotal problems in this context revolves around automatically finding the topic(s) in a large corpus of unorganized, semi-structured, and poorly connected filings to reveal the underlying purpose(s) of these lobbying activities, which can be formulated as a text classification problem. However, unlike the case in a general text classification task, the lobbying disclosure form often consists of short text descriptions or a set of independent, free-form phrases, words and abbreviations in a semistructured format. Fig.1 gives such an example of LDA filing [1], in which several confined fields, such as General Issue and Specific Issue, with short, poorly-structured text descriptions, can be identified. Abbreviations (e.g., ‘ENV’ for ‘Environment’ and ‘FOO’ for ‘Food’) are commonly used in General Issue, and bill numbers often appear in Specific Issue. Previously, coders in this domain need to manually classify a LDA filing by assigning it one or more Policy Agenda (PA) code based on the information contained in these fields with the aid of coding guidelines defined in a Topics Codebook [2] for Policy Agendas (PA) Project [3]. This manual classification process is very time consuming and error-prone, and would incur much longer processing time when cross-validation is required. Our task, in this project, can thus literally be formulated as a short & sparse text classification (SSTC) problem which is an integral part of a highly domain-specific taxonomic procedure rather than that in general knowledge base. Data sparseness has been the utmost hurdle to the measuring of semantic similarity in the context of SSTC. Several studies [4-7] have attempted to expand the context of the data either by employing the search engines (e.g.,
Google) or utilizing online data repository (e.g., Wikipedia) as external knowledge sources [8-9], or coined as “universal datasets”. Training, test and future (short & sparse text) datasets can then either be enriched by external text documents crawled from search engines using seed words, or integrated with hidden topic distribution [10-11] after topic inference on Wikipedia. One limitation of these studies, however, requires that the nature of “universal datasets” (patterns, statistics, and co-occurrence) be consistent with the classification problem at hand. Therefore, were our classifier built on the expanded context, it might risk being biased or overwhelmed by the rich information in “universal datasets,” because LDA filings are coded according to a very specific PA taxonomy, in which 19 major topics and 225 subtopics are given. To better clarify the challenges we face in this project, one of the dominant rules, quoted from the Topics Codebook-“Observations are coded according to the single predominant, substantive policy area rather than the targets of particular policies or the policy instrument utilized” [2], can serve as a motivating example. In particular, if a case discusses the mental illness programs for returning veterans it would be coded according to the predominant policy area (mental illness) rather than the target of the programs (veteran affairs). The distinction between the predominant policy area and the targets of the programs is ambiguous partly due to the data sparseness in the LDA filings, which need domain experts (i.e., the coders in our case) to carefully examine. The utilization of “universal datasets” may potentially yet unfavorably bring forth more wordoccurrences with respect to veteran affairs rather than the supposedly predominant policy area - mental illness, thus deviating the coding procedure from the Codebook guidelines and leading to less discriminative classifiers in this context. In this paper, to simulate the PA coding procedure, we propose a multi-topic meta-classification (MTMC) framework based on an ensemble of base classifiers learned from a set of sematic sources, i.e., General Issue (GI), Specific Issue (SI) and Bill Info. (BI), where each multi-label base classifier can be transformed into a multi-class classification problem. A direct combination of predicted labels from base classifiers often leads to much larger label cardinality than there actually is, which is undesired in our case because we need to deal with a multi-label dataset in which single label is still dominant. In order to fuse semantic sources using reliable information as much as possible, having the capability of predicting the reliability of each base classifier is crucial. This is where meta-classifier comes into play. To prepare training dataset for meta-classifiers, metainstances with probability distribution (PD) representation need be generated, from the original training dataset per attribute, in a cross-validation fashion. Therefore, three metaclassifiers are built for three base classifiers, respectively. Only the predicted labels from base classifiers that have been marked as reliable participate in the subsequent label reranking process. Another of our contribution is that we, for the first time, successfully achieved satisfactory PA code prediction for the LDA dataset, which poses many
challenges including problems with noisy data and semantic ambiguity. The proposed meta-classifier approach has been thoroughly evaluated, using various assessment criteria. The rest of the paper is organized as follows. Section II gives the full specification of the problem for base classifiers. Inspired by the PA coding procedure, Section III describes our multi-topic meta-classification scheme in details. Experiments, discussion and conclusive remarks are presented in Sections IV, V and VI, respectively. II.
PROBLEM SPECIFICATION
A. Multi-Class Classification on Single Label Dataset One of the common supervised learning tasks is multiclass classification (MCC), which involves a set of class labels L, where |L| = N > 2. Each training sample belongs to one of the N different classes. The goal is to construct a function which, given a new data sample, can correctly predict the class to which the new sample belongs. Suppose we know the density, pi(x), for each of the L classes. Then we could predict using 1 arg max ,…,
However, since the density is typically not known a priori and density estimation is usually difficult, especially in high dimensional data. For binary classification, we have seen that directly building a smooth separating function gives better results than density estimation. Inspired by this observation, two approaches are commonly used in this field, i.e., One-vsAll (OVA) and All-vs-All (AVA) classification. Given a good technique for building binary classifiers, OVA builds N different binary classifier with the assumption that, for the ith classifier, we let positive examples be all the samples in class i, and let negative examples be all the samples not in class i. Let fi(x) be the ith classifier, the OVA classification can be formulated as Eq. 2. In contrast, AVA builds N(N-1) binary classifiers with each classifier distinguishing a pair of class i and j (i j). Let fij be the classifier where class i were positive examples and class j were negative. The AVA classification can be formulated as in Eq. 3. 2 arg max ,…,
arg max ,…,
3
The choice between OVA and AVA is largely driven by computational complexity. Intuitively, AVA seems faster and more memory efficient. Although AVA requires O(N2) classifiers instead of O(N), each classifier is on average much smaller (in terms of the amount of training samples used) than that of OVA and thus faster to train. For example, if the time to build a classifier is super linear in the number of training samples (e.g., Support Vector Machines with a non-linear kernel), AVA is a better choice. However, in the linear and sublinear cases, AVA could be worse than OVA because in those cases the amount of classifiers (AVAO(N2)) vs OVA-O(N)) that need to be trained largely determines the performance.
In addition, there have been two other advanced approaches to extending regularization ideas to multi-class classification, i.e., “Single Machine” approaches and “Error Correcting Code” approaches. However, empirical evidences have suggested that the simple OVA and AVA schemes are hard to beat [12]. B. Multi-Label Classification on Muti-Label Dataset There are many scenarios in which there are multiple categories to which data points belong, but a given data point can belong to multiple categories simultaneously. More formally, multi-label classification (MLC) is concerned with learning a model that outputs a bipartition of the set of labels L into relevant and irrelevant with respect to a query instance. Another related task on multi-label dataset is label ranking (LR) which is concerned with learning a model that outputs an ordering of the class labels according to their relevance to a query instance. These two tasks are not mutually exclusive in the sense that a bipartition of a set of label can be obtained by setting a relevance threshold to the label-ranking list, thus being able to mine both an ordering and a bipartition of the set of labels. Such a joint-effort task has recently been coined as multi-label ranking (MLR). Basically two groups of approaches, i.e., problem transformation and algorithm adaptation, have been proposed in the literature for these tasks on classifying multilabel dataset [13-14]. The first group of methods deals with only the label space, and decomposes the learning task trivially into one or more unlinked single-label classification tasks, which can be solved naturally using binary or multiclass classification. The second group of methods extends specific learning algorithm to handle multi-label datasets directly. A detailed review of these methods is beyond the scope of this paper. C. Our Classification Problem Given a Lobbying Disclosure Act (LDA) filing as a query instance, our goal is to classify it into 19 major topics, or PA codes in this context, without considering the 225 subtopics under the major topics. Learning to classify into a hierarchical label space is unnecessary due to the fact that the subtopics only facilitate the coders to quickly identify the related policy area but not particularly interest the potential parties (e.g., political scientists, lobbyists) interested in the outcome of this project. Strictly speaking, the problem at hand is a multi-label classification task that has a low average label-cardinality (