Automated Coding of Open-ended Surveys: Technical and Ethical Issues Daniela Giorgetti1 , Irina Prodanof1 , Fabrizio Sebastiani2 1
Istituto di Linguistica Computazionale, CNR, Pisa, Italia. Istituto di Scienza e Tecnologie dell’Informazione, CNR, Pisa, Italia. {daniela.giorgetti, irina.prodanof}@ilc.cnr.it,
[email protected] 2
Abstract. This paper presents some technical and ethical issues arising from the use of automated methods to solve a typical social science problem: the coding of surveys including answers to open-ended questions. Coding an open-ended survey, which may include thousands of interviews, means to assign symbolic predefined labels to its answers according to their meaning. The increasing amount of information available from surveys carried out also on the Web, makes it viable the use of (semi)automated systems both to reduce time and human resources cost to analyze and manage it, and to produce results independent from coders’ subjective impressions, but on the other hand poses both technical and ethical challenges to be carefully evaluated before being adopted.
1
Introduction
Surveys are a well consolidated means in the social sciences to assess customer satisfaction, patient satisfaction, political and social opinions, lifestyle and health habits, brand and image fidelity, and so forth. Surveys are often used by national institutions across the world to produce statistics about large portions of the population. There are two types of questions that can be asked in surveys either written or spoken: open-ended and close-ended. The close-ended questions require the surveyed person to select one (or possibly more) answers from a pre-defined group of answers, while in the open-ended questions surveyed persons can answer freely without any selection from a pre-determined list of responses. The advantages of open-ended questions with respect to close-ended questions are that they allow more flexibility in the answer (they are less restrictive), and they do not tend to inhibit the communication, but the drawbacks are that they can collect unneeded or messy information, and they require more time for being analysed. However the choice between the two kinds of question depends on the particular application, its scope and aim, and funds available. In this paper we will only deal with open-ended questionnaires and the issues related to their analysis, leaving out the issues pertaining to their design. In order to explain the issues linked to open-ended automatic survey coding we are going to present a case study where data come from the National Opinion Research Center (NORC) General Survey Study (GSS), which is administered since 1972, and investigates intergroup relations and cultural pluralism (Mellon Foundation, Carnegie Corporation, and American Jewish Committee), users of the Internet and how they use it (National Science Foundation), assessments of external and internal security threats and the balancing of security and civil liberties (Office of Naval Research), how people assess their physical and mental health (National Institutes of Health), sexual behavior and drug use (Centers
for Disease Control and Prevention), freedom (Smith-Richardson), religious congregations (Lilly Endowment Fund) evaluating the functions of the local church (Andrew Greeley), and religious identification (a consortium of sociologists of religion). Another kind of data we are dealing with come from a survey which aims at assessing competences in job seekers according to a behaviorist method developed in the ’70s at the University of Harvard by David McClelland. This latter data arise some interesting ethical problems related to their automatic coding, which will be discussed in Section 5. We believe that the issues we met in our case studies are pretty much generalizable to most of the open-ended surveys whatever the subject of the survey is. In Section 2 we explain what the task of automated coding consists of and some related work, in Section 3 we introduce some basic notions of text categorization, in Section 4 we show an automated solution to the task, which relies on machine learning techniques for text categorization, in Section 5 we outline the possible technical and ethical issues of an automated approach to the task of coding, and in Section 6 we give our final remarks.
2
Automated survey coding: definition and related work
Surveys are currently a common means for gathering and analyzing public opinion about business, political, and social issues and they are usually sponsored by government agencies, academic institutions, and business organizations. The diffusion of the World Wide Web to more and more users, has made it possible to administer surveys also via Web or e-mail, thus reducing their costs and increasing the quantity of survey data available. Surveys consist of a questionnaire where the questions may be open or close ended, i.e., they require the respondents to answer in their own words in the first case, leading to unstructured data, while they require to choose their answer from a list of possible pre-defined answers in the latter case, leading to structured data. The coding of close-ended surveys follows straightforwardly from the association between the chosen answer and a unique symbolic label (aka code) previously associated with it. The coding of open-ended surveys instead requires an interpretation of the answer in order to associate it with (usually) one among a code set previously defined (supervised coding) or in order to group together the most similar answers without a pre-defined set of codes (unsupervised coding). In both cases coding allows to identify and group common ideas and trends across different respondents, thus finding common semantic patterns in free form text. We restrict our analysis to the case of written (either originally in the written form or transcribed from speech) open-ended surveys analyzed by supervised coding. Hand coding of open-ended surveys is a well known problem from the social sciences, but up to now there are only a few attempts at completely automating the task [16, 10, 7]. In fact manual coding is hardly standardizable and inherently subjective, as different coders may assign different codes to the same text data, expensive, as it requires specialized professionals to analyze the text data, and time consuming. These are also the reasons why many times closed-end questionnaires are used instead, which though less flexible can be univocally and automatically codified. Most of the efforts have been geared towards computer aided content analysis systems (usually not specifically tailored for survey analysis) which can facilitate but not substitute the work of the experts [1].
Viechnicki in [16] presents two different approaches to automated survey coding. In the first one each code corresponds to a different category represented by user selected key-words linked by Boolean operators, and the answer is assigned a given code if its Boolean representation matches the Boolean representation of the category associated with that code. The second approach instead represents both the answer and the different coding categories through weighted vectors, and the answer is assigned to the most similar (i.e., the closest vector computed with the cosine formula) category vector. Macchia and Murgia in [10] present an automated approach using a software developed by the Statistics Canada, which follows some North American Census Bureau specifications, where the answer is assigned a unique code according to an exact match with phrases in a previously defined category dictionary associated with the code or a ”best code” according to a partial match based on words appearing both in the answer and in the definition of the category associated with the code as it occurs in the dictionary. The main drawback of these approaches, which have been experimented on NORC data and ISTAT (Italian National Institute of Statistics) data respectively, is that they need a dictionary for defining and representing the categories associated with the different codes, which has to be hand developed before the coding process actually begins. A different approach based on data mining techniques is described by Li and Yamanishi from NEC Corporation [9], which aims at solving two tasks: extracting common ideas from open answers about brand or company images (e.g., image characteristics for different car brands) and finding the relationships between the different categories (e.g., car brands) and the key-words used to characterize them. The adopted approach employs rule analysis for defining classification rules to solve the first task and correspondence analysis for defining association rules to solve the second task. This approach is similar to ours in that it is dictionary free but it employs different techniques from a text data mining background and the authors do not mention coding, though the first task they describe is very close to the coding task.
3
Text categorization basics
Before we explain how we have redefined the task of automated survey coding as a text categorization application task, we introduce some basic concepts of text categorization, referring to [14] for a broad and updated overview of the field3 . Text categorization (also called text classification) aims at approximating the unknown target function Φ : D×C → {T, F }, which describes how documents should ˆ : D × C → {T, F } called the classifier, where be classified by means of a function Φ C = {c1 , . . . , c|C| } is a predefined set of categories and D is a domain of documents. If Φ(dj , ci ) = T , then dj is called a positive example of ci , while if Φ(dj , ci ) = F it is called a negative example of ci . A classifier for category ci is then a function Φˆi : D → {T, F } that approximates the unknown target function Φi : D → {T, F }. The categorization task usually has to be performed relying just on the content of the documents without other additional knowledge coming for example from metadata (e.g. publication date, document type, publication source) and categories do not carry any intrinsic semantics, they are just symbolic code labels. Text categorization may be either single-label, if exactly one ci ∈ C must be assigned to each dj ∈ D, or multi-label, if any number 0 ≤ n ≤ |C| of categories may be assigned to a document dj ∈ D. Moreover, text categorization may be binary, when the classifier 3
The notation we use for text categorization basics is borrowed from [14].
decides between ci or ci or multi-class, when the classifier decides between ci or ck , where k ∈ 1, . . . , n and k = i. A text categorization system usually performs three fundamental steps: document indexing, classifier learning, and classifier evaluation. 3.1
Document indexing
Document indexing is the step needed for representing the contents of a text document so that it can be directly interpreted (i) by a classifier-learning algorithm and (ii) by a classifier, once it has been built. A text dj is typically represented as a vector of term weights and document indexing techniques are differentiated by the definition of what a term is, and by a method to compute term weights. Terms, also called features, usually are the words appearing in the document, possibly excluding stop words, i.e., function words like prepositions, articles, and sometimes considering word stems, i.e., words morphological roots. Sometimes also phrases, i.e., semantically related words, are added to the term dictionary. As far as the weighting algorithms are concerned, they set a measure of the importance of a term in a document, usually relying on statistical or probabilistic techniques. Many times feature reduction algorithms are used to low down computational costs of the learning algorithms and also to reduce the overfitting, i.e., the excessive adaptation of the learnt classifier to the examples it is built on. 3.2
Classifier learning
A text classifier for the category ci is automatically built following a learning algorithm, which is based on a set of documents pre-classified under ci or ci (binary case), or under ci or ck , where k ∈ 1, . . . , n and k = i (multi-class case). In order to build classifiers for C, a corpus Ω of documents such that the value of Φ(dj , ci ) is known for every dj , ci ∈ Ω × C is needed. The Ω corpus is usually partitioned into three (disjoint) subsets: T r, the training set, V a, the validation set, and T e, the test set. The training set is used by the learning algorithm to build the classifier, the validation set is used for tuning the classifier parameters, and the test set is used to evaluate the effectiveness (see Section 3.3) of the built classifier comparing its output to the pre-assigned categories. Different learners have been applied in the text categorization literature, including probabilistic methods, regression methods, decision tree and decision rule learners, neural networks, batch and incremental learners of linear classifiers, examplebased methods, support vector machines, genetic algorithms, hidden Markov models, and classifier committees. We will only briefly sketch two of these methods we applied in our experiments. 3.3
Classifier evaluation
The effectiveness of a classifier is defined as the average correctness of Φˆi ’s classification behaviour. In binary single-label text categorization tasks, effectiveness with respect to a category ci is often measured by a combination of precision πi , i.e., the percentage of documents classified under ci that actually belong to it, and recall ρi , i.e., the percentage of documents belonging to ci that are actually classified under it. Precision and recall are not independent, and very often combinations of the two are used to evaluate effectiveness. Another popular effectiveness measure used also for multi-class text categorization tasks is accuracy, defined as the ratio between the correct decisions and the total number of decisions.
4
Text categorization approach to automated survey coding
As already pointed out supervised coding of an answer means to find its meaningful part and attach to it a code label from a predefined set of code labels. Redefining the aim of the process in terms of text categorization we can say that supervised coding for each answer (or part of it) looks for the category it belongs to as defined by a predefined set of categories. The Ω corpus of documents is thus represented by the collection of responses to an open-ended question and the C domain of categories is represented by the pre-defined codes which can be assigned to the responses. Codes may be arranged hierarchically, and a possible approach to deal with hierarchical text categorization of survey data has been proposed in [3]. Here we concisely describe only experiments run on data from NORC, as the data from the Web competence survey are not available for quantitatively significant experiments yet. For a more technical and detailed description of these experiments we point the interested reader to [4]. We used the same datasets as in [16] to compare our results to some known results. Our three datasets comprise responses to three different open-ended questions in the field of mental health: – the first dataset, called angry at investigates who the respondent is angry at (e.g., family, work, government), and includes 1370 responses and 7 categories; – the second dataset, called angry why, investigates the reasons of anger (e.g., because of something the respondent did, because someone demanded too much from the respondent, because someone was critical, insulting or disrespectful), and includes 460 responses and 6 categories; – the third dataset, called brhdhlp, investigates the source of help to heal from a nervous breakdown (e.g., family, friends, group therapy, psychiatrist), and includes 367 answers and 7 categories. The two learning techniques we applied to automatically classify the responses are Na¨ıve Bayes and SVM (Support Vector Machines) as implemented in [11, 6] respectively. In Na¨ıve Bayes techniques data are supposed to be generated by a parametric model, whose parameters are induced through the observation of examples. When new examples are presented to the induced parametric model its output is a probability of a given category having generated the example in input. There are many simplifying hypotheses underlying these techniques (hence the adjective na¨ıve), but anyway it’s been used effectively in many cases [12]. SVM techniques instead behave very well for binary text categorization. They learn from examples how to separate, using hyperplanes, the positive examples of a given category from the negative examples, maximizing the possible margin between hyperplanes and training examples. The indexing method we used is normalized tf ∗ idf , where word weights in a given document are proportional to their importance to the document itself measured according to the frequency, and inversely proportional to their frequency of appearance in the rest of the documents belonging to the training set. No feature selection was operated, though stop words were removed, and word stems taken. We randomly split each dataset into a training and test set and comparative results on the test sets obtained with the two methods tried are in Table 1. The two learning methods achieve similar accuracy results, though SVM methods in literature are recognized as better performers than bayesian methods. The reasons for this may be that we didn’t tune SVM parameters on the better configuration possible, the multi-class SVM implementation we used might be non optimal [5], and finally the nature of data is quite different from the typical benchmark data used in text categorization for comparisons (e.g., a subset of Reuters agency news). Notably, the
Vector [16] Boolean [16] Na¨ıve Bayes SVM angry at
0.451
0.464
0.714
0.693
angry why
0.210
0.271
0.389
0.397
brhdhlp
0.646
0.747
0.653
0.643
Table 1. Comparative accuracy results obtained on the angry at, angry why, and brkdhlp datasets using a boolean and a vector-based method (results reported from [16]), and using a na¨ıve bayesian and a support vector machine text categorization method.
best results are obtained on the most populated dataset, angry at, remarking the fact that the more examples we train the classifier with, the better results we may achieve. The reasons for not very good accuracy obtained on the angry why dataset are already discussed in [16], who points out that in this set of answers there is not homogeneity of language like there is instead in the angry at and brkdhlp dataset, where as a matter of fact results are better.
5
Technical and ethical issues of automated coding
Using automated survey coding systems is convenient to low down costs and speed up times when the survey size ranges from medium to large, and also when the survey is administered with minor changes from year to year. In fact automated survey coding requires a certain amount of data to be manually coded to build the training set, and if the survey extent is not ample enough, preparing the training set could coincide with the coding of the whole survey data. Currently the technical limitations of software systems for the automated coding of surveys are evident. Using only words (regardless their order and dependence relationships) as indexing terms in a text categorization context may lead to clear errors, classifying a sentence like e.g., “I was mad at the man who insulted my wife” under the category family, as the induced classifier tends to associate the word “wife” with the category which specifies anger towards family members. Alternative approaches to the simple bag of words have been investigated [8, 13], but it is still on debate whether using phrases (either semantically or statistically meaningful) is really advantageous. Yet other kind of more sophisticated indexing approaches to be investigated are those which take into account the semantic relations between the words in a sentence, such the one proposed in a text clustering context [2], where the Universal Networking Language [15] is used to give a representation of the meaning of sentences. Other common problems derive from the use of an informal language, close (if not directly transcribed) to the spoken language, and including ill-formed words and sentences. Apart from problems arising even from the literal interpretation of texts, nonmanifest elements of discourse like nuances or irony or humour are not capturable by any software yet. Census surveys, e.g., about occupation or industry data, or more in general surveys used just for statistical purposes are difficult but less complex to be automatically coded than open-ended surveys used to make marketing decisions or even more delicate decisions like assessing surveyed people profiles in order e.g., to decide if they should get a job or obtain a loan. In the latter cases not only technical limitations are severe due to the greater extension and nuances of the linguistic domain (more likely use of metaphors, irony and so forth), but also the ethical implications (already present in non automated survey coding) are amplified by the use of a computer algorithm to take decisions that have to be reliable. For example, in one of our
paradigmatic case studies the aim of the survey is to assess job competencies like leadership, teamwork, analytical attitudes, and so forth. Once these competencies have been automatically assessed, results can be used to discriminate between top job performers and average ones in order to decide who should be hired. Though this assessing task is inherently subjective a human coder may give a sensible reason for her choice, while using the automated classifier hardly justifiable errors may occur. This kind of problems makes the probable best choice of the moment the one based on interactive approaches to the coding problem, where controversial (according to some user set threshold parameters) automated decisions may be taken in consideration by some human expert coder in order to be able to say the last word about their acceptation or rejection.
6
Conclusion
We have shown how the task of automated survey coding may be viewed as a text categorization task solved by machine learning techniques, where an inductive process builds a classifier based on hand classified examples, e.g., answers to open-ended questions hand codified and hence considered “correct”. The effectiveness of the statistical classifier is evaluated by comparing its classification decisions on new data not used in the training phase with the classification decisions of human coders. First accuracy results obtained in our experiments on social science data encourage further exploration of text categorization techniques to solve this particular kind of task, and we are confident that there are ample margins for improvements. Before adopting these automated techniques for real world applications, the “augmented” ethical issues related to reliability have to be considered as well as the strictly technical ones. The most viable solution currently seems to be an interactive approach where texts controversial from a coding point of view have to be identified and undergo some human supervision.
References 1. Melina Alexa and Cornelia Z¨ ull. Text analysis software: Commonalities, differences and limitations. The results of a review. Quality and Quantity, 34:299–321, 2000. 2. Bhoopesh Choudhary and P. Bhattacharyya. Text clustering using semantics. In Proceedings of the WWW2002 , 11th International World Wide Web Conference, Hawaii, US, 2002. 3. Daniela Giorgetti, Irina Prodanof, and Fabrizio Sebastiani. Mapping an automated survey coding task into a probabilistic text categorization framework. In Elisabete Ranchod and Nuno J. Mamede, editors, Proceedings of PorTAL 2002, 3rd International Conference on Advances in Natural Language Processing, pages 115–124, Faro, Portugal, 2002. Springer Verlag, Heidelberg, DE. Published in the “Lecture Notes in Computer Science” series, LNAI number 2389. 4. Daniela Giorgetti and Fabrizio Sebastiani. Multiclass text categorization for automated survey coding. In Proceedings of SAC-03, 18th ACM Symposium on Applied Computing, Melbourne, FL, US, 2003. Forthcoming. 5. Chih-Wei Hsu and Chih-Jen Lin. A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks, 13(2):415–425, 2002. 6. Chih-Wei Hsu and Chih-Jen Lin. A simple decomposition method for support vector machines. Machine Learning, 46:291–314, 2002. 7. Rodger Knaus. Methods and problems in coding natural language survey data. Journal of Official Statistics, 1(3):45–67, 1987. 8. David D. Lewis. An evaluation of phrasal and clustered representations on a text categorization task. In Nicholas J. Belkin, Peter Ingwersen, and Annelise Mark Pejtersen, editors, Proceedings of SIGIR-92, 15th ACM International Conference on Research and Development in Information Retrieval, pages 37–50, Copenhagen, Denmark, 1992. ACM Press, New York, US.
9. Hang Li and Kenji Yamanishi. Mining from open answers in questionnaire data. In Proceedings of KDD 2001, Knowledge Discovery and Data Mining, pages 443–449, San Francisco, USA, 2001. 10. Stefania Macchia and Manuela Murgia. Coding of textual responses: Various issues on automated coding and computer assisted coding. In Proceedings of JADT 2002, 6es Journes internationales d’Analyse statistique des Donnes Textuelles, pages 471–482, St-Malo, France, 2002. 11. Andrew K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996. 12. Andrew K. McCallum and Kamal Nigam. A comparison of event models for naive Bayes text classification. In Proceedings of the 1st AAAI Workshop on Learning for Text Categorization, pages 41–48, Madison, US, 1998. 13. Hinrich Sch¨ utze, David A. Hull, and Jan O. Pedersen. A comparison of classifiers and document representations for the routing problem. In Edward A. Fox, Peter Ingwersen, and Raya Fidel, editors, Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval, pages 193–216, Seattle, US, 1995. ACM Press, New York, US. 14. Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47, 2002. 15. Hiroshi Uchida, Meiying Zhu, and Tarcisio Della Senta. The UNL, a Gift for a Millennium. United Nations University, Tokyo, Japan, 1999. 16. Peter Viechnicki. A performance evaluation of automatic survey classifiers. In Vasant Honavar and Giora Slutzki, editors, Proceedings of ICGI-98, 4th International Colloquium on Grammatical Inference, pages 244–256, Ames, US, 1998. Springer Verlag, Heidelberg, DE. Published in the “Lecture Notes in Computer Science” series, number 1433.