AUTOMATED CONTENT ASSESSMENT OF TEXT USING LATENT SEMANTIC ANALYSIS TO SIMULATE HUMAN COGNITION. by ROBERT DARRELL LAHAM B.A., University of Kansas, 1983 M.A., University of Arizona, 1993 M.A., University of Colorado at Boulder, 1997
A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirement for the degree of Doctor of Philosophy Department of Psychology and Institute of Cognitive Science 2000
ii
This thesis entitled: Automated Content Assessment of Text Using Latent Semantic Analysis to Simulate Human Cognition. written by Robert Darrell Laham has been approved for the Department of Psychology and the Institute of Cognitive Science
________________________________ Thomas K Landauer
________________________________ James Martin
Date _______________________
The final copy of this thesis has been examined by the signators, and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline.
iii
Laham, Robert Darrell (Ph.D., Psychology and Cognitive Science) Automated Content Assessment of Text Using Latent Semantic Analysis to Simulate Human Cognition. Thesis directed by Professor Thomas K Landauer
Latent Semantic Analysis (LSA) is both a theory of human knowledge representation and a method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text. The underlying idea is that the aggregate of all the word contexts in which a given word does and does not appear provides a set of mutual constraints that largely determines the similarity of meaning of words and sets of words to each other. Simulations of psycholinguistic phenomena show that LSA reflects similarities of human meaning effectively. The adequacy of LSA’s reflection of human knowledge has been established in a variety of ways. For example, its scores overlap those of humans on standard vocabulary and subject matter tests; it mimics human word sorting and category judgments; it simulates word–word and passage–word lexical priming data; it accurately estimates learnability of passages by individual students and the quality and quantity of knowledge contained in an essay. To assess essay quality, LSA is first trained on domain-representative text. Then student essays are characterized by LSA representations of the meaning of their contained words and compared with essays of known quality on degree of conceptual relevance and amount of relevant content. Over many diverse topics, LSA scores agreed with human experts as accurately as expert scores agreed with each other.
iv
LSA has also been used to characterize tasks, occupations and personnel and measure the overlap in content between instructional courses covering the full range of tasks performed in many different occupations. It extracts semantic information about people, occupations, and task-experience contained in natural-text databases. The various kinds of information are all represented in the same way in a common semantic space. As a result, the system can match or compare any of these objects with any one or more of the others. LSA-based agent software can help to identify required job knowledge, determine which members of the workforce have the knowledge, pinpoint needed retraining content, and maximize training and retraining efficiency. Computational models of concept relations using LSA representations demonstrate that categories can be emergent and self -organizing based exclusively on the way language is used in the corpus without explicit hand-coding of category membership or semantic features. LSA modeling also shows that the categories
which are most often impaired in category specific semantic disnomias are those that show the most internal coherence in LSA representational structure. If brain structure corresponds to LSA structure, the identification of concepts belonging to strongly clustered categories should suffer more than weakly clustered concepts when their representations are partially damaged.
v
Dedication page For Trish and our future together.
vi
Acknowledgements First and foremost I would like to thank my mentor and partner, Tom Landauer for his invaluable assistance and dedication. We have the best arguments. I wish to acknowledge the contributions to this work of the current and former members of the SALSA (Science and Applications of Latent Semantic Analysis) laboratory: Dr. Thomas K Landauer, Dr. Walter Kintsch, Dr. Robert Rehder, Dr. Michael Wolfe, Dr. Michael Jones, Ms. Missy Schreiner, Mr. David Steinhart, Dr. Eileen Kintsch and of Dr. Peter Foltz of New Mexico State University. I would like to acknowledge the funding agencies who have supported portions of this work: The McDonnell Foundation, The Army Research Institute, The Air Force Research Laboratory, The Office of Naval Research, The Defense Advanced Research Projects Agency, and US West Advanced Technologies. Special thanks to Dr. Joseph Psotka of ARI and Dr. Winston Bennett of AFRL. I would also like to thank the Touchstone Applied Science Associates for graciously providing us with the TASA corpus for our research. The LSA software tools are copyrighted by Bell Communications Research, Inc. (now Telcordia, Inc.). A patent has been applied for which covers the methods described herein: A method for analysis and evaluation of the semantic content of writing, by Landauer, Foltz, Laham, Rehder, and Kintsch. Correspondence concerning this dissertation should be addressed to Darrell Laham, Knowledge Analysis Technologies, 4001 Discovery Drive Suite 392, Boulder, CO, 80303. Electronic mail may be sent via Internet to
[email protected] .
vii
CONTENTS CHAPTER I .................................................................................................................. 1 Review of Research in Automated Assessment of Text ............................................1 General Introduction ..............................................................................................1 Brief LSA Introduction ..........................................................................................2 LSA and Essay Scoring .........................................................................................3 Essay Scoring Experiments....................................................................................6 Auxiliary Findings ...............................................................................................26 On the Limits of LSA Essay Scoring...................................................................35 Appropriate Purposes for Automatic Scoring ......................................................37 Theoretical Implications of LSA Scoring ............................................................40
CHAPTER II............................................................................................................... 43 Introduction to LSA .................................................................................................43 An Introduction to LSA .......................................................................................43 Learning Human-like Knowledge by Singular Value Decomposition ................77
CHAPTER III ............................................................................................................. 90 Occupation, Personnel and Training Analysis.........................................................90 An LSA-based software tool for matching jobs, people, and instruction ............90 Latent Semantic Analysis: A Technique for Enhancing use of Occupational Analyses in Evaluating Occupational Restructuring .........................................111
CHAPTER IV ........................................................................................................... 125 Automated Assessment of Student Writing ...........................................................125 The Intelligent Essay Assessor: Applications to Educational Technology .......125 Using Latent Semantic Analysis to Assess Knowledge: Some Technical Considerations....................................................................................................139 Learning from text: Matching readers and texts by Latent Semantic Analysis.165 How Well Can Passage Meaning be Derived without Using Word Order? A Comparison of Latent Semantic Analysis and Humans ....................................208
CHAPTER V............................................................................................................. 226 Categorization ........................................................................................................226 LSA Modeling of Categorization and Category Specific Semantic Disnomias 226 Introduction ....................................................................................................226 Characteristics of the Disnomia .....................................................................227 Current Explanations......................................................................................228 Problems with Current Models ......................................................................229 LSA Modeling of Concepts and Categories ..................................................230 Evidence from LSA Modeling .......................................................................231 Discussion ......................................................................................................244 LSA and the Brain..........................................................................................245
BIBLIOGRAPHY ..................................................................................................... 250 APPENDIX A: Details of essay scoring analyses for all datasets ............................ 256
viii
TABLES 1.1. Reliability scores by individual data sets for LSA measures ...............................17 1.2. Reliability scores by individual data sets for IEA components ...........................18 1.3. Reliability scores by individual data sets for single readers ................................19 2.1. LSA simulation of Till et al (1988) priming study ..............................................70 3.1. Similarities for occupational specialties ............................................................104 3.2. Judgments of similarity for AFSCs ....................................................................122 4.1. Correlations Between Pre-Questionnaire Scores and the Three Cosine Measures ............................................................................................................................141 4.2. Correlations of Pre-Knowledge Assessment Scores and LSA Measures ..........146 4.3. Results of Multiple Regression Where Pre-Questionnaire Scores are Predicted ............................................................................................................................147 4.4. Mean learning scores for questionnaire and essay for the four instructional texts. ............................................................................................................................181 4.5. The correlations between two measures of learning and two questionnaire predictors and the cosine measure. ....................................................................182 4.6. Cosines between the four instructional texts. .....................................................183 4.7. Correlations of average clustering frequencies of heart terms by undergraduates with expert judgments of similarity and LSA cosine measures of similarity. ...199 4.8. Heart essay results. ..............................................................................................217 4.9. Psychology essay results. ...................................................................................219 4.10. Further analyses of heart essay measures. ........................................................221 5.1. 5.2. 5.3. 5.4. 5.5. 5.6.
Percent correct identification, naming and superordinate scores for 2 patients 227 The object ‘rose’ is compared with 14 superordinate category labels...............232 Mean cosines between category labels and category member labels ................234 Mean cosines between category member labels ................................................235 Correlations of member rank for three sources of typicality judgments ...........236 Correlations between cosines and typicality judgments from 3 sources ...........238
ix
FIGURES 1.1. The Intelligent Essay Assessor architecture ..........................................................6 1.2. Scored essays represented in 2-dimensional space ................................................9 1.3. Inter -rater reliability for standardized and classroom tests..................................13 1.4. Inter -rater reliabilities for resolved reader scores ................................................14 1.5. Relative prediction strength of individual IEA components................................15 1.6. Relative percent contribution for IEA components .............................................16 1.7. Prompts for the GMAT essay sets .......................................................................24 1.8. Comparison of LSA measures with word count ..................................................28 1.9. Effects of training set score source for GMAT issue essays ...............................30 1.10. Effects of training set score source for heart essays ..........................................30 1.11. Confidence measure 1: The nearest neighbor has too low of a cosine .............31 1.12. Confidence measure 2. The near neighbor grades are too variable ..................32 1.13. Reliability for GMAT essays with varying size of comparison set ...................35 2.1. 2.2. 2.3. 2.4. 2.5.
A word by context matrix ....................................................................................51 Complete SVD of matrix in Figure 2.1................................................................52 Two dimensional reconstruction of original matri x shown in Figure 2.1............53 Intercorrelations among vectors representing titles .............................................55 The effect of number of dimensions i n an LSA corpus-based representation .....65
3.1. 3.2. 3.3. 3.4. 3.5. 3.6.
A 2-D representation of 2 LSA objects................................................................97 Two candidates for a job ......................................................................................98 A job, a candidate, and 2 training courses for the job........................................100 The candidate after taking either training course. ..............................................101 Average cosines within & between course test items ........................................103 Mean cosine of personnel comparisons within and between occupational specialties ...........................................................................................................104
4.1. Summary of reliability results............................................................................131 4.2. The proportion of variance accounted for when predicting pre-questionnaire scores..................................................................................................................143 4.3. Distribution of cosines with Text A for the 94 undergraduates and the 12 medical students.................................................................................................152 4.4. Distribution of pre-questionnaire scores for the 94 undergraduates and the 12 medical students.................................................................................................153 4.5. Distribution of dimension scores computed by Method 1 for the 94 undergraduates and the 12 medical students. .....................................................155 4.6. Distribution of dimension scores computed by Method 2 for the 94 undergraduates and the 12 medical students. .....................................................156 4.7. Distribution of dimension scores computed by Method 3 for the 94 undergraduates and the 12 medical students. .....................................................158 4.8. Representation of the LSA vectors for a student essay and instruction text. ......163 4.9. Theoretical relationship between background knowledge and learning. ............171
x
4.10. Theoretical relationship between the prior knowledge/text match and learning. ............................................................................................................................172 4.11. Average pre- and post- cosines between the students’ essays and the text they read.....................................................................................................................185 4.12. Average learning scores for four groups of college students. ...........................187 4.13. Fitted Gaussian curve of the learn-questionnaire and pre-questionnaire relationship for each of the four text conditions. ...............................................191 4.14. Fitted Gaussian curves of the learn-questionnaire and cos pre-essay textread relationship for each of the four text conditions. ...............................................192 4.15. Fitted Gaussian curve of the learn-questionnaire and cos pre-essay text read relationship over all four conditions. .................................................................193 4.16. Correlations with expert ratings for average pre and post sort patterns of medical students and of undergraduates. ...........................................................197 4.17. Correlations with cosine similarity for average pre and post sort patterns of medical students and of undergraduates. ...........................................................198 4.18. A word by passage matrix. ................................................................................213 4.19. Comparing intercorrelatioons among vectors representing titles in the original source data. .........................................................................................................216 5.1. 5.2. 5.3. 5.4. 5.5. 5.6. 5.7.
Percent correct in forced choice task ..................................................................233 Mean cosines between category labels and category member labels .................234 Mean cosines between category member labels .................................................235 Principal Components (PC) factor loadings for first 3 components. ..................241 Hierarchical clustering of categories..................................................................242 Connectionist representation of the LSA model .................................................246 Simplified access to vector representations in categories ...................................248
1
CHAPTER I REVIEW OF RESEARCH IN AUTOMATED ASSESSMENT OF TEXT General Introduction. This dissertation traverses multiple lines of research with a common theme: intelligent applications of the Latent Semantic Analysis (LSA) theory and method. These applications are important in their own right, as useful tools in education, occupational analysis, and categorization tasks, but also serve to inform the LSA psychological theory of knowledge representation. This first chapter provides detailed results and analyses from experiments based on LSA and other natural language processing methods in scoring and annotating student essay examinations. The Intelligent Essay Assessor (IEA) is a set of software tools for scoring the quality of the conceptual content of essays based on LSA. Student essays are cast as LSA representations of the meaning of their contained words and compared with essays of known quality on degree of conceptual relevance and amount of relevant content. Chapter 2 presents an in-depth introduction to the Latent Semantic Analysis psychological theory, the mathematics of the method, and a summary of related empirical findings (Landauer, Foltz & Laham, 1998; Landauer, Laham & Foltz, 1998). Chapter 3 reviews an LSA based method for occupation, personnel and training analyses which represents diverse types of objects (jobs, people, and instructional materials) in the same semantic space. A variety of analyses can be performed using such a space, including assessments of similarity between training
2
courses, of the best instruction to prepare a person for a job, or in selection of teams for optimal knowledge coverage for a new mission (Laham, Bennett & Landauer, 1999, 2000). Chapter 4 covers earlier work in scoring essays, as well as ot her applications for automated assessment of student writing, including matching readers to the optimal texts for maximal learning outcomes (Foltz, Laham & Landauer, 1999; Rehder, Schreiner, Wolfe, Laham, Landauer & Kintsch, 1998; Wolfe, Schreiner, Rehder, Laham, Foltz, Kintsch & Landauer, 1998; Landauer, Laham, Rehder, & Schreiner, 1997). Chapter 5 presents results of modeling experiments in automated categorization of LSA representations. For some natural categories, analyses of the LSA representations r eveal inherent category information, even though the information was not explicitly provided in the input. In several cases, the analyses suggest that for some concepts LSA representations contain information on membership in a category as well as typicality of that concept within the category (Laham, 1997).
Brief LSA Introduction. LSA is fundamentally a statistical analysis of the way words are used in natural text, but one with several critical differences from those considered in the past. First, although it uses a kind of co-occurrence as a starting place, the relations it uses are not those between a word and it successors, but between words and whole passages of sufficient length to express full ideas. Second, LSA does not stop with surface correlations between individual words and their contexts, but uses a powerful form of statistical induction to derive
3
and represent the complete underlying system of mutual constraints implicit in the direct and indirect relations between every word and every context. It represents the full set of such relations in a high-dimensional “semantic” space in which each word and any passage is a point; the semantic similarity between passages, and certain other properties, can then be estimated from relative locations in the space. Third, the mathematical analytic technique used, along with large scale computations made possible by recent advances in computers, allows LSA to process and derive its representations of meaning from very large corpora of text, up to millions of running words, thus, in many cases, the same or highly similar sources, both in content and size, as those from which students derive most of the knowledge they use to write an essay. To understand the application of LSA to essay scoring and other educational and industrial applications, it is sufficient to understand that it represents the semantic content of an essay as a vector (which can also be thought of equivalently as a point in hyper-space, or a set of factor loadings) that is computed from the set of words that it contains. Each of these points can be compared to every other through a similarity comparison, the cosine measure. Each point in the space also has a length, called the vector length, which is the distance from the origin to the point. Some readers may want more explanation and proof before accepting the plausibility of LSA as reflection of knowledge content. Chapter 2 provides an indepth introduction to the model and summary of related empirical findings. LSA and Essay Scoring. It has long been recognized that American students at all grade levels, including university, do not get enough practice writing, partially
4
due to the inherent problems with the assessment and scoring of essays, including the subjectivity of rater judgments, low inter-rater reliability, and the associated expenses in both time and money. With the advent of computer gradable multiple-choice tests, most teachers require very little writing from their students for either learning or assessment purposes. Forced-choice items, such as multiple-choice and matching questions, while acknowledged as efficient, are not viewed as an authentic assessment of student knowledge. For the most part, multiple-choice questions simply require the student to recognize the correct associations between the question prompt and the available answers. Constructed-response items, such as short answers and essays, require that the student recall the appropriate kno wledge and think critically about applying that knowledge to fully answer the question. There are a variety of reasons to use written products such as essay questions and term papers for learning and assessment. The ability to convey information verbally is an important educational achievement in its own right, and one that is not sufficiently well assessed by other kinds of tests. In addition, essay-based testing is thought to encourage a more conceptual understanding approach to learning on the part of students and to reflect a deeper, more useful level of knowledge acquisition and application. Thus scoring and criticizing written products is important not only as an assessment method, but also as a feedback device to help students better learn both content and the skills of thinking and writing. However, despite belief in the advantages of essay testing by both educators and the population at large, it poses substantial difficulties. Human scoring and diagnosis of writing requires relatively large amounts of expert labor compared to
5
other testing methods, and nevertheless often has undesirably low inter- and intrareader reliability, and is subject to many potential sources of both random and systematic bias. Thus methods that could evaluate written products automatically and accurately would be very useful. The growing ease of obtaining student writing in machine readable form, either as direct input to computers or networks, or by scanning and optical character recognition, makes computer based evaluation more feasible. However, the more important and difficult problem is how to get the computer to assess quality and diagnose deficiencies in factual and conceptual content and their expression. What would be the ideal features of a computer-based essay evaluation technique? Here are some, probably not exhaustive, criteria. At an abstract level one can distinguish four properties of a student essay that are desirable to assess; the correctness and completeness of its contained knowledge, the soundness of arguments that it presents in discussion of issues, the presence and quality of original ideas, and the organization, fluency, elegance, and comprehensibility of its writing. One might also want to score for grammatical and stylistic variables or for mechanical features such as spelling and punctuation. Evaluation of superficial mechanical and syntactical features is fairly easy to separate from the other factors, but the rest—content, argument, comprehensibility, creativity and aesthetic style—are likely to be difficult to pull apart because each influences the other, if only because the evidence for each depends on the student's choice of words. In contrast to earlier approaches, the methods to be described here concentrate primarily on the conceptual content, the knowledge conveyed in an essay, rather than
6
its style or mechanics. As noted above, we would not expect evaluation of knowledge content to be clearly separable from stylistic qualities, or even from sheer length in words, but we believe that making knowledge content primary has much more favorable consequences; it will have greater face validity, be harder to counterfeit, more amenable to use in diagnosis and advice, and more likely to encourage valuable study and thinking activities. The Intelligent Essay Assessor (IEA), while based on LSA for its content analyses, also takes advantage of other style and mechanics measures for scoring, for validation of the student essay as appropriate English prose, and as the basis for some tutorial feedback. The high level IEA architecture is shown in Figure 1.1. Customized Reader
% Content % Style % Mechanics
Overall Score
CONTENT
Similarity to Source variance VL Confidence
STYLE
Coherence
PLAGIARISM
MECHANICS
Misspelled Char Count Words
And / Or VALIDATION
Figure 1.1. The Intelligent Essay Assessor architecture Essay Scoring Experiments. A number of experiments have been done using LSA measures of essay content derived in a variety of ways and calibrating them against several different types of standards to arrive at quality scores. I first present an overall description of the primary method along with summaries of the accuracy of
7
the method as compared to expert human readers. Then I describe the individual experiments in more detail. LSA has been applied to evaluate the quality and quantity of knowledge conveyed by an essay using three different methods. The methods vary in the source of comparison materials for the assessment of essay semantic content: 1) pre-scored essays of other students; 2) expert model essays and knowledge source materials; 3) internal comparison of an unscored set of essays. These measures provide indicators of the degree to which a students essay has content of the same meaning as that of the comparison texts. This may be considered a semantic direction or quality measure. The primary method detailed in this chapter, called the Holistic method, involves comparison of an essay of unknown quality to a set of pre-scored essays which span the range of representative answer quality. The second and third methods are briefly described in this chapter in the experiment 1 results and in much greater detail in Chapter 4. Description of the Holistic Method. Most raters of essays use a holistic scoring technique rather than a componential technique. In componential scoring methods, key components of the answer space are ident ified, operationalized, and assigned a point value. Essay scores are simply a sum of the assessed component scores (Millman & Greene, 1989). In this way, major components of the domain can carry more weight than minor components, completeness of answer is rewarded, and irrelevant information is ignored. In holistic scoring, a single essay score is assigned on the basis of the rater’s overall impression of the writing following an established rubric for the question at
8
hand. The criteria used in most scoring rubrics emphasize the content and organization of the essay, with only secondary importance given to the mechanics involved in writing (Huot, 1990). In general, rubrics include a point scale with each demarcation associated with a textual description of the level of proficiency required to achieve that score. Additionally, example essays which fall into each category might be provided to the raters for calibration. In the LSA version of the holistic method the semantic content of the student essay as represented by LSA is compared with the content of other essays written by other students and evaluated by human judges. A text corpus covering the knowledge involved in answering the test question is first processed to produce a semantic space in which each word has a vector. Then the set of student essays is processed to produce a vector representation for each as the average of the word vectors it contains. The vectors are used to produce two independent scores, one for the semantic quality of the content, the other for the amount of such content expressed. The quality score is computed by first giving a large sample (e.g. 50 or more) of the student essays to one or more human experts to score. Each of the to-be-scored essays is then compared with all of the humanly scored ones. Some number, typically 10, of the pre-scored essays that are most similar to the one in question are selected, and the target essay given the weighted—by cosine—average human score of those in the similar set. Figure 1.2 illustrates the process geometrically.
9
A A B A C B A A
1.0
A
T
A
B
A B
θ
A B
C A B
Dim
A A C C
1
C D
D F Dim 2
1.0
Figure 1.2. Scored essays represented in 2-dimensional space. Each essay in the space is represented by a letter corresponding to the score for the essay (A, B, C, D, F). This representation shows how essays might be distributed in the semantic space, as seen by the cosine measure, on the surface of a unitized hyper-sphere. The to-bescored target essay is represented by the circled-T. The target in this figure is being compared to the circled-A essay. Theta is the angle between these two essays from the origin point. The to-be-scored essay is compared to every essay in the pre-scored representative set of essays. From these comparisons, the ten pre-scored essays with the highest cosine to the target are selected. The scores for these 10 essays are averaged, weighted by their cosine with the target, and this average is assigned as the target’s quality score.
10
The vector representation of an essay has both a direction in high dimensional space, whose angle with a comparison model is the basis of the quality measure just described, and a length. The length summarizes how much domain relevant content, that is knowledge represented in the semantic space as derived by LSA from the training corpus, is contained in the essay independent of its similarity to the quality standard. Because of the transformation and weighting of terms in LSA, and the way in which vector length is computed, for an essay’s vector to be long, the essay must tap many of the important underlying dimensions of the knowledge expressed in the corpus from which the semantic space was derived. The vector length algebraically is the sum of its squared values on each of the (typically 100-400) LSA dimensions or axes (in factor-analytic terminology, the sum of squares of its factor loadings.) The Content score is the weighted sum of the two components after normalization and regression analysis. A third application of LSA derived measures is to produce indices of the coherence of a student essay. Typically, a vector is constructed for each sentence in the student’s answer, then an average similarity between, for example, each sentence and the next within every paragraph, or the similarity of each sentence to the vector for the whole of each paragraph, or the whole of the essay, is computed. Such measures reflect the degree to which each sentence follows conceptually from the last, how much the discussion stays focused on the central topic, and the like (Foltz, Kintsch & Landauer, 1998). As assessed by correlation with human expert judgments, it turns out that coherence measures are positively correlated with essay quality in some cases but not in others. Our interpretation is that the correlation is
11
positive where correctly conveying technical content requires such consistency, but negatively related when a desired property of the essay is that it discussed a number of disparate examples. The coherence measures are included in the Style index of the Intelligent Essay Assessor. Meta-analysis of Experiments. This chapter reports on application of this method to ten different essay questions written by a variety of students on a variety of topics and scored by a variety of different kinds of expert judges. The topics and students were: Experiment 1: Heart Essays. A question on the anatomy and function of the heart and circulatory system administered to 94 undergraduates at the University of Colorado before and after an instructional session (N = 188) and scored by two professional readers from Educational Testing Service (ETS). Experiment 2: Standardized Essay Tests. Two questions from the GMAT business school aptitude test administered and scored by Educational Testing Service on the state of tolerance to diversity (N = 403) and on the likely effects of an advertising program (N = 383) and a Narrative Essay question for grade school children (N = 900). Experiment 3: Classroom Essay Tests. Three essay questions answered by students in general psychology classes at the University of Colorado, on operant conditioning (N = 109), attachment in children (N =55), and aphasia (N = 109), an 11th grade essay question from U.S. history, on the era of the depression (N= 237), and two questions from an undergraduate level clinical psychology course from the University of South Africa, on Freud (N = 239) and on Rogers (N = 96).
12
N size for all essays examined is 3396, with 2263 in standardized tests and 1033 in classroom tests (experiment 1 being cons idered more like a classroom test). For all essays, there were at least two independent readers. In all cases, the human readers were ignorant of each other's scores. In all cases, the LSA system was trained using the resolved score of the readers, which in most cases were a simple average of the two reader scores, but could also include resolution of scores by a third reader when the first two disagreed by more than 1 point (GMAT essays) or adjustment of scores to eliminate calibration bias (CU psychology). Inter-rater Reliability Analyses. The best indicator that the LSA scoring system is accurately predicting the scores is by comparison of LSA scores to single reader scores. By obtaining results for a full set of essays for both the automated system and at least two human readers, one can observe the levels of agreement of the assessment through the correlation of scores. Figure 1.3 portrays the levels of agreement between the Intelligent Essay Assessor scores and single readers and between single readers with each other. For all standardized essays, the data were received in distinct training and testing collections. The system was trained on the latter, with reliabilities calculated using a modified jack-knife method, wherein each essay was removed from the training set when it was being scored, and left in the training set for all other essays. The test sets did not include any of the essays from the training set. For the classroom tests the same modified jack-knife method was employed, thus allowing for the maximum amount of data for training without skewing the resulting reliability estimates.
13
Inter-Rater Reliability 1.00 Reliability Coefficient
0.90
0.86
0.85 0.75
0.80
0.73
0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00
Standardized Tests (N = 2263)
Reader 1 to Reader 2
Classroom Tests (N = 1033)
IEA to Single Readers
Figure 1.3. Inter-rater reliability for standardized and classroom tests Across all examinations, the IEA score agreed with single readers as well as single readers agreed with each other. The differences in reliability coefficients is not significant as tested by the Z-test for two correlation coefficients. The LSA system was trained using the resolved scores of the readers, which should be considered the best estimate of the true score of the essay. In Classical Test theory, the average of several equivalent measures better approximates the true score than does a single measure (Shavelson & Webb, 1991). Figure 1.4 extends the results shown in Figure 1.3 to include the reliability between the IEA and the resolved score. Note that while the IEA to Single Reader correlations are slightly, but not significantly, lower than the Reader 1 to Reader 2 correlations, the IEA to Resolved Score reliabilities are slightly, but not significantly, higher than are those for Reader to Reader.
14
Inter-rater reliabilities for Human Readers, IEA to Single Readers, and IEA to Resolved Reader Scores 1.00 .90
.83
.81
.85
.86
.85
.88 .75
Reliability Coefficient
.80
.73
.78
.70 .60 .50 .40 .30 .20 .10 .00 All Essays Reader 1 to Reader 2
Standardized IEA-Single Readers
Classroom IEA-Resolved Score
Figure 1.4. Inter-rater reliabilities for resolved reader scores R e l a t i v e P r e d i c t i o n S t r e n g t h s f o r L S A a n d o t h e r m e a s u r e s . In all of the examination sets, the LSA content measure was found to be the most significant predict o r , f a r s u r p a s s i n g t h e i n d i c e s o f S t y l e a n d M e c h a n i c s . Figure 1.5 gives the reliability of the individual scoring components with the criterion human assigned scores.
15
Relative Prediction Strength of Individual IEA Components 0.90
0.85
0.83
Reliability with Resolved Human Score
0.80 0.70
0.68
0.66
Style Score
Mechanics Score
0.60 0.50 0.40 0.30 0.20 0.10 0.00 Content Score
IEA Total Score
Figure 1.5. Relative prediction strength of individual IEA components While style and mechanics indices do have strong predictive capacity on their own as indicated in Figure 1.5, their capacity is overshadowed by the content measure. When combined into a single index, the IEA Total Score, the content measure accounts for the most variance. The relative percent contribution to prediction of essay scores, as determined by an analysis of standardized correlation coefficients, ranges from 7 0 to 80% for the content measure, from 10 to 20% for the style measure, and approximately 11% for the mechanics measure (See Figure 1.6).
16
Relative Percent Contribution to Prediction of Essay Score for IEA Components when Used Together 0.11
0.11
0.13
0.20
0.75
0.69
All Essays
Standardized
0.11 0.10
0.79
Mechanics Style Content
Classroom
Figure 1.6. Relative percent contribution for IEA components
The following Tables 1.1, 1.2, and 1.3 provide a synopsis of the overall and component reliabilities for each independent data set. Table 1.1 has the reliabilities between human assigned scores and both of the LSA measures independently and combined into a total score. Table 1.2 breaks out the Reliabilities for the IEA scoring components of content style and mechanics. Table 1.3 compares the Reader to Reader reliability with the IEA to Single Reader reliability.
17
Table 1.1. Reliability scores by individual data sets for LSA measures
Standardized
N
LSA LSA Total LSA Quality Quantity Score
gm1.train
403
0.81
0.77
0.88
gm1.test
292
0.75
0.76
0.85
gm2.train
383
0.81
0.75
0.87
gm2.test
285
0.78
0.77
0.86
narrative.train
500
0.84
0.79
0.86
narrative.test
400
0.85
0.80
0.88
great depression
237
0.77
0.78
0.84
heart
188
0.78
0.70
0.80
aphasia
109
0.36
0.62
0.62
attachment
55
0.63
0.49
0.64
operant
109
0.56
0.52
0.66
freud
239
0.79
0.48
0.79
rogers
96
0.60
0.56
0.69
All Essays
3296
0.77
0.73
0.83
Standardized
2263
0.81
0.78
0.87
Classroom
1033
0.69
0.62
0.76
Classroom
18
Table 1.2. Reliability scores by individual data sets for IEA components
N
IEA content
gm1.train
403
0.88
0.84
0.68
0.90
gm1.test
292
0.85
0.80
0.59
0.87
gm2.train
383
0.87
0.70
0.63
0.87
gm2.test
285
0.86
0.67
0.64
0.87
narrative.train
500
0.86
0.73
0.79
0.89
narrative.test
400
0.88
0.74
0.81
0.90
great depression
237
0.84
0.65
0.72
0.84
heart
188
0.80
0.56
0.57
0.80
aphasia
109
0.62
0.45
0.62
0.70
attachment
55
0.64
0.36
0.57
0.70
operant
109
0.66
0.57
0.49
0.73
freud
239
0.79
0.55
0.53
0.80
rogers
96
0.69
0.38
0.38
0.70
All Essays
3296
0.83
0.68
0.66
0.85
Standardized
2263
0.87
0.75
0.70
0.88
Classroom
1033
0.76
0.54
0.57
0.78
Standardized
IEA IEA Style Mechanics
IEA Score
Classroom
19
Table 1.3. Reliability scores by individual data sets for single readers Standardized gm1.train
N Reader 1 to Reader 2 403 0.87
IEA-Single Reader 0.88
gm1.test
292
0.86
0.84
gm2.train
383
0.85
0.83
gm2.test
285
0.88
0.85
narrative.train
500
0.87
0.86
narrative.test
400
0.86
0.87
great depression
237
0.65
0.77
heart
188
0.83
0.77
aphasia
109
0.75
0.66
attachment
55
0.19
0.54
operant
109
0.67
0.69
freud
239
0.89
0.78
rogers
96
0.88
0.68
All Essays
3296
0.83
0.81
Standardized
2263
0.86
0.85
Classroom
1033
0.75
0.73
Classroom
Differences for All Essays, Standardized, and Classroom Not Significant using Z-Test for differences in Reliability Coefficients; Critical Z at alpha (.05) = 1.96 Z(ALL) = .153; Z(STANDARD) = 1.53, Z(CLASSROOM) =.70
20
This meta-analyses has covered the most important results from the research. A review of some additional modeling experiments performed on some of the unique datasets is presented next, with more detailed analyses available in Chapter 4. Appendix A provides bivariate scattergrams for human score and IEA score as well as the individual model component reliability results. Experiment 1: Heart Studies. Ninety four undergraduates fulfilling introductory psychology course requirements volunteered to write approximately 250 word essays on the structure, function and biological purpose of the heart. They wrote one such essay at the beginning of the experiment, then read a short article on the same topic chosen from one of four sources: an elementary school biology text, a high school text, a college text, or a professional cardiology journal. They then wrote another essay on the same topic. In addition, both before and after reading the students were given a short answer test that was scored on a 40 point scale. The essays were scored for content, that is, the quality and quantity of knowledge about the anatomy and function of the heart—without intentional credit for mechanics or style— independently by two professional readers employed by Educational Testing Service. The short answer tests were scored independently by two graduates students who were serving as research assistants in the project. The LSA semantic space was constructed by analysis of all 94 paragraphs in a set of 26 articles on the heart taken from an electronic version of Grolier’s Academic American Encyclopedia. This was a somewhat smaller source text corpus than has usually been used, but it gave good results and attempts to expand it by the addition of more general text did not improve results.
21
First each essay was represented as an LSA vector. The sets of before-and after-reading essays were analyzed separately. Each target essay was compared with all the others, the ten most similar by cosine measure were found, and the essay in question given the cosine weighted average of the human assigned scores. Alternative methods of analysis. In another explored method, instead of comparing a student essay with other student essays, the comparison is with one or more texts authored by experts in the subject. For example, the standard might be the text that the students have read to acquire the knowledge needed, or a union of several texts representative of the domain, or one or more model answers to the essay question written by the instructor. In this approach it is assumed that a score reflects how close the student essay is to a putative near- ideal answer, a "gold standard". For this experiment, instead of comparing each essay with other essays, each was compared with the high school level biology text section on the heart. This experiment is described in detail in Chapter 4. An advantage of this method as applied here, of course, is that the score is derived without the necessity of human readers providing the comparison set, but it does require the selection or construction of an appropriate model. In a third method the scoring scale is derived solely from comparisons among the student essays themselves rather than from their relation to human scores or model text. The technique rests on the assumption that in a set of essays intended to tap the amount of knowledge conveyed, the principal dimension along which the essays will vary will be the amount of knowledge conveyed by each essay. That is, because students will try to do what they are asked, the task is difficult, and the
22
students vary in ability, the principal difference between student products will be in how well they have succeeded. The LSA-based analysis consists of computing a matrix of cosines between all essays in a large collection. These similarities are converted into distances (1-cosine), then subjected to a single dimensional scaling (also known as an unfolding (Coombs, 1964). Each essay then has a numerical position along the single dimension that best captures the similarities among all of the essays; by assumption, this dimension runs from poor quality to good. The analysis does not tell which end of the dimension is high and which low, but this can be trivially ascertained by examining a few essays. The unfolding method, when tested on the heart essays, yielded an average correlation of .62 with the scores given by ETS readers. This can be compared with correlations of .78 for the holistic quality score, .70 for the holistic quantity score and .65 when source texts are used in the comparison. All methods gave reliabilities that are close to those achieved by the human readers, and well within the usual range of well-scored essay examinations. Details of the experiments involving these alternative methods are in Chapter 4. Validity Studies using Objective Tests. Because the essay scoring in experiment 1 was part of a larger study with more analysis, some accessory investigations that throw additional light on the validity of the method were possible. First, we asked whether LSA was doing more than measuring the number of technical terms used by students. To explore this, the words in the essays were classified as either topically relevant content words or topically neutral words, including common function words and general purpose adjectives, that might be found in an essay on any subject. This division was done by one of the research assistants with intimate
23
knowledge of the materials. As detailed in Chapter 4, the correlation with the average human score was best when both kinds of words were included, but was, remarkably, statistically significant even when only the neutral words were used. However, as to be expected, relevant content words alone gave much better results than the neutral words alone. The administration of the essay test before and after reading in Experiment 1 provides additional indicants of validity. First, as reported in detail in Chapter 4, the LSA relation between the before-essay and the text selection that a student read yielded substantial predictions in accordance with the zone of optimal learning hypothesis for how much would be learned; had students been assigned their individually optimal text as predicted by the relation between it and their beforereading essay, they would, on average, have learned a significant 40% more than if all students were given the one overall best text. The same effects were reflected in LSA after-reading essays scores and short-answer tests. These results make it clear that the measure of knowledge obtained with LSA is not superficial; it does a good job of reflecting the influence of learning experiences and predicting the expected effects of variations in instruction. A final result from this experiment is of special interest. This is the relation between the LSA score and the more objective short answer test. The correlation between LSA scores and short answer tests was .76. The correlation between the Reader 1’s essay score and the short answer test was .72.; for Reader 2 it was .81, for an average of .77 This lack of a difference indicates that the LSA score had an
24
external criterion validity that was at least as high as that for combined expert human judgments. Again, details of this are provided in Chapter 4. Experiment 2: Standardized tests. This experiment used a large sample of essays taken from the ETS GMAT exam used for selection of candidates for graduate business administration programs. There were two topics shown in Figure1.7. The essays on both topics were split by ETS into training and test sets. An interesting feature of these essays is that they have much less consistency either in what students wrote or what might be considered a good answer. There was opportunity for a wide variety of different good and bad answers, including different discussions of different examples reaching different conclusions and using fairly disjoint vocabularies, and at least an apparent opportunity for novel and creative answers to have received appropriate scores from the human judges. Although it was therefore thought that the holistic approach might be of limited value, the method was nevertheless applied. To our surprise, it worked quite well. As described in Table 1.1 the reliabilities for the IEA matched the reliabilities for the well-trained readers. GMAT Issue Prompt “Everywhere, it seems, there are clear and positive signs that people are becoming more respectful of one another’s differences.” In your opinion, how accurate is the view expressed above? Use reasons and/or examples from your own experience, observations, or reading to develop your position.
GMAT Argument Prompt “The potential of Big Boards to increase sales of your products can be seen from an experiment we conducted last year. We increased public awareness of the name of the current national women’s marathon champion by publishing her picture and her name on billboards in River City for a period of three months. Before this time, although the champion had just won her title and was receiving extensive national publicity, only five percent of 15,000 randomly surveyed residents of River City could correctly name the champion when shown her picture; after the three-month advertising experiment, 35 percent of respondents from a second survey could supply her name.” Discuss how well reasoned you find this argument. In your discussion be sure to analyze the line of reasoning and the use of evidence in the argument. For example, you may need to consider what questionable assumptions underlie the thinking and what alternative explanations or counterexamples might weaken the conclusion. You can also discuss what sort of evidence would strengthen or refute the argument, what changes in the argument would make it more sound and persuasive, and what, if anything, would help you better evaluate its conclusion.
Figure 1.7 Prompts for the GMAT essay sets
25
A third set of grade school student essays, require narrative writing from an open-ended prompt (“Usually the gate was closed, but on that day it was open…”). This examination question allowed for infinite variability in writer response. Almost any situation could have followed this prompt, yet the LSA content measure was actually slightly stronger for this case than for any other tested (See Table 1.1). An explanation of this finding could be the following: over the fairly large number of essays scored by LSA, almost all of the possible ways to write a good, bad or indifferent answer, and almost all kinds of examples that would contribute to a favorable or unfavorable impression on the part of the human readers, were represented in at least a few student essays. Thus by finding the ten most similar to a particular essay, LSA was still able to establish a comparison that yielded a valid score. The results are still far enough from perfect to allow the presence of a few unusual answers not validly scored, although the human readers apparently did not, on average, agree with each other in such cases any more than they did with LSA. Experiment 3: Classroom Studies. An additional 845 essays from six exams from three educational institutions were also scored using the holistic method. In general, the inter-rater reliability for these exams is lower than for standardized tests, but is still quite respectable. The reliability results for all of these sets are also detailed in Tables 1.1-1.3. General Psychology at the University of Colorado. In a freshman and sophomore level general psychology, hour exams included ten minute essays. Three of these were selected for study. One was on the determinants of attachment in young children, one on classical and operant conditioning of a pet dog, and another on the
26
brain areas associated with different kinds of aphasia. The readers assigned holistic scores on a 0-10 scale to indicate the quality of the essay with respect to the level of student knowledge of the topic it reflected. The LSA semantic space was constructed by analysis of the textbook, Psychology 4th Ed. (Myers, 1994) used in the class. These essays were in general shorter than the others with an average word count of 120 compared to 270 for the ETS essays. High School American History. These 237 essays from 11th graders were supplied by the Center for Research on Evaluation, Standards and Student Testing (CRESST). The topic of the essay was the “New Deal/Great Depression”. The essays came from nine high school classes, seven of which were Advanced Placement US History classes, from San Diego and Palos Verdes, California. Four readers were employed (R1-R4) for assessment of the essays. Only R4 scored all 237 essays. R1 scored 127 essays, R2 scored 119 essays, and R3 scored 121 essays. Clinical Psychology at the University of South Africa. Two questions on the work of Sigmund Freud and Carl Rogers were provided by the Psychology Department as well as the textbook used by the students. These essays had very high inter-rater reliability (at the level of standardized tests) Auxiliary Findings. In addition to the reliability and validity studies, the research examined a variety of other aspects of scoring the essays. These explorations are detailed in this section. Count Variables and Vector Length. Previous attempts to develop computational techniques for scoring essays have focused primarily on measures of style and mechanics. Indices of content have remained secondary, indirect and
27
superficial. For example, in the extensive work of Page and his colleagues (Page, 1966, 1994) over the last thirty years, a growing battery of computer programs for analysis of grammar, syntax, and other non-semantic characteristics has been used. Combined by multiple regression, these variables accurately predict scores assigned by human experts to essays intended primarily to measure writing competence. By far the most important of these measures, accounting for well over half the predicted variance, is the sheer number of words in the student essay. Although this might seem to be a spurious predictor, and certainly one easily counterfeited by a test taker who knew the scoring procedure, it seems likely that it is, in fact, a reasonably good indicator variable under ordinary circumstances. The rationale is that most students, at least when expecting their writing to be judged by a human, will not intentionally write nonsense. Those who know much about the topic at hand, those who have control of large vocabularies and are fluent and skillful in producing discourse, will write, on average, longer essays than those lacking these characteristics. Thus it is not a great surprise that a measure of length, especially when coupled with a battery of measures of syntax and grammar that would penalize gibberish, does a good job of separating students who can write well from those who can’t, and those who know much from those who don’t. The major deficiencies in this approach are that its face validity is extremely low, that it appears easy to fake and coach, and that its reflection of knowledge and good thinking on the part of the student arise, if at all, only out of indirect correlations over individual differences. It is important to note that while vector length in most cases is highly correlated with the sheer number of words used in an essay, or to the number of
28
content-specific words, it need not be. For example, unlike ordinary word count methods, an essay on the heart consisting solely of the words "the heart" repeated hundreds of times would generate a low LSA quantity measure, that is, a short vector. In many of our experiments the vector length has been highly correlated with the number of words, as used in the Page measures, and collinear with it in predicting human scores, but in others it has been largely independent of length in number of words, but nevertheless, strongly predictive of human judgments. The standardized essays resemble more closely those studied by Page and others in which word count and measures of stylistic and syntactic qualities together were sufficient to produce good predictions of human scores. Analyses of the relative contributions of the quality and quantity measures and their correlations with length in words for two contrasting cases are shown in Figure 1.8. It should be mentioned that count variables have been expressly excluded from any of the IEA component measures. Comparison of LSA Measures with Word Count 0.90 Reliability Coefficient
0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 All Essays
Standardized
LSA-Quality
Classroom
Great Depression (word count unrestricted)
LSA-Quantity
Figure 1.8. Comparison of LSA measures with word count
Heart (word count restricted)
Word Count
29
One reader training compared with resolved score training. As stated previously, all essay sets had at least two readers, and the LSA models were trained on the resolved score of the readers. In an interesting set of side experiments, on the GMAT issue prompt and on the heart prompt, new analyses were conducted wherein the LSA training used only one or the other of the independent reader scores, rather than the resolved score. This situation would parallel many cases of practical application where the expense of two readers for the calibration set would be too high. In all three cases, where the training used the Resolved scores, the Reader 1 scores, or the Reader 2 scores, the LSA Quality measure predicts the Resolved scores at a slightly higher level of reliability than it predicts the individual reader scores. The resolved score is the best estimate of the true score of the essay, a better estimate than either individual reader all things being equal. This method, even when using single reader scores, better approximates the true score than does the single reader alone. The results are shown in Figure 1.9 for the GMAT issue essays, and in Figure 1.10 for the Heart essays. An implication of this is that a single reader could use LSA scoring after hand scoring a set of essays to act as if she were two readers, and thereby arrive at a more reliable estimate of the true scores for the entire set of essays. This application would tend to alert one to and/or smooth out any glaring inconsistencies in scoring by considering each of the semantically near essays as though they were alternative forms.
30
Effects of Training Set Score Sources for GMAT Issue Essays: Single Readers and Resolved Scores 0.90 Reliability Coefficient
0.85
0.81
0.80
0.79 0.79
0.81
0.79 0.79
0.81
0.79 0.78
0.75 0.70 0.65 0.60 0.55 0.50
Trained on Resolved Score
Trained on Reader 1
Resolved Score
Reader 1
Trained on Reader 2
Reader 2
Figure 1.9. Effects of training set score source for GMAT issue essays
Effects of Training Set Score Sources for Heart Essays: Single Readers and Resolved Scores 0.90 Reliability Coefficient
0.85 0.80 0.75
0.79
0.78
0.78
0.78
0.70
0.71
Trained on Resolved Score
Trained on Reader 1
0.78
0.77 0.69
0.70 0.65 0.60 0.55 0.50
Resolved Score
Reader 1
Trained on Reader 2
Reader 2
Figure 1.10. Effects of training set score source for heart essays
31
Confidence measures for LSA Quality Score. The LSA technique itself makes possible several ways to measure the degree to which a particular essay has been scored reliably. One such measure is to look at the cosines between the essay being scored and the set of k to which it is most similar. If the essays in the comparison set have unusually low cosines with the essay in question (based on the norms of the essays developed in the training stage), or if their assigned grades are unusually variable (also assessed by considering the training norms), it is unlikely that an accurate score can be assigned (See Figures 1.11 & 1.12).
1 .0
A A B A C B A A A
A
B
A B
A
θ Dim 1 T
Dim 2
1 .0
Figure 1.11. Confidence measure 1: The nearest neighbor has too low of a cosine. Such a situation could indicate that the essay is incoherent with the content domain. It could also reflect an essentially good or bad answer phrased in a novel
32
way, or, even one that is superbly creative and unique. Again, if the essay in question is quite similar to several others, but they are quite different from each other (which can happen in high-dimensional spaces), the essay in question is also likely to be unusual.
1 .0
D C DC C
θ2
T A
θ1
A
A
A B A A
Dim
C
1
C C D
D F Di m 2
1 .0
Figure 1.12. Confidence measure 2. The near neighbor grades are too variable. On the other hand, if an essay has an unexpectedly high cosine with some other it would be suspected of being a copy. In all these cases, one would want to flag the essay for additional human evaluation. Of course, the application of such measures will usually require that they be re- normed for each new topic, but this is easily accomplished by including the necessary statistical analyses in the IEA software system that computes LSA measures.
33
Validation measures for the essay. Computer based style analyzers offer the possibility of giving the student or teacher information about grammatical and stylistic problems encountered in the student’s writing, for example data on the distribution of sentence lengths, use of passives, disagreements in number and gender and so forth. This kind of facility, like spelling checkers, has become common in text editing and word processing systems. Unfortunately, we know too little about their validity relative to comments and corrections made by human experts, or of their value as instructional devices for students. These methods also offer no route to delivering feedback about the conceptual content of an essay or emulating the criticism or advice that might be given by a teacher with regard to such content. In addition to the LSA-based measures, the IEA calculates several other sensibility checks. It can compute the number and degree of clustering of word-type repetitions within an essay, the type/token ratio or other parameters of its word frequency distribution, or of the distribution of its word-entropies as computed by the first step in LSA. Comparing several of these measures across the set of essays would allow the detection of any essay constructed by means other than normal composition. For example, forgery schemes based on picking rare words specific to a topic and using them repeatedly, which can modestly increase both LSA measures, are caught. Yet another set of validity checks rest on use of available automatic grammar, syntax, and spelling checkers. These also detect many kinds of deviant essays that would get either too high or too low LSA scores for the wrong reasons. Finally, the IEA inc ludes a method that determines the syntactic cohesiveness of an essay by computing the degree to which the order of words in its sentences
34
reflect the sequential dependencies of the same words as used in printed corpora of the kinds read by students (the primary statistics used in automatic speech recognition "language models"). Gross deviations from normative statistics would indicate abnormally generated text; those with good grammar and syntax will be near the norms. Other validity checks which might be added in future implementations include comparisons with archives of previous essays on the same and similar topics either collected locally or over networks. On one hand, LSA's relative insensitivity to paraphrasing and component re-ordering would provide enhanced plagiarism detection, on the other, comparisons with a larger pool of comparable essays could be used to improve LSA scoring accuracy. The required size for the set of comparison essays. A general rule of thumb used in acquiring the comparison sets is that the more pre-scored essays that are available, all else being equal, the more accurate the scores determined by the LSA Quality measure (the only measure affected by pre-scoring), especially on essay questions that have a variety of good or correct answers. To help understand the increase in reliability as comparison set size grows, the GMAT Issue test set was scored based on comparison sets which ranged in size from 6 essays (1 randomly selected at each score point) to 403 (the full training set). As can be seen in Figure 1.11, even the 6 essay comparison set did a reasonable job in prediction. The highest levels of reliability began at around 100 essays and continued through 400 essays. When the 6-essay measure was supplemented by the other IEA components, the 6-essay and the 400-essay models had equal reliability coefficients of .87.
35
Reliability for GMAT Issue Test Set (N=292) with Varying Numbers of Pre-Scored Essays in the Comparison Set
Reliability Coefficient
1.00 0.90 0.80 0.70 0.60
0.69
0.72
0.75
0.74
0.75
25
50
100
200
400
0.53
0.50 0.40 0.30 0.20 0.10 0.00
6
Number of Training Essays in Comparison Set
Figure 1.13. Reliability for GMAT essays with varying size of comparison set. On the Limits of LSA Essay Scoring. What does the LSA method fail to capture? First of all, it is obvious LSA does not reflect all discernible and potentially important differences in essay content. In particular, LSA alone does not directly analyze syntax, grammar, literary style, logic or mechanics (e.g. spelling and punctuation). However, this does not necessarily—or empirically often—cause it to give scores that differ from those assigned by expert readers and, as shown in Figure 1.5, these measures add little prediction value to the LSA only model. The overall reliability statistics shown in Fig. 1.3 demonstrate that this must be the case for a wide variety of students and essay topics. Indeed, the correspondence between LSA's word-order independent measures and human judgments is so close that it forces one to consider the possibility that something closely related to word combination alone is the fundamental basis of meaning representation, with syntax usually serving
36
primarily to ease the burden of transmitting word-combination information between agents with limited processing capacity (Landauer, Laham and Foltz, 1998). In addition to its lack of syntactic information and perceptual grounding , LSA obviously cannot reflect perfectly the knowledge that any one human or group thereof possesses. LSA training always falls far short of the background knowledge that humans bring to any task, experience based on more and different text, and, importantly, on interaction with the world, other people, and spoken language. In addition, on both a sentence and a larger discourse level, it fails to measure directly such qualities as rhyme, sound symbolism, alliteration, cadence and other aspects of the beauty and elegance of literary expression. It is clear, for example, that the method would be insufficient for evaluating poetry or important separate aspects of creative writing. However, it is possible that stylistic qualities restrict word choice so as to make beautiful essays resemble other beautiful essays to some extent even for LSA. Nonetheless, some of the esthetics of good writing undoubtedly go unrecognized. It is thus surprising that the LSA measures correlate so closely with the judgments of humans who might have been expected to be sensitive to these matters either intentionally or unintentionally. However, it bears noting that in the pragmatic business of assessing a large number of content-oriented essays, human readers may also be insensitive to, largely ignore, or use these aspects unreliably. Indeed, studies of text comprehension, have found that careful readers often fail to notice even direct contradictions (van Dijk & Kintsch,1983). And, of course, judgments of aesthetic qualities, as reflected, for example, in the opinion of critics of
37
both fiction and nonfiction, are notoriously unreliable and subject to variation from time to time and purpose to purpose. Appropriate Purposes for Automatic Scoring. There are some important issues regarding the uses to which LSA scoring is put. These differ depending on whether the method is primarily aimed at assessment or at tutorial evaluation and feedback. We start with assessment. There are several ways of thinking about the object of essay scoring. One is to view assessment as determining whether certain people with special social roles or expertise, for example, teachers, critics, admissions officers, potential employers, parents, politicians, or taxpayers will find the testtaker's writing admirable. Obviously the degree to which an LSA score will predict such criteria will depend in part on how many of what kind of readers are used as the calibration criterion. One can also view the goal of an essay exam to be accurate measurement of how much knowledge about a subject the student has. In this case correlation with human experts is simply a matter of expedience; even experts are less than perfectly reliable and valid, thus their use can be considered only an approximation. Other criteria, such as other tests and measures, correlations with amounts or kind of previous learning, or long term accomplishments—for example, course grades, vocational advancement, professional recognition, or earnings—would be superior ways to calibrate LSA scores. There is, of course, a critical issue of face validity and social acceptability for an automatic scoring procedure. It can be expected that many people, ranging form educators to parents to government officials to the public at large, even educational
38
researchers, will be suspicious of or offended by automatic scoring. Partly such antipathy will arise from the considerations given above, a legitimate desire for assurance that human opinion about quality is accurately reflected, and doubt that machines can give as valuable opinions. One response may be to always use the machine scoring only as an extra scoring, a second opinion, and/or an independent source of diagnosis and advice. That is, one would always have one or more humans score at least a substantial portion of the essays, and always have human fallback judges available. A maximally conservative approach would be for a teacher or testing body to score essays in its traditional human intellectual manner, but to supplement the procedure by submitting the set of essays and their human scores to LSA scoring as a check on consistency. It is well established that human judges are subject to a variety of undesirable sources of variability: fluctuation, drift, contrast, and bias effects. Because LSA scores are based on computational comparisons with the scores given to many other essays, they will tend to average out such variations. Thus an essay that receives a different score from the human judge and the program will always be one that may have been scored unreliably by the human and should be reassessed. In the case of calibration to human judgments, as in the holistic method, the process will not remove average systematic bias effects because these will be present in the human judgments of comparison essays, but it will reduce case-to-case variations. Applied appropriately, LSA scoring is unquestionably less subjective, less subject to individual bias, and more reliable over time than human scoring. For example, LSA combines judgments on many other essays written by many other
39
students, possibly scored by multiple judges, thereby greatly reducing the consequences of inattention, shifting criteria, clerical errors, and many other irrelevant sources of variability. LSA does not know what the students look like, what grades they have gotten in the past, how they have behaved in class or in the teacher's office (of course, these things are usually true of nationally administered tests as well.) As described earlier, it is also possible, using LSA plus other techniques, to perform any number of validity and sensibility checks, and to set thresholds for human re-assessment. Uniform use of such checks should, with experience, allay fears that spurious scores, either too high or too low, will be given by either human or machine. A continual problem with longitudinal assessments of writing, such as the National Assessment of Educational Progress (NAEP), is the transferability of scores from one year to another. Writing evaluations need to be compared across yearly test administrations to assess group progress, but for each year, different sets of raters are used and their reliability is notoriously low (Huot, 1990).
This would not be a
problem for LSA based grading. As long as the comparison set is held constant, the basis for grading will not change, however over time changes in language, advances in knowledge, and trends in communication may require updating of the system. Yet another likely objection to computer essay scoring will be that shifting essay scoring from teacher to machine will reduce the communication bandwidth been students and teachers, making each less aware of each others' qualities and concerns and therefore rendering teachers less able to help students. There is, of course, no reason why the use of computer scores need reduce the number of essays
40
read by teachers. The computer score can be used only as a second opinion or for additional essays beyo nd the number feasible or desirable for the teacher to read. This is clearly an issue purely of pedagogical strategy in the application of the method, not a consequence of its use as such. Moreover, many teachers believe that much of the time that they currently spend reading and scoring written products, in part because the feedback delay is usually undesirably long, would be better spent in other activities such as individual interaction with students, perhaps, for example, in going over written assignments one-on-one or in small groups. Finally, there will probably be some pure Luddite- like reactions, teachers afraid of being put out of work, or, more likely, others who think that teachers will have such fears. However, most teachers do not consider paper scoring as an indispensable part of what they are paid to do, but rather believe that they could easily spend any freed time on aspects of teaching that their employers value more. Theoretical Implications of LSA Scoring. Every successful application of the LSA methodology in information processing, whether in strictly applied roles or in psychological simulations, adds evidence in support of the claim that LSA must, in some way, be capturing the performance characteristics of its human counterparts. LSA scores have repeatedly been found to correlate with a human reader's score as well as one human score correlates with another. Given that humans surely have more knowledge and can use aspects of writing, such as syntax, that are unavailable to LSA, how can this be? There are several possibilities. The first is very strong correlation in different writing skills across students. In general, it has long been known that there is a high correlation ove r students between quality of performance
41
on different tasks and between different kinds of excellence on the same and similar tasks. It is not necessary to ask the origin of these correlations to recognize that they exist. The issue that this raises for automatic testing is that almost anything that can be detected by a machine that is a legitimate quality of an essay is likely to correlate well with any human judgment based on almost any other quality. So, for example, measures of the number of incorrect spellings, missing or incorrect punctuation marks, or the number of rare words is likely to correlate fairly well with human judgments of quality of arguments or of the goodness and completeness of knowledge. As an example, the LSA scores for the Grade School Narrative essays correlated with the Handwriting scores of the same essays at .76 even though the LSA system had no access to the handwritten essays themselves. However, no matter how well correlated, we would be uncomfortable in using superficia l, intrinsically unimportant properties as the only or main basis of student evaluation, and for good reasons. The most important is that by so doing we would tend to encourage students to study and teachers to teach the wrong things; inevitably the more valuable skills and attitudes would decline and the correlations would disappear. Instead, we want to assess most directly the properties of performance we think most important so as to reward and shape the best study, pedagogical and curricular techniques and the best public policies. Where do the LSA methods, then, come in? First, they can greatly reduce the expert human effort involved in using essay exams. Even if used only as a "second opinion" as suggested, they would reduce the effort needed to attain better reliability and validity. Second, they offer a much more objective measure.
42
But what about the properties of the student that are being measured? Are they the ones that we truly want to measure? Does measuring and rewarding their achievement best motivate and guide a society of learners and teachers? This is not as yet entirely clear. Surely students who study more "deeply", who understand and express knowledge more accurately and completely will tend to receive higher LSA based scores. On the other hand, LSA scores do not capture all and exactly the performances we wish to encourage. Nonetheless, the availability of accurate machine scoring should shift the balance of testing away from multiple-choice and short answer towards essays, and therefore towards greater concentration on deep comprehension of sources and discursive expression of knowledge. In conclusion, it is clear that determining the value of the LSA automatic scoring technology—its reliability and validity in practice and over the whole spectrum of potential educational settings, and its public acceptability—will require considerably more research and experience.
43
CHAPTER II INTRODUCTION TO LSA An Introduction to Latent Semantic Analysis Originally Published as: Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to Latent Semantic Analysis. Discourse Processes, 25, 259-284.
Abstract. Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the contextual- usage meaning of words by statistical computations applied to a large corpus of text (Landauer and Dumais, 1997). The underlying idea is that the aggregate of all the word contexts in which a given word does and does not appear provides a set of mutual constraints that largely determines the similarity of meaning of words and sets of words to each other. The adequacy of LSA’s reflection of human knowledge has been established in a variety of ways. For example, its scores overlap those of humans on standard vocabulary and subject matter tests; it mimics human word sorting and category judgments; it simulates word–word and passage–word lexical priming data; and, as reported in 3 following articles in this issue, it accurately estimates passage coherence, learnability of passages by individual students, and the quality and quantity of knowledge contained in an essay. An Introduction to Latent Semantic Analysis. Research reported in the three articles that follow—Foltz, Kintsch & Landauer (1998/this issue), Rehder, et al. (1998/this issue), and Wolfe, et al. (1998/this issue)—exploits a new theory of knowledge induction and representation (Landauer and Dumais, 1996, 1997) that provides a method for determining the similarity of meaning of words and passages by analysis of large text corpora. After processing a large sample of machine-readable
44
language, Latent Semantic Analysis (LSA) represents the words used in it, and any set of these words—such as a sentence, paragraph, or essay—either taken from the original corpus or new, as points in a very high (e.g. 50-1,500) dimensional “semantic space”. LSA is closely related to neural net models, but is based on singular value decomposition, a mathematical ma trix decomposition technique closely akin to factor analysis that is applicable to text corpora approaching the volume of relevant language experienced by people. Word and passage meaning representations derived by LSA have been found capable of simulating a variety of human cognitive phenomena, ranging from developmental acquisition of recognition vocabulary to word-categorization, sentence-word semantic priming, discourse comprehension, and judgments of essay quality. Several of these simulation results will be summarized briefly below, and additional applications will be reported in detail in following articles by Peter Foltz, Walter Kintsch, Thomas Landauer, and their colleagues. We will explain here what LSA is and describe what it does. LSA can be construed in two ways: (1) simply as a practical expedient for obtaining approximate estimates of the contextual usage substitutability of words in larger text segments, and of the kinds of—as yet incompletely specified— meaning similarities among words and text segments that such relations may reflect, or (2) as a model of the computational processes and representations underlying substantial portions of the acquisition and utilization of knowledge. We next sketch both views. As a practical method for the characterization of word meaning, we know that LSA produces measures of word-word, word-passage and passage-passage relations that are well correlated with several human cognitive phenomena involving association or semantic similarity. Empirical evidence of this will be reviewed shortly. The correlations demonstrate close resemblance between what LSA extracts and the way peoples’ representations of meaning reflect what they have read and
45
heard, as well as the way human representation of meaning is reflected in the word choice of writers. As one practical consequence of this correspondence, LSA allows us to closely approximate human judgments of meaning similarity between words and to objectively predict the consequences of overall word-based similarity between passages, estimates of which often figure prominently in research on discourse processing. It is important to note from the start that the similarity estimates derived by LSA are not simple contiguity frequencies, co-occurrence counts, or correlations in usage, but depend on a powerful mathematical analysis that is capable of correctly inferring much deeper relations (thus the phrase “Latent Semantic”), and as a consequence are often much better predictors of human meaning-based judgments and performance than are the surface level contingencies that have long been rejected (or, as Burgess and Lund, 1996 and this volume, show, unfairly maligned) by linguists as the basis of language phenomena. LSA, as currently practiced, induces its representations of the meaning of words and passages from analysis of text alone. None of its knowledge comes directly from perceptual information about the physical world, from instinct, or from experiential intercourse with bodily functions, feelings and intentions. Thus its representation of reality is bound to be somewhat sterile and bloodless. However, it does take in descriptions and verbal outcomes of all these juicy processes, and so far as writers have put such things into words, or that their words have reflected such matters unintentionally, LSA has at least potential access to knowledge about them. The representations of passages that LSA forms can be interpreted as abstractions of “episodes”, sometimes of episodes of purely verbal content such as philosophical arguments, and sometimes episodes from real or imagined life coded into verbal descriptions. Its representation of words, in turn, is intertwined with and mutually interdependent with its knowledge of episodes. Thus while LSA’s potential
46
knowledge is surely imperfect, we believe it can offer a close enough approximation to people’s knowledge to underwrite theories and tests of theories of cognition. (One might consider LSA's maximal knowledge of the world to be analogous to a well- read nun’s knowledge of sex, a level of knowledge often deemed a sufficient basis for advising the young.) However, LSA as currently practiced has some additional limitations. It makes no use of word order, thus of syntactic relations or logic, or of morphology. Remarkably, it manages to extract correct reflections of passage and word meanings quite well without these aids, but it must still be suspected of resulting incompleteness or likely error on some occasions. LSA differs from some statistical approaches discussed in other articles in this issue and elsewhere in two significant respects. First, the input data "associations" from which LSA induces representations are between unitary expressions of meaning—words and complete meaningful utterances in which they occur—rather than between successive words. That is, LSA uses as its initial data not just the summed contiguous pairwise (or tuple-wise) co-occurrences of words but the detailed patterns of occurrences of very many words over very large numbers of local meaning-bearing contexts, such as sentences or paragraphs, treated as unitary wholes. Thus it skips over how the order of words produces the meaning of a sentence to capture only how differences in word choice and differences in passage meanings are related. Another way to think of this is that LSA represents the meaning of a word as a kind of average of the meaning of all the passages in which it appears, and the meaning of a passage as a kind of average of the meaning of all the words it contains. LSA's ability to simultaneously—conjointly—derive representations of these two interrelated kinds of meaning depends on an aspect of its mathematical machinery that is its second important property. LSA assumes that the choice of dimensionality
47
in which all of the local word-context relations are simultaneously represented can be of great importance, and that reducing the dimensionality (the number parameters by which a word or passage is described) of the observed data from the number of initial contexts to a much smaller—but still large—number will often produce much better approximations to human cognitive relations. It is this dimensionality reduction step, the combining of surface information into a deeper abstraction, that captures the mutual implications of words and passages. Thus, an important component of applying the technique is finding the optimal dimensionality for the final representation. A possible interpretation of this step, in terms more familiar to researchers in psycholinguistics, is that the resulting dimensions of description are analogous to the semantic features often postulated as the basis of word meaning, although establishing concrete relations to mentalisticly interpretable features poses daunting technical and conceptual problems and has not yet been much attempted. Finally, LSA, unlike many other methods, employs a preprocessing step in which the overall distribution of a word over its usage contexts, independent of its correlations with other words, is first taken into account; pragma tically, this step improves LSA’s results considerably. However, as mentioned previously, there is another, quite different way to think about LSA. Landauer and Dumais (1997) have proposed that LSA constitutes a fundamental computational theory of the acquisition and representation of knowledge. They maintain that its underlying mechanism can account for a longstanding and important mystery, the inductive property of learning by which people acquire much more knowledge than appears to be available in experience, the infamous problem of the "insufficiency of evidence" or "poverty of the stimulus." The LSA mechanism that solves the problem consists simply of accommodating a very large number of local co-occurrence relations (between the right kinds of observational units) simultaneously in a space of the right dimensionality.
48
Hypothetically, the optimal space for the reconstruction has the same dimensionality as the source that generates discourse, that is, the human speaker or writer's semantic space. Naturally observed surface co-occurrences between words and contexts have as many defining dimensions as there are words or contexts. To approximate a source space with fewer dimensions, the analyst, either human or LSA, must extract information about how objects can be well defined by a smaller set of common dimensions. This can best be accomplished by an analysis that accommodates all of the pairwise observational data in a space of the same lower dimensionality as the source. LSA does this by a matrix decomposition performed by a computer algorithm, an analysis that captures much indirect information contained in the myriad constraints, structural relations and mutual entailments latent in the local observations available to experience. The principal support for these claims has come from using LSA to derive measures of the similarity of meaning of words from text. The results have shown that: (1) the meaning similarities so derived closely match those of humans, (2) LSA's rate of acquisition of such knowledge from text approximates that of humans, and (3) these accomplishments depend strongly on the dimensionality of the representation. In this and other ways, LSA performs a powerful and, by the human-comparison standard, correct induction of knowledge. Using representations so derived, it simulates a variety of other cognitive phenomena that depend on word and passage meaning. The case for or against LSA's psychological reality is certainly still open. However, especially in view of the success to date of LSA and related models, it can not be settled by theoretical presuppositions about the nature of mental processes (such as the presumption, popular in some quarters, that the statistics of experience are an insufficient source of knowledge.) Thus, we propose to researchers in discourse processing not only that they use LSA to expedite their investigations, but
49
that they join in the project of testing, developing and exploring its fundamental theoretical implications and limits. What is LSA? LSA is a fully automatic mathematical/statistical technique for extracting and inferring relations of expected contextual usage of words in passages of discourse. It is not a traditional natural language processing or artificial intelligence program; it uses no humanly constructed dictionaries, knowledge bases, semantic networks, grammars, syntactic parsers, or morphologies, or the like, and takes as its input only raw text parsed into words defined as unique character strings and separated into meaningful passages or samples such as sentences or paragraphs. The first step is to represent the text as a matrix in which each row stands for a unique word and each column stands for a text passage or other context. Each cell contains the frequency with which the word of its row appears in the passage denoted by its column. Next, the cell entries are subjected to a preliminary transformation, whose details we will describe later, in which each cell frequency is weighted by a function that expresses both the word’s importance in the particular passage and the degree to which the word type carries information in the domain of discourse in general. Next, LSA applies singular value decomposition (SVD) to the matrix. This is a form of factor analysis, or more properly the mathematical generalization of which factor analysis is a special case. In SVD, a rectangular matrix is decomposed into the product of three other matrices. One component matrix describes the original row entities as vectors of derived orthogonal factor values, another describes the original column entities in the same way, and the third is a diagonal matrix containing scaling values such that when the three components are matrix- multiplied, the original matrix is reconstructed. There is a mathematical proof that any matrix can be so decomposed perfectly, using no more factors than the smallest dimension of the original matrix. When fewer than the necessary number of factors are used, the reconstructed matrix
50
is a least-squares best fit. One can reduce the dimensiona lity of the solution simply by deleting coefficients in the diagonal matrix, ordinarily starting with the smallest. (In practice, for computational reasons, for very large corpora only a limited number of dimensions—currently a few thousand— can be constructed.) Here is a small example that gives the flavor of the analysis and demonstrates what the technique accomplishes. This example uses as text passages the titles of nine technical memoranda, five about human computer interaction (HCI), and four about mathematical graph theory, topics that are conceptually rather disjoint. Thus the original matrix has nine columns, and we have given it 12 rows, each corresponding to a content word used in at least two of the titles. The titles, with the extracted terms italicized, and the corresponding word-by-document matrix is shown in Figure 2.1. 1 We will discuss the highlighted parts of the tables in due course. The linear decomposition is shown next Figure 2.2; except for rounding errors, its multiplication perfectly reconstructs the original as illustrated. Next we show a reconstruction based on just two dimensions (Figure 2.3) that approximates the original matrix. This uses vector elements only from the first two, shaded, columns of the three matrices shown in the previous figure (which is equivalent to setting all but the highest two values in S to zero). Each value in this new representation has been computed as a linear combination of values on the two retained dimensions, which in turn were computed as linear combinations of the original cell values. Note, therefore, that if we were to change the entry in any one cell of the original, the values in the reconstruction with reduced dimensions might be changed everywhere; this is the mathematical sense in which LSA performs inference or induction.
1
This example has been used in several previous publications (e.g. Deerwester et al.,
1990; Landauer & Dumais, in press).
51
Example of text data: Titles of Some Technical Memos c1: c2: c3: c4: c5:
Human machine interface for ABC computer applications A survey of user opinion of computer system response time The EPS user interface management system System and human system engineering testing of EPS Relation of user perceived response time to error measurement
m1: m2: m3: m4:
The generation of random, binary, ordered trees The intersection graph of paths in trees Graph minors IV: Widths of trees and well-quasi-ordering Graph minors: A survey
{X } = human interface computer user system response time EPS survey trees graph minors
c1 1 1 1 0 0 0 0 0 0 0 0 0
c2 0 0 1 1 1 1 1 0 1 0 0 0
c3 0 1 0 1 1 0 0 1 0 0 0 0
c4 1 0 0 0 2 0 0 1 0 0 0 0
c5 0 0 0 1 0 1 1 0 0 0 0 0
m1 0 0 0 0 0 0 0 0 0 1 0 0
m2 0 0 0 0 0 0 0 0 0 1 1 0
m3 0 0 0 0 0 0 0 0 0 1 1 1
m4 0 0 0 0 0 0 0 0 1 0 1 1
r (human.user) = -.38 r (human.minors) = -.29 Figure 2.1. A word by context matrix, X, formed from the titles of five articles about human-computer interaction and four about graph theory. Cell entries are the number of times that a word (rows) appeared in a title (columns) for words that appeared in at least two titles.
52
{X } = {W }{S }{P}' {W } = 0.22 0.20 0.24 0.40 0.64 0.27 0.27 0.30 0.21 0.01 0.04 0.03
-0.11 -0.07 0.04 0.06 -0.17 0.11 0.11 -0.14 0.27 0.49 0.62 0.45
0.29 0.14 -0.16 -0.34 0.36 -0.43 -0.43 0.33 -0.18 0.23 0.22 0.14
-0.41 -0.55 -0.59 0.10 0.33 0.07 0.07 0.19 -0.03 0.03 0.00 -0.01
-0.11 0.28 -0.11 0.33 -0.16 0.08 0.08 0.11 -0.54 0.59 -0.07 -0.30
-0.34 0.50 -0.25 0.38 -0.21 -0.17 -0.17 0.27 0.08 -0.39 0.11 0.28
0.52 -0.07 -0.30 0.00 -0.17 0.28 0.28 0.03 -0.47 -0.29 0.16 0.34
-0.06 -0.01 0.06 0.00 0.03 -0.02 -0.02 -0.02 -0.04 0.25 -0.68 0.68
-0.41 -0.11 0.49 0.01 0.27 -0.05 -0.05 -0.17 -0.58 -0.23 0.23 0.18
{S } = 3.34 2.54 2.35 1.64 1.50 1.31 0.85 0.56 0.36
{P} = 0.20 -0.06 0.11 -0.95 0.05 -0.08 0.18 -0.01 -0.06
0.61 0.17 -0.50 -0.03 -0.21 -0.26 -0.43 0.05 0.24
0.46 -0.13 0.21 0.04 0.38 0.72 -0.24 0.01 0.02
0.54 -0.23 0.57 0.27 -0.21 -0.37 0.26 -0.02 -0.08
0.28 0.11 -0.51 0.15 0.33 0.03 0.67 -0.06 -0.26
0.00 0.19 0.10 0.02 0.39 -0.30 -0.34 0.45 -0.62
Figure 2.2 Complete SVD of matrix in Figure 2.1.
0.01 0.44 0.19 0.02 0.35 -0.21 -0.15 -0.76 0.02
0.02 0.62 0.25 0.01 0.15 0.00 0.25 0.45 0.52
0.08 0.53 0.08 -0.03 -0.60 0.36 0.04 -0.07 -0.45
53
{Xˆ }= c1
c2
c3
c4
c5
m1
m2
m3
m4
human
0.16
0.40
0.38
0.47
0.18
-0.05 -0.12 -0.16 -0.09
interface
0.14
0.37
0.33
0.40
0.16
-0.03 -0.07 -0.10 -0.04
computer
0.15
0.51
0.36
0.41
0.24
0.02
0.06
0.09
0.12
user
0.26
0.84
0.61
0.70
0.39
0.03
0.08
0.12
0.19
system
0.45
1.23
1.05
1.27
0.56
-0.07 -0.15 -0.21 -0.05
response
0.16
0.58
0.38
0.42
0.28
0.06
0.13
0.19
0.22
time
0.16
0.58
0.38
0.42
0.28
0.06
0.13
0.19
0.22
EPS
0.22
0.55
0.51
0.63
0.24
-0.07 -0.14 -0.20 -0.11
survey
0.10
0.53
0.23
0.21
0.27
0.14
0.31
0.44
0.42
trees
-0.06
0.23
-0.14 -0.27
0.14
0.24
0.55
0.77
0.66
graph
-0.06
0.34
-0.15 -0.30
0.20
0.31
0.69
0.98
0.85
minors
-0.04
0.25
-0.10 -0.21
0.15
0.22
0.50
0.71
0.62
r (human.user) = .94 r (human.minors) = -.83
Figure 2.3. Two dimensional reconstruction of original matrix shown in Fig. 2.1 based on shaded columns and rows from SVD as shown in Fig. 2.2. Comparing shaded and boxed rows and cells of Figs. 2.1 and 2.3 illustrates how LSA induces similarity relations by changing estimated entries up or down to accommodate mutual constraints in the data.
The dimension reduction step has collapsed the component matrices in such a way that words that occurred in some contexts now appear with greater or lesser estimated frequency, and some that did not appear originally now do appear, at least
54
fractionally. Look at the two shaded cells for survey and trees in column m4. The word tree did not appear in this graph theory title. But because m4 did contain graph and minors, the zero entry for tree has been replaced with 0.66, which can be viewed as an estimate of how many times it would occur in each of an infinite sample of titles containing graph and minors. By contrast, the value 1.00 for survey, which appeared once in m4, has been replaced by 0.42 reflecting the fact that it is unexpected in this context and should be counted as unimportant in characterizing the passage. Very roughly and anthropomorphically, in constructing the reduced dimensional representation, SVD, with only values along two orthogonal dimensions to go on, has to estimate what words actually appear in each context by using only the information it has extracted. It does that by saying the following: “This text segment is best described as having so much of abstract concept one and so much of abstract concept two, and this word has so much of concept one and so much of concept two, and combining those two pieces of information (by vector arithmetic), my best guess is that word X actually appeared 0.6 times in context Y.” Now let us consider what such changes may do to the imputed relations between words or between multi- word textual passages. For two examples of wordword relations, compare the shaded and/or boxed rows for the words human, user and minors (in this context, minor is a technical term from graph theory) in the original and in the two-dimensionally reconstructed matrices (Figures 2.1 and 2.3). In the original, human never appears in the same passage with either user or minors—they have no co-occurrences, contiguities or “associations” as often construed. The correlations (using Spearman r to facilitate familiar interpretation) are -.38 between human and user, and a slightly higher -.29 between human and minors. However, in the reconstructed two-dimensional approximation, because of their indirect relations, both have been greatly altered: the human-user correlation has gone up to .94, the human-minors correlation down to -.83. Thus, because the terms human and user
55
occur in contexts of similar meaning—even though never in the same passage—the reduced dimension solution represents them as more similar, while the opposite is true of human and minors. To examine what the dimension reduction has done to relations between titles, we computed the intercorrelations between each title and all the others, first based on the raw co-occurrence data, then on the corresponding vectors representing titles in the two-dimensional reconstruction; see Figure 2.4. Correlations between titles in raw data:
c2 c3 c4 c5 m1 m2 m3 m4
c1 -0.19 0.00 0.00 -0.33 -0.17 -0.26 -0.33 -0.33
c2
c3
0.00 0.00 0.58 -0.30 -0.45 -0.58 -0.19
0.47 0.00 -0.21 -0.32 -0.41 -0.41
0.02 -0.30
0.44
c4
c5
m1
m2
m3
-0.31 -0.16 -0.24 -0.31 -0.31
-0.17 -0.26 -0.33 -0.33
0.67 0.52 -0.17
0.77 0.26
0.56
0.81 -0.88 -0.88 -0.88 -0.84
-0.45 -0.44 -0.44 -0.37
1.00 1.00 1.00
1.00 1.00
1.00
Correlations in two dimensional space: c2 c3 c4 c5 m1 m2 m3 m4
0.91 1.00 1.00 0.85 -0.85 -0.85 -0.85 -0.81
0.91 0.88 0.99 -0.56 -0.56 -0.56 -0.50
1.00 0.85 -0.85 -0.85 -0.85 -0.81
0.92 -0.72
1.00
Figure 2.4. Intercorrelations among vectors representing titles (averages of vectors of the words they contain) in the original full dimensional source data of Fig. 1 and in
56
the two-dimensional reconstruction of Fig. 3 illustrate how LSA induces passage similarity. In the raw co-occurrence data, correlations among the 5 human-computer interaction titles were generally low, even though all the papers were ostensibly about quite similar topics; half the rs were zero, three were negative, two were moderately positive, and the average was only .02. The correlations among the four graph theory papers were mixed, with a moderate mean r of 0.44. Correlations between the HCI and graph theory papers averaged only a modest -.30 despite the minimal conceptual overlap of the two topics. In the two dimensional reconstruction the topical groupings are much clearer. Most dramatically, the average r between HCI titles increases from .02 to .92. This happened, not because the HCI titles were generally similar to each other in the raw data, which they were not, but because they contrasted with the non-HCI titles in the same ways. Similarly, the correlations among the graph theory titles were reestimated to be all 1.00, and those between the two classes of topic were now strongly negative, mean r = -.72. Thus, SVD has performed a number of reasonable inductions; it has inferred what the true pattern of occurrences and relations must be for the words in titles if all the original data are to be accommodated in two dimensions. In this case, the inferences appear to be intuitively sensible. Note that much of the information that LSA used to infer relations among words and passages is in data about passages in which particular words did not occur. Indeed, Landauer and Dumais (1997) found that in LSA simulations of schoolchild word knowledge acquisition, about threefourths of the gain in total comprehension vocabulary that results from reading a paragraph is indirectly inferred knowledge about words not in the paragraph at all, a result that offers an explanation of children's otherwise inexplicably rapid growth of
57
vocabulary. A rough analogy of how this can happen is as follows. Read the following sentence: John is Bob's father and Mary is Ann's mother. Now read this one: Mary is Bob's mother. Now, because of the relations between the words mother, father, son, daughter, brother and sister that you already knew, adding the second sentence probably tended to make you think that that Bob and Ann were brother and sister, Ann the daughter of John, John the father of Ann, and Bob the son of Mary, even though none of these relations is explicitly expressed (and none follow necessarily from the presumed formal rules of English kinship naming.) The relationships inferred by LSA are also not logically defined, nor are they assumed to be consciously rationalizable as these could be. Instead, they are relations only of similarity—or of context sensitive similarity—but they nevertheless have mutual entailments of the same general nature, and also give rise to fuzzy indirect inferences that may be weak or strong and logically right or wrong. Why, and under what circumstances should reducing the dimensionality of representation be beneficial; when, in general, will suc h inferences be better than the original first-order data? We hypothesize that one such case is when the original data are generated from a source of the same dimensionality and general structure as the reconstruction. Suppose, for example, that speakers or writers generate paragraphs by choosing words from a k-dimensional space in such a way that words in the same paragraph tend to be selected from nearby locations. If listeners or readers try to infer the similarity of meaning from these data, they will do better if they reconstruct the
58
full set of relations in the same number of dimensions as the source. Among other things, given the right analysis, this will allow the system to infer that two words from nearby locations in semantic space have similar meanings even though they are never used in the same passage, or that they have quite different meanings even though they often occur in the same utterances. The number of dimensions retained in LSA is an empirical issue. Because the underlying principle is that the original data should not be perfectly regenerated but, rather, an optimal dimensionality should be found that will cause correct induction of underlying relations, the customary factor-analytic approach of choosing a dimensionality that most parsimoniously represent the true variance of the original data is not appropriate. Instead some external criterion of validity is sought, such as the performance on a synonym test or prediction of the missing words in passages if some portion are deleted in forming the initial matrix. (See Britton & Sorrells, this issue, for another approach to determining the correct dimensions for representing knowledge.) Finally, the measure of similarity computed in the reduced dimensional space is usually, but not always, the cosine between vectors. Empirically, this measure tends to work well, and there are some weak theoretical grounds for preferring it (see Landauer & Dumais, 1997). Sometimes we have found the additional use of the length of LSA vectors, which reflects how much was said about a topic rather than how central the discourse was to the topic, to be useful as well (see Rehder et al., this volume). Additional detail about LSA. As mentioned, one additional part of the analysis, the data preprocessing transformation, needs to be described more fully. Before the SVD is computed, it is customary in LSA to subject the data in the raw word-by-context matrix to a two-part transformation. First, the word frequency (+ 1) in each cell is converted to its log. Second, the information-theoretic measure,
59
entropy, of each word is computed as -
p log p over all entries in its row, and each
cell entry then divided by the row entropy value. The effect of this transformation is to weight each word-type occurrence directly by an estimate of its importance in the passage and inversely with the degree to which knowing that a word occurs provides information about which passage it appeared in. Transforms of this or similar kinds have long been known to provide marked improvement in information retrieval (Harman, 1986), and have been found important in several applications of LSA. The are probably most important for correctly representing a passage as a combination of the words it contains because they emphasize specific meaning-bearing words. Readers are referred to more complete treatments for more on the underlying mathematical, computational, software and application aspects of LSA (see Berry, 1992 ; Berry, Dumais & O’Brien, 1995; Deerwester, Dumais, Furnas, Landauer & Harshman, 1990; Landauer & Dumais, 1997; http://superbook.bellcore.com/~std/LSI.papers.html. http://lsa.colorado.edu/ is a WWW site in which investigators can enter words or passages and obtain LSA based word or passage vectors, similarities between words and words, words and passages, and passages and passages, and do a few other related operations . The site offers results based on several different training corpora, including ones based on an encyclopedia, a grade and topic partitioned collection of schoolchild reading, newspaper text in several languages, introductory psychology textbooks, and a small domain-specific corpus of text about the heart. To carry out LSA research based on their own training corpora, investigators will need to consult the more detailed sources (see Appendix). LSA’s Ability to Model Human Conceptual Knowledge. How well does LSA actually work as a representational model and measure of human verbal concepts? Its performance has so far been assessed more or less rigorously in eight ways: (1) as a predictor of query-document topic similarity judgments; (2) as a simulation of agreed
60
upon word-word relations and of human vocabulary test synonym judgments, (3) as a simulation of human choices on subject- matter multiple choice tests, (4) as a predictor of text coherence and resulting comprehension, (5) as a simulation of wordword and passage-word relations found in lexical priming experiments, (6) as a predictor of subjective ratings of text properties, i.e. grades assigned to essays, and (7) as a predictor of appropriate matches of instructional text to learners. (8) It has also been used with good results to mimic synonym, antonym, singular-plural and compound-component word relations, aspects of some classical word sorting studies, to simulate aspects of imputed human representation of single digits, and, in pilot studies, to replicate semantic categorical clusterings of words found in certain neuropsychological deficits (Laham, 1997a). Kintsch (in press) has also used LSA derived meaning representations to demonstrate their possible role in constructionintegration-theoretic accounts of sentence comprehension, metaphor and context effects in decision making. We will take space here to review only some of the most systematic and pertinent of these results. LSA and information retrieval. Anderson (1990) has called attention to the analogy between information retrieval and human semantic memory processes. One way of expressing their commonality is to think of a searcher as having in mind a certain meaning, which he or she expresses in words, and the system as trying to find a text with the same meaning. Success, then, depends on the system representing query and text meaning in a manner that correctly reflects their similarity for the human. LSI does this better than systems that depend on literal matches between terms in queries and documents. Its superiority can often be traced to its ability to correctly match queries to (and only to) documents of similar topical meaning when query and document use different words. In the text–processing problem to which it was first applied, automatic matching of information requests to document abstracts, SVD provides a significant improvement over prior methods. In this application, the
61
text of the document database is first represented as a matrix of terms by documents (documents are usually represented by a surrogate such as a title, abstract and/or keyword list) and subjected to SVD, and each word and document is represented as a reduced dimensiona lity vector, usually with 50-400 dimensions. A query is represented as a “pseudo-document” a weighted average of the vectors of the words it contains. (A document vector in the SVD solution is also a weighted average of the vectors of words it contains, and a word vector a weighted average of vectors of the documents in which it appears.) The first tests of LSI (Latent Semantic Indexing, LSA’s alias in this application) were against standard collections of documents for which representative queries have been obtained and knowledgeable humans have more or less exhaustively examined the whole database and judged which abstracts are and are not relevant to the topic described in each query statement. In these standard collections LSI's performance ranged from just equivalent to the best prior methods up to about 30% better. In a recent project sponsored by the National Institute of Standards and Technology, LSI was compared with a large number of other research prototypes and commercial retrieval schemes. Direct quantitative comparisons among the many systems were somewhat muddied by the use of varying amounts of preprocessing— things like getting rid of typographical errors, identifying proper nouns as special, differences in stop lists, and the amount of tuning that systems were given before the final test runs. Nevertheless, the results appeared to be quite similar to earlier ones. Compared to the standard vector method (essentially LSI without dimension reductions) ceteris paribus LSI was a 16% improvement (Dumais, 1994). LSI has also been used successfully to match reviewers with papers to be reviewed based on samples of the reviewers’ own papers (Dumais & Nielsen, 1992), and to select papers for researchers to read based on other papers they have liked (Foltz and Dumais, 1992).
62
LSA and synonym tests. It is claimed that LSA, on average, represents words of similar meaning in similar ways. When one compares words with similar vectors as derived from large text corpora, the claim is largely but not entirely fulfilled at an intuitive level. Most very near neighbors (the cosine defining a near neighbor is a relative value that depends on the training database and the number of dimensions) appear closely related in some manner. In one scaling (an LSA/SVD analysis) of an encyclopedia, “physician,” “patient,” and “bedside” were all close to one another, cos > .5. In a sample of triples from a synonym and antonym dictionary, both synonym and antonym pairs had cosines of about .18, more than 12 times as large as between unrelated words from the same set. A sample of singular-plural pairs showed somewhat greater similarity than the synonyms and antonyms, and compound words were similar to their component words to about the same degree, more so if rated analyzable. Nonetheless, the relationship between some close neighbors in LSA space can occasionally be mysterious (e.g., “verbally” and “sadomasochism” with a cosine of .8 from the encyclopedia space), and some pairs that should be close are not. It's impossible to say exactly why these oddities occur, but it is plausible that some words that have more than one contextual meaning receive a sort of average highdimensional placement that out of context signifies nothing, and that many words are sampled too thinly to get well placed. It must be born in mind that most of the training corpora used to date correspond in size approximately to the printed word exposure (only) of a single average 9th grade student, and individual humans also have frequent oddities in their understanding of particular words. (Investigators who use LSA vectors should keep these factors in mind: the similarities should be expected to reflect human similarities only when averaged over many word or passages pairs of a particular type and when compared to averages across a number of people; they will not always give sensible results when applied to the particular words
63
in a particular sentence.) It's also likely, of course, that LSA’s "bag of words" method, which ignores all syntactical, logical and nonlinguistic pragmatic entailments, sometimes misses meaning or gets it scrambled. To objectively measure how well, compared to people, LSA captures synonymy, LSA's knowledge of synonyms was assessed with a standard vocabulary test. The 80 item test was taken from retired versions of the Educational Testing Service (ETS) Test of English as a Foreign Language (TOEFL: for which we are indebted to Larry Frase and ETS). To make these comparisons, LSA was trained by running the SVD analysis on a large corpus of representative English. In various studies, both collections of newspaper text from the Associated Press news wire and Grolier's Academic American Encyclopedia, a work intended for students, have been used. In the most successful experiment, an SVD was performed on text segments consisting of 500 characters or less (on average 73 words, about a paragraph’s worth) taken from beginning portions of each of 30,473 articles in the encyclopedia, a total of 4.5 million words of text, roughly equivalent to what a child would have read by the end of eighth grade. This resulted in a vector for each of 60 thousand words. The TOEFL vocabulary test consists of items in which the question part is usually a single word, and there are four alternative answers, usually single words, from which the test taker is supposed to choose the one most similar in meaning. To simulate human performance, the cosine between the question word and each alternative was calculated, and the LSA model chose the alternative closest to the stem. For six test items for which the model had never met either the stem word and/or the correct alternative, it guessed with probability .25. Scored this way, LSA got 65% correct, identical to the average score of a large sample of students applying for college entrance in the United States from non-English speaking countries. The detailed pattern of errors of LSA was also compared to that of students. For each question a product- moment correlation coefficient was computed between
64
(i) the cosine of the stem and each alternative and (j) the proportion of guesses for each alternative for a large sample of students (n > 1,000 for every test item). The average correlation across the 80 items was 0.70. Excluding the correct alternative, the average correlation was .44. These correlations may be thought of as between one test-taker (LSA) and group norms, which would also be much less than perfect for humans. When LSA chose wrongly and most students chose correctly, it sometimes appeared to be because LSA is more sensitive to contextual or paradigmatic associations and less to contrastive semantic or syntagmatic features. For example, LSA slightly preferred “nurse” (cos = .47) to “doctor” (cos = .41) as an associate for “physician.” To assess the role of dimension reduction, the number of dimensions was varied from 2 to 1,032 (the largest number for which SVD was computationally feasible.) On log- linear coordinates, the TOEFL test results showed a very sharp and highly significant peak (Figure 2.5). Corrected for guessing by the standard formula ((correct - chance)/(1-chance)), LSA got 52.7% correct with 300 and 325 dimensions, 13.5% correct with just two or three dimensions.
65
Figure 2.5. The effect of number of dimensions in an LSA corpus-based representation of meaning on performance on a synonym test (from ETS Test of English as a Foreign Language). The measure is the proportion of 80 multiple-choice items after standard correction for guessing. The point for the highest dimensionality is equivalent to a first-order co-occurrence correlation.
When there was no dimension reduction at all (equivalent to choosing correct answers by the correlation of transformed co-occurrence frequencies of words over encyclopedia passages), just 15.8%. At optimal dimensionality, LSA chose approximately three times as many right answers as would be obtained by ordinary first-order correlations over the input, even after a transformation that greatly improves the relation. This demonstrates conclusively that the LSA dimension reduction technique captures much more than mere co-occurrence (simply choosing the alternative that co-occurs with the stem in the largest number of corpus paragraphs gets only 11% right when corrected for guessing). More importantly for our argument, it implies that indirect associations or structural relations induced by analysis of the whole corpus are involved in LSA’s success with individual words. Thus, correct representation of any one word depends on the simultaneous correct representation of many, perhaps all other words. As mentioned earlier, Landauer and Dumais (1997) also estimated, by a different method, the relative direct and indirect effects of adding a new paragraph to LSA’s “experience”. For example, at a point in LSA’s learning roughly corresponding to the amount of text read by late primary school, an imaginary test of all words in the language—the model’s imputed total recognition vocabulary—gains about three times as much knowledge about words not in the new paragraph as about words actually contained in the paragraph.
66
Landauer and Dumais (1997) also found that the rate of gain in vocabulary by LSA was approximately equal to the rate of gain of “known”, as compared to morphologically inferred, words empirically estimated by Anglin (1995) and others for primary school children. Simulating word sorting and relatedness judgments. Recently, Darrell Laham and Tom Landauer explored the relation between LSA and human lexical semantic representations further by simulating a classic word sorting study by Anglin (1970). In Anglin’s experiments third and fourth grade children and adults were given sets of selected words to sort by meaning into as many piles as they wished. The word sets contained subsets of nouns, verbs, prepositions and adjectives, and within each subset there were words taken from common conceptual hierarchies, such as boy, girl, horse, flower, among which clustering could reveal the participant’s tendency to use abstract versus concrete similarity relations. Anglin measured the semantic similarity of every pair of words by the proportion of subjects who put them in the same pile. He found that parts of speech clustered moderately in both child and adult sets, and, confirming the hypothesis behind the study, that adults showed more evidence of use of abstract categories than did children. Laham and Landauer measured the similarity between the same word pairs by cosines based on 5 grade-partitioned scalings of samples of schoolchild reading—3rd, 6th, 9th, 12th grade and college. For each scaling, the cosine between each word pair in the set (20 words for 190 comparisons) was calculated. The overall correlation of the LSA estimates and the grouped human data, for both child and adult, rose as the number of documents included in the scaling rose. Using the third grade scaling, the correlation between the LSA estimates and the child data was .50, with the adult data .35. Using the college level scaling the correlation between LSA estimates and the child data was .61, with adults .50. The correlation coefficients between LSA
67
estimates and human data showed a monotonic linear rise as the grade level (and number of documents known to LSA) increased. LSA exhibited differences in similarities across degrees of abstraction much like those found by Anglin; for the third grade scaling, the average correlations in patterns across means for the comparable levels within each part-of-speech class r = .80 with children and r = .75 with adults, for the college level scaling r = .90 with children and r = .90 with adults . The correlation between the adult and child patterns was .95. The LSA measure did not separate word classes nearly as strongly as did the human data, nor did it as clearly distinguish within part-of-speech from between partof-speech comparisons. For the third grade scaling, the overall (N = 190) average cosine =.13, the average within part-of-speech cosine (N = 41) = .15 and the average between part-of-speech cosine (N = 149) = .13. The college level scaling showed stronger similarities within class with the overall average cosine =.19, the average within part-of-speech cosine = .23 and the average between part-of-speech cosine = .17. As in the vocabulary acquisition simulations, it appears that the relations obtained from a corpus of small size relative to a typical adult’s cumulative language exposure resemble children somewhat more than adults. LSA’s weak reflection of word class in this rather small sample of data would appear to confirm the expectation that the lack of word order information in its input data along with the use of fairly large passages as the context units prevents it from inducing grammatical relations among words. (Wolfe et al., this issue, reports further word sorting results. Also compare Burgess et al, this issue.) Simulating subject-matter knowledge. In three investigations by Foltz and by Laham and Landauer to be reported elsewhere, LSA has been trained on the text of introductory psychology textbooks, then tested with multiple choice tests provided by the textbook publishers. LSA performed well above chance in all cases, and in all
68
cases did significantly better on questions rated “easy” than on ones rated “difficult”, and on items classified as “factual” than on ones classified as “conceptual” by their authors. On questions used in university introductory psychology course exams given at New Mexico State University and the University of Colorado, Boulder, LSA scored significantly worse than class averages, but in every case did well enough to receive a passing grade according to the class grading scheme. In related work, Foltz, Britt and Perfetti (1996) used LSA to model the knowledge structures of both expert and novice subjects who had read a large number of documents on the history of the Panama canal. After reading the documents, subjects made judgments of the relatedness of 120 pairs of concepts that were mentioned in the documents. Based on an LSA scaling of the documents, the cosines between the concepts were used to estimate the relatedness of the concept pairs. The LSA predictions correlated significantly with the subjects, with the correlation stronger to that of the experts in the domain (r = 0.41) than that of the novices (r = 0.36). (Note again that two human ratings would also not correlated perfectly.) An analysis of where LSA's predictions deviated greatly from that of the humans indicated that LSA tended to underpredict more global or situational relationships that were not directly discussed in the text but would be common historical knowledge of any undergraduate. Thus in this case the limitation on LSA's predictions may simply be due to training only on a small set of documents rather than on a larger set that would capture a richer representation of history. Simulating semantic priming. Landauer and Dumais (1997 ) report an analysis in which LSA was used to simulate a lexical semantic priming study by Till, Mross and Kintsch (1988), in which people were presented visually with one or two sentence passages that ended in an obviously polysemous word. After varying onset delays, participants made lexical decisions about words related to the homographic word or to the overall meaning of the sentence. In paired passages, each homographic
69
word’s meaning was biased in two different ways judged to be related to two corresponding different target words. There were two additional target words not in the passages or obviously related to the polysemous word but judged to be related to the overall meaning or “situation model” that people would derive from the passage. Here is an example of two passages and their associated target words, along with a representative control word used to establish a baseline.
“The townspeople were amazed to find that all the buildings had collapsed except the mint.” “Thinking of the amount of garlic in his dinner, the guest asked for a mint.”
Target words: money, candy, earthquake, breath Unrelated control word: ground
In the Till et al. study, target words related to both senses of the homographic words were correctly responded to faster than unrelated control words if presented within 100 ms after the homograph. If delayed by 300 ms, only the contextappropriate associate was primed. At a one second delay, the so-called inference words were also primed. In the LSA simulation, the cosines between the polysemic word and its two associates were computed to mimic the expected initial priming. The cosine between the two associates of the polysemic word and the sentence up to the last word preceding it were used to mimic contextual disambiguation of the homographs. The cosine between the entire passages and the inference words were computed to emulate the contextual comprehension effect on their priming. Table 2.1 shows the average results over all 27 passage pairs, with one of the above example passages shown again to illustrate the conditions simulated. The values given are the cosines between the word or passage and the target words. The
70
pattern of LSA similarity relations corresponds almost perfectly with the pattern of priming results; the differences corresponding to differences observed in the priming data are all significant at p < .001, and have effect sizes comparable to those in the priming study. Table 2.1. LSA simulation of Till et al (1988) priming study. mint: money
candy
ground
.21
.20
.07
thinking amount garlic dinner guest asked: money
candy
.15
.21 ground
earthquake
breath
.14
.21
.15
The import of this result is that LSA again emulated a human behavioral relation between words and multi-word passages, and did so while representing passages simply as the vector average of their contained words. (Steinhart, 1995, obtained similar results with different words and passages.) It is surprising and important that such simple representations of whole utterances, ones that ignore word order, sentence structure, and non- linear word-word interactions, can correctly predict human behavior based on passage meaning. However, this is the second example of this property—query-abstract and abstract-abstract similarity results being the first— and there have subsequently been several more. These findings begin to suggest that word choice alone has a much more dominant role in the expression of meaning than has previously been credited (see Landauer, Laham and Foltz, 1997).
71
Of course, LSA as currently constituted contains no model of the temporal dynamics of discourse comprehension. To fit the temporal findings of the Till et al. experiment one would need to assume that the combining (averaging) of word vectors into a single vector to represent the whole passage takes about a second, and that partial progress of the combining mechanism accounts for the order and times at which the priming changes occur. We hope eventually to develop dynamic LSAbased models of the word combining mechanism by which sentence and passage comprehension is accomplished. Such models will presumably incorporate LSA word representations into processes like those posited in Construction-Integration (Kintsch, 1988) or other spreading activation theories. An example of such a model would be to first compute the vector of each word, then the average vector for the two most similar words, and so forth. It seems likely that such a model would prove too simple. However, the research strategy behind the LSA effort would dictate trying the simplest models first and then complicating them, for example in the direction of the full-blown CI construction and iterative constraint satisfaction mechanisms, or even to models including hierarchical syntactic structure (presumably, automatically induced), only if and as found necessary. Assigning holistic quality scores to essay test answers. In another set of studies to be published elsewhere by Landauer, Laham and Foltz (e.g., 1997), LSA has been used to assign holistic quality scores to written answers to essay questions. Five different methods have been tried, all with good success. In all cases an LSA space was first constructed based either on the instructional text read by students or on similar text from other sources, plus the text of student essays. In method one, a sample of essays was first graded by instructors, then the cosine (and/or other LSAbased similarity and quantity measures) between each ungraded essay and each pregraded essay was computed, and the new essay assigned the average of a small set of closely similar ones, weighted by their similarity.
72
In the second method, a pre-existing exemplary text on the assigned topic, one written by an instructor or expert author, was used as a standard, and the student essay score was computed as its LSA cosine with the standard. In the third method, the cosine between each sentence of a standard text from which the students had presumably learned the material being tested and each sentence of a student’s answer was first computed. The maximum cosine for each source text component was found among the sentences of the student essay, and these cumulated to form a total score. In a variant of the third method, a fourth method computed and cumulated the cosines between each sentence in a student's essay and a set of sentences from the original text that the instructor thought were important. In the fifth method, only the essays themselves were used. The matrix of distances (1-cosine) between all essays was "unfolded" to the single dimension that best reconstructed all the distances, and the point of an essay along this dimension taken as the measure of its quality. This assumes that the most important dimension of difference among a set of essay exams on a given topic is their global quality. All five methods provided the basis of scores that correlated approximately as well with expert assigned scores as such scores correlated with each other, sometimes slightly less well, on average somewhat better. In one set of studies (Laham, 1997b), method one was applied to a total of eight exams ranging in topic from heart anatomy and physiology, through psychological concepts, to American history, current social issues and marketing problems. A meta-analysis found that LSA correlated significantly better with individual expert graders (from ETS or other professional organization or course instructors) than one expert correlated with another. Because these results show that human judgments about essay qualities are no more reliable than LSA’s, they again suggest that the holistic semantic representation of a passage relies primarily on word choice and surprisingly little on properties whose transmission necessarily requires the use of syntax. This is good news for the
73
practical application of LSA to many kinds of discourse processing research, but is counter- intuitive and at odds with the usual assumptions of linguistic and psycholinguistic theories of meaning and comprehension, so it should be viewed with caution until further research is done (and, of course, with reservations until the details of the studies have been published.) LSA and Text Comprehension. This application of LSA is described in papers in this volume, so we will mention the results only briefly to round out our survey of evidence regarding the quality of LSA’s simulation of human meaningbased performance. Kintsch and his colleagues (e.g. van Dijk & Kintsch, 1983; Kintsch & Vipond, 1979; McNamara, Kintsch, Songer & Kintsch, 1996) have developed methods for representing text in a propositional language and have used it to analyze the coherence of discourse. They have shown that the comprehension of text depends heavily on its coherence, as measured by the overlap between the arguments in propositions. In a typical propositional calculation of coherence, a text must first be propositionalized by hand. This has limited research to small samples of text and has inhibited its practical application to composition and instruction. Foltz, Kintsch, and Landauer (1993, this issue; Foltz, 1996) have applied LSA to the task. LSA can make automatic coherence judgments by computing the cosine from one sentence or passage and the following one. In one case, analysis of the coherence between a set of sentences about the heart, the LSA measure predicted comprehension scores extreme ly well, r= .93. As will be discussed in the article in this volume, the general approach of using LSA for computing textual coherence also permits an automatic characterization of places in a text where the coherence breaks down, as well as a measure of how semantic content changes across a text. Predicting learning from text. As reported in some detail in two of the succeeding articles in this issue, Kintsch, Landauer and colleagues (Rehder et al.; Wolfe et al.; this issue) have begun to use LSA to match students with text at the
74
optimal level of conceptual complexity for learning. Earlier work by Kintsch and his collaborators (see Kintsch, 1994; McNamara, Kintsch, Butler-Songer and Kintsch, 1996 ) has shown that people learn the most when the text on a topic is neither too hard, containing too many concepts with which a student is not yet familiar, nor too easy, requiring too little new knowledge construction (a phenomenon we call “the Goldilocks principle”). LSA has been used to characterize both the knowledge of an individual student before and after reading a particular text and the knowledge conveyed by that text. These studies and their results are described in detail in articles hereafter. It is shown that choosing between instructional texts of differing sophistication by the LSA relation between a short student essay and the text can significantly increase the amount learned. In addition, analytic methods are developed by which not only the similarity between two or more texts, but their relative positions along some important underlying conceptual continuum, such as level of sophistication or relevance to a particular topic, can be measured. Summary and some caveats. It is clear enough from the conjunction of all these formal and informal results that LSA is able to capture and represent significant components of the lexical and passage meanings evinced in judgment and behavior by humans. The following papers exploit this ability in interesting and potentially useful ways that simultaneously provide additional demonstrations and tests of the method and its underlying theory. However, as mentioned briefly above, it is obvious that LSA lacks important cognitive abilities that humans use to construct and apply knowledge from experience, in particular the ability to use detailed and complex order information such as that expressed by syntax and used in logic. It also lacks, of course, a great deal of the important raw experience, both linguistic and otherwise, on which human knowledge is based. While we are impressed by LSA’s current power to mimic aspects of lexical semantics and psycholinguistic phenomena, we believe that its validity as a model or measure of human cognitive processes or their products
75
should not be oversold. When applied in detail to individual cases of word pair relations or sentential meaning construal it often goes awry when compared to our intuitions. In general, it performs best when used to simulate average results over many cases, suggesting either that, so far at least, it is capturing statistical regularities that emerge from detailed processes rather than the detailed processes themselves, or that the corpora and, perhaps, the analysis methods, used to date have been imperfect. On the other hand, the success of LSA as a theory of human knowledge acquisition and representation should also not be underestimated. It is hard to imagine that LSA could have simulated the impressive range of meaning-based human cognitive phenomena that it has unless it is doing something analogous to what humans do. No previous theory in linguistics, psychology or artificial intelligence research has ever been able to provide a rigorous computational simulation that takes in the very same data from which humans learn about words and passages and produces a representation that gives veridical simulations of a wide range of human judgments and behavior. While it seems highly doubtful that the human brain uses the same mathematical algorithms as LSA/SVD, it seems almost certain that the brain uses as much analytic power as LSA to transform its temporally local experiences into global knowledge. The present theory clearly does not account for all aspects of knowledge and cognition, but it offers a potential path for development of new accounts of mind that can be stated in mathematical terms rather than imprecise mentalistic primitives and whose empirical implications can be derived analytically or by computations on bodies of representative data rather than by verbal argument. In future research we hope to see both improvements in LSA’s experience base from analysis of larger and more representative corpora of both text and spoken language—and perhaps, if a way can be found, by adding representations of experience of other kinds—and the provision of a compatible process model of online discourse comprehension by which both its input of experience and its application of
76
constructed knowledge will better reflect the complex ways in which humans combine word meanings dynamically. As suggested above, one promising approach to the latter goal is to combine LSA word and episode representation with the Construction-Integration theory’s mechanisms for discourse comprehension, a strategy that Walter Kintsch illustrates in a forthcoming book (Kintsch, in press.) Other avenues of potential improvement involve the representation of word order in the input data for LSA, following the example of the work reported in Burgess and Lund (this volume). Meanwhile, it needs keeping in mind that the applications of LSA recounted in the following articles are all based on its current formulation and based on varying training corpora that are all smaller and less representative of relevant human experience than one would wish. Part of the problem of non-optimal corpora is due simply to the current unavailability and difficulty of constructing large general or topically relevant text samples that approximate what a variety of individual learners would have met. But another is due to current computational limitations. LSA became practical only when computational power and algorithm efficiency improved sufficiently to support SVD of thousands of words-by-thousands of contexts matrices; it is still impossible to perform SVD on the hundreds of thousands by tens of millions matrices that would be needed to truly represent the sum of an adult’s language exposure. It also needs noting that is still early days for LSA and that many details of its implementation, such as the preprocessing data transformation used and the method for choosing dimensionality, even the underlying statistical model, will undoubtedly undergo changes. Thus in reading the following articles, or in considering the application of LSA to other problems, one should not think of LSA as a fixed mechanism or its representations as fixed quantities, but rather, as evolving approximations.
77
Learning Human- like Knowledge by Singular Value Decomposition Originally Published as: Landauer, T. K., Laham, D., & Foltz, P. W., (1998). Learning human- like knowledge by Singular Value Decomposition: A progress report. In M. I. Jordan, M. J. Kearns & S. A. Solla (Eds.). Advances in Neural Information Processing Systems 10, (pp. 45-51). Cambridge: MIT Press.
Abstract. Singular value decomposition (SVD) can be viewed as a method for unsupervised training of a network that associates two classes of events reciprocally by linear connections through a single hidden layer. SVD was used to learn and represent relations among very large numbers of words (20k-60k) and very large numbers of natural text passages (1k-70k) in which they occurred. The result was 100-350 dimensional "semantic spaces" in which any trained or newly added word or passage could be represented as a vector, and similarities were measured by the cosine of the contained angle between vectors. Good accuracy in simulating human judgments and behaviors has been demonstrated by performance on multiple-choice vocabulary and domain knowledge tests, emulation of expert essay evaluations, and in several other ways. Examples are also given of how the kind of knowledge extracted by this method can be applied. Introduction. Traditionally, imbuing machines with human-like knowledge has relied primarily on explicit coding of symbolic facts into computer data structures and algorithms. A serious limitation of this approach is people's inability to access and express the vast reaches of unconscious knowledge on which they rely, knowledge based on masses of implicit inference and irreversibly melded data. A more important deficiency of this state of affairs is that by coding the knowledge
78
ourselves, (as we also do when we assign subjectively hypothesized rather than objectively identified features to input or output nodes in a neural net) we beg important questions of how humans acquire and represent the coded knowledge in the first place. Thus, from both engineering and scientific perspectives, there are reasons to try to design learning machines that can acquire human-like quantities of human- like knowledge from the same sources as humans. The success of such techniques would not prove that the same mechanisms are used by humans, but because we presently do not know how the problem can be solved in principle, successful simulation may offer theoretical insights as well as practical applications. In the work reported here we have found a way to induce significant amounts of knowledge about the meanings of passages and of their constituent vocabularies of words by training on large bodies of natural text. In general terms, the method simultaneously extracts the similarity between words (the likelihood of being used in passages that convey similar ideas) and the similarity between passages (the likelihood of containing words of similar meaning). The conjoint estimation of similarity is accomplished by a fundamentally simple representational technique that exploits mutual constraints implicit in the occurrences of very many words in very many contexts. We view the resultant system both as a means for automatically learning much of the semantic content of words and passages, and as a potential computational model for the process that underlies the corresponding human ability. While the method starts with data about the natural contextual co-occurrences of words, it uses them in a different manner than has previously been applied. A long-
79
standing objection to co-occurrence statistics as a source of linguistic knowledge (Chomsky’s 1957) is that many grammatically acceptable expressions, for example sentences with potentially unlimited embedding structures, cannot be produced by a finite Markov process whose elements are transition probabilities from word to word. If word-word probabilities are insufficient to generate language, the n, it is argued, acquiring estimates of such probabilities cannot be a way that language can be learned. However, our approach to statistical knowledge learning differs from those considered in the past in two ways. First, the basic associational data from which knowledge is induced are not transition frequencies between successive individual words or phrases, but rather the frequencies with which particular words appear as components of relatively large natural passages, utterances of the kind that humans use to convey complete ideas. The result of this move is that the statistical regularities reflected are relations among unitary expressions of meaning, rather than syntactic constraints on word order that may serve additional purposes such as output and input processing efficiencies, error protection, or esthetic style. Second, the mutual constraints inherent in a multitude of such local co-occurrence relations are jointly satisfied by being forced into a global representation of lower dimensionality. This constraint satisfaction, a form of induction, was accomplished by singular value decomposition, a linear factorization technique that produces a representational structure equivalent to a three layer neural network model with linear activation functions.
80
The text analysis model and method. The text analysis process that we have explored is called Latent Semantic Analysis (LSA) (Deerwester et al., 1990; Landauer and Dumais, 1997). It comprises four steps: (1) A large body of text is represented as a matrix [ij], in which rows stand for individual word types, columns for meaning-bearing passages such as sentences or paragraphs, and cells contain the frequency with which a word occurs in a passage. (2) Cell entries (freq ij ) are transformed to: (
log freq ij + 1
)
freqij freqij * log −∑ ∑ freqij ∑ freqij 1− j 1− j 1− j
a measure of the first order association of a word and its context. (3) The matrix is then subjected to singular value decomposition (Berry, 1992): [ij] = [ik] [kk] [jk]' in which [ik] and [jk] have orthonormal columns, [kk] is a diagonal matrix of singular values, and k