LSAC RESEARCH REPORT SERIES

8 downloads 284750 Views 1MB Size Report
Application of Heuristic-Based Semantic Similarity. Between .... made use of these methods in the development of automated essay-scoring engines. Recently ..... Manager: I recommend that our company reconsider the decision to completely.
LSAC RESEARCH REPORT SERIES

 Application of Heuristic-Based Semantic Similarity Between Single-Topic Texts for the LSAT Dmitry I. Belov David A. Kary

 Law School Admission Council Research Report 12-05 October 2012

A Publication of the Law School Admission Council

The Law School Admission Council (LSAC) is a nonprofit corporation that provides unique, state-of-theart admission products and services to ease the admission process for law schools and their applicants worldwide. More than 200 law schools in the United States, Canada, and Australia are members of the Council and benefit from LSAC's services.

© 2012 by Law School Admission Council, Inc.

LSAT, The Official LSAT PrepTest, The Official LSAT SuperPrep, ItemWise, and LSAC are registered marks of the Law School Admission Council, Inc. Law School Forums, Credential Assembly Service, CAS, LLM Credential Assembly Service, and LLM CAS are service marks of the Law School Admission Council, Inc. 10 Actual, Official LSAT PrepTests; 10 More Actual, Official LSAT PrepTests; The Next 10 Actual, Official LSAT PrepTests; 10 New Actual, Official LSAT PrepTests with Comparative Reading; The New Whole Law School Package; ABA-LSAC Official Guide to ABA-Approved Law Schools; Whole Test 2 Prep Packages; The Official LSAT Handbook; ACES ; ADMIT-LLM; FlexApp; Candidate Referral Service; DiscoverLaw.org; Law School Admission Test; and Law School Admission Council are trademarks of the Law School Admission Council, Inc. All rights reserved. No part of this work, including information, data, or other portions of the work published in electronic form, may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying and recording, or by any information storage and retrieval system without permission of the publisher. For information, write: Communications, Law School Admission Council, 662 Penn Street, PO Box 40, Newtown PA, 18940-0040. LSAC fees, policies, and procedures relating to, but not limited to, test registration, test administration, test score reporting, misconduct and irregularities, Credential Assembly Service (CAS), and other matters may change without notice at any time. Up-to-date LSAC policies and procedures are available at LSAC.org.

Table of Contents Executive Summary ...................................................................................................... 1 Introduction ................................................................................................................... 2 WordNet ......................................................................................................................... 2 Data Type ....................................................................................................................... 5 Computing Semantic Similarity Between Two Single-Topic Texts ........................... 7 Experiments With Real Data ....................................................................................... 15 Discussion ................................................................................................................... 16 References ................................................................................................................... 18

i

Executive Summary Educational measurement practice (item bank development, form assembly, scoring of constructed response answers, etc.) involves the development and processing of an enormous amount of text. This requires large numbers of people to write, read through, evaluate, classify, edit, score, and analyze the text. Not only is this process time consuming and resource intensive, but it is also subjective and prone to error. Subjectmatter experts must define the construct of the test through some formalized process. Based on the construct, items are written, reviewed, edited, and classified. Beyond the individual items, item banks must also be evaluated to identify content overlap, cuing, or other content features that will lead to dependence among items and reduce construct representation. Newly written items approved for pretesting must then be administered to a sample of representative test takers before their statistical quality can be determined. If the items involve constructed response answers, they must be scored by trained human raters. Finally item writing must be conducted on a continuous basis due to security issues, and construct definition must be reevaluated on a regular basis due to changes in practice or standards. Natural language processing (NLP) can be used to reduce the above-mentioned costs in time, money, and labor. NLP is a collection of methods for indexing, classifying, summarizing, generating, and interpreting texts. Initially, educational measurement made use of these methods in the development of automated essay-scoring engines. Recently, however, NLP methods have been applied to nearly every aspect of test development and psychometrics: item difficulty modeling, using text analysis to improve scoring in computerized adaptive testing and multistage testing, searching for pairs of mutually excluded items (item enemies), item generating, and item bank referencing. This report introduces a heuristic for computing semantic similarity between two single-topic texts. The heuristic was tested on 10 datasets prepared by a test developer. Each dataset consisted of 10 Logical Reasoning passages from the Law School Admission Test (LSAT), where passages P1 and P2 were judged by the test developer to be similar, and the other 8 passages were judged to be dissimilar from P1 and P2. Given a dataset, the heuristic was used to compute semantic similarity between P1 and other passages, and it demonstrated an agreement with the test developer on 8 datasets. The heuristic has several potential applications for the LSAT: (1) semantic-based search for possible enemies in an item pool; (2) Internet search for illegally reproduced cloned items; (3) improvement of estimates of item difficulty through the addition of semantic features (e.g., semantic similarity between a passage and its key, or between a key and its distractors).

1

Introduction The measurement of semantic similarity between two texts has numerous applications. It has been used for text classification (Rocchio, 1971), word sense disambiguation (Lesk, 1986; Schutze, 1998), extractive summarization (Salton, Singhal, Mitra, & Buckley, 1997), automatic evaluation of machine translation (Papineni, Roukos, Ward, & Zhu, 2002), and text summarization (Lin & Hovy, 2003). In educational measurement, it has been used for automated essay scoring (Cook & Clauser, 2012; Page, 1966; Shermis & Burstein, 2003), for expanding feature space for regressiontree-based item difficulty modeling by adding semantic features (Belov & Knezevich, 2008; Sheehan, Kostin, & Persky, 2006, April), for improving Bayesian priors of latent trait (He & Veldkamp, 2012), and for searching pairs of mutually excluded (enemy) items (Li & Shen, 2012). A typical approach to computing text similarity is to use a matching method, which produces a similarity score based on the number of semantically similar pairs of words from two texts. There are a large number of word-toword semantic similarity measures using approaches that are either knowledge based (Leacock & Chodorow, 1998; Resnik, 1998; Wu & Palmer, 1994) or corpus based (Turney, 2001). As usual with artificial intelligence, there is no general method that would work well for any data. Moreover, even mathematical formalization of semantics and semantic similarity is still an active area of research (Palmer, 2011). Therefore, the success of applying existing techniques mostly depends on how well the algorithm exploits the structure of given data. This report presents a heuristic that exploits the structure of a passage from a Law School Admission Test (LSAT) Logical Reasoning (LR) item. The heuristic is a very simple one at this stage; it is intended to be the basis for more sophisticated (and more accurate) heuristics in the future. The first section of this report briefly describes the WordNet lexical database (Fellbaum, 1998) and its application to the measurement of semantic similarity between two nouns. The second section describes the studied data type and identifies some structural characteristics of the data. The third section describes an algorithm to compute semantic similarity between two texts. The fourth section presents the results of experiments with LSAT items. A discussion is presented in the final section of the report.

WordNet WordNet (Fellbaum, 1998) is a semantic dictionary designed as a graph. Each vertex of the graph is called a synset, which represents a specific meaning (sense) of a word. It includes the word, its explanation, and its synonyms. Synsets are connected to one another through explicit semantic relations. In WordNet, given a part of speech (noun, verb, adjective, adverb) and semantic relation (hypernyms for nouns and verbs; synonyms for adjectives and adverbs), a word can have multiple senses, each corresponding to a particular synset (Fellbaum, 1998). Therefore, we will use sense and synset interchangeably. All of the senses of any given word are sorted from the most common to the least common. For a noun, each sense defines a unique path (determined by “this is a kind of” semantic relation) to the noun entity. Figure 1 shows 2

two senses and their semantic paths for the noun otter. Figure 2 shows the sense along with the corresponding semantic path for two nouns: psychometrics and psychophysics.

FIGURE 1. Two senses and semantic paths for the noun otter

3

(a)

(b) FIGURE 2. Sense and semantic path for two nouns in WordNet: (a) psychometrics; (b) psychophysics

4

A semantic similarity between two synsets can be computed from the distance between them in the semantic graph. The shorter the distance from one synset to another, the more similar they are (Resnik, 1998). The distance ( x , y ) is the number of synsets between (and including) synset x and synset y. Figure 3 shows a fragment of a subgraph formed by two paths from Figure 2. We use the following equation to compute semantic similarity between two synsets:

 ( x , y) 

1 , ( x , y )

(1)

where  ( x , y )[0,1] . The similarity is 0 if the paths have no common synsets or no path exists. The similarity is 1 if two synsets are the same (e.g., two synonyms). For example, if x = psychometrics and y = psychophysics, then the similarity (1) is 0.25, since four synsets are counted along the path between x and y , beginning with x = psychometrics and terminating with y = psychophysics (see Figure 3). Other methods for computing semantic similarity between two words can be found in Budanitsky and Hirst (2006). …

Science, scientific discipline

Psychology, psychological science

Psychometrics

Experimental psychology, psychonomics

Psychophysics

FIGURE 3. An example showing the path between psychometrics and psychophysics

Data Type In this study, we estimate the semantic similarity between short passages from LR items from past administrations of the LSAT. (Note: LR items make up roughly one half of the scored items on the LSAT.) The passages used in the study ranged from about 30 to 100 words in length, with a mean word count of roughly 60. LR passages are written in ordinary English and do not presuppose a specialized vocabulary. Most LR passages are based on short arguments drawn from published sources such as scholarly books and periodicals, newspapers, and general interest 5

magazines. Any LR passages that are not based on published sources are written to resemble real-world argumentation. Most LR subtypes are required to have passages that take the form of arguments, consisting of an explicitly drawn conclusion, one or more premises offered in support for the conclusion, and (optionally) background information that sets the stage for the premises and conclusion. The passages chosen for this study were selected from only those subtypes, which we will refer to as single-topic texts. Here is a sample passage from the datasets: Biologists have noted reproductive abnormalities in fish that are immediately downstream of paper mills. One possible cause is dioxin, which paper mills release daily and which can alter the concentration of hormones in fish. However, dioxin is unlikely to be the cause, since the fish recover normal hormone concentrations relatively quickly during occasional mill shutdowns and dioxin decomposes very slowly in the environment.1 The first two sentences of this passage are largely devoted to presenting background information. The third sentence states the conclusion of the argument—that dioxin is unlikely to be the cause of the observed reproductive abnormalities in fish. Two premises are then offered in support of this conclusion: (1) the fish in question recover normal hormone concentrations relatively quickly when the mill shuts down, and (2) dioxin decomposes very slowly in the environment. Like many LR passages, this passage uses an “indicator term” to signal the location of premises or conclusions in the text: since signals the presence of the conclusion, which precedes it, and the premises, which follow it. The 10 datasets used in the study were compiled by a test developer, drawing on operational experience in reviewing LSAT forms for subject matter overlap. Each dataset consisted of 10 LR passages. Of the 10 passages, 2 (P1 and P2) were judged by the test developer to be semantically similar to each other. For the purposes of this study, two passages were deemed “semantically similar” if and only if they were judged to be too similar in subject matter to allow both to be part of the same LSAT form (i.e., they are considered “enemies”). The other eight passages (P3 through P10) were judged to be semantically dissimilar from both P1 and P2. For the purposes of this study, two passages were deemed “semantically dissimilar” if and only if there would be no reasonable subject-matter-based objection to having both passages on the same LSAT form.

1

All LR passages ©1997–2007 by Law School Admission Council, Inc.

6

Computing Semantic Similarity Between Two Single-Topic Texts This section introduces and explains the heuristic algorithm for computing text similarity. We start by presenting briefly and informally three stages of the heuristic. Then each stage is described in detail with examples to improve clarity. Stage 1 (Form a list of nouns for each text): The list of candidates for nouns (including frequently used noun phrases such as global warming) is formed by a window method. Then each candidate is checked to determine if it exists in WordNet as a noun (or a noun phrase, which we will henceforth refer to as just “noun”). The initial weight of each noun is set to its frequency in the text. If the noun is contained in the conclusion or premises of the text, then its weight is increased. Stage 2 (Identify the sense of each noun): Given two nouns, all possible pairs of their senses are enumerated. Each pair of senses corresponds to a pair of paths. If one noun belongs to the path of another, then the corresponding senses are weighted by Equation (1) and added to the nouns’ lists of possible senses. Finally, for each noun, the sense with the highest weight is chosen. Stage 3 (Compute the semantic similarity between two lists): Given two lists of nouns, the semantic similarity matrix is built. Each element of the matrix is semantic similarity between corresponding nouns (with senses identified at Stage 2) computed by Equation (1). A graph matching the heuristic (taking into account the weight of each noun) is applied to the matrix to compute the semantic similarity between the two lists.

Stage 1: Form a list of nouns for each text Step 1: Split text into sentences (using regular expressions2). Add each sentence to the list of sentences G . Step 2: Split each sentence into words (using regular expressions). Step 3: Using a predefined list of keywords (since, therefore, thus, etc.) identify sentences that likely express the conclusion and premises from the text. Add such sentences to list C . Step 4: Set list of nouns L :  . Step 5: For each sentence g from list G , perform the following: Step 5.1: For each word vi , i  m,...,2,1 ( m is number of words in sentence g ) perform the following:

2

A regular expression provides a concise and flexible means to “match” (i.e., specify and recognize) strings of text, such as particular characters, words, or patterns of characters.

7

Step 5.1.1: If word vi is in the stop list (predefined list of frequently occurring or insignificant words: a, about, after, again, against, etc.) then set i : i  1 and repeat Step 5.1.1. Step 5.1.2: For each j  i  k ,..., i ( k  5 is used in this study) perform the following: Step 5.1.2.1: Set a candidate for noun x : v j  " " v j 1  ...  " " vi . Step 5.1.2.2: If x is represented in WordNet as a noun, then go to the next step; otherwise increase j and go to the previous step. Note that in the text, the word x can serve as a verb or an adjective. However, if x has at least one WordNet sense for a noun, it will be interpreted as a noun. Strictly speaking this is incorrect, although often a noun and a verb of the same word bear similar meanings (e.g., use, speed, study). Step 5.1.2.3: If list L does not include x then add x to L with weight 1, g  C w , otherwise increase weight of x in L by w. Set i : j  1 and 2, g  C go to Step 5.1.1. Each found noun is transformed by the WordNet morphology function to its base form; for example, the word terms will be transformed to term. Let us consider the following passage: Manager: I recommend that our company reconsider the decision to completely abandon our allegedly difficult-to-use computer software and replace it companywide with a new software package advertised as more flexible and easier to use. Several other companies in our region officially replaced the software we currently use with the new package, and while their employees can all use the new software, unofficially many continue to use their former software as much as possible. The result of Stage 1 is the following list, where each element has a noun and its weight w: List L1 Noun former package region employee software package decision abandon computer software company software use

8

Weight 1 1 1 1 2 2 2 2 3 3 5

Note 1: The noun manager is not in the list. The noun corresponding to a person whose monolog forms the passage is removed from the first sentence at Step 2 (using regular expressions). Note 2: The words abandon and use are interpreted as nouns even though they are used as verbs in this passage.

Stage 2: Identify the sense of each noun Step 1: For each pair of nouns ( xi , x j ) , i  j from list L , perform the following: If there is sense s( xi ) such that its path p( s( xi )) contains x j , then add s( xi ) , s( x j ) [identified from the path p( s( xi )) ] to the lists of senses S( xi ) , S( x j ) , respectively, with a weight computed by Equation (1). Step 2: For each noun x from list L, identify the sense with the maximum weight in the list S( x ). If several senses have maximum weight, then select the smallest sense (more common). If list S( x ) is empty, select sense 1 (most common). Step 3: Synonym nouns are substituted by the one with the highest weight. The new weight equals the sum of the weights of all synonyms. For the above passage, Step 2 results in the following list, where each element has the noun and its sense: List L2 Noun former package region employee software package decision abandon computer software company software use

Sense 1 3 1 1 1 1 1 1 1 1 1

Figure 4 shows all senses of the noun package. One can see that because of Steps 1 and 2, the selected sense is 3. In particular, pairs (package, software package), (package, computer software), and (package, software) resulted in the highest weight for sense 3 of the noun package.

9

FIGURE 4. Three senses and their paths for the noun package

10

For the above passage, Step 3 results in the following list, where each element has the noun and its weight: List L3 Noun region former employee abandon decision company use software

Weight 1 1 1 2 2 3 5 8

One can see that package, software package, computer software, and software are merged into software because they are synonyms (i.e., they belong to the same synset, see Figure 4) and because the noun software had the highest weight, 3 (see List L1 above). This results in a new weight of 8 for software (sum of the weights of all synonyms). Let us consider another passage: In 1988, a significant percentage of seals in the Baltic Sea died from viral diseases; off the coast of Scotland, however, the death rate due to viral diseases was approximately half what it was for the Baltic seals. The Baltic seals had significantly higher levels of pollutants in their blood than did the Scottish seals. Since pollutants are known to impair marine mammals’ ability to fight off viral infection, it is likely that the higher death rate among the Baltic seals was due to the higher levels of pollutants in their blood. Stage 2 will identify that seal has sense 9 (see Figures 5 and 6) due to the pair (seal, mammal).

11

FIGURE 5. All senses for the noun seal

FIGURE 6. The path of sense 9 for the noun seal

12

Stage 3: Compute the semantic similarity between two lists Step 1: Given two lists of nouns x  ( x1 , x2 ,..., xm ) and y  ( y1 , y2 ,..., yn ) , where the senses of xi , y j are identified at Stage 2, i  1,2,..., m , j  1,2,..., n , compute the m n matrix R such that rij   ( xi , y j ) [see Equation (1)]. Use of Equation (1) is appropriate because each noun and its sense uniquely identify the corresponding synset. Step 2: Given matrix R , compute the semantic similarity as follows: m

 (x , y ) 

n

n

m

 wi max rij   w j max rij i 1

j 1 m

w  w i 1

i 1

j 1 n

i

j 1

,

(2)

j

where wi , w j are the weights of nouns x i , y j , respectively. From Equations (1) and (2) it follows that (x , y )[0,1] and the semantic similarity between nouns with higher weights influences  (x , y ) more. Consider the following three passages: Passage 1: The asteroid that hit the Yucatan Peninsula 65 million years ago caused both long-term climatic change and a tremendous firestorm that swept across North America. We cannot show that it was this fire that caused the extinction of the triceratops, a North American dinosaur in existence at the time of the impact of the asteroid. Nor can we show that the triceratops became extinct due to the climatic changes resulting from the asteroid’s impact. Hence, we cannot attribute the triceratops’s extinction to the asteroid’s impact. Passage 2: One theory to explain the sudden extinction of all dinosaurs points to “drug overdoses” as the cause. Angiosperms, a certain class of plants, first appeared at the time that dinosaurs became extinct. These plants produce amino-acid-based alkaloids that are psychoactive agents. Most plant-eating mammals avoid these potentially lethal poisons because they taste bitter. Moreover, mammals have livers that help detoxify such drugs. However, dinosaurs could neither taste the bitterness nor detoxify the substance once it was ingested. This theory receives its strongest support from the fact that it helps explain why so many dinosaur fossils are found in unusual and contorted positions.

13

Passage 3: Manager: I recommend that our company reconsider the decision to completely abandon our allegedly difficult-to-use computer software and replace it companywide with a new software package advertised as more flexible and easier to use. Several other companies in our region officially replaced the software we currently use with the new package, and while their employees can all use the new software, unofficially many continue to use their former software as much as possible. Stages 1 and 2 applied to each passage produce the following three lists, where each element consists of a noun, its sense, and its weight (each list is ordered by weight): Noun

List 1 Sense

existence time firestorm million north yucatan peninsula years dinosaur fire hit north america due change show attribute extinction impact triceratops asteroid

1 5 1 1 1 1 1 1 6 7 1 1 1 1 1 1 1 1 1

Weight 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 4 4 5

List 2 Sense

Noun liver found produce poison point position support extinction agent time fact alkaloid substance fossil angiosperm class cause taste help theory mammal bitterness plant drug dinosaur

2 1 1 1 1 1 2 1 1 4 1 1 1 1 1 4 4 1 1 1 1 3 2 1 1

Weight 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 4

Noun region former employee abandon decision company use software

List 3 Sense 1 1 1 1 1 1 1 1

Weight 1 1 1 2 2 3 5 8

Stage 3 applied to the first and second lists results in   0.308. At the same time, Stage 3 applied to the first and third lists results in   0.153. One can see that semantically the first passage is closer to the second one than to the third one, because the first and second passages are about dinosaurs.

14

Experiments With Real Data The three-stage algorithm was implemented in C# for Windows using WordNet and WordNet.Net (Crowe & Simpson, 2005). The results of the study follow, where each dataset consisting of 10 passages was analyzed within 10 seconds on a personal computer with Intel Core i7-860 CPU. The heuristic was tested on 10 datasets prepared by a test developer. Each dataset consisted of 10 LR passages, where passages P1 and P2 were judged by the test developer to be similar, and the other 8 passages were judged to be dissimilar from P1 and P2. Given a dataset, the heuristic was used to compute semantic similarity between P1 and passages P2, P3, …, P10. For 8 datasets it was correct in showing higher similarity between P1 and P2, and lower similarity between P1 and P3, P4, …, P10. Details of the results are presented in Figure 7.

0.6

0.5 P1 & P2 0.4

P1 & P3

P1 & P4 P1 & P5 0.3

P1 & P6 P1 & P7 P1 & P8

0.2

P1 & P9 P1 & P10 0.1

0 Data 1

Data 2

Data 3

Data 4

Data 5

Data 6

Data 7

Data 8

Data 9 Data 10

FIGURE 7. For each dataset (from 1 to 10) one can see the value of semantic similarity between passage 1 and passages 2 through 10. For 8 datasets, the semantic similarity between P1 and P2 was the largest within the dataset. Interestingly, for dataset 1 the semantic similarity between P1 and P10 is quite large; this is because P1, P2, and P10 were about the legal system, but P1 and P2 were more similar. In both datasets where the heuristic failed (datasets 4 and 5), the assumption that the topic is well represented by the nouns did not hold.

15

In order to check the stability of the heuristic, we then analyzed the semantic similarities between P2 and P1, P3, P4, …, P10. For 7 datasets the heuristic demonstrated higher similarity between P2 and P1, and lower similarity between P2 and P3, P4, …, P10. The three datasets for which the heuristic gave incorrect results for (P2, Pi), comparisons include the two datasets that yielded incorrect results for (P1, Pi) comparisons. Details of the results are presented in Figure 8.

0.6

0.5 P2 & P1 0.4

P2 & P3

P2 & P4 P2 & P5 0.3

P2 & P6 P2 & P7 P2 & P8

0.2

P2 & P9 P2 & P10 0.1

0 Data 1

Data 2

Data 3

Data 4

Data 5

Data 6

Data 7

Data 8

Data 9 Data 10

FIGURE 8. For each dataset (from 1 to 10) one can see the value of semantic similarity between passage 2 and passages 1, 3, and so on to 10. For 7 datasets, the semantic similarity between P2 and P1 was the largest within the dataset

Discussion The objective of this report was to construct a simple and fast heuristic for computing semantic similarity between LR passages. The results of experiments with real data were surprisingly successful, because the heuristic performed with 70–80% accuracy and within 10 seconds for 10 passages on a common personal computer. Future work will be directed toward improving accuracy through (a) use of a part-of-speech tagger to more precisely identify nouns, verbs, adjectives, and adverbs; (b) use of information represented by verbs, adjectives, and adverbs; and (c) identification of senses by an adaptation of the Lesk algorithm (Lesk, 1986). 16

The method used in this study for computing semantic similarity is built on the assumption that the topic of a passage is adequately represented via the nouns from that passage. This assumption appears to be met for most LR passages. For example, the dioxin/fish passage that appeared above in the Data Type section yielded the following noun list (with weights): Noun release abnormality daily biologist mill normal environment shutdown cause hormone concentration fish paper mill dioxin

Weight 1 1 1 1 2 2 2 2 3 3 3 4 4 5

Most of these nouns have some bearing on the passage’s topic, which is the issue of whether dioxin is the cause of abnormalities observed in fish downstream of paper mills. The adequacy of this assumption is uncertain for other LR passages, however. Consider passage P1 from Dataset 5, where the heuristic failed (see Figures 7 and 8): The kind of thoughts that keep a person from falling asleep can arise in either half of the brain. Therefore, a person being prevented from sleeping solely by such thoughts would be able to fall asleep by closing the eyes and counting sheep, because this activity fully occupies the left half of the brain with counting and the right half of the brain with imagining sheep, thereby excluding the sleeppreventing thoughts. This passage yields the following list of nouns (with senses and weights): Noun keep kind sleeping activity fall eyes closing person counting sheep thought brain half

Sense 1 1 3 3 5 1 1 1 1 2 4 4 1

17

Weight 1 1 2 2 2 2 2 3 4 4 5 5 5

Most of these nouns have no obvious bearing on the passage’s topic—how to overcome insomnia by banishing sleep-preventing thoughts. Improving the heuristic to make use of information from adjectives (e.g., asleep) would likely improve its performance with passages such as this one. If the accuracy of this heuristic could be substantially improved, then the heuristic would have several potential applications:  



Semantic-based searching for potential enemies in an item pool. This is a straightforward application of the three-stage algorithm presented in this report. Searching the Internet for illegally reproduced cloned items. This application is possible after Stage 2, when the list of nouns with their senses is formed. At that point, one can use a search engine with the search parameter set to a list of nouns (with highest weight) and their synonyms acquired from WordNet. Expanding feature space for regression-tree-based item difficulty modeling by adding semantic features (e.g., semantic similarity between a passage and its key, or between a key and its distractors). For more details on this potential application, see the report by Belov and Knezevich (2008).

References Belov, D. I., & Knezevich, L. (2008). Automatic prediction of item difficulty based on semantic similarity measures (LSAC Research Report 08-04). Newtown, PA: Law School Admission Council, Inc. Budanitsky, A., & Hirst. G. (2006). Evaluating WordNet-based measures of lexical semantic relatedness. Computational Linguistics, 32(1), 13–47. Cook, R., & Clauser, B. (2012). An NLP-based approach to automated scoring of the USMLE® Step 2 CSE® Patient Note. Paper presented at the annual meeting of the National Council on Measurement in Education, Vancouver, Canada. Crowe, M., & Simpson, T. (2005). WordNet.Net Library. http://opensource.ebswift.com/WordNet.Net/Default.aspx Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. Cambridge, MA: MIT Press. He, Q., & Veldkamp, B. P. (2012). Classifying unstructured textual data using the Product Score Model: An alternative text mining algorithm. In T. J. H. M. Eggen & B. P. Veldkamp (Eds.), Psychometrics in practice at RCEC (pp. 47–62). Enschede, Netherlands: RCEC. Leacock, C., & Chodorow, M. (1998). Combining local context and WordNet sense similarity for word sense identification. In WordNet, An Electronic Lexical Database. Cambridge, MA: MIT Press. 18

Lesk, M. (1986). Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In SIGDOC ‘86: Proceedings of the 5th annual international conference on systems documentation (pp. 24–26). New York: ACM. Li, F., & Shen, L. (2012). Can enemy items be automatically identified? Paper presented at the annual meeting of the National Council on Measurement in Education, Vancouver, Canada. Lin, C., & Hovy, E. (2003). Automatic evaluation of summaries using n-gram cooccurrence statistics. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (pp. 71–78), Stroudsburg, PA: Association for Computational Linguistics. Page, E. B. (1966). The imminence of grading essays by computer. Phi Delta Kappan, 48, 238–243. Palmer, M. (2011, June). Going beyond shallow semantics. Presented at the ACL 2011 Workshop on Relational Models of Semantics, Portland, Oregon. Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (pp. 311–318). Stroudsburg, PA: Association for Computational Linguistics. Resnik, P. (1998). WordNet and class-based probabilities. In C. Fellbaum (Ed.), WordNet: An electronic lexical database (pp. 239–263). Cambridge, MA: MIT Press. Rocchio, J. (1971). Relevance feedback in information retrieval. Englewood Cliffs, NJ: Prentice Hall. Salton, G., Singhal, A., Mitra, M., & Buckley, C. (1997). Automatic text structuring and summarization. Information Processing and Management, 33(2), 193–207. Schutze, H. (1998). Automatic word sense discrimination. Computational Linguistics, 24(1), 97–124. Sheehan, K. M., Kostin, I., & Persky, H. (2006, April). Predicting item difficulty as a function of inferential processing requirements: An examination of the reading skills underlying performance on the NAEP grade 8 reading assessment. Paper presented at the Annual Meeting of the National Council on Measurement in Education, San Francisco, California. Shermis, M. D., & Burstein, J. C. (2003). Automated essay scoring: A cross-disciplinary perspective. Mahwah, NJ: Lawrence Erlbaum Associates, Inc.

19

Turney, P. D. (2001), Mining the Web for synonyms: PMI versus LSA on TOEFL. In L. De Raedt & P. Flach (Eds.), Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001) (pp. 491–502). Freiburg, Germany. Wu, Z., & Palmer, M. (1994). Verb semantics and lexical selection. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Las Cruces, New Mexico.

20