Data Mining of Missing Persons Data

86 downloads 17738 Views 203KB Size Report
1 School of Information Technology, Charles Sturt University, Bathurst,. NSW 2795, Australia. kblackmore@csu.edu.au. 2 School of Social Sciences and Liberal ...
Data Mining of Missing Persons Data K. Blackmore1, T. Bossomaier1, S. Foy2 and D. Thomson2 School of Information Technology, Charles Sturt University, Bathurst, NSW 2795, Australia. [email protected] 2 School of Social Sciences and Liberal Studies, Charles Sturt University, Bathurst, NSW 2795, Australia. [email protected] 1

Abstract: This paper presents the results of analysis to evaluate the effectiveness of data mining techniques to predict the outcome for missing persons cases. A rule-based system is used to derive augmentations to supplement police officer intuition. Results indicate that rule-based systems can effectively identify variables for prediction. Keywords: Classification; Rules; Missing Values; Uncertainty

1 Introduction Many people go missing in NSW, at a rate of approximately 7000 per year. Fortunately, around 99% of these missing persons are quickly located, with 86% returning home or being found by family or friends within a week. In some cases, however, the person who has been reported as missing has become the victim of foul play or has committed suicide. Whilst these high priority cases occur infrequently, the fact remains that to date there exist no adequate guidelines for the police to utilize when making a decision about the likelihood that a person is at risk of harm. In particular, inexperienced police officers may find it difficult to disregard stereotypical views of why the missing person is missing [1]. In order to standardize the process of assigning possible risk to missing person reports, a system is needed that is reliable, not too complicated to use and efficient.

There are several powerful advantages in being able to deepen the inferences that can be drawn about the circumstances of a missing person’s disappearance. · The police response can be tuned to make the most efficient use of resources. While foul play and suicide as the reason for going missing will need an immediate response, a repeat runaway who is of low risk for foul play or suicide might be assigned lower priority. · The police investigation may be focused: it is more important to ask the right questions and collect the most useful data than to try, and fail, to collect everything which might possibly be useful. Again this is a resource issue. · Where foul play is suspected, some inferences about the criminal modus operandi may be possible. Other disappearances, which may be separated over time and space, may have important linking features that may help in apprehending the offender or in warning the public of specific dangers. Thus the NSW Police are interested in a system that can both utilize artificial intelligence and be available across NSW for all police officers to use. Towards this end, case histories of disappearances with known outcomes falling into the categories of runaway, suicide and foul play were made available to the investigators. We approach the problem in the first instance through a search for a rule-based classification system. Given the largely nominal data quality of our first extract from the police data, rule systems are a natural choice. To eliminate the necessity for software training for police officers, some heuristics to support computational results are highly desirable. The rule-based approach generates simple augmentations to intuition that may be used by police anywhere at any time. The aim of this work is to show that given the uncertain nature of the data, rules can be derived which can be used to predict the outcome for future missing persons cases. Examples exist within the dataset that escape rules, containing random patterns which result in examples that do not match classes. What were sought were not a holistic analysis of all patterns, but rather a subset of patterns, presented as rules, that could be used reliably to supplement police intuition. The potential also exists for these patterns to be used to isolate links and possible repeat or serial “offenders”.

2 Collection and parametrisation of data

2.1 Data Source This research was funded by the Maurice Bryers Foundation and is in collaboration with the Missing Person’s Unit (MPU) of the New South Wales Police Service. The Missing Person’s Unit in Parramatta, Sydney, is the central location for the collation of all missing persons reports that are made within New South Wales, Australia. This study was conducted as a systematic investigation of these archival police files which were originally collected upon the missing person report being made to the police, and during the investigation of the person’s whereabouts. The information used in this study was extracted from records held in a centralised database (COPS), as well as from photocopies of records and other relevant information stored within files held at the Missing Persons Unit. Also included were relevant files archived within the Homicide Library of the NSW Police Service. The criteria for cases selected for inclusion in this study were the circumstances surrounding the disappearance of the missing person, the theoretical and practical relevancy of the case to the research, and the quality of the information contained in the police file. Only persons who were known to have run away, attempted or completed suicide, or fallen victim to foul play were included in the study. Persons who had gone missing because of an accident or through being lost, and persons reported missing because of misunderstanding, were not included in the study. All the files were treated with the strictest confidentiality, in accordance with the Charles Sturt University Animal and Human Ethics Committee, The Australian Psychological Society Code of Ethics and the New South Wales Police Service confidentiality practices. 2.2 Criterion Selection and Data Reduction All aspects of the information contained within the files were considered for their capacity to predict type of missing person. Determining what aspects of the data were suitable and obtainable required combing through the files a number of times. The priority in the initial stages of data collection was to maximise both the type of information included as well as the coverage of that information. Ultimately, deciding what variables were important to preserve, and which could be discarded was a matter of

judgement combined with the frequency with which the information was available. The procedure adopted for the collection and the coding of data comprised (a) listing those variables that were deemed desirable for the research, (b) listing those variables that were observed to be available; (c) inspecting case files for the quality of the records and to determine what variables were and were not usable in the study; and (d) reviewing the suitability of the variables in light of theoretical and practical relevance. 2.3 Data Quality Technically, the dataset contained no missing values. This was due to the fact that a response option for “not relevant” or “not known” was included for most variables because of the difficulty in finding files with all of the information relevant to the study. Missing data in the police file was taken to imply that the information of interest was either (a) not relevant to that particular case; (b) not known by the reporting person; (c) not considered for its possible relevance by the police officer; or (d) explored by the officer but not noted in the COPS narrative. Rather than scoring the response for that variable as missing data, it was more practical to categorise the absence of knowledge as “not relevant or not known” because of the nonspecific nature of lack of information. Systematic randomisation of cases was not possible due to the nature of the data and issues of availability. However, all of the cases included in this research were from New South Wales. All the runaway cases occurred during 2000, and the suicide and foul play occurred between 1980 and 2000. Cases prior to 1980 were not included in the study due to the influence of different temporal factors such as differences in the availability of forensic evidence [2], as well as changes to police recording methods. Because of the practical constraints on the data collection it was not possible to validate the criterion measures by inter-rater reliability. Therefore, care was taken to define variables in a clear, precise and consistent manner. Also, in this study an effort was made to minimise the impact of a retrospective design. Only cases where the missing person had been missing for three months and under before being reported as a missing person were included in the research. This condition aimed to reduce the influence of errors in the reporting person’s memory recall when the report was made to the police (Conwell, Duberstein, Cox & Herrmann, 1996). By the completion of the data collection a total of 26 input variables and 1 output variable described the sample of 357 finalised missing persons cases. Information present in every file, which was deemed relevant, in

cluded age, gender, nationality by appearance, residential address, occupation, marital status, date last seen, and who the reporting person was. Below (Table 1) are the remaining variables that were usually dependent on information contained in the files free narrative, which was reported by the officer in charge of the case. Table 1. Variables based on information in police files describing missing persons · Does missing person have any dependents · Residential status · Time of day when last seen · Day of week when last seen · Season of year when last seen · Last seen in public · Is this episode out of character for the missing person · What does the reporting person suspect has happened · Any known risk factors for foul play · Is missing person known to be socially deviant or rebellious · Is there a past history of running away

· Is there a past history of suicide attempt or ideation · Any known mental health problems · Any known drug and alcohol issues · Any known short term stressors · Any known long term stressors · Method of suicide · Was the perpetrator known or a stranger to the victim · Was missing person alive, deceased or hospitalized when located

3 Developing rule systems 3.1 Participants A total of 357 case files were used in this research. The sample ranged in age from 9 years to 77 years with a mean age of 28 years (SD = 15 years). There were 184 females (51.5%) in the entire sample, and 173 (48.5%) males. Those who appeared to be Caucasian were the most frequently reported missing in this sample, and comprised 85.4% of the entire sample. Those who fell into ‘all other categories’ comprised 7% of the sample, with Asian accounting for 4.8% of the missing persons sample. Those appearing to be Aboriginal were the least reported as missing persons, comprising 2.8% of the entire sample. There were 250 (70%) runaways, 54 (15.1%) victims of suicide and 53 (14.8%) persons missing due to foul play included in the present study.

Table 2. Data Summary Data Summary Gender Age Outcome Female Male 18 R S 157 168 107 218 188 59 48% 52% 33% 67% 58% 18% (R=runaway; S=suicide; F=foulplay) Total cases: 325 Age Range = 9 – 77 years Mean Age = 27.41

F 75 23%

3.2 Rule Systems Structural patterns in data can be discovered and expressed as rules, which are easily interpreted logical expressions. Covering algorithms use a “separate and conquer” strategy to determine the rule that explains or covers the most examples in the data set. The examples covered by this rule are then separated from the data set and the procedure repeats on the remaining examples [4]. Decision trees are a popular “divide and conquer” method to discover knowledge from data [5]. Decision trees represent variables and variable values as trees, branches and leaves from which rules must be transformed. Earlier work [6] using datasets containing missing values focussed on the problems associated with classifying and defining rules for data with missing values, which is well recognised and has been widely discussed in the literature [7], [8], [9], [10]. Given that the occurrence of missing values in the dataset carries no significance, a suitable approach to develop rules in this study was to apply a covering algorithm to a decision tree derived from the training dataset. Algorithms that derive rule sets from decision trees first generate the tree, and then transform it into a simplified set of rules [4]. Based on results from previous research [3], the WEKA J48.PART [21] algorithm was selected to derive and evaluate rules sets from the training data. This is a standard algorithm that is widely used for practical machine learning. The algorithm is an implementation of C4.5 release 8 [9] that produces partial decision trees and immediately converts them into the corresponding rule. The methods used were found to be robust against missing values. Classification algorithms provide statistical measures of fit to assess the effectiveness of the derived rules to predict outcomes. In this study, consistency was used as a measure of the derived rules predictive ability. Under this model, a rule appearing in all folds for all samples is considered a

consistent and accurate predictor. Statistical measures of effectiveness can then be ascertained by averaging the individual measures. 3.3 Methodology Evaluation of an algorithm’s predictive ability is best carried out by testing on data not used to derive rules [21], thus all training was carried out using tenfold cross-validation. Cross-validation is a standard method for obtaining an error rate of the learning scheme on the data [21]. Tenfold crossvalidation splits the data into a number of blocks equal to the chosen number of folds (in this case 10). Each block contains approximately the same number of cases and the same distribution of classes. Each case in the data is used just once as a test case and the error rate of the classifier produced from all the cases is estimated as the ratio of the total number of errors on the holdout cases to the total number of cases. Overall error of the classification is then reported as an average of the error obtained during the test phase of each fold. Variations in results for each iteration in a cross-validation occur depending on the cases used in the training and holdout folds, which can lead to differences in overall results. Given the uneven distribution of cases in the dataset and the possible influence of case selection in folds, the tenfold cross-validation was repeated 10 times on randomised data. The following results section will refer to each individual cross-validation as a “run”.

4 Results Preliminary analysis of variables was used to evaluate the worth of a subset of variables by considering the individual predictive ability of each feature along with the degree of redundancy between them [9]. Although significant correlations between the suspicion variable and the outcome were not identified using statistical analysis (Table 3), the variable was removed from the analysis due to possible conflicts. Reasoning for the removal of the suspicion variable is provided in the discussion. Table 3. Correlation analysis of suspicion and outcome status Runaway

Suicide

Foulplay

Suspicion - runaway Suspicion - suicide

0.57 -0.27

-0.31 0.57

-0.43 -0.23

Suspicion - foulplay

-0.30

-0.20

0.59

The status variable, which identifies if the missing person was alive, deceased or hospitalized when located, was removed from the dataset due to high correlation with the outcome. The remaining 24 variables were used for analysis. A “best first” classifier subset evaluation, using the J48.PART algorithm, was used to estimate the most promising set of attributes. From the entire dataset, the following variables were identified as having the highest predictive ability; age category; gender; marital status; residential status, appearance, person(s) reporting, previous history of suicide, mental health status, and whether their behaviour was typical behaviour. No variables were identified as redundant.

4.1 Rule determination The WEKA J48.PART classifier derived 22 rules for each of 10 repetitions of the tenfold cross-validations. Although the number of rules generated was consistent, the accuracy measures were inconsistent (Fig. 1.). On average, 71% or 253 of the 357 cases were classified correctly. Classification Error 32 31 30 %Error

29 28

Error

27 26 25 24 23 1

2

3

4

5

6

7

8

9

10

Run Number

Fig. 1. Error for ten runs of a tenfold cross-validated classification using J48.PART

The confusion matrices for each classification, which were combined, are shown in Table 4 and detail the number of correctly and incorrectly classified cases in each sample for both algorithms. While the overall predictive accuracy was 71%, the confusion matrix shows that the J48.PART algorithm correctly classifies 84% of runaway (R) cases, and only 39% of suicide (S) and 47% of foulplay (F) cases. Table 4. Classification confusion matrix Actual Classed Run1 Run2 Run3 Run4 Run5 Run6 Run7 Run8 Run9 Run10 Avg % Error

R R 207 210 216 215 215 202 210 206 201 207 209 16

R S 25 25 22 16 16 30 23 28 25 28 23

R F 18 15 12 19 19 18 17 16 24 15 17

S R 32 31 34 30 30 34 22 28 27 32 30

S S 20 19 16 21 21 18 28 23 22 21 21 61

S F 2 4 4 3 3 2 4 3 5 1 3

F R 23 27 25 23 23 21 24 25 23 22 23

F S 7 1 5 4 4 6 3 5 3 5 4.3

F F 23 25 23 26 26 26 26 23 27 26 25 53

An analysis of the rules derived during each training fold of a single run identified a total of 195 distinctly different rules and only 1 rule appeared consistently in all folds. In general, although the number of derived rules from the repeated cross-validations was consistent, the constitution of rules was inconsistent. Variations in the number variables constituting a rule and the structure of the underlying decision trees produced inconsistencies. Similarly to the issues of rule consistency occurring within folds of a single classification run, a total of 120 distinctly different rules were identified across all 10 runs. Interestingly though, 10 of the 22 rules from each run appeared consistently across all runs, providing a valid subset which are listed below. · PHSUICID = unknown AND REPORTIN = immediate_family AND MENTAL_R = no_mental_health_problems_experienced:>> runaway · PHSUICID = unknown AND REPORTIN = not_reported AND PUBLIC = public: >> foulplay

· PHSUICID = yes AND DEVIANTR = no_deviancy AND REGION = Sydney_metro AND MARITAL = single AND REPORTIN = immediate_family: runaway · PHSUICID = yes AND DEVIANTR = no_deviancy AND RESIDENT = sharing_married_defacto: >> suicide · SHORTERM = mental_health AND REGION = regional_NSW ANDCHARACTE = out_of_character: >> suicide · PUBLIC = home AND DEPENDEN = no AND REPORTIN = immediate_family: >> runaway · DEVIANTR = social_externalised AND REPORTIN = immediate_family AND ALCODRUG = no_drug_or_alcohol_problems_mentioned: >> runaway · PHSUICID = unknown AND MARITAL = married_defacto AND PASTRUN = first_time: >> suicide · PHSUICID = no AND ALCODRUG = no_drug_or_alcohol_problems_mentioned AND PUBLIC = public AND AGE_BEE = 18_25: >> foulplay · ALCODRUG

= no_drug_or_alcohol_problems_mentioned MENTAL_R = no_mental_health_problems_experienced PUBLIC = public AND AGE_BEE = 17_under: >> foulplay

AND AND

5. Discussion Although consistency was not achieved in the derived rules, pruning resulted in several valid rules that appeared consistently in all of the 10 cross-validations performed. The consistency of these rules makes them valuable supplements to police intuition. The variables identified in the subset evaluation as having predictive ability and those appearing consistently in derived rules were flagged as priorities for missing value reduction and re-evaluation for coding inconsistencies. The presence of consistent predictor variables supports the need for quality control in data collection. Although data quality is a typical problem, police officers investigating missing persons often deal with con

flicting, scarce and emotionally charged information. Knowledge of the key factors affecting the outcome provides a guide for police interviews and data capture. Although the rules derived are generally inconsistent, the consistent variables identified in this study are more useful perhaps than soft-computing in this regard. The data obtained by police officers investigating reported missing persons consists of facts, judgments and model-based attitudes. Factual information such as gender, marital status and location last seen present few issues for data capture. Judgments are required when collecting information on variables like mental health problems, deviancy and drug and alcohol problems, and a more structured approach to the capture of this information may improve data quality. For instance, a field-based tool to prompt officers to explore specific areas of concern, such as the duration or type of drug abuse, may provide richer data for analysis and ultimately more comprehensive and effective prediction of outcomes. The suspicion variable is a model-based attitude, in itself reflecting an individual’s prediction of outcome based on rules derived from the underlying variables. For this reason, suspicion is considered correlated with the outcome category and exists as a stand-alone prediction rather than a factor that may affect the outcome. Mental health problems also appear to be valid indicators although may not be as simple as a binary answer. Their predictive ability may be improved by a scaled system to indicate the severity of illness. The algorithm implemented by the WEKA J48.PART produced inconsistent results. Given that the most consistent rules were sought, correct information may have been discarded to achieve consistency. Given this, recent work [12, 13] has considered the use the soft-computing methodologies for the missing persons problem. The predictive accuracy of the rule-based classifier was compared with an artificial neural network (ANN). The ANN achieved superior accuracy over the WEKA J48.PART rule based classifier, correctly predicting outcomes for 99% of cases. This issue is particularly pertinent given the vast array of “off-the-shelf” data mining software applications currently being applied to a diverse range of problems. Clearly, algorithms produce different results given the same training data and care must be taken to ensure the correct method is “on the job”. Although in this case ANNs offer improved predictive accuracy over rule-based classifiers, the nature of the problem domain requires rules or insights to support police officer intuition. To this end, a method of rule extraction from ANNs using a genetic algorithm has been used [12] to extract heuristics from the trained network. The resulting rule set was found to “cover” or explain 86% of cases in the dataset. Generalisation issues exist with the rule extraction method used, however, combining the use of

an ANN and genetic algorithm appears to be a more appropriate approach to the missing persons problem. Rules must be reliable in order to supplement police intuition and general “rules of thumb”. Interesting results arose which tend to go against general intuition. Reliable associations between missing persons reported by a member of their immediate family (implying runaway) and missing persons married or in a de facto relationship (implying suicide) are unexpected patterns, as too is the limited predictability of time of day of the disappearance. This study has provided insight into variables that have potential to accurately predict outcomes for missing persons cases and highlighted issues pertaining to data capture, preprocessing and rule determination. Data capture and quality may be improved through the use of structured systems, such as forms or computer collection of data using palm-tops, to prompt and guide police officers to ensure all pertinent data is consistently collected. There are indications that the missing persons problem is an instance of a “random problem”[14], a problem of a high Kolmogorov complexity [15].

6. Acknowledgement This research was carried out under a grant from the Sir Maurice Byers Research Fellowship of the NSW Police Service. Foy thanks the NSW Police Service Missing Persons Unit, notably Geoff Emery and Jane Suttcliffe, for their assistance with data capture.

References 1. 2. 3.

4.

Newiss G (1999) The Police response to missing persons, Police Research Series Paper 114, Home Office Salfati CG (2000) The Nature of Expressiveness and Instrumentality in Homicide: Implications for Offender Profiling, Homicide Studies, 4(3):265-293 Conwell Y, Duberstein PR, Cox C, Herrmann JH (1996) Relationships of age and axis I diagnoses in victims of completed suicide: A psychological autopsy study. American Journal of Psychiatry. 153(8):10011008 Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann, San Mateo, CA

5. 6.

7.

8.

9. 10. 11.

12.

13.

14. 15.

Crémilleux B (2000) Decision trees as a data mining tool, Computing and Information Systems, 7(3):91-97, University of Paisley Blackmore K, Bossomaier T, Foy S, Thomson D (2002) Data Mining of missing persons data. In Proceedings of 1st International Conference on Fuzzy Systems and Knowledge Discovery (FSKD02), Orchid Country Club, Singapore, 18-22 November, 2002 Ragel A, Crémilleux B (1998) Treatment of missing values for association rules, In proceedings of the Second Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD-98), Melbourne, Australia, 258-270 Crémilleux B, Ragel A, Bosson JL (1999) An interactive and understandable method to treat missing values: application to a medical data set. In proceedings of the 5th International Conference on Information Systems Analysis and Synthesis (ISAS / SCI 99), Torres M Sanchez B, Wills E (Eds.), Orlando, FL, 137-144 Witten I, Frank E (2000) Data mining: Practical machine learning tools and techniques with Java implementations. Morgan Kaufmann : San Francisco Ripley BD (1996) Pattern recognition and neural networks. Cambridge University Press Frank E, Witten IH (1998) Generating accurate rule sets without global optimisation. In Proceedings ICML’98–International Conference on Machine Learning, Shavlik J (Ed.),Morgan Kaufmann, Madison, Wisconsin, pp. 144-151 Blackmore KL and Bossomaier TRJ (2003) Using a Neural Network and Genetic Algorithm to Extract Decision Rules. In proceedings of the 8th Australian and New Zealand Conference on Intelligent Information Systems, 10-12 December 2003, Macquarie University, Sydney, Australia Blackmore KL and Bossomaier TRJ (2003) Soft computing methodologies for mining missing person data. International Journal of Knowledge-based Intelligent Engineering Systems. Howlett RJ and Jain LC (Eds). 7(3), UK Abu-Mostafa Y (1986). Complexity of random problems, in Complexity in Information Theory, Abu-Mostafa Y (Ed.), Springer-Verlag Li M and Vitanyi P (1997) An introduction to Kolmogorov complexity and its applications, 2nd ed., Springer