ERGONOMICS, 2003,
VOL.
46,
NO.
1 ± 3, 188 ± 196
A data mining technique for discovering distinct patterns of hand signs: implications in user training and computer interface design NONG YE*, XIANGYANG LI and TONI FARLEY Department of Industrial Engineering, Arizona State University, Box 875906, Tempe, Arizona, 85287-5906, USA Keywords: Hand signs; user training; computer interface design; data mining. Hand signs are considered as one of the important ways to enter information into computers for certain tasks. Computers receive sensor data of hand signs for recognition. When using hand signs as computer inputs, we need to (1) train computer users in the sign language so that their hand signs can be easily recognized by computers, and (2) design the computer interface to avoid the use of confusing signs for improving user input performance and user satisfaction. For user training and computer interface design, it is important to have a knowledge of which signs can be easily recognized by computers and which signs are not distinguishable by computers. This paper presents a data mining technique to discover distinct patterns of hand signs from sensor data. Based on these patterns, we derive a group of indistinguishable signs by computers. Such information can in turn assist in user training and computer interface design.
1. Introduction Many areas of ergonomics involve the processing of collected data to determine characteristics of a data set such as usability, performance, human error probability, and so on (Sanders and McCormick 1992, Wickens et al. 1997). This data may include human data (height, weight, etc.) as well as human generated data (physical abilities, cognitive abilities, error level, etc.). To analyse the meaning of a certain set of data, it is bene®cial to cluster the input data into meaningful categories depending upon the aim of the research. Such categories might include degree of diculty, comfort level, ease of use, level of human error, etc. After clustering a data set, researchers can discover and observe data patterns by looking at the resulting clusters. These patterns may be desirable or undesirable as to the aim of the research. With the discovery of such patterns, a researcher can then identify patterns to strive for, as well as patterns to attempt to suppress. Furthermore, the ability to cluster data into multiple categories allows one to determine the frequency percentages of each data class and discover which patterns appear more or less often for use in determining such things as usage and level of importance. In this study, we focus on clustering a set of human generated sign language data to discover existing patterns within the data. By clustering the data, we can observe
*Author for correspondence; e-mail:
[email protected] Ergonomics ISSN 0014-0139 print/ISSN 1366-5847 online # 2003 Taylor & Francis Ltd http://www.tandf.co.uk/journals DOI: 10.1080/0014013021000035262
Data mining technique
189
the similarities and dierences between the input signs. There are two main implications to this study. One implication is that of designing a human ± computer interface for signing. By knowing which signs are similar enough that they are indistinguishable by a computer, we can eliminate or ®nd a way to otherwise distinguish the signs. Another implication is its usefulness to a teacher or practitioner of signing. Knowing that two dierent signs are indistinguishable to a computer tells us that these signs have a greater chance of being indistinguishable to humans. This knowledge allows a teacher or practitioner to use care when teaching or using signs that a student or audience may not be interpret correctly. The algorithm we use to cluster the data is based on an incremental supervised clustering method. This algorithm is called the Clustering and Classi®cation AlgorithmÐSupervised (CCAS) (Ye and Li 2001, 2002b, Li and Ye 2002b). The algorithm ®rst uses information contained in the data to produce distinct clusters of data. The clusters are then subject to redistribution and hierarchical processing techniques to improve the accuracy of clustering. The enhanced clustering techniques help reduce the impact of noise that may be present in the data. Several data mining algorithms can also be used for data clustering and classi®cation, including decision trees (Ye et al. 2000, Li and Ye 2002a), association rules (Lee et al. 1999), arti®cial neural networks and genetic algorithms (Endler 1998, Sinclair et al. 1999), and Bayesian networks (Valdes and Skinner 2000). The dierence between CCAS and other data mining algorithms is the ability of CCAS to supervise the clustering and learn data patterns using an incremental, scalable learning approach. This algorithm has proven successful in data mining used for computer intrusion detection (Ye and Li 2001, 2002a, Li and Ye 2002b). However, the success of this algorithm is not limited. CCAS is a general clustering and classi®cation algorithm that we can apply to any area of study that requires the use of such data mining techniques. One relevant area of study is the study of ergonomics. Many areas of ergonomics involve the processing of human or human generated data (Sanders and McCormick 1992, Wickens et al. 1997). With this data comes the necessity for data mining techniques. The technique used throughout this study is the technique of clustering data for the purpose of discovering patterns in the data set. This paper presents the application of CCAS to analyse ergonomics data. Section 2 provides a review of the CCAS method. Section 3 presents the applicability of CCAS to ergonomics and describes the sign language data set used for supervised clustering. Section 4 presents the results of the clustering process and discusses its implications. Section 5 concludes the paper. 2. Review of CCAS The CCAS algorithm builds upon a number of concepts from traditional and scalable cluster analysis (Jain and Dubes 1988, Procopiuc 1997, Zhang 1997, Ester et al. 1998, Sheikholeslami et al. 1998, Harsha 1999, Huang et al. 1992), and instancebased classi®cation (Mitchell 1997, Cherkassky and Mulier 1998), along with several innovative concepts that we developed. Li and Ye (2002b) give a detailed comparison of CCAS with existing algorithms of cluster analysis and instance-based classi®cation and a detailed discussion of the scalable, incremental learning ability of CCAS. In this section, we brie¯y review the processing steps of CCAS. For each data record in the input data set, we represented the data as a (p+1)tuple and the input variable vector X in p dimensions as {X1, X2, . . . , Xp} and the target variable Y representing the target class of each data record. Then a data
190
N. Ye et al.
record is a data point in a p-dimensional space. Each input variable is numeric or nominal. Y can be a binary variable with value 0 or 1, or a multi-category nominal variable as needed. In this paper, we describe CCAS for numeric input variables and nominal target variables. 2.1. Cluster representation and distance measures In CCAS, a cluster L containing a number of data points is represented by the centroid of all the data points in it, with coordinates XL, the number of data points, NL, and the target class, YL. We calculate XL as: NL X
XL
Xj
jl
NL
1
where X j is the coordinates of the jth point in this cluster. It is possible to calculate the distance from a data point to a cluster using dierent distance measures. We use the weighted Euclidean distance in this study: v u P uX
2 d
X; L t
Xi ÿ XLi 2 r2iY il
where Xi and XLi are the coordinates of the data point X and the cluster L's centroid on the ith dimension, and riY is the correlation coecient between the input variable Xi and the target variable Y. 2.2. Five steps of CCAS There are ®ve steps in the CCAS algorithm: 1. 2. 3. 4. 5.
Scan the input data. Incrementally group data into clusters. Redistribute data points among clusters. Perform a supervised hierarchical clustering of clusters produced in Step 3. Re®ne the clusters by removing outliers.
The ®rst two steps represent the incremental clustering of the data. CCAS uses the target variable information to supervise the incremental grouping of the N data points of the data set into clusters. During these steps, CCAS is performing a nonhierarchical clustering procedure based on the distance information as well as the target class information of the data points in the input data set. Step 1. Scan the data and compute necessary training parameters. This step calculates the squared correlation coecient riY between each variable Xi and the target variable Y to determine how relevant each variable is for clustering in the target class. For data points n = 1, . . . , N, 2 SiY
n 2
3 riY
n pp Sii
n SYY
n where variance Sii(n) and SYY(n), and covariance SiY(n) can be calculated incrementally as shown in Li and Ye (2002b).
Data mining technique
191
Step 2. Incrementally group each data point in the data set into clusters The supervised aspect of the algorithm is contained in the method we use for this step. We have two choices for this method. One is the dummy-clusters method and the other is the grid-based method Li and Ye (2002b). In this study we use the grid-based approach. First, we divide the space of data points into grid cells. Many methods are available for this purpose. In our study, we divide each dimension into a set of equal intervals along the range limited by the minimum and maximum values of data points in this dimension. Then we separate the entire space into `cubic' cells using the grids determined by the endpoints of these intervals. Each cluster belongs to a grid cell and we can assign to it an index referring to its grid cell. Then in clustering, for each data point, we search only the existing clusters in its grid cell to look for the nearest cluster. If there is no cluster in the grid cell of this data point, we have to create a new cluster. The clustering has the following operations. 1. For a data point X, calculate the grid index. 2. Search the clusters in the grid index cells for the nearest cluster L to this data point using the distance measure d(X, L) given above. 3. If there is a nearest cluster L with the same target class as X, add X to L and update the centroid coordinates of L and the number of the data points (NL) in this cluster: NL XLi Xi for i 1; 2; :::; p XLi NL 1
4 NL NL 1 4. Otherwise, create a new cluster with this data point as the centroid; the number of the data points in the new cluster is one, and the target class of the new cluster is the class of this data point. 5. Repeat (1) to (3) until no data point is left in the data set. Step 3. Redistribute data among clusters The clustering procedure above produces clusters of homogeneous data points. However, a shortcoming of the incremental clustering procedure is that each data point only interacts with the nearest cluster using the current cluster structure at the time of processing. The redistribution step reduces the impact of this shortcoming associated with the presentation order of data points. In redistribution, we re-cluster all data points in the data set using the clusters from the previous step as the initial seeds. In this strategy, we discard the nearest seed cluster after we ®nd it for a data point, and we start a new cluster using that data point. Therefore, we can only incorporate one copy of each point into one cluster after redistribution. We allow new clusters to emerge since this option gives more freedom to the clustering operation and adjusts the cluster structure more accurately. The clustering procedure is the same as the above grid based clustering. Step 4. Perform a supervised hierarchical clustering of the clusters produced in Step 3 In this step, we use a supervised hierarchical clustering algorithm to post-process the clusters produced in Step 3. This algorithm uses the basic concept of hierarchical clustering, but is dierent from the traditional hierarchical clustering algorithm because it combines the pair of clusters that are nearest to each other and have the
192
N. Ye et al.
same target class label. Therefore, this hierarchical clustering will produce a cluster structure where neighbouring clusters have dierent class labels. Step 5. Re®ne the clusters by removing outliers This step further re®nes the produced cluster structure by removing the possible noises (or outliers) present in training data. Clusters that have very few data points may represent noises. We remove those clusters with the number of data points smaller than or equal to a threshold value. The threshold on determining possible outliers could be chosen based on the average number of points in clusters for each target class. The thresholds on clusters with dierent target classes can be dierent. For this study, we use one as a threshold value. 2.3. CCAS outcome The procedure steps outlined above map the data points into a cluster structure with clusters of dierent target classes. For example, in this study a cluster structure from the data contains clusters that represent homogeneous groups of signs separated by class. 3. Application of CCAS to sign sensor data We test the application of CCAS to sign sensor data by clustering a set of human generated sign language data. The purpose of the clustering is to record how well we can distinguish dierent signs from each other and to discover which signs are more dicult to dierentiate in practice for computers as well as humans. 3.1. Data representation The sign language data used for this study consist of a subset of Auslan (Australian Sign Language) signs. The donor of this data (http://www.cse.unsw.edu.au/ *waleed/tml/data/) collected the data subset from ®ve dierent signers of diering backgrounds. The ®ve signers each provided samples of 95 dierent signs (see table 1). For this paper, we chose three of the signers to be included in the study. We chose Table 1. Alive All Answer Boy Building Buy Change (mind) Cold Come Computer (PC) Cost Crazy Danger Deaf Dierent Draw Drink Eat Exit
¯ash-light forget girl give glove go god happy head hear hello her hot how hurry hurt I innocent is (true)
Ninety-®ve signs used in the study. joke juice know later lose love make man maybe mine money more name no Norway not-my-problem paper pen please
polite question read ready research responsible right sad same science share shop soon sorry spend stubborn surprise take temper
thank think tray us voluntary wait (not yet) what when where which who why wild will write wrong yes you zero
Data mining technique
193
the signers to include three dierent categories of signer: a sign linguist (Adam), a natural signer (Andrew), and a professional Auslan interpreter (John). Each of these signers submitted samples of the 95 signs. For Adam and Andrew, we use six samples of each sign and for John we have 15 samples of each sign. Therefore, the total number of signs in our data set is 2565. The donor collected the data using an electronic glove to record hand movements. For each sign, the recorded data includes the position (x, y, z) of the glove accurate to 8 bits (this measurement is susceptible to noise), the roll of the wrist accurate to just less than 4 bits, and the ®nger bend on the ®rst four ®ngers accurate to 2 bits. A more in-depth description of the signs used and the signers (including a list of signs, the reason each was chosen, and the method of sign collection) can be found at the donor's website referenced in the link above and the acknowledgements at the end of this paper. The raw data contain the basic (and hopefully complete) information of the studied subject. In our case, the raw data include the position coordinates in x, y, and z dimensions, the rotation of wrist in 308 increments, and the ®nger bend for the ®rst four ®ngers. These attributes are recorded continuously for each sign, producing a set of frames with the sampling frequency at about 23 Hz. One very important step in a data mining application after data collection is feature selection. We use the feature selection methods described in Kadous (1995). First we use a simple median ®lter to remove the `glitches' in the data. These glitches are caused by the noise due to the involvement of the electronic glove. The function of this median ®lter is to ®nd those implausible data records, those data records showing there are too furious hand movements between frames, and then just recalculate the values of such a data record as the average of its previous and next data records. After such ®ltering, we build histograms of frames on each attribute value range, which is divided into a number of intervals for a complete sign sample. In our study, we use the division numbers recommended in Kadous (1995). There are six divisions for x, y, and z. We divide the wrist rotation representation into palm down, palm up, palm left, and palm right. The ®nger bend is divided into four intervals due to the recording accuracy. In addition, we select another set of features based on time segmentation. The frames of each sign sample are segmented into several equal groups according to time order. Here we divide a sign sample into ®ve segments, as recommended in Kadous (1995). Then we calculate the average of each attribute in every segment. Thus, after feature selection, we have in total 78 features used as the input variables for our CCAS algorithm. These features capture the characteristics of each sign sample in spatial density and temporal position. 3.2. Aim of study The aim of this study is to discover patterns in the sign data. We begin by using the CCAS algorithm to cluster the signs into homogeneous groups. We then examine these clusters to determine if any signs are missing in the clusters. We also examine the distance between the clusters. There are two key observations that we are looking for in the data. 1. If a single sign is not represented by a cluster then we determine that the computer is unable to distinguish this sign from other signs. 2. If two distinct sign clusters are very close together in distance, we determine that the signs have very similar patterns that may be indistinguishable by computers and humans.
194
N. Ye et al.
With this information, we can advise individuals working with sign language data. We can advise a software engineer, who is designing a human ± computer interface for signing, which signs are most dicult for the computer to recognize. We can advise a teacher of signing, which signs are more dicult for a pupil to distinguish so that they may advise the student to take care when making or reading these signs. We can advise a practitioner of signing which signs are more likely to be misread so that they might take more care when using these signs. Any other application to the learning and utilization of sign language for humans and computers alike can bene®t from the use of processing sign language data with CCAS. 4. Result analysis We ran the CCAS algorithm on the data set for each of the three signers in our study. We present these results using the two key observations shown above. 4.1. Signs without clusters There are three signs for which no cluster exists in two of the three signers' ®nal cluster structures. These signs are sad, share, and think. The clusters generated for these three signs cannot combine into larger clusters in the hierarchical clustering step. Then these clusters are removed in the step of removing outliers since there is only one instance in each cluster. The reason for this problem is possibly because the selected features are not ecient enough to represent the common characteristics of these signs. The instances represented by these features for one sign are not as close to each other as to the instances from the other signs. This implies that of the 95 signs processed, these are the three most dicult signs for the computer to recognize for this signer group. There are also 21 signs for which no cluster exists in one of the three signers' ®nal cluster structures. This implies that these signs are somewhat dicult for the computer to recognize for this signer group. This requires more powerful feature selection methods to be used to extract the common characteristics of these signs and to distinguish them from other signs. The majority of these 21 signs (11 of them) are from Andrew's set, with the other 10 evenly distributed between Adam and John. This suggests the possibility that natural signers might be more dicult for a computer to understand. This is a discovery that warrants further investigation. 4.2. Pairs of signs that are close in distance We chose threshold values to determine the relative distance between clusters. In this study, our results show that a good threshold distance to use, when considering the distance between two clusters as being close, has an order of magnitude of 1073. Table 2. Eat Go Happy Hot More Shop
Nearest signs in the clusters.
exit is (true) cost drink spend right
why lose glove
computer (PC)
Data mining technique
195
Table 2 summarizes our ®ndings for those sign clusters that have neighbouring clusters within the threshold distance. Each row in the table represents a dierent grouping of similar signs. We list the signs in order of the closest close neighbour to the furthest close neighbour. The results are the combined results of all three signers. We only show results that held true for two of the three signers. The results imply that the signs in these groups appear the most similar to the computer. Each of these groups of two to four signs per row represent groups of signs that are most confusing to the computer and therefore subject to misinterpretation. Combining our two key observations, we ®nd that the most dicult signs for the human ± computer interface for these 95 signs and three signers are sad, share, think, eat, exit, why, go, is (true), lose, happy, cost, glove, computer (PC), hot, drink, more, spend, shop, and right. In addition to these signs, there are other signs that appear as clusters of only two signs. We could increase our list of confusing signs by incrementing the threshold value in Step 5 of the CCAS algorithm. For example, if we increase the value to 2, then these additional sign clusters would have been removed as outliers. There are also other clusters that are close to each other. We could increase our list by looking at signs with a distance greater than our chosen threshold of order of magnitude 1073. Using combinations of these two adjustments, we can extend the study to include dierent levels and degrees of diculty in the computer recognition of signs. In addition to increasing our list of signs, we could re®ne the results by looking at data from the other two signers available on the donor website. We leave these enhancements to further studies. 5. Conclusion In this paper, we present the scalable, incremental data-mining algorithm CCAS as an analytical tool for analysing data related to ergonomics studies. This algorithm has proven useful in classi®cation problems in the ®eld of computer intrusion detection. Researchers can apply CCAS to other areas of study as a useful data mining technique. It uses a clustering algorithm on training instances supervised by their class. The post processing methods including redistribution and a special hierarchical clustering are helpful for producing more accurate cluster structure to represent the pattern in data. We test the application of CCAS to ergonomics using a set of human generated sign language data. The test results show that the application of CCAS is not limited to traditional classi®cation problems. This algorithm is also useful in clustering-like problems where we try to analyse the structure of data, although the original data are in the form of a typical classi®cation application. We can apply CCAS to ergonomics as well as other data mining problems that require clustering techniques. Acknowledgements Dr Nong Ye would like to dedicate this paper to her doctoral advisor at Purdue University, Dr Gavriel Salvendy, to express Dr Ye's deep appreciation of Dr Salvendy's greatest support, help, guidance, wisdom, kindness and friendship since 1989 when Dr Ye started her PhD study with Dr Salvendy. The sign language data set used in this paper may be used provided that: the data source is acknowledged: (http://www.cse.unsw.edu.au/*waleed/tml/data/); the user informs the donor if any work using this data set is published; the user acknowledges that commercial use requires donor permission.
196
N. Ye et al. References
CHERKASSKY, V. and MULIER, F., 1998, Learning from Data, (Chichester: John Wiley & Sons). ENDLER, D., 1998, Intrusion detection: applying machine learning to solaris audit data, Proceedings of 14th Annual Computer Security Applications Conference (ACSAC '98) 268 ± 279. ESTER, M., KRIEGEL, H. P., SANDER, J., WIMMER, M. and XU, X., 1998, Incremental clustering for mining in a data warehousing environment, Proceedings of 24th VLDB Conference, 323 ± 333. HARSHA, S. G. and CHOUDHARY, A., 1999, Parallel subspace clustering for very large data sets, Technical report, CPDC-TR-9906-010, Northwestern University. HUANG, C., BI, Q., STILES, R. and HARRIS, R., 1992, Fast search equivalent encoding algorithms for image compression using vector quantization, IEEE Transactions on Image Processing, 1, 413 ± 416. JAIN, A. K. and DUBES, R. C., 1988, Algorithms for Clustering Data, (New Jersey: Prentice Hall). KADOUS, M. W., 1995, GRASP: recognition of Australian sign language using instrumented gloves, Honours thesis, School of Computer Science and Engineering, University of New South Wales. LEE, W., STOLFO, S. J. and MOK, K., 1999, A data mining framework for building intrusion detection models, Proceedings of 1999 IEEE Symposium on Security and Privac, 120 ± 132 (Berkeley: IEEE). LI, X. and YE, N., 2002a, Decision tree classi®ers for computer intrusion detection, Journal of Parallel and Distributed Computing Practices, in press. LI, X. and YE, N., 2002b, Grid- and dummy-cluster-based learning of normal and intrusive clusters for computer intrusion detection, Quality and Reliability Engineering International, 18, No. 3, 231 ± 242. MITCHELL, T., 1997, Machine Learning, (New York: WCB/McGraw-Hill). PROCOPIUC, C. M., 1997, Clustering Problems and their Applications (a Survey), http:// www.cs.duke.edu/*magda/ (Duke University). SANDERS, M. S. and MCCORMICK, E. J., 1992, Human Factors in Engineering and Design (New York: McGraw Hill). SHEIKHOLESLAMI, G., CHATTERJEE, S. and ZHANG, A., 1998, Wavecluster: a multi-resolution clustering approach for very large spatial databases, Proceedings of 24th VLDB Conference, 428 ± 439. SINCLAIR, C., PIERCE, L. and MATZNER, S., 1999, An application of machine learning to network intrusion detection, Proceedings of 15th Annual Computer Security Applications Conference (ACSAC `99), 371 ± 377. VALDES, A. and SKINNER, K., 2000, Adaptive, model-based monitoring for cyber attack detection, Proceedings of Recent Advances in Intrusion Detection (RAID) 2000. WICKENS, C. D., LIU, Y. and GORDON, S., 1997, An Introduction to Human Factors Engineering (Boston: Addison Wesley Longman, Inc.). YE, N. and LI, X., 2002a, A supervised clustering and classi®cation algorithm for computer intrusion detection, IEEE Transactions on Systems, Man, and Cybernetics, in press. YE, N., LI, X. and EMRAN, S. M., 2000, Decision trees for signature recognition and state classi®cation, Proceedings of the First IEEE SMC Information Assurance and Security Workshop, 189 ± 194 (New York: IEEE). YE, N. and LI, X., 2001, A scalable clustering technique for intrusion signature recognition, Proceedings of the Second IEEE SMC Information Assurance Workshop, 1 ± 4 (New York: IEEE). YE, N. and LI, X., 2002b, A supervised, incremental learning algorithm for classi®cation problems, Computers and Industrial Engineering Journal, in press. ZHANG, T., 1997, Data clustering for very large datasets plus applications, PhD thesis, Department of Computer Science, University of Wisconsin - Madison.