Using Text Mining on Curricula Vitae for Building Yellow ... - CiteSeerX

17 downloads 4126 Views 30KB Size Report
Pages are related to People-Finder Systems [2], since they are useful to find experts in domains. Yellow pages may also be used for recruiting external people or ...
Using Text Mining on Curricula Vitae for Building Yellow Pages Daniel Lichtnow1 , Stanley Loh1,2 , Luiz Carlos Ribeiro Junior1 , Gustavo Piltcher1 1

Catholic University of Pelotas/ESIN, Rua Félix da Cunha, 412 CEP 96010-000 Pelotas/RS, Brazil. Tel: +55 (53) 284.8227. Fax: +55 (53) 225.3105. {lichtnow,loh,lcr,gustavop}@ucpel.tche.br 2

Lutheran University of Brasil, Rua Miguel Tostes, 101 CEP 92420-280 Canoas/RS, Brazil. Tel: +55 (51) 477-4000. Fax: +55 (51) 477.1313.

Abstract. This work presents a strategy to identify interest/expertise areas of professionals from analyzing their curricula vitae. A text mining technique is used to analyze the texts in the curricula and compare keywords to concepts defined in a domain ontology. The paper shows how the structure of the curriculum may be used for that purpose. The main application of the strategy is to support the construction of Yellow Pages, helping in identifying people with interest or expertise in certain areas.

1 Introduction One of the most frequent tasks in software projects is to locate experts within a particular technology or area. This kind of problem is related to Knowledge Management (KM). KM focuses on how to store and disseminate knowledge within organizations. This is common to all kinds of organizations, including software ones [3]. In some cases, knowledge may be explicitly available in documents, in papers or electronically. In these cases, digital libraries can help in the management task. But in other cases, knowledge exists only with people (not documented in anywhere). An alternative way to locate knowledge is to construct Yellow Pages. Yellow Pages contain information about people and their skills [3]. Yellow Pages do not describe how to fix a problem but indicate who has knowledge for helping in a task. Yellow Pages are related to People-Finder Systems [2], since they are useful to find experts in domains. Yellow pages may also be used for recruiting external people or for finding people inside the organization to compose a team or to fill an internal vacancy. Besides that, Yellow Pages are important instruments to understand the expertise of a group, showing strong areas (where there are many experts) or pointing out deficits (areas where there are no experts). Yellow Pages may be constructed by analyzing knowledge available in different kinds of sources like, for example, personal homepages, chat rooms, emails, electronic documents (read or written), etc.. According to [5], “the premise here is

that corporate workspaces can be used to glean expertise”. This paper presents a strategy to construct Yellow Pages from analyzing curricula vitae of people. The strategy is to identify expertise of people automatically by analyzing contents of curricula using text mining techniques (more precisely, classification techniques). When this analysis is made by hand, the process is time-consuming and error-prone. Furthermore the number of curricula and information inside each curriculum may be huge, making difficult to extract expertise. Software with text mining techniques can minimize some of those difficulties. The process considers the structure of the curriculum (stored in a XML format) and the texts used to represent information within each part of the structure. The paper also presents initial experiments with an automated system applied on curricula of professionals from Computer Science. The result of the process is a list of knowledge areas related to each analyzed curriculum, including a numeric degree related to each area that means the interest level of the person in the determined area. This degree may be used to represent the expertise level of a person in a domain or area, that is, how much a person knows about something. However, this paper is not so ambitious; determining the expertise level demands a more deep study. The section 2 of this paper discusses works related to Expert-Finding Systems. The section 3 describes the system and the text mining technique used in. The experiments and evaluations are showed in the section 4. Finally, section 5 discusses concluding remarks and indicates future research directions.

2 Related Works This section presents works related to a kind of system generally known as Expert Finders, People -Finder Systems or Expert-Finding Systems. These systems try to automatically discover who has the necessary knowledge to fix some problem or to help in a specific task.In general, for identifying experts, systems explore some implicit evidences presents in e-mails, personal home pages, chat rooms, news groups and others information sources [5]. ContactFinder [10] is an intelligent agent that try to identify experts based on bulletin board messages. This system looks for questions in the messages, identifies the topic area and searches people related to this topic área. ContactFinder uses some heuristic rules based on relation between areas and people. There are systems that try to find experts by analyzing organizational and personnel document repositories. One example is the Expert/Expert-Locator (EEL) (“Who Knows”) [15] where expertise of research groups is identified by analyzing technical documents produced by the groups. This system uses Latent Semantic Indexing (LSI) [16]. By its turn, Expert Seeker [2] uses Term Frequency and Inverse Document Frequency (TFIDF) of documents produced by members of an organization. The assumption is that authors of documents in the repository are experts in the subject which the document is about. The final result of a query (a query is a set of keywords) will rank authors according to documents that they have written.

Although the benefits, none of the cited works utilizes curricula vitae for identifying expertise. The only work found in the scientific literature that analyzes curricula vitae is [7]. The work proposes a competency model based on an ontology for annotating curricula vitae. The ontology is used for describing competencies related to parts of a curriculum. However, the identification of the content inside a curriculum is made by hand.

3 Description of the Strategy This section presents the proposed strategy for constructing Yellow Pages. The strategy was implemented in a software system that identifies interest areas and expertise degrees of people analyzing their curricula vitae. The strategy utilizes text mining techniques on textual content of documents representing curricula vitae. The software does not manage a Yellow Page but only indicates areas related to people. A Yellow Page system should have complementary information about who is each person and how to find her/him. The scope of this work is only the identification of interest areas for each person. The documents used by the system are structured in XML files, following the Lattes format [11]. The Lattes format is used by CNPQ, an entity of the Brazilian government for scientific and technological development, for gathering curricula vitae of the Brazilian researchers. There is a specific DTD for documents in the Lattes format. The document is structured by sections. There are sections about personal data, academic history, publications, past jobs, work experiences, extracurricular activities, etc. Each section may be divided in other sub-sections. For example, publications contain a subsection “papers published in reviews” and this subsection may contain many elements. The first step is to extract keywords related to each part of the document. The vector space model is used to represent text. For each part of the document, there must be a term vector, containing the words that appear in the textual information of the corresponding part. Stopwords are removed and do no appear in the vectors. There is a value associated to each word in the vector, representing the relative frequency of the word in document part. The relative frequency is calculated as the division of the number of occurrences of the word in the text being considered by the total number of words in the same text. A special vector is used to represent the whole document; in this case, the relative frequency of the words is considered in the whole text without making distinction of the part where they appear. 3.1 Identifying areas in the Curriculum Vitae The text mining technique used for identifying areas performs a classification task. The classification consists in comparing the text in the document to a set of concepts previously defined in a domain ontology.A domain ontology is a description of things that exist or can exist in a domain [14] and describes the vocabulary related to this domain [6]. In this work, the ontology was implemented as a set of concepts, each one represented by a vector of terms and weights. The relationship between concepts and

terms are many-to-many. The ontology is created by a semi-automatic process. Humans determine the relevant concepts and software tools are used to find terms related to concepts and their weights. Using a supervised learning strategy, human experts select scientific papers about each concept in the ontology and software tools extract a centroid for each concept (a centroid is a vector with the average characteristics of the sub-set). The weigh associated to each term in the centroid was determined using the TFDIF method [13]. The weight of a term in the ontology (in the vector of a concept) represents the probability that term is present in a text about that concept or represents that a text with that term has a certain probability of belonging to that concept. Human intervention is important to solve ambiguities and to minimize errors. An important step is to examine samples of false hits (texts assigned to wrong concepts), looking for terms or weights that lead to errors, in order to refine the concept vectors. Some special actions are: (i) to decrease high weights of generic terms (that do not discriminate concepts), especially those appearing in many concepts; (ii) to include word variations (genre and number) and synonyms that were not found by the automatic step and (iii) to balance weights of a word that appears in a father concept (more generic) and in a son concept (more specific). A software tool normalizes the weights of the terms in all concept vectors (highest weights have to be in the same level). Software tools may bring great benefits, but the final decision about the ontology must be responsibility of human experts. The classification process compares vectors representing the curriculum vitae and vectors representing concepts of the domain ontology. Only single words are used in the vectors. Although pairs of terms can give better results, using only pairs bring poor results, while single words alone are relatively successful [1], with the additional benefit of reducing time for computing classification and learning class definitions. The comparison between the vectors is done through a reasoning process, where weights of common terms (those present in both text and concept) are multiplied. The overall sum of these products, limited to 1, is the degree of relation between the text and the concept, meaning the level of relationship between the curriculum and the domain area. The fundament behind this method is that each word of a concept contributes with certain strength to the presence of that concept. Strong indicators may receive higher weights in the concept vector. This is like the relevancy index proposed in [12] whose definition is "a collection of features that, together, reliably predict a relevant event description". The approach is considered under the statistical paradigm according to the classification of [9], since it analyzes word frequencies and probabilities.

3.2 The Use of Text Mining Technique in XML Files The strategy implemented may consider or not the structure of the XML files generated in the Lattes format. If not considering the structure, only a term vector is used to represent the text of a curriculum. In this case, the result of the process is a list of areas and degrees related to the whole curriculum. . By other side, the structure of the curriculum vitae may be used to better identify the interest areas. Some parts of the curriculum may have more information or may be

more suited to represent the interest areas of a person. For example, publications may indicate the expertise areas better (more precisely) than extracurricular courses (which may be complementary to the main expertise). In this case, each part of the curriculum has a term vector associated to it. The DTD of the Lattes format in the implemented system has weights associated to each part. The weights may be set by a human administrator who indicates which elements or attributes are more important or may be generated by automatic steps (as explained in the next section). The weight is a numeric value ranging from 0 to 1, where 1 represents the greatest level of importance.

4 Experiments Experiments were carried out to evaluate the strategy. All the exp eriments used a domain ontology for the Computer Science. This ontology was created by the semiautomatic process described early. Experts in the area constructed the hierarchy of concepts following the ACM classification. However, the final result is different from the ACM classification. After that, a software tool was used to identify terms and weights for representing each concept. Approximately, 100 training texts were used for each concept in this supervised learning step. The first experiment applied the strategy over texts from scientific papers. The goal was to investigate how the strategy works with formal and well-formed texts of the Computer Science. 15 papers from CiteSeer were selected by hand [4]. Abstracts of the papers and sentences from the abstracts were used to represent short texts like the ones that appear in a curriculum. After, the implemented system tried to identify the main topic of the text according to the method described early. The result of this process is considered correct when the concept identified by the method is equal to the concept related to document (the concept of a document corresponds to the directory in CiteSeer where the document is located). Results presented in the table 1 show that the method performs better with bigger texts. Table 1. Results with scientific texts

Part of the text Abstracts Sentences of Abstracts

Correct areas (%) 91,66% 60,97%

The second experiment evaluates the performance of the method with parts of the curriculum. The goal is to know what parts better represent the interest areas of people. 5 groups of the curriculum in the Lattes format was selected for testing: 1. General Data contains identification data (name, address, e-mail), current and past jobs, academic history (undergraduate and postgraduate coursers) and data about research projects; 2. Bibliographic Production contains information about published papers, books and chapters;

3. Technical Production contains information about produced software, equipments, medicine products and intellectual property; 4. Other Production contains data about supervision of undergraduate and postgraduate works; 5. Complementary Data contains data about extracurricular formation (extra courses, participation in events and examination boards). Table 2. Results with parts of the curriculum vitae.

Group 1 General Data 2 Bibliographic Production 3 Technical Production 4 Other Production 5 Complementary Data

Correct areas (%) 88.2% 88.2% 82.3% 88.2% 58.8%

In this second experiment, 17 curricula vitae of Computer Science teachers from Catholic University of Pelotas were used. Each teacher determined his/her main interest/expertise area. That area was associated to the teacher’s curriculum and was used as the correct concept to be identified by the text mining technique. Table 2 shows the results. Note that groups 1, 2 and 4 generated the best performances and the worst case is the group representing the Complementary Data. One reason for this worst case is that generally this part of the curriculum vitae does not have information about the main activity of people; therefore, this part of the curriculum is not suited for pointing the expertise of a person. Regarding the groups 1 to 4, we can say that is possible to use individually one of these parts of the curriculum to identify expertise, because the performance was similar to the results with scientific texts (first experiment). The third experiment intends to evaluate the performance of the expertise identification method considering the curriculum as a whole (using the same 17 curricula of the previous experiment). In this case, 3 different strategies were used: − Strategy 1 - Analysis of the curriculum vitae without considering the structure of XML file. The curriculum was analyzed as a regular textual document (only one text, without subdivisions); − Strategy 2 - Analysis of the curriculum vitae considering only the more important parts of the curriculum (groups 1 to 4). These parts were selected according to the results of the second experiment. Strategy 2 does not make distinction of importance between the parts; Strategy 3 – Weighted analysis of the curriculum vitae, considering the degree of importance of each part of the curriculum. Each part of the curriculum received a weight (a numeric value ranging from 0 to 1). This degree of importance is equal to the percentage of precision for each part, according to the results obtained in the second experiment. For example, the value attributed to group 1 in the DTD was 0.882 and the group 3 received the value 0.823. The Table 3 shows the results of the third experiment. The worst case was the strategy 1 (precision of 58.8%), while strategies 2 and 3 performed similarly (88.2%).

One of the conclusions is that using only the more imp ortant parts of the curriculum (strategy 2) may be enough to identify the main interest/expertise area of a person. Other finding is that the weighted analysis performs better than using all parts with the same weight. Table 3. Results with the whole curriculum.

Strategy 1 2 3

Correct areas (%) 58.8% 88.2% 88.2%

Comparing these results to the ones from the first experiment, it is possible to say that the method achieves a good performance with the whole curriculum as with texts e xtracted from scientific papers. Another interesting finding is that curriculum vitae may be used for identifying expertise or interest areas of people with the proposed strategy, since the performance was similar on curricula and on scientific texts. A final remark is that using only some parts of the curriculum may be enough, since the performance of some parts equals the best performance using the whole curriculum (88.2%). This may be useful to save time when analyzing great volumes of curricula vitae.

5 Concluding Remarks This work presented a strategy to identify interesting or expertise areas of professionals from analyzing their curricula vitae. The text mining technique used performed so well with scientific texts as with texts from the curricula vitae, demonstrating that texts from the curricula can point out interest/expertise areas. The paper showed that the structure of the curriculum vitae may be used in this process. Experiments revealed that some parts are more important than others for that purpose. Experiments also raised the possibility of using only one part of the curricula, since the performance of some parts individually was equal to the best one. The strategy may be used to help the construction of Yellow Pages. Another interesting application of the strategy is for creating an initial profile for users of recommender systems. Recommenders that use the collaborative filtering technique may suffer with the cold start problem that happens when there is no information about the user and the system can not generate recommendations [8]. The strategy proposed in this paper may be used to identify interest areas of the users so that the system can use this information as an initial profile. A future work will be to extend the strategy to find not only the main interest area but also other minor interest areas. In this direction, other necessary work is to define a schema of points (a kind of reputation system) to identify precisely the degree of expertise (how much a person knows about an area) of each person based on the information available in the curriculum vitae.

Acknowledgements This work is partially supported by CNPQ, an entity of the Brazilian government for scientific and technological development, and FAPERGS, the Rio Grande do Sul state entity for scientific support.

References 1. 2. 3. 4. 5. 6. 7.

8.

9. 10.

11. 12.

13. 14. 15. 16.

Apté, C., Damerau, F., Weiss, S. M.: Automated Learning of Decision Rules for Text Categorization. ACM Trans. Inf. Syst. Vol. 12 n. 3 (1994) 233-251 Becerra-Fernandez, I.: Facilitating the Online Search of Experts at NASA using Expert Seeker People-Finder. In: PAKM 2000. Vol. 34 (2000) Davenport, T. H., Pruzac, L.: Working Knowledge – How organizations manage what they know. Harvard Business School Press, Cambridge, MA (1998) Directory-Citeseer: Computer Science Directory - CiteSeer. Available in: http://citeseer.ist.psu.edu/directory.html (2005) D’ Amore, R.: Expertise Tracking. In: Proceedings of 2005 2005 International Conference on Intelligence Analysis, McLean, VA (2005) Guarino, N.: Formal Ontology and Information Systems. In: International Conference on Formal Ontologies in Information Systems - FOIS'98, Italy (1998) 3-15 Harzallah, M., Leclère, M., Trichet, F. Knowledge engineering tools and techniques: CommOnCV: modelling the competencies underlying a curriculum vitae. In: Proceedings of the 14th International conference on Software Engineering and Knowledge Engineering - SEKE '02, July (2002) 65-81 Herlocker, J., Konstan, J. A., Terveen, L.G., Riedl, J. T. Evaluating Collaborative Filtering Recommender Systems. ACM Transactions on Information Systems, Vol. 22, No. 1, January (2004) 5–53. Knight, K.: Mining online text. In: Communications of the ACM, Vol. 42, n.11, (1999) 58-61 Krulwich, B., Burkey, C.: The ContactFinder Agent: Answering Bulletin Board Questions with Referrals. In: Proceedings of the 1996 National Conference on Artificial Intelligence (AAAI-96), Vol. 1, Portland, OR (1996) 10-15 Lattes-CNPq Lattes Platform - National Council for Scientific and Technological Development. Available in: http://lattes.cnpq.br (2005) Riloff, E., Lehnert, W.: Information extraction as a basis for high-precision text classification. ACM Transactions on Information Systems, Vol. 12, n.3 (1994) 296333. Salton, G., McGill, M. J.: Introduction to modern information retrieval. McGrawHill, New York (1983) Sowa, J. F.: Knowledge representation: logical, philosophical, and computational foundations. Brooks/Cole Publishing Co. Pacific Grove, CA: (2000) Steeter, L. A., Lochbaum, K. E.: Who Knows: A System Based on Automatic Representation of Semantic Structure. In: RIAO’88, Cambridge, MA (1988) 380-388 Yiman-Seid, D., Kobsa, A.: Expert Finding Systems for Organizations: Problem and Domain Analysis and the DEMOIR Approach. Journal of Organizational Computing and Electronic Commerce, Vol. 13, n. 1 (2003) 1-24.