Textmining, Feature Selection and Datamining for Protein Classification Faouzi MHAMDI
Mourad ELLOUMI
Ricco RAKOTOMALALA
Faculté des Sciences de Tunis, Tunisie
[email protected]
Faculté des Sciences Economiques et de Gestion de Tunis, Tunisie
[email protected]
Université Lumière Lyon 2, France
[email protected]
Abstract In this paper, we resent an approach of the classification of proteins that respect the knowledge extraction process. This approach is based on the techniques of Textmining and Datamining. Our work progress on three phases: The extraction of N-Grams from a file which regroups protein families. The selection of N-Grams which discriminates the best the different families of files. The application of many techniques of Datamining to realise the task of classification in order to determine the best classifier. We have realised the experimentation on files of proteins. Each file contains two families of proteins. The average length of the sequences of the proteins is equal to 1000 characters.
1. Introduction The enormous growth of biological data has obliged the biologists to find other solutions other than the benches “invitro“ in manipulating and treating the information. The computer science has shown a great efficiency in the treatment of biological data: we speak of “insilico” biology[18,23]. one of the treated problems by the computer tools is the classification of biologic sequences of both proteins and nucleic (DNA, RNAm, etc). Many works of classification have been realised. These works are based on different approaches and on the different structures of proteins[3,4,5]. Which is interesting for us is the classification of proteins based on their primary structures[4]. This choice is due to the availability of data banks such as Swissport, TrEMBL[b ] and the simplicity of their manipulations. This type of classification is based essentially on the comparison between the sequences. These comparisons have as unit the “pattern”. The pattern are in the centre of the treatment of biological data. It’s for this reason that exist in parallel with the Genbank[18], EMBL[18], Swissprot[18], PDB[18], etc. a banks of designs as Prosite[18]. All these banks are published on the internet network. For the biological sequences, the treatment of these designs makes the discovery of the hidden information easy. There are two big classes of design treatments which are the research of pattern[1,6,24] and the extraction of pattern[6]. This treatment causes many problems. There exist two sub-problems as the research of the most
frequent pattern or the research of repetitions within a sequence, etc. As there exist major problems concerning the research of designs and the extraction of designs. The problem of research of a design consists in the localisation of all the occurrences of a given design X , of size N, in a sequence of size M or within a set of sequences. There exist many algorithms that treat this problem with a difference in complexity as the naïve algorithms KMP[1,2], Karp-Rabin[24], etc. The extraction of pattern consists of the discovery of all the possible patterns of size n within a sequence S of size m. Many algorithms treat this problem as naïve algorithms[24], KMR, etc. There exist also many software known as bioinformatics tools. We can cite, MEME, N-Gram extraction. These software are published on Internet and at the disposition of the scientific community. They are put at the service of the pattern extractions. They have generally, as an input for a sequence, a file of a set of sequences or a bank of sequences. These sequences are presented generally within a well determined format. The most used format is the FASTA[18] and as an output, a certain number of patterns. In this work we will try to define a process of protein classification based on the famous pattern. Our work is composed of three sections, in the first section we will develop the problem of n-grams extraction[7,8,9,10,11]. The second section speaks about the features selection[12,13,14]. The last one concerns the classification based on the selected n-grams.
2. Datamining and classification 2.1. Process of datamining The Datamining[19,20] is known too by the Knowledge discovery and Datamining(KDD). It is a process which is composed of 3 phases, the first is the preparation of data. The second is the application of one or many techniques as the trees of decision[19,20,21], the neurone network[19,20,22], the genetic algorithms[19,20], the rules of association[19,20], etc. the third phase consists of the validation of the obtained knowledge. Noticing that then process requires the intervention of an expert of the field and specially for the first and third phases. The realised tasks by this process are numerous; we can quote the prediction, the
2 estimation, the grouping the optimisation and the classification[19,20].
2.2 The classification The classification is an important task of Datamining[15,16]. In biology, it is applied mostly for proteins[3,4,5]. It is a process which allows many steps and needs the presence of the combination of many data. Firstly, we must define a set of training individuals and another set of tests and define for each individual of the training set, the class and the family; it is a supervised training. Secondly, we must define the variables of the classification. Thirdly, we must apply one or many techniques of classification within the group of training and finally validate the model within the set of tests.
3. Textmining and extraction of N-Grams 3.1. Textmining We can consider that the Textmining[7,8,9,10 ] as the “new-born” of Datamining. The two of them own the same process. We know that the Datamining is a process of knowledge discovery from the general data. However, the Textmining is specialised in the textual data. As each process of knowledge discovery, we must go via the data preparation phase. Then, we can start on the training phase and apply techniques of Datamining. The preparation of textual data consists of the definition of an approach of text presentation. A text owns many forms of representation. We can consider it as a set of words, a set of lemma or a set of NGrams[7,8,9,10]. .
The alphabet of the text is the set of amino acids for the proteins, there are 20 characters {A, R, N, D, B, C, Q, E, Z, G, H, I, L, K, N, F, P, U, S, T, W, Y, V}, For the DNA, we have four characters {A,C,G,T} and for RNAm, we have four characters{ A, C, G, U}[23]. The term N-Gram in the Textmining corresponds to “pattern” in biology[18]. In fact, the splitting of a sequence in N-Grams, permits to us to establish revealing statistics of new knowledge. Thus we can via a set of sequences, define one or many designs which discriminate a family with regard to another. In our work, the following phases are based on 2 first factors, the N-Grams and the statistical studies devoted for them. These studies take in consideration the number of occurrences of each N-Grams, the size of NGram, the presence or the absence of an N-Grams within a sequence or within a set of sequences, etc. A third factor intervenes in the work: It’s the biologic sense of treatments and results. It’s a very sensible factor because the biological data are dynamic. We speak all the time of mutation and substitution. At this step of the work, we judge useful to give an example which illustrate the extraction of NGrams[7,8,9,10,16]. * Let a sequences of size m and an N-Gram of size N A R N Z G I L A A V Y W U T S M L K F R S C ... ARNZGILAAVYWUTSMLKFRSC… ARNZGILAAVYWUTSMLKFRSC… if N=3, the possible 3-Grams are: {ARN, RNZ, NZG, ZGI, GIL, ILA, LAA, AAV, AVY,…}
3.2. The N-Grams An N-Gram is a sequence of n characters[ Cavnar and Trenkle, 1994]. For any test, the set of N-Grams that we can generate is the result that we obtain by moving an interval of n cases on the body of the text. This movement is done by steps, one step (which corresponds to a character) for each movement. For each step, a photo taking is done, the group of photos obtained constitute the set of all the N-Grams that we can generate [Miller et all,1999]. The N-Grams constitutes an efficient tool for ,the protein classification, object of our work.
3.3. Textmining and extraction of N-Grams in biology Let’s see at present how we can apply these techniques in biology. First, we must prepare our corpus that sends to a file consisted of a text which is composed of a set of sequences (proteins, DNA, RNAm, etc). Each sequence is presented on a line which is independent of the text. These sequences represent the primary structures of these data.
Figure 1. N-Grams extraction
4. Process of biological sequence classification Proteins File N-Grams Extarction Distincts N-Grams
Features selection
Learning File
Best N-Grams
Learning File
Classification
Best Size of N
Best Classifier
Bests N-Grams
Figure 2. Process of proteins classification
3 In This process, we have tried to respect the process of knowledge extraction. We can distinguish the three big phases. The preparation of date is present in the assembling of protein sequences in file sets, in the extraction of N-Grams et feature selection. The second phase is one of training which is realised during classification. Finally, the evaluation of classifiers and the validation of results of classification which a biologist will be realised at the final step, that’s to say step of validation. Table 1. Extraction N-Gram with variance of N N
All
Distinct
1 2 3 4 5 6 7 8
249 24 962 59 197 65 367 66 125 66 189 66 135 66 136
20 400 7 145 31 595 42 040 45 206 46 954 48 226
Distinct/ All 0.080 0.016 0.120 0.480 0.635 0 .682 0.709 0.729
File size (Byte) -70 1 243 5 524 7 391 7 992 8 347 8 667
N-Grams Number
Varinace of N-Grams 80000 60000 Distinct N-Grams
40000
All N-Grams 20000 0 1
2
3
4
5
6
7
8
N-Grams size
Figure 3. Relationship between N-Gram size and number We show in the Table1 and Figure1, the very great number of N-Grams provided from one file of 87 sequences of protein. We can notice that from the size four of N-Grams, many difficulties to make the classification with this enormous number of Distinct NGrams were found. So a step of feature selection is necessary. But before beginning this step we must prepared variables. This one consists in subjecting these last ones to a phase of filtering. To realize filtration one can apply several strategies.
5. Feature selection The aim of this phase is to select the best variables in order to make the best classification. Thus, this task looks for the minimisation of the cost of the time of calculating. There exists two big classes variable selection methods. There are the filter methods and the wrapper methods[12,13,14]. For the first method, the selection is done independently from the quality of variable classification . It is based on statistical calculations (e.g. correlation). But, with the second method, each variable must be put under a test which
permits the calculations of the level of classification. The variables which have a low level of classification will be eliminated.
5.1. Feature selection with Filter methods During the selection of N-Grams, we have applied these two methods. Firstly, we have undergone the filtrate phase where we have played on the number of total occurrences of each N-Grams in the total sequences file. This strategy, consists of the elimination of variables which have a percentage of presence less than or equal to a level (let x%), where x is a constant which we have varied between 5 and 25 (Table 2). The best value of x will be conserved. Thus we will have the following algorithm.
Algorithm Filtration i, j, k integer Ocr[1..N][1..M] : Occurrence Table x constant for j from 1 to N do k=0 for i de 1 to M do if Ocr[i][j] = 0 then k=k+1 End if Loop if k >= x * N then Eliminate ( NGrami ) End if loop End Algorithm
This technique has a major inconvenient, that’s the elimination of variables is not based on their classification performances. This statistical study is presented in Table2.
Table 2. N-Gram filtration N
All N-Grams
Distinct N-Grams
Filter