VML: A View Modeling Language for Computational Knowledge Discovery Hideo Bannai1 , Yoshinori Tamada2, Osamu Maruyama3, and Satoru Miyano1 1
Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo, 108-8639 Japan. {bannai,miyano}@ims.u-tokyo.ac.jp 2 Department of Mathematical Sciences Tokai University, 1117 Kitakaname, Hiratuka-shi, Kanagawa 259-1292, Japan.
[email protected] 3 Faculty of Mathematics, Kyushu University, Kyushu University 36, Fukuoka, 812-8581, Japan.
[email protected]
Abstract. We design a programming language called VML (View Modeling Language), providing facilities for boosting the efficiency of the trial-and-error cycle which frequently appears in any knowledge discovery process. The language is based on the concept of view, a generalized form of attribute, which has been shown to constitute fundamental building blocks in the knowledge discovery process. This trial-and-error cycle can be formulated as the manipulation and redesign of views. A view is defined as a named function, and therefore we design VML by extending a functional language ML, for realizing the easy modeling and manipulation of views. The two main extensions are the keywords ’view’ and ’vmatch’, which are used for defining and decomposing views. We describe, as VML programs, successful knowledge discovery tasks which we have actually experienced, and show that computational knowledge discovery experiments can be efficiently developed and conducted using this language.
1
Introduction
In [5], Fayyad et al. presented the general flow and components which comprise the KDD process. The KDD process can be generally divided into several stages such as: data preparation (selection, preprocessing, transformation) data mining, hypothesis interpretation/evaluation, and knowledge consolidation. A typical process will not only go one-way through the steps, but will involve many feedback loops, due to the trial-and-error nature of knowledge discovery [2]. Most research concerning KDD focuses on only a single stage of the process, such as the development of efficient algorithms for a specific problem (the data mining stage). On the other hand, little work has been done concerning the trial-and-error cycle (or feedback stage) which is inherent in any KDD process. The importance of the ease of human intervention has been stressed [9], and a generic solution for this is longed-for.
2
The purpose of this paper is to design a programming language to speed up this trial-and-error cycle of the KDD process. To clarify the criteria for such a language, we turn to the concept of views. Views have been introduced and recognized as fundamental concepts which frequently appear in the knowledge discovery [12, 13, 11, 1]. A view is a generalized form of attribute, a specific way of looking at or understand given objects, and is essentially a function over a set of objects, returning a value which represents some aspect of each object. Each step in the knowledge discovery process can be described as the design, generation, application, or evaluation of such views [1]. Also, in [11], the function of many existing knowledge discovery systems is depicted as the search for good views and view designs which enable the human expert to explain the data. Hence, the first criterion of our language is that it should provide facilities for the easy manipulation and modeling of views. To accomplish this, we consider functional languages, and use it for a basis of our language. Second, although a view is no different from a normal function in its behavior, a view must somehow carry information about itself. Any value returned by a view should know its origin, in other words, how the value was calculated. This is because in the knowledge discovery process, there is a need for understanding the meaning of a given value. Obtaining a black-box function, even if it can describe the data very well, is not of much use, and we must understand exactly what the function is doing to the data. For example, the value pair ‘175, 73’ would not tell us much about the data, even if we know it is valuable for detecting risks of a certain disease. We would like to know that the values were obtained from applying a “height” view, and “weight” view to a certain person, returning the height, in centimeters, and weight, in kilograms, of the person. In [2], Cheeseman et al. show three examples of computational knowledge discovery using the AutoClass system. For each of the examples, the interaction between the human expert and the system is stressed as playing an important role in the discovery of meaningful results. They also discuss their troublesome encounters with undocumented preprocessing and/or sampling bias, which had already been applied to the data given to them. These data transformations are in essence, application of views to the data, and need to be remembered somewhere. From the points mentioned above, we consider extending the syntax and semantics of ML [15], an excellent functional language with a type inference system. The most important extensions are the two keywords ‘view’ and ‘vmatch’. ‘view’ is used to bind a function to a name, as well as, instructing the program to remember any value resulting from the function. The history of its application, or the view design is remembered. ‘vmatch’ is a keyword for the pattern matching of view design, enabling the extraction of views which are parts of a compound view. Any value which is “remembered” can be matched and decomposed by the ‘vmatch’ statement. We will show through actual computational knowledge discovery tasks, that using these keywords, a programmer can design and manipulate views with ease.
3
In Section 2, we briefly describe related work. In Section 3, we describe the basic concepts of views. The syntax and semantics of our extensions are given in Section 4. We describe, using VML, several actual discovery tasks we have conducted in Section 5. Finally, we conclude the paper with Section 6.
2
Related Work
KEPLER [18] is a knowledge discovery system with a pluggable architecture, which focuses on the extensibility of a discovery system. CLEMENTINE [8] is a successful commercial application, which focuses on the human intervention of view design. Our work is different and unique in that it tries to give a solution at a more generic level. Concerning the “remembering”, a similar idea may be found in programming languages which offer facilities for reflection, or introspection, which lets the program to look inside itself. For example, some dialects of LISP provide a function called get-lambda-expression, which returns the actual lambda expression of a given closure. These facilities are of course useful, but can return too much information concerning the function (e.g. the source code for a complicated algorithm). The idea of our work is to limit this information by regarding views as the smallest unit of representation.
3
Background
Here, we outline the definitions given in [1]. An entity set E is a set of objects which may be distinguished from one another. Each object e ∈ E is called an entity. A view v : E → R is a function over E. v will take an entity e, and return some aspect (i.e. attribute) concerning e. A view operation is an operation which generates new views from existing views and entities. Below are some examples: Example 1. Given a view v : E → R, a new view v′ : E → R′ may be created ψ v with a function ψ : R → R′ (i.e. v′ ≡ ψ ◦ v : E → R → R′ ). We can consider n-ary functions as views. All arguments except for the argument expecting the entity is regarded as parameters of the view. Hypothesis generation via machine learning algorithms can also be considered as a form of view operation. The generated hypothesis can also be considered a view. Example 2. Given a set of data records (entities) and their attributes (views), the ID3 algorithm (view operator) generates a decision tree d. d is also a view because it is a function that returns the class which a given entity is classified to. The generated view d can also be used as an input to view operation to create new views. This represents knowledge consolidation. Views and view operators are combined to create new views. The structure of such combinations of a compound view, is called the design of the view. The task of KDD lies in the search for good views which explain the data. Knowledge concerning the data is encapsulated in its design. Human intervention can be
4
conducted through the hand-crafted design of views by domain experts. For a computational knowledge discovery system to successfully assist an expert, the expert should be able to manipulate and understand the view design with ease.
4
VML Syntax and Semantics
In this section, we describe the general idea of the two keywords ‘view’ and ‘vmatch’. As mentioned in the introduction, the aim of the extensions is to provide facilities for manipulating the views, and for retaining the meaning of the values and views, which are operations that reoccur in the KDD process. Although these extensions can be considered for any other functional language, we borrow syntax and semantics from O’Caml [20], an excellent language of the ML family. We do not provide a complete syntax definition of O’Caml here, and the O’Caml reference manual should be consulted for a full description. In the code snippets presented below, lines beginning with ’#’ will represent the input, interpreted up to a double semicolon ’;;’. Other lines are internal representations (or responses from the interpreter). Comments are written between ’(*’ and ’*)’. Functional types are expressed as type1 ->type2 , meaning that the function takes an argument of type type1 and returns a value of type2 . Tuple types are expressed by type1 *type2 *· · · *typen . Polymorphic types are represented by a ‘’’ in front of an identifier, for example: ’a. 4.1 The ‘view’ keyword The ‘view’ keyword can be used in the same syntax as the let statement in O’Caml, which binds a value or function to a name. We first briefly describe the let statement. let ident = expr
will bind the value of expr to the name ident. To bind a function to a name, one may write: let ident parameter1 . . . parameterm = expr or let ident = fun parameter1 . . . parameterm -> expr
The right hand side of the second form defines an anonymous function, with the fun keyword, binding it to ident. Recursive definitions of names are introduced by let rec: let rec ident parameter1 . . . parameterm = expr
Examples (taken from the O’Caml manual): # let pi = 4.0 *. atan 1.0;; val pi : float = 3.141593 # let square x = x *. x;; val square : float -> float = # square(sin pi) +. square(cos pi);; - : float = 1.000000 # let rec fib n = if n < 2 then 1 else fib(n-1) + fib(n-2);; val fib : int -> int =
The view keyword can be used in essentially the same way as let, assigning names to functions or values. The only difference is that operations conducted on a value or function bound to a name via the view statement, are remembered. For convention, we will write view names starting with a capital letter. The
5
subject of what is to be remembered, is the lambda expression of the value. We will call this the representation of the value. A value does not need to have a representation expressing its origin, in which case, the value itself is considered its representation. A representation for a given expression can be recursively determined as the β-reduction over the representations in the expression. In this β-reduction, values bound to a name via the view keyword are not expanded. The resulting value will have a representation if it is a functional application to a value with a representation. For example: # view Plus x y = x + y;; view Plus : int -> int -> int = ::(x y . Plus x y)
A function of type: (int->int->int) is assigned to the name “Plus”. The representation of this function is shown to the right of ’::’ as (x y . Plus x y), where names on the left of the period ‘.’ represent abstracted variables. Views can also be partially applied. # let val f # let val t
f : t :
= Plus 3;; int -> int = ::(y . Plus 3 y) = f 5;; int = 8::(Plus 3 5)
A function of type: (int->int) is bound to f. It was created by applying 3 to (x y . Plus x y), and the β-reduction of this is (y . Plus 3 y). Since (ignoring the abstraction) this is an application to a value defined as a view (Plus), the representation of the expression Plus 3 is remembered as (y . Plus 3 y). Similarly for t, the value 8 is bound, with a representation of (Plus 3 5). Other statements, such as if-then-else, match, vmatch, will not have a representation, unless they return a value which has a representation. Let’s look at a more sophisticated example. The following code can be used to define a binary decision tree for binary classification. # type ’a bdt = Leaf of bool | Node of (’a -> bool) * (’a bdt) # view rec DTree dtrep elm = match dtrep with | Leaf(x) -> x; | Node(v, y, n) -> if (v elm) then (DTree y else (DTree n view DTree : ’a bdt -> ’a -> bool = ::(x y
* (’a bdt);; elm) elm);; . DTree x y)
This can be used like the following, which would create a tree of type int->bool. # let tree = DTree (Node ((fun e -> e > 5), (Node ((fun e -> e < 10), (Leaf true), (Leaf false))) (Leaf false)));; val tree : int -> bool = ::(y . DTree (rep1 ) y) # tree 7;; - : bool = true::rep2
where rep1 is Node (, (Node (, (Leaf true), (Leaf false))), (Leaf false))
and rep2 is (DTree (Node (, (Node (, (Leaf true), (Leaf false))) (Leaf false))) 7) ::(DTree (Node (, (Leaf true), (Leaf false))) 7)
6
Here, since DTree is recursively defined, there are multiple representations for the resulting value: one for each time DTree is called (the subtrees a value is classified through). An alternate function handling the recursion can be defined to avoid this behavior. Otherwise, we can introduce a new keyword clear to let a value forget its representation. “clear expr” would return the value of expr without a representation. 4.2 The ‘vmatch’ keyword A view can be freely decomposed by using matching against its design using the ‘vmatch’ statement. From a value or function constructed from a view, we may obtain their origins. The syntax is the same as O’Caml’s ‘match’ statement: vmatch expr with pattern1 -> expr1 | ... | patternn -> exprn
but matches the representation of expr against the patterns pattern1 to patternn . If the matching against patterni succeeds, the associated expression expri is evaluated, and its value becomes the value of the whole vmatch expression. The evaluation of expri takes place in an environment enriched by the bindings performed during matching. If several patterns match the value of expr, the one that occurs first in the match expression is selected. If none of the patterns match the value of expr, the exception Vmatch_failure is raised. (In the code below, values t and f are defined in the previous Subsection) # # -
vmatch t with (Plus x y) -> (x - y);; (* t: (Plus 3 5) *) : int = -2 vmatch f with (Plus x _) -> Plus x;; (* f: (y . Plus 3 y) *) : int -> int = ::(y . Plus 3 y)
An underscore ‘_’ in the pattern represents a wild card, which will match anything. The representation without lambda abstractions is matched. Abstracted, or unbound variables will not match variables in the pattern, and will only match if the corresponding variable is specified as an underscore ’_’. # vmatch f with (Plus x y) -> "won’t match";; (* y is not bound *) Uncaught exception: Vmatch_failure
The syntax for patterns is not described here, but since view designs can be regarded as a tree of functional application, there are possibilities for powerful patterns such as regular expressions over tree structures [6].
5
Actual Knowledge Discovery Tasks
In this section, we describe two computational knowledge discovery experiments, using VML. Since the aim is to illustrate the usage of VML, details of the experiments, which would otherwise be necessary, may be left out. VML is not yet fully implemented, and the experiments conducted here were developed with the C++ language, but based on the same concepts. 5.1 Characterization of N-terminal Sorting Signals of Proteins Proteins are composed of amino acids, and can be regarded as strings consisting of an alphabet of 20 characters. Most proteins are first synthesized in the cytosol, and carried to specified locations, called localization sites, such as mitochondria or chloroplasts. In most cases, the information determining the subcellular localization site is represented as a short amino acid sequence segment called a protein sorting signal [16]. Here, we consider signals known to be at the N-terminal
7
of the protein: i.e. mitochondrial targeting peptides (mTP), chloroplast transit peptides (cTP), and signal peptides (SP). TargetP [4], a neural network based predictor, is the state-of-the art for the prediction of these signals. However, it is difficult to see what aspects of the input sequence TargetP has captured, and the aim of our work is to find simple and interpretable characteristics, or rules, which can still accurately predict the localization site of proteins. Our approach to the problem was to first discuss with an expert, what kind of views we should design. We used data available from the TargetP web-site [21], consisting of 940 sequences containing 368 mTP, 141 cTP, 269 SP, and 162 “Other” sequences. After conducting computational experiments for some view design, we presented the results to the expert as feedback, and repeated the process. We first considered binary classifiers, which distinguishes sequences of a certain signal. The entity set is the set of amino acid sequences. The views we look for are of type string->bool: for an amino sequence, return a Boolean value, true if the sequence contains a certain signal, and false if it does not. The views we actually designed (in time order) can be written in VML as: let h1 pat mm ind pos len str = Astrstr mm pat (AlphInd ind (Substring pos len str));; let h2 thr ind pos len str = GT (Average (AAindex ind (Substring pos len str))) thr;; let h3 thr aaind pos1 len1 pat mm alphind pos2 len2 str = And (h1 thr aaind pos1 len1 str) (h2 pat mm alphind pos2 len2 str);;
view Substring pos len str : int -> int -> string -> string >> return substring: [pos,pos+len] of str. A negative value for pos means to count from the right end of the string. view AlphInd ind str : (char -> char) -> string -> string >> convert str according to alphabet indexing ind. ind is a mapping of char->char, called an alphabet indexing [17], and can be considered as a classification of the characters of a given alphabet. view Astrstr mm pat str : astr_mismatch -> string -> string -> bool >> approximate pattern matching: match pat & str with mismatch mm. The type astr_mismatch consists of (int * bool * bool * bool) where the int value is the maximum number of errors allowed, and the bool values are flags to permit the error types: insertion, deletion, and substitution. view AAindex ac str : string -> string -> (float array) >> convert str to an array of float according to amino acid index: ac. ac is an accession id of an entry in the AAindex database[7 ]. Each entry in the database represents some biochemical property of amino acids, such as volume, hydropathy, etc., represented as a mapping of char -> float. view Average v : float array -> float >> the average of the values in vec view GT x y : ’a -> ’a -> bool >> greater than view And x y : bool -> bool -> bool >> Boolean ‘and’
Fig. 1. Views used in the view design to distinguish protein sorting signals.
The meanings of each view used in the design is summarized in Figure 1. For each view design, if we apply all the parameters except str, the resulting function is of the desired type: string->bool. Note that since each view design is composed of views, representation of such a function will contain information of all the arguments. For example (“BIGC670101” is the accession id for amino acid index: ‘volume’):
8 # let f = h2 3.5 "BIGC670101" 5 20;; val f : string -> bool = ::(x .(GT (Average (AAindex "BIGC670101" (Substring 5 20 x)) 3.5)))
The task is now to find good parameters which defines a function that can accurately distinguish the signals. For each view design, a wide range of parameters were applied. For each combination of parameters, we apply them to the view design shown above, and obtain a function: string->bool. Note again that since VML remembers the representation, the programmer need not worry about keeping track of the meanings of each function, because the representation may be consulted using the vmatch statement when needed. We apply all the protein sequences to this function, and calculate the score of this function as a classifier of a certain signal. Functions with the best scores are selected. View design h1, looks for a pattern over a sequence converted by a classification of an alphabet [17]. We hoped to find some kind of structural similarities of the signals with this design, but we could not find satisfactory parameters which would let h1 predict the signals accurately. Next, we designed a new view h2 which uses the AAindex database [7], this time looking for characteristics of the amino acid composition of a sequence segment. This turned out to be very effective, especially for the SP set, and was used to distinguish SP from the other signals. For the remaining signals, we tried combining h1 and h2 into h3. This proved to be useful for distinguishing the “Other” set (those which do not have N-terminal signals), from mTP and cTP. We can see that the functional nature of VML enables the easy construction of the view designs. By combining the views and parameters thus obtained into a single decision list, we were able to create a rule which competes fairly well with TargetP in terms of prediction accuracy. The views and parameters used in the decision list is given in Figure 2. The scores of a 5-fold cross-validation is shown in Table tp×tn−f p×f n , where tp, tn, fp, fn 1. The score is defined by: √ (tp+f n)(tp+f p)(tn+f p)(tn+f n)
are the number of true positive, true negative, false positive, and false negative, respectively (Matthews correlation coefficient (MCC) [14]). The knowledge encapsulated in the view design was consistent with widely believed (but vague) characteristics of each signal, and the expert was surprised that such a simple rule could describe the sorting signals with such accuracy. A system called iPSORT was built based on these rules, and an experimental web service is provided at the iPSORT web-site [22]. vmatch can be useful in the following situation: After obtaining a good view of design h2, we may want to see if we can find a good view of design h1, but use the same substring sequence as h2. This can be interpreted as first looking for a segment which has a distinct amino acid composition, and then looking closer at this segment, to see if structural characteristics of the segment can be found. This function can be written as: # let newh f = vmatch f with | GT (Average (AAindex _ (Substring p l _))) _ -> fun pat mm ind str -> h1 pat mm ind p l str;; val newh : ’a -> string -> astr_mismatch -> (char -> char) -> string -> bool =
If the representation of a function h was for example:
9 (str . GT (Average (AAindex ind (Substring 3 16 str))) 3.5)
then, the representation of (newh h) would become: (pat mm ind str . (Astrstr mm pat (AlphInd ind (Substring 3 16 str))))
representing a function of design h1, but using the parameters of h of view design h2 for Substring. Again, we need not worry about explicitly keeping track of what values were applied to h2 to obtain h, since it is automatically remembered and can be extracted by the vmatch keyword. Thus, we have seen that the design and manipulation of views can be done easily with VML, therefore would greatly assist the trial-and-error cycle of the experiments. Table 1. The Prediction Accuracy of the Final Hypothesis (scores of TargetP [4] in parentheses) Predicted category # of cTP mTP SP Other seqs cTP 141 112 (120) 15 (14) 0 (2) 14 (5) 304 (300) 9 (9) 14 (18) mTP 368 41 (41) SP 269 16 (2) 8 (7) 237 (245) 8 (15) Other 162 13 (10) 6 (13) 2 (2) 141 (137) Specificity 0.62 (0.69) 0.91 (0.90) 0.96 (0.96) 0.80 (0.78) True category
sequence
P1 P1
SP? h2
yes
Other? h3
yes
P2 P2
P3 P3
mTP? h3 no
cTP
5.2
SP
Node Substring Amino acid index threshold P1 P2 P3
Other
Name
mTP
AI1 AI2
yes
[6, 25] [1, 30] [1, 15]
Hydropathy Index Negative Charge Isoelectric Point
0.9225 0.083 0.621
Sensitivity 0.79 0.83 0.88 0.87
(0.85) (0.82) (0.91) (0.85)
MCC 0.64 0.79 0.89 0.80
(0.72) (0.77) (0.90) (0.77)
Alphabet Pattern Mismatch Indexing allowance not used not used not used AI1 112111221 2 ins/del AI2 211211221 3 ins/del
Alphabet Indexing 1 2 3 ACFLMPQSTVWY IR DEHKN ACDEFGHLMNQSTVWY KR IP
Fig. 2. The parameters for each view, and the decision list combining the views for each signal, in iPSORT.
Detecting gene regulatory sites
It is known that for many genes, whether or not the gene expresses its function depends on specific proteins, called transcription factors, which bind to specific locations on the DNA, called gene regulatory sites, usually in the upstream region of the coding sequence of the gene. Since proteins selectively bind to these sites, it is believed that common motifs exists for genes which are regulated by the same protein. We consider the case where the 2-block motif model is preferred, that is, when the binding site cannot be characterized by a single motif, and 2 motifs should be searched for. We develop a simple and original method, based on views. Testing on B.subtilis σ A -dependent promoter sequences of [3], our method was able to rediscover the same results, as well as other candidates for 2-block motifs.
10
Previous Work BioProspector [19] is a system that generates a set of motifs as candidate regulatory sites, from a set of sequences and input parameters. Its main algorithm is based on the Gibbs sampling method [10], although some modifications are made for the sampling of 2-block motifs. The input to the system consists of: 1) sequences containing motifs to be searched for, 2) sequences used to define the background nucleotide distribution, 3) The widths of the two motifs to be searched w1 , w2 and their gaps [gL, gM ], 4) The motif score threshold [TL, TH ]. BioProspector could be written as a view: view BioProspector seqs dist_seqs g_min g_max thr_l thr_h: (string array) -> (string array) -> int -> int -> float -> float -> ((int * int array) array)
which returns the position and length of each motif that is a candidate regulatory site. The resulting value would hold the information that it was a value calculated by BioProspector. Our original method In a view-oriented approach, we start with some view design, preferably designed by an expert, and various parameters are searched for. The design of a view can be regarded as a working hypothesis, reflecting an expert’s knowledge or intuition. In our case, we modeled the 2-block motif for regulatory sites as consisting of three components: the motif pattern, the gap width of these patterns, and their positions. We construct a function with the following design: let orig pos len g_min g_max mm1 mm2 pat1 pat2 str = ListDistAnd g_min g_max (AstrstrList mm1 pat1 (Substring pos len str)) (AstrstrList mm2 pat2 (Substring pos len str));;
where view ListDistAnd min max l1 l2: int->int->(int list)->(int list)->bool >> Return true if there exists e1 ∈ l1, e2 ∈ l2 such that min ≤ (e2 − e1 ) ≤ max. view AstrstrList mm pat str: astr mismatch->string->string->(int list) >> Return the match positions of a pattern as a list of int. Type: astr mismatch is explained in Figure 1.
The arguments except str are parameters, and when all the parameters are set, a function of type bool->string is generated, returning true if a certain 2block motif appears for a given string, and false otherwise. To look for good parameters, we take a supervised learning approach and randomly selected genes of B.subtilis not included in the original dataset, from the GenBank database [23], as negative data. The parameter search space is enormous, but knowledge concerning the DNA regulatory sites was used to limit the space, keeping the search feasible. The score of each view is based on its accuracy as a classification function that interprets whether or not an input sequence have the motifs. We look at the top ranking views in order to evaluate them. Numerous iterations with different search spaces yielded some interesting results. The outputs of the experiments are shown in Figure 3. By limiting the search space by using knowledge obtained from previous work, we were able to come up with views v1 and v2 where the 2-block motifs were consistent or
11 v1: (str . ListDistanceAnd 20 30 (AstrstrList (2,false,false,true) (AstrstrList (2,false,false,true) true positive 102 false negative 40 = false positive 0 true negative 142 = v2: (str . ListDistanceAnd 20 30 (AstrstrList (2,false,false,true) (AstrstrList (2,false,false,true) true positive 100 false negative 42 = false positive 0 true negative 142 = v3: (str . ListDistanceAnd 25 35 (AstrstrList (3,false,false,true) (AstrstrList (2,false,false,true) true positive 142 false negative 0 = false positive 0 true negative 142 =
"ttgtca" (Substring -40 35 str)) "tataat" (Substring -40 35 str))) 71.8 % 100.0 % "ttgaca" (Substring -40 35 str)) "tataat" (Substring -40 35 str))) 70.4 % 100.0 % "atgatc" (Substring -50 65 str)) "gttata" (Substring -50 65 str))) 100.0 % 100.0 %
Fig. 3. Result of our original method to find regulatory sites.
were the same with “TTGACA” and “TATAAT” as detected in [3, 19]. We also ran the experiments with a wider range of parameters, and found a view v3, that could perfectly discriminate the positive and negative examples. Although a biological interpretation must follow for the result to be meaningful, we were successful in finding a candidate for a novel result. In this kind of experiment, VML can help the expert in the following way: Although the views are sorted by some score, it is difficult to check the validity of a view according to the score. For example, a valuable view will probably have a high score, but a view with a high score may not be valuable. In the evaluation stage, there is a need for the expert to look at the many different views with adequately high scores, and see what kind of parameters were used to generate the view. This would be easy for VML since it would be just to display the representation of the functions.
6
Conclusion
We presented a language called VML, as an extension of O’Caml, a dialect of ML. The advantages of VML are: 1) Since VML is a functional language, the composition and application of views can be done in a natural way, compared to procedural languages such as C. 2) By defining the unit of knowledge as views, the programmer (or explorer) does not need to explicitly keep track of (i.e. manage data structures to remember the set of parameters) how each individual view was designed. 3) The explorer can use “parts” of a good view which can only be determined perhaps at runtime, and apply it to another (the example in Subsection 5.1). 4) In an interactive interface, (i.e. a VML interactive interpreter), the explorer can compose and decompose views and view designs, and apply them to data. When the explorer “accidently” stumbles upon an interesting view, he/she can retrieve the design immediately. Using VML, we modeled and described successful knowledge discovery tasks which we have actually experienced, and showed that these points can definitely lighten the burden of the programmer and explorer, and as a result, speed up the trial-and-error cycle of computational knowledge discovery processes.
References [1] H. Bannai, Y. Tamada, O. Maruyama, K. Nakai, and S. Miyano. Views: Fundamental building blocks in the process of knowledge discovery. In Proceedings of the 14th International FLAIRS Conference, 2001. To appear.
12 [2] P. Cheeseman and J. Stutz. Bayesian classification (AutoClass): Theory and results. In Advances in Knowledge Discovery and Data Mining. AAAI Press/MIT Press, 1996. [3] J. D.Helmann. Compilation and analysis of bacillus subtilis σA -dependent promoter sequences: evidence for extended contact between RNA polymerase and upstream promoter DNA. Nucleic Acids Research, 23(13):2351–2360, 1995. [4] O. Emanuelsson, H. Nielsen, S. Brunak, and G. von Heijne. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. Journal of Molecular Biology, 300(4):1005–1016, July 2000. [5] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From data mining to knowledge discovery in databases. AI Magazine, 17(3):37–54, 1996. [6] H. Hosoya and B. C. Pierce. Regular expression pattern matching for XML. In The 25th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2001. [7] S. Kawashima and M. Kanehisa. AAindex: Amino Acid index database. Nucleic Acids Research, 28(1):374, 2000. [8] T. Khabaza and C. Shearer. Data mining with Clementine. IEE Colloquium on ’Knowledge Discovery in Databases’, 1995. IEE Digest No. 1995/021(B), London. [9] P. Langley. The computer-aided discovery of scientific knowledge. In Lecture Notes in Artificial Intelligence, volume 1532, pages 25–39, 1998. [10] C. Lawrence, S. Altschul, M. Boguski, J. Liu, F. Neuwald, and J. Wootton. Detecting subtle sequence signals – a Gibbs sampling strategy for multiple alignment. Science, 262:208–214, 1993. [11] O. Maruyama and S. Miyano. Design aspects of discovery systems. IEICE Transactions on Information and Systems, E83-D:61–70, 2000. [12] O. Maruyama, T. Uchida, T. Shoudai, and S. Miyano. Toward genomic hypothesis creator: View designer for discovery. In Discovery Science, volume 1532 of Lecture Notes in Artificial Intelligence, pages 105–116, 1998. [13] O. Maruyama, T. Uchida, K. L. Sim, and S. Miyano. Designing views in HypothesisCreator: System for assisting in discovery. In Discovery Science, volume 1721 of Lecture Notes in Artificial Intelligence, pages 115–127, 1999. [14] B. W. Matthews. Comparison of predicted and observed secondary structure of t4 phage lysozyme. Biochim. Biophys. Acta, 405:442–451, 1975. [15] R. Milner, M. Tofte, R. Harper, and D. MacQueen. The Definition of Standard ML (Revised). MIT Press, 1997. [16] K. Nakai. Protein sorting signals and prediction of subcellular localization. In Advances in Protein Chemistry, volume 54, pages 277–344. Academic Press, 2000. [17] S. Shimozono. Alphabet indexing for approximating features of symbols. Theoretical Computer Science, 210:245–260, 1999. [18] S. Wrobel, D. Wettschereck, E. Sommer, and W. Emde. Extensibility in data mining systems. In Proceedings of the 2nd International Conference On Knowledge Discovery and Data Mining (KDD-96), pages 214–219, 1996. [19] X.Liu, D.L.Brutlag, and J.S.Liu. BioProspector: Discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. In Pacific Symposium On Biocomputing 2001, volume 6, pages 127–138, 2001. [20] http://caml.iniria.fr/ocaml/. [21] http://www.cbs.dtu.dk/services/TargetP/. [22] http://www.hypothesiscreator.net/iPSORT/index.html. [23] http://www.ncbi.nlm.nih.gov/Genbank.