Activist Data Mining for Computational Science: Tools and Applications Dennis Shasha Courant Institute of Mathematical Sciences Department of Computer Science New York University
[email protected] Abstract Classical data mining involves: waiting for data to appear and then mining it. Activist data mining involves: proposing experiments based on algorithmic and application-specific considerations, evaluating the results, proposing new experiments, evaluating, proposing, and so on. Thus Activist Data Mining is a fundamentally interventionist and iterative endeavour. It entails close collaboration with application specialists. The techniques required include combinatorial design to support a disciplined experimental design, a variety of analog circuit-building techniques, and hypothesis-generation. The talk and this paper discusses these tools in the context of a series of case study collaborations with biologists and physicists. The necessary scientific background will be presented to make the discussion self-contained. The talk is meant to appeal to researchers and practitioners in data mining as well as any visiting natural scientists. The data sizes range from 30,000 items for microarrays to trillions of items in gamma ray experiments. My intent in this paper is to convey the philosophy of my appraoch. You can find the technicalities on my web page or on the conference site. I will concentrate on biology because that is where I do the most work.
1. Introduction Having the complete genome of a species is both an opportunity and a burden to biologists. One part of the opportunity comes from the fact that different species often contain genes having similar sequences. This allows two kinds of inferences: 1. If the functionality of gene A is known in species X, then the functionality of gene A’ having a similar sequence to A but in species Y can be guessed. 2. Even if the functionality of A and A’ are unknown, they can both be inferred if the only species having sequences similar to A are X and Y and those two species have some feature that no other species has [6]. This is only one use of genomic data, but it illustrates a principle well known in data mining: similar cause (in this case sequence) may result in similar effect (functionality). Why does genomic data constitute a burden to biologists? There is a lot of it. Manipulating it in a lab notebook is infeasible. Further, any variation in experimental conditions will change the behavior of some gene. Often these measurements are full of noise. The first reaction of biologists to this new reality was simply to generate data and hope that someone in the community would analyze it. Many have followed this paradigm. The cancer cell line data of Golub [1] and or the yeast data by Brown [2] have been picked over carefully, for example. The trouble is that hard-and-fast conclusions are difficult to come by. For example, to discover genes that may determine a response to cancer treatment, Golub’s data presents data about 72 patients having three conditions. This is far too few to determine which gene or small set
Work supported in part by U.S. NSF grants IIS-9988345, N2010-0115586 and MCB-0209754.
of genes among thousands is responsible. Statistical approaches are possible [3] but remain (so far) inconclusive. 2. Activist Data Mining Activist data mining follows the philosophy that computational techniques should both precede and follow the generation of data. Others adopt a similar philosophy and call it active learning [4]. 2.1. Adaptive Combinatorial Design One use of activist data mining is to identify input variables that have the strongest effect on an output of interest. The first step is to test a large given search space with a small set of well-spaced experiments. The experiments are derived from 2-factor combinatorial design [7] and have the property that every pair of values from every pair of inputs is tested in at least one experiment. This concept comes from the work of David Cohen, Sid Dalal, Michael Fredman and Gardner Patton of Telcordia where it was used for software testing. The result is capture at least two-level interactions in this first set. (Compared to the size of the search space, the 2-factor combinatorial design yields very small samples. For example, 10 inputs having 4 possible values each would require about one million experiments. 2-factor combinatorial design requires only 30.) The second step is to guess important inputs from those that determine the value of the output of interest in the initial 2-factor design. (Remember this is an interative approach.) For example, it may be that light has a higher correlation to the production of a certain plant amino acid than any other factor. Those important inputs can then be used as “pivots” for the next set of experiments. These next experimental sets are divided into subsets. Each subset has a single value of the pivot input and a 2-factor combinatorial design of the other inputs. The different subsets are identical on all inputs except the pivots. Thus, one can evaluate the effect of the pivot input on a variety of background conditions, because each experiment in one subset has a counterpart in another subset that is identical except in its value of the pivot. These “minimal pairs” allow us to answer several questions: Which output variables (say genes) are consistently inductive by the pivot in a wide variety of contexts? Formally, for pivot values p1 and p2, is p1 with C inductive with respect to p2 with C for all contexts C of non-pivot inputs? (Similarly for repressive.) Is the pivot ever decisively inductive in the sense that an experiment having p1 and context C1 is inductive compared with p2 and C2 for all C1 and C2? (Similarly for repressive.) Which genes track one another across all these different contexts? This may help find transcription factor binding sites.
2.2. Time Series Analysis A second thread in activist data mining is the ability to react quickly to the flood of data that arises from sensors. Many people think that faster processor speeds should make efficiency considerations less important. For scientific applications, the opposite holds, because faster
processor speeds make sensors much more capable of delivering large streams of data. For example, in mission operations for NASA’s Space Shuttle, approximately 20,000 sensors are telemetered once per second to Mission Control at Johnson Space Center, Houston. This data arrives in time order as does data in fields ranging from physics to finance to medicine to music, just to name a few. Further, the number of sensors increases over time, so fusion between sensors becomes ever more critical in order to distill knowlege from the data. Real time response is desirable in many applications (e.g. to aim a telescope at a burst detected by another telescope or to offer feedback to a doctor doing magnetic imagery). These three factors – data size, fusion, and fast response – motivate the need to improve time series analysis primitives. Here are typical primitives of importance: 1. Correlation. Find windows in different time sequence streams whose values are highly correlated. The windows may be synchronous and of fixed size or asynchrnous of varying size. The windows may not be the same length if we allow time warping. This is an essential primitive for data fusion. Nobody knows how to do these in linear time. 2. Burst Detection. Identify periods of burst, where a burst means the number of events that have arrived over a given window exceeds some threshold value for size . We would like to be able to do this for many (thousands or even millions) of window sizes and many (thousands) of event types. If you are interested, please see my web site for papers and software on some special cases of these problems.
3. Collaborating with Physical Scientists Computer scientists solve puzzles. We are happiest when we have found a provable algorithm to solve a problem with as little regard to the semantics of the data as possible. That is the power of the discipline: a sorting routine, a database system, or a time series analysis package each work in domains whose existence was unknown to the original programmers and algorithm designers. While offering power, this analysis-rich but semantics-oblivious ideal to problem solving gives rise to misunderstanding at many levels. For example, a colleague of mine came up with a very fast string comparison algorithm in the late 1980s. He presented his algorithm to biologists at the U.S. National Institutes of Health. He came back dejected and told me, “You know what they asked me. They asked me if I had a Fortran implementation!” The problem of programming was to him, well, a trivial consideration. He was surprised that the biologists weren’t overjoyed at seeing the cleverness of his technique and grateful for the speed it would give them. This was not arrogance. Had his algorithm been useful for, say, computer aided design, the engineers would have given him his desired reaction. The biologists for their part were not interested in algorithmic beauty, but practical effectiveness. Biological theories must be tested by real data after all, why not algorithms? To them, my friend had done little more than speculate. Many computer scientists would love it if they could design an algorithm, put its description on the web or in a journal, and then have biologists discover it and implement it. The trouble is that algorithms invented this way rarely fit the needs of a practicing natural scientist. I know this from personal experience. Starting in the late 1980s, two students and I designed some of the fastest tree matching algorithms in the world. Linguists, compiler-writers, and ecommerce people liked it, but biologists never did. Our algorsthms were optimized for ordered
trees (where the order among sister nodes matters). This is almost never useful to biologists who care mostly about the parent-child relationships of phylogenetic relationships. Only after realizing this problem three years ago have we made new tools available (again on my web site). Biologists care about data. They stare at it. They fret about outliers. Computer scientists, by contrast, never look at data. In my collaboration with plant biologists Phil Benfey and Ken Birnbaum, I would write a program, run it against some data, and send the results to Ken. When Ken put together the first draft of the paper, I pointed to part of the paper and asked “How did you find that binding site?” “You found it,” they replied, laughing. “It came from your program.” This different interest in data has its good and bad aspects. The good is that this diminishes the temptation to “fix” bad data in the program (e.g. our binding site-finding algorithm never explicitly removed TATA boxes; they fell out because the algorithm deemed them irrelevant for statistical reasons) [5] The only bad is that the computer scientist gets laughed at. (In other words, there is nothing bad about it.) The bottom line is that the two cultures of scientists should compromise in their working styles but not in their ethics. Computer scientists should not hope to work in isolation. To be relevant, we must build useful software that can evolve with a project. Biologists should recognize the value of general tools – provided those tools can specialize to the problems of their labs. My most extensive collaboration so far has been with Gloria Coruzzi’s lab on the modeling of regulation of biochemical pathways. That was where adaptive combinatorial design was developed. In principle, the tools apply to any experimental discipline, but the tools were used – to start with in crude form – from the very beginning of the project. The success of these collaborations suggests several lessons. 1. Meet every week. We meet over lunch or in a small conference room with cookies. Discussions and presentations in comfortable settings spark ideas and suggest problems to solve. For example, our particular use of combinatorial design arose from the evident impracticality of trying all possible input combinations in a large search space. 2. Build prototypes. Tools find use if they are available when a problem manifests itself, not two years later. We write tools within a few days of the identification of a significant problem. They are used immediately. At first, the tool is so crude that only we can run it, but our biologist colleagues understand the output. Refinements (and program refactoring) occur over time. As I discovered in my research for a book, computer graphics pioneer Fred Brooks started a whole thread of computer graphics this way [8]. 3. Teach one another. Many computer scientists (and a remarkable number of biologists) found the terminological flood of high school biology to resemble forcefeeding more than learning. Fortunately, molecular biology is much simpler and more nearly algorithmic than classical biology, so there is a lot of common ground. Further, when biologists explain their view of the world, computer scientists soon realize that the biologists are using a knowledge representation. As computer scientists well know, one representation may improve on another. For example, many biologists use an arrow notation in which an element X either represses or induces element Z. This makes it impossible to express a fact such as that X induces Z only when Y is absent or Y induces Z only when X is absent. Such “exclusive or” relationships are easily expressed as boolean circuits.
4. Conclusion: From Ignorance to Tools People work together for glory, for tenure, or for money. But creativity blossoms in an atmosphere of enjoyment and mutual respect. Biologists, physicists, and medical researchers may not know at first exactly what they want from a computer scientist. Computer scientists may be lost in a discussion of biochemical pathways. But if each feels comfortable enough to ask questions at the risk of appearing ignorant, then fundamental patterns may emerge. From patterns come algorithms and tools. Tools, in turn, lead to more science. Glory, tenure, and money flow of their own accord. At least that’s the theory. References [1] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286: 531–537, 1999. [2] S. Chu, J. DeRisi, M. Eisen, J. Mulholland, D. Botstein, P. O. Brown, and I. Herskowitz. The transcriptional program of sporulation in budding yeast. Science, 282: 699-705, 1988. [3] F. Chiaromonte and J. A. Martinelli. Dimension-reduction strategies for analyzing global gene expression data with a response. Mathematical Biosciences, 176(1): 123-144, 2002. [4] S. Tong and D. Koller, Active Learning for Structure in Bayesian Networks. Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, Seatle, Washington, 2001, pp. 863-869. [5] K. Birnbaum, P. N. Benfey, and D. E. Shasha. cis Element/Transcription Factor Analysis (cis/TF): A Method for Discovering Transcription Factor/cis Element Relationships. Genome Research, 11(9): 1567-1573, 2001. [6] M. Levesque, D. Shasha, W. Kim, M. G. Surette, and P. N. Benfey. Trait-ToGene: A Computational Method for Predicting the Function of Uncharacterized Genes. Current Biology, 13(2): 129-133, 2003. (Discussed in: http://www.thescientist.com/yr2003/jun/hot 030603.html) [7] D. Shasha, A. Kouranov, L. Lejay, Michael Chou, and G. Coruzzi, Using Combinatorial Design to Study Regulation by Multiple Input Signals. A Tool for Parsimony in the PostGenomics Era. Plant Physiology, 127(4):1590-1594, 2001. [8] D. Shasha and C. Lazere. Out of Their Minds: The Lives and Discoveries of 15 Great Computer Scientsts. Springer-Verlag, New York, 1995.