SIGKDD Demo: Sensors and Software to Allow Computational Entomology, an Emerging Application of Data Mining Gustavo Batista Eamonn Keogh University of California, Riverside Riverside, CA, 92521
Agenor Mafra Neto
Edgar Rowton
ISCA Technologies Riverside, CA, 92517
Walter Reed Army Institute of Research Silver Spring, MD, 20910
[email protected]
{gbatista,eamonn}@cs.ucr.edu
[email protected]
economic and educational tools to help mitigate their harmful effects. Since all such interventions are costly, they depend on knowing the best time and place to intervene. As a group of medical entomologists recently noted just in the context of mosquitoes: “Accurately georeferenced data are crucial for understanding mosquito biogeography, ecology, and the impact of environmental changes, as well as for species distribution modeling, planning mosquito surveys, and for determining disease risk” [8]. Such data is currently obtained by having humans inspect mechanical or “sticky” traps. As such, the data are typically sparse, error prone and out of date.
ABSTRACT The history of humankind is intimately connected to insects. Insect borne diseases kill a million people and destroy tens of billions of dollars worth of crops annually. However, at the same time, beneficial insects pollinate the majority of crop species, and it has been estimated that approximately one third of all food consumed by humans is directly pollinated by bees alone. Given the importance of insects in human affairs, it is somewhat surprising that computer science has not had a larger impact in entomology. We believe that recent advances in sensor technology are beginning change this, and a new field of Computational Entomology will emerge. We will demonstrate an inexpensive sensor that allows us to capture data from flying insects, and the software that allows us to analyze the data. Moreover, we will distribute both the sensors and software for free, to parties willing to take part in a crowdsourcing project on insect classification.
In this demonstration we show simple low-cost sensors we can get accurate counts of flying insects in real time. We believe that these sensors will open up a new area of research, computational entomology. We argue that computational entomology is a branch of data mining (rather than say, machine learning or biostatistics), because it draws from such traditional data mining techniques as classification, regression, clustering, etc. Furthermore, the classic data mining problems of scalability and dealing with distributed and uncertain data are unavoidable in any task in this domain.
Categories and Subject Descriptors H.2.8 [Database Application]: Data mining, Scientific database
General Terms
2. OUR INSECT WINGBEAT SENSOR
Algorithms, Measurement, Experimentation, Human Factors
Keywords
Incredibly, the idea of classification of insects by sound dates back to the very dawn of computers and commercially available audio recording equipment. In 1945, three researchers at the Cornell University Medical College, Kahn, Celestin and Offenhauser used equipment donated by Oliver E. Buckley (then President of the Bell Telephone Laboratories) to record and analyze mosquito sounds [12].
Agricultural, Spatiotemporal data, Insects, Classification
1. INTRODUCTION Humankind's history and its destiny are intimately connected to insects. Insect borne diseases kill well over a million people each year, and sicken tens of millions more [16]. Insects also destroy tens of billions of dollars worth of crops and livestock. However, at the same time beneficial insects pollinate 75% of crop species, and it has been estimated that 30% of all food consumed by humans is directly pollinated by bees alone [5].
The authors later wrote “It is the authors' considered opinion that the intensive application of such apparatus will make possible the precise, rapid, and simple observation of natural phenomena related to the sounds of disease-carrying mosquitoes and should lead to the more effective control of such mosquitoes and of the diseases that they transmit.” [13]. In retrospect it seems astonishing that more progress on this problem has not been made in the intervening years.
Given the importance of insects to humans, researchers have developed an arsenal of mechanical, chemical, biological,
The logical design of the sensor is shown in Figure 1. It consists of a side-by-side low-powered laser source/phototransistor. The latter is connected to an electronic board. The laser is pointed at a total internal reflector which returns the slightly scattered light back to the source, with some of it hitting the phototransistor.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD’11, August 21–24, 2011, San Diego, CA, USA. Copyright 2011 ACM 978-1-4503-0813-7/11/08...$10.00.
761
Phototransistor
Despite concerted efforts to control their vectors, most of these mosquito-transmitted diseases are spreading [16].
Laser Circuit Board
The first line of defense against vector-borne diseases is vector control. Vector control is an umbrella term for any of the dozens of mechanical, chemical, biological methods used to limit or (locally) eradicate a pest. Vector control requires constant surveillance of vector populations. Information about a vector’s spatial and temporal distribution is essential in establishing preventive measures. While vector control is an extremely challenging task, control efforts have shown their value in the eradication of other populations of insects of medical importance. Examples of such successful eradications of a harmful insect species from a large region include that of the New World screwworm, Mediterranean fruit flies, and tsetse flies [1].
Insect Detected Insect detection threshold
0
500
1000
Figure 1: The logical general design of the insect sensor we are developing. It can count insects by recording high-amplitude “bleeps” in the signal, classification takes additional processing. See also Figure 2.
When a flying insect crosses the laser beam, its wings partially occlude the light, causing small light fluctuations captured by the phototransistor. This signal is filtered and amplified by a custom designed board, and the output signal is recorded as audio data.
A complete discussion of possible interventions is beyond the scope of this paper [16]. However current techniques to mitigate mosquitoes deadly effects are myriad, and include:
In Figure 2 we show a physical version of the device. Note that the sensor is built into standard PVC pipe fittings. This allows us to add a cheap PVC pipe “portals” to the hundreds of insectaries found in a typical entomological laboratory, and quickly move the sensor between them.
The use of insecticidal treated mosquito nets, spraying of insecticides (including controversial chemicals such as DDT); The introduction of fish/turtles that eat mosquito larva; The introduction of dragonflies which eat adult mosquitoes; Attracting, then trapping mosquitoes; Habitat reduction by draining ponds and pools; Use of chemical films to reduce the surface tension of water (drowning the pupa). The key to vector control is the development of predictors [10][15]. Establishing early and accurate predictors for mosquito population outbreaks and associated risk of epidemics ideally requires compilation of large data sets from individual agencies encompassing data on mosquito populations, landscape ecology, vertebrate hosts, hospital admissions, and data meteorological data including temperature, humidity and wind speed.
Total Internal Reflector Phototransistor Laser
Front View Amplification and filtering circuit board
Note that most predictors are surrogates1 for the target insect density. For example, meteorological data indirectly tells us of the relative abundance of mosquito breeding grounds, and hospital admission data indirectly and (belatedly) tells us of insect density.
Rear View Figure 2: Two views of our insect counting device. The sensor is mounted in standard PVC pipe fittings to allow “plug and play” between different insectaries.
Our work is motivated by the observation that in this domain surrogate variables are always doomed to be inaccurate and/or tardy. If we want to measure insect density, then we should measure insect density. As we shall show in the next section, plunging prices for sensors, combined with data mining classification techniques now make this possible.
We omit a detailed discussion of the circuitry and logical design of our sensor, other than to say that we are making the circuit design freely available at [14]. The entire system can be made for less than $100, but in bulk, the price can be reduced to less than $10 each.
4. CLASSIFICATION USING WINGBEAT SENSOR DATA
3. MOTIVATION FOR INSECT CLASSIFICATION There are at least 3,528 species of mosquitoes [11]. The majority of these are harmless to humans, but a few dozen species transmits disease agents that cause illness and death among humans while depressing economic well-being over large swaths of the globe. Anopheline mosquitoes are vectors (living carriers of an infectious agent) of human malaria. Aedine mosquitoes transmit arboviruses such as yellow fever, dengue, and dengue hemorrhagic fever. Culicine mosquitoes are the primary vectors of various diseases, for example human filariasis, western equine encephalomyelitis, St. Louis encephalitis, and West Nile encephalitis. All these diseases have a serious impact on humans, killing one million people, mostly African children, each year.
In Figure 1 we showed how a simple threshold on the output of our sensor can detect and count insects. However, for the task at hand we need to be able to classify the insect species, and in some cases even the insects sex (since only female mosquitoes feed on human blood). Recall that when deployed in the field, more than 99% of
1
762
Surrogate is a variable that can be measured (or is easy to measure) that is used in place of one that cannot be measured (or is difficult to measure).
the insects counted may be innocuous or even beneficial. As the reader will have anticipated, we can also use our counting sensor to classify insects.
probability of occurrence of the feature vector . This probability is usually unknown, but if the classifier computes the likelihoods | of the entire set of species, then ∑ 1 and we can obtain the desired probabilities for each likely with:
An example of the signal generated by the sensor is presented in Figure 3.top. This signal was collected from a bee of the species Bombus impatiens (The Common Eastern Bumble Bee) in an insectary. A further analysis of the high amplitude section shows the signal has a peak in the fundamental frequency at 197Hz, and some harmonics in integer multiples of the fundamental frequency, as shown in Figure 3.bottom. The first harmonics represents the frequency of interest, i.e., the insect wing-beat frequency.
Where Nf = ∑ normalization factor.
One second of audio from the laser sensor. Only Bombus impatiens (Common Eastern Bumble Bee) is in the insectary.
0.1
0
Background noise
-0.2 0
0 .5
Bee begins to cross laser
1
1 .5
2
2 .5
x 1 0-3
|Y(f)|
3
3 .5
4
0
100
200
300
400
20 0
30 0
40 0
50 0
60 0
7 00
We can calculate the insect's class-conditioned probability of being Anopheles stephensi using a Gaussian distribution function, given that we know the mean and standard deviation of the Anopheles stephensi wingbeat frequency are 475 and 30 respectively:
Harmonics
1
10 0
Figure 4: Gaussian curves representing the mean and standard deviation of the wingbeat frequencies of seven species of insects (females only). From left to right, Lucidota atra, Chauliognathus marginatus, Oulema melanopus, Drosophila melanogaster, Culex quinquefascitus, Anopheles stephensi, and Aedes aegypti.
x 1 04 4 .5
Peak at 197Hz
60Hz interference
2
0
3
Single-Sided Amplitude Spectrum of Y(t)
4
, and can be understood as a
Wing Beat Frequency Hz
0 -0.1
|
Figure 4 shows a wingbeat frequency distribution plot for seven species of insects. We can use these data to predict the insect species, given the wingbeat frequency. For example, suppose we observe an unknown insect with a wingbeat frequency of 500Hz.
Whenever possible we do a sanity check to make sure the values returned by our analysis are reasonable. A recent study which measured the wingbeat frequency of this species with a true acoustic device found the frequency had a mean of 181 and a min/max of 155/205 respectively [3]. 0.2
|
|
500
600
700
800
900
1000
Frequency (Hz)
Figure 3: top) A one second audio clip recorded by our sensor in insectary containing only Bombus impatiens (Common Eastern Bumble Bee). bottom) An examination of the amplitude section reveals the fundamental frequency at 197Hz, the wingbeat frequency of the bee
500|
1
√2 30 We can calculate the probabilities for the other classes in a similar manner, and predict our unknown insect as the most likely class. In this example, the unknown insect is about 3.3 times more likely to be an Anopheles than Aedes aegypti (the second most likely class).
As shown in Figure 3.top the raw digital audio signal can be represented by a sequence S = where si is the signal sampled in the instant i and N is the total number of samples of the signal. This sequence contains a lot of acoustic information, and features can be extracted from it, including (but not necessarily exclusively) wingbeat frequency. So a feature vector = can be generated, where each feature xj is extracted from S (or some part of it) by an appropriate extraction procedure, defined as a function Xj : S → Xj, where Xj is the feature domain.
Figure 5 is a plot of the wing-beat frequency histogram for three species captured with our sensor.
From a set of insect species B we must select the most likely class given the data we have observed. More formally, the intuition behind Bayesian classification is to find the most likely , given the feature vector , that is: argmax
|
Where | is the a posteriori probability that the signal belongs to a insect of the species b given the features represented by the feature vector . Using Bayes rule the equation can be expressed as: argmax
|
Figure 5: Histogram of one-second fragments for three species: Bumble impatiens, Culex quinquefasciatu and Aedes Aegypti
From Figure 5 we can observe that the distribution for all classes resembles a Gaussian distribution with a long left tail. The Bumblebees wing-beats frequencies are linearly separable from
| is the probability of observing the feature vector Where in class b, P(b) is the a priori probability of the species b, that can be estimated from frequencies in the database, and is the
763
the mosquitoes frequencies, an expected result since the Bumblebees are considerably larger than the mosquitoes, and larger insects usually present lower wingbeat frequencies. Considering the wingbeat frequencies of the mosquitoes alone, Aedes aegypti has higher frequency wingbeats than Culex quinquefasciatu; however, there is some overlap between the wingbeat frequencies distribution of these two species.
understood [9]. We have long been able to get minute by minute precipitation data for most of the world, but similarly fine-grained data on insect density has not been available until now.
6. ACKNOWLEDGMENTS This work was funded by Bill and Melinda Gates Foundation, NSF awards 0803410 and 0808770 and FAPESP award 2009/06349-0. The first author is in leave from the Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, Brazil.
We can use a very simple Bayesian classifier, as discussed above, to categorize the examples into the three classes (species). We assume each class distribution is governed by a Gaussian distribution, with mean and standard deviation, as well as class prior probabilities, P(b). We applied this simple Bayesian classifier to the data and obtained a respectful 96.04% accuracy. The results for each class are summarized in Table 1.
The opinions or assertions contained herein are the private views of the authors, and are not to be construed as official, or as reflecting true views of the Department of the Army or the Department of Defense.
Table 1: Performance summary for each class
Predicted
Actual
Bombus Culex Aedes
Bombus 499 0 0
Culex 0 4139 124
7. REFERENCES
Class accuracy Aedes 0 113 1107
[1] G. Andrienko, N. Andrienko, J. Dykes, M. Kraak, and H.
100.00% 97.34% 89.92%
[2] [3]
Note that the histogram for Culex quinquefasciatu in Figure 5 does not appear to be well modeled by a single Gaussian. This is not unexpected; Culex quinquefasciatu, like many mosquitoes, is highly sexually dimorphic, the females are larger than the males, and thus have lower wingbeat frequencies. In fact, the data in Figure 5 is suggestive of one Gaussian bump at about 400Hz and another at 550Hz. Ideally we would record training data for each sex separately, however sexing mosquitoes, either as larva or adults is very difficult [7]. We plan to investigate techniques to mitigate this by learning a mixture model for the insects [6].
[4] [5] [6] [7]
5. COMPUTATIONAL ENTOMOLOGY IS A DATA MINING PROBLEM
[8]
We end with a brief discussion as to why the insect classification problem addressed here, and more general computational entomology problems, should be consider a sub-field of data mining.
[9]
We begin by noting the scale of the data considered. A single experiment we did on one species produced more than 200,000 data events. We need to reproduce this experiment at several different times of the year, and at several different latitudes, for at least one-hundred different insect species. We conservatively estimate that we will produce a billion data events in the next year alone. Furthermore, our insect classification problem requires contributions from virtually every data mining technique. For example, we use outlier detection to throw out spurious events in our audio data, we can use regression to predict how the wingbeat frequency of an insect will generalize to new temperatures and altitudes, in the (current) absence of circadian rhythms for all target insects we are using transfer learning to fill in missing data [4] by transferring information from an insect in the same genus etc.
[10] [11] [12]
Schumann. GeoVA(t) - Geospatial Visual Analytics: Focus on Time. International Journal of Geographical Information Science, 24(10): 1453-1457, 2010. M. Q. Benedict and A. S. Robinson. The first releases of transgenic mosquitoes: an argument for the sterile insect technique. Trends in Parasitology, 19: 349–356, 2003. R. Buchwald and R. Dudley. Limits to vertical force and power production in bumblebees (Hymenoptera: Bombus impatiens). The Journal of Experimental Biology, 213: 426-432, 2010. B. Cao, S. J. Pan, Y. Zhang, D. Yeung, and Q. Yang: Adaptive Transfer Learning. AAAI 2010 Conference, 407-412, 2010. K. W. Dixon. Pollination and Restoration. Science, 825, 5940: 571–573. 2009. M. A. T. Figueiredo and A. K. Jain. Unsupervised Learning of Finite Mixture Models, IEEE Transactions on Pattern Analysis and Machine Intelligence, 24: 381-396, 2002. D. A. Focks. An improved separator for the developmental stages, sexes, and species of mosquitoes (Diptera: Culicidae). Journal of Medical Entomology, 17: 567-568, 1980. D. H. Foley, R. C. Wilkerson, and R. E. Rueda. Importance of the “What”, “When” and “Where” of Mosquito Collection Events. Journal of Medical Entomology, 46 (4): 717-722, 2009. H. Gong , A.T. Degaetano, and L. C. Harrington. Climate-based models for west nile culex mosquito vectors in the northeastern US. International Journal of Biometeorology, 2010. M. Grüebler, M. Morand, and B. Naef-Daenzer. A predictive model of the density of airborne insects in agricultural environments. Agric. Ecosyst. Environ., 123 (1-3): 75-80, 2008. R . Harbach. Mosquito Taxonomic Inventory mosquitotaxonomic-inventory.info/valid-species-list, 2011. M.C. Kahn, W. Celestin, and W. Offenhauser. Recording of sounds produced by certain disease-carrying mosquitoes. Science 101: 335–336, 1945.
[13] M. C. Kahn and W. Offenhauser. The identification of certain West African mosquitoes by sound. American Journal of Tropical Medicine, 29(5): 827-836, 1949. [14] Keogh, E. Computational Entomology website. Online at www.cs.ucr.edu/~eamonn/CE, 2011.
[15] J.L. Patnaik, L. Juliusson, and R. L. Vogt. Environmental predictors of human West Nile virus infections, Colorado. Emerg. Infect Dis., 13: 1788–1790, 2007. [16] WHO. World malaria report 2010. Geneva, World Health Organization. www.who.int/malaria/publications/en/, 2010. [17] I. P. Woiwod and C. D. Thomas. Insect Movement: Mechanisms and Consequences. Royal Entomological Society. CAB International, 2001.
Finally, when deploy our sensors in the field, the insect density data we obtain can be used as inputs into several higher-level data mining algorithms, including spatiotemporal analyses [1] and association (rule) discovery. For example, while it has long been known that there is an association between precipitation and mosquito density, the exact nature of this association is still not
764