An Unsupervised Learning Scheme for DNA Microarray Image Spot Detection Luis Rueda∗ and Li Qin†
Abstract— DNA microarrays are novel and powerful techniques, which are used to analyze the expression level of DNA, and have many applications in pharmacology, medical diagnosis, environmental engineering, and biological sciences. The process of separating the background from the foreground is a crucial stage in DNA microarray data analysis, since it substantially affects the subsequent stages. Quite a few image processing techniques have been proposed in this direction, including circlebased methods, seeded region growing, histogram-based segmentation, and clustering-based techniques. Of these, the latter method is an emerging topic in microarray image segmentation. We propose an optimized clustering-based microarray image segmentation approach that includes a noise-removal stage. The experiments show that our method performs microarray image segmentation more accurately than the previous clustering-based microarray image segmentation methods, and capture a larger number of true foreground pixels than the seeded region growing method. Index Terms— cDNA microarrays, Gene expression, Microarray image segmentation, Clustering algorithms.
I. I NTRODUCTION Microarrays are emerging technologies in bioinformatics, by providing help in dealing with a wide range of problems in medicine, health and environment, and drug development. The use of microarrays in measuring gene expression levels in variable conditions provides biologists with a better understanding of gene functions, and has many applications in life sciences [1], [2]. The analysis of DNA microarray gene expression data involves many steps [1]. The initial steps consist of extracting gene expression data from the microarray image, and include spot localization (or gridding) [1], foreground and background separation (image segmentation), and normalization. These stages are quite important, since accuracy of the resulting data is essential in posterior analysis. The subsequent step is gene expression data analysis, for which different statistical and/or unsupervised learning methods can be applied. We deal with the problem of DNA microarray1 image segmentation. In general, segmentation of an image refers to the process of partitioning the image into several regions, each having its own properties [3]. In microarray image ∗ Member of the IEEE. School of Computer Science, University of Windsor, 401 Sunset Ave., Windsor, ON N9B 3P4, Canada. E-mail:
[email protected]. Phone +1-519-253-3000 Ext. 3780, Fax +1-519-9737093. Partially supported by NSERC, the Natural Sciences and Engineering Council of Canada, CFI, the Canadian Foundation for Innovation, and OIT, the Ontario Innovation Trust. † Platform Computing Inc. 3760 14th Avenue, Markham, ON, Canada. 1 Of the recently discovered technologies, cDNA and oligonucleotide arrays [1], our approach focuses on the former, referred to as ”DNA microarrays” or ”microarrays” thereafter.
processing, segmentation refers to the classification of pixels as either foreground or background. There exist other types of pixels, such as noisy pixels, which are contaminated pixels produced during microarray production and scanning process, and should be excluded from either the background or the foreground region during segmentation. Depending on the approaches used to classify the pixels, another possible type of pixels includes the edge pixels surrounding the foreground region. Since the intensities of these pixels fall in between the foreground and the background, including or excluding them will lead to different signal to noise ratios. The problem can be stated more formally as follows. Given the microarray images corresponding to two channels2 , green and red, a composite image is first created by means of an arbitrary function3 , f (., .). Let C = {cij } be an m × n integer-valued matrix that represents the composite microarray image for an experiment. The aim is to classify each pixel, cij , and assign it to one of the pre-determined (or possibly unknown) classes from the set {ω1 , ω2 , . . . , ωk }, where k ≥ 2. As in general image processing models, here, the final goal is to assign each pixel to one of two classes: “foreground” or “background”, in which case we are dealing with the traditional two-class classification problem [4], [5]. To deal with the microarray image segmentation problem, many approaches have been proposed. Fixed circle segmentation is a traditional technique that was first used in ScanAlyze [6]. GenePix [7] and ScanArray Express [8] also provide the option for fixed circle method. Another method that was proposed to avoid the drawback of the fixed circle segmentation is the adaptive circle segmentation technique. An implementation of this approach can be found in GenePix, ScanAlyze, ScanArray Express, Imagene, and Dapple [9]. Since the two above-mentioned methods are limited to circular spots, other techniques that deal with “free-shape” spot segmentation have been introduced, being the most important one the seeded region growing (SRG) [1], [10]. Another technique that has been successfully used in microarray image segmentation is the histogram-based approach. Chen et al. introduced a method that uses a circular target mask to cover all the foreground pixels, and computes a threshold using the Mann-Whitney test [11]. Clustering algorithms have also been used in microarray 2 We assume that the underlying experiment is conducted on two channels, green and red, corresponding to the Cy3 and Cy5 dies respectively. However, our model can be generalized to more than two channels, in which case the composite image is obtained by combining the images from the different channels. 3 A typical function used in this context is f (x, y) = √x + y, where x and y are the green and red channel pixel intensities respectively.
image segmentation [12], and have the advantage that they are not restricted to a particular shape and size for the spots, and can be seen as a generalization of the histogram-based approach. Clustering-based algorithms have various advantages over other methods. The former does not depend on the shape of the objects in the image, albeit they can make use of the “ideal” shape of a typical spot, which tends to be (but not always the case) a circle. However, unlike region growing methods, they do not require an initial state of pixels be known. As a matter of fact, clustering-based algorithms do not need any prior knowledge about the labels of the objects be known. Common edge detection approaches have major drawbacks, such as the sensitivity to noisy pixels, the thickness of the detected edges, the disconnectivity of the edges, and usually the need of post processing to assign each pixel to one part of the image after edge detection. In this sense, clustering-based methods are more flexible in dealing with the aforementioned problems. In this paper, we propose an optimized multi-feature clusteringbased microarray image segmentation approach that includes a noise-removal stage. The experiments show that our method performs microarray image segmentation more accurately than the previous clustering-based microarray image segmentation methods, and capture a larger number of true foreground pixels than the seeded region growing method. II. C LUSTERING - BASED M ICROARRAY S EGMENTATION Clustering are unsupervised learning techniques applied to pattern classification. Many algorithms have been proposed in this regard, including k-means, fuzzy k-means, unsupervised maximum likelihood estimation, and hierachical clustering. We consider the former in the implementation of our approach. Consider a dataset D = {x1 , x2 , . . . , xn }, where (1) (2) (d) xi = [xi , xi , . . . , xi ]t is a d-dimensional vector. Thus, to represent an unknown sample, we use a generic feature vector, x = [x(1) , x(2) , . . . , x(d) ]t . The aim is to partition D into k nonempty subsets D1 , D2 , . . . , Dk containing n1 , n2 , . . . , nk samples respectively, where xi is assigned to class ωj , if xi ∈ Dj , maximizing the similarity among the samples in the same subset, while minimizing the similarity among samples in different subsets. k-means is one of the most widely used algorithms for clustering, and its aim is to find k mean vectors µ1 , µ2 , . . . , µk , which determine the membership xi ∈ Dj based on a pre-defined similarity measure. Many similarity measures have been proposed, including the Euclidean distance, the Minkowski distance, the Manhattan distance, and the correlation distance. In our model, we use the former as a measure of similarity. k-means starts with arbitrary initial values of µ1 , µ2 , . . . , µk , and assigns each xi to µj , where 2 ∀j = 1, . . . , k, kxi − µj k is minimum. Once each xi is assigned to a cluster ‘j’, the means are re-calculated as follows: µj =
1 X xi nj
(1)
xi ∈Dj
This process is repeated until no change in the means is observed. Although k-means usually converges to a global
optimum, it may become stuck in a local optimum. In general, this occurs because k-means is sensitive to the initial values of µ1 , µ2 , . . . , µk . Due to the nature of the problem we deal with, it is not difficult to see that the initial means can be set in an intelligent manner. In our model, we initialize them as follows. The samples are classified into two classes, by computing the distance between the minimum intensity and the maximum intensity. The sample means for the two classes obtained are used to initialize µ1 and µ2 . When k = 3, the third mean can be obtained by using the median of D. A. Single-feature Clustering Wu et al. proposed a k-means clustering algorithm for microarray image segmentation [12], which we refer to as single-feature k-means clustering microarray image segmentation (SKMIS). This method attempts to cluster the pixels into two groups, one for foreground, and the other for background. Thus, in SKMIS, the feature vector that represents the ith sample is reduced to a variable in the Euclidean one-dimensional space, which we refer to it as xi , and contains the intensity of the ith pixel. The dataset D is composed of all the pixels contained in a single spot region, which is obtained from the gridding process [1]. The first step of SKMIS consists of initializing the class label for each pixel and calculating the mean for each cluster. Let xmin and xmax be the minimum and maximum values for the intensities in the spot. Then, each xi is assigned to • foreground (or D1 ) if |xi − xmin | > |xi − xmax | • background (or D2 ) otherwise. Next to this process, the mean, µj , for each cluster, is calculated as in (1). Despite it requires initialization and an iterative process, SKMIS method is quite efficient in practice. After the initialization, the second step of the algorithm is the re-calculation of the means and the adjustment of the label of xi (each pixel) by the following criterion: Assign xi to D2 if: n1 |xi − µ1 | n2 |xi − µ2 | > , n1 − 1 n2 + 1
(2)
otherwise assign xi to D1 . This step is repeated until no change in the means, µ1 and µ2 , is observed. B. Multiple-feature Clustering Traditional image processing algorithms have been developed based on the information about the intensities of the pixels only. An example of this includes the SKMIS discussed above. As in [13], in the microarray image segmentation problem, we encountered that the position of the pixel, for example, also influences the result of the clustering, and subsequently that of segmentation. Based on this observation, we propose an optimized k-means microarray image segmentation (OKMIS) method, which perform two main stages: a multiplefeature clustering process discussed below, followed by a postprocessing noise removal procedure, which is discussed in Section III.
To apply the k-means algorithm on a single spot, we take all the pixels contained in the spot region (a single spot region at a time is taken), which are obtained from the gridding process, and create a dataset D = {x1 , x2 , . . . , xn }, where (1) (2) (d) xi = [xi , xi , . . . , xi ]t is a d-dimensional vector that th represents the i pixel in the spot region (from left to right, and from top to bottom). In our model, we use two features, i.e. d = 2. The features of xi are obtained as follows: (1) • xi : It corresponds to the pixel intensity (an integer value). (2) • xi : It represents the distance from the center of the spot region to the pixel. Thus, assuming that the coordinates of the ith pixel are represented by the two-dimensional (2) t vector pi = [pix , piy ] , xi is computed as follows: (2)
xi
= kpi − ck ,
(3)
t
where c = [cx , cy ] is a two-dimensional vector that represents the “weighted” center of the spot region, and is obtained as: n X 1 (1) c = Pn xi pi . (4) (1) i=1 i=1 xi Once the features for each pixel in the spot region are obtained, producing a dataset D, the k-means algorithm is applied, where the number of clusters is set to k = 3. We obtain three classes, two of them corresponding to foreground and background, and the other representing the edge. Since the standard format used to represent microarray images is grayscale, and (1) 16-bit color depth, the intensity values, xi , lie in the range 0..65,535. In typical microarray images, the intensities of the foreground pixels vary from 10,000 to 20,000. Considering that a typical spot size is√10 to 12 (pixels), producing values (2) (1) of xi in the range [0, 50], it implies that xi dominates the clustering results. To avoid this situation, we normalize the features before the clustering algorithm is applied. Note that we use the two features described above in our model. However, many other features can also be used, such as the mean, median, or variance of the intensity of a certain number of surrounding pixels, or other statistical moments. We have conducted experiments with some of these features and noticed that the results are similar to the two-feature model. This is due to the following (main) two reasons. The nature of the problem leads to high intensities concentrated on the center (2) of the spot region, and hence xi becomes quite significant in deciding on the class. The other features that consider the surrounding pixels are intended to detect noisy pixels, which are eliminated by either the significance of the second feature, (2) xi , or by the noise removal procedure discussed in the next section. This effect also causes overfitting of the data, and in many cases, degrading accuracy of the classification.
the resulting foreground, some post-processing methods may still be desirable to eliminate even more noisy pixels. The pixels labeled “foreground” contain not only the high-intensity true-foreground pixels, but also high-intensity noisy pixels scattered in the background region. Without post-processing of noisy pixels, the resulting spots may contain noise that affects the accuracy of the subsequent steps in the analysis. While it is hard to eliminate noisy pixels that reside in the foreground region, discarding noisy pixels in the background region is desired. Noisy pixel processing can be achieved by applying many different noise removal algorithms. For this purpose, we introduce a technique, which we call the largest continuous region (LCR) method. LCR consists of calculating the largest continuous region using the pixels that compose the foreground, assuming the foreground is larger than any noisy area. Since it is expected that the spot foreground is the largest cluster compared to the noisy clusters, by assuming that the foreground is a continuous region, we can easily identify the cluster by finding the area with the largest number of pixels. The algorithm first marks each continuous region with different labels by involving a recursive function. After the first step, the algorithm obtains a mask for the spot, in which each label stands for a different continuous region. Then, the algorithm counts the number of pixels for each label of the mask. The foreground is the region with the largest number of pixels. Finally, the algorithm clears the labels for all the regions except that of the foreground pixels. The details of the algorithm can be found in [14]. IV. E XPERIMENTAL R ESULTS
III. N OISE R EMOVAL P ROCEDURE
In our evaluation, we combine our subjective judgment and an objective measurement to compare SKMIS, OKMIS and SRG. Since SKMIS generates a foreground with a significant number of noisy pixels, Wu et al. applied a further mathematical morphological process, which they called foreground correction, to eliminate the noise [12]. In our experiments, we apply LCR to perform the foreground correction to SKMIS and OKMIS. These two methods have been implemented in Matlab4 . We have also run the SRG method that is implemented in the Spot software package5 . In order to obtain a more consistent assessment about our segmentation method, and its comparison with other approaches, we performed some simulations on real-life microarray images obtained from the ApoA1 data6 . First of all, we compare the resulting binary images of the two clustering methods with the original microarray image. Two spots from the 1230c1G/R image, No. 136 and No. 137, are shown in Figure 1. These two spots contain high-intensity noise artifacts which make difficult to separate true foreground from background. While SKMIS is not able to eliminate the noise, OKMIS accurately obtains the two spots. SRG, however, is able to separate the two spots, but at the expense of producing a smaller foreground region.
After applying a clustering algorithm, the resulting spots may contain true foreground pixels as well as noisy pixels due to the contamination introduced during the microarray experiments. Although different models and parameters can be chosen in order to minimize the number of noisy pixels in
4 The Matlab source code that implements SKMIS and OKMIS is available at http://www.cs.uwindsor.ca/˜lrueda/research/okmis.zip. 5 An evaluation version of Spot can be electronically obtained from http://www.cmis.csiro.au/iap/Spot/spotmanual.htm. 6 http://www.stat.berkeley.edu/users/terry/zarray/Html/apodata.html.
Fig. 1. The result of applying SKMIS, OKMIS and SRG to spots No. 136 and 137, extracted from the 1230c1G/R microarray images.
A few spots from the 1230c1G/R microarray images are shown in Figure 2. We observe that SKMIS produces poor results. In spot No. 10, it only produces noisy pixels. In the other spots it reveals the true foreground, but it is not able to eliminate noisy pixels. OKMIS, instead, reveals the true spot foreground in all the cases, with the exception of Spot No. 10, in which noisy pixels are also included as foreground. SRG has also problems with Spot No. 10, producing the noisy pixels as foreground. However, SRG produces smaller foreground regions in most of the cases. Figure 3 shows a comparison of the methods low intensity (“weak”) spots. In order to easily visualize the results, the intensities of the original spots have been increased 20 times. We observe that OKMIS results in larger, closer to the real spot, and cleaner foreground regions. As can be seen in the figures, OKMIS automatically removes the noisy pixels, and is more efficient than SKMIS. SRG, again, is able to eliminate the noisy pixels, capturing most of the foreground spots, but produces smaller foreground regions, and discarding true foreground pixels. After visually demonstrating the efficiency of OKMIS for a few spots drawn from an Apo A1 image, we now provide an objective measurement for a batch of real-life microarray images. To achieve this analysis, we compare the size of the resulting foreground region for both methods. The results are shown in Table I. The first column for each method contains the total foreground intensity, If g , and the second column represents the number of pixels in the foreground region, Nf g . In the first two columns, we note that the foreground region generated by SKMIS contains many noisy pixels. Thus, a postprocessing method has to be applied in order to eliminate the noise. However, such a method will not eliminate the noisy pixels which are adjacent to the resulting foreground region (e.g. the two spots shown in Figure 1). In the case of OKMIS, we observe that the total foreground intensity is slightly greater than that of SRG, while the total number of “true-foreground” spots captured by OKMIS is substantially larger than that of SRG. This demonstrates, numerically, that OKMIS is able to both eliminate noisy pixel, and capture the true foreground intensity for each spot. V. C ONCLUSIONS In this paper, we propose an optimized method for microarray image segmentation, OKMIS, which is based on clustering and a post-processing, noise removal algorithm, LCR. The clustering algorithm is based on the well-known k-means method and it uses both intensity-based and spatial information for the features of the classification system. The optimized method also involves the LCR algorithm for removing the noisy pixels produced by the clustering algorithm.
Fig. 2. Comparison of SKMIS, OKMIS and SRG for some typical spots from the 1230c1G/R microarray images. For spots with high intensity noisy pixels, such as No.10, OKMIS can reveal the true spot foreground instead of the noise produced by SKMIS.
As shown in the experiments, our algorithm performs microarray image segmentation more accurately than the previous clustering-based microarray image segmentation methods, SKMIS, being able to successfully eliminate the noisy pixels. OKMIS, also, performs better than its counterpart, SRG, in the sense that the former is able to capture the true foreground pixels, while producing larger foreground regions. Quite a few open problems arise from the proposed method. One of them deals with increasing the number of features, and the study of different feature extraction and feature correlation analysis. This aspect is quite important, since some features are correlated due to the nature of the problem. For example, a pixel located near the center of the spot is expected to have a larger intensity than that of another pixel located, say, near the grid line. Although this has been efficiently solved by OKMIS by normalizing the features, more advanced feature analysis techniques, such as principal component analysis, deserve investigation. The analysis of other features is also important, for example, the mean or the variance of the intensities of the surrounding pixels. Finally, finding the appropriate number of clusters automatically is a problem that deserves investigation. This will help detect noisy regions, like for example, in spot No. 10 of Figure 2.
Fig. 3. Comparison of SKMIS, OKMIS and SRG for some weak intensity spots from the 1230c1G/R microarray image. The intensities of the original spots have been enlarged 20 times to enhance their visualization. TABLE I C OMPARISON OF SKMIS, OKMIS, AND SRG ON A BATCH OF IMAGES FROM THE A POA1 ANALYZED . Image 1230ko1G/R 1230ko2G/R 1230ko3G/R 1230ko4G/R 1230c1G/R 1230c2G/R 1230c3G/R 1230c4G/R Total
SKMIS If g Nf g 1,025,300 11,508 1,439,400 7,520 1,618,600 10,826 1,559,100 8,141 2,060,900 8,620 2,530,300 8,179 2,446,300 8,446 1,836,700 8,685 14,516,600 71,925
DATASET, WHERE THE FIRST SUB - GRID OF EACH IMAGE IS
OKMIS If g Nf g 1,049,000 11,234 1,366,100 9,701 1,541,100 10,903 1,518,300 9,046 1,929,100 10,128 2,362,900 9,510 2,367,100 9,436 1,816,800 9,399 13,950,400 79,357
R EFERENCES [1] S. Dr˘aghici, Data Analysis Tools for DNA Microarrays, Chapman & Hall/CRC, 2003. [2] M. Schena, Microarray Analysis, John Wiley & Sons, 2002. [3] P. Soille, Morphological Image Analysis: Principles and Applications, Springer, 1999. [4] R. Duda, P. Hart, and D. Stork, Pattern Classification, John Wiley and Sons, Inc., New York, NY, 2nd edition, 2000. [5] A. Webb, Statistical Pattern Recognition, John Wiley & Sons, N.York, second edition, 2002. [6] M. Eisen, ScanAlyze User’s Manual, M. Eisen, 1999. [7] Axon Instruments, Genepix 4000A: User’s Manual, Axon Instruments Inc., 1999. [8] GSI Lumonics, QuantArray Analsyis Software: Operator’s Manual, 1999. [9] J. Buhler, T. Ideker, and D. Haynor, “Dapple: Improved Techniques for Finding Sports on DNA Microarrays,” Tech. Rep. UWTR 2000-08-05, University of Washington, 2000. [10] R. Adams and L. Bischof, “Seeded Region Growing,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 16, no. 6, pp. 641–647, 1994.
SRG If g Nf g 1,007,931 11,616 1,369,696 7,586 1,527,652 9,871 1,529,136 7,870 1,923,433 9,020 2,373,413 8,007 2,289,604 8,387 1,804,101 8,042 13,824,966 70,399
[11] Y. Chen, E. Dougherty, and M. Bittner, “Ratio-based Decision and the Quantitative Analysis of cDNA Microarray Images,” Journal of Biomedical Optics, vol. 2, pp. 364–374, 1997. [12] H. Wu and H. Yan, “Microarray Image processing Based on Clustering and Morphological Analysis,” in Proc. of the First Asia Pacific Bioinformatics Conference, Adelaide, Australia, 2003, pp. 111–118. [13] C. Carson, S. Belongie, H. Greenspan, and J. Malik, “Blobworld – Image Segmentation using Expectation Maximization and its Application to Image Querying,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 8, pp. 1026–1038, 2002. [14] L. Qin, “New Machine-learning-based Techniques for DNA Microarray Image Segmentation,” M.S. thesis, School of Computer Science, University of Windsor, Canada, 2004, Electronically available at http://www.cs.uwindsor.ca/˜lrueda/papers/LiThesis.pdf.