texture analysis, neural network, and feature-based approach, etc. ..... function of the SVM classifier is computed by Support Vectors (SVs) that are identified from ...
IADIS International Conference on Applied Computing 2005
AUTOMATIC ACTOMYOSIN COMPLEX SELECTION USING SVM Jianfei Yang, Takeshi Ohashi, Takuo Yasunaga Department of Computer Sciences and Systems Engineering, Kyushu Institute of Technology 680-4 Kawazu , Iizuka-shi , Fukuoka , 820-8502 , Japan
ABSTRACT This paper describes that actomyosin complex particles are automatically selected. We propose a new approach, which combines both gray level co-occurrence matrix to extract texture features and SVM classifier to detect actomyosin complex particles automatically. Experimental results show that detection rate achieves 93.58%, the false positive rate is 3.66%, and the area under the ROC curve (AUC) is 0.9645. KEYWORDS
Gray level co-occurrence matrix, support vector machine, actomyosin complex
1. INTRODUCTION Myosin is to exist in muscle and non-muscle tissue. About 50% of the protein in a muscle cell is myosin, about 30% of myosin are bound to action (Molecular motors, 2002). Myosin is the best studies molecular motor. Due to one want to understand how myosin produces force, it is necessary to visualize the structure of myosin. Information on the myosin bound to actin can be obtained using Cryo-EM. Investigators can see that the views of large scale conformational changes in the actomyosin comparison of 3D reconstruction. The information contributed to the understanding the force production and its function. EOS (Extensible and Object-oriented System) is a group of small tools include three-dimensional reconstruction of macromolecules (Takuo Y. et al, 1996). The single particle analysis has been widely used for 3D reconstruction of large molecular complexes from Cryo-EM image. Owing to the low signal to noise ratio in Cryo-EM images, one will require a hundreds of thousands or even million of high resolution particles, which make it impractical to manually pick the particles. The current main method is: Template-matching, edge detection, intensity comparison, texture analysis, neural network, and feature-based approach, etc. For classification algorithm, several machine learning algorithms have been used to classify protein particle datasets, in which include k-nearest neighbor, decision trees, Fisher linear discriminant analysis, Bayesian networks, neural networks, and SVM. For particle analysis, the particle selection is critical and become a bottleneck in high the resolution structure determination of macromolecules using Cryo-EM. This is an unresolved challenging problem. This demands development of fast and accurate detection algorithm (Yuanxin Z. et al, 2004). Such as Yongyi Yang et al. proposed SVM approach for detection for microcalcifications (Yongyi Yang, et al, 2002); Zeyun Yun et al. proposed feature extraction from the edge map (Zeyun Y, et al, 2004); Roseman, A.M., proposed particle finding using a fast local correlation algorithm (Roseman A. et al, 2003), and Zhu,Y. et al. proposed fast detection of generic biological particles (Zhu Y. et al, 2002). These algorithms can achieve over 90% detecting rate (Furey T., et al, 2000 and XU., et al, 2003), the false positive rate ranging from 15- 30%, and the lowest false positive rate is 4.5% (Yuanxin Z. et al, 2004). Our detecting rate achieved 93.58%, and the false positive rate is 3.66%. SVM can handle large feature space, can effectively avoid over fitting by controlling the margin, and can automatically identify a small subset. SVM is being increasingly used for solving biological problems, including colonography detection (Anna K. et al, 2003), protein analysis (Y.F.Sun, et al, 2003), and genetic
559
ISBN: 972-99353-6-X © 2005 IADIS
expression data analysis (Danh V. et al 2001), etc. Since actomyosin complex shape is complex, its feature extraction is very difficult. We propose gray level co-occurrence matrix (GLCM) to extract texture features and support vector machine (SVM) classifier to classify actomyosin complex automatically. This approach can implement simple and rapid extracting feature and classification. Therefore, training speed of SVM classifier is faster than Artificial Neural Network or other methods. The paper is organized as follows. Texture feature extraction, SVM classifier brief introduction, and designed the detection system are provided in section 2. The experimental results are presented in section 3. Finally, conclusions are drawn in section 4.
2. METHODS 2.1 Gray level co-occurrence matrix A gray level co-occurrence matrix (GLCM) based on the pixel intensity of the eight-nearest neighbors is computed for region of interesting (ROI). If it is symmetrical, GLCM base on the pixel intensity of fournearest neighbors is computed. Adopting symmetrical GLCM is to improve rate of computer operating. Cooccurrence matrix of relative frequencies P(i, j) of pairs of neighboring pixels in digital images, one having a gray level value of i and the other j (separated by a distance of one pixel i.e. d=1). The co-occurrence matrix is a function of both the distance between neighboring pixels (d) and the angular relationship ( θ = 0º, 45º, 90º, 135º) between them (R. M. Haralick, 1979). A symmetrical function can be computed using the formula (Matjaz B. et al, 2002):
CdG (i, j ) = C+ dG (i, j ) + C−dG (i, j )
Cd (i, j) the average at certain gray level (i, j) is computed:
G C d (i, j ) = avg{ C dGk (i, j ); d k ∈ ∆ →
→
(1)
}
(2)
→
A set of displacement vectors is ∆ = { d 1 , d 2 , " d n } . Distance vector is ∆ = {(1,0), (1,1), (0,1), ( −1,1)} . This digital image is quantified in 4 gray levels. The digital image is shown in Figure 1. Since there are only four gray levels, GLCM P(i, j) is a 4×4 matrix. This system concern the 0º, 45º, 90º, 135º directions (i.e., 0º and 90º is calculated as horizontal and vertical direction; 45º and 135º is calculated as diagonal direction respectively), and selected distance d=1(separated by a distance of one pixel i.e. d=1), generate four (0º, 45º, 90º, 135º) direction co-occurrence matrix is shown in Figure 2.
0
0
1
1
0
0
1
1
0
2
2
2
2
2
3
3
⎡4 ⎢2 P(1,0) = ⎢ ⎢1 ⎢ ⎣0
P(1,0)
⎡2 ⎢1 =⎢ ⎢3 ⎢ ⎣0
Figure 1. Digital image
2 4 0 0
1 0 6 1
0⎤ 0⎥⎥ 1⎥ ⎥ 2⎦
P(1,−1)
⎡6 ⎢0 =⎢ ⎢2 ⎢ ⎣0
0 2 0⎤ 4 2 0⎥⎥ 2 2 2⎥ ⎥ 0 2 0⎦
1 3 0⎤ ⎡4 1 0 ⎥ ⎢1 2 2 2 1 0⎥ P(1,−1) = ⎢ ⎢0 2 4 1 0 2⎥ ⎢ ⎥ 0 2 0⎦ ⎣0 0 1 Figure 2. Co-occurrence matrices
0⎤ 0⎥⎥ 1⎥ ⎥ 0⎦
2.1.1 Co-occurrence matrix normalization Normalized GLCM Cd(i,j) is defined by: C d (i, j ) =
P (i, j ) Ng −1 Ng −1
∑ ∑ P (i, j )
i=0
560
j =0
(3)
IADIS International Conference on Applied Computing 2005
This formula normalizes the Co-occurrence values to lie between 0 and 1. Where Cd (i, j) represents the number of occurrence of gray level (i, j) within the given window and a certain (d, θ). Ng is the quantized number of gray levels. The sum in the denominator thus represents the total number of gray level pairs (i, j) within the window.
2.1.2 Texture features We calculated 13 texture features in this research. Original images are quantized in 256 gray levels before texture feature calculation. Six key texture features are described in table 1 and 2. Table 1. GLCM character Statistic Character Texture entropy Contrast Correlation Homogeneity Energy Inverse different moment
Description The entropy gives a measure of complexity of the image. Measures the local variation in the GLCM. Measures the joint probability occurrence of the specified pixel pairs. Measures the closeness of the distribution of elements in the GLCM to the GLCM diagonal. Provides the sum of squared elements in the GLCM. Inverse different moment measures image homogeneity. Table 2. Feature formula
Order
Name
f1
Texture entropy
f2
Texture contrast
Texture Feature Formula
f1 = −
f2 =
Ng −1 Ng −1
∑ ∑ cd (i, j ) ln cd (i, j )
i =0
j =0
Ng − 1Ng − 1
∑ ∑ (i − j )
2
cd (i, j )
i =0 j =0
Ng −1Ng −1
f3
f4
f5
f6
Texture correlation
Texture homogeneity
Uniformity of texture energy
Inverse different moment
f3 =
∑ ∑ (i − u
)( j − u y )cd (i, j )
σ xσ y
f5 = f5 = f6 =
x
i =0 j =0
Ng −1Ng −1
∑ ∑c
2 d
(i, j )
∑ ∑c
2 d
(i, j )
i =0 j = 0 Ng −1Ng −1
i =0 j = 0 Ng −1Ng −1 d
c (i, j )
∑ ∑ (i − j ) i = 0 j =0
2
i≠ j
2.2 Non-linear SVM classifier SVM is a separation hyperplane with its separation margin maximized and the number of incorrectly classified examples minimized. In non-linear case, SVM can map the training data nonlinearly into a higherdimensional feature space via K(xi,xj) and construct a separating hyperplane with maximum margin (Fang Qian, et al, 2002). For a two-class classification problem, assume that we have a set of examples. Given the training data (xi ,yi) where i= 1, 2,… ,n , xi Rd , yi {-1,+1}. Here -1 and +1 indicate the two classes. The resulting training problem consists of optimizing the following cost function with respect to the scalars αi..
561
ISBN: 972-99353-6-X © 2005 IADIS
n
Max W (α ) = ∑ α i − i =1
1 n ∑α iα j yi y j K (xi , x j ) 2 i, j =1
(4)
Subject to: n
∑ yi α i = 0
i = 1,..., n
0≤α ≤C
i =1
(5)
The xj are called support vectors only if the corresponding αi >0. C is a regularization parameter, which controls the trade-off between margin maximization and training error minimization. The deciding classification function is n
f (x) = ∑ α i* yi K (x i ⋅ x) + b
(6)
i =1
Where x is the n-dimensional vector of an observation example, yi {-1,+1} is a class lable. xi is the vector of the ith training example. N is the number of training examples. K(xi,x) is a kernel function. The decision function of the SVM classifier is computed by Support Vectors (SVs) that are identified from the examples during training. Support Vectors (SVs) can represent all the information about classification in the training examples. The number of SVs is quite small compared with total number of training examples.
2.3 Design detecting system Cryo-EM image of actomyosin complex is acquired by Yasunaga Lab. Cryo-EM images were digitized by 1.72Å/pixel, Each image size is 2048×2048 pixels and 16-bit gray image, and file format is mrc. We have got 979 sub-images from 11 image files.
2.3.1 Framework of automatic detecting system The system is composed of the following three modules: Cryo-EM image pre-processing, texture feature extraction and SVM classifier. Framework of system is shown in Figure 3.
Figure 3. Framework of automatic detecting system
Image pre-processing includes histogram, Fourier frequency spectrum, 3×3 median filter, and brightness and contrast strengthen to obtain a better image effect. Design a 210×210 pixels window (36 nm/1.72Å /pixel) and shift it to obtain 979 sub-images from region of interesting image in 11 images. Cryo-EM image is shown in Figure 4. Among 158 sub-images are selected as positive training examples shown in Figure 5. 656 sub-images are selected as negative training examples. The actomyosin complex is a three-dimensional (3D) structure of actin and myosin shown in Figure 6.
Figure 4. Cryo-EM image
562
Figure 5. The actomyosin particles
Figure 6. 3D structure of actin and myosin
IADIS International Conference on Applied Computing 2005
2.3.2 Preparation of data set The vector formed by feature of pixels denoted by Xi is used as an input pattern for actomyosin complex particles class (yi=+1). Non- actomyosin complex particles are collected (yi=-1). Test examples: we used a 210×210 pixels sub-window to scan Cryo-EM images. Shift the window of 210×210 pixels by one pixel until the whole Cryo-EM image have been scanned and mapped. Then, they satisfy the highest fourth-statistical moment that is computed from image histogram. According to the set threshold, the test examples are extracted. In our study, we used sub-sampling method to reduce window size which is reduced from 210×210 pixels to 14×14 pixels. It is very important to scale testing data before applying SVM. The main advantage is to avoid that attributes in greater numeric ranges dominate those in smaller numeric ranges. Another advantage is to avoid numerical difficulties during the calculation. We scaled the testing data to [-1, +1].
2.3.3 Mode selection SVM training contains the selection of the proper kernel function parameters and the regularization parameter C. The selection of the kernel function parameters is very important because they implicitly define the structure of the high dimensional feature space where the maximal margin hyperplane is found. The regularization parameter C controls the complexity of the learning machine and influences the training speed. Several typical kernel functions are Linear:
xT x i
(7)
(xT x i + 1) d
(8)
Polynomial with degree d: Gaussian Radial Basic Function (RBF):
exp( −γ || x − x i || 2 )
(9) For a given dataset, only the kernel function and the regularization parameter C are selected to specify SVM. We selected RBF as the kernel function of SVM. To optimize these parameter C and γ, applied 5-fold cross validation to the training image set. The estimated generalization error is different values of the selecting γ and C. The classifier performance is described using the ROC (Receiver Operating Characteristic) curve that is a two-dimensional. To compare classifiers, a common method is to calculate the area under the ROC curve (AUC).
3. EXPERIMENT RESULTS We selected that sub-sampling window size is 14×14 pixels according to Table 3. We applied 5-fold cross validation to the training image set. The classifier performance is represented using the ROC curve. Experimental results show that the number of Support Vectors (SVs) is approximately 24.44% of the total number of training examples and the area under the ROC curve (AUC) is 0.9589 with Polynomial kernel function. Table 3. Selected sub-sampling window size Items
Sub-sampling Results
Window size
7×7
12×12 14×14 15×15 18×18 21×21 24×24 27×27 30×30
AUC (Examples:158/281) 0.8849 0.9513 0.9581 0.9523 0.9541 0.9489 0.9488 0.9486 0.9483 AUC (Examples:158/656) 0.8864 0.953
0.9645 0.9623 0.9605 0.9585 0.9563 0.9571 0.9559
563
ISBN: 972-99353-6-X © 2005 IADIS
Selected RBF kernel function and obtained C=2.0, γ=0.125, accuracy=95.57%. RBF kernel results show that the number of Support Vectors (SVs) is approximately 16.34% of the total number of training examples. The area under the ROC curve (AUC) is 0.9645 and shown in Fig. 7. The detection accuracy achieves 93.58%, and the false positive rate is 3.66%. Both ROC curves are shown in Figure 8. It can be seen that GLCM–SVM classifiers are better than FOS (First-order-statistics) ones. 1 0.9
0.8 0.7
0.8 0.7
True Positive Rate
1
True Positive Rate
0.9
0.6 0.5 0.4 0.3 0.2
GLC M -SV M FO S
0.6 0.5 0.4 0.3 0.2
0.1 0 0
0.2
0.4 0.6 False Positive Rate
0.8
1
Figure 7. GLCM ROC curve using RBF kernel
0.1 0 0
0.1 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Positive Rate
Figure 8. Comparing both ROC curves
4. CONCLUSIONS This paper proposes that GLCM-SVM is used for detection of actomyosin complex in Cryo-EM images. Experimental results show that detection rate achieves 93.58% and the number of SVs is approximately 16.34% of the total number of training examples. Since actomyosin complex shape is complex, its feature extraction is very difficult. GLCM approach solves extracting actomyosin complex texture features problem, which can extend to extract asymmetrical particles and a variety of irregular particles. Our approach can implement simple, rapid, and accurate extracting feature and classification.
REFERENCES Anna K. et al, 2003. Computer-aided Poly Detection in CT Colonography Using an Ensemble of Support Vector Machines, International congress series, 1256, 1019-1024. Danh V. et al 2001. Tumor Classification by Partial Least Squares Using Micro-array Gene Expression Data. Bioinformatics, 18, 30-50. Furey T., et al,2000. Support vector machine classification and validation of cancer tissue samples using microarray data. Bioinformatics, 16, 10, pp.906-914. Fang Qian, et al, 2002. SVM-Based Feature Weighting Method for Image Retrieval. 4th Intl Workshop on Multimedia Information Retrieval, Juan-les-Pins, Franc, pp142-146. Melissa S. et al, 2004. Three-dimensional Structure of Complex Spliceosomes by electron microscopy. Nature Structural & Molecularbiology, VOL.11, 265-269. Molecular motors: myosin, 2002. http://www.proweb.org/myosin/ Matjaz B. et al, 2002. A Statistical Approach to Texture Description of Medical Images: A Preliminary Study. CBMS: 239-240. R. M. Haralick, Statistical and Structural Approaches to Texture, Proceedings of the IEEE, 1979, VoL. 67, 5: 786-804. Roseman A. et al, 2003. Particle Finding in Electron Micrographs Using Fast Local Correlation Algorithm, Ultramicroscopy, 94, 225-236. Takuo Y. et al, 1996. Extensible and Object-oriented System. Journal of Structural Biology, 116, 155-160. XU., et al, 2003. Discovery and Identification of Rat Liver Cirrhosis Biomarker Candidates Using Protein Profiling. Genome Research, 13 (9): 2112-2117. Y.F.Sun, et al, 2003. Identifying Splicing Sites in Eukaryotic RNA: Support Vector Machine Approach. Computer in Biology and Medicime, Vol.33,17-29. Yongyi Yang, et al, 2002. A support vector machine approach for detection of mirocalcifications. IEEE Transactions on medical imaging, 21, 1552-1563. Yuanxin Zhu, et al, 2004. Automatic Particle Selection: Results of A Comparative Study. J Struct Biol, 145(1-2):3-14. Zeyun Yun, et al, 2004. Picking Circular and Rectangular Particles Based on Geometric Feature Detection in Electron Micrographs. Journal of Structural Biology, 145, 168-180. Zhu Y. et al, 2002. Fast Detection of Generic Biological Particles in Cryo-EM Images through Efficient Hough Transforms. IEEE International Symposium of Biomedical Image, pp.205-208
564