Cross-Modal Learning - The Learning ... - Semantic Scholar

7 downloads 0 Views 142KB Size Report
The Learning Methodology Inspired by Human's Intelligence. 1. Bo Zhang and Dayong Ding. Computer Science & Technology Department, Tsinghua University.
Neural Information Processing – Letters and Reviews

Vol. 11, Nos. 4-6, April-June 2007

LETTER

Cross-Modal Learning - The Learning Methodology Inspired by Human’s Intelligence1 Bo Zhang and Dayong Ding Computer Science & Technology Department, Tsinghua University Beijing, China 100084 [email protected]; [email protected] Ling Zhang Artificial Intelligence Institute, Anhui University Hefei, China 230039 [email protected] (Submitted on December 15, 2006) Abstract — Human has an amazing cross-modal learning capability. In order to endow the computers with the same ability, we use a model based on the quotient space theory. In the quotient space model, representations at different modalities form a complete semi-order lattice and the translation from one modality to the others becomes easier. Therefore, it is suitable to be a mathematical model of cross-modal learning. Taking the video retrieval as an example, we show how to apply the cross-modal learning strategy to the field. The first problem of cross-modal learning in video retrieval is how to represent a video (content) so that the user expected videos can be found from a collection of videos precisely and entirely. A video can be represented by different modalities such as image, speech, text, etc. Each modality can be represented by several forms with different grain-sizes. Researches showed that, grain-size in the modality of image can bring compromise between precision and recall and multi-level feature may improve them both. But using only one modality to video retrieval is not enough. Speech and keyword are used as well. One of the strategies for cross-modal learning is to integrate information from different sense modalities. The second problem is how to integrate the results from different modalities. That is feature binding or information fusion problem. Multi-classifier technique will be discussed. We may consider each modality as a projection of the same object (video) and integrate information from the projections. Specifically, we propose the Probabilistic Model Supported Rank Aggregation (PMSRA) method to accomplish this integration. Theoretical analysis and experimental results show that cross-modal learning can significantly improve the performances of machine learning and that the quotient space model is powerful for it. Keywords — Cross-modal learning, quotient space theory, probabilistic model supported rank aggregation, machine learning, video retrieval

1. Introduction Human has the amazing ability to conceptualize the world from multiple sense modalities. Even in his babyhood, for example, one is able to crudely turn to sound. As a baby groups up, a more and more precise auditory-visual spatial map is to be developed. Another typical example is human’s ability to read lips. In a noisy environment, one’s speech recognition ability can be substantially improved when the speaker’s face is visible [1]. In experiments discordant pairs of acoustic-visual stimuli resulted in interesting phenomena: similar but 1

Supported by the National Natural Science Foundation of China under the grant No. 60621062, and the National Key Foundation R&D Projects under the grant No. 2003CB317007 and 2004CB318108.

83

Cross-Modal Learning Methodology Inspired by Human’s Intelligence

Bo Zhang, Dayong Ding, and Ling Zhang

discordant lip-movements shown to testees greatly misled their comprehension of syllables they heard. This result suggested the strong interaction of the auditory and visual modalities during human’s listening. Moreover, additional analysis showed that the cross-modal learning ability, at least partially, acquired [2]. Inspired by the human’s intelligence, we would like to address the question: how to endow the ability of cross-modal learning to the computer? Specifically, we will extend quotient space theory to model the crossmodal learning process and take video retrieval as an example to show how to apply cross-modal learning strategy to the field. In the modeling of the cross-modal learning, representation and integration are two key problems, which might be clearly seen in the area of video retrieval. Therefore, video retrieval is taken as an example through out the paper to display the application of the cross-modal learning. Representation is the first key problem for cross-modal learning. Video is rich medium containing pictures, sounds and texts. Obviously, when people watch a video program, he/she simultaneously process information from all the modalities to catch the meaning. A machine with cross-modal learning ability for video retrieval should also make use of multiple modalities in video. So, representation of different modalities is the basis of cross-modal learning. Specifically, which modalities or sub-modalities are to be represented and how they can be represented are of the most interest. Integration of the results from multiple modalities is the second key problem. In the video retrieval, this integration is a kind of classifier combination problem, which we propose a Probabilistic Model Supported Rank Aggregation (PMSRA) method to address. Briefly speaking, we may consider each modality as a projection of the same object (video shot). Then the probabilistic properties of each projection and relations among the projections are investigated. Finally, based on the properties and relations, we can integrate the information from the projections by various known statistical reasoning technologies. Cross-modal learning can be modeled by the quotient space theory [3-5]. In this model, the world is represented by a semi-order lattice composed by a set of quotient spaces: each of them represents the world at a certain modality and granularity denoted by a triplet ( X , F , f ) , where X is a domain and consists of the whole objects that we intend to deal with, F- the structure of X and represents the relationship among objects, f -the attribute of X and is a function (or a set of functions) defined on X. The same world represented at different modalities or granularities are represented by the quotient spaces of ( X , F , f ) . These quotient spaces can be defined by equivalence, consistent or fuzzy relations and are denoted as ([ X ],[ F ],[ f ]) , where [ X ] is a quotient space of X, [ F ] -the quotient structure of X, [ f ] -the attribute of [ X ] . Then the representations at different modalities form a complete semi-order lattice and the translation from one modality to the others becomes very easy. Therefore, the quotient space model is suitable for the cross-modal learning. The paper is organized as follows. The representation problem in video retrieval is discussed in Section 2, followed by the integration method in cross-modal learning, namely, the PMSRA method (Section 3). The quotient space model for cross-modal learning is presented in Section 4, with the emphasis on representation and integration schemes. The experimental results of the model in cross-modal learning are displayed in Section 5 and finally the work is concluded in Section 6.

2. Representation of Multiple Modalities Multiple sense modalities and sub-modalities have great impact on human’s process of conceptualizing the world. On one hand, cooperation of modalities is important for people to sense the world efficiently. The effects of absence of some of the modalities or sub-modalities on human’s recognition or learning procedure clearly reveal the importance. For example, poor audition will greatly impede a child’s learning to speak. Achromatopsia, i.e. color blindness, is the lack the most important visual sub-modality, color, which brings much inconvenience to the patients. On the other hand, inconsistency from modalities can cause confusion. In an interesting psychological experiment, people are required to correctly read out some colorful words of color names they are shown. When the word indicates a color different from the color it is shown, such as a word ‘black’ in red font, the responding times are significantly longer than when the word indicated color and the font color are consistent. In summary, many of the human’s cognitive phenomena suggest that multiple modalities may be very important to improve machine intelligence. Video is a typical multi-modal medium. Information in a video may be conveyed through various modalities, such as visual, audio and/or text. Although text is usually coincident with the visual or audio modality, its distinctive functionality, to some extent, may justify it as an independent modality. Each modality in a video has sub-modalities. For example, visual modality is composed of gray scales, colors, textures and motions; audio 84

Neural Information Processing – Letters and Reviews

Vol. 11, Nos. 4-6, April-June 2007

modality consists of speeches, music and environmental sounds; text modality contains sub-modalities such as close caption, ASR (auto-speech-recognition) text, and video OCR (optical-character-recognition) text. For a certain modality, suitable granularity of its representation is important under different cases. Each modality of a video can be represented by several forms with different grain-sizes. For example, an image can be represented by an m × n matrix; each of its elements represents a pixel. This is the finest representation of an image. Using this representation to video retrieval, precision will be high but robustness (recall) will be low. Since it contains too many details of an image, it is sensitive to noises. Therefore, the pixel-based representation was seldom used in video retrieval. The commonly used representation in video retrieval is the coarsest one, i.e., so called global visual features [6]. Here, an image is represented by a visual feature (a vector) such as color moments, color correlograms, wavelet transforms, Gabor transform, etc. In the coarsest representations, most of the details in an image lose so that the retrieval precision decreases but the robustness (recall) increases. The coarsest representations are suitable for seeking a class of similar images due to their robustness. Therefore, the global visual features were widely used for video retrieval. In order to overcome the low precision introduced by the coarsest representations, global features, the middle-size representation of an image was presented recently such as region-based representation [7]. In the representation, an image is partitioned into several consistent regions and each region is represented by a visual feature (a vector) extracted from the region. The whole image is represented by a set of features (vectors). Since the region-based representation contains more details of an image than the global one does, the retrieval precision increases whilst the robustness decreases. Nevertheless, the overall quality, including precision and recall, of video retrieval would be improved by using multi-level features. Previous studies mostly only employed one modality, i.e., image information, to retrieve video, which was apparently not enough for machine intelligence. As such, speech and keyword [8] are used as well. Therefore, cross-modal learning is consequently the key topic of this study.

3. Integration of Multiple Modalities: Probabilistic Model Supported Rank Aggregation 3.1 Overview of Integration Methods Once the representation of different modalities is finished, it is necessary to integration the information from all the modalities. The cross-modal integration ability in the human’s intelligence is an extremely sophisticated process. In the literature, multi-classifier technique can be used here to simulate the integration. Herein, a PMSRA method is proposed to fuse the results from multiple sources. As a technique to combine multiple-rank lists into one, rank aggregation is especially suitable for crossmodal integration, which can be thought as a decision combination problem. Unless otherwise stated, the binary decision problems, which correspond to two-class classification problems in machine learning, are the present concern. Two objective classes are called the ‘positive/relative’ and ‘negative/irrelative’ classes in this paper. Generally, there are three kind of information that can be used for decision combination [9]: z 0-1 results, i.e. YES or NO judgments for objects to be judged; z ranking, an ordered list of objects, usually in a descending order of their likelihood; or z decision variable, whose value is the basis of decision (for example, if we judge the sex of a bird by its weight, then weight is the decision variable.). In the cross-modal integration, the 0-1 results provide limited information to perform interesting fusion. On the other hand, decision variables from different modalities are often dependent on their units, ranges and physical meanings. As a result, it is hard to develop general integration methods based on different decision variables. Ranking is in the middle. It is neither too crude nor too specific to use. For a certain decision problem, rankings from multiple modalities reflect substantial differences of modalities in the same form. Therefore, rank aggregation is a suitable model for cross-modal integration. Many rank aggregation methods have been studied in the voting theory, information retrieval, and machine learning area [9-16]. According to the way that rank lists are treated, there are mainly two categories of methods, namely, the Condorcet’s method and the Borda’s method. Condorcet’s method and its variations [13, 15] assume lists as binary relations. A full order is broken into a set of N(N-1)/2 pairwise comparisons or preferences, where N is the number of objects in the order. Rank aggregation is a union of the binary relations. However, a binary relation does not necessarily form an order. The key problem for the Condorcet’s methods is how to find the ‘best’ order given the aggregated binary relation. Unfortunately, this problem is proven to be NP-hard when the number of aggregating lists is an even integer greater than three. Therefore, it’s not practical to apply these methods to cross-modal integration.

85

Cross-Modal Learning Methodology Inspired by Human’s Intelligence

Bo Zhang, Dayong Ding, and Ling Zhang

Borda’s method and its variants [10, 14, 16] directly translated order into scores. Rank aggregation is turned to operation of the scores. Although very efficient in calculation, the scoring methods lack sounding interpretations. Therefore, variations of the Borda’s method are mostly heuristic and there is no guarantee for its application to cross-modal integration. Consequently, it is necessary to find a new integration method.

3.2 The PMSRA Method The PMSRA method assumes rank lists as order statistics of sampled decision variables. By proposing this method, we seek to devise a simple, Borda-like method that has plausible probabilistic interpretations for rank aggregation and can be extended in many ways. All the ideas in the PMSRA stem from the way that rank lists are interpreted. In PMSRA, a rank list is thought as a result of sorted objects in particular probabilistic space of a decision variable. Once we know the probabilistic space, for each object in the list we can estimate its class likelihood, i.e. the possibility to be correctly labeled as YES (or NO). Then the class likelihoods can be used in various ways to accomplish the rank aggregation. Therefore, the main steps of PMSRA include: i) choice of the distribution family for the decision variable; ii) estimation of distribution parameters; iii) estimation of class likelihoods for all objects; iv) aggregation of the likelihoods. Next, we would like to first formalize the above ideas and then derive the Bayesian rank aggregation rule under the scheme of PMSRA. Suppose τ is a rank list, defined on the set S = S1 + S0 , where S1 and S0 are the subsets of positive and negative samples, respectively, and N = S . X is the decision variable for τ , defined as X = YX 1 + (1 − Y ) X 0

(1)

where X 1 and X 0 are decision variables for the positive and negative samples, respectively, and Y is the label

variable. In this case Pr{Y = 1} = π ≡ S1 / S and Pr {Y = 0} = 1 − π . Four major steps would be conducted for the rank aggregation. Firstly, according to the given domain knowledge, a distribution family for the decision variable is chosen as (2) p1 ( x ) = φ ( x;θ1 ) ; p0 ( x ) = φ ( x;θ 0 ) where p1 and p0 are probability density functions (pdf) for X 1 and X 0 , and θ1 and θ 0 are distribution parameters to be estimated by the lists on training data. Thus the pdf for X is (3) p ( x ) = π p1 ( x ) + (1 − π ) p0 ( x ) = πφ ( x;θ1 ) + (1 − π )φ ( x;θ 0 ) Secondly, from the distribution information reflected in the training list, we can estimate the distribution parameters θ1 and θ 0 . Note that it is not necessary, or even possible, to estimate absolute values of θ1 and θ 0 . For example, for the normal distribution family, we are only interested in the difference between the two means, i.e. ∧



μ1 − μ0 , rather than their absolute values. θ1 and θ 0 denote the estimations. Thirdly, we would like to estimate the value of decision variable for each object in the rank list. In fact, objects in a rank list are order statistics from the population whose pdf is denoted by Eq.(3). We use the expectations of the order statistics as estimated values of the decision variable for the objects. If i is the ( N − r + 1) -th object in τ , i.e. τ ( i ) = N − r + 1 , then the value of decision variable for i is x ( i ) = E ( X r:N )

(4)

where X r : N denotes the r-th largest element in a sample of size N . Based on the above analysis, given τ , the likelihood of object i to be positive is L(i ∈ S1 τ ) =

86

πp1 (E (X N −τ (i ) +1: N )) πp1 (E (X N −τ (i ) +1: N )) + (1 − π ) p0 (E (X N −τ (i ) +1: N ))

∧ ⎞⎞ ⎛ ⎛ ⎜ φ ⎜⎜ E X N −τ (i ) +1: N ; θ 0 ⎟⎟ ⎟ ⎜ 1−π ⎝ ⎠⎟ = ⎜1 + ∧ ⎞⎟ π ⎛ ⎜ φ ⎜⎜ E X N −τ (i ) +1: N ; θ1 ⎟⎟ ⎟⎟ ⎜ ⎝ ⎠⎠ ⎝

(

(

)

)

−1

(5)

Neural Information Processing – Letters and Reviews

Vol. 11, Nos. 4-6, April-June 2007

Formula (5) is a probabilistic interpretation for rank lists. From the viewpoint of calculation, Eq.(5) is a unified scoring method that has sound probabilistic meaning. Based on Eq.(5), we can apply many statistical reasoning principles to get various rank aggregation methods. Here we take the Bayes’s principle as an example to deduce corresponding Bayesian ranking rule. Bayesian ranking is to rank the objects according to their a posteriori probability. Mathematically, given rank lists τ 1 ,L,τ j ,L,τ J on object set S = {1, 2,L, i,L, N } , Bayesian ranking is to rank each object by Pr {i ∈ S1 | τ 1 ,L,τ J } .

2

(6)

Assuming the independence of the object’s ranks in the lists of different modalities given its class and according to the Bayes’s theorem, we have Pr {i ∈ S1 | τ 1 ,L,τ J } =

Pr {τ 1 ,L,τ J | i ∈ S1} Pr {i ∈ S1} Pr {τ 1 ,L,τ J }

=

⎛ J p ji ⎞ ⎜∏ ⎟π ⎝ j =1 π N ⎠ J

π∏ j =1

p ji

J

πN

+ (1 − π ) ∏

p ji

.

j =1

1 − p ji

(1 − π ) N

⎡ ⎛ π ⎞ J −1 J 1 − p ji ⎤ = ⎢1 + ⎜ ⎥ ⎟ ∏ ⎣⎢ ⎝ 1 − π ⎠ j =1 p ji ⎦⎥

−1

(7)

where p ji = L ( i ∈ S1 | τ j ) , j = 1,L, J ; i = 1,L , N . Obviously, ranking by Eq.(7) is equivalent to ranking by J

∑ ln 1 − p j =1

(8)

ji

From Eq.(8) we can see that, under the above-mentioned assumptions of independence, Bayesian ranking can be achieved by simply sorting the objects by their sum of logits, i.e. the natural log of the odds for the object to be positive, through all the lists.

4. The Quotient Space Model for Cross-modal Learning The cross-modal learning strategies can be naturally modeled by the quotient space theory [3, 4]. Here, taking video retrieval as an example, we first briefly introduce the basic ideas of the quotient space theory and then discuss its integration technologies. The basic idea of the quotient space theory is to model a practical problem with spaces of multiple granularities in a hierarchy. By properly constructing this hierarchy, the computational complexity of the problem may be reduced. The most important issues of applying quotient space theory are construction of quotient spaces, transformation between quotient spaces, and integration of quotient spaces. The quotient space theory can also be extended to model cross-modal learning. Given a semantic concept to be queried, the video retrieval problem can be theoretically represented in a space of complete information, which is denoted as ( X , F , f ) , where z X is the domain, i.e., the set of all video shots represented in a form of complete information; z F is the structure of X, i.e., the relation among the video shots; and z f is the property defined on X, and f is often a function defined as f : X a R n . In practice, the ideal space (X, F, f) is unknown. Rather, in the scheme of cross-modal learning, we may only know something about the video shots represented in different sense modalities or sub-modalities. For the quotient space theory, representations for the modalities are projections of (X, F, f), which are quotient spaces of the ideal one. Suppose that there are two projective spaces, which correspond to two sense modalities or sub-modalities, denoted as (X1, F1, f1) and (X2, F2, f2), respectively. Then, the cross-modal integration can be formulated as an integration problem of the two spaces. The integration is to find a quotient space (X3, F3, f3) of (X, F, f) such that i) X 1 , X 2 are quotient spaces of X 3 ; ii)

F1 , F2 are quotient structures of F3 with respect to X 1 , X 2 ; iii) f1 and f2 are projections of f3 onto X1 and X2, respectively, and (X3, F3, f3) satisfies certain optimal criteria given by the practical problem. With the above formulation of the problems, many integration techniques have been studied for different space structures [3, 4]. Note that Pr{ i ∈ S1 | τ1 … τj } is short for Pr{ i ∈ S1 | τ1(i)=r1 ,…, τj(i)=rj }. Therefore Pr{ τ1 ,…, τj } actually means Pr{ τ1(i)=r1 ,…, τj(i)=rj }, which is in fact dependent on i.

2

87

Cross-Modal Learning Methodology Inspired by Human’s Intelligence

Bo Zhang, Dayong Ding, and Ling Zhang

The PMSRA might be deemed as an instance of this quotient space model. In fact, if we look at (X1, F1, f1) and (X2, F2, f2) as probabilistic models for two rank lists from different modalities, F1 and F2 are the mixed distributions, and f1 and f2 are likelihood functions. Taking maximization of a posterior probability as the optimal criterion of the integrated space (X3, F3, f3), we will finally get the Bayesian rule as Eq.(8). Other criteria can be applied to get alternative rank aggregation rules. As a generalized model, the quotient space theory is suitable for cross-modal learning.

5. Experimental Results To demonstrate the application of quotient space model in cross-modal learning, video retrieval experiments on the TRECVID 2005 dataset were conducted, focusing on the semantic concept detection. The experimental results were further compared with those based on uni-modal strategies. Semantic concept detection, which is also called high-level feature extraction (HFE) in TRECVID [17], is to automatically determine related video shots from a video dataset given some semantic concepts that are intelligible to human. This technology is very useful for automatic or semi-automatic video indexing or annotation. Usually, small portion of the data is annotated manually as training data, which are used for machine learning programs. The result of concept detection can be measured by the average precision, which summarizes the precision at all recall levels and favors algorithms that detect more relevant shots earlier on [18]. Our experiments were carried out on the TRECVID 2005’s test dataset for the HFE task [17], which consists of 86.6 hours of news videos (45766 shots in 140 video clips). In TRECVID 2005, 69 teams took part in the HFE task and 106 runs of results were submitted. Three uni-modal methods and four cross-modal methods, which are C 32 + C 33 integrations of the former, are compared. Uni-modal results are chosen from TRECVID 2005 submissions. They include z ASR, the 7th run from Fudan University [19], based on auto-speech recognition text, which is a submodality of the text; z Texture, the 7th run from National University of Singapore [20], based on visual texture, which is a visual sub-modality; and z Region, the 1st run from Tsinghua University [21], based on color of segmented image regions, which is also a visual sub-modality. Therefore, the four cross-modal results are denoted as ‘A+T’, ‘A+R’, ‘T+R’, and ‘A+T+R’, respectively. We use the Bayesian rule of PMSRA to integrate different modalities or sub-modalities. The distribution family adopted is the Gauss distribution. Four concepts are chosen for comparison. They belong to different types of concept. ‘US-flag’ is a concept of object. ‘Water’ and ‘Mountain’ are typical concepts of scenes. ‘Sports’ is a kind of event. Average precisions (AP) of the results for all the uni-modal and cross-modal methods are given in Table 1. The averaged results (the last row) show that cross-modal strategies are significantly better than the uni-modal ones. Most cross-modal results are better than the results gotten by any of the constituent modalities alone. Nevertheless, some performances may slightly differ, which possibly results from the inconsistency between modalities, generating a low average value. For example, combination of ‘A’ and ‘T’ in the row ‘Water’ produces a result poorer than ‘Texture’ alone. This phenomenon is like the misleading effect of discordant lipmovement observed in a psychological experiment mentioned above. In view of the adaptability of the present PMSRA method, the deterioration in performance is slim. Table 1. Comparison of Uni-modal and Cross Modal Semantic Video Concept Detection Results

US-flag Water Mountain Sports Average

88

ASR 0.0335 0.0034 0.0033 0.0723 0.0281

Uni-Modal Texture 0.0155 0.1143 0.0693 0.0769 0.0690

Region 0.0375 0.0814 0.1104 0.2156 0.1112

A+T 0.0359 0.1022 0.0668 0.1465 0.0879

Cross-Modal A+R T+R 0.0506 0.0372 0.0735 0.1333 0.1066 0.1176 0.2678 0.2802 0.1246 0.1421

A+T+R 0.0521 0.1211 0.1154 0.3050 0.1484

Neural Information Processing – Letters and Reviews

Vol. 11, Nos. 4-6, April-June 2007

6. Conclusion and Discussion Cross-modal learning is a distinctive feature of human’s intelligence. In order endow the cross-modal learning ability to machines, a model based on the quotient space theory is proposed. Using video retrieval as an example, application of the model is demonstrated. While the multiple modalities are represented as quotient spaces of the ideal problem space of complete information, the integration is performed by quotient space integration, i.e., the Probabilistic Model Supported Rank Aggregation. Finally, concept detection experiments of video retrieval using cross-modal learning strategies are studied. The work leads to the following conclusions: firstly, cross-modal learning is a powerful methodology for improving the performances of machine learning; secondly, the quotient space theory is suitable for modeling the cross-modal learning problems because the model may easily solve both the representation problem and the integration problem; thirdly, the Probabilistic Model Support Rank Aggregation is an effective method to implement the cross-modal integration. The present work is however only a preliminary one. Many questions in cross-modal learning are still open. For instance, as noticed earlier that inconsistent modalities may bring confusions during learning, how to find out modalities suitable for a given learning task has yet to be studied. A related problem is how to construct many modalities or sub-modalities for selection. Cross-modal integration is a process that involves sensing to decision-making. Up to today, only integration in the decision stage has been investigated. How to integrate multiple modalities in the representation or training stage is worth of further investigation. In summary, crossmodal learning, motivated by human’s intelligence, is a very promising research topic in the area of machine learning.

References [1] Sumby, W.H. and I. Pollack, Visual Contribution to Speech Intelligibility in Noise. The journal of the acoustical society of America, 1954. 26(2): p. 212-215 [2] McGurk, H. and J. MacDonald, Hearing lips and seeing voices. Nature, 1976. 264: p. 746-748 [3] Zhang, B. and L. Zhang, Theory and Applications of Problem Solving. 1992: Elsevier Science Publishers B. V. [4] Zhang, L. and B. Zhang, The Quotient Space Theory of Problem Solving. Fundamenta Informaticae, 2004. 59(2/3): p. 287-298 [5] Zhang, L. and B. Zhang, Fuzzy reasoning model under quotient space structure. Information Sciences, 2005. 173 p. 353-364 [6] Huang, J., et al. Image indexing using color correlograms. in IEEE Comp. Soc. Conf. Comp. Vis. And Patt. Rec. 1997 [7] Jing, F., et al., An efficient and effective region-based image retrieval framework. IEEE Trans. on Image Processing, 2004. 13(5): p. 699-709 [8] Jing, F., et al., A unified framework for image retrieval using keyword and visual features. IEEE Trans. on Image Processing, 2005. 14(7): p. 979-989 [9] Jain, A.K., R.P.W. Duin, and J.C. Mao, Statistical pattern recognition: A review. Ieee Transactions on Pattern Analysis and Machine Intelligence, 2000. 22(1): p. 4-37 [10] Chen, L., et al., AP-based Borda voting method for feature extraction in TRECVID-2004, in Advances in Information Retrieval. 2005. p. 568-570 [11] Dwork, C., et al. Rank Aggregation Methods for the Web. in International WWW Conference(10). 2001 [12] Erp, M.v., L.G. Vuurpijl, and L.R.B. Schomaker. An overview and comparison of voting methods for pattern recognition. in the 8th International Workshop on Frontiers in Handwriting Recognition. 2002 [13] Young, H.P., Condorcet's Theory of Voting. American Political Science Review, 1988. 82(4) [14] Erp, M.v. and L. Schomaker. Variants of the Borda Count Method for Combining Ranked Classifier Hypotheses. in the Seventh International Workshop on Frontiers in Handwriting Recognition. 2000. Amsterdam

89

Cross-Modal Learning Methodology Inspired by Human’s Intelligence

Bo Zhang, Dayong Ding, and Ling Zhang

[15] Montague, M. and J.A. Aslam, Condorcet fusion for improved retrieval, in Proceedings of the eleventh international conference on Information and knowledge management. 2002, ACM Press: McLean, Virginia, USA [16] Melnik, O., Y. Vardi, and C.-H. Zhang, Mixed group ranks: preference and confidence in classifier combination. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004. 26(8): p. 973-981 [17] Over, P., W. Kraaij, and A.F. Smeaton. TRECVID 2005 - An Introduction. in TRECVID 2005. 2005 [18] Zhu, M., Recall, Precision and Average Precision 2004, Department of Statistics & Actuarial Science, University of Waterloo [19] Xue, X. and et al. Fudan University at TRECVID 2005. in TRECVID 2005. 2005: NIST. http://wwwnlpir.nist.gov/projects/tvpubs/tv5.papers/Fudan.pdf [20] Chua, T.-S. and et al. TRECVID 2005 by NUS PRIS. in TRECVID 2005. 2005: NIST. http://wwwnlpir.nist.gov/projects/tvpubs/tv5.papers/nus.pdf [21] Yuan, J. and et al. Tsinghua University at TRECVID 2005. in TRECVID 2005. 2005: NIST. http://wwwnlpir.nist.gov/projects/tvpubs/tv5.papers/tsinghua.pdf

Bo Zhang graduated from Dept. of Automatic Control, Tsinghua University in 1958. He is now a professor of Computer Science and Technology Department, Tsinghua University, Beijing, China, the member of Chinese Academy of Sciences. His main research interests include artificial intelligence, robotics, intelligent control and pattern recognition. He has published about 150 papers and 3 monographs in these fields.

Dayong Ding is a PhD student in the Department of Computer Science and Technology of Tsinghua University (Beijing, China). His research interests include the content-based multimedia retrieval, machine learning and cognitive science.

Ling Zhang graduated from Dept. of Mathematics, Nanjing University, Nanjing, China in 1960. He is now a professor of Department of Computer Science, Anhui University, Hefei, China and the director of Artificial Intelligence Institute, Anhui University. His main interests are artificial intelligence, machine learning, neural networks, genetic algorithms and computational intelligence. He has published more than 100 papers and 4 monographs in these fields.

90