Relaxed Exponential Kernels for Unsupervised ...

10 downloads 119 Views 837KB Size Report
data in the social networks where several entities (people, documents, etc) can have multiple ...... Lenca, (Eds.): Applied Stochastic Models and Data Analysis.
Relaxed Exponential Kernels for Unsupervised Learning Karim Abou-Moustafa1 , Mohak Shah2 , Fernando De La Torre3 , and Frank Ferrie1 1

2

3

Centre of Intelligent Machines, McGill University, 3480 University street, Montr´eal, QC, H3A 2A7, CANADA. {karimt,ferrie}@cim.mcgill.ca Accenture Technology Labs. 161 N. Clark street Chicago, IL, 60601, U.S.A. [email protected] The Robotics Institute, Carnegie Mellon Unviversity, 5000 Forbes Avenue, Pittsburg, PA 15213, U.S.A. [email protected]

Abstract. Many unsupervised learning algorithms make use of kernels that rely on the Euclidean distance between two samples. However, the Euclidean distance is optimal for Gaussian distributed data. In this paper, we relax the global Gaussian assumption made by the Euclidean distance, and propose a locale Gaussian modelling for the immediate neighbourhood of the samples, resulting in an augmented data space formed by the parameters of the local Gaussians. To this end, we propose a convolution kernel for the augmented data space. The factorisable nature of this kernel allows us to introduce (semi)-metrics for this space, which further derives relaxed versions of known kernels for this space. We present empirical results to validate the utility of the proposed localized approach in the context of spectral clustering. The key result of this paper is that this approach that combines the local Gaussian model with measures that adhere to metric properties, yields much better performance in different spectral clustering tasks.

1

Abou-Moustafa

Classifier Ensemble Diversity in a Repeated Measurements Setup W. Adler and S. Potapov University Erlangen-Nuremberg, Germany Abstract. Bootstrap aggregation (bagging) (Breiman, 1996) improves the result of a single classification tree by varying the training data per tree in that way, that a random sample is drawn with replacement. The predictions of many trees then are aggregated by majority voting. The improved performance of the ensemble results from a reasonable diversity of the single classification trees. Random forests (Breiman, 2001) increase this diversity by randomly selecting only a small subset of variables per tree node, from which the best split is chosen. Adler et al. (accepted) introduced bootstrap strategies to improve the performance of tree ensembles when repeated measurements are available. Again, the better performance is supposed to be due to increased variability between trees when the strategies are applied. To further examine this assumption, we review several distance measures and examine their ability to re ect the diversity in tree ensembles. A simulation model of repeated measurements is introduced and the proposed bootstrap strategies for longitudinal classification are applied to random forests and bagged classification trees. We calculate the diversity of the tree ensembles and give a discussion of the results.

References Adler, W., Potapov, S., and Lausen, B. (accepted): Classification of repeated measurements data using tree-based ensemble methods. Computa- tional Statistics. Breiman, L. (1996): Bagging Predictors. Machine Learning, 24(2), 123-140. Breiman, L. (2001): Random forests. Machine Learning, 45, 5-32.

Adler

2

Machine Learning based Approach for Hyper-parameter Optimization Osman Akcatepe1 , Lucas Drumond1 , Tomas Horvat1 , Lars Schmidt-Thieme1 Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim, Germany [osman, ldrumond, horvath, schmidt-thieme]@ismll.uni-hildesheim.de Abstract. Hyper-parameter values form an important and crucial input of machine learning algorithms. However, the optimal values of hyper-parameters in which the error is minimal are dependent on the data used. Thus, searching for adequate values of hyper-parameters is considered as a kind of a pre-processing step. The grid search method simply searches towards several combinations of hyper-parameter values, which form a grid, in order to find approppriate values of hyper-parameters. This method may provide some protection against finishing the search in a local minima of an error function but it is not very efective. Even if several hyper-parameter optimization algorithms were developed [1,2,3], it seems that in most of the cases machine learning researchers use the grid search to find suitable hyper-parameter values. In [4] a random search for hyper-parameter optimization was introduced and showed its advantages against the grid search. We propose a method for optimizing hyper-parameter values by using machine learning algorithms in this paper. The idea of our approach lies between the ideas presented in [1] and [2] which exploit steepest ascent and pattern search algorithms, respectively. Pattern search is necessary when derivative-based methods cannot be used, since the optimized function is not differentiable. However, pattern search requires predefined patterns which have high impact to the outcome of the algorithm. The main concept of our approach is to detect regions of optimal hyper-parameter values iteratively. In each step we take randomly generated combinations of hyperparameter values, determine their accuracy and derive new patterns in the error function. According to these patterns we generate new, more suitable combinations of hyper-parameter values and accomplish a next iteration. We stop the search when stopping criteria are met. Our method will be reported within an experimental and evaluated analysis level which exactly is more thrifty than the conventional grid search. We will also compare our approach to the state-of-the-art approaches.

References CZOGIEL, I., LUEBKE, K. and WEIHS, C. (2006): Response Surface Methodology for Optimizing Hyper Parameters.Technical Report 10/2006. SFB 475, De- partment of Statistics, University of Dortmund, Germany.

Keywords MACHINE LEARNING, HYPER-PARAMETER, GRID SEARCH, OPTIMIZATION

3

Akcatepe

Visualizing Data in Social and Behavioral Sciences: An Application of PARAMAP on Judicial Statistics Ulas Akkucuk1 , J. Douglas Carroll2 and Stephen L. France3 1 2 3

Bogazici University [email protected] Rutgers University [email protected] University of Wisconsin, Milwaukee [email protected]

Abstract. In this paper, we describe a technique called PARAMAP for the visualization, scaling, and dimensionality reduction of data in the social and behavioral sciences. PARAMAP uses a criterion of maximizing continuity between higher dimensional data and lower dimensional derived data, rather than the distance based criterion used by standard distance based multidimensional scaling (MDS). We introduce PARAMAP using an example based upon scaling and visualizing the voting patterns of Justices in the US Supreme Court. We use data on the agreement rates between individual Justices in the US Supreme Court and on the percentage swing votes for Justices over time. We use PARAMAP, metric MDS, and nonmetric MDS approaches to create a ’voting space’ representation of the Justices in one, two and three dimensions. We test the results using a metric that measures neighborhood agreement of points between higher and lower dimensional solutions. We find that given sufficient runs, PARAMAP can produce solutions with superior values of the agreement metric to the Metric and Nonmetric MDS solutions. PARAMAP produces smooth, easily interpretable, solutions, with no clumping together of solution points.

Keywords PARAMAP, Visualization, MDS, Dimensionality Reduction, Judicial Data

Akkucuk

4

On Correction of Similarity Indices for Chance Agreement in Cluster Analysis A.Albatineh Florida International University, United States of America [email protected] Abstract. Similarity indices are used in cluster analysis to quantify agreement between two partitions of the same data set. Some of this agreement is due to chance. In this presentation, a family L of similarity indices which are linear in the matching counts is identified. It is shown that after correction for chance agreement, members of the L family become identical, hence the choice of the index to be used becomes irrelevant. We further derive means and variances of indices in the L family under fixed marginal totals of the matching counts matrix and independence of the algorithms. A proposal for correcting non members of the family L which greatly improved the performance of the indices will be discussed. Simulations are performed using homogenous and clustered data generated from bivariate normal distribution to assess the proposed method

Keywords Similarity indices, correction for chance, matching counts, comparing partitions, Jaccard index

5

Albatineh

Some novel upper bounds for the number of modes of mixture densities G.Alexandrovich Philipps-Universit¨ at Marburg Fachbereich Mathematik und Informatik [email protected] Abstract. We present some novel upper bounds for the maximal number of modes in finite mixtures of Gaussian- or t-distributions. Such mixtures play an important role in model-based cluster analysis, since the number of modes can be associated with the number of clusters in the data. Such an approach to clustering is called modality based cluster concept (see Hennig, C.: 2009, Methods for merging Gaussian mixture components. Research Report no. 302 , Department of Statistical Science, UCL). The main tool for deriving the bounds is the ridgeline-theory introduced by Ray and Lindsay (2005, The topography of multivariate normal mixtures, Ann. Statist.). With that theory we get a connection between zeros of a specific polynomial in the interval [0; 1] and the modes of the mixture density. The number of zeros of that polynomial can be estimated by using tools from matrix algebra.

Keywords model based clustering, number of modes, mixture density

Alexandrovich

6

Application of Multi-Modal Features for Terrain Classification on a Mobile System Marc Arends1 University of Koblenz-Landau Abstract. This paper presents an approach of an extended terrain classification procedure for an autonomous mobile robot with multi-modal features. Terrain classification is an important task in the field of outdoor robotics as it is essential for negotiability analysis and path planning. In this paper I present a novel approach of combining multi modal features and a Markov random field to solve the terrain classification problem. The presented model uses features extracted from 3D laser range measurements and images and is adapted from a Markov random field used for image segmentation. Three different labels can be assigned to the terrain describing the classes road for easy to pass flat ground, rough for hard to pass ground like grass or a field and obstacle for terrain which needs to be avoided. Experiments showed that the algorithm is fast enough for real time applications and that the classes road and street are detected with a rate of about 90

7

Arends

Pose-Consistent 3D Shape Segmentation Based on a Quantum Mechanical Feature Descriptor Aubry1 , Cremers2 , and Schlickeweis3 TU Munchen Abstract. We propose a novel method for pose-consistent segmentation of nonrigid 3D shapes into visually meaningful parts. The key idea is to study the shape in the framework of quantum mechanics and to group points on the surface which have similar probability of presence for quantum mechanical particles. For each point on an object’s surface these probabilities are encoded by a feature vector, the Wave Kernel Signature (WKS). Mathematically, the WKS is an expression in the eigenfunctions of the Laplace Beltrami operator of the surface. It characterizes the relation of surface points to the remaining surface at various spatial scales. Gaussian mixture clustering in the feature space spanned by the WKS signature for shapes in several poses leads to a grouping of surface points into different and meaningful segments. This enables us to perform consistent and robust segmentation of new versions of the shape. Experimental results demonstrate that the detected subdivision agrees with the human notion of shape decomposition (separating hands, arms, legs and head from the torso for example). We show that the method is robust to data perturbed by various kinds of noise. Finally we illustrate the usefulness of a pose-consistent segmentation for the purpose of shape retrieval.

Aubry

8

DDC-RVK-Konkordanz - Erste Erkenntnisse aus dem Gebiet Medizin und Gesundheit Uma Balakrishnan VZG, G¨ ottingen Abstract. Die wachsende Nachfrage der Nutzer nach dem Zugang zu internationalen bibliothekarischen Ressourcen verst¨ arkt die Notwendigkeit zur Erstellung der seit mindestens u ¨ber 15 Jahren gew¨ unschten Konkordanzen von Klassifikationssystemen, z. B. zwischen dem in Deutschland stark verbreiteten Regensburger Verbundklassifikationssystem (RVK) und dem international weit verbreiteten Universalklassifikationssystem, der Dewey Dezimalklassifikation (DDC). Andererseits bleibt eine vollst¨ andige Konkordanz zwischen den beiden genannten Klassifikationssystemen aufgrund des erheblichen Aufwandes ein noch nahezu unber¨ uhrtes Feld. Deshalb wurde Ende 2009 das Teilprojekt ”coli-conc” (DDC-Konkordanzen zu anderen Klassifikationssystemen) des VZG-Projektes Colibri/DDC initi-iert. Anfang 2011 wurde mit der Erstellung einer Konkordanz (Gebiet ”Medizin und Gesundheit”) zwischen dem DDC- und RVK-System begonnen. Davor und parallel dazu wurde eine vollst¨ andige EZB-DDC-Konkordanz f¨ ur das Projekt ”Nationallizenzen” erstellt. F¨ ur die Ermittlung des Entwicklungsstandes zu vorhandenen DDC-RVKKonkordanzen wurde eine auf Mailverteilern versendete Online-Umfrage ausgewertet. Die Erstellungsmethode der DDC-RVK-Konkordanz im Projekt ”coli-conc” erfolgt semi-automatisch mittels Ausschlussprinzip. Die ca. 1,5 Mio. große Titeldatensatzmenge, deren Titeldatens¨ atze sowohl DDC- als auch RVK-Notationen enthalten, dient als Datenbasis f¨ ur die semi-automatische Erstellung von DDC-RVKKonkordanzen. Als Beispielbereich wurden f¨ ur die DDC-Klasse ”614 Rechtsmedizin; Inzidenz von Verletzungen, Wunden, Krankheiten; ¨ offentliche Pr¨ aventivmedizin” eine bidirektionale Konkordanz und deren DDC-RVK-Korrelationsbeziehungen ermittelt. Im Vortrag werden die oben genannten Themen und erste Ergebnisse pr¨ asentiert.

9

Balakrishnan

Clustering considering local density of units Vladimir Batagelj University of Ljubljana, FMF, Dept. of Mathematics [email protected] Abstract. The ”standard” rule for determining the clusters in hierarchical clustering: ”cut the dendrogram at selected height h - the corresponding subtrees determine the clusters” is not valid when there are big differences in local densities in the space of units. In the paper some approaches and algorithms to obtain better clustering in such cases are discussed. The first approach is to replace the height in the dendrogram with the clustering level and apply the rule on it. The second approach is to base the clustering procedures on criterion functions that consider the local density. Both approaches are illustrated with clustering results obtained for some artificial and some real-life data sets. The developed algorithms are implemented in R.

References KAUFMAN, L. and ROUSSEEUW, P.J. (1990): Finding Groups in Data: An Introduction to Cluster Analysis. Wiley-Interscience.

Keywords CLUSTERING, DENSITY, CRITERION FUNCTION, ALGORITHM, R

Batagelj

10

Design of Experiments in Signal Analysis Nadja Bauer, Julia Schiffner, and Claus Weihs Chair of Computational Statistics, Faculty of Statistics, TU Dortmund {bauer,schiffner,weihs}@statistik.tu-dortmund.de Abstract. Design of experiments is an important step for the optimization of algorithms or industrial processes. Mostly the problem is finding that parameter set which leads to the optimal value of one or more target variables. The major problem here is finding a compromise between model validity and costs, which increase with the number of experiments. There are two types of experimental designs: classical und sequential. In classical designs all trial points are fixed before implementation. In sequential designs the next trial point or the decision for stopping the experiment depends on the results of previous experiments. The second relevant problem is choosing an appropriate model, which describes the relationship between parameters and target values. One of the recent approaches here is model combination, which can be used in sequential designs in order to improve automatic prediction of the next trial point. In this paper a music instrument classification algorithm will be optimised using sequential parameter optimization with model combination. It will be shown that parameter optimization via design of experiments is considerably better than usual parameter optimization via grid search or genetic optimization algorithms. Furthermore, the results of this application study will reveal, whether the combination of many models brings improvements in finding the optimal parameter set or it is still better to choose one appropriate model analytically.

References BARTZ-BEIELSTEIN, T., LASARCZYK, C., PREUSS, M. (2005): Sequential Parameter Optimization. In: McKay, B. (Hrsg.) u. a.: Proceedings 2005 Congress on Evolutionary Computation (CEC’05), Edinburgh, Scotland Bd. 1. Piscataway NJ: IEEE Press, S. 773-780. EICHHOFF, M. and WEIHS, C. (2010): Musical Instrument Recognition by HighLevel Features, Proceedings of the 34th Annual Conference of the German Classification Society (GfKl), July 21.-23., Karlsruhe, Germany. VAN DER LAAN, M. J., POLLEYY, E. C. and HUBBARD, A. E. (2007): Super Learner. U.C. Berkeley Division of Biostatistics Working Paper Series, University of California, Berkeley, Paper 222.

Keywords SEQUENTIAL DESIGN OF EXPERIMENTS, MODEL COMBINATION, MUSIC INSTRUMENT CLASSIFICATION

11

Bauer

Data Transformations in Data Mining Andreas Baumgart and Ulrich M¨ uller-Funk European Research Center for Information Systems (ERCIS), University of M¨ unster {baumgart,mueller-funk}@ercis.uni-muenster.de Abstract. Within data preprocessing various transformations have been proposed for the purpose of normalization respectively standardization. Quite often authors recommend specific transformations - e.g. from a robustnic’s point of view. The problem is addressed in most textbooks on data mining in a basic way (cf. the references below), but hasn’t gained much attention in monographs on multivariate statistics. In statistical terms, a group of mappings on the sample space is specified and a maximal invariant statistic is determined that is insensitive with respect to outliers, blanks or noise. The topic is particularly relevant to clustering procedures. Here, the discussion combines two aspects. On the one hand side, features become scalefree and comparable. On the other hand, the call for insensitivity aims at cluster stability. In this paper we shall deal with the group of affine respectively monotone mappings. We consider several maximal invariants and shall discuss their relative merits within standard clustering devices.

References HAN, J., KAMBER, M. (2006): Data Mining. Concepts and Techniques. Morgan Kaufmann, San Francisco. PYLE, D. (1999): Data Preparation for Data Mining. Morgan Kaufmann, San Francisco. RUNKLER, T. A. (2010): Data Mining: Methoden und Algorithmen intelligenter Datenanalyse. Vieweg+Teubner, Wiesbaden.

Keywords DATA PREPROCESSING, DATA TRANSFORMATION, SENSITIVITY, CLUSTERING

Baumgart

12

A Fully Implicit Framework for Sobolev Active Contours and Surfaces Baust1 and Navab2 Technische Universit¨ at M¨ unchen Abstract. We present a convenient framework for Sobolev active contours and surfaces, which uses an implicit representation on purpose, in contrast to related approaches which use an implicit representation only for the computation of Sobolev gradients. Another difference to related approaches is that we use a Sobolev type inner product, which has a better geometric interpretation, such as the ones proposed for Sobolev active contours. Since the computation of Sobolev gradients for surface evolutions requires the solution of partial differential equations on surfaces, we derive a numerical scheme which allows the user to obtain approximative Sobolev gradients even in linear complexity, if desired. Finally, we perform several experiments to demonstrate that the resulting curve and surface evolutions enjoy the same regularity properties as the original Sobolev active contours and show the whole potential of our method by tracking the left ventricular cavity acquired with 4D MRI.

13

Baust

Model estimation and clustering through Schoenberg transformations Fran¸cois Bavaud and Aris Xanthos University of Lausanne {francois.bavaud,aris.xanthos}@unil.ch Abstract. Schoenberg transformations (Schoenberg 1938) are componentwise map˜ ij = ϕ(Dij ) of (squared) Euclidean distances into new Euclidean distances, pings D embeddable in a higher dimensional space. Bavaud (2011) presents their properties with applications to Discriminant Analysis and Robust Estimation. Let the data consist of a matrix of squared PEuclidean distances (Dij ) between n observations, weighted by fi > 0 (with i fi = 1), and with coordinates embeddable in Rp . Given a familly of Schoenberg transformations, the ML estimation of the location, shape and dispersion parameters (a, s, t) leads to minimize the functional Z X exp(−t hs (kxk2 )) dx + t G(α, s, t) = ln fi hs (Dia ) Rp Dia =

i 1 j αj Dij − 2

P

P

jk

αj αk Djk

where Dia is the distance to the centroid a in the original space, determined by the distribution α ; t ≥ 0 is a dispersion parameter and hs (D) denotes a family of Schoenberg transformations such as D s (with 0 < s < 1) or ln(1 + sD) (with s > 0). In the same setting, a soft clustering is determined by the n × m memberP ship matrix Z = (z z = 1, inducing group weights ρg = ig ) with ig g P g f z , group distributions f = f z /ρg and transformed within-groups ini ig i ig i i P P ertia g i fi zig ϕ(Dig ). Minimizing the latter, for a fixed mutual information between observations and groups, generates an alternating minimization scheme generalizing the Gaussian mixture model clustering (e.g. Bavaud 2009 and references therein). Both procedures will be numerically illustrated on a few data sets.

References BAVAUD, F. (2009): Aggregation invariance in general clustering approaches. Advances in Data Analysis and Classification, 3, 205–225. BAVAUD, F. (2011): On the Schoenberg Transformations in Data Analysis: Theory and Illustrations. To appear in the Journal of Classification, vol. 28. SCHOENBERG, I.J. (1938): Metric Spaces and Completely Monotone Functions. The Annals of Mathematics, 39, 811–841.

Keywords ROBUST LOCATION AND SHAPE ESTIMATION; SOFT CLUSTERING

Bavaud

14

Using Regression Trees for Raw Effluents Quality Prediction Orlando Belo1 and Ant´ onio Sanfins2 1

2

Department of Informatics, School of Engineering, University of Minho, Portugal [email protected] IDITE-Minho, Braga, Portugal [email protected]

Abstract. Nowadays is quite relevant to any WasteWater Treatment Plant (WWTP) manager to know how and in what conditions its facilities and respective treatment units work, especially the ones related to the application of biological treatments. Among the different treatment stages that occurs usually in a WWTP to deal with domestic wastewater, the stage where organic load is removed (secondary treatment) assume an indispensable function ensuring a minimum quality for the treated wastewater. It is the most sensitive unit in a WWTP, reacting easily to load variations, flow rate or residual concentrations of harmful elements. Factors like these ones may lead to the death of microorganisms responsible for the treatment or to significant changes in the kinetics of organic matter degradation, which affects, obviously, the quality of the final effluent. All this assumes a major role in the removal of the pollutant load of wastewaters. In this work we present a study carry out on a specific WWTP, located in the Northern of Portugal, in order to monitor and control its secondary treatment units, in particular, predicting the impact of changes in its raw effluents to provide operational elements that may contribute to ensure an adequate quality for final effluents. We used several analysis parameters – pH, Chemical Oxygen Demand (COD), Biochemical Oxygen Demand after 5 days (BOD5), and others – to support the development of a data mining model, using regression trees, especially oriented to make such prediction. This predictive application will give us the possibility to determine such changes and analyse the overall operation of the WWTP for different periods of time, analysing their impact in the operation of the WWTP.

Keywords Wastewater Treatment; Wastewater Treatment Plants Operation; Raw Effluents; Data Mining Techniques; Regression Trees, Raw Effluents Quality Prediction.

15

Belo

Bayesian Binary Quantile Regression with the bayesQR R-package Dries F. Benoit1 and Dirk Van den Poel2 1

2

Ghent University, Tweekerkenstraat 2, 9000 Gent, Belgium [email protected] Ghent University, Tweekerkenstraat 2, 9000 Gent, Belgium [email protected]

Abstract. This talk discusses a Bayesian method for quantile regression (Koenker and Basset, 1978) for dichotomous response data (Benoit and Van den Poel, 2011). The frequentist approach to this type of regression has proven problematic in both optimizing the objective function and making inference on the parameters. By accepting additional distributional assumptions on the error terms, the Bayesian method proposed sets the problem in a parametric framework in which these problems are avoided. To test the applicability of the method, we introduce the bayesQR R-package that implements the algorithms for binary quantile regression. We show two Monte-Carlo experiments and and application on Horowitz’ (1993) often studied work-trip mode choice dataset. Compared to previous estimates for the latter dataset, the method proposed leads to a different economic interpretation.

References BENOIT, D.F. and VAN DEN POEL, D. (2011): Binary quantile regression: a Bayesian approach based on the asymmetric Laplace distribution. Journal of Applied Econometrics, in press. BENOIT, D.F., YU, K. and VAN DEN POEL, D. (2011): bayesQR: Bayesian quantile regression. R-package available at CRAN: http://cran.rproject.org/web/packages/bayesQR/index.html. KOENKER, R. and BASSET, G. (1978): Regression Quantiles. Econometrica, 46, 33–50. HOROWITZ, J.L. (1993): Semiparametric estimation of a work-trip choice model. Journal of Econometrics, 58, 49–70.

Keywords QUANTILE REGRESSION, BINARY REGRESSION, BAYESIAN ESTIMATION, CHOICE MODELS

Benoit

16

The Effect of Microarray Normalization in Resampling Approaches Christoph Bernau1 , Ferdinand Jamitzky2 and Anne-Laure Boulesteix1 1

2

Institute for Medical Information Sciences, Biometry, and Epidemiology, Marchioninistr. 15, 81377 M¨ unchen [email protected] Leibniz Supercomputing Center of the Bavarian Academy of Sciences, Boltzmannstraße 1, 85748 Garching

Abstract. Microarray data have to be normalized before they can be analyzed appropriately. Common normalization techniques like RMA or VSN are multisample approaches which use the information from all microarrays available in the study. After normalization, microarrays are often used to construct prediction rules which for example discriminate between cancer and normal tissues. However, the obtained prediction rule might perform noticeably worse on new data which have been normalized separately using different normalization parameters. This problem has to be taken into account in resampling approaches used for classifier validation. In each resampling step, training and test data have to be normalized separately in order to mimic what happens when a classifier is used on a completely independent data set. This procedure also maintains the clear separation of training and test sets whose violation can in general induce a severe optimistic bias (for instance when variable selection is not performed on the training data set only but on the whole data set). In our study, that was performed at the computer clusters at the Leibniz Supercomputing Center due to its high computational requirements, we quantified the effect of separate normalization of training and test set for several resampling approaches on different real data sets. Additionally, we assessed the performance of normalization methods that try to remedy the problem of transferability to new data such as the add-on normalization proposed by Kostka & Spang (2008).

References Kostka, D. and Spang, R. (2008): Microarray Based Diagnosis Profits from Better Documentation of Gene Expression Signatures. PLoS Computational Biology, 4, e22.

Keywords NORMALIZATION, PREPROCESSING, RESAMPLING, MICROARRAY

17

Bernau

Multilevel clustering systems, with or without overlapping clusters, and dissimilarities Patrice Bertrand Ceremade - Universite Paris-Dauphine - France [email protected] Abstract. Johnson and Benzecri set up a well known one-one correspondence between hierarchical clusterings and ultrametric dissimilarities (cf. Johnson 1967, Benzecri 1973). Since the1980’s, several multilevel clustering structures were proposed in order to allow overlapping clusters and to extend the previous one-one correspondence. In this talk, we investigate a general framework that enables to compare all of these extensions of the Benzecri-Johnson bijection (cf. Batbedat 1988, Bertrand 2000, Barthelemy and Brucker 2008). By considering different definitions of a cluster, this general framework leads to recover various types of clustering structures, such as weak hierarchies, pyramidal clusterings whose clusters are intervals on a line, and 2-3 hierarchies for which each cluster overlap at most one cluster.

References Barthelemy J.-P., Brucker F., ”Binary clustering”, Journal of Discrete Applied Mathematics 156, 2008, p. 1237 1250. Batbedat A., ”Les isomorphismes HTS et HTE (aprs la bijection de Benzecri / Johnson)”, Metron 46, 1988, p. 4759. Benzecri J.P., L’Analyse des Donnees, Dunod, Paris, 1973. Bertrand P., ”Set Systems and Dissimilarities”, Europ. J. Combinat. 21, 2000, p. 727-743. Johnson S.C., ”Hierarchical clustering schemes”, Psychometrika 32(3), 1967, p.241254.

Keywords Multilevel clustering, overlapping clusters, dissimilarities

Bertrand

18

A Theoretical and Empirical Analysis of the Black-Litterman Model Wolfgang Bessler1 and Dominik Wolff2 1

2

Center for Finance and Banking, University of Giessen, Licher Strasse 74, 35394 Giessen, Germany [email protected] Center for Finance and Banking, University of Giessen, Licher Strasse 74, 35394 Giessen, Germany [email protected]

Abstract. In this paper we analyze the Black-Litterman model analytically and show how it combines equilibrium returns with subjective estimates for asset-returns. The Black-Litterman model aims to enhance standard mean-variance portfolio optimization and to overcome its well known shortcomings such as extreme portfolio weights and their high sensitivity to the respective input parameters. When applying the Black-Litterman framework, a major problem exists regarding an estimation of the required subjective return forecasts. Empirically, we demonstrate how an investor might incorporate alternative subjective return forecasts for stocks and stock indices. For this purpose, we utilize information on price-earnings ratios and dividend-price ratios that have formerly been shown to reflect information regarding future stock returns. Furthermore we analyze how these subjective return estimates affect optimal portfolio weights. We use regional MSCI indices to calculate optimal portfolio weights for both, the Black-Litterman approach as well as the classic mean-variance optimization. The empirical analysis focuses on developed, emerging and frontier markets for the time period between 2002 and 2009. Controlling for different levels of risk aversion, realistic investment constraints and diverging subjective return estimates, we find that the Black-Litterman approach is suitable to overcome most of the experienced shortcomings of mean-variance optimization. In fact, the resulting portfolios are more diversified, portfolio weights are less extreme and, therefore, economically reasonable, and react less sensitive to changes in the relevant input data. However, we also emphasize some drawbacks that remain even after applying the Black-Litterman approach to asset allocation.

Keywords PORTFOLIO OPTIMIZATION, BLACK-LITTERMAN

19

Bessler

Active Learning for Automated Identification of Components in 3D Ultramicroscopy Images B. Bischl1 , L. Schlieker1 , U. Leischner2 , H.-U. Dodt3 , and C. Weihs1 1 2 3

Faculty of Statistics, Dortmund University of Technology, Germany Institute of Neuroscience, Munich University of Technology, Germany Department of Bioelectronics, Vienna University of Technology, Austria

Abstract. We recently presented a novel three-dimensional microscopical technique for imaging non-stained, fixed brain samples with cubic size in the mm-range and a sub-micron resolution. This technique makes use of an induced fluorescence signal, caused by the fixative formalin. These pictures contain many different objects (cell bodies, blood vessels, dendrites), and an automated data analysis procedure requires a classification of the objects into the mentioned categories. We apply a watershedsegmentation on these 3D image stacks, and analyze each segment individually. We extract features from properties of the gray value distribution and from the threedimensional geometry of the object. Based on these features, we classify the objects using supervised models. A general problem is that some of the 3D components have to be manually and time-consumingly labeled by an expert before supervised techniques can be applied, and it is not clear which of the many potential objects should be selected for labeling. Here we follow an active learning approach which interactively: (a) presents components to the expert which should be most helpful for improving the classifier; (b) gives feedback to the user about the current quality of the model, so the labeling process can be stopped at a reasonable point in time.

References Leischner, U., Schierloh, A., Zieglg¨ ansberger, W. and Dodt, H.-U (2010): FormalinInduced Fluorescence Reveals Cell Shape and Morphology in Biological Tissue Samples, PLoS ONE 5. Settles, B. (2009): Active Learning Literature Survey. University of Wisconsin−Madison.

Keywords CLASSIFICATION, ACTIVE LEARNING, 3D IMAGE ANALYIS

Bischl

20

Efficient Sampling and Handling of Variance in Tuning Model Chains with Kriging B. Bischl1 , P. Koch2 , W. Konen2 , and C. Weihs1 1 2

Faculty of Statistics, Dortmund University of Technology, Germany Faculty for Computer and Engineering Science, Cologne University of Applied Sciences, Germany

Abstract. The complex data in real-world machine learning problems often require a sequence of processing steps, e.g. pre-processing, feature filtering, model training and possibly post-processing. All of these require the setting of parameters and often no generally valid guidelines exist which ensure optimal results. Data-dependent algorithms are needed to allow the practitioner to achieve these results with minimal effort and time. Surrogate-model based tuning - often using a Kriging model - has been proposed to rectify this, and we will demonstrate how the techniques from this field can be efficiently applied to chains of machine learning processing steps. Here, two issues arise: (a) The target function in these types of experiments will be stochastic due to resampled performance values, and it is still an open issue how this aspect can be handled best when using Kriging surrogate models. (b) The sampling strategy for evaluating the model is not fixed but can be adapted to exploit a tradeoff between precision and time complexity. Also information regarding the variance of the target value is available which can be incorporated into the surrogate model.

References Forrester, A. I. J., Keane, A. J., and Bressloff, N. W. (2006): Design and Analysis of “Noisy” Computer Experiments. AIAA Journal, 44(10). Koch, P., Bischl, B., Flasch, O., Bartz-Beielstein, T., and Konen, W. (2011): On the Tuning and Evolution of Support Vector Kernels. Cologne University of Applied Science, Faculty of Computer Science and Engineering Science. Konen, W., Koch, P., Flasch, O., and Bartz-Beielstein, T. (2010): Parameter-Tuned Data Mining: A General Framework. In F. Hoffmann, and E. H¨ ullermeier (Eds.), Proceedings 20. Workshop Computational Intelligence.

Keywords MACHINE LEARNING, PARAMETER TUNING, KRIGING

21

Bischl

An efficient algorithm for the detection and classification of horizontal gene transfer events and identification of mosaic genes Alix Boc, Dunarel Badescu, Abdoulaye Banir´e Diallo and Vladimir Makarenkov D´epartement d’Informatique, Universit´e du Qu´ebec a ` Montr´eal, C.P. 8888, Succursale Centre-Ville, Montr´eal (Qu´ebec), H3C 3P8, Canada. Abstract. Bacteria and viruses adapt to varying environmental conditions through the formation of mosaic genes which are composed of alternating sequence parts belonging either to the original host gene or stemming from the integrated donor sequence (Doolittle 1999, Koonin 2003, Zhaxybayeva et al. 2004). An accurate identification and classification of mosaic genes as well as the detection of the related partial horizontal gene transfers are among the most important challenges posed by modern computational biology (Zheng et al. 2004). Partial horizontal gene transfer model assumes that any part of a gene can be transferred among the organisms under study; a traditional (complete) horizontal gene transfer model assumes that only an entire gene, or a group of complete genes, can be transferred (Makarenkov et al. 2006a and b). We described an efficient algorithm for detecting partial horizontal gene transfer events and corresponding intragenic recombination giving rise to the formation of mosaic genes. A bootstrap validation procedure incorporated in the algorithm can be used to assess the statistical support of each predicted partial gene transfer (Boc et al. 2010). The proposed technique can be also used to confirm or discard complete horizontal gene transfers detected by any existing horizontal gene transfer inferring algorithm, and thus to classify the detected gene transfers as partial or complete. The new algorithm will be applied on a full-genome scale to estimate the level of mosaicism in the genomes of prokaryotes as well as the rates of complete and partial gene transfers triggering their evolution.

Keywords BIOINFORMATICS ALGORITHM, HORIZONTAL GENE TRANSFER (HGT), HGT CLASSIFICATION, MOSAIC GENE, RECOMBINATION

Boc

22

Spectral Clustering of ROIs for Object Discovery Bodesheim1 University of Jena Abstract. Object discovery is one of the most important applications of unsupervised learning. This paper addresses several spectral clustering techniques to attain a categorization of objects in images without additional information such as class labels or scene descriptions. Due to the fact that background textures bias the performance of image categorization methods, a generic object detector based on some general requirements on objects is applied. The object detector provides rectangular regions of interest (ROIs) as object hypotheses independent of the underlying object class. Feature extraction is simply constrained to these bounding boxes to decrease the influence of background clutter. Another aspect of this work is the utilization of a Gaussian mixture model (GMM) instead of k-means as usually used after feature transformation in spectral clustering. Several experiments have been done and the combination of spectral clustering techniques with the object detector is compared to the standard approach of computing features of the whole image.

23

Bodesheim

Finding two-level structure in field recordings of folk music Ciril Bohak and Matija Marolt University of Ljubljana, Faculty of Computer and Information Science {ciril.bohak, matija.marolt}@fri.uni-lj.si Abstract. Ethnomusicological field recordings are audio recordings of folk music that also contain interviews with performers. Such field recordings may contain several hours of recorded audio that consists of different content such as speech, singing, instrumental music, event recordings etc. It is easy for a human to split field recordings into meaningful parts (units) according to their content types, such as songs, interviews as well as stanzas and verses. Manual segmentation is a very time consuming task for large archives of field recordings. In this paper we present a two-level segmentation algorithm. The algorithm segments a recording on two levels: on the first (high) level, a recording is split into units, representing parts of the same content type (e.g. singing, speech), while on the second (low) level, high level parts containing singing and instrumental music are split into individual repeating parts (stanzas). We first test and compare different algorithms for high and low level segmentation on a collection of field recordings of Slovenian folk music. Next we present a novel approach, which combines high and low level segmentation. A probabilistic model is used for high level segmentation, while for low level segmentation a combination of vocal pauses detection and self-similarity of chroma vectors is used. The algorithm makes use of the low level boundaries between stanzas to improve estimation of high level boundaries between different units. We evaluate the performance of the proposed algorithm on a collection of field recordings and show that the two-level approach outperforms the high level one.

References MAROLT M. (2009): Probabilistic segmentation and labeling of ethnomusicological field recordings. In Proceedings of ISMIR 2009, Kobe, Japan, 75-80. ˆ ULLER ¨ ˆ OSCHE ¨ MA¨ M., GRA¨ P., WIERING F. (2009): Robust segmentation and annotation of folk song recordings. In Proceedings of ISMIR 2009, Kobe, Japan, 735-740.

Keywords FOLK MUSIC, SEGMENTATION, MUSIC STRUCTURE EXTRACTION

Bohak

24

Complexity selection and cross-validation in lasso and sparse PLS with high-dimensional data Anne-Laure Boulesteix, Adrian Richter, Christoph Bernau Institut f¨ ur Medizinische Informationsverarbeitung, Biometrie und Epidemiologie, Universit¨ at M¨ unchen (LMU), [email protected] Abstract. Sparse regression and classification methods are commonly applied to high-dimensional data to simultaneously build a prediction rule and select relevant predictors. The well-known lasso regression and the more recent sparse partial least squares (SPLS) approach are important examples. In such procedures, the number of identified relevant predictors typically depends on a complexity parameter that has to be adequately tuned. Most often, parameter tuning is performed via crossvalidation (CV). In the context of lasso penalized logistic regression and SPLS classification, this paper addresses three important questions related to complexity selection: 1) Does the number of folds in CV affect the results of the tuning procedure? 2) Should CV be repeated several times to yield less variable tuning results?, and 3) Are variable selection and complexity selection robust against resampling?

References CHUN, D., KELES, S. (2010): Sparse partial least squares regression for simultaneous dimension reduction and variable selection. Journal of the Royal Statistical Society, 72, 3–25. CHUNG, D., KELES, S. (2010): Sparse Partial Least Squares Classification for High Dimensional Data. Statistical Applications in Genetics and Molecular Biology, 9, 17.

Keywords PARAMETER TUNING, RESAMPLING, STABILITY, SPARSE REGRESSION, VARIABLE SELECTION

25

Boulesteix

Model-based Clustering of Time Series in Group-specific Functional Subspaces Charles Bouveyron and Julien Jacques Universit´e Paris 1, France [email protected] Abstract. This work develops a general procedure for clustering functional data which adapts the efficient clustering method High Dimensional Data Clus- tering (HDDC), originally proposed in the multivariate context. The resulting clustering method, called funHDDC (functional HDDC), is based on a func- tional latent mixture model which fits the functional data in group-specific functional subspaces. By constraining model parameters within and between groups, a family of parsimonious models is exhibited which allow to fit onto various situations. An estimation procedure based on the EM algorithm is proposed for estimating both the model parameters and the group-specific functional subspaces. Experiments on real-world datasets show that the pro- posed approach performs better or similarly than classical clustering methods while providing useful interpretations of the groups.

Bouveyron

26

Innovative Erschließung und Bereitstellung von Musikdokumenten im Probado-Projekts Katrin Braun Bayerische Staatsbibliothek M¨ unchen Digitale Bibliothek / M¨ unchener Digitalisierungszentrum [email protected] Zusammenfassung. In dem von der Deutschen Forschungsgemeinschaft gef¨ orderten Projekt PROBADO (www.probado.de) werden von drei universit¨ aren InformatikInstituten und zwei Bibliotheken innovative Verfahren f¨ ur nicht-textuelle Dokumente entwickelt und prototypisch f¨ ur die Bereiche 3D-Architekturmodelle und Musikdokumente in bibliothekarische Arbeitsabl¨ aufe integriert. Tonaufnahmen und Musikalien stehen dabei im Mittelpunkt des Teilprojekts Probado-Musik, das von der Bayerischen Staatsbibliothek und dem Institut f¨ ur Informatik III der Universit¨ at Bonn (Arbeitsgruppe Professor Clausen) betreut wird. Die drei wichtigsten Ziele dieses Teilprojektes sind die weitgehende Automatisierung der Erschließung und Indexierung von Musikdokumenten, die Anwendung neuartiger inhaltsbasierter Suchverfahren f¨ ur Musik und der Aufbau nutzerfreundlicher Oberfl¨ achen zur komfortablen Suche und Bereitstellung von diesen Musikdokumenten. Hierf¨ ur wurde ein Musik-Repository aufgebaut, das Musikdokumente in verschiedenen Formaten enth¨ alt. Dazu geh¨ oren eingescannte Notendrucke im TIFF- und JPEG-Format, Audio-Dateien im WAV- und MP3-Format sowie symbolische Musikformate wie Music-XML und die Ausgabeformate von OMR-Programmen (Optical Music Recognition). Das Musik-Repository enth¨ alt mittlerweile ca. 100.000 Seiten von eingescannten Notendrucken und Audio-Dateien von mehreren Hundert CDs. Die dazugeh¨ origen Metadaten werden aus dem Katalog der Bayerischen Staatsbibliothek und externen Datenquellen gewonnen und in einer FRBR-basierten Datenbank abgelegt. (Katrin Braun, M¨ unchener Digitalisierungszentrum / Digitale Bibliothek, Bayerische Staatsbibliothek)

27

Braun

Parametric analysis of interval data Paula Brito1 , A. Pedro Duarte Silva2 and Jos´e G. Dias3 1

2

3

Faculdade de Economia & LIAAD-INESC Porto LA, Universidade do Porto, Rua Dr. Roberto Frias, 4200-464 Porto, Portugal [email protected] Faculdade de Economia e Gest˜ ao & CEGE, Universidade Cat´ olica Portuguesa at Porto, Porto, Portugal [email protected] ISCTE – Instituto Universit´ ario de Lisboa, UNIDE, Av. das For¸cas Armadas, Lisboa 1649–026, Portugal [email protected]

Abstract. In this work, we focus on the analysis of interval data, i.e., where elements are described by variables whose values are intervals of IR. Parametric probabilistic models for interval-valued variables are proposed and studied in Brito and Duarte Silva (in press) where each observed interval is represented by its midpoint and log-range, for which Normal and Skew-Normal distributions are assumed. The intrinsic nature of the interval variables leads to special structures of the variancecovariance matrix, represented by five different possible configurations. This framework may be applied to different statistical multivariate methodologies, allowing for inference approaches for symbolic data. In Duarte Silva and Brito (2006), we have studied and compared different methods for linear discriminant analysis of interval data, which rely on non-parametric approaches. We now adopt the proposed parametric modelling in linear and quadratic discriminant analysis of data described by interval-valued variables and compare the performance of the new approach with previous proposals. Furthermore, we investigate mixture distributions, exploring the behaviour of model-based clustering using the proposed parametric modelling.

References AZZALINI, A. and DALLA VALLE, A. (1996): The multivariate Skew-Normal distribution. Biometrika 83 (4), 715–726. BRITO, P. and DUARTE SILVA, A.P. (2011): Modelling interval data with Normal and Skew-Normal distributions. Journal of Applied Statistics (in press). DUARTE SILVA, A.P. and BRITO, P. (2006): Linear discriminant analysis for interval data. Computational Statistics 21, (2), 289–308. FRALEY, C. and RAFTERY A.E. (1998): How many clusters ? Which clustering method ? Answers via model-based cluster analysis. The Computer Journal, 41, 8:578–588. NOIRHOMME-FRAITURE, M. and BRITO, P. (2011): Far Beyond the Classical Data Models: Symbolic Data Analysis. Stat. Anal. Data Mining (in press).

Keywords DISCRIMINANT ANALYSIS, INTERVAL DATA, MODEL-BASED CLUSTERING, SYMBOLIC DATA ANALYSIS

Brito

28

Mortality in EU countries - dependence analysis with the use of log-linear models Justyna Brzezinska Department of Statistics, University of Ecoonomics in Katowice, 1 Maja 50, 40–287 Katowice [email protected] Abstract. This paper is concerned with the use of log-linear models for categorical data, which are used to analyze large datasets in contingency table. Log-linear models is linear combination of possible effects parameters, starting from saturated model to null model. A hierarchical log-linear models include lower order terms implied by any higher order ones. The fit of log-linear model can be assessed with the Pearson or likelihood-ratio chi-square, as well as information criteria AIC (Akaike [1973]), BIC (Raftery [1986]). With the use of differences between chi-square (partial chi-square) it is possible to test statistical significance of the additional terms in model. Models generate expected values for cell frequencies. Identifying the simplest model, the model with the fewest parameters that generate expected frequencies, not too discrepant from the observed ones, is a major goal off the analysis. Log-linear analysis is available in R software with the use of loglm function in MASS library. The empirical use of log-linear analysis will be based on European Union countries mortality dataset.

References AKAIKE H. (1987): Information theory and an extension of the maximum likelihood principle, in: Proceedings of the 2nd International Symposium on Information, Petrow B. N., Czaki F., Budapest Akademiai Kiado. AGRESTI A. (2002): Categorical data analysis. Wiley & Sons, Hoboken, New Jersey. CHRISTENSEN R. (1997): Log-linear models and logistic regression. Springer– Verlag, New York. RAFTERY A.E. (1986): Choosing models for cross-classification. Amer. Sociol. Rev. 51, 145-146.

Key words: CONTINGENCY TABLE, LOG-LINEAR MODEL, HIERARCHY LOG-LINEAR MODELS

29

Brzezinska

Factor Preselection and Multiple Measures of Dependence Nina B¨ uchel, Kay F. Hildebrand, and Ulrich M¨ uller-Funk European Research Center for Information Systems (ERCIS), University of M¨ unster buechel|hildebrand|[email protected] Abstract. In Data Mining there are special needs for a process by which relevant factors of influence are identified in order to achieve a balance between bias and noise. Data Mining – in contrast to ordinary data analysis – is characterized by 1. a large number of data sets coming from a heterogeneous universe and 2. a comparatively large number of factors measured on different scales. Both features together imply that the data has to be modeled by a general unidentifiable finite mixture model. Accordingly, procedures which are linear or likelihood-based become obsolete. In that context and with black box methods (e. g. EBPN, SVM), regularization leads to rather complex optimization problems. Furthermore, it is difficult to complement them with testing procedures replacing the F- or LQ-test. Trees, on the other hand, can only successfully deal with a limited number of features. Therefore, it is conclusive to incorporate factor selection into data preprocessing. With a large number of factors, the selection procedure requires a suitable process model. In this paper, we shall discuss procedures proposed in literature (cf. Lee and Verleysen (2007)) and we shall come up with a proposal of our own. Its implementation requires multiple measure of dependence. Few such indices have been introduced cf. Wolff(1980), Jouini and Clemen (1996), Mari and Kotz (2001). Thus, we shall develop some kind of non-linear factor/canonical correlation analysis.

References M. Jouini and R. Clemen (1996): Copula models for aggregating expert opinions. Oper. Research, 44, n 3, 444-457. J. Lee and M. Verleysen (2007): Nonlinear Dimensionality Reduction. Springer. F. Schmid et. al. (2010): Copula-Based Measures of Multivariate Association. Springer. E. Wolff (1980): N-dimensional measures of dependence. Stochastica, 4, 175-188.

Keywords FACTOR SELECTION, DATA PREPROCESSING, DEPENDENCIES

B¨ uchel

30

On dynamic Hurdle models for longitudinal zero-inflated count data J. Bulla1 and A. Maruotti2 1 2

Universit´e de Caen, France [email protected] Universit` a Roma Tre [email protected]

Abstract. The models presented deal with the analysis of a longitudinal dataset of buying behavior, i.e. we record a set of costumer informations over several time periods. Such data structure shows a number of characteristics which need to be described as the dependence of dependent variables on covariates, serial dependence and heterogeneity among the customers. Several model specifications have been proposed to model correlation in a longitudinal setting (see e.g. Molenberghs et al., 2010). In our empirical study, customers may be subject to several factors affecting their buying behavior such as advertisements, or promotional offers. Over time, the influence of these factors on the costumer’s buying behavior may vary in an unknown (unobserved) way. In this setting, the hurdle-Poisson Hidden Markov Model (HMM) (Alf´o and Maruotti, 2010), a dynamic extension of the classical hurdle model (Mullahy, 1986), can be applied. We analyze the performance of this model and show how to extend the latent process from a Markov to a semi-Markov chain uilizing a computationally convenient and easily deductible approach inspired by Durbin et al. (1998) and Langrock & Zucchini (2011). As a by-product of the model estimation algorithm, it is possible to classify customers into several relationship states.

Keywords Hidden Markov Model, hidden semi-Markov Model, time series clustering, Hurdle model

31

Bulla

Improving Denoising Algorithms via a Multi-Scale Meta-Procedure Burger1 and Harmeling2 MPI Abstract. Many state-of-the-art denoising algorithms focus on recovering highfrequency details in noisy images. However, images corrupted by large amounts of noise are also degraded in the lower frequencies. Thus properly handling all frequency bands allows us to better denoise in such regimes. To improve existing denoising algorithms we propose a meta-procedure that applies existing denoising algorithms across dif- ferent scales and combines the resulting images into a single denoised image. With a comprehensive evaluation we show that the performance of many state-of-the-art denoising algorithms can be improved.

Burger

32

Restricted Unfolding: Preference Analysis with Optimal Transformations of Preferences and Attributes Frank M.T.A. Busing Leiden University, Netherlands Abstract. The fundamentals of preference mapping are revisited in the context of a new restricted unfolding method that has potential for wide application to product optimization for consumers. Since more of an attribute is not necessarily preferred, the unfolding distance model provides estimates of ideal points for consumers and therefore provides a more adequate representation of the preference relationships, compared to conventional internal preference mapping. Compared to other ideal point methods, the new unfolding technique offers advantages in terms of allowing for the ordinal nature of the ratings, rather than implicitly assuming that ratings are linear. The proposed restricted unfolding model incorporates property fitting, both passive, as a separate, second step, and active, as a restriction on the product locations. This is also available as a restriction on the respondents’ locations and as such establishing a link between internal and external preference mapping. Different transformation functions for the restricting variables lead to more or less restricted areas for the ideal points, thus showing closer resemblance with the variables’ measurement scales and allowing for better fitting configurations.

Keywords unfolding, preferences, attributes, transformations, restrictions

33

Busing

Topology and Classification Gunnar Erik Carlsson Stanford University, USA [email protected] Abstract. Recent years have seen the development of topological techniques whose goal can be informally stated as understanding the shape of data. Methods such as these can construct very compressed representations of large data sets, which retain important geometric features. Such representations can be very useful in classification questions, where the classes are not well separated into distinct components but rather have a continuous aspect which is accurately described by the geometric representation. I will discuss these new developments, with numerous examples.

Keywords compressed representations, large data sets

Carlsson

34

Statistical inference for the latent block model: a review Gilles Celeux INRIA, France. [email protected] Abstract. Model-based block clustering is concerned with simultaneous clustering of the rows and the columns of a matrix. It has received much attention recently since it could be useful in domains where large dimension data matrices are available, such as document classification or genomics data analysis. In this talk I will present the main approaches developed to estimate the parameters of the latent block model. The particular difficulties involved with this model will be described as the the solutions to overcome them. The maximum likelihood and the Bayesian perspectives to derive a block clustering from the latent block model will be discussed as the identifiability issues. A special attention will be put on the model selection problem which concentrates most of the latent block model difficulties.

Keywords Clustering

35

Celeux

Multivariate Outlier Detection and robust clustering Andrea Cerioli1 , Marco Riani1 , and Anthony C. Atkinson2 1

2

Dipartimento di Economia, Universit` a di Parma, Italy {andrea.cerioli}{mriani}@unipr.it Department of Statistics, The London School of Economics, UK [email protected]

Abstract. Multivariate outliers are usually identified by means of robust distances. A statistically principled rule for accurate and powerful outlier detection requires: 1. Availability of a good approximation to the finite-sample distribution of the robust distances under the postulated model for the “good” part of the data; 2. Correction for the multiplicity implied by repeated testing of all the observations for outlyingness. In this talk I will review some key issues arising in the implementation of 1 and 2. I will start from the seminal work of Wilks (1963), then browse through the high-breakdown approach which has prevailed in the last twenty years (Hubert et al., 2008), and finally conclude with some recent results, which come closest to the hoped-for solution. These results are available both in the case of robust distances from high-breakdown estimators (Cerioli, 2010; Cerioli and Farcomeni, 2011) and for the recently established approach based on the Forward Search, which enjoys a flexible data-driven level of trimming (Riani et al., 2009). If time allows, I will also sketch the potential of the Forward Search for the purpose of robust clustering.

References CERIOLI, A. (2010): Multivariate Outlier Detection With High-Breakdown Estimators. Journal of the American Statistical Association, 105, 147–156. CERIOLI, A. and FARCOMENI, A. (2011): Error rates for multivariate outlier detection. Computational Statistics and Data Analysis, 55, 544–553. HUBERT, M., ROUSSEEUW, P.J. and VAN AELST, S. (2008): High-breakdown robust multivariate methods. Statistical Science, 23, 92–119. RIANI, M., ATKINSON, A.C. and CERIOLI, A. (2009): Finding an unknown number of multivariate outliers. Journal of the Royal Statistical Society B, 71, 447– 466. WILKS, S.S. (1963): Multivariate Statistical Outliers. Sankhya A, 25, 407–426.

Keywords FORWARD SEARCH, ROBUST DISTANCES, REWEIGHTED MCD

Cerioli

36

Clustering of variables via the PCAMIX method Marie Chavent1,2 , Vanessa Kuentz3 , Benoˆıt Liquet4 , and J´erˆome Saracco1,2 1

2 3 4

IMB, University of Bordeaux, France {marie.chavent,jerome.saracco}@math.u-bordeaux1.fr CQFD team, INRIA Bordeaux Sud-Ouest, France CEMAGREF, UR ADBX, Cestas, France [email protected] ISPED, University of Bordeaux, France [email protected]

Abstract. Clustering of variables is as a way to arrange variables into homogeneous clusters i.e. groups of variables which are strongly related to each other and thus bring the same information. Clustering of variables can then be useful for dimension reduction and variable selection. Several specific methods have been developed for the clustering of numerical variables. However concerning qualitative variables or mixtures of quantitative and qualitative variables, much less methods have been proposed. The ClustOfVar package has then been developped specifically for that purpose. The homogeneity criterion of a cluster is the sum of correlation ratios (for qualitative variables) and squared correlations (for quantitative variables) to a synthetic variable, summarizing “as good as possible” the variables in the cluster. This synthetic variable is the first principal component obtained with the PCAMIX method. Two algorithms for the clustering of variables are proposed: iterative relocation algorithm, ascendant hierarchical clustering. We also propose a bootstrap approach to determine suitable numbers of clusters. The proposed methodologies are illustrated on real datasets.

References CHAVENT, M., KUENTZ, V., LIQUET, B., SARACCO, B. (2010): The R package ClustOfVar. CHAVENT, M., KUENTZ, V., LIQUET, B., SARACCO, B. (2011): ClustOfVar: an R package for the clustering of variables. Submitted paper.

Keywords MIXTURE OF QUANTITATIVE AND QUALITATIVE VARIABLES, HIERARCHICAL CLUSTERING OF VARIBLES, K-MEANS CLUSTERING OF VARIABLES, DIMENSION REDUCTION.

37

Chavent

An approach to using ontologies for interpreting text documents E. Chernyak1 , O. Chugunova1 , J.Askarova1 , S. Nascimento2 , and B. Mirkin1 1

2

Higher School of Economics, 11 Pokrovski Boulevard, Moscow, RF [email protected] Computer Science Department and Centre for Artificial Intelligence (CENTRIA), Faculdade de Ciˆencias e Tecnologia, Universidade Nova de Lisboa, Caparica, Portugal [email protected]

Abstract. This work is motivated by the problem of the (automatic) annotation of a text document with keywords. Usually a taxonomy is used as a source of keywords to manually annotate a text document such as a journal paper. A typical example of such a taxonomy is the Association for Computing Machinery Classification of Computer Subjects (ACM-CCS) [1]. Manual ACM-CCS annotations are available for some journals on the ACM web-portal. Our method proceeds in two stages, mapping and lifting. The mapping procedure associates taxonomy topics with a document under consideration by using the machinery of annotated suffix trees. It results in a crisp or fuzzy set of topics that characterises the content of the document and is referred to as a query set. Then the lifting procedure from [2] is invoked. At this stage, leaf query topics are parsimoniously lifted to taxonomy nodes of higher ranks, so that one or two higher rank nodes cover all or almost all the query topics, thus making a desired annotation. A collection of documents as can be processed as well, with an additional stage of topic clustering. This method has been experimentally applied in two settings: (i) papers published in some of the ACM journals together with the ACM-CCS taxonomy, (ii) syllabuses of courses in pure and applied mathematics taught in the Higher School of Economics Moscow together with a taxonomy of mathematics (in Russian).

References Ac98ACM Computing Classification System (1998), http://www.acm.org/about/class/1998 (accessed December 2010). Mi10MIRKIN, B., NASCIMENTO S., FENNER T. and PEREIRA L. M. (2010): Fuzzy Thematic Clusters Mapped to Higher Ranks in a Taxonomy. International Journal of Software and Informatics, 4, 257–275.

Keywords TEXT ANALYSIS, ONTOLOGY, VISUALIZATION, INTERPRETATION

Chernyak

38

Recognising Cello Performers Using Timbre Models Magdalena Chudy and Simon Dixon Centre for Digital Music, Queen Mary University of London Mile End Road, London E1 4NS, UK {magdalena.chudy,simon.dixon}@eecs.qmul.ac.uk Abstract. In this paper, we compare timbre features of various cello performers playing the same instrument in solo cello recordings. Using an automatic feature extraction framework, we investigate the differences in sound quality of the players. The motivation for this study comes from the fact that the performer’s influence on acoustical characteristics is rarely considered when analysing audio samples of various instruments. While even a trained musician cannot entirely change the way an instrument sounds, he is still able to modulate its sound properties obtaining a variety of individual sound ”colours” according to his playing skills and musical expressiveness. This phenomenon known amongst musicians as ”player timbre” enables to differentiate one player from the others when they perform an identical piece of music on the same instrument. To address this problem, we analyse sets of acoustical features extracted from cello recordings of five players and model timbre characteristics of each performer. The extracted features include timbre descriptors from the MPEG7 standard, harmonic and noise spectra, Mel-frequency spectra and Mel-frequency cepstral coefficients (MFCCs). We validate the performer models with a variety of classification techniques. Amongst others, k-Nearest Neighbours, Na¨ıve Bayes and Bagging Decision Trees are tested for the best suitability and performance.

39

Chudy

Principal Components for Gaussian Mixtures Carlos Cuevas-Covarrubias Anahuac University Mexico, Mexico [email protected] Abstract. Principal Components and Discriminant Analysis are two important techniques of Multivariate Statistics. Given a random vector X, principal components are usually applied to find its optimal representation in a space of lower dimension. Discriminant analysis assumes that the sample space of X, is partitioned into different categories; given x, a particular realization of X, discriminant analysis is used to determine the category this observation comes from. We present a new idea where the area under the ROC curve links both methods. We focus on the interesting case where X follows some Gaussian Mixture law. The objective is to represent X with a small number of components, simultaneously independent in both categories, and keeping most of its separability. We illustrate the theoretical concepts with two practical examples. The results are clear to interpret. This article offers an original idea with good potential towards Pattern Recognition and Data Mining.

Keywords Principal Components, Gaussian Mixtures, Discriminant Analysis

Cuevas-Covarrubias

40

Robust point matching in HDRI through estimation of illumination distribution Cui1 , Pagani1 , and Stricker1 DFKI Abstract. High Dynamic Range Images provide a more detailed information and their use in Computer Vision tasks is therefore desirable. However, the illumination distribution over the image often makes this kind of images difficult to use with common vision algorithms. In particular, the highlights and shadow parts in a HDR image are difficult to analyze in a standard way. In this paper, we propose a method to solve this problem by applying a preliminary step where we precisely compute the illumination distribution in the image. Having access to the illumination distribution allows us to subtract the highlights and shadows from the original image, yielding a material color image. This material color image can be used as input for standard computer vision algorithms, like the SIFT point matching algorithm and its variants. We propose a three-step algorithm for solving this problem: First, we estimate the light distribution with Gaussian Mixture Model (GMM) for different exposure layers. Second, we calculate the shadow and highlight parts in the images, and produce ’material color’ images, which do not depend on light sources. Finally, we apply a point matching algorithm (e. g. SIFT) for the ’material color’ image. We show that our approach drastically increases the number of correctly matched points when using HDR images.

41

Cui

Robust point matching in HDRI through estimation of illumination distribution Yan Cui, Alain Pagani, and Didier Stricker DFKI, Augmented Vision Kaiserslautern University, Germany Abstract. High Dynamic Range Images provide a more detailed information and their use in Computer Vision tasks is therefore desirable. However, the illumination distribution over the image often makes this kind of images difficult to use with common vision algorithms. In particular, the highlights and shadow parts in a HDR image are difficult to analyze in a standard way. In this paper, we propose a method to solve this problem by applying a preliminary step where we precisely compute the illumination distribution in the image. Having access to the illumination distribution allows us to subtract the highlights and shadows from the original image, yielding a material color image. This material color image can be used as input for standard computer vision algorithms, like the SIFT point matching algorithm and its variants.

Cui

42

Feature Selection and Clustering of Digital Images Versus Questionnaire Based Grouping of Consumers: A Comparison Ines Daniel and Daniel Baier Institute of Business Administration and Economics, Brandenburg University of Technology Cottbus, Postbox 101344, 03013 Cottbus, Germany {ines.daniel, daniel.baier}@tu-cottbus.de Abstract. Clustering algorithms are standard tools for marketing purposes. So, e.g., in market segmentation, they are applied to derive homogeneous customer groups. However, recently, the available resources for this purpose have extended. So, e.g., in social networks potential customers provide images which reflect their activities, interests, and opinions. To compare whether images lead to similiar results as conventional methods for lifestyle analysis, a comparison study was conducted among 500 people. In this paper we discuss the results of the study. We also analyze possible advantages and disadvantages of using images for lifestyle analysis compared to conventional procedures of grouping customers for market segmentation.

References LAW, M., FIGUEIREDO, M., and JAIN, A.K. (2004): Simultaneous Feature Selection and Clustering Using Mixture Model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9), 1154–1166. PUNJ, G., STEWART, D.W. (1983): Cluster Analysis in Marketing Research: Review and Suggestions for Application. Journal of Marketing Research, 20 (2, May), 134–148. VAN HOUSE, N.A. (2009): Collocated Photo Sharing, Story-Telling, and the Performance of Self. International Journal of Human-Computer Studies, 67(12), 1073–1086. WEDEL, M., KAMAKURA, W.A. (2000): Market Segmentation: Conceptual and Methodological Foundations. Kluwer, Dordrecht. WELLS, W.D., TIGERT, D.J. (1971): Activities, Interests, and Opinions. Journal of Advertising Research, 11(4), 27–35.

Keywords MARKET SEGMENTATION, IMAGE CLUSTERING ALGORITHMS

43

Daniel

A dynamic analysis of stock markets using a hidden Markov model Luca De Angelis1 and Leonard J. Paas2 1

2

Statistical Sciences Department, University of Bologna, Via delle Belle Arti 41, Bologna, Italy [email protected] Marketing Department, Vrije Universiteit, De Boelelaan 1105, Amsterdam, The Netherlands [email protected]

Abstract. This work proposes an innovative framework to detect financial crises, pinpoint the end of a crisis and predict future developments in stock markets. This proposal is based on a hidden Markov model and allows for a specific focus on conditional mean returns. By analyzing weekly changes in the U.S. stock market indexes over a period of 20 years, this study obtains an accurate detection of stable and turmoil periods and a probabilistic measure of switching between different stock market conditions. The results contribute to the discussion of the capabilities of hidden Markov models and give financial operators some appealing investment strategies.

References BULLA, J. and BULLA, I. (2006): Stylized Facts of Financial Time Series and Hidden Semi-Markov Models. Computational Statistics and Data Analysis, 51, 2192–2209. DIAS, J.G., VERMUNT, J.K., and RAMOS, S. (2010): Mixture Hidden Markov Models in Finance Research. In: A. Fink, B. Lausen, W. Seidel, and A. Ultsch (Eds.): Advances in Data Analysis, Data Handling and Business Intelligence, Studies in Classification, Data Analysis, and Knowledge Organization. SpringerVerlag, Berlin, 451–459. ROSSI, A., and GALLO, G.M. (2006): Volatility Estimation via Hidden Markov Models. Journal of Empirical Finance, 13, 203–230. ZUCCHINI, W., and MACDONALD, I.L. (2009): Hidden Markov Models for Time Series: An Introduction Using R. Chapman & Hall/CRC Monographs on Statistics & Applied Probability, Boca Raton, FL.

Keywords Stock market pattern analysis, Regime-switching, Forecasting, Hidden Markov model, Financial crises

De Angelis

44

Trend vector models De Rooij Leiden University [email protected] Abstract. Longitudinal data are often collected in various fields of science. In the social and health related areas the variable of interest is often categorical. We introduce the general class of trend vector models for the statistical analysis of nominal and assessed ordinal longitudinal data. These trend vector models are based on multidimensional unfolding and use the two-mode distance function in a probabilistic multinomial model. The squared Euclidean distance between a position of a subject on a specific time point and a position for the category of the response variable is inversely related with the probability for this category. The positions of the subjects are determined using a linear combination of the time and possibly grouping variables. The trend vector model results in a low dimensional graphical representation of trends in the data using (non)linear vectors. We distinguish between marginal and subject specific models (Molenberghs and Verbeke, 2005) and show several applications.

References Molenberghs, G. and Verbeke, G. (2005). Models for discrete longitudinal data. Spinger.

Keywords Multinomial data; Multidimensional scaling; Random effects; Marginal models

45

De Rooij

Switching PCA for modeling changes in the underlying structure of multivariate time series data K. De Roover1 , E. Ceulemans1 , M. Timmermann2 , and P. Onghena1 1 2

Katholieke Universiteit Leuven, Belgium [email protected] University of Groningen, The Netherlands

Abstract. Abstract. Behavioral researchers are often interested in the underlying structure of multivariate time series data. For instance, given multiple measurements of a set of variables for one subject, one may wonder whether some of the variables covary across time or whether they all ◦ uctuate independently of one another, and whether the underlying structure is the same across all measurements or varies over time. For example, when something happens between two measurements, the amount or nature of covariation in the data can change. To explore such strucR tural di erences across time, Switching PCA is developed. Using Switching PCA, clusters of sequential (i.e., a time contiguity constraint is imposed on the clustering, implying that each cluster consists of consecutive measurements only) measurements are induced, according to the underlying structure, and the data within each cluster are modeled by a separate PCA. An algorithm for fitting Switching PCA models is presented. The value of the model for empirical research is demonstrated.

Keywords time series data, multivariate data, principal component analysis, time contiguity, clustering

De Roover

46

Recommender Systems for Biosurveillance 2.0 Ernesto Diaz-Aviles, Avar´e Stewart and Wolfgang Nejdl L3S Research Center Leibniz Universit¨ at Hannover, Germany {diaz, stewart, nejdl}@L3S.de Abstract. Social media production has been enjoying a great deal of success in recent years. There are millions of active users participating in social network sites like Facebook.com, or micro-blogging using Twitter.com. Recently, modern disease surveillance systems have started to also monitor these streams of mass social media data with the objective of improving their timeliness to detect disease outbreaks and produce warnings against potential public health threats [1]. However, to public health officials, these warnings represent an overwhelming amount of information for risk assessment. To reduce this overload we explore to what extent Recommender Systems can help intelligently filtering information items according to the public health users’ context (e.g., location) and preferences (e.g., disease, symptoms) [2,3]. In this work, we first discuss the relevant features that characterize a health event within the context of Web 2.0 data. Then, we introduce a multi-phase filtering technique for health event selection on social media streams. Finally, we present our recommender system approach that ultimately offers the user the most relevant and attractive events for risk assessment. Moreover, an extensive experimental evaluation of our methods, on a micro-blog real dataset collected from Twitter.com, is reported.

References [1] CORLEY, COURTNEY D.; COOK, DIANE J.; MIKLER, ARMIN R. and SINGH, KARAN P. (2010): Text and Structural Data Mining of Influenza Mentions in Web and Social Media. Int. J. Environ. Res. Public Health 7, no. 2: 596–615. [2] KARATZOGLOU, A.; AMATRIAIN, X.; BALTRUNAS, L.; and OLIVER, N. (2010): Multiverse Recommendation: N-Dimensional Tensor Factorization for Context-Aware Collaborative Filtering. In Proceedings of RecSys ’10. [3] RENDLE, S.; MARINHO, L.; NANOPOULOS, A.; and SCHMIDT-THIEME, L. (2009): Learning Optimal Ranking with Tensor Factorization for Tag Recommendation. In Proceedings of KDD ’09.

Keywords:

BIOSURVEILLANCE, RECOMMENDER SYSTEMS, SOCIAL

MEDIA

47

Diaz-Aviles

Swarm Intelligent Recommender Systems Ernesto Diaz-Aviles, Avar´e Stewart, Mihai Georgescu and Wolfgang Nejdl L3S Research Center Leibniz Universit¨ at Hannover, Germany {diaz, stewart, georgescu, nejdl}@L3S.de Abstract. Recommender systems make product suggestions that are tailored to the human user’s individual needs and represent powerful means to combat information ˆ overload. In this paper, we Efocus on the item prediction task of Recommender Systems, which is to predict a user-specific ranking for a set of items, and present a method to automatically optimize the performance qualityof ranking functions using a Swarm Intelligence (SI) perspective.Our approach, which is well-founded in a Particle Swarm Optimization (PSO) framework, learns a ranking function by optimizing the combination of various types of important and unique characteristics (i.e., features) of users, items and their interactions. PSO was chosen because it is a ˆ global non-linear optimization algorithm that neither Erequires, nor approximates, cost function gradients. Further, due to PSO reliance on local minima, we are able to ˆ Edirectly optimize non-smooth metrics.We build feature vectors from a factorization ˆ of Ethe user-item interaction matrix, and directly optimize an error metric in order to learn ranking linear functions that output a real value representing the relevance of a given item, for a particular user. Our experimental evaluation on a real world online radio dataset indicates that our approach is able to find ranking functions that significantly improve the performance of the recommender system.

References KENNEDY, J. AND EBERHART, R. C. (1995): Particle Swarm Optimization. Proc. IEEE Int’l. Conf. on Neural Networks, IV, 1942-1948. LIU, T.(2009): Learning to Rank for Information Retrieval. Found. Trends Inf. Retr. 3, 3 (March 2009), 225-331. KOREN, Y.; BELL, R.; VOLINSKY, C. (2009): Matrix Factorization Techniques for Recommender Systems. Computer, vol. 42, no. 8, pp. 30-37, Aug. 2009, doi:10.1109/MC.2009.263.

Keywords:

PARTICLE SWARM OPTIMIZATION, RECOMMENDER SYSTEMS, LEARNING TO RANK, ITEM PREDICTION

Diaz-Aviles

48

Latent class modeling of time series data Jos´e G. Dias ISCTE – Instituto Universit´ ario de Lisboa, UNIDE, Av. das For¸cas Armadas, Lisboa 1649–026, Portugal [email protected] Abstract. Latent class or finite mixture modeling has proven to be a powerful tool for identifying discrete unobserved heterogeneity in observed data. Here we focus on the latent class analysis of economic and financial time series data (see, e.g., Ramos et al., 2011). This model is strongly related with regime switching models (Hamilton, 1989) and hidden Markov models (Baum et al., 1970). These complementary approaches are compared and new developments are integrated from different backgrounds. In particular, we focus on the classification uncertainty in latent class modeling of time series data with emphasis on entropy-based measures (Dias and Vermunt, 2006; Dias and Vermunt, 2008). Results are illustrated with two time series.

References 1.BAUM, L. E., PETRIE, T., SOULES, G., and WEISS, N. (1970): A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41(1), 164–171. 2. DIAS, J. G. and VERMUNT, J. K. (2006): Bootstrap methods for measuring classification uncertainty in latent class analysis. In: Rizzi, A. and Vichi, M. (eds.), COMPSTAT 2006: Proceedings in Computational Statistics, Springer, pp. 31-41. 3.DIAS, J. G. and VERMUNT, J. K. (2008): A bootstrap-based aggregate classifier for model-based clustering. Computational Statistics, 23(4), 643–659. 4.HAMILTON, J. D. (1989): A new approach to the economic-analysis of nonstationary time-series and the business-cycle. Econometrica, 57(2), 357–384., 5.RAMOS, S.B., VERMUNT, J.K. and DIAS, J.G. (2011): When markets fall down: Are emerging markets all the same? International Journal of Finance and Economics, in press.

Keywords LATENT CLASS MODEL, TIME SERIES ANALYSIS, CLASSIFICATION UNCERTAINTY, CLUSTERING

49

Dias

Testing the Value-Added of Rebalancing Strategies for Stock-Bond-Portfolios Hubert Dichtl1 , Wolfgang Drobetz2 and Martin Wambach2 1

2

3

alpha portfolio advisors GmbH, Wiesbadener Weg 2a, 65812 Bad Soden/Ts., Germany. [email protected] University of Hamburg, Institute of Finance, Von-Melle-Park 5, 20146 Hamburg, Germany. [email protected] University of Hamburg, Institute of Finance, Von-Melle-Park 5, 20146 Hamburg, Germany. [email protected]

Abstract. Most institutional investors rebalance their portfolio in regular time intervals in order to maintain their initial asset allocation. A pure risk management argument underlies this strategy. Rebalancing the portfolio back to the original allocation prevents a drift away from the worse performing asset class toward the better performing one, thereby reducing diversification and increasing risk. In order to investigate the potential risk-return-benefits of different rebalancing strategies, we run historical simulations and apply a novel block-bootstrap approach. In contrast to prior studies, our methodology allows us to conduct a statistical test whether different rebalancing strategies dominate a buy-and-hold strategy based on data from the US, the UK, and Germany. This test statistic is robust to time-series dependencies that are inherent in tests that are based on rolling time windows. Our simulations deliver several results that have immediate practical implications. First, despite the strong performance of stocks relative to bonds during the sample period, the average returns of a buy-and-hold strategy are not significantly higher than those of different rebalancing strategies. According to Perold and Sharpe’s (1988) notion, this result implies that neither the mean reversion nor the momentum effects in the market data are strong enough to produce superior returns of either strategy. Second, we document that rebalancing strategies at all trading frequencies exhibit a significant lower volatility compared to the corresponding buyand-hold strategy due to their better diversification properties. Third, analyzing the Sharpe ratio as a performance measure that incorporates both the return and the volatility of an investment strategy, our results reveal that rebalancing strategies significantly outperform buy-and-hold strategies. This finding is robust for all trading frequencies. Fourth, comparing different rebalancing intervals, quarterly rebalancing produces significantly higher Sharpe ratios compared to monthly rebalancing. This observation indicates that there may be an optimal rebalancing frequency, with excessive rebalancing and no rebalancing both leading to inferior Sharpe ratios. Our results incorporate realistic transaction costs and trading triggers, and they are qualitatively the same in all countries.

Dichtl

50

k-Means Clustering of Incomplete Data Stephan Dlugosz ZEW Centre for European Economic Research Mannheim [email protected] Abstract. Cluster analysis is a powerful tool for detecting hidden structures in large datasets. Large datasets, however, often suffer from data problems such as mismeasured or missing values. Classical data mining methods—including k-means clustering—however, are designed for complete datasets. Many methods have been proposed to address this issue: imputation methods try to artificially fill the holes in the data by finding plausible values; distance metric correction methods treat missing values during calculating of the distance metrix. Both groups of methods ignore the clustered structure of the data. Fuzzy c-means and model-based clustering techniques include the missingness information into their optimization criterion; but they rely on strong assumptions on the underlying data model. In this paper, we discuss the statistical assumptions of the various incomplete data methods. Furthermore, we extend the k-means algorithm to handle missing values in a non-parametric way under the missing-at-random assumption. The new algorithm computes cluster assignment probabilities for incomplete vectors by cluster-wise kernel-density estimations. Additionally, we present simulation results and discuss how the computational burden of the method can be reduced to match data mining requirements.

References DIXON, J.K. (1979): Pattern Recognition with Partly Missing Data, IEEE Transactions on Systems, Man, and Cybernetics 9(10), 1979. HARTIGAN, J.A. (1975): Clustering Algorithms, Wiley, 1975. HATHAWAY, R.J. and BEZDEK, J.C. (2001): Fuzzy c-Means Clustering of Incomplete Data, IEEE Transactions on Systems, Man, and Cybernetics B 31(5), 2001. LITTLE, R.J.A. and RUBIN, D.B. (2002): Statistical Analysis with Missing Data, Wiley, 2002.

Keywords CLUSTER ANALYSIS, INCOMPLETE DATA, K-MEANS

51

Dlugosz

Classification and Regression Trees with Covariates Missing at Random Stephan Dlugosz ZEW Centre for European Economic Research Mannheim [email protected] Abstract. Classification and regression trees (CART) are a popular tool for nonparametric data analysis in large datasets. Decision tree induction is quite fast; and trees provide an intuitive access to the data. Large datasets, however, often suffer from data problems such as mismeasured or missing values. Incomplete data are usually ‘avoided’ with the help of surrogate splits in classical CART. This surrogate is based upon a proxy for the missing value; using it instead of the (unobserved) value incorporates additional error. Additionaly, this procedure ignores other covariates, which might also be highly associated with the unobserved value. This paper proposes and justifies an EM-type algorithm to improve tree induction with incomplete data. We also adjust the prediction function in order to cope with a tree that expects complete data. A small simulation study demonstrates the potential of the new method.

References BREIMAN, L., FRIEDMAN, J.H., OLSHEN, R.A. and STONE, C.J. (1984): Classification and Regression Trees, Chapman & Hall, 1984. DEMPSTER, A.P., LAIRD, N.M. and RUBIN, D.B. (1977): Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39, 1–38. FRIEDMAN , J.H. (1977): A recursive partitioning decision rule for nonparametric classification. IEEE TRans. Computers, C-26, 404–408. IBRAHIM, J.G. (1990): Incomplete Data in Generalized Linear Models. Journal of The American Statistical Association 85, 765–769. LITTLE, R.J.A. and RUBIN, D.B. (2002): Statistical Analysis with Missing Data, Wiley, 2002.

Keywords DECISION TREES, INCOMPLETE DATA, EM

Dlugosz

52

Implications of Axiomatic Consensus Properties Florent Domenach and Ali Tayari Computer Science Department, University of Nicosia, 46 Makedonitissas Ave., PO Box 24005, 1700 Nicosia, Cyprus [email protected] Abstract. Since Arrow’s celebrated impossibility theorem, axiomatic consensus theory have been extensively studied (e.g. Day and McMorris for a comprehensive survey). We are interested here in implications between properties on a profile of hierarchies. Such implications are systematically investigated using Formal Concept Analysis (Ganter and Wille) through Attribute Exploration. Different properties of functions are considered, among which Pareto optimality, nesting preservation, strong presence, etc. All possible consensus functions are automatically generated on a set of hierarchies derived from the enclosed set of elements in respect to the Schroeder’s Fourth Problem. The list of implications is presented and discussed.

References ADAMS III, E.N. (1972): Consensus Techniques and the Comparison of Taxonomic Trees. Systematic Zoology, 21, 390-397. ARROW, K.J. (1951): Social Choice and Individual Values. Wiley, New York. DAY, W.H.E. and MCMORRIS, F.R. (2003): Axiomatic Consensus Theory in Group Choice and Biomathematics. Siam, Philadelphia. GANTER, B. and WILLE, R. (1996): Formal Concept Analysis : Mathematical Foundations. Springer. PARETO, V. (1896): Cours d’´economie politique. F. Rouge, Lausanne ¨ SCHRODER, R, E.(1870): Vier eombinatorische probleme. Z. Math. Physik, 15 , 361–376.

Keywords Properties of Consensus Functions; Attribute Exploration; Implications

53

Domenach

Learning time series dissimilarities Ahlame Douzal-Chouakria1 , Cedric Frambourg1 , Eric Gaussier1 , Jacques Demongeot2 1

2

LIG, Universit´e Joseph Fourier, Grenoble, France (Ahlame.Douzal,cedric.Frambourg, Eric.Gaussier)@imag.fr TIMC-IMAG, Universit´e Joseph Fourier Grenoble, France

Abstract. In many classification approaches, time series comparison requires the time series to be aligned. Numerous strategies for temporal alignments have been proposed in the literature; they intend to fit observations to make compared time series as close as possible (e.g., Rodrigue et al. (2004), Shou et al. (2005), Navarro (2001), Nanopoulos et al. (2001)). In the framework of time series discrimination, this work focuses on learning time series alignments by connecting the commonly shared temporal features within clusters (i.e., higher clusters cohesion), and the greatest differences between clusters (i.e., higher clusters isolation). A new time series alignments approach supervised by a variance/covariance criterion is proposed. The core of the alignment strategy is based on strengthening or weakening links according to their contribution to the variability within and between classes. To this end, the classical variance/covariance expression is extended to a set of time series, as well as to a partition of time series. Discriminative distances based on the learned alignments are then induced for time series classification. We show, through the carried out experiments, that the learned distances outperform the standard ones for time series classification. Key words: Time series alignments, distance learning, discriminant analysis, variance-covariance

References Navarro, G. (2001): A guided tour to approximate string matching. ACM Computing Surveys, 33 (1), pp. 31-88. Rodriguez, J.J. and Alonso, C.J. (2004): Interval and dynamic time warping-based decision trees. Proc of the ACM Symposium on applied computing, pp. 548-552. Nanopoulos, A., Alcock, R. and Manolopoulos, Y. (2001): Feature-based classification of time-series Data. International Journal of Computer Research, pp. 49-61, Nona Science Publishers. Shou, Y., Mamoulis, N. and D. Cheung, W. (2005): Fast and Exact Warping of Time Series Using Adaptive Segmental Approximations. Machine Learning Journal, 58(2-3), pp. 231-267.

Douzal-Chouakria

54

Fingerprints for Machines - Optical Identification of Grinding Imprints Dragon1 , M¨ orke1 , Rosenhahn1 , and Ostermann2 Leibniz Universit¨ at Hannover Abstract. The profile of a 10 mm wide and 1 m deep grinding imprint is as unique as a human fingerprint. To utilize this for fingerprinting mechanical components, a robust and strong characterization has to be used. We propose a feature-based approach, in which features of a 1D profile are detected and described in its 2D space-frequency representation. We show that the approach is robust on depth maps as well as intensity images of grinding imprints. To estimate the probability of misclassification, we derive a model and learn its parameters. With this model we demonstrate that our characterization has a false positive rate of approximately 10e20 which is as strong as a human fingerprint.

55

Dragon

Indoor Calibration using Segments Chains Drareni1 , Marlet, and Keriven ENPC Abstract. In this paper, we present a new method for line segments matching for indoor reconstruction. Instead of matching individual segments via a descriptor like most methods do, we match segment chains that have a distinctive topology using a dynamic programing formulation. Our method relies solely on the geometric layout of the segments chain and not on photometric or color profiles. Our tests showed that the presented method is robust and manages to produce calibration information even under a drastic change of viewpoint.

Drareni

56

A Bayesian Approach for Scene Interpretation with Integrated Hierarchical Structure Drauschke1 and Foerstner2 Universit¨ at der Bundeswehr Abstract. We propose a concept for scene interpretation with integrated hierarchical structure.We start with segmenting regions at many scales, arranging them in a hierarchy, and classifying them by a common classifier. Then, we use the hierarchy graph of regions to construct a conditional Bayesian network, where the probabilities of class occurrences in the hierarchy are used to improve the classification results. We show that our framework is able to learn models for several objects, such that we can reliably detect instances of them in other images.

57

Drauschke

Image Comparison on the Base of a Combinatorial Matching Algorithm Drayer1 University Freiburg Abstract. In this paper we compare images based on the constellation of their interest points. The fundamental technique for this comparison is our matching algorithm, that is capable to model miss- and multi-matches, while enforcing one-toone matches. We associate an energy function for the possible matchings. In order to find the matching with the lowest energy, we reformulate this energy function as Markov Random Field and determine the matching with the lowest energy by an efficient minimization strategy. In the experiments, we compare our algorithm against the normalized cross correlation and a naive forth-and-back best neighbor match algorithm.

Drayer

58

Pick your Neighborhood – Improving Labels and Neighborhood Structure for Label Propagation Ebert1 , Fritz1 , and Schiele1 PI Informatik Abstract. Graph-based methods are very popular in semi-supervised learning due to their well founded theoretical background, intuitive interpretation of local neighborhood structure, and strong performance on a wide range of challenging learning problems. However, the success of these methods is highly dependent on the preexisting neighborhood structure in the data used to construct the graph. In this paper, we use metric learning to improve this critical step by increasing the precision of the nearest neighbors and building our graph in this new metric space. We show that learning of neighborhood relations before constructing the graph consistently improves performance of two label propagation schemes on three different datasets – achieving the best performance reported on Caltech 101 to date. Furthermore, we question the predominant random draw of labels and advocate the importance of the choice of labeled examples. Orthogonal to active learning schemes, we investigate how domain knowledge can substantially increase performance in these semi-supervised learning settings.

59

Ebert

Spatial classification of loss of heterozygosity on tumor chromosomes Paul H.C. Eilers1 Erasmus University Medical Center, Rotterdam, The Netherlands [email protected] Abstract. Normal human cells contain 23 pairs of chromosomes, one inherited from the mother, the other from the father. During cell division each chromosome gets copied. In tumor cells however, many undesirable scenarios occur. One of them is loss of heterozygosity (LOH): parts of one of the chromosomes get lost, but repair mechanisms in the cell construct perfect copies of the corresponding regions of the other chromosomes. The state of chromosomes can be detected at so-called single nucleotide polymorphisms (SNPs) occur. A SNP have two different alleles, one on each chromosome. If we call the alleles A and B, the possible (unordered) pairs are AA, AB and BB, the genotypes. The AB pair called heterozygous, AA and BB homozygous. Normal cells show an alternation of homozygous and heterozygous genotypes along a (pair of) chromosomes. However where LOH occurs, only AA and BB are seen. When we observe a series of adjacent homozygous genotypes, LOH is probable, but it is not sure. It is desirable to have a realistic statistical model that estimates for each SNP the probability that it is in a region with LOH. Such a model should make use of the spatial locations of the SNPs. I propose to estimate a spatial series for the logit of the probability of LOH, given the positions and genotypes of SNPs on a chromosome. Conditional on LOH, two different discrete distributions are possible at each SNP, one with two states (only AA and BB), the other with three. To account for the spatial correlation, a difference penalty is introduced. Different norms (sums of powers of the absolute values of the differences) can be chosen. A small power in the norm leads to segments with sharp boundaries. Several refinements are possible. One is not to use genotypes as determined by established procedures, but to work with the raw signals from the microarrays that are used for SNP genotyping. Then mixtures of two or three normal distributions (with unknown parameters) have to be introduced. The other refinement is to make use of the probabilities with which the alleles of each SNP occur in the population.

Keywords Spatial smoothing, difference penalty, Lp norm

Eilers

60

Simultaneous Interpolation and Deconvolution Model for the 3-D Reconstruction of Cell Images Elhayek1 , Welk2 , and Weickert3 MPI Saarbr¨ ucken Abstract. Fluorescence microscopy methods are an important imaging technique in cell biology. Due to their depth sensitivity they allow a direct 3-D imaging. However, the resulting volume data sets are undersampled in depth, and the 2-D slices are blurred and noisy. Reconstructing the full 3-D information from these data is therefore a challenging task, and of high relevance for biological applications. We address this problem by combining deconvolution of the 3-D data set with interpolation of additional slices in an integrated variational approach. Our novel 3-D reconstruction model, Interpolating Robust and Regularised Richardson-Lucy reconstruction (IRRRL), merges the Robust and Regularised Richardson-Lucy deconvolution (RRRL) from [15] with variational interpolation. In this paper we develop the theoretical approach and its efficient numerical implementation using Fast Fourier Transform and a coarse-to-fine multiscale strategy. Experiments on confocal fluorescence microscopy data demonstrate the high restoration quality and computational efficiency of our approach.

61

Elhayek

Solving Clustering Problems by the Hyperbolic Smoothing Approach Adilson Elias Xavier Centro de Tecnologia, Rio de Janeiro [email protected]@gmail.com Abstract. Clustering analysis can be done according to numerous criteria, through different mathematical formulations. The methodology considered here deals with clustering problems that have a common component: the measure of a distance, which can be done following different metrics. The methodology, called hyperbolic smoothing, has a wider scope, and can be applied to clustering according to distances measured in different metrics, such as those known as Euclidian, Minkowski, Manhattan and Chebychev norms. By smoothing we fundamentally mean the substitution of an intrinsically non-differentiable twolevel problem by a completely differentiable single-level alternative. This is achieved through the solution of a sequence of differentiable sub-problems which gradually approaches the original problem. An additional improvement considers the partition of the set of observations into two groups: ”data in the frontier” and ”data in gravitational regions”. The resulting effect is a desirable substantial reduction of the computational effort necessary to solve the clustering problems. The talk will consider three clustering formulations: 1 - Among many criteria, the most natural, intuitive and frequently adopted criteria is the minimum sum-ofsquares clustering (MSSC); 2 - The minimum sum of distances clustering problem according to the Euclidian metric, which is analogous to the Fermat-Weber location problem; 3 - The minimum sum of distances clustering problem according to the Manhattan metric. In order to show the distinct performance of the proposed methodologies, a set of computational results obtained by solving traditional instances is presented.

Elias Xavier

62

Similarity measures for learning ontological knowledge Floriana Esposito Dipartimento di Informatica - Universit` a Aldo Moro, Bari, Italy [email protected] Abstract. The application of learning method in order to automatically acquire/ discover ontological knowledge is strictly related to the used representations and requires the availability of distance functions or (dis-)similarity measures that have to be able to assess the (dis-)similarity value of the considered elements. To this end, measures are strictly related to the representation language, indeed they need to capture all the expressiveness of the language in order to evaluate similarity in the best possible way. Definition and evaluation of similarity and dissimilarity measures have been largely studied in literature and a lot of measures have been defined. They can be classified with respect to the propositional setting and relational setting, although most of the work has been dedicated to propositional representations. Hence, this work is not so useful to deal with relational representations and new measure have been determined for relational setting. Main models for computing (dis-)similarity measures in relational setting will be discussed. The paper will be focused on definition of measures applied to knowledge expressed in Description Logics and hence in ontological setting. Description logics (DL) are a family of knowledge representation languages where atomic concepts and atomic roles represent elementary descriptions. Complex descriptions can be built from atomic concepts and roles inductively by the use of concept constructors. Different description languages, rather different description logics are distinguished, in the DLs family, by mean of the constructors they provide. Most of the similarity measure using the relational models are not able to exploit the high expressiveness of DLs, because often measure definitions are strictly related to the particular representation language to which they are applied and they refer to very low expressive DLs: such works do not consider the problem of assessing (dis-)similarity between individuals, only concepts are considered. In the paper a notion of similarity, coded in a distance measure, complying with the semantics of knowledge bases expressed in DLs will be proposed.

Keywords Similarity Measures, Ontological Learning, Description Logics

63

Esposito

Efficient and Robust Alignment of Unsynchronized Video Sequences Evangelidis1 and Bauckhage2 University of Patras Abstract. This paper addresses the problem of aligning two unsynchronized video sequences. We present a novel approach that allows for temporal and spatial alignment of similar videos captured from independently moving cameras. The goal is to synchronize two videos of a scene such that changes between the videos can be detected automatically. This aims at applications in driver assistance or surveillance systems but we also envision applications in map building. Our approach is novel in that it adapts an efficient information retrieval framework to a computer vision problem. In addition, we extend the recent ECC image-alignment algorithm to the temporal dimension in order to improve spatial registration and enable synchro refinement. Experiments with traffic videos recorded by in-vehicle cameras demonstrate the efficiency of the proposed method and verify its effectiveness with respect to spatio-temporal alignment accuracy.

Evangelidis

64

Real Time Head Pose Estimation from Consumer Depth Cameras Fanelli1 , Weise, Gall, and Van Gool ETHZ Abstract. We present a system for estimating location and orientation of a person’s head, from depth data acquired by a low quality device. Our approach is based on discriminative random regression forests: ensembles of random trees trained by splitting each node so as to simultaneously reduce the entropy of the class labels distribution and the variance of the head position and orientation. We evaluate three different approaches to jointly take classification and regression performance into account during training. For evaluation, we acquired a new dataset and propose a method for its automatic annotation.

65

Fanelli

Local Clique Merging: An Extension of the Maximum Common Subgraph Measure for the Classification of Graph Structures Thomas Fober and Eyke H¨ ullermeier Department of Mathematics and Computer Science, University of Marburg {thomas, eyke}@mathematik.uni-marburg.de Abstract. The clustering and classification of structured or geometrical objects, as frequently encountered in bioinformatics applications, raises new challenges, especially since mapping such objects to a standard feature representation is often inappropriate. Instead, it is often better to use a similarity (or, equivalently, a distance) measure which is defined on the structured objects directly, and therefore able to capture important characteristics of these objects that may get lost by mapping them to feature vectors. Many algorithms for clustering and classification, including, for example, kernel-based methods such as support vector machines, only require a similarity measure of that kind. In this paper, we develop a novel similarity measure for node-labeled and edgeweighted graphs, which is an extension of the well-known maximum common subgraph (MCS) measure. Despite being commonly used, MCS has a number of disadvantages. First, the computation of an MCS of two graphs, which is typically done by searching for a maximum clique in the product graph, is an NP-hard problem and hence critical from a complexity point of view. Second, as it is based on exact matches, the MCS is quite inflexible and not tolerant toward measurement errors or structural flexibility, which is especially problematic in biological applications. Our measure, which is based on the use of so-called quasi-cliques as a relaxation of a clique, overcomes these problems. We apply this approach for the comparison of protein structures or, more specifically, protein binding sites, which is an important problem in structural bioinformatics. Such binding sites are represented in terms of graphs, in which the node labels correspond to physico-chemical properties and the edge lengths are used to model geometric constraints. We tackle classification problems, and compare our approach with alternative methods and existing similarity measures on graphs.

Fober

66

Properties of a General Measure of Configuration Agreement Stephen L. France Lubar School of Business, University of Wisconsin-Milwaukee, P. O. Box 742, Milwaukee, WI 53201-0742, USA [email protected] Abstract. Variants of the Rand index of clustering agreement have been used to measure agreement between spatial configurations of points (Akkucuk 2004, Chen 2006). For these techniques,the k -nearest neighbors of each point are compared between configurations. France and Carroll (2007) generalize the agreement measure across multiple values of k. The weights for each value of k are defined by a function. The generalized agreement metric is denoted as ψ. An overall framework for rank based/neighborhood agreement methods is given in Lee and Verleysen (2009). In this paper, we generalize ψ to the case of more than two configurations. We develop a partial-agreement index as a neighborhood agreement version of the partial correlation coefficient. We examine the statistical properties of ψ given a simple constant weighting scheme where f (k) = ck . We demonstrate the use of ψ using several illustrative examples.

References AKKUCUK, U. (2004): Nonlinear mapping: Approaches based on optimizing an index of continuity and applying classical metric MDS on revised distances. Ph.D. dissertation, Rutgers University. CHEN, L. (2006): Local multidimensional scaling for nonlinear dimension reduction, graph layout and proximity analysis. Ph.D. dissertation, University of Pennsylvania. FRANCE, S.L. and CARROLL, J.D. (2007): Development of an agreement metric based upon the Rand index for the evaluation of dimensionality reduction techniques, with applications to mapping customer data. In: P. Perner (Ed.): Proc. MLDM 2007. Springer, Berlin, 499–517. LEE, J.A. and VERLEYSEN, M. (2009): Quality assessment of dimensionality reduction: Rank-based criteria. Neurocomputing, 72, 365–392.

Keywords AGREEMENT,RANKINGS,RAND INDEX,GINI

67

France

Color Image Segmentation Based on an Iterative Graph Cut Algorithm Using Time-of-Flight Cameras Franke Abstract. This work describes an approach to color image segmentation by supporting an iterative graph cut segmentation algorithm with depth data collected by time-of-flight (TOF) cameras. The graph cut algorithm uses an energy minimization approach to segment an image, taking account of both color and contrast information. The foreground and background color distributions of the images subject to segmentation are represented by Gaussian mixture models, which are optimized iteratively by parameter learning. These models are initialized by a preliminary segmentation created from depth data, automating the model initialization step, which otherwise relies on user input.

Franke

68

Reporting Differentiated Literacy Results in PISA by using Multidimensional Adaptive Testing Andreas Frey1 , Ulf Kr¨ ohne2 and Nicki-Nils Seitz3 1 2

3

Friedrich Schiller University Jena, Germany [email protected] German Institute for International Educational Research, Frankfurt, Germany [email protected] Friedrich Schiller University Jena, Germany [email protected]

Abstract. A current application of multidimensional item response theory lies in multidimensional adaptive testing (MAT). In MAT several latent constructs are measured simultaneously while the answers to previously answered items are used to optimize the selection of the next item. MAT allows for substantial increases in measurement efficiency. Within a real data simulation it was examined whether this capability can be used to report reliable results for all 10 subdimensions of students’ literacy in reading, mathematics and science considered in the international largescale assessment of student achievement PISA (Programme for international student assessment). The responses of N = 14 624 students who participated in the PISA assessments of the years 2000, 2003 and 2006 in Germany were used to simulate unrestricted MAT, MAT with the multidimensional maximum priority index method (MMPI), and MAT with MMPI taking typical restrictions of the PISA assessments (treatment of link items, treatment of open items, grouping of items to unit) into account. In contrast to conventional testing based on the booklet design of PISA 2006, for MAT with MMPI the reliability coefficients for all subdimensions were larger than .80. The incorporation of PISA-typical restrictions reduced these advantages slightly. The findings demonstrate that MAT with MMPI can successfully be used for subdimensional reporting in PISA.

Keywords Item Response Theory, Multidimensional Adaptive Testing, Programme for International Student Assessment

69

Frey

Time-consistent Foreground Segmentation of dynamic Content from Color and Depth Video Frick1 , Franke, and Koch CAU - Kiel Abstract. This paper introduces an approach for automatic foreground extraction from videos utilizing depth information from time of flight(ToF) cameras. We give a clear definition of background and foreground based on 3D scene geometry and provide means of foreground extraction based on one-dimensional histograms in 3D space. Further a refinement step based on hierarchical grab-cut segmentation in a video volume with incorporated time constraints is proposed. Our approach is able to extract time-consistent foreground objects even for a moving camera and for dynamic scene content, but is limited to indoor scenarios.

Frick

70

Konzeption einer fachlichen Facette fu ¨ r einen Bibliothekskatalog am Beispiel der Universit¨ atsbibliothek Mannheim Julian Frick Zusammenfassung. Eine in vielen Bibliothekskatalogen bislang nicht verwirklichte Recherchefunktion ist die gezielte Suche nach Literatur aus bestimmten Fachgebieten. Recherchen mit Notationen der im Katalog verwendeten Klassifikation oder mit Schlagw¨ ortern k¨ onnen den Anspruch an eine fachgebietsumfassende Suche meist nicht erf¨ ullen. Eine m¨ ogliche L¨ osung ist die Entwicklung einer bibliotheksspezifischen fachlichen Facette, in der jeder Titel u ¨ber seine sachlichen Erschließungsdaten ei¨ nem oder mehreren F¨ achern zugeordnet wird. Im Vortrag wird nach einem Uberblick u oglichkeiten in verschiedenen Bi¨ber bereits vorhandene fachliche Facettierungsm¨ bliothekskatalogen die Konzeption einer fachlichen Facette f¨ ur den Bibliothekskatalog der Universit¨ atsbibliothek Mannheim erl¨ autert. Hierbei wurden im Besonderen die vorliegenden Sacherschließungsdaten sowie die fachlichen Schwerpunkte der Medienbest¨ ande der Universit¨ atsbibliothek Mannheim ber¨ ucksichtigt. Das Ziel war die Definition und die Zusammenstellung von F¨ achern, die im Bibliothekskatalog in unterschiedlichen Varianten umgesetzt und verwendet werden k¨ onnen.

71

Frick

Clustering Images Using Earth Mover’s Distance: A Comparison of Traditional and New Varieties Sarah Frost and Daniel Baier Institute of Business Administration and Economics, Brandenburg University of Technology Cottbus, Postbox 101344, 03013 Cottbus, Germany, {Sarah.Frost,daniel.baier}@tu-cottbus.de Abstract. There are many different distances to calculate dissimilarities between images. Some of them operate fast but generate imprecise results. Others yield better results, especially the Earth Mover’s Distance (EMD) (see: Rubner (2000) or Wichterich (2008)), but its computational complexity prevents its usage in large databases (see: Ling (2007)). For marketing purposes, like clustering consumer images, every single image of a database has to be compared with each other; hence the EMD is an inappropriate method. Therefore some approximations of EMD were presented in the last years by Serratosa and Sanroma (2008), Ling and Okada (2007) and Shirdhonkar et al. (2008), which will be referred here. This paper also presents a new intuitive intelligible approximation of EMD for equally sized histograms. The empirical study in this paper tends to show whether the good results of EMD justify its long computation times, or whether faster methods of calculation provide results as good as EMD’s. We test distances with images of the Caltech-265 Object Categorie Dataset (Griffin (2007)) and evaluate their results by means of the Rand Index (Hubert (1985)).

Keywords IMAGE CLUSTERING, EARTH MOVER’S DISTANCE

Frost

72

Model-based Clustering of Time Series Sylvia Fr¨ uhwirth-Schnatter Wirtschaftsuniversit¨ at Wien, Austria [email protected] Abstract. Clustering is a widely used statistical tool to determine subsets in a given data set. Frequently used clustering methods are mostly based on distance measures and cannot easily be extended to cluster time series within a panelor a longitudinal data set. The talks reviews recently suggested approaches to model-based clustering of panel or longitudinal data based on finite mixture models. Several approaches are considered that are suitable both for continuous as well as categorical time series observations. Bayesian estimation through Markov chain Monte Carlo methods is discussed in detail and various criteria to select the number of clusters are reviewed. Applications to a panel of marijuana use among teenagers as well as clustering labor market outcomes in the Austrian labor market serve as an illustration

73

Fr¨ uhwirth-Schnatter

Weiß nicht, was soll es bedeuten. Die Abbildung von Teil-Ganzes-Beziehungen in Online-Katalogen Katja Ganzenmueller1 Zusammenfassung. Teil-Ganzes-Beziehungen treten in den unterschiedlichsten Formen auf, z.B. bei den B¨ anden einer Schriftenreihe oder bei Aufs¨ atzen in einem Sammelwerk. Ihnen allen gemeinsam ist die schwierige Darstellung in OnlineKatalogen und die komplizierte Navigation zwischen den zusammengeh¨ origen Teilen. Im Rahmen einer Bachelorarbeit an der Hochschule der Medien, Stuttgart, werden deutsche und ¨ osterreichische Online-Kataloge daraufhin untersucht, wie sie hierarchische Strukturen abbilden. Dabei werden Erschließungstiefe, die Art der in der Trefferliste angezeigten Titelaufnahmen und deren Unterscheidung ber¨ ucksichtigt. Außerdem wird die Navigation zwischen zusammengeh¨ origen Aufnahmen aus der Trefferliste und aus der Vollanzeige heraus betrachtet. An Hand dieser Untersuchung und nach einem kurzen Blick auf AACR-Kataloge und auf nicht-bibliothekarische Recherchewerkzeuge werden Vorschl¨ age f¨ ur die Verbesserung der hierarchischen Darstellung gemacht.

Ganzenmueller

74

Visual Motion Capturing for Kinematic Model Estimation of a Humanoid Robot Andre Gaschler TUM Abstract. Controlling a tendon-driven robot like the humanoid Ecce is a difficult task, even more so when its kinematics and its pose are not known precisely. In this paper, we present a visual motion capture system to allow both real-time measurements of robot joint angles and model estimation of its kinematics. Unlike other humanoid robots, Ecce (see Fig. 1A) is completely molded by hand and its joints are not equipped with angle sensors. This anthropomimetic robot design demands for both (i) real-time measurement of joint angles and (ii) model estimation of its kinematics. The underlying principle of this work is that all kinematic model parameters can be derived from visual motion data. Joint angle data finally lay the foundation for physics-based simulation and control of this novel musculoskeletal robot.

75

Gaschler

A New Approach for Graph Clustering Wolfgang Gaul1 and Rebecca Klages2 1

2

Institute of Technology Institute of Technology

Decision Theory and Management Science, Karlsruhe Institute of (KIT), Kaiserstr. 12, 76128 Karlsruhe. [email protected] Decision Theory and Management Science, Karlsruhe Institute of (KIT), Kaiserstr. 12, 76128 Karlsruhe. [email protected]

Abstract. The problem of finding structures in graphs (e.g., Schaeffer (2007), Fortunato (2010)) is an area that has been applied with success to topics as diverse as social networks, the internet and scientific citation, to name a few. A widely used measure to evaluate the quality of a given clustering of an undirected graph is modularity introduced by Newman and Girvan (2004), which has been used in classification heuristics that aspire to find optimal structures (eg. Blondel et al. (2008), Zhu et al. (2008)). Since the evaluation of a vertex partition with maximum modularity in a graph is NP-hard (Brandes et al. (2008)) we propose a new approach based on standard methods known from cluster analysis, select a solution with highest modularity value, and apply a vertex exchange algorithm for checking possible improvements. We apply our method to several real-life as well as artificial graphs.

References BLONDEL, V.D., GUILLAUME, J.-L., LAMBIOTTE, R. and LEFEBVRE, E. (2008): Fast unfolding of community hierarchies in large networks. J. Stat. Mech., 10, P10008. BRANDES, U., DELLING, D., GAERTLER, M., GOERKE, R., HOEFER, M., NIKOLOSKI, Z., and WAGNER, D. (2008): On modularity clustering. IEEE Transactions on Knowledge and Data Engineering, 20(2), 172–188. FORTUNATO, S. (2010): Community detection in graphs. Physics Reports, 486, 75–174. NEWMAN, M. and GIRVAN, M. (2004): Finding and evaluating community structure in networks. Physical Review E, 69, 026113. SCHAEFFER, S. (2007): Graph clustering. Computer Science Review, 1, 27–64. ZHU, Z., WANG, C., MA, L., PAN, Y. and DING, Z. (2008): Scalable Community Discovery of Large Networks. WIAM ’08: Proceedings of the Ninth International Conference on Web-Age Information Management, 381–388.

Keywords MODULARITY CLUSTERING, DIRECTED NETWORKS

Gaul

76

An Approach for Topic Trend Detection Wolfgang Gaul1 and Dominique Vincent2 1

2

Institute of Technology Institute of Technology

Decision Theory and Management Science, Karlsruhe Institute of (KIT), Kaiserstr. 12, 76128 Karlsruhe. [email protected] Decision Theory and Management Science, Karlsruhe Institute of (KIT), Kaiserstr. 12, 76128 Karlsruhe. [email protected]

Abstract. The detection of topic trends is an important issue in textual data mining (see, e.g., Kontostathis et al.(2004)). For this task textual articles from newspapers collected over a certain period are analysed by grouping them into homogeneous clusters. We use a vector space model (Salton (1989)) and a straight-forward vector cosine measure to evaluate document-document similarities (see, e.g., Eichmann, Srinivasan (2002)) and discuss how cluster-cluster similarities can help to detect alterations of topic trends over time. Our method is demonstrated by using an empirical data set of about 250 preclassified time-stamped documents. Results allow to assess which method specific parameters are valuable for further research.

References EICHMANN, D. and SRINIVASAN, P. (2002): A Cluster-Based Approach to Broadcast News. In ALLAN, J. (Ed.): Topic Detection and Tracking: Event-based Information Organization. Kluwer Academic Publishers. KONTOSTATHIS, A., GALITSKY, L., POTTENGER, W.M., ROY, S., PHELPS, D.J. (2004): A Survey of Emerging Trend Detection in Textual Data Mining. In Berry, M.W. (Ed.): A Comprehensive Survey of Text Mining - Clustering, Classification, and Retrieval. Springer. SALTON, G. (1989): Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Publishing Company, Inc.

Keywords TOPIC DETECTION, TREND DETECTION, TEXT MINING, CLUSTERING, FEATURE EXTRACTION

77

Gaul

Optimal Network Revenue Management Decisions Including Flexible Demand Data and Overbooking Wolfgang Gaul1 and Christoph Winkler2 1

2

Institute of Decision Theory and Management Science, Karlsruhe Institute of Technology (KIT), Kaiserstraße 12, 76128 Karlsruhe [email protected] Institute of Decision Theory and Management Science, Karlsruhe Institute of Technology (KIT), Kaiserstraße 12, 76128 Karlsruhe [email protected]

Abstract. In aviation network revenue management it is helpful to address consumers who are flexible w. r. t. certain flight characteristics, e.g., departure times, number of intermediate stops, booking class assignments. The offering of these so-called flexible products is gaining increasing importance (e.g. Gallego/Phillips (2004), Gallego/Iyengar/Phillips/Dubey (2004), Petrick/Goensch/Steinhardt/Klein (2010)). While overbooking (e.g. Bertsimas/Popescu (2003), Erdelyi/Topaloglu (2009)) has some tradition in network revenue management, the simultaneous handling of both aspects is new. We develop a DLP (deterministic linear programming) model that considers flexible products and overbooking and use an empirical example for the explanation of our findings.

References BERTSIMAS, D.; POPESCU, I. (2003): Revenue Management in a Dynamic Network Environment. Transportation Science 37, No. 3, 257-277. ERDELYI, A.; TOPALOGLU, H. (2009): Separable Approximations for Joint Capacity Control and Overbooking Decisions in Network Revenue Management. Journal of Revenue and Pricing Management 8, No. 1, 3-20. GALLEGO, G.; IYENGAR, G.; PHILLIPS, R. DUBEY, A. (2004): Managing Flexible Products on a Network. CORC Technical Report TR-2004-01. GALLEGO, G.; PHILLIPS, R. (2004): Revenue Management of Flexible Products. Manufacturing & Service Operations Management 6, No. 4, 321-337. PETRICK, A.; GOENSCH, J.; STEINHARDT, C.; KLEIN, R. (2010): Dynamic Control Mechanisms for Revenue Management with Flexible Products. Computers & Operations Research (2010), doi:10.1016/j.cor.2010.02.003.

Keywords Network Revenue Management, Flexible Products, Overbooking, Deterministic Linear Programming

Gaul

78

Regularization and Model Selection with Categorical Covariates Jan Gertheiss, Veronika Stelz and Gerhard Tutz Department of Statistics, LMU Munich; correspondence to: [email protected] Abstract. The challenge in regression problems with categorical covariates is the high number of parameters involved. Common regularization methods like the Lasso (Tibshirani, 1996), which allow for selection of predictors, are typically designed for metric predictors. If independent variables are categorical, selection strategies should be based on modified penalties. For categorical predictor variables with many categories a useful strategy is to search for clusters of categories with similar effects. The objective is to reduce the set of categories to a smaller number of categories which form clusters. The effect of categories within one cluster is supposed to be the same, but the (conditional) expectation of the response will differ across clusters. In the talk, L1 -penalty based methods for factor selection and clustering of categories are presented and investigated. It is distinguished between nominally and ordinally scaled covariates. The proposed regularization techniques combine, adapt and extend ideas from Bondell and Reich (2009) and Gertheiss and Tutz (2010).

References BONDELL, H.D. and REICH, B.J. (2009): Simultaneous Factor Selection and Collapsing Levels in ANOVA. Biometrics, 65, 169–177. GERTHEISS, J. and TUTZ, G. (2010): Sparse Modeling of Categorial Explanatory Variables. The Annals of Applied Statistics, 4, 2150–2180. TIBSHIRANI, R. (1996): Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58, 267–288.

Keywords CATEGORICAL PREDICTORS, CLUSTERING, MODEL SELECTION, REGULARIZATION

79

Gertheiss

Embedded Variable Selection in Classification Trees Servane Gey1 and Tristan Mary-Huard2 1

2

MAP5, UMR 8145, Universit´e Paris Descartes, Paris [email protected] UMR AgroParisTech INRA MIA 518, Paris [email protected]

Abstract. The Classification and Regression Tree (CART, [1]) algorithm is a wellestablished algorithm to build and prune decision trees, that has been successfully applied in various fields. A crucial step of the CART algorithm is the pruning process, in which a tree classifier fc T is selected by minimizing a penalized criterion of the form Pn fT + αn × |T | , (1) where n is the number of observations, Pn fT is the empirical risk of classifier fT built on tree T , αn is a tuning parameter, and |T | is the size of the tree. While many theoretical works have investigated this pruning process (see [2] for a review), the results obtained so far are not satisfactory. It is well-known that CART belongs to the family of embedded variable selection methods: since at each node only one variable is selected, fc T will only include a small subset of the p initial variables. Due to this inner variable selection process, p should play a crucial role in the regularization term. However, criterion (1) does not obviously depend on p, and this has not been studied so far. This paper investigates the exact impact of inner variable selection on tree classifier selection. From a theoretical point of view, we prove that criterion (1) can be validated under some assumption on the data distribution (known as margin assumption) by providing performance guarantees for fc T through an upper bound of its risk. This upper bound establishes that αn should depend linearly on log(p). This linear relationship is then observed in practice: a simulation study is performed which shows that the proposed penalization function is the one that is implicitly used in the classical implementation of the CART algorithm.

References 1 GEY, S. (2010): Risk bounds for CART classifiers under a margin condition. Tech. Rep. 0902.3130v4, arXiv. 2 BREIMAN, L., FRIEDMAN, J. H., OLSHEN, R. A., and STONE, C. J. (1984): Classification And Regression Trees. Chapman & Hall

Keywords CLASSIFICATION TREE, VARIABLE SELECTION, STATISTICAL LEARNING THEORY.

Gey

80

Modified Randomized Modularity Clustering: Adapting the Resolution Limit Andreas Geyer-Schulz1 and Michael Ovelg¨onne2 1 2

Information Service and Electronic Markets [email protected] Information Service and Electronic Markets [email protected]

Abstract. Fortunato and Barth´elemy (2007) carefully investigated implied resolution limits of modularity clustering algorithms. Geyer-Schulz and Ovelg¨ onne (2011) studied Fortunato and Barth´lemy’s scaling bounds together with the stability of of graph clusterings on various types of regular graphs. They identified the mismatch of the underlying structure of the graph automorphism group with the implied resolution limit of the modularity clustering algorithm as the main problem in finding proper cluster partitions. In this contribution a (simple) modification of the objective function of modularity clustering is investigated which adapts the implied resolution limit in a suitable way. Key words: modularity clustering, resolution limit, scaling effects

References Fortunato, S. and Barth´elemy, M. (2007): Resolution Limit in Community Detection. Proceedings of the National Academy of Sciences of the United States of America, 104(1), 36–41. Geyer-Schulz, A. and Ovelg¨ onne, M. (2011): On Diagnostics for Modularity Clustering, GPSDAA 2011, 2nd Bilateral German-Polish Symposium on Data Analysis and Its Applications, Cracow, p. 9. Ovelg¨ onne, M. and Geyer-Schulz, A. and Stein, M. (2010): Randomized Greedy Modularity Optimization for Group Detection in Huge Social Networks, ACM Workshop on Social Network Mining and Analysis, to appear.

81

Geyer-Schulz

Similarity learning with a collection of matrices and tensors Clement Grimal and Gilles Bisson LIG/AMA - Centre Equation 4 - UFR IM2AG BP 53 - F-38041 Grenoble Cedex 9 {Clement.Grimal,Gilles.Bisson}@imag.fr Abstract. In data analysis domain, data are often described by a single matrix in which the rows describe objects and the columns describe features of these objects. Nevertheless, in many real world applications, data are said to be multi-dimensional, meaning that they cannot be fully described by only one matrix, but rather by a collection of matrices, or even as a tensor. Typically, we can find these kind of data in the social networks where several entities (people, documents, etc) can have multiple relationship. In the framework of the co-clustering problem (or two-way clustering) we already proposed (Bisson et al. 2008; Hussain et al. 2010) an algorithm, named chisim, to compute simultaneously the similarity measures between rows and columns. This approach allows to discover high order similarities between objects and features, and we experimentally shown that our method achieves better results than the state of the art co-clustering systems such as LSA, ITCC, etc. In this paper, we propose to go a step further by extending the chisim method in two directions. First, we introduce different architectures in order to combine the similarity matrices obtained by applying chisim to a collection of matrices. We show that some of these architectures improve the quality of the co-clustering. Second, we propose a natural generalization of chisim in order to directly tackle tensors. Here a major concern is time and space complexity of this new method and thus we analyze some trade-off allowing the problem to become tractable for medium size datasets. Key words: multi-dimensional data, similarity learning, social networks

References Bisson, G. and Hussain, S. F. (2008): chisim: A New Similarity Measure for the Co-clustering Task. In proceedings of ICMLA’2008. Hussain, S. F., Grimal, C. and Bisson, G. (2010): An Improved Co-Similarity Measure for Document Clustering. In proceedings of ICMLA’2010.

Grimal

82

Automatische Sacherschließung an der ZBW - Status quo und Ausblick Thomas Gross1 ZBW - Deutsche Zentralbibliothek f¨ ur Wirtschaftswissenschaften Leibniz-Informationszentrum Wirtschaft Kiel/Hamburg [email protected] Zusammenfassung. Die ZBW m¨ ochte mit der Implementierung eines automatischen Sacherschließungsverfahrens einerseits dem Umstand einer stetigen Zunahme an Onlinedokumenten Rechnung tragen und andererseits bei der Inhaltserschließung neue Wege beschreiten. Neben der Entlastung der intellektuellen Erschließung durch ein semi- oder vollautomatisches Verfahren soll es dar¨ uber hinaus m¨ oglich sein, ZBW-fremde digitale Informationsressourcen jeglicher Art mit maschineller Hilfe zu indexieren und in einem gemeinsamen Suchraum auffindbar zu machen. Im derzeitigen Projekt werden hierzu die in der ZBW zur Anwendung kommenden Vokabulare (verbale Sacherschließung mit Standard-Thesaurus Wirtschaft, bzw. klassifikatorische Erschließung mit der Standardklassifikation Wirtschaft) f¨ ur das maschinelle Verfahren angepasst, trainiert und evaluiert. Die Erfahrungen der ZBW mit der organisatorischen Implementierung automatischer Sacherschließung sowie die M¨ oglichkeiten der Auswertung dieser Verfahren stehen im Mittelpunkt des Vortrages.

83

Gross

Integrated risk management in practice: How reliable is it? Peter Grundke University of Osnabr¨ uck, Germany [email protected] Abstract. For a correct aggregation of losses resulting from different risk types and, hence, a correct computation of total economic capital requirements, existing stochastic dependencies between risk-specific losses have to be considered by integrated risk management (IRM) approaches. Banks predominantly compute economic capital requirements for each risk type separately and later, with or without considering diversification effects, these various capital requirements are aggregated to a total economic capital number. Two more sophisticated approaches proposed for computing total economic capital requirements are the so-called top-down approach and the bottom-up approach. Within the top-down approach, the separately determined marginal distributions of aggregate profits and losses resulting from different risk types (e.g., market, credit and operational risk) are linked by copula functions to model their joint distribution function. In contrast, bottom-up approaches model the complex interactions between different risk types already on the level of the individual financial instruments and risk factors. Most of the literature that deals with integrated risk management is of scientific nature. This means that (partly technically very advanced) models are proposed which, however, are restricted to some few risk types, are applied to very stylized bank portfolios or solve data problems by working with assumed parameters and distributions. There are only a few paper in which frameworks are presented that aim to be used in practice. As an example of such an IRM practice model, we employ a restricted version of the IRM model of the Deutsche Bank AG [see Brockmann, M., M. Kalkbrener (2010): On the aggregation of risk, Journal of Risk 12(3), p. 45-68]. This model could be characterized as a top-down approach with parametric marginal distributions. However, the authors show that there is also some relationship to bottom- up approaches. The accuracy of this IRM practice model (restricted to the two most important risk types market and credit risk) is tested within a simulation study. For this, a simple bottom-up approach is used for generating stochastically dependent credit and market risk losses. Afterwards, the IRM practice model is calibrated on the simulated loss data and diversification benefits computed by this model are compared with those of the data-generating bottom-up approach. Furthermore, determinants of the accuracy of the IRM practice model are analyzed.

Grundke

84

Mixture model clustering with explanatory variables : one step and three step approaches D. Gudicha and J. Vermunt Tilburg University, Netherlands Abstract. Mixture modeling is becoming a more and more popular technique for cluster analysis. After determining the number and the shapes of the clusters using statistical criteria, researcher will often wish to link the cluster memberships to explanatory variables or covariates. This profiling of the clusters or latent classes can either be done using a one- or a three-step approach. In the one- step approach, the mixture model is expanded to include the covariates in a regression model for the prior class membership probabilities. The three step approach involves estimating the mixture model (step 1), classifying subjects into their most likely latent class (step 2), and regressing the class assignments on covariates (step 3). Bock, Croon, and Hagenaars (2004) and Vermunt (2010) showed for latent class models with categorical responses that the three step approach may yield severely downward biased estimates for the covariate effects. These authors also propose two different ways of correcting for this bias; that is, via weighted data analysis or an analysis with known classification errors in step three. In our study we show how to generalize the methods proposed by Bock, Croon, and Hagenaars (2004) and Vermunt (2010) to the situation in which the response variable used in the mixture model are not categorical but continuous. The main complicating factor is that complex multidimensional integral needs to be solved in order to obtain the classification error matrix which is needed for the corrections in step 3 using either a Monte Carlo integration or a summation over the observed distribution of the responses. In a simulation study we compare the performance (the ability to recover covariate effects) of the onestep approach, the standard (biased) three step approach and various types of three step approaches with corrections. The main conditions that were varied are class separation and sample size.

Keywords latent class analysis, mixture modeling, Monte Carlo integration, cluster analysis

85

Gudicha

From individual categorisations to consensus ones Alain Gu´enoche Universit´e Paris 1, France [email protected] Abstract. Starting from individual judgments given as categories (i.e., a profile of partitions on an X item set), we attempt to establish a collective partitioning of the items. For that task, we compare two combinatorial approaches. The first one allows to calculate a consensus partition, namely the median partition of the profile, which is the partition of X whose sum of distances to the individual partitions is minimum. Then, the collective classes are the classes of this partition. The second one, due to J.P. Barth´elemy, consists in first calculating a distance D on X based on the profile and then in building an X-tree associated to D. The collective classes are then some of its subtrees. We compare these two approaches and more specifically study in what extent they produce the same decision as a set of collective classes.

Keywords Categorization data, Partitions, Tree representation, Consensus

Gu´enoche

86

Multi-view Active Appearance Models for the X-ray Based Analysis of Avian Bipedal Locomotion Haase1 , Nyakatura1 , and Denzler1 University of Jena Abstract. Many fields of research in biology, motion science and robotics depend on the understanding of animal locomotion. Therefore, numerous experiments are performed using high-speed biplanar x-ray acquisition systems which record sequences of walking animals. Until now, the evaluation of these sequences is a very timeconsuming task, as human experts have to manually annotate anatomical landmarks in the images. Therefore, an automation of this task at a minimum level of user interaction is worthwhile. However, many difficulties in the data—such as xray occlusions or anatomical ambiguities—drastically complicate this problem and require the use of global models. Active Appearance Models (AAMs) are known to be capable of dealing with occlusions, but have problems with ambiguities. We therefore analyze the application of multi-view AAMs in the scenario stated above and show that they can effectively handle uncertainties which can not be dealt with using single-view models. Furthermore, preliminary studies on the tracking performance of human experts indicate that the errors of multi-view AAMs are in the same order of magnitude as in the case of manual tracking.

87

Haase

Supervised and Unsupervised Classification of Rankings Using a Kemeny Distance Framework Willem J. Heiser1 and Antonio D’Ambrosio2 1 2

Institute of Psychology, Leiden University, The Netherlands Department of Mathematics and Statistics, University of Naples Frederico II, Italy [email protected]

Abstract. Rankings and partial rankings are ubiquitous in data analysis, yet there is relatively little work on the their classification that uses the typical properties of rankings. We propose a common framework for both the prediction of rankings and clustering of rankings, which is also valid for partial rankings. This framework is based on the Kemeny distance, defined as the minimum number of interchanges of two adjacent elements required to transform one ranking into another. The Kemeny distance is equivalent to Kendall’s tau for complete rankings, but for partial rankings it is equivalent to Emond and Mason’s extension of tau. For clustering (unsupervised classification), we use the probabilistic distance method proposed by Ben-Israel and Iyigun, and define the disparity between a ranking and the center of a cluster as the Kemeny distance. For prediction (supervised classification), we build a classification tree by recursive partitioning, and define the impurity measure of the subgroups formed as the sum of the within-node Kemeny distances. In both cases, the center of a subgroup of (partial) rankings 43 also called the consensus ranking 43 is useful to characterize the subgroup. It is well-known that finding the consensus ranking is an NPhard problem. We use a branch-and-bound algorithm to find approximate solutions. Illustrative examples are given for both procedures.

Keywords Kemeny distance, probabilistic distance method, classification trees

Heiser

88

Sacherschließung in und mit der Wikipedia. - Idee, Prototyp, Diskussion Lambert Heller Zusammenfassung. Die systematische inhaltliche Erschließung von Literatur k¨ onnte und sollte heute besser in der Wikipedia stattfinden, denn die Wikipedia ist ein u utzung durch automatische Verfahren l¨ aßt sich ¨berlegener Thesaurus; mit Unterst¨ zudem ein einfacher gemeinschaftlicher Erschließungsprozeß konstruieren. - Diese These hatte ich 2010 im Weblog Biblionik zur Diskussion gestellt und durch Mockups anschaulich gemacht. Magnus Manske, einer der Erfinder der Software MediaWiki, der technischen Basis der Wikipedia, hat 2011 mit LITurgy den ersten funktionst¨ uchtigen Prototypen f¨ ur ein solches Sacherschließungsverfahren geschaffen. Der Vortrag zeigt anhand ausgew¨ ahlter Beispiele den Entwicklungsvorsprung der Wikipedia gegen¨ uber nicht-kollaborativen Thesauri verschiedener Disziplinen, stellt kurz den Prototyp LITurgy vor und diskutiert abschließend die Chancen und Risiken kollaborativer Erschließungsarbeit.

89

Heller

Towards Cross-modal Comparison of Human Motion Data Thomas Helten1 , Meinard Mueller, Jochen Tautges, Andreas Weber, and Hans-Peter Seidel MPI Informatik Abstract. Analyzing human motion data has become an important strand of research in many fields such as computer animation, sport sciences, and medicine. In this paper, we discuss various motion representations that originate from different sensor modalities and investigate their discriminative power in the context of motion identification and retrieval scenarios. As one main contribution, we introduce various mid-level motion representations that allow for comparing motion data in a cross-modal fashion. In particular, we show that certain low-dimensional feature representations derived from inertial sensors are suited for specifying highdimensional motion data. Our evaluation shows that features based on directional information outperform purely acceleration based features in the context of motion retrieval scenarios.

Helten

90

Some tricky issues in comparative simulations of clustering methods, including the robust improper ML estimator Christian Hennig1 and Pietro Coretto2 1

2

Department of Statistical Science, University College London, Gower St, London WC1E 6BT, United Kingdom [email protected] Department of Economics and Statistics, University of Salerno [email protected]

Abstract. The robust improper ML estimator for Gaussian mixtures was introduced for one-dimensional data by Hennig (2004). This presentation will present some simulation results for the multivariate version of this method (Coretto and Hennig, 2010, presented a one-dimensional simulation study). The main focus of the paper will be on the decisions that need to be made when designing comparative simulation studies in clustering. Apart from the choice of model setups from which to generate data, a key issue is the measurement of quality. Assuming that in clustering we are not mainly interested in parameter estimators but rather in the grouping of the points, it is necessary to define the “true clusters” of the simulated data generating process. This is not trivial, because “outliers” have to be properly distinguished from “clusters”. It will also be discussed whether the clusters that the researcher would be interested in in a real situation correspond to finding the “true” mixture components, or whether quality measurement should be based on within-cluster distances regardless of the true underlying model (which, in reality, may not exist).

References CORETTO, P. and HENNIG, C. (2010): A simulation study to compare robust clustering methods based on mixtures, Advances in Data Analysis and Classification, 4, 111–135. HENNIG, C. (2004): Breakdown points for maximum likelihood-estimators of location-scale mixtures, Annals of Statistics, 32, 1313–1340.

Keywords GAUSSIAN MIXTURE MODEL, MIXTURE OF T-DISTRIBUTIONS, OUTLIER DEFINITION, TCLUST

91

Hennig

Some thoughts on the aggregation of variables in dissimilarity design Christian Hennig Department of Statistical Science, University College London, Gower St, London WC1E 6BT, United Kingdom [email protected] Abstract. One way of analysing complex data is to define a dissimilarity measure and to use dissimilarity-based methodology such as dissimilarity-based clustering or k-nearest neighbour methods. The question then arises how to define the dissimilarities. In this talk I will consider how to aggregate variables in order to define dissimilarity measures in highdimensional data sets or data sets with mixed type variables. Arising questions concern the standardisation of variables (e.g., they could be standardised by range, variance, or a robust scale statistic such as the MAD), aggregation method (e.g., Euclidean vs. Manhattan) and variable weighting (this is nontrivial at least for aggregating mixed type variables, e.g., interval, ordinal and nominal scaled ones). The general philosophy behind the presentation is that there is no objectively optimal solution to these problems and it is important to understand properly what the different approaches do and imply in order to make a well informed applicationbased decision about them. Furthermore, the interplay between dissimilarity design and the statistical method that is afterwards applied to the dissimilarities will be discussed.

Keywords DISTANCE BASED CLUSTERING, NEAREST NEIGHBOURS, STANDARDISATION, HIGH-DIMENSIONAL DATA

Hennig

92

Parallel coordinate plots in archaeology Irmela Herzog1 and Frank Siegmund2 1 2

LVR-Amt f¨ ur Bodendenkmalpflege im Rheinland, Germany [email protected] Basel University, Switzerland [email protected]

Abstract. Parallel coordinate plots (PCPs) can be applied to explore multivariate data with more than three dimensions. This visualisation method is straightforward and easily intelligible for people without statistical background. However, to our knowledge, PCPs have not yet been applied in archaeology. This paper will present some examples of archaeological classifications which are clearly visible in PCPs. For this purpose a program has been written which offers some additional options which are not supported in standard software for PCP generation. Some of the functionality of Geographic Information Systems (GIS) was introduced for PCPs: This program is able to create a thematic display based on a user-selected variable, optionally multiple plots highlight each thematic colour. Another variable may control the breadth of the PCP lines. Moreover, an info-tool, zoom, and a find-function are supported. The resulting graph can be saved in SVG format and in a GIS format.

References COOK, D. AND SWAYNE, D.F. (2007): Interactive and Dynamic Graphics for Data Analysis: With R and GGobi (Use R). Springer, 24- 34. SIEGMUND, F. (2000): Erg¨ anzungsb¨ ande zum Reallexikon der Germanischen Altertumskunde, Band 23: Alemannen und Franken. De Gruyter.

Keywords PARALLEL COORDINATE PLOT, ARCHAEOLOGY

93

Herzog

Modeling Mortality in the WikiLeaks Afghanistan War Logs: Combining topicmodels and negative binomial recursive partitioning Paul Hofmarcher1 , Reinhold Hatzinger2 , Kurt Hornik3 and Thomas Rusch4 1

2

3

4

Institute for Statistics and Mathematics, WU (Vienna University of Economics and Business), Augasse 2-6, 1090 Wien, Austria, [email protected] Institute for Statistics and Mathematics, WU, Augasse 2-6, 1090 Wien, Austria, [email protected] Institute for Statistics and Mathematics, WU, Augasse 2-6, 1090 Wien, Austria, [email protected] Institute for Statistics and Mathematics, WU, Augasse 2-6, 1090 Wien, Austria, [email protected]

Abstract. The WikiLeaks Afghanistan war logs contain more than 79.000 ground level reports about fatalities and the surrounding situations in the US-led Afghanistan war. They cover the period from January 2004 to December 2009. In this paper we use those reports to build statistical models to understand the mortality rates associated with specific circumstances. Our approach combines Latent Dirichlet Allocations (LDA) with negative binomial based recursive partitioning. LDA are used to process natural language information contained in report summaries, i.e. to estimate latent topics and assign each report to one of them. These topic assignements subsequently serve - in addition to variables contained in the data set - as explanatory variable for modeling the number of fatalities of the civilian population, Anti Coalition Forces as well as the combined number of fatalities. Actual modeling is carried out with segmented negative binomial models by means of model-based recursive partitioning, which we call manifest mixture models. For each group of fatalities, we are able to identify segments with different mortality rates that correspond to a number of topics or other explanatory variables and their interactions. Furthermore, we connect those segments to each other and to stories that have been covered in the media. This gives an unprecedented description of the war in Afghanistan covered by the war logs. Our approach is an example of how modern statistical methods may lead to extra insight if applied to problems of database journalism.

References BLEI, D.M., JORDAN, M.I. and NG, A.Y. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022 LAWLESS, J.F. (1987). Negative binomial and mixed poisson regression. The Canadian Journal of Statistics, 15, 209–225. O’LOUGHLIN, J., WITMER, F.D.W., LINKE, A.M., and THORWARDSON, N. (2010): Peering into the fog of war: The geography of the WikiLeaks Afghanistan war logs, 2004–2009. Eurasian Geography and Economics, 51, 1– 24.

Hofmarcher

94

Determining the similarity between US cities using a gravity model for search engine query data Paul Hofmarcher1 , Bettina Gr¨ un2 , Kurt Hornik3 and Patrick Mair4 1

2

3

4

Institute for Statistics and Mathematics, WU (Vienna University of Economics and Business), Augasse 2-6, 1090 Wien, Austria, [email protected] Department of Applied Statistics, Johannes Kepler University Linz, Altenbergerstraße 69, 4040 Linz, Austria [email protected] Institute for Statistics and Mathematics, WU, Augasse 2-6, 1090 Wien, Austria, [email protected] Institute for Statistics and Mathematics, WU, Augasse 2-6, 1090 Wien, Austria, [email protected]

Abstract. Google Trends (Google, 2010) allows to look up how often a given search term was queried and to determine a relative ranking of the regions and cities where internet users have requested this term. In this paper we use the gravity model (Tobler and Wineberg, 1971) to estimate the similarity of US cities based on data provided by Google Trends (GT). As search terms for GT serve dictionaries derived from the General Inquirer (see http://www.wjh.harvard.edu/~inquirer/), containing the categories Economy, Politics, Legal. In order to visualize our estimated similarities we make use of multidimensional scaling (MDS) which allows to represent the similarities as distances between points in a low-dimensional space (Borg and Groenen, 2005).

References BORG I. and GROENEN P.J. (2005). Modern Multidimensional Scaling. Second Edition. Springer, Berlin-Heidelberg. GOOGLE (2010). About Google Trends. http://www.google.com/intl/ en/trends/about.html TOBLER W. and WINEBERG S. (1971). Riddle of Cappadocian tablets: Bronze age trade - Statistical evidence for sites of ancient towns. Nature, 231, 39–41.

Keywords GOOGLE TRENDS, CLUSTER ANALYSIS, MULTI-DIMENSIONAL SCALING, GENERAL INQUIRER, GRAVITY MODEL

95

Hofmarcher

Detecting person heterogeneity in a large-scale orthographic test by Item Response models Christine Hohensinn, Klaus D. Kubinger and Manuel Reif Department of Psychological Assessment and Applied Psychometrics, Faculty of Psychology, University of Vienna [email protected] Abstract. Achievement tests for students are constructed with the aim to measure a specific competency uniformly for all examinees. This requires that the underlying population works in a homogeneously way on the items. The unidimensional Rasch model is the model of choice to assess these assumptions in the process of test construction. But it is possible that various subgroups of the population apply either different strategies for solving the items or make specific types of mistakes in the test. The presence of such latent groups would contradict the unidimensional Rasch model. Mixture distribution models as the mixed Rasch model (Rost, 1990) are methods for detecting such latent groups. The present study examines a large-scale German orthographic test for 8th grade students. In the process of test construction and calibration, the test was administered to 3227 students in Austria. In a first step of data analysis, the items yielded a poor model fit to the unidimensional Rasch model. Therefore further analyses were conducted to find homogenous subgroups which are characterized by different orthographic error patterns. Subsequently the relationship of these latent groups to manifest characteristics of the examinees were assessed.

References ROST, J. (1990): Rasch models in latent classes: an integration of two approaches to item analysis. Applied Psychological Measurement, 14, 271-282. Applied Psychological Measurement, 14, 271-282.

Keywords ITEM RESPONSE THEORY, RASCH MODEL, LATENT CLASS, TEST CONSTRUCTION

Hohensinn

96

Semiparametric Idenfication and Estimation in Hidden Markov Models Daniel Hohmann1 and Hajo Holzmann2 1

2

Fachbereich Mathematik u. Informatik, Philipps-Universit¨ at Marburg [email protected] Fachbereich Mathematik u. Informatik, Philipps-Universit¨ at Marburg [email protected]

Abstract. Hidden Markov Models (HMMs) are applied in many fields as a tool for statistical inference, e. g. speech recognition, machine translation, and gene prediction, where it is common to assume that the state-dependent distributions belong to some parametric family. In contrast, our objective is the identification and estimation in semiparametric two-component HMMs, i. e., parametric assumptions only relate to the parameters of the hidden Markov chain. In order to achieve full identifiability we mainly impose tail assumptions to the state-dependent distributions on the one hand and their Fourier transforms on the other. These assumptions are fulfilled by various location-scale families as e. g. mixtures of normal distributions. Exploiting a regression model representation of our HMM, the identification does further base on the additional information that is provided by the dependence structure within the observations, so we may achieve identifiability of the HMM even if the marginal mixture model is not. For the estimation we then consider the misclassified binary regressor model which was investigated by Henry et. al. (2010). We generalize their estimator in as much as we first allow for continuous regressors and second for a φ-mixing dependence structure, which eventually provides a semiparametric estimator for both the non-parametric state-dependent distribution functions and the transition probabilities of the Markov chain. Finally, this estimator is proven to be asymptotically normal.

References ´ B. (2010): Identifying finite mixtures HENRY, M., KITAMURA, Y., and SALANIE, in econometric models. Preprint. HOHMANN, D. and HOLZMANN, H. (2011): Semiparametric Idenfication and Estimation in Hidden Markov Models. Preprint.

Keywords HIDDEN MARKOV MODEL, SEMIPARAMETRIC, IDENTIFIABILITY, ASYMPTOTIC NORMALITY

97

Hohmann

A Quantification Method for Data Matrix with Many Missing Values Tadashi Imaizumi School of Management & Information Sciences, Tama University,4-1-1 Hijirigaoka, Tama-shi, Tokyo,JAPAN [email protected] Abstract. When we want to apply PCA, CA, MDS methods to the data set, we presume that almost all of data cells in data matrix are given. However, we must analyze data matrix of which many cells are unobserved or missing values ,in case of text mining, data collected by picking m of n items etc. For example, when we want to analyze similarity among N statements in text mining, Then we will rearrange these statements into a two mode two way data matrix of which row is corresponding to each statement and column is corresponding each word in these statements. this rearranged data matrix will be a huge matrix with many unobserved data. In another case, when we want to analyze the one mode two way data matrix of word co-occurrence matrix whose row and column are corresponding to words, then many cells will be also unobserved or missing. The classical non metric scaling methods have been not applied to these data matrix successfully and the results are called ”de-generated solution”. So the new algorithm for scaling these type of data matrix must be developed. These type data matrix will be treated as a data matrix with partial ordered data in some sense, and with too many cells being unobserved or missing values. We will propose an quantification method for analyzing these type of data matrix. The unobserved or missing values will not be incorporated into scaling. Some applications to real data sets and the comparison with classical non-metric scaling method will be shown.

References HAYASHI C (1950): On the quantification of qualitative data from the mathematicostatistical point of view.Annals of the Institute of Statistical Mathematics,35-47 RAVEH Adi and LANDAU Simha F. (1993): Partial order scalogram analysis with base coordinates (POSAC): Its application to crime patterns in all the states in the United States. Journal of Quantitative Criminology, vol9 no.1 83-99.

Keywords Unobserved data,Missing Value, Quantification Method, Partial Ordered

Imaizumi

98

Probabilistic Neural Networks for the Decision Support of Investment Processes Jan Indorf and Thorsten Poddig University of Bremen, Chair of Finance {indorf,poddig}@uni-bremen.de Abstract. In this contribution we examine the application of classifiers in financial forecasting and asset management. The classifier under consideration is the Probabilistic Neural Network (PNN) which is a non parametric, non linear classifier. With this classifier we forecast the conditional probability density of classes, given a set of features. Several studies have shown that PNN provide a very encouraging performance compared to level forecasts. Our empirical study is divided into two parts. First we arrange a study based on artificial data. Here we examine the performance of the PNN under laboratory conditions. Especially the performance of the feature selection procedure is tested, since the forecasting performance depends on the correct selection of the relevant features. To solve this task, we introduce a selection algorithm based on the multifold cross-validation technique. Our multifold crossvalidation algorithm is extended by a monte-carlo simulation to make the algorithm robust against noisy structures, which might be observed when dealing with capital market data. The study with artificial data shows this robustness. The second part is based on real capital market data. Here we divide return series of several assets in c = 2 as well as c = 5 classes. In a first step we forecast the conditional probabilities of the classes, then in a second step we construct a portfolio based on these conditional probability forecasts. To evaluate the performance of the PNN portfolio we compare the PNN performance with the performance of other classifiers (linear discriminant analysis, multi layer perceptrons) and of classical regression approaches (linear regression, kernel regression).

References Hildebrandt, J. (2009): Nichtparametrische integrierte Rendite- und Risikoprognosen im Asset Management mit Hilfe von Pr¨ adiktorselektionsverfahren. Cuvillier Verlag G¨ ottingen Poddig, Th. (1990): Handbuch Kursprognose – Quantitative Methoden im Asset Management. Uhlenbruch Verlag Bad Soden/Ts. Specht, D.F. (1990): Probabilistic Neural Networks. Neural Networks, 3, 109–118.

Keywords Classification, Feature Selection, Financial Forecasting

99

Indorf

Multiple correspondence analysis with missing values: a comparative study Josse Julie1 , Chavent Marie2 , Liquet Benoˆıt3 , and Husson Fran¸cois1 1

2 3

Agrocampus, 65 rue de st-Brieuc, 35042 Rennes, France [email protected] Universit´e V. S´egalen, Bordeaux 2, 146 rue L. Saignat, 33076 Bordeaux, France Equipe Biostatistique de l’U897, INSERM, ISPED

Abstract. Different methods are available to perform a multiple correspondence analysis (MCA) with missing data. The most popular method is missing single; it consists in creating an extra category for the missing values and performing the MCA on the new data set. The missing passive method used in Gifi’s Homogeneity analysis framework is also often used. This method is based on the following assumption: if an individual has not answered to one variable, one considers that the individual has not chosen any category for the variable. Consequently, in the indicator matrix, the row corresponding to this individual is full of 0 for this variable. Missing-data passive modified margin is an adaptation of the previous method with fixed margins. In this presentation, we propose a new approach to handle missing values in MCA. This method, named iterative MCA, performs an iterative imputation of the missing values during the estimation of the axes and components and can be seen as an EMtype algorithm. This new algorithm is described and its properties are studied. We point out the overfitting problem and propose a regularized version of the algorithm to overcome this major issue. Compared to the performances of the existing methods, the results of the regularized iterative MCA algorithm are very promising.

References JOSSE, J., CHAVENT, M, LIQUET, B and HUSSON, F. (2011). Handling missing values with regularized iterative multiple correspondence analysis. Submitted. MEULMAN, J. (1982): Homgeneity Analysis of Incomplete Data. Leiden. VAN DER HEIJEN, P.G.M. and ESCOFIER, B. (2003). Multiple correspondence analysis with missing data. In Analyse des correspondances. PUR.

Keywords MULTIPLE CORRESPONDENCE ANALYSIS, MISSING VALUES, IMPUTATION, REGULARIZATION

Josse

100

Will the pedestrian cross? Probabilistic Path Prediction based on Learned Motion Features Christoph Keller1 and Christoph Hermes Univ. Heidelberg Abstract. Future vehicle systems for active pedestrian safety will not only require a high recognition performance, but also an accurate analysis of the developing traffic situation. In this paper, we present a system for pedestrian action classification (walking vs. stopping) and path prediction at short, sub-second time intervals. Apart from the use of positional cues, obtained by a pedestrian detector, we extract motion features from dense optical flow. These augmented features are used in a probabilistic trajectory matching and filtering framework. The vehicle-based system was tested in various traffic scenes. We compare its performance to that of a state-of-the-art IMM Kalman filter (IMM-KF), and for the action classification task, to that of human observers, as well. Results show that human performance is best, followed by that of the proposed system, which outperforms the IMM-KF and the simpler system variants.

101

Keller

The R Package CDM for Cognitive Diagnosis Modeling ¨ u1 , and Alexander Thomas Kiefer1 , Ann Cathrice George2 , Ali Unl¨ 3 Robitzsch 1

2

3

Technische Universit¨ at Dortmund, Fakult¨ at Statistik, D-44221 Dortmund {kiefer, uenlue}@statistik.tu-dortmund.de Technische Universit¨ at Dortmund, Research School Education and Capabilities, D-44221 Dortmund [email protected] Bundesinstitut f¨ ur Bildungsforschung, Innovation & Entwicklung des o ¨sterreichischen Schulwesens, A-5020 Salzburg, [email protected]

Abstract. The R (R Development Core Team, 2010) package CDM for cognitive diagnosis modeling (CDM; Rupp, Templin, & Henson, 2010) in psychometrics is introduced. CDM is restricted latent class modeling for inferring from respondents’ item answers fine-grained and individualized diagnostic information about multiple latent attributes or competencies. CDM represents model-based classification approaches, where the latent classes correspond to the possible attribute profiles, and the conditional item parameters model atypical response behavior in the sense of slipping and guessing errors. The confirmatory character of CDM is apparent in the Q-matrix, which allows incorporating qualitative prior knowledge and can be seen as an operationalization of the latent concepts of an underlying theory. The package CDM provides functions and example data for CDM with the deterministic-input, noisy-and-gate model and the deterministic-input, noisy-or-gate model in R. The main function of the package provides statistical inference procedures for CDM models. Plot, print, and summary methods are implemented for graphing and outlining CDM analyses. The features of the package CDM are illustrated with accompanying data.

References R DEVELOPMENT CORE TEAM (2010): R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.r-project.org/. RUPP, A.A., TEMPLIN, J., and HENSON, R.A. (2010): Diagnostic Measurement: Theory, Methods, and Applications. The Guilford Press, New York.

Keywords R, COGNITIVE DIAGNOSIS MODELING, PSYCHOMETRICS

Kiefer

102

Chance corrected correlation measures for qualitative variables Henk A.L. Kiers Heymans Institute, University of Groningen, Grote Kruisstraat 2/1, 9712 TS Groningen, The Netherlands [email protected] Abstract. There are various ways to calculate the association between two qualitative variables. An often used measure is the rand index. An older coefficient is in fact an adaptation of the chi-square measure, normalized to the range of [0,1]. This is the socalled Tschuprov’s T2. Interestingly, this coefficient can be seen as a special case of the well known RV-coefficient, applied to socalled quantification matrices set up for each of the qualitative variables. Recently, Smilde et al. (2009) proposed a modified version of the RV coefficient, ensuring that its expected value in cases of purely random data is 0. Thus, in fact they offered a chance corrected version of the RV coefficient. In the present paper, it will be shown how T2 can be adjusted analogously, as well as similar RV-based coefficients, thus offering a new class of chance corrected coefficients for the association between qualitative variables. The resulting coefficients will be compared to the modified rand index, i.e., the chance corrected version of the rand index.

References SMILDE, A.K., KIERS, H.A.L., BIJLSMA, S., RUBINGH, C.M., and VAN ERK, M.J. (2009): Matrix correlations for high-dimensional data: the modified RVcoefficient. Bioinformatics, 25, 401-405.

Keywords ASSOCIATION, QUALITATIVE VARIABLES

103

Kiers

Predictive validity of tracking decisions: The development of a new validation criterion. Florian Klapproth1 , Thomas H¨ orstermann2 , and Sabine Krolak-Schwerdt3 1 2 3

University of Luxembourg [email protected] University of Luxembourg [email protected] University of Luxembourg [email protected]

Abstract. Tracking in education means grouping students by academic ability (classes, schools). In educational systems with rigid tracking (like Germany or Luxembourg), students’ achievements (e.g., in primary school) determine the type of future schooling to attend, and therefore the type of education to receive. Tracking decisions usually are validated by relating school success at t1 to the tracking decision at t0 . Estimates of school success are primarily based either on achievement of certification or keeping the initial track. However, recent studies suggest that school success does not reflect students’ competencies satisfyingly (Scharenberg, Gr¨ ohlich, Guill, & Bos, 2010). To prevent misclassifications, we propose a validation criterion for the tracking decision based on standardized achievement test scores rather than mere school success. Based on the assumption that test scores validly assess students’ academic competencies and that distributions of test scores, separated by track, partly overlap, we propose that intersections of the distributions mark the boundaries of categories defining different competency levels. Hence, a high correlation (e.g., Cohen’s κ) between the categories and the tracking decision indicates a precise prediction of the competency necessary for school success.

References ¨ SCHARENBERG, K., GROHLICH, C., GUILL, K. and BOS, W. (2010): Schulformwechsel und prognostische Validit¨ at der Schullaufbahnempfehlung in der Jahrgangsstufe 4. In: W. Bos and C. Gr¨ ohlich (Eds.): KESS 8. Kompetenzen und Einstellungen von Sch¨ ulerinnen und Sch¨ ulern - Jahrgangsstufe 8. Hamburg, Beh¨ orde f¨ ur Schule und Berufsbildung, 115–123.

Keywords TRACKING DECISION, PREDICTIVE VALIDITY, SCHOOL SUCCESS, VALIDATION CRITERION, MISCLASSIFICATIONS, STANDARDIZED ACHIEVEMENT TEST SCORES

Klapproth

104

Protein Classification using Amphipathy Maps Anne Sophie Kn¨ oller, Hyung-Won Koh, and Eyke H¨ ullermeier Department of Mathematics and Computer Science, University of Marburg {asknoeller,koh,eyke}@mathematik.uni-marburg.de Abstract. Current state-of-the-art methods for protein classification commonly rely on abstract features of an amino acid (AA) sequence, e.g., AA index, AA composition, and sequence homology, for which an intuitive biological interpretation is often difficult if not impossible. In addition, amphipathic structures and surface characteristics are mainly neglected, although it is well-known that amphipathy of secondary structural elements, such as α-helices and β-sheets/strands, plays a key role in protein stability as well as in determining the functional properties of the respective element. Of course, index-based lag-k autocorrelation functions inherently capture amphipathy-like properties of a sequence. However, due to the choice of a constant, usually integer parameter k, only a small fraction of possible surface structures are considered. To achieve greater flexibility, we propose to depict protein sequences in all possible pairwise angular conformations. That is, within a given window size, angles between neighboring amino acids ranging from 0◦ to 180◦ are considered. Based on the idea of hydrophobic moment, we generated protein profiles including the hydrophobicity, charge and bulkyness. Based on this concept, we introduce the amphipathy map as an alternative feature space and propose a feature extraction method to capture the amphipathy of putative sub-regions within amino acid sequences. Using support vector machines, we compare our derived features with a standard k-mer string kernel approach as a baseline method for compositional feature representation of sequences. Finally, we report first results on publicly available protein datasets.

105

Kn¨oller

Algorithms for incorporating spatial information into clustering of high-spectral data Jan Hendrik Kobarg1 and Theodore Alexandrov1,2 1

2

Center for Industrial Mathematics, University of Bremen, 28359 Bremen, Germany [email protected] Steinbeis Innovation Center for Scientific Computing in Life Sciences, 28211 Bremen, Germany [email protected]

Abstract. Segmentation of hyper-spectral imaging data using clustering requires special algorithms, which consider spatial relations between the pixels. This strategy can improve clustering of noisy data, since neighbor pixels usually should be clustered into one group. However, in the case of the spectral dimension p being large, cluster algorithms already suffer from the curse of dimensionality and have high memory needs as well as long runtimes. We propose to incorporate neighboring pixels from a window of w × w pixels to define a feature space of size npw2 , then apply a clustering method to the projected points. The effect of improvement is controlled by weights depending on the spatial distance between the pixels to be clustered. We propose a data-adaptive way to define weights based on similarity of pixels. Any vectorial clustering algorithms like standard k-means can directly be applied to the projected points. In addition, we propose an efficient dimensionality reduction strategy which finds a Euclidean space of dimension nq corresponding to the feature space. The proposed algorithm suits well for hyper-spectral imaging data as found in imaging mass spectrometry where the number of pixels is relative high (1e4).

Keywords SPATIALLY AWARE CLUSTERING, DATA ADAPTION, DIMENSIONALITY REDUCTION, HYPER-SPECTRAL IMAGING

Kobarg

106

Dense 3D Reconstruction of Symmetric Scenes from a Single Image Kevin Koeser1 and Christopher Zach ETH Zurich Abstract. A system is presented that takes a single image as an input (e.g. showing the interior of St.Peter’s Basilica) and automatically detects an arbitrarily oriented symmetry plane in 3D space. Given this symmetry plane a second camera is hallucinated that serves as a virtual second image for dense 3D reconstruction, where the point of view for reconstruction can be chosen on the symmetry plane. This naturally creates a symmetry in the matching costs for dense stereo. Alternatively, we also show how to enforce the 3D symmetry in dense depth estimation for the original image. The two representations are qualitatively compared on several real world images, that also validate our fully automatic approach for dense single image reconstruction.

107

Koeser

People Tracking Algorithm for Human Height Mounted Cameras Vladimir Kononov1 , Vadim Konushin, and Anton Konushin Moscow State University Abstract. We present a new people tracking method for human height mounted camera, e.g. the one attached near information or advertising stand. We use stateof-the-art particle filter approach and improve it by explicitly modeling of object visibility which makes the method able to cope with difficult object overlapping. We employ our own method based on online-boosting classifiers to resolve occlusions and show that it is well suited for tracking multiple objects. In addition to training an online-classifier which is updated each frame we propose to store object appearance and update it with a certain lag. It helps to correctly handle situations when a person enters the scene while another one leaves it at the same time. We demonstrate the perfomance of our algorithm and advantages of our contributions on our own video dataset.

Kononov

108

Von Chaos und Qualit¨ at - die Ergebnisse des Projekts Collaborative Tagging Christine Kraetzsch1 Zusammenfassung. Im akademischen Bereich sind in Social-Software-Anwendungen wie Connotea, CiteULike und BibSonony umfangreiche Sammlungen von nutzergenerierten Metadaten entstanden. Im Vergleich zu kontrollierten Vokabularen, wie der Schlagwortnormdatei, handelt es sich dabei um personalisierte und in weiten Teilen chaotische¨Inhaltserschließung. An der Universit¨ atsbibliothek Mannheim wurde in einem DFG-Projekt untersucht, inwieweit das Potential dieser Art von Metadaten f¨ ur eine bessere und nutzerorientierte Pr¨ asentation von Informationsressourcen eingesetzt werden kann. Ein Kernst¨ uck der Untersuchung war die Analyse von Tag-Daten des Systems BibSonomy. Es zeigte sich, dass nicht nur die mangelnde semantische Strukturiertheit der Tags, sondern auch ihre heterogene Gestalt einen limitierenden Faktor f¨ ur die Verwendung in der bibliothekarischen Sacherschließung darstellt. Der Beitrag gibt anhand von Beispielen Einblick in das qualitative und strukturelle Chaos der untersuchten Tags und fasst die Ergebnisse des Projekts zusammen

109

Kr¨atzsch

An Estimation Theoretical Approach to Ambrosio-Tortorelli Image Segmentation Kai Krajsek1 , Ines Dedovic, and Hanno Scharr Forschungszentrum J¨ ulich Abstract. This paper presents a novel approach for Ambrosio-Tortorelli (AT) image segmentation, or, more exactly, joint image regularization and edge-map reconstruction. We interpret the AT functional, an approximation of the Mumford-Shah (MS) functional, as the energy of a posterior probability density function (PDF) of the image and smooth edge indicator. Previous approaches consider AT or MS segmentation as a deterministic optimization problem by minimizing the energy functional, resulting in a single point estimate, i.e. the maximum-a-posteriori (MAP) estimate. We adopt a wider estimation theoretical view-point, meaning we consider images to be random variables and investigate their distribution. We derive an effective block-Gibbs-sampler for this posterior PDF based on the theory of Gaussian Markov random fields (GMRF). The merit of our approach is multi-fold: First, sampling from the posterior PDF allows to apply different types of estimators and not only the MAP estimator. Secondly, sampling allows to estimate higher order statistical moments like the variance as a confidence measure. Third, our approach is not prone to get trapped into local minima as other AT image reconstruction approaches, but our approach is asymptotically statistical optimal. Several experiments demonstrate the advantages of our block-Gibbs-sampling approach.

Krajsek

110

Optimization of Quadrature Filters based on the Numerical Integration of Improper Integrals Andreas Krebs1 , Johan Wiklund, and Michael Felsberg BTU Cottbus Abstract. Convolution kernels are a commonly used tool in computer vision. These kernels are often specified by an ideal frequency response and the actual filter coefficients are obtained by minimizing some weighted distance with respect to the ideal filter. State-of-the-art approaches usually replace the continuous frequency response by a discrete Fourier spectrum with a multitude of samples compared to the kernel size, depending on the smoothness of the ideal filter and the weight function. The number of samples in the Fourier domain grows exponentially with the dimensionality and becomes a bottleneck concerning memory requirements. In this paper we propose a method that avoids the discretization of the frequency space and makes filter optimization feasible in higher dimensions than the standard approach. The result is no longer depending on the choice of the sampling grid and it remains exact even if the weighting function is singular in the origin. The resulting improper integrals are efficiently computed using Gauss-Jacobi quadrature.

111

Krebs

Blockmodeling of Co-authorship Networks Luka Kronegger1 , Anuˇska Ferligoj2 , and Patrick Doreian3 1 2 3

University of Ljubljana, [email protected] University of Ljubljana, [email protected] University of Pittsburgh and University of Ljubljana, [email protected]

Abstract. The complete longitudinal co-authorship networks for four research disciplines (biotechnology, mathematics, physics and sociology) for 1986-2005 are compared. Complete bibliographies of all researchers registered at the national Slovene Research Agency are studied. Using blockmodeling, it is shown how coauthorship structures change in all disciplines. The most frequent form is a core-periphery structure with multiple simple cores, a periphery and a semi-periphery. The next most frequent form has this structure but with bridging cores. Bridging cores consolidate the center of a discipline by giving it greater coherence. These consolidated structures appear at different times in different disciplines, appearing earliest in physics and latest in biotechnology. In 2005, biotechnology has the most consolidated center followed by physics and sociology. All coauthorship networks expand over time. By far, new recruits go into either the semi-periphery or the periphery in all fields. Two ’lab’ fields, biotechnology and physics, have larger semi-peripheries than peripheries. The reverse holds for mathematics and sociology, two ’office’ disciplines. Comparison of four disciplines indicate important differences in formation of collaborating cores within the networks through time. Network patterns of physicists for example are very stable, while structures in co-authorship networks of mathematicians form and then dissolve through time. The tendencies are somewhere in between among biotechnologists and again completly different in the network of sociologists. Joint graphical representation of blockmodels and intervening variables offers a great oportunity to understand the dynamics of collaborative structures in these disciplines.

Keywords CO-AUTHORSHIP NETWORKS, BLOCKMODELING, SCIENTIFIC COLLABORATION, NETWORK DYNAMICS

Kronegger

112

The comparison of some feature selection methods in regression Mariusz Kubus Department of Mathematics and Applied Computer Science, Opole University of Technology, 5 Mikolajczyka Street, 45–271 Opole [email protected] Abstract. As data accumulates rapidly with an advance of computer technology, feature selection is one of the key task in analysis of large datasets. Removing irrelevant variables not only improves interpretative properties of a models but can improve the predictive accuracy. The methods of feature selection are presently classified into three groups: filters, wrappers and embedded methods, i.e. (Guyon et al. [2006]). Filters perform as a pre-processing step evaluating and excluding some variables before learning a model. Wrappers use model selection idea to evaluate the feature subsets. Embedded methods use feature selection as an integral part of learning algorithm. The third approach is represented by regularized linear regression (i.e. LASSO) or tree-based models. In practice, LASSO tends to overfitting and introduces irrelevant variables. On the other hand, regression trees usually neither introduce irrelevant variables nor some relevant. In the situation when number of features is much larger then number of observations (what is typical in genomic study of microarray data) the feature selection is especially difficult. In this case modifications of classical methods were proposed, i.e. relaxed LASSO (Meinshausen [2005]). The goal of the article is to compare some methods of feature selection representing mentioned three approaches. In the simulation study, some of the irrelevant variables will be collinear. The special case when the number of variables greatly exceeds the sample size also will be taken into consideration.

References EFRON B., HASTIE T., JOHNSTONE I., TIBSHIRANI R. (2004): Least Angle Regression. ,,Annals of Statistic” 32 (2): p. 407-499. GUYON I., GUNN S., NIKRAVESH M., ZADEH L. (2006): Feature Extraction: Foundations and Applications. Springer, New York. MEINSHAUSEN M. (2005), Lasso with relaxation. Research Report 129, ETH Z¨ urich.

Key words: Feature selection, filters, wrappers, embedded methods.

113

Kubus

Vulnerability of Copula-VaR to misspecification of margins and dependence structure Katarzyna Kuziak Department of Financial Investments and Risk Management Wroclaw University of Economics, Poland [email protected] Abstract. Copula functions as tools for modeling multivariate distributions are well known in theory of statistics and over the last decade have been gathering more and more popularity also in the field of finance. A Copula-based model of multivariate distribution includes both dependence structure and marginal distributions in such a way that the first may be analyzed separately from the later. Its main advantage is an elasticity allowing to merge margins of one type with a copula function of another one, or even bound margins of various types by a common copula into a single multivariate distribution. In this article copula functions are used to estimate Value at Risk (VaR). The first aim of this article is to study properties of risk factor or portfolio component return marginal distributions, as well as dependence structure, under the different hypotheses about data generating process. The second aim is to investigate how misspecification of the marginal distributions may affect estimation of the dependence in copula and what are the efects of these biases for Value at Risk. The analysis is based on simulation studies.

References CHERUBINI U, LUCIANO E, VECCHIATO W (2004) Copula Methods in Finance. Wiley, New York. EMBRECHTS P, HOING A, JURI A (2002) Using copulae to bound the ValueatRisk for functions of dependent risks. Report, ETH Zurich GREGORIOU GN, HOPPE CH, WEHN CS (2010) The Risk Modeling Evaluation Handbook: Rethinking Financial Risk Management Methodologies in the Global Capital Markets. McGraw-Hill, New York. JORION P (1997) Value At Risk: The New Benchmark for Controlling Market Risk. McGraw-Hill, New York

Keywords RISK MEASUREMENT, VALUE AT RISK, COPULAE

Kuziak

114

Dynamic Principal Component Analysis: a banking Customer Satisfaction evaluation Caterina Liberati1 and Paolo Mariani2 1

2

University of Milano-Bicocca Economics Department [email protected] University of Milano-Bicocca Statistics Department [email protected]

Abstract. An empirical study, based on a sample of 27.000 retail customers, has been curried out: the management of a national bank with a spread network across Italian regions wanted to analyze the loss in competition of its retail services, probably due to a loss in customer satisfaction. The survey has the aim to analyze weaknesses of retail services, individuate possible recovery actions and measure their effectiveness across different waves (3 time lags). Such issues head our study towards a definition of a new dissimilarities measure which exploits a dimension reduction obtained with Dynamic Principal Component Analysis (DPCA). Before doing that we focused our attention on some limitations of our approach related with the geometrical properties of the DPCA applied. As it is well known the coordinates of a point in the space tell us where the point is located respect to a particular set of a ` ´ coordinate axes. Projecting the unities on the factorial axis of the Ocompromise O phase, the movements with respect to the different waves can be analyzed . We know such data transformation might generate a new patterns modifying configuration of customer satisfactions. We are already studying an integrated approach to fix such aspect that will be part of a further work on this topic.

References BERRY L. L., PARASURAMAN, A. AND ZEITHAML, V., A. (1988). SERVQUAL: a multiple-item scale for measuring consumer perceptions of service quality. Journal of Retailing, vol. 64, no. 1, pp. 12-37. COPPI, R. and D’URSO, P. (2002): Fuzzy time arrays and dissimilarity measures for fuzzy time trajectories. in Kiers, H.A.L., Rasson, J.P., Groenen, P.J.F., Schader, M. (Eds.): Data Analysis, Classification, and Related Methods. Springer, Heidelberg, pp. 273-278.

Keywords customer satisfaciton, dynamic factor analysis, trajectories analysis, dynamic patterns

115

Liberati

Simultaneous Reconstruction and Tracking of non-planar Templates Sebastian Lieberknecht1 and Slobodan Ilic Metaio GmbH Abstract. In this paper, we address the problem of simultaneous tracking and reconstruction of non-planar templates in real-time. Classical approaches to template tracking assume planarity and do not attempt to recover the shape of an object. Structure from motion approaches use feature points to recover camera pose and reconstruct the scene from those features, but do not produce dense 3D surface models. Finally, deformable surface tracking approaches assume static camera and impose strong deformation priors to recover dense 3D shapes. The proposed method simultaneously recovers the camera motion and deforms the template such that an approximation of the underlying 3D structure is recovered. Spatial smoothing is not explicitly imposed, thus templates of smooth and non-smooth objects can be equally handled. The problem is formalized as an energy minimization based on image intensity differences. Quantitative and qualitative evaluation on both real and synthetic data is presented, we compare the proposed approach to related methods and demonstrate that the recovered camera pose is close to the ground truth even in presence of strong blur and low texture.

Lieberknecht

116

Multi-Person Localization and Track Assignment in Overlapping Camera Views Martijn Liem1 and Dariu Gavrila Univ. Amsterdam Abstract. The assignment of multiple person tracks to a set of candidate person locations in overlapping camera views is potentially computational intractable, as observables might depend upon visibility order, and thus upon the decision which of the candidate locations represent actual persons and which not. In this paper, we present an approximate assignment method which consists of two stages: In a hypothesis generation stage, the similarity between track and measurement is based on a subset of observables (appearance, motion) that exist independent of the beforementioned person labeling. This allows the computation of the K-best assignment in low polynomial time by standard graph matching methods. In a subsequent hypothesis verification stage, the known person positions associated with the K-best solutions are used to define the full set of observables, which are used to compute the maximum likelihood assignment. We demonstrate that our method outperforms the state-of-the-art on a complex outdoor dataset.

117

Liem

¨ Klassifikationslandschaft Osterreich Rudolf Lindpointner1 Zusammenfassung. In der o ¨sterreichischen Bibliothekenlandschaft waren bis vor einiger Zeit Klassifikationen kaum ein Thema, wenn doch dann f¨ ur die Freihand¨ aufstellung, wobei auch hier die sog. Haussystematiken bei weitem in der Uberzahl waren und es auch weiterhin sind. Erst in den letzten Jahren ist, verbunden mit dem Thema der Suchmaschinentechnologie, auch das Thema Klassifikationen wieder ¨ etwas in den Vordergrund ger¨ uckt, wobei in Osterreich bisher vor allem die Regensburger Verbundklassifikation (RVK) und die Basisklassifikation (BK) angewendet werden, und auch das Interesse an der DDC w¨ achst. Aber auch das Thema Aufstellung stellt sich f¨ ur viele Bibliotheken in diesem Kontext erneut in dem Sinn, dass im Zuge von Baumaßnahmen, aber teilweise auch aus grunds¨ atzlichen Erw¨ agungen - auch gr¨ oßere Bibliotheken ein Abgehen von vorhandenen Haussystematiken in Erw¨ agung ziehen.

Lindpointner

118

Revisiting Projection Pursuit and Principal Component Analysis B. Lindsay Penn State University, United States of America Abstract. Projection pursuit (PP) dates to 1974 (Friedman and Tukey). Principal component analysis (PCA) dates to 1901 (Pearson). We introduce a new method that has resemblances to both. It could equally well be called conditional projection pursuit (CPP) or most informative component analysis (MICA) in honor of its two ancestors. Like principle component analysis, it is based on a matrix eigenanalysis, with eigenvectors used as linear combinations. Like projection pursuit, it is focused on nonlinear and non-normal features of the data. The method will be illustrated with several examples. Technical issues will be deemphasized.

119

Lindsay

Association of complex human pain phenotypes with complex pain genotypes using a self-organizing maps approach J. L¨ otsch1 and A. Ultsch2 1

2

Institute of Clinical Pharmacology, Johann Wolfgang Goethe University, Germany Data Bionics Research Group, University of Marburg, Germany

Abstract. BACKGROUND: Pain is a complex trait. While clinical pain syndromes can already be diagnosed by a set of neurological parameters, the complexity of experimental pain is only incompletely accounted for, which often impedes associations of pain data with clinical or genetic parameters. METHODS: Pain phenotype markers (n = 8) and genotype markers (n = 30) were available from previous assessments in 125 healthy volunteers. A U-Matrix on an Emergent Self organizing map (ESOM) was used for visualization of the distance structures in the data. Subsequently, the prediction of the clusters by the genetic markers was assessed using a classification and regression tree (CART) approach. RESULTS: On the U-Matrix of the pain phenotypes, eight clusters were identified. This clustering showed advantages over a Ward clustering on the same data. Rules could be derived to describe the cluster contents that corresponded to three basic types of pain thresholds: low, mean and high sensitivity. In the mean and low sensitivity stoical phenotypes, subgroups could be identified. A cluster consisting of persons with high overall pain threshold but selectively low resistance to heat, the predictive accuracy of the classifiers was 84.56%. Among the genetic variants that were used for the CART decision in that cluster were polymorphisms in a gene coding for a heat sensor. CONCLUSIONS: ESOM-based clustering of pain data provides biologically meaningful results and satisfying the complexity of pain. The thus obtained clusters seem to facilitate the otherwise only insufficiently successful genotype phenotype association in common pain.

Loetsch

120

Antecedents and outcomes of participation in social networking sites Sandra Loureiro University of Aveiro - Department of Economy, Management and Industrial Engineering - Campus of Santiago - 3810-193 Aveiro [email protected] Abstract. Nowadays, most (80%) of the young internet users (16-24 years) in the European Union post messages to chat sites, blogs or social networking sites (SEY¨ OF, ¨ 2010).Thus, social networking sites are growing in interest, both BERT and LO for researchers and managers. Therefore, this study seeks to understand factors that influence the participation in online social networks and outcomes. The proposed model integrates variables such as identification, satisfaction, degree of influence, usefulness and ease of use into a comprehensive framework. The empirical approach was based on an online survey of 336 young adults in Portugal, undertaken during November/December 2010. The model estimation includes structural equation analysis, using PLS approach. The research findings showed that identification, perceived usefulness, interaction preference, and extroversion are the most important factors to influence the participation. Degree of influence and identification exercise an indirect effect on participation through perceived usefulness. Participation in social networking sites, in turn, is linked to higher levels of loyalty, actual use, and word-of-mouth. Satisfaction with the online social network does not have a direct and significant influence on participation, however, exercises a direct and significant influence on the ease of using the network. The proposed model explains 84 per cent of the variance in participation in social networking sites. The results of this study have implications for researchers and practitioners.

References ¨ OF, ¨ SEYBERT, H. and LO A. (2010): Internet usage in 2010 - Households and Individuals: Eurostat data. Retrieved on 20 of February of 2011 from http://epp.eurostat.ec.europa.eu/cache/ITY OFFPUB/KS-QA-10050/EN/KS-QA-10-050-EN.PDF.

Keywords PARTICIPATION, IDENTIFICATION, USEFULNESS, SOCIAL NETWORKING SITES, EXTROVERSION

121

Loureiro

Temporally locally adaptive Linear Discriminant Analysis Karsten Luebke1 , Julia Schiffner2 , Stefanie Hillebrand2 , and Claus Weihs2 1

2

FOM Hochschule f¨ ur Oekonomie und Management, c/o B1st software factory, Rheinlanddamm 201, 44139 Dortmund, Germany [email protected] Dortmund University of Technology, Department of Statistics, 44221 Dortmund, Germany

Abstract. In many applications of classification methods the data arrives with a time stamp: like for example shopping data for direct marketing, online monitoring or classification of business cycles. In order to capture non-stationarity in the data generating process it can be useful to include weights for the observations used for classification. The simple heuristic is: give more weight to more recent observations and give less or even no weight for old observations. For a logistic two-class classifier this was introduced by Anagnostopoulos et.al. (2009). In order to classify more than two classes we adopted the ideas of Czogiel et.al. (2007) for a temporally weighted version of a Linear Discriminant Analysis. Within this framework different weighting schemes can be incorporated. With such an estimator it is possible to achieve an improved classification accuracy even in dynamic settings which we demonstrate on a real-life data set where the so called ex-post-ante error rate, i.e. the prediction error rate along the time line is relevant. Also an extension to classify data streams is possible.

References ANAGNOSTOPOULOS, C., TASOULIS, D.K., ADAMS, N.M., HAND, D.J. (2009): Temporally adaptive estimation of logistic classifiers on data streams. Advances in Data Analysis and Classification, 3, 243–261. CZOGIEL, I., LUEBKE, K., ZENTGRAF, M., WEIHS, C. (2007): Localized linear discriminant analysis. In: R. Decker, H.-J. Lenz (Eds.): Advances in Data Analysis. Springer, Berlin, 133–204.

Keywords LOCAL METHODS, ADAPTIVE PROCEDURE, TIME RELATED DISCRIMINANT ANALYSIS

Luebke

122

Using Landmarks as a Deformation Prior for Hybrid Image Registration Marcel L¨ uthi1 , Christoph Jud, and Thomas Vetter University of Basel Abstract. Hybrid registration schemes are a powerful alternative to fully automatic registration algorithms. Current methods for hybrid registration either include the landmark information as a hard constraint, which is too rigid and leads to difficult optimization problems, or as a soft-constraint, which introduces a difficult to tune parameter for the landmark accuracy. In this paper we model the deformations as a Gaussian process and regard the landmarks as additional information on the admissible deformations. Using Gaussian process regression, we integrate the landmarks directly into the deformation prior. This leads to a new, probabilistic regularization term that penalizes deformations that do not agree with the modeled landmark uncertainty. It thus provides a middle ground between the two aforementioned approaches, without sharing their disadvantages. Our approach works for a large class of different deformation priors and leads to a known optimization problem in a Reproducing Kernel Hilbert Space.

123

Luethi

Robust Classification and Semi-Supervised Object Localization with Gaussian Processes Alexander L¨ utz1 University of Jena Abstract. Traditionally, object recognition systems are trained with images that may contain a large amount of background clutter. One way to train the classifier more robustly is to limit training images to their object regions. For this purpose we present a semi-supervised approach that determines object regions in a completely automatic manner and only requires global labels of training images. We formulate the problem as a kernel hyperparameter optimization task and utilize the Gaussian process framework. To perform the computations efficiently we present techniques reducing the necessary time effort from cubically to quadratically for essential parts of the computations. The presented approach is evaluated and compared on two well-known and publicly available datasets showing the benefit of our approach.

Luetz

124

Applying Multiple Instance Learning to Automatic Music Classification Hanna Lukashevich1 , Bernd Bischl2 and Claus Weihs2 1

2

Fraunhofer IDMT, Ehrenbergstr. 31, 98693 Ilmenau, Germany [email protected] Faculty of Statistics, Dortmund University of Technology, Germany {bischl,weihs}@statistik.tu-dortmund.de

Abstract. Stimulated by the ever-growing availability and size of digital music collections, automatic music classification has been identified as an increasingly important means to aid convenient exploration of large music catalogs. Here we will focus on the task of segment-based automatic instrument recognition in real-world songs. The observations of this classification problem will in general be of the multilabel / multi-instance type, as a number of different instruments might be played during a song but not necessarily at all times. Each song in the training data is formally considered as a bag of short-time frames (the instances). The ground truth is available in the following form: (a) The set of all instruments (multi-label) which occur at least once in a song/bag is known; (b) No information is available regarding the specific segments where certain instruments appear, meaning only the bag is labeled and the label does not have to apply to all instances in this bag. Both constraints imply specific approaches for the classification problem under consideration, especially (b), as it is not known which segments exactly pose positive (or negative) examples for a label. In our contribution we will discuss and compare recent approaches for solving such a task. Evaluation datasets are generated by mixing single stems of multi-track recordings. Here a single stem contains a recording of one music instrument, the times when each specific instrument is played are therefore known. We construct three datasets with raising degree of difficulty.

References MANDEL, M. (2008): Multiple-instance learning for music information retrieval. In Proc. of the 9th Int. Conf. on Music Information Retrieval (ISMIR).

Keywords MULTIPLE INSTANCE LEARNING, AUTOMATIC MUSIC CLASSIFICATION

125

Lukashevich

A new method for the elimination of systematic error from experimental high-throughput screening data Vladimir Makarenkov1 , Plamen Dragiev1,2 and Robert Nadon2 1

2

D´epartement d’Informatique, Universit´e du Qu´ebec a ` Montr´eal, C.P. 8888, Succursale Centre-Ville, Montr´eal (Qu´ebec), H3C 3P8, Canada. Department of Human Genetics, McGill University, 1205 Dr. Penfield Ave. Montreal, QC, H3A 1B1; McGill University and Genome Quebec Innovation Centre, 740 Dr. Penfield Ave., Montreal, QC, H3A 1A4 Canada

Abstract. High-throughput screening (HTS) is a critical step of the drug discovery process. It involves measuring the activity levels of thousands of chemical compounds. Several technical or environmental factors can affect an experimental HTS campaign and thus cause systematic deviations from correct results. A number of error correction methods have been designed to address this issue in the context of experimental HTS (Malo et al. 2006; Makarenkov et al. 2007). Despite their power to reduce the impact of systematic error, those methods introduce a bias when applied to the data not containing any systematic error. In our recent study, we showed how to assess the presence of systematic error in a given HTS assay by detecting the exact locations of rows and columns affected by systematic error (Dragiev et al. 2011). We will present a new method for eliminating systematic error from HTS assays using the prior knowledge on its exact location. The proposed method is based on an iterative procedure in which the median of each row (or column) affected by systematic error is subtracted from the row (or column) measurements and the median of the measurements not affected by systematic error is added to them. This is an improvement over the popular B-score method designed by Merck Frosst researchers (Brideau et al. 2003) and widely used in the modern HTS.

Keywords BIOINFORMATICS, B-SCORE, HIGH-THROUGHPUT SCREENING (HTS), SYSTEMATIC ERROR

Makarenkov

126

Empirical tests of the CAPM and D-CAPM model on the Warsaw Stock Exchange Leslaw Markowski University of Warmia and Mazury Abstract. The aim of this study is to present the problem of valuating capital assets listed on the Warsaw Stock Exchange on the basis of the CAPM model. In the classical approach, relations between the expected rate of return and the risk are expressed most frequently in terms of mean-beta coefficient. Due to the fact that investors attach a lower importance to positive than to negative deviations from the mean value of the rate of return, the study proposes an alternative approach to portfolio analysis. It was assumed that the risk should be treated in the context of the downside structure, or the possibility of incurring potential losses. The verification of equations postulated by the CAPM model is based on downside beta coefficients, examining the sensitivity of changes in the rates of return of companies on changes in unfavourable stock market conditions. Analyses of crosssectional regressions were carried out for both individual securities and for highly diversified portfolios subjected to simulation. The study involved analysis of the sensitivity of the results obtained on the capitalization level of stock-listed companies. Additionally, analyses of classical and downside beta coefficients distributions were also carried out.

Keywords beta coefficient, downside beta, CAPM.

127

Markowski

An adaptive and flexible way for Searching for a Clustering Pattern in presence of noise Carlos G. Matr´ an Universidad de Valladolid, Spain [email protected] Abstract. Trimming data is the best-known and oldest way of robustifying statistical procedures. That is, to avoid large deviations in the conclusions of our analysis as a consequence of a small fraction of contaminating data. At the beginning, its use just covered one-dimensional data and constituted the na¨ıf approach to guarantee some robustness degree in the estimation of a centralization measure of a data set. Now, data-driven trimming procedures constitute an obligated reference for robustness in a wide range of statistical procedures. We will give an overview of the evolution of a class of trimming procedures in the context of Clustering Analysis, arising from impartial “trimming”, a datadriven general methodology. This will include the consideration of different types of data as well as different kind of shapes. Moreover we will explore some capabilities of the software TCLUST designed to realize robust clustering analysis for multivariate data arising from populations with different weights and shape patterns.

Keywords trimming, clustering

Matr´ an

128

A Case Study about the Effort to Classify Music Intervals by Chroma and Spectrum Analysis Verena Mattern1 , Igor Vatolkin2 , and G¨ unter Rudolph3 1 2 3

Chair of Algorithm Engineering, TU Dortmund [email protected] Chair of Algorithm Engineering, TU Dortmund [email protected] Chair of Algorithm Engineering, TU Dortmund [email protected]

Abstract. Recognition of harmonic characteristics from polyphonic music, in particular intervals, can be very hard if the different instruments with their specific characteristics (overtones, formants, noisy components) are playing together at the same time. In our study we examined the impact of Harmonic Pitch Class Profile (G´ omez, 2006), Chroma Based Normalized Statistics (M. M¨ uller and S. Ewert, 2010) and spectrum on the classification of single tone pitchs and music intervals played either by the same or different instruments (acoustic and electric guitar, cello, electric bass, flute, piano, sax and trombone). After the analysis of the audio recordings which produced the most errors we implemented two optimization approaches. The first one is based on energy envelope and selects the frames using the knowledge of attack and release intervals. The second one estimates the overtone distribution for the tone candidates. The both methods were compared during the experiment study. The results show that especially the integration of instrument-specific knowledge can significantly improve the overall performance.

References ´ GOMEZ, E. (2006): Tonal Description of Music Audio Signals. PhD thesis, Universitat Pompeu Fabra, Music Technology Group, Barcelona. ¨ MULLER, M. and EWERT, S. (2010): Towards timbre-invariant audio features for harmony-based music. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 18, No. 3, pp. 649-662.

Keywords TONE PITCH DETECTION, MUSIC INTERVAL RECOGNITION

129

Mattern

New advances in robust clustering based on trimming:The TCLUST approach Agustin Mayo-Iscar1 , Heinrich Fritz2 , Luis Angel Garcia-Escudero1 , Alfonso Gordaliza1 , and Carlos Matran-Bea1 1 2

Universidad de Valladolid Vienna University of Technology

Abstract. The TCLUST methodology falls within robust model based clustering approaches. It achieves robustness by using a data driven trimming. This methodology can be adapted to situations where some information about the structure of data is available. Constraints on the eigenvalues of the scatter matrices have already been considered in the first Tclust release and now we also consider the possibility of imposing constraints on the eigenvectors of the scatter matrices. A faster algorithm addressing these possibilities is given.

Keywords robust model based clustering

Mayo-Iscar

130

The Ever-Increasing Role of Mixture Models in Classification G.J. McLachlan Department of Mathematics and Institute for Molecular Bioscience, University of Queensland, St. Lucia, Queensland 4072, Australia Abstract. We consider the role that mixture models have played in classification, in particular for clustering continuous data via mixtures of normal distributions. A very brief history is given starting with the seminal papers by Nick Day and John Wolfe in the sixties before the appearance of the EM algorithm. It was the publication in 1977 of the latter algorithm by Dempster, Laird, and Rubin that greatly stimulated interest in the use of finite mixture distributions to model heterogeneous data. This is because the fitting of mixture models by maximum likelihood is a classic example of a problem that is simplified considerably by the EM’s conceptual unification of maximum likelihood (ML) estimation from data that can be viewed as being incomplete. In recent times there has been a proliferation of applications in which the number of experimental units n is comparatively small but the underlying dimension p is extremely large as, for example, in microarray-based genomics and other high-throughput experimental approaches. Hence there has been increasing attention given not only in bioinformatics and machine learning, but also in mainstream statistics, to the analysis of complex data in this situation where n is small relative to p. The latter part of the talk shall focus on the clustering of such high-dimensional data using mixture models.

131

McLachlan

Some of my work with Jean-Pierre Barth´ elemy - - and beyond F. R. McMorris Department of Applied Mathematics Illinois Institute of Technology Chicago, USA [email protected] Abstract. Let (X, d) be a finite metric space and π = (x1 , . . P . , xk ) ∈ X k , k a profile in X. A median for π is an element x of X for which i=1 d(x, xi ) is minimum. The median function on (X, d) is the function that returns the set of all medians for any profile π. Letting M ed denote the median function, and S X ∗ = k>0 X k , we have M ed : X ∗ → 2X \ {∅} defined by M ed(π) = {x : x is a median for π}. Since a median for π can be thought of as a “closest” element of X to the profile π, medians are often used in studies having to do with location theory and consensus. When the metric space is completely arbitrary, not much can be said about the function M ed so research usually focuses on situations where the space has additional structure imposed by graph or order theoretic conditions. In this talk I focus on the space of all hierarchies H on a finite set, endowed with the symmetric difference metric. I will recall a result that Jean-Pierre and I published in 1984 that characterizes M ed on H and then discuss recent work that was motivated by this and related results.

Keywords finite metric space, location theory and consensus

McMorris

132

Clustering using latent variable models Damien McParland1 and Claire Gormley1 University College Dublin [email protected] [email protected] Abstract. Item response models (IRM) are a type of latent variable model for ordinal data. In brief, the ordinal data point Yi is assumed to be a discrete version of an underlying latent (Gaussian) variable Zi . In turn, the latent variable Zi is modeled as a function of a ‘latent trait’ associated with individual i. In a similar vein, factor analysis (FA) is a widely used latent variable model for high dimensional continuous data. Continuous (Gaussian) response variables are modeled as a function of underlying ‘latent factors’. Merging the IRM and FA models facilitates the modeling of mixed ordinal and continuous data. Mixture models are a popular clustering tool – each component of the mixture is assumed to correspond to a cluster. Here a mixture of factor analyzers for mixed ordinal and continuous data (MFA-MD) is developed. The model has the capability to appropriately model mixed data and to cluster the observations into homogeneous groups. A Bayesian approach to inference is taken here; parameter estimation requires a Metropolis-within-Gibbs sampler. Real social science data sets provide illustrative examples of the model and estimation procedure.

References FOX, JP. (2010): Bayesian Item Response Modeling. Springer. JOHNSON, V.E. and ALBERT, J.H.. (1999): Ordinal Data Modeling. Springer. QUINN, K.M. (2004): Bayesian Factor Analysis for Mixed Ordinal and Continuous Responses. Political Analysis, 12:338–35.

Keywords LATENT VARIABLES, MIXTURE MODELS, BAYESIAN METHODS.

133

McParland

On Dynamic Weighted Majority algorithm based on Genetic algorithm Dhouha Mejri1 , Mohamed Limam1 and Claus Weihs2 1 2

ISG TUNIS, University of Tunis mejri [email protected] Technical University of [email protected]@T-Online.de

Abstract. Dynamic weighted majority-Winnow (DWM-WIN) algorithm of Mejri [4] is a powerful classifcation method that handles nonstationary environment and copes with concept drifting data streams. Despite having good performance, this method has a serious drawback in choosing the best values for its parameters. Hence, there is a need for a rational automatic selection of parameter values. To deal with this issue, a genetic algorithm (GA) of Feng et al. [2] is used as an optimization method to find the best parameter values. We used DWM-WIN as a fitness function of GA. To assess this optimized DWM-WIN algorithm, four data sets are simulated from UCI data sets repository to highlight the classification performance of the new version compared to other algorithms.

Keywords Learning and classification, data mining, optimization

Mejri

134

Prediction of Sub-cellular Protein Localization for Specialized Compartments using Time Series Kernels Marco Mernberger and Eyke H¨ ullermeier Department of Mathematics and Computer Science, University of Marburg {marco, eyke}@mathematik.uni-marburg.de Abstract. Identifying the sub-cellular localization of proteins is an important problem in systems biology. However, since an experimental verification of the localization of each protein in an organism is infeasible, there is a need for reliable prediction tools. So far, prediction tools for many compartments in eucaryotic and procaryotic cells exist, typically using different machine learning techniques, such as neural networks, support vector machines and others. Trained on large datasets compiled from proteins of different organisms, these approaches typically utilize knowledge of protein trafficking signals, such as the signal peptide, chloroplast transit peptide and others, that are present in many different organisms. However, for the case of specialized compartments that are only present in certain organisms, these tools fail to provide a reliable prediction, simply due to the lack of training data. In such cases, one has to resort to motif searches and other kinds of homology-based analysis to infer the sub-cellular location of a protein. Still, even a motif search may fail if the responsible trafficking signal is not well conserved. In this paper, we present an alternative method based on the use of time series kernels for the comparison of amino acid sequences that allows for the simultaneous incorporation of several different levels of information, thus in principle realizing a more powerful representation than the mere sequence of amino acids. We present a number of experimental studies in which we compare our approach with existing methods.

135

Mernberger

Ranking and clustering large number of cereal selection lines from experiments without randomization and replications of the lines G.Menexes1 and K. Bladenopoulos2 and A. Markos3 1

2 3

Laboratory of Agronomy, School of Agriculture, Aristotle University of Thessaloniki, Greece [email protected] Cereal Institute, Thermi-Thessaloniki, NAGREF, Greece [email protected] Democritus University of Thrace, Greece, [email protected]

Abstract. In this study a methodological scheme for the ranking and clustering of a large number of cereal selection lines is presented concerning their general adaptability, examining agronomic traits, criteria and indices. Usually the corresponding data come from experiments combined over locations without randomization (same order of treatments in each location) and replication of treatments (cereal lines). So the traditional approaches like ANOVA and procedures for multiple comparisons of means can not be applied. Methods using moving averages on scale data and estimations for some kind of “experimental error” utilizing partial information from the control treatments are already in use. The proposed ranking is achieved by assigning one or more optimal scores to each selection line based on the available (primary and/or secondary) measurements. These measurements correspond either to qualitative categorical variables or to quantitative which in turn are likely to be transformed to categorical ones, based on certain biological (i.e. relative to the mean yield of the control treatments) and/or statistical criteria (i.e. quartiles of the corresponding yield distributions). In this way, intervals of values for each variable are determined, whereas at the same time specific “properties” within each variable are marked out. Thus, a qualitative transition takes place from the continuous exact measurements to the discrete states in which an agronomic trait, criterion or index can be analyzed with biologically meaning or significance. The realization of the suggested methodology can be achieved by applying the Factorial Correspondence Analysis on an appropriate design matrix (indicator matrix with 0-1 binary coding) and utilizing the optimal scaling properties which characterize this specific data analysis method. Bi-plots can facilitate the visualization of the analysis outcomes. So the optimal scores can either be used in the evaluation of lines or in the development of groups or otherwise line clusters with common or uniform traits by means of Hierarchical Cluster Analysis. As an example for the application of the suggested methodology yield data of 113 F6 barley varieties from 8 locations in Greece were used.

Keywords categorical data analysis, experimental designs, selection, adaptability Applications in Natural Sciences, Engineering, Medicine and Biology 1 Laboratory Menexes

136

Minimizing Calibration Time for Brain Reading Jan Hendrik Metzen1 , Su Kyoung Kim, and Elsa Andrea Kirchner University Bremen Abstract. Machine learning is increasingly used to autonomously adapt brainmachine interfaces to user-specific brain patterns. In order to minimize the preparation time of the system, it is highly desirable to reduce the length of the calibration procedure, during which training data is acquired from the user, to a minimum. One recently proposed approach is to reuse models that have been trained in historic usage sessions of the same or other users by utilizing an ensemble-based approach. In this work, we propose two extensions of this approach which are based on the idea to combine predictions made by the historic ensemble with session-specific predictions that become available once a small amount of training data has been collected. These extensions are particularly useful for Brain Reading Interfaces (BRIs), a specific kind of brain-machine interfaces. BRIs do not require that user feedback is given and thus, additional training data may be acquired concurrently to the usage session. Accordingly, BRIs should initially perform well when only a small amount of training data acquired in a short calibration procedure is available and allow an increased performance when more training data becomes available during the usage session. An empirical offline-study in a testbed for the use of BRIs to support robotic telemanipulation shows that the proposed extensions allow to achieve this kind of behavior.

137

Metzen

Rapid Adaptation of Brain Reading Interfaces based on Threshold Adjustment Jan Hendrik Metzen1 and Elsa Andrea Kirchner1,2 1 2

Robotics Group, University of Bremen [email protected] Robotics Innovation Center (RIC), German Research Center for Artificial Intelligence (DFKI GmbH), Bremen [email protected]

Abstract. Brain Reading Interfaces (BRIs) can be used, e.g., to detect whether a user has recognized or missed an infrequent but task-relevant warning. Machine learning allows to train an electroencephalography-based BRI such that it can distinguish between the two corresponding brain patterns. Unfortunately, acquiring a sufficient number of training examples is time-consuming since infrequent warnings cannot be displayed often and its not under the BRI’s control how often a user misses a warning. Because of that, we propose to train the BRI instead on data associated with the recognition of an important warning and data associated with the perception of an irrelevant stimulus. Since irrelevant stimuli can be displayed with a higher frequency, large amounts of training data can be acquired more easily. We show that a BRI trained for this different but related task (the “source” task) can surprisingly well distinguish between recognized and missed warnings (the “target” task). This may indicate that similar brain patterns are evoked by missed warnings and irrelevant stimuli. To improve performance further, we propose to adjust the threshold which maps the scalar classifier output onto the two class labels in order to adapt the BRI from the source to the target task. A close-to-optimal threshold can be chosen based on a comparatively small training set from the target task. We show empirically on data acquired in the Labyrinth Oddball testbed (Kirchner et al. (2010)) that the proposed procedure is well-suited for rapid adaptation of the BRI to the target task based on a small amount of training data.

References ¨ KIRCHNER, E. A., WOHRLE, H., BERGATT, C., KIM, S. K., METZEN, J. H., FEESS, D. and KIRCHNER, F. (2010): Towards operator monitoring via brain reading - an EEG-based approach for space applications. Proceedings of the 10th International Symposium on Artificial Intelligence, Robotics and Automation in Space (iSAIRAS), pp. 448-455, Sapporo, Japan

Keywords BRAIN READING INTERFACE, THRESHOLD ADJUSTMENT

Metzen

138

Feature-based joint analysis of product perception and preference Michel Meulders HUBrussel, Stormstraat 2, 1000 Brussel [email protected] KUL, Tiensestraat 102, 3000 Leuven Abstract. A key task of strategic marketing is to study the competitive structure of products by deriving a spatial configuration or a categorization of products. Besides information on the similarity of products, an important goal of competitive structure analysis is to investigate to what extent distinct consumer segments prefer a specific group of products, or whether the perception of the products depends on consumer segments (DeSarbo et al. (2008)). Candel and Maris (1997) use a probabilistic feature model to categorize products on the basis of binary latent features. However, in this analysis no consumer differences in product perception or product preference are taken into account. To solve this problem, we propose latent class extensions of the probabilistic feature model which allow to capture both differences in product perception and product preference. A Gibbs sampling algorithm is used to compute a sample of the observed posterior distribution of the model and posterior predictive simulations are used to evaluate the fit of the model. As an illustration, the model is used to analyze binary judgments of 407 respondents who indicated for 10 types of potatoe chips and 36 characteristics whether or not a certain type of chips has a certain characteristic. In addition, a latent class vector model using parameters on product perception as input, is used to model product rankings.

References CANDEL, M. J. J. M. and MARIS, E. (1997): Perceptual analysis of two-way twomode frequency data: probability matrix decomposition and two alternatives. International Journal of Research in Marketing 14, 321-339. DESARBO, W. S., GREWAL, R. and SCOTT, C. J. (2008): A clusterwise bilinear multidimensional scaling methodology for simultaneous segmentation and positioning analyses. Journal of Marketing Research 280 Vol. XLV, 280-292.

Keywords LATENT CLASS MODEL, BAYESIAN ANALYSIS, PREFERENCE ANALYSIS

139

Meulders

A statistical survey on bulk emails with symbolic data analysis Hiroyuki Minami Information Initiative Center, Hokkaido University, JAPAN [email protected] Abstract. Bulk emails (a.k.a spams) are known as nuisances in the Internet. However, we can regard them as a set of computer texts with some specific features. Some powerful spam filters in commerce are based on massive reports from the users (e.g, Cloudmark). It is really effective when we exclude major spams, but not always useful for a relatively minor one, for example, written in non-English. We had reported the simple technique to distinguish a spam based only on e-mail headers (Minami, 2005), which made binary data according to our original heuristics on the headers. Compared to popular filters, it had poor performance, especially on English spams, but took some advantages for spams in mixed languages. If we treat and analyze both headers and contents keeping each characteristics in a same framework, it must be useful. To formalize our idea, we introduce Symbolic Data Analysis (SDA) to spam analysis. All bulk mails are transformed to Symbolic Object which consists of the binary data based on e-mail headers and the aggregated text data based on e-mail body, with care for content languages. Some classification techniques in SDA (like SGCA and SFDA in Diday and Noirhomme-Fraiture (2008)) are applied to the original data (over 100,000 spams collected by the author) and we try to reveal the characteristics and how to classify and discriminate the spams properly.

References Diday, E. and Noirhomme-Fraiture, M.(Eds.) (2008). Symbolic Data Analysis and the SODAS Software. Wiley. O’Donnell, A and Prakash, V. V. Reputation Based Approach for Efficient Filtration of Spam. http://www.cloudmark.com/en/whitepapers/ reputation-based-approach-for-efficient-filtration-of-spam Minami, H. (2005). SPAM Filtering based on E-mail Headers and the Comparison with Bayesian Filtering on its Contents. Proceedings of ISI2005, 276.

Keywords Spam, Huge data

Minami

140

Random projections for stopping the process of divisions in k-means bisection clustering E. Kovaleva1 and B. Mirkin1,2 1

2

Higher School of Economics, 11 Pokrovski Boulevard, Moscow, RF [email protected] Birkbeck University of London, Malet Street, London, WC1E 7HX, UK [email protected]

Abstract. The problem of stopping the process of divisions in divisive clustering is of current interest. Recently, [Ta10] proposed a method, dePDDP, stopping splits when the histogram of the projection of the cluster’s entities to the principal axis [Bo98] has no minima, that is monotone or convex. We consider cluster histograms on randomly chosen axes to increase the performance and stability of stopping criteria. We develop two tests using the histograms of the projections of cluster’s entities to a number of randomly drawn axes. One, deRAND, picks up the idea of [Ta10] and stops splitting when the proportion of the number of random axes at which the histogram has no minima is greater than 70% (derived from the properties of the Gaussian distribution). The other relies on the classical uni-dimensional likelihood ratio statistic 2LR [Co74] for choosing between hypotheses of one against two Gaussian distributions to develop a statistic, 2LRRAND, that stops the divisions when it becomes negative. We have conducted multiple experiments with generated Gaussian cluster sets involving different cluster numbers, sizes, spreads and extents of the intermix. Our results indicate that all the three criteria work similarly well at modest numbers of well separated clusters (up to 7). When the number of clusters rises to 15, dePDDP falters whereas the other two carry on. When increasing the cluster intermix to 10% of the sample, the only method maintaining the high accuracy is 2LRRAND. However, the latter method is applicable only at relatively high sizes of data and clusters.

References [Ta10]TASOULIS S.K., TASOULIS D.K. and PLAGIANAKOS V.P. (2010): Enhancing Principal Direction Divisive Clustering. Pattern Recognition, 43, 3391– 3411. [Bo98]BOLEY D. (1998): Principal Direction Divisive Partitioning. Data Mining and Knowledge Discovery, 2, 325–344. [Co74]COX, D. R. and HINKLEY, D.V. (1974): Theoretical Statistics. Chapman & Hall, Roca Baton.

Keywords DIVISIVE CLUSTERING, STOPPING CRITERION, RANDOM PROJECTIONS

141

Kovaleva

Hierarchical clustering for distribution valued dissimilarity data Masahiro Mizuta Information Initiative Center, Hokkaido University, Sapporo 060-0811, JAPAN [email protected] Abstract. In this paper, we deal with a hierarchical clustering method for distribution valued dissimilarity data. Conventional hierarchical clustering methods assume that the input data are a set of nonnegative real values, i.e. dissimilarities between objects. In many situations, the dissimilarities between objects cannot be measured by real values, but have to be represented by distributions. We had proposed a method of Multidimensional scaling (MDS) for these kinds of data (Mizuta, 2009). Single Linkage (SL) is a well-known method to find clusters. The key issue of SL is to select the minimum value among a set of dissimilarities. We focus on p-percentile of the distributions. When the value p is fixed, the distribution valued dissimilarities are represented by nonnegative real values; in other words, the dissimilarities are represented as functional data. A functional clustering method (Mizuta, 2003) can be adopted for the functional dissimilarity data. The results of the clustering method are represented by functional Minimum Spanning Tree.

References Diday, E. and Noirhomme-Fraiture, M. eds.(2008): Symbolic Data Analysis and the SODAS Software, Wiley. Mizuta, M. (2003): Hierarchical Clustering for Functional Dissimilarity Data, Proceedings of the 7th World Multiconference on Systemics, Cybernetics and Informatics, Volume V, pp.223-227. Mizuta, M. (2009): MDS for Distribution Valued Dissimilarity Data, Proceedings of IFCS 2009 and GfKl, p.234.

Keywords Symbolic Data Analysis, Functional Data Analysis

Mizuta

142

Geochemical and Statistical Investigation of Clay Deposits in the Troad and its Implication for Provenance of Bronze Age Fine Pottery from Troia Carlos Morales-Merino1 , Cornelia Schubert1 , Hans-Joachim Mucha2 , and Hans-Georg Bartel3 1

2

3

Curt-Engelhorn-Zentrum Arch¨ aometrie, D6, 3, 68159 Mannheim, Germany, [email protected], [email protected] Weierstrass Institute for Applied Analysis and Stochastics (WIAS), 10117 Berlin, Germany, [email protected] Institute for Chemistry, Humboldt University Berlin, Brook-Taylor-Straße 2, 12489 Berlin, Germany, [email protected]

Abstract. Troian pottery has been frequently studied, but there are still many unanswered questions about their origin and distribution throughout the Troad. In provenance studies, it is common to compare ceramic with reference material with a known origin. Unfortunately, as in the case of Troia, this is not always available in archaeological contexts. An investigation of clay deposits in the region is therefore important to understand the selection and discrimination of clays used by ancient potters. Anatolian Grey Ware (AGW) and Tan Ware (TW) are the characteristic fine wares of Late Bronze Age Troia. They are common not only in Troia itself but on several sites in the Troad and in the case of AGW also in the Eastern Mediterranean. In this work the focus lies on samples from Troia and some other sites in the Troad. The question of imported pottery from Troia VI (Early and Middle) is also investigated. A study of 255 ceramic and 324 clay sediment samples was conducted. They were analyzed by neutron activation analysis. The resulting chemical composition data is examined using different multivariate statistical procedures. Some cluster analysis models, validation and different data transformations were applied in order to differentiate between imported and local ceramics. The results show some regional trends. For example, an unusually high arsenic content was observed in the clays and is mirrored in the chemical composition of Troian ceramics. Some of the wares happen to be imports, originating from Samothrace, some unidentified littoral islands and from the Troad’s coastal stripe.

Keywords Troia, Ceramics, Cluster Analysis, Discriminant Analysis, Neutron Activation Analysis

143

Morales-Merino

Classification of Roman Tiles With Stamp PARDALIUS Hans-Joachim Mucha1 , Jens Dolata2 , and Hans-Georg Bartel3 1

2

3

Weierstrass Institute for Applied Analysis and Stochastics (WIAS), 10117 Berlin, Germany, [email protected] Head Office for Cultural Heritage Rhineland-Palatinate (GDKE), Große Langgasse 29, 55116 Mainz, Germany, [email protected] Department of Chemistry at Humboldt University, Berlin, Brook-Taylor-Straße 2, 12489 Berlin, Germany, [email protected]

Abstract. Newsworthy with respect to Roman tiles research, the archaeometrical investigation of brickstamps of late Roman PARDALIUS can be reported. The location of the findings is Nehren on the Mosel river, nearby the former imperial residence Trier. The roof tiles belong to a Roman mausoleum equipped with a grave-chamber. Their chemical composition was measured by X-ray fluorescence analyses at the laboratory of Freie Universit¨ at Berlin. First, the new set of 14 tiles is compared with all currently known proveniences of Roman tile making in Roman Southern Germany. Without any doubt, the class of tiles from Nehren is highly significant different. Therefore, from the statistical point of view, a new brickyard can be confirmed. However, this class looks not homogeneous. Therefore, second, exploratory data analysis including data visualizations is performed. Additionally we investigated also the tiles of the provenience “Not yet known 3” (see, for instance, BARTEL 2009), because their statistical difference to the PARDALIUS-tiles is the lowest among all considered classes of tiles. A serious problem is the small sample size, 14 and 7 observations, respectively. The series of the analysed tiles can nowadays not be increased by archaeologists, because already all known tiles have been made available. In order to increase the sample size, we propose some combinations of bootstrapping and jittering to generate additional observations.

References BARTEL, H.-G. (2009): Arch¨ aometrische Daten r¨ omischer Ziegel aus Germania Superior. In: H.-J. Mucha and G. Ritter G. (Eds.): Classification and Clustering: Models, Software and Applications, Report No. 26, WIAS, Berlin, 50–72.

Keywords Roman tiles, discriminant analysis, clustering, bootstrap, jittering

Mucha

144

Illumination-Robust Dense Optical Flow Using Census Signatures Thomas M¨ uller1 , Clemens Rabe, Jens Rannacher, Uwe Franke, and Rudolf Mester Daimler AG Abstract. Vision-based motion perception builds primarily on the concept of optical flow. Modern optical flow approaches suffer from several shortcomings, especially in real, non-ideal scenarios such as traffic scenes. Non-constant illumination conditions in consecutive frames of the input image sequence are among these shortcomings. We propose and evaluate the application of intrinsically illumination-invariant census transforms within a dense state-of-the-art variational optical flow computation scheme. Our technique improves robustness against illumination changes, caused either by altering physical illumination or camera parameter adjustments. Since census signatures can be implemented quite efficiently, the resulting optical flow fields can be computed in real-time.

145

Mueller

A psychological perspective on similarity and distance measures Daniel M¨ ullensiefen Dept. of Psychology, Goldsmiths, University of London, New Cross Road, SE14 6NW, London. [email protected] Abstract. Over the last four decades the psychological processes underlying the human perception of similarity and distance have received a good deal of attention in the cognitive literature. Theories and models range from early geometrical models related to the development of multi-dimensional scaling (Shepard, 1987), and similarity as the overlap of sets of perceptually salient features (Tversky, 1977), via the similarity choice model based on empirical confusion matrices (Townsend & Landon, 1982) and the more general concept of stimulus bias (Nosofsky, 1991), to the idea of transformation distances, such as Edit Distance, Earth Mover’s Distance or generally complexity and compressability (Hahn & Chater, 1997; Hodgetts, Hahn & Chater, 2009) . This paper will review some of the most prominent models of similarity perception and their formal specifications along with the corresponding experimental evidence from cognitive psychology. A subset of these similarity models is then evaluated on a real-world dataset stemming from the area of music psychology. In this experiment human participants indicated the perceptual similarity between pairs of melodies taken from ordinary pop tunes. Aggregates of the participants’ similarity ratings are then compared to similarity measurements as derived from musical adaptations of perceptual similarity models. Results are discussed not only in terms of how closely individual similarity models match the experimental data but also in terms of the psychological plausibility of the underlying theories and their formal implementations.

References CHATER, N., and HAHN, U. (1997). Representational distortion, similarity, and the universal law of generalization. In: Proceedings of the interdisciplinary workshop on similarity and categorization, SimCat97. Edinburgh: Department of Artificial Intelligence, University of Edinburgh, 31-36. HODGETTS, C.J., HAHN, U. and CHATER, N. (2009): Transformation and alignment in similarity. Cognition, 113, 62-79. SHEPARD, R.N. (1987): Toward a universal law of generalization for psychological science. Science, 237, 1317-1323. TOWNSEND, J.T. and LANDON, D.E. (1982): An experimental and theoretical investigation of the constant-ratio rule and other models of visual letter confusion. Journal of Mathematical Psychology, 25, 119-162. TVERSKY, A. (1977): Features of similarity. Psychological Review, 84, 327-352.

Keywords SIMILARITY, DISTANCE, PERCEPTION, PSYCHOLOGY, MUSIC M¨ ullensiefen

146

Spurious Dimensions in the Application of Principal Components Analysis with the Oblique Rotation to Binary Data Takashi MURAKAMI1 and Yuri IRIE2 1 2

Chukyo University, Japan [email protected] Nagoya University, Japan [email protected]

Abstract. Principal components analysis (PCA) has been mentioned as an inappropriate procedure for analyzing binary data because it tends to yield spurious dimensions called difficulty factors (e.g., McDonald & Ahlawat, 1974). However, we will demonstrate that the method can be a useful tool when it is carefully applied. A questionnaire consisting of 66 items for self-assessment of foreign language skills, “can-do” statements, were analyzed by PCA with Harris-Kaiser independent cluster rotation (Kiers & ten Berge, 1994). Six-point responses were coded into dichotomous categories in advance. Because obtained five components were highly correlated one another and loadings were strongly related with the response rates, four of them should have been interpreted as spurious components. However, the three components, on which items with intermediate response rates had salient loadings, could be interpreted on the basis of their contents. In addition, simultaneous distribution of these three components was roughly elliptical whereas scatter diagrams of other combinations showed approximately L-shaped distributions. The result suggests that PCA may be used as a congruent and/or complementary method to multiple scalogram analysis such as POSA.

References KIERS, H.A.L. and TEN BERGE, J.M.F. (1994): The Harris-Kaiser independent cluster rotation as a method for rotation to simple component weights. Psychometrika , 59, 81-90. MCDONALD, R.P. and AHLAWAT, K.S. (1974): Difficulty factors in binary data. British Journal of Mathematical and Statistical Psychology, 27, 82-99.

Keywords DIFFICULTY FACTORS, INDEPENDENT CLUSTER ROTATION, “CAN-DO” STATEMENTS

147

Murakami

Cepstral Modulation Features for Versatile Audio Classification Tasks Anil Nagathil and Rainer Martin Institute of Communication Acoustics, Ruhr-Universit¨ at Bochum, Germany [email protected], [email protected] Abstract. Audio signal classification is of high interest in many applications such as audio scene analysis in hearing aids or music playlist generation. An integral part of the classification scheme is the design of features which facilitate a high discrimination between different audio classes. While there are many approaches towards customized features for specific classification tasks, in this work we strive for a unified and scalable audio feature representation which can be used in versatile fields of audio classification. Cepstral coefficients (Bogert et al. (1963)), which describe different levels of detail of the short-time Fourier spectrum, are commonly used as a representation of speech and audio signals. The dynamics of these coefficients can be expressed by means of a short-time modulation spectrum. A temporally averaged cepstral modulation spectrum (TACSM) yields a compact signal representation and a basis for the extraction of audio features. In this work we propose two different feature sets which are both based on a TACSM, either with a low or high modulation frequency resolution. While in the former case the TACSM can be parameterized using Cepstral Modulation Ratio Regressions (CMRARE) (Nagathil et al. (2011)), in the latter case features are obtained from a regression of the TACSM which is based on a singular value decomposition. We present classification results obtained using the proposed features in a hierarchically structured speech, music and noise classification task where a linear discriminant analysis is chosen as the classifier.

References BOGERT, B.P., HEALY, M.J.R., TUKEY, J.W. (1963): The Quefrency Alanysis of Time Series for Echoes: Cepstrum, Pseudo-Autocovariance, Cross-Cepstrum and Saphne Cracking. In Proc. Symposium on Time Series Analysis, 209–243. ¨ NAGATHIL, A., GOTTEL, P., MARTIN, R. (2011): Hierarchical Audio Classification Using Cepstral Modulation Ratio Regressions Based on Legendre Polynomials. In Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing (ICASSP).

Keywords CEPSTRUM, MODULATION SPECTRUM, AUDIO CLASSIFICATION

Nagathil

148

Analysis of One-mode Three-way Asymmetric Data by Multidimensional Scaling and Cluster Analysis. Atsuho Nakayama1 , Hiroyuki Tsurumi2 , and Akinori Okada3 1

2 3

Graduate School of Social Sciences, Tokyo Metropolitan University, 1-1 Minami-Ohsawa, Hachioji-shi, Tokyo 192-0397, Japan [email protected] Faculty of Business Administration , Yokohama National University Graduate School of Management and Information Sciences, Tama University

Abstract. Previous studies have proposed the model to analyze one-mode threeway data. These models usually assume triadic symmetric relationships among objects. Therefore, Nakayama and Okada (2010) proposes a method that extend Harshman et al. (1982)’s reconstructed method to one-mode three-way asymmetric proximity data. Their method makes the overall sum of the rows and the overall sum of the columns equal. So, our proposed method reconstructs one-mode three-way asymmetric data so that the overall sum of the rows, columns and depths is made equal over all objects. Our proposed method is also effective to analyze the data that have differences among the overall sum of the rows, columns, and depths depending on external factors. In the present paper, we applied our reconstructing method to the one-mode three-way asymmetry purchase behavior data. The reconstructed one-mode three-way asymmetry data are symmetrized and analysis by one-mode three-way symmetry MDS and additive clustering method. Because, it is said that joint uses of MDS and cluster analysis are often desirable.

References HARSHMAN, R.A., GREEN, P.E., WIND, Y., & LUNDY, M.E. (1982): A Model for the Analysis of Asymmetric Data in Marketing Research. Marketing Science, 1, 205–242. NAKAYAMA, A., & OKADA, A. (2010): Reconstructing One-mode Three-way Asymmetric Data for Multidimensional Scaling [summary]. Abstracts of the 34th Annual Conference of the German Classification Society, p. 133. (July 22, 2010).

Keywords CLUSTER ANALYSIS, CONSUMER BEHAVIOR, MDS, ONE-MODE THREE-WAY ASYMMETRIC DATA, TRIADIC RELATIONSHIPS

149

Nakayama

The influence of the size of the scale in a statistical model D.Nappo University of Naples ”‘Federico II”’, Department of Mathematic and Statistics, Via Cinthia 80126 Naples (Italy) [email protected] Abstract. The size scale is an important and critical problem in a statistical modelling approach. In particular, referring to the Partial Least Squares Path Modeling approach (Wold,1982) to the estimation of a Structural Equation Modeling (SEM; Joreskog, 1970), that allows to model a complex phenomenon, by the identification of several dimensions (latent variables), the observed manifest variables (Mvs), used to measure the latent variables, are often expressed on a ordinal scale of 10, 5, 4 , 3 or 2 levels. This can affect the significativity of the model parameters, also because, generally, they are considered as numerical variables. This is a strong assumption that cannot be considered valid, especially when the Mvs have three or two levels. In order to quantify these Mvs, to consider them as numerical, it is proposed an algorithm, called Partial Alternating Least Squares Optimal Scaling-Path Modeling (PALSOS-PM; D. Nappo,2009) that, using an Alternating Least Squares algorithm (Young, 1981), combined with the Partial Least Squares approach to the estimation of a Structural equation Modeling, obtains an optimal quantification for the ordinal and nominal Mvs. In this way, the work forces the problem of the scale size both if it used the numerical assumption for these Mvs, both if it used, instead, a quantification approach, in order to verify the sensibility of this approach respect to the size of scale.

References Joreskog, K.G. (1973): A general method for estimating a Linear Structural Equation System.Goldberger and Duncan, pp. 85-112. Nappo, D. (2009): SEM with ordinal manifest variables. An Alternating Least Squares approach.Phd Thesis, University of Study of Naples “Federico II”. Tenenhaus,M. , Esposito Vinzi, V., Chatelin,Y. M. and Lauro,C.(2005): PLS path modeling.Computational Statistics and Data Analysis, 48,1, pp. 159-205. Wold,H. (1982): Soft modeling: The basic design and some extensions.in J¨ oreskog, K. G. and Wold, H., editors, Systems Under Indirect Observation. Part II, pp. 1-54. North-Holland, Amsterdam.

Keywords SEM MODELS, PLS-PM, ALSOS, SCALE

Nappo

150

Statistical Software for Clustering Images Robert Naundorf and Daniel Baier Institute of Business Administration and Economics, Brandenburg University of Technology Cottbus, Postbox 101344, 03013 Cottbus, Germany, {robert.naundorf,daniel.baier}@tu-cottbus.de Abstract. The digital age and especially the Internet has turned people to produce more data than ever before. Particularly social network services (e.g. Facebook, Twitter, Flickr) offer numerous features for sharing personal interests and activities by providing multimedia content, such as photos, music, videos, or location data. Content-based multimedia information retrieval addresses the problem of processing large amounts of multimedia data even when textual descriptions and annotations are nonexistent (Lew et al. (2006)). Although there has been some remarkable progress e.g. in the field of content-based image retrieval (CBIR) commercial applications remain scarce (Datta et al. (2008)). A recent and promising approach considers the application of content-based methods for clustering user-generated images for market segmentation purposes (Baier and Daniel (2010)). However, as we will show, conventional statistical software packages do not provide adequate capabilities for dealing with large sets of images or lack specific requirements of the commercial domain. In conclusion we present a software prototype addressing existing shortcomings.

References BAIER, D. and DANIEL, I. (2010): Image Clustering for Marketing Purposes. 3rd German-Japanese Workshop on Advances in Data Analysis and Related New Techniques and Applications, Karlsruhe, 2010-07-20. DATTA, R., JOSHI, D., LI, J. and WANG, J. Z. (2008): Image Retrieval: Ideas, Influences, and Trends of the New Age. ACM Computing Surveys, 40 (2), article 5. LEW, M. S., SEBE, N., DJERABA, C. and JAIN, R. (2006): Content-Based Multimedia Information Retrieval: State of the Art and Challenges. ACM Transactions on Multimedia Computing, Communications, and Applications, 2 (1), 1–19.

Keywords Statistical Software Packages, Image Clustering, Content-Based Image Retrieval

151

Naundorf

Regularized Ideal Point Classification C. Ninaber and M. de Rooij Leiden University, Institute of Psychology, Wassenaarseweg 52, 2333 AK, Leiden, The Netherlands [email protected] Abstract. Multinomial classification problems are traditionally approached with the use of multiple logit functions, making the model parameters cumbersome to interpret. In addition there are increasingly more research fields that have to cope with large sets of variables (e.g. genomics, machine learning, brain imagery). This is causing an alteration in emphasis, that is to say not only the development of valid analysis techniques, but also model and data visualization are increasingly important. To overcome these challenges we propose Ideal Point Classification (IPC) with lasso soft-threshold variable selection. IPC is a form of discriminant analysis by which subjects are projected in a multidimensional space and membership probability is estimated based on there distance toward class points. In maximum dimensionality it is equivalent to a multivariate generalized linear model, but by reducing dimensionality it is possible to visualize a multinomial classification model in a low dimensional space, making interpreting an easier task. Furthermore by incorporating lasso shrinkage, models will be more parsimonious and overfitting is prevented.

References FRIEDMAN, J. and HASTIE, T. and TIBSHIRANI, R. (2010): Regularization Paths for Generalized linear models via Coordinate Descent. Journal of statistical software, 33, 1–22. DE ROOIJ, M. (2009): Ideal Point Discriminant Analysis Revisited with a Special Emphasis on Visualization. Psychometrika, 74, 317–330. DE ROOIJ, M. (2011): Transitional Ideal Point Models for Longitudinal Multinomial Outcomes. Statistical Modelling, 11, 115–135.

Keywords CLASSIFICATION, HIGH DIMENSIONAL DATA, MULTIDIMENSIONAL UNFOLDING, REGULARIZATION

Ninaber

152

Estimating and Visualizing Cluster Structure in a Constrained Hypercube as a Proxy for Cognitive Diagnosis Models R. Nugent1 and N. Dean2 1 2

Department of Statistics, Carnegie Mellon University School of Mathematics and Statistics, University of Glasgow

Abstract. Projection pursuit (PP) dates to 1974 (Friedman and Tukey). Principal component analysis (PCA) dates to 1901 (Pearson). We introduce a new method that has resemblances to both. It could equally well be called conditional projection pursuit (CPP) or most informative component analysis (MICA) in honor of its two ancestors. Like principle component analysis, it is based on a matrix eigenanalysis, with eigenvectors used as linear combinations. Like projection pursuit, it is focused on nonlinear and non-normal features of the data. The method will be illustrated with several examples. Technical issues will be deemphasized.

153

Nugent

Cluster Analysis Based on Multi-Layer Structure Akinori Okada1 and Satoru Yokoyama2 1

2

Graduate School of Management and Information Sciences, Tama University [email protected] Department of Business Admnistration, Faculty of Economics, Teikyo University [email protected]

Abstract. There are two categories of cluster analysis; hierarchical cluster analysis and non-hierarchical cluster analysis. In the case of the hierarchical cluster analysis, two clusters are merged into one cluster or one cluster is divided into two clusters at each step by optimizing a criterion of that step. In the case of the non-hierarchical cluster analysis, classifying the object into one of the clusters of predetermined number is done also by optimizing a criterion of the resulting classification. The algorithms of the conventional cluster analysis, both the hierarchical and the nonhierarchical, are stepwise optimal at each step (Gordon, 1999, p. 78). The model of the present cluster analysis assumes a hierarchical structure consists of layers beforehand, e. g., a species, a genus, a family, and an order , where each layer has the predetermined number of clusters (Okada and Yokoyama, 2010). The present cluster analysis classifies each object into one of the clusters at each layer simultaneously by an iterative algorithm. Each object belongs to one of the clusters at each layer. The algorithm optimizes the criterion of the fitness measure at all layers simultaneously, but does not optimize the criterion of the fitness measure at one layer after another (not stepwise optimal at each step) . The present cluster analysis is applied to the data on whisky brands.

References GORDON, A.G. (1999): Classification (2nd ed.). Chapman & Hall/CRC, Boca Raton, FL. OKADA, A. and YOKOYAMA, S. (2010): Multi-Layer Cluster Analysis. Proceedings of the 28th Annual Meeting of the Japanese Classification Society, pp. 11–12. (In Japanese)

Keywords CLUSTER ANALYSIS, HIERARCHICAL, LAYER, NON-HIERARCHICAL, PARTITION

Okada

154

The Classification of Mutual Funds Based on the Management Style – Quantile Regression Approach Agnieszka Orwat-Acedanska1 and Grazyna Trzpiot2 1 2

University of Economics in Katowice [email protected] University of Economics in Katowice [email protected]

Abstract. We present modelling of conditional quantiles of mutual funds yields as a function of risk factors. The factors are utilized through Sharp Style Analysis. The factor style analysis aims at attributing the fund’s rate of return to rates of return from indices representing a fund’s investments in some asset classes. The main aim of the paper is classification of the funds according to estimated style shares for different part of quantile conditional distribution of returns. In the paper we extend Sharpe style analysis to Quantile Style Analysis. It employs multiple quantile regression with additional parameter restrictions. Contrary to the classical approach quantile regression need not assume any distribution for error terms. Therefore the method is robust to deviations from the classical assumptions necessary for the Sharpe style analysis. This virtue is important when fat tails or asymmetries are present in data. Quantile style analysis allows to investigate dependence between funds’ returns and conditional risk factors along whole distribution of funds’ returns. We use hierarchical methods of classification. Classification results show significant heterogeneity of the style expositions for quantiles of different orders.

References KOENKER, R. and Ng P., (2005): Inequality Constrained Quantile Regression. The Indian Journal of Statistics, 67, 418–440. ORWAT A., (2011): The classification of Polish mutual balanced funds according to the management style using Andrews estimators. Studia Ekonomiczne, in print. SHARPE W. F., (1992): Asset Allocation Management Style and Performance Measurement. Journal of Portfolio Management, 18(2), 7-19. TRZPIOT G., (2009): Estimation methods for quantile regression. Studia Ekonomiczne, 53, 81-90.

Keywords QUANTILE STYLE ANALYSIS, MUTUAL FUNDS, HIERARCHICAL METHODS

155

Orwat-Acedanska

Preserving asymmetry of distance data in the clustering setting Jan W. Owsinski Systems Research Institute, Polish Academy of Sciences, Newelska 6, 01-447 Warsaw, Poland [email protected] Abstract. Asymmetric distances or proximities arise in, e.g., trade and commuter flows, communication or urban traffic. In clustering, by its nature symmetric with respect to individual objects, one can hardly preserve information on original asymmetry, even if asymmetry is unimportant for many clustering approaches. Recent work on clustering for asymmetric distances (Saito and Yadohisa, 2005), refers to k-medoids, as distance/proximity asymmetry often involves asymmetry of positions (towns in the settlement system, people in communication networks). Owsinski (2009) proposed a hierarchical k-medoid-type structure, extended by fuzzyfication of memberships in clusters, with an appropriate distance definition. The shape of the structure (e.g. number of levels) is determined here with the objective function from Owsinski (1990). The triples of objects and their distances to other objects are considered consecutively, yielding the assignment of candidate membership values to clusters, labelled by individual objects, which are then modified from a global viewpoint. The intended application is to analyse the web-based networks of not too high density and dimensions, where the algorithm, despite its low computational effectiveness, can be applied. The paper outlines also the ways to enhance computational efficiency.

References ´ OWSINSKI, J. W. (1990): On a new naturally indexed quick clustering method with a global objective function. Applied Stochastic Models and Data Analysis, 6, 157–171. ´ OWSINSKI, J.W. (2009): Asymmetric distances a natural case for fuzzy clustering? In: D. A. Viattchenin, ed., Developments in Fuzzy Clustering. Vever, Minsk (Belarus), 36–45. SAITO, T. and YADOHISA, H. (2005): Data Analysis of Asymmetric Structures Recent Development of Computational Statistics. Marcel Dekker, New York.

Owsinski

156

Intrablocks Correspondence Analysis Campo El´ıas Pardo1 and Jorge Eduardo Ortiz2 1

2

Departamento de Estad´ıstica. Universidad Nacional de Colombia. Bogot´ a [email protected] Facultad de Estad´ıstica. Universidad Santo Tom´ as. Bogot´ a [email protected]

Abstract. Contingency tables with double partition structures on the columns and on the rows may be analysed by the Internal Correspondence Analysis (ICA) (Cazes et al., 1988). B´ecue et al. (2005) introduced the superimposed representations in a Multiple Factor Analysis way. By applying the same methodology of Escofier (1984), we propose a new method named Intra blocks Correspondence Analysis (IBCA), as the Correspondence Analysis of the contingency table with respect to the Intra blocks Independence Model. Furthermore, we introduce variable dilations to the partial points in the superimposed representations. In the superimposed representations, the ICBA has an important advantage over the ICA: the partial points corresponding to columns or rows of zeros within a block are located at the origin. The variable dilations are preferable than the constant one because the partial points belonging to low weight bands are highlighted, and the more weighted partial points are closer to their global points. An application to Spanish mortality data shows the contribution of IBCA.

References BECUE, M., PAGES, J. and PARDO, C.E. (2005): Contingency table with a double partition on rows and columns. Visualisation and comparison of the partial and global structures. In: Proceedings ASMDA 2005 (Jacques Janssen and Philippe Lenca, (Eds.): Applied Stochastic Models and Data Analysis. Brest, France, 355–364. CAZES, P., CHESSEL, D. and DOLEDEC, S. (1988): L’analyse des correspondances internes d’un tableau partitionn´e. Son usage en hydrobiologie. Revue de Statistique Appliqu´ee, 36(1), 39–54. ESCOFIER, B. (1984): Analyse factorielle en r´ef´erence a ` un mod`ele. Application a ` l’analyse de tableaux d’´echanges. Revue de Statistique Appliqu´ee 32(4), 25–36.

Keywords MULTIWAY CONTINGENCY TABLES, INTERNAL CORRESPONDENCE ANALYSIS, MULTIPLE FACTOR ANALYSIS.

157

Pardo

Comparison of Some Chosen Tests of Independence of Value-at-Risk Violations Krzysztof Piontek Department of Financial Investments and Risk Management Wroclaw University of Economics, ul. Komandorska 118/120, Wroclaw, Poland [email protected] Abstract. Backtesting is the necessary statistical procedure to evaluate performance of Value-at-Risk models. A satisfactory test should be able to detect both deviations from the correct probability of violations, as well as their clustering. Many researchers and practitioners underline the importance of the lack of any dependence in the hit series over time. If the independence condition is not met, it may be a signal that the respective VaR model reacts too slowly to changes in the market. If the violation sequence exhibits a dependence other than first order Markov dependence, the classical test of Christoffersen would fail to detect it. This article presents two chosen tests having power against more general forms of dependence. Both of them are, however, based on the same set of information as the Christoffersen test, i.e. hit series. The first approach is based on durations and hazard functions, and the second one - on durations and GMM approach. The aim of this article is to analyze presented backtesting methodologies, focusing on the aspect of limited data sets and the power of tests. Simulated data representing asset returns are used here. The power analysis is based on the Dufour Monte Carlo technique. Presented results indicate that some tests are not adequate for small samples, even 1000 observations. Finally, obtained results are summarized and some hints for the optimal backtesting are given. This paper is a continuation of earlier research done by the author.

References CAMPBELL S (2005): A Review of Backtesting and Backtesting Procedures. Federal Reserve Board. Washington HAAS M (2005): Improved duration-based backtesting of value-at-risk, Journal of Risk, Vol. 8, No. 2, pp. 17-38 CHRISTOFFERSEN P, PELLETIER D (2004): Backtesting Value-at-Risk: A Duration-Based Approach, Journal of Financial Econometrics, 2 , pp. 84-108

Keywords RISK MEASUREMENT, VaR, BACKTESTING, POWER OF TESTS

Piontek

158

Putting MAP back on the map Patrick Pletscher1 , Sebastian Nowozin, Pushmeet Kohli, and Carsten Rother ETH Zurich Abstract. Conditional Random Fields (CRFs) are popular models in computer vision for solving labeling problems such as image denoising. This paper tackles the rarely addressed but important problem of learning the full form of the potential functions of pairwise CRFs. We examine two popular learning techniques, maximum likelihood estimation and maximum margin training. The main focus of the paper is on models such as pairwise CRFs, that are simplistic (misspecified) and do not fit the data well. We empirically demonstrate that for misspecified models maximummargin training with MAP prediction is superior to maximum likelihood estimation with any other prediction method. Additionally we examine the common belief that MLE is better at producing predictions matching image statistics.

159

Pletscher

Efficient and Robust Shape Matching for Model Based Human Motion Capture Gerard Pons-Moll1 , Laura Leal-Taixe, Tri Truong, and Bodo Rosenhahn Leibniz University Hannover Abstract. In this paper we present a robust and efficient shape matching approach for Marker-less Motion Capture. Extracted features such as contour, gradient orientations and the turning function of the shape are embedded in a 1-D string. We formulate shape matching as a Linear Assignment Problem and propose to use Dynamic Time Warping on the string representation of shapes to discard unlikely correspondences and thereby to reduce ambiguities and spurious local minima. Furthermore, the proposed cost matrix pruning results in robustness to scaling, rotation and topological changes and allows to greatly reduce the computational cost. We show that our approach can track fast human motions where standard articulated Iterative Closest Point algorithms fail.

Pons-Moll

160

Fusion of Audio- and Visual Cues for Real-Life Emotional Human Robot Interaction Ahmad Rabie1 and Uwe Handmann Hochschule Ruhr-West Abstract. Recognition of emotions from multi-modal cues is of basic interest for the design of many adaptive interfaces in human-machine interaction (HMI) in general and human-robot interaction (HRI) in particular. It provides a means to incorporate non-verbal feedback in the course of interaction. Humans express their emotional and affective state rather unconsciously exploiting their different natural communication modalities such as body language, facial expression and prosodic intonation. In order to achieve applicability in realistic HRI settings, we develop person-independent affective models. In this paper, we present a study on multi-modal recognition of emotions from such auditive and visual cues for interaction interfaces. We recognise six classes of basic emotions plus he neutral one of talking persons. The focus hereby lies on the simultaneous online visual and acoustic analysis of speaking faces. A probabilistic decision level fusion scheme based on Bayesian networks is applied to draw benefit of the complementary information from both – the acoustic and the visual – cues. We compare the performance of our state of the art recognition systems for separate modalities to the improved results after applying our fusion scheme on both DaFEx database and a real-life data that captured directly from robot. We furthermore discuss the results with regard to the theoretical background and future applications.

161

Rabie

Steerable Deconvolution - Feature Detection as an Inverse Problem Marco Reisert1 and Henrik Skibbe University Medical Center, Freiburg Abstract. Steerable filters are a common tool for feature detection in early vision. Typically, a steerable filter is used as a matched filter by rotating a template to achieve the highest correlation value. We propose to use the steerable filter bank in a different way: it is interpreted as a model of the image formation process. The filter maps a hidden ’orientation’ image onto an observed intensity image. The goal is to estimate the hidden image from the given observation. As the problem is highly under-determined, prior knowledge has to be included. A simple and effective regularizer which can be used for edge, line and surface detection will be used. Further, an efficient implementation in terms of Circular Harmonics in the conjunction with the iterated use of local neighborhood operators is presented. It is also shown that a simultaneous modeling of different low-level features can improve the detection performance. Experiments show that our approach outperforms other existing methods for low-level feature detection.

Reisert

162

Computational Prediction of High-Level Descriptors of Music Personal Categories Guenther Roetter1 , Igor Vatolkin2 , and Claus Weihs3 1

2 3

Institute for Music and Music Science, TU Dortmund [email protected] Chair of Algorithm Engineering, TU Dortmund [email protected] Chair of Computational Statistics, TU Dortmund [email protected]

Abstract. The digital music collections are often organized by genre relationships or personal preferences. The target of automatic classification systems is to provide the music management limiting the listener efforts for the labeling of the large song number (Ahrendt(2006)). Many state-of-the art methods categorize based on low-level audio features like spectral and time domain characteristics, chroma etc. However the impact of these features is very hard to understand; if the listener labels some music pieces as belonging to a certain category, this decision is indeed motivated by instrumentation, harmony, vocals, rhythm and further high-level descriptors from the music theory. So it could be more reasonable to understand a classification model created from such intuitively interpretable features. For our study we created a set of personal music categories from different test persons, where each category was defined by only five selected prototype songs. Then the music experts have been asked to write down the high-level characterisicts of the songs (vocal alignment, tempo, key etc.). In the final step we have created the classification models which predict these characteristics from the large set of lowlevel audio features available in AMUSE framework (Vatolkin et al (2010)). The capability of this set to classify the expert descriptors is investigated in detail.

References AHRENDT, P. (2006): Music Genre Classification Systems - A Computational Approach. PhD thesis, Technical University of Denmark, Informatics and Mathematical Modelling, Lyngby. VATOLKIN, I., THEIMER, W. and BOTTECK, M. (2010): AMUSE (Advanced MUSic Explorer) - A Multitool Framework for Music Data Analysis. In: Proc. of 11th Int’l Society for Mus. Inform. Retr. Conf. (ISMIR), Utrecht, pp. 33-38.

Keywords MUSIC CLASSIFICATION, HIGH-LEVEL FEATURES

163

R¨oetter

Multivariate Analysis of Dividend Payout of German Prime Standard Issuers Joachim Rojahn1 and Karsten Luebke2 1

2

DIPS Deutsches Institut f¨ ur Portfolio-Strategien gGmbH, Leimkugelstraße 6 45141 Essen, Germany [email protected] FOM Hochschule f¨ ur Oekonomie und Management, c/o B1st software factory, Rheinlanddamm 201, 44139 Dortmund, Germany [email protected]

Abstract. Stock investors are generally interested whether a certain stock will pay a dividend and if so which amount of dividend is payed. This is even more important as up to 30% of the return of a stock holder are due to the dividend. Therefor there is a need of research which factors influence the dividend payout. Redding (1997) analyzed the dividend payouts of U.S. corporations. It turned out that liquid companies and large companies are significantly more likely to pay dividends. Fama and French (2001) also found the positive effect of firm size together with a positive effect of profitability and a negative effect of investment opportunities in the U.S. market. Also for this market Aivazian et.al. (2006) showed that firms that regularly access public debt (bond) markets are more likely to pay a dividend. With help of the Bloomberg Terminal data we analyzed the dividend payout policy of German Prime Standard Issuers during the years 2005-2010. By a multivariate modeling of the probability of a dividend payout with help of a probit regression we were able to find the influencing factors within the German stock market.

References AIVAZIAN, V.A., BOOTH, L., CLEARY, S. (2006): Dividend Smoothing and Debt Ratings. Journal of Financial and Quantitative Analysis, 41 (2),439-453. FAMA, E.F., FRENCH, K.R. (2001): Disappearing Dividends: Changing Firm Characteristics or Lower Propensity to Pay? Journal of Financial Economics, 60, 3-43. REDDING, L.S. (1997): Firm Size and Dividend Payouts. Journal of Financial Intermediation, 6(3), 224-248.

Keywords PRIME STANDARD CORPORATIONS, DIVIDEND PAYOUTS, PROBIT REGRESSION

Rojahn

164

Using some chosen methods of systemic risk analysis in stock portfolio stress testing Pawel Rokita Department of Financial Investments and Risk Management Wroclaw University of Economics, ul. Komandorska 118/120, Wroclaw, Poland [email protected] Abstract. Complementing Value at Risk or related measure with stress testing belongs to best practices of risk management and, for banks, is moreover an obligation imposed by the Basel accord. The last emphasizes the need of performing stress tests with respect to all addressed types of risk. This article concentrates on market risk and limits its scope to risk of stock portfolios. It has been observed during the crisis of 2008-2010 that approaches to stress testing that had been used so far were often insufficient and did not generate appropriate warnings. The main objections to traditional methods are: deficiencies of their forward-looking nature, tendency to neglect changes of dependence structure in times of bubbles and crashes, or not working shock transmission mechanisms into the models used. This paper is aimed at discussing, testing and comparing some proposals of applying recent achievements in systemic risk management, system stability analysis and complex systems to generate plausible but reasonably severe scenarios for stress testing.

References 1.BASEL COMMITTEE ON BANKING SUPERVISION (2009) Principles for sound stress testing practices and supervision. (http://www.bis.org/publ/bcbs155.pdf. Cited 30 Mar 2011) EUROPEAN CENTRAL BANK (2010) Financial Stability Review, December. (http://www.ecb.int. Cited 23 Mar 2011) KALI R, REYES J (2009) Financial Contagion on the International Trade Network. Econ Inquiry Vol. 48, No 4:1072–1101 MALEVERGNE Y, SORNETTE D (2006) Extreme Financial Risks. Springer, Berlin Heidelberg SORNETTE D (2003) Why Stock Markets Crash. Princeton University Press, Princeton, NJ

Keywords MARKET RISK, STRESS TESTING, FINANCIAL STABILITY, SYSTEMIC RISK

165

Rokita

Knowledge creation in research and development entities in Poland and the other European Union Member States K. Romaniuk UNIVERSITY OF WARMIA AND MAZURY IN OLSZTYN, Poland [email protected] Abstract. Economic scientists refer to the modern economic development period as the knowledge-based economy and the information society era. In recent years, knowledge has been treated as a classical resource next to land, capital and work. Such an approach to knowledge allows for building a competitive advantage due to the generation of new knowledge (innovation) which is reflected in undertaken actions. Moreover, the speed of retrieval, creation and processing, as well as the skill of its application are the key factors of success. The aim of this paper is to outline the situation in knowledge creation in research and development entities in Poland and the other Member States. The studied objects were organised using the linear order with the model.

Keywords knowledge-based economy, knowledge development, research and development entities, linear order.

Romaniuk

166

Exploratory analysis of innovation Dominik Antoni Rozkrut12 1 2

Statistical Office in Szczecin, ul. Matejki 22, Szczecin [email protected] University of Szczecin [email protected]

Abstract. Paper presents short discussion over the possible range of new applications of exploratory techniques in industrial innovation analysis. Since innovation is a multidimensional process, application of explanatory data analysis may give additional insight into its nature. Innovation plays an important role in shaping the growth and competitiveness of firms. Therefore appropriate indicators that can capture different aspects of innovation are crucial from the point of view of policy-making and policy evaluation. While international organizations more and more heavily stress the need for improved metrics of innovation, classical indicators, based on results of innovation surveys, constructed using a single variables such as the ”innovation rate”, are of limited information capacity. The methods discussed here may be used to reveal hidden innovation-related patterns across firms, thus leading to better metrics. The discussion is illustrated with two specific examples of innovation modes analysis, and differentiation of innovation strategy across regions. The applications are based on the data from results of large scale innovation survey. The analysis is preceded by short literature review. Besides factor analysis and clustering methods that are applied as statistical techniques to analyze the data, also correspondence analysis is used to depict associations between variables, and to plot perception map, which allows for interesting visual inspection of the underlying patterns, enabling effective interpretation.

References ARUNDEL, A. et al. (2007): How Europe’s Economies Learn: A Comparison of Work Organization and Innovation Mode for the EU-15. Industrial and Corporate Change, Vol. 16, Number 6. DE JONG, J.P.J and MARSILI O. (2006): The Fruit Flies of Innovation: A Taxonomy of Innovative Small Firms. Research Policy, Vol. 35, Issue 2. HOLLENSTEIN, H. (2003): Innovation Modes in the Swiss Service Sector: A Cluster Analysis based on Firm level data. Research Policy, Vol. 32, Issue 5. OECD/Eurostat (2005): Oslo Manual - Proposed Guidelines for Collecting and Interpreting Innovation Data. 3rd edition, OECD, Paris.

Keywords INNOVATION INDICATORS, INNOVATION MODES, FACTOR ANALYSIS, CLUSTERING, CORRESPONDENCE ANALYSIS

167

Rozkrut

Comparison of spectral clustering and cluster ensembles stability Dorota Rozmus Department of Statistics, Katowice University of Economics, Bogucicka 14, 40-226 Katowice [email protected] Abstract. High accuracy of the results is very important task in any grouping problem (clustering). It determines effectiveness of the decisions based on them. Therefore in the literature there are proposed methods and solutions that main aim is to give more accurate results than traditional clustering algorithms (e.g. k-means or hierarchical methods). Examples of such solutions can be cluster ensembles or spectral clustering algorithms. A desirable quality of any clustering algorithm is also stability of the method with respect to small perturbations of data (e.g. data subsampling, small variations in the feature values) or the parameters of the algorithm (e.g. random initialization). Empirical results shown that cluster ensembles are more stable than traditional clustering algorithms. Here, we carry out an experimental study to compare stability of spectral clustering and cluster ensembles.

References DUDOIT S., FRIDLYAND J. (2003): Bagging to Improve the Accuracy of a Clustering Procedure. Bioinformatics, Vol. 19, No. 9, 1090-1099. FRED A., JAIN A. K. (2002): Data clustering using evidence accumulation. Proceedings of the 16th International Conference on Pattern Recognition , ICPR, Canada, 276-280. KUNCHEVA L.I, VETROV D.P. (2006): Evaluation of Stability of k-Means Cluster Ensembles with Respect to Random Initialization. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 28, No. 11, 1798–1808. LEISCH F. (1999): Bagged Clustering. Adaptive Information Systems and Modeling in Economics and Management Science, Working Paper 51, SFB. NG A.Y., JORDAN M.I., WEISS Y. (2001): On Spectral Clustering: Analysis and an Algorithm. Advances in Neural Information Processin Systems.

Keywords CLUSTER ANALYSIS, CLUSTER ENSEMBLE, SPECTRAL CLUSTERING, STABILITY.

Rozmus

168

Correspondence Mining for the Identification of Relationships in Product Reviews Mayra Ruano Backcountry Corp., [email protected] Abstract. This article studies product reviews by customers which are published at the e-commerce website Backcountry.com. The analysis leverages the existing natural processing language framework called “General Architecture for Text Engineering” (GATE) and applies Correspondence Analysis on custom contingency tables. These contingency tables are deduced from the customer’s comments. For this purpose, GATE is used as a filtering tool to select appropriate words by means of specific grammatical rules or regular expressions based on parts of speech. Two case studies are presented. The first case study consists on identifying relationships between adjectives from customer reviews and their corresponding products. The second case study analyzes text patterns to look for relationships between products and users’ perceptions regarding product size. A visual representation of what customers’ perceptions are is obtained, providing a better understanding of the relationship between information that derives from reviews and specific products.

References BAEZA-YATES, R.; RIBEIRO-NETO, B. (1999): Modern Information Retrieval. Addison Wesley–ACM Press, New York. COLLOBERT, R.; WESTON, J. (2008): A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. In: W. Cohen, A. McCallum and S. Roweis (Eds.): Proceedings of the 25th International Conference on Machine Learning, Helsinki, 160–167. CUNNINGHAM, H. (2000): Software Architecture for Language Engineering. Ph.D. thesis, Department of Computer Science, University of Sheffield. GREENACRE, M. (1984): Theory and Applications of Correspondence Analysis. Academic Press, London. PORTER, M.F. (1980): An Algorithm for Suffix Stripping. Program, 14, 130–137.

Keywords CORRESPONDENCE ANALYSIS, DATA MINING, MARKETING, VISUALIZATION

169

Ruano

A Comparison of Latent Class Analysis With and Without the Feature Saliency Concept Susanne Rumstadt and Daniel Baier Institute of Business Administration and Economics, Brandenburg University of Technology Cottbus, Postbox 101344, D-03013 Cottbus, Germany {susanne.rumstadt,daniel.baier}@tu-cottbus.de Abstract. The selection, the scaling, and the weighting of features play decisive roles in latent class analysis. The inclusion of additional features as well as the distribution of their possible values affect the grouping, some suggest better, some worse groupings. In order to detect well suited features, in image processing as well as in other application fields the concept of feature saliency was proposed. This paper compares latent class analysis with and without the feature saliency concept using real and simulated data. A Monte Carlo is used to investigate the conditions where the feature saliency concept is helpful and where not. Key words: Latent Class Analysis, EM-Algorithm, Feature Saliency, Comparing Classifications, Clustering

References 1.M.H.C. Law, M.A.T. Figueiredo, A.K. Jain. Simultaneous Feature Selection and Clustering Using Mixture Models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26:1154-1166, 2004. 2.L. Hubert, P. Arabie. Comparing Partitions. Journal of Classification, 2:193-218, 1985. 3.G.J. McLachlan, D. Peel. Finite Mixture Models. Wiley Series in Probability and Statistics, John Wiley & Sons, Inc., New York, 2000.

Rumstadt

170

Linear logistic models with relaxed assumptions in R: Implementation and application Thomas Rusch1 , Marco Maier2 and Reinhold Hatzinger3 1

2 3

Institute for Statistics and Mathematics, WU (Vienna University of Economics and Business), Austria [email protected] Institute for Statistics and Mathematics, WU, Austria [email protected] Institute for Statistics and Mathematics, WU, Austria [email protected]

Abstract. Linear logistic models with relaxed assumptions (LLRA) are a flexible tool for item-based measurement of change or multidimensional Rasch models. Its key feature is to allow for multidimensional items and mutual dependencies of items as well as imposing no assumptions on the distribution of the latent trait in the population. Inference for such models becomes possible within a framework of conditional maximum likelihood estimation (CML). The R package eRm provides computational infrastructure for CML estimation of Rasch-type models. In this talk we will show how the provided functionality can be used to estimate LLRAs with any number of time points, treatment groups and covariates. Furthermore, we will illustrate the use of LLRAs in eRm with a large scale example (n=781) stemming from an introductory course on accounting at WU.

References FISCHER G. H. (1995). Linear logistic models for change. In: G.H. Fischer and I. W. Molenaar (Eds.), Rasch models: Foundations, Recent developments and Applications, Springer, New York, 157–818. MAIR, P. and HATZINGER, R. (2007). Extended Rasch Modeling: The eRm Package for the Application of IRT Models in R. Journal of Statistical Software, 20, 1–20. HATZINGER, R. and RUSCH, T. (2009). IRT models with relaxed assumptions in eRm: A manual-like instruction. Psychology Science Quarterly, 51, 87–120.

Keywords LLRA, RASCH-MODELS, REPEATED MEASUREMENTS, ERM, LARGE-SCALE EDUCATIONAL TESTING

171

Rusch

Targeting Voters with Logistic Regression Trees Thomas Rusch1 , Kurt Hornik2 , Wolfgang Jank3 , Ilro Lee4 and Achim Zeileis5 1

2 3

4

5

Institute for Statistics and Mathematics, WU (Vienna University of Economics and Business), Austria, [email protected] Institute for Statistics and Mathematics, WU, Austria [email protected] Department of Decisions, Operations & Information Technologies, The Robert H. Smith School of Business, University of Maryland, USA [email protected] School of Organsiation and Management, Australian School of Business, University of New South Wales, Australia [email protected] Department of Statistics, University of Innsbruck, Austria [email protected]

Abstract. Voter targeting is done by political campaigns to identify and influence voters. Effort is directed either at mobilizing for a party/candidate or to increase turnout. Campaigns use procedures like CHAID or logistic regression for decisions on whom to target. While the data available is usually extremely rich, campaigns have relied on a limited selection of predictors, e.g. previous voting behavior and demographical variables. In this talk we propose a novel approach to voter targeting, “Logistic Regression Trees” (LORET). LORETs are trees (which may just be a single root node) containing logistic regressions (which may just have an intercept) in every leaf. Thus, they contain logistic regression and classification trees as special cases and allow for a synthesis of both techniques. We explore various flavors of LORETs that employ (a) either a reduced or a full set of available variables and (b) use these variables as regressors in the logistic model components and/or partitioning variables in the tree components. The resulting LORET variations are applied to a data set of 19,634 voters from the 2004 US presidential election. We find that employing an extended set of predictor variables clearly improves predictive accuracy capabilities, with the best results for classification trees. While leading to slightly worse predictions, LORET models with the reduced set of variables as regressors in each leaf are more parsimonious and more intelligible. Moreover we find that voter targeting based on the latter leads to a higher potential increase in turnout.

Keywords VOTER TARGETING, LORET, MODEL-BASED RECURSIVE PARTITIONING, CLASSIFICATION TREES, LOGISTIC REGRESSION

Rusch

172

Fundamental portfolio construction based on Mahalanobis distance Anna Rutkowska-Ziarko University of Warmia and Mazury Abstract. In the classical Markowitz model, at an assumed profitability level, the portfolio risk is minimized. The fundamental portfolio introduces an additional condition aimed at ensuring that the portfolio is only composed of companies in good economic condition. A synthetic indicator is constructed for each company, describing its economic and financial situation. There are many methods for constructing synthetic measures. This article applies the standard method of linear order. In models of fundamental portfolio construction, companies are most often organised in order on the basis of Euclidean distance. Due to possible correlation between economic variables, the most appropriate measure of distance between enterprises is the Mahalanobis distance. The aim of the article is to compare the composition of fundamental portfolios constructed on the basis of Euclidean distance with portfolios determined using the Mahalanobis distance.

Keywords Markowitz model, fundamental portfolio, Mahalanobis distance

173

Rutkowska-Ziarko

Inhaltsbasierte Erschließung und Suche in multimedialen Objekten H. Sack and J. Waitelonis Abstract. Das kulturelle Ged¨ achtnis speichert immer gewaltigere Mengen von Informationen und Daten. Doch nur ein verschwindend geringer Teil der Inhalte ist derzeit u ¨ber digitale Kan¨ ale recherchierbar und verf¨ ugbar. Die Projekte mediaglobe und yovisto erm¨ oglichen, den wachsenden Bestand an audiovisuellen Dokumenten auffindbar und nutzbar zu machen und begleiten Medienarchive in die digitale Zukunft. mediaglobe hat das Ziel, durch automatisierte und semantische Verfahren audiovisuelle Dokumente zur deutschen Zeitgeschichte zu erschließen und verf¨ ugbar zu machen. Die Vision von mediaglobe ist ein web-basierter Zugang zu umfassenden digitalen AV-Inhalten in Medienarchiven. Dazu bietet mediaglobe zahlreiche automatisierte Verfahren zur Analyse von audiovisuellen Daten, wie z.B. strukturelle Analyse, Texterkennung im Video, Sprachanalyse oder Genreanalyse. Der Einsatz semantischer Technologien verkn¨ upft die Ergebnisse der AV-Analyse und verbessert qualitativ und quantitativ die Ergebnisse der Multimedia-Suche. Ein Tool zum Rechtemanagement liefert Informationen u ¨ber die Verf¨ ugbarkeit der Inhalte. Innovative und intuitiv bedienbare Benutzeroberfl¨ achen machen den Zugang zu kulturellem Erbe aktiv erlebbar. mediaglobe vereinigt die Projektpartner Hasso-Plattner Institut f¨ ur Softwaresystemtechnik (HPI), Medien-Bildungsgesellschaft Babelsberg, FlowWorks und das Archiv der defa Spektrum. mediaglobe wird im Rahmen des Forschungsprogramms “THESEUS - Neue Technologien f¨ ur das Internet der Dienste” durch das Bundesministerium f¨ ur Wirtschaft und Technologie gef¨ ordert. Die Videosuchmaschine yovisto hingegen ist spezialisiert auf Aufzeichnungen akademischer Lehrveranstaltungen und implementiert explorative und semantische Suchstrategien. yovisto unterst¨ utzt einen mehrstufigen ’explorativen’ Suchprozess, in dem der Suchende die M¨ oglichkeit erh¨ alt, den Bestand des zugrundeliegenden Medienarchivs u ¨ber vielf¨ altige Pfade entsprechend seines jeweiligen Interesses zu erkunden, so dass am Ende dieses Suchprozesses Informationen entdeckt werden, von deren Existenz der Suchende bislang nichts wusste. Um dies zu erm¨ oglichen vereinigt yovisto automatisierte semantische Medienanalyse mit benutzergenerierten Metadaten zur inhaltlichen Erschließung von AV-Daten und erm¨ oglicht dadurch eine punktgenaue inhaltsbasierte Suche in Videoarchiven.

Sack

174

Nonsymmetric Correspondence Analysis of Abbreviated Hard Laddering Interviews Adam Sagan1 and Eugene Kaciak2 1

2

Department of Market Analysis and Marketing Research, Cracow University of Economics, Rakowicka 27, 31-510 Cracow, Poland [email protected] Faculty of Business, Brock University, 500 Glenridge Avenue St. Catharines, Ontario, Canada [email protected]

Abstract. Hard laddering is semistructured interview in quantitative means-end approach that provides the summary implication matrix (SIM) being the base for developing hierarchical value maps. SIM matrix is based on pariwise associations between attributes (A) - consequences (C) and values(V). However, this approach is time consuming and provides often low quality ladders A new method of evaluating the quality of ladders obtained by abbreviated hard laddering (AHL) procedure ( Kaciak, Cullen and Sagan 2010) is proposed. AHL may radically shorten laddering interview and it provides information about ladders in triads of A-C-V structures that form so called summary ladder matrix (SLM). The importance of particular A-C-V’s in SLM is measured by top of mind awareness indices (TMA). The structure of SLM is examined and classified using the nonsymmetric correspondence analysis of the SLM (Kroonenberg and Lombardo 1999). It permits to obtain an additional information on dominant dimensions preserving the information about A-C-V conditionality in the means-end structures and also to determine which prominent ladders that contribute most to the systems inertia.

References KACIAK, E. CULLEN, C. and SAGAN, A. (2010): The quality of ladders generated by abbreviated hard laddering Journal of Targeting, Measurement and Analysis for Marketing, 18(3/4), 159–166. KROONENBERG, P. and LOMBARDO, R. (1999): Nonsymmetric Correspondence Analysis: A Tool for Analysing Contingency Tables with a Dependence Structure. Multivariate Behavioral Research, 34, s. 367-396.

Keywords HARD LADDERING, SUMMARY LADDER MATRIX, NONSYMMETRIC CORRESPONDENCE ANALYSIS

175

Sagan

Bias Correction in Sentiment Analysis Michael Salter-Townshend and Thomas Brendan Murphy University College Dublin, Ireland [email protected] Abstract. We present a joint model for observer bias and term classification in the context of sentiment analysis of an Irish media dataset (Brew et al 2010). This dataset comprises user annotations of online news articles as having either negative, positive or irrelevant impact on the Irish economy during the 2009 financial crisis. In the simplest model a majority vote of the annotations is applied to determine the “ground truth” sentiment of each news article. A classifier may then be trained on these labelled articles to determine the sentiment contributed by the terms appearing in the articles. This classifier can then be applied to un-annotated articles. However, some users may be biased towards positive (or negative) sentiment. To model such bias in the analysis, we include an estimated sentiment calculated on an appropriately weighted sum of the user supplied annotations. We estimate user bias matrices using an Expectation-Maximisation algorithm which includes estimating the unobserved sentiment in the articles. A classifier is then trained using these expected sentiments (Dawid and Skene 1979); this algorithm is sequential. Instead, we propose a joint estimation of both the user biases and the classifier parameters within a single EM algorithm. We demonstrate the superiority of this joint model and apply it to the Irish media sentiment data, with results that are markedly different from the sequential estimation of the model.

References Brew, A. and Greene, D. and Cunningham, P. (2010): Using Crowdsourcing and Active Learning to Track Sentiment in Online Media. ECAI 2010 - 19th European Conference on Artificial Intelligence, 1–11. Dawid, A.P. and Skene, A.M. (1979): Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):20–28, 1979.

Keywords SENTIMENT ANALYSIS, BIAS CORRECTION.

Salter-Townshend

176

Optimal weighted nearest neighbour classifiers R.Samworth1 University of Cambridge, UK [email protected] Abstract. Classifiers based on nearest neighbours are perhaps the simplest and most intuitively appealing of all nonparametric classifiers. Arguably the most obvious defect with the k-nearest neighbour classifier is that it places equal weight on the class labels of each of the k nearest neighbours to the point being classified. Intuitively, one would expect improvements in terms of the misclassification rate to be possible by putting decreasing weights on the class labels of the successively more distant neighbours. In this talk, we determine the optimal weighting scheme, and quantify the benefits attainable. Notably, the improvements depend only on the dimension of the data, not on the underlying population densities. We also show how the bagged nearest neighbour classifier can be regarded as a weighted nearest neighbor classifier, and compare its performance with both the unweighted and optimally weighted nearest neighbour classifiers.

177

Samworth

“NSW online” - das elektronische Tool zur “Liste der fachlichen Nachschlage-werke zu den Normdateien (GKD, PND, SWD)” Margit Sandner Universit¨ atsbibliothek Wien, Austria [email protected] Zusammenfassung. Die Liste der fachlichen Nachschlagewerke zu den Normdateienßtellt mit ihren derzeit mehr als 1.800 Eintr¨ agen ein verbindliches Arbeitsinstrument f¨ ur die t¨ agliche Praxis in der kooperativen Normdatenpflege des deutschsprachigen Raumes, speziell f¨ ur die Terminologiearbeit in der bibliothekarischen Sacherschließung, dar. In jedem Normdatensatz der Schlagwortnormdatei (SWD) werden f¨ ur den Nachweis und die Begr¨ undung der Ansetzungs- und Verweisungsformen eines Deskriptors im Feld ”QuelleReferenzwerke aus der so genannten Priorit¨ atenliste (Rangfolge der Nachschlagewerke), dar¨ uber hinaus aus der gesamten NSW-Liste, festgehalten und normiert abgek¨ urzt. Diese Liste erscheint - von der Deutschen Nationalbibliothek (DNB) regelm¨ aßig aktualisiert - j¨ ahrlich in gedruck¨ ¨ ter Form mit einem Anderungsdienst (Anderungen, Neuauflagen; Neuaufnahmen) und steht seit einigen Jahren auch elektronisch abrufbar bereit. Dennoch ist sie ¨ın die Jahre”gekommen. Eine verbesserte Form der NSW-Liste war daher ein langj¨ ahriges Desiderat. In dem Vortrag wird u aten, die datentechnische ¨ber die konzipierten Funktionalit¨ Realisierung, die praktische Verwendung, den aktuellen Entwicklungsstand und die potenzielle Weiterentwicklung des neuen NSW-Tools berichtet. Die Normdateiarbeit ist komplex und anspruchsvoll. Durch die praxisorientierte Auf-bereitung in diesem Tool wird die Einhaltung der f¨ ur alle Neuansetzungen verbindlichen Rangfolge ganz entscheidend erleichtert, was von Beginn an die Qualit¨ at jedes Normdatensatzes erh¨ oht. Den gr¨ oßten Zeitgewinn in der t¨ aglichen Praxis bringt der sofortige Zugriff auf verlinkte Volltexte. - Angesichts des zunehmenden multilateralen Datentausches bei gleichzeitiger dramatischer Verknappung personeller Ressourcen trotz eines erheblichen Anstiegs des inhaltlich zu erschließenden Literaturaufkommens wird dies im Workflow des vor kurzem eingef¨ uhrten OnlineRedaktionsverfahrens (ONR) f¨ ur Normdaten der wohl nachhaltigste Effekt von NSW onlineßein und auch in der k¨ unftigen Gemeinsamen Normdatei (GND) bleiben.

Sandner

178

Dimensions of job characteristics as predictors of job satisfaction and professional satisfaction Silvina Santana1 , Sandra Loureiro2 , and Jos´e Cerdeira3 1

2

3

University of Aveiro - Department of Economy, Management and Industrial Engineering - Campus of Santiago - 3810-193 Aveiro [email protected] University of Aveiro - Department of Economy, Management and Industrial Engineering - Campus of Santiago - 3810-193 Aveiro [email protected] University of Aveiro - Department of Economy, Management and Industrial Engineering - Campus of Santiago - 3810-193 Aveiro [email protected]

Abstract. We present a conceptual model linking job characteristics, satisfaction with actual job and satisfaction with the profession. The model was estimated using PLS technique on data from professionals (doctors, nurses, technical specialists, technical assistants and operational assistants) working on ACES Baixo Vouga II, a Primary Care Trust that groups four health centres in the Central Region of Portugal. Perceptions on job characteristics were measured using 72 items, corresponding to nine dimensions found in the literature. Job satisfaction and professional satisfactions were assessed with one item each. Using factor analysis it was possible to find six dimensions for job characteristics: leadership, autonomy, salary, personal development, group and social relations, and work atmosphere. Further analysis using structural equations shows that salary exercises the most significant and strong impact on job satisfaction and professional satisfaction. The feel of job autonomy (when the job is interesting, contributes to develop the professional knowledge and for the self-esteem) has also a significant effect on professional satisfaction. The dimensions of job characteristics explain 35 percent of the variance in job satisfaction and 25 percent of the variance in professional satisfaction. The results of this study have implications for researchers and practitioners.

References K.-Y. Lu, et al. (2007): Relationship between professional commitment, job satisfaction, and work stress in public health nurses in Taiwan. Journal of Professional Nursing, 23, 110–116.

Keywords Job characteristics, job satisfaction, satisfaction with the profession, health professional, health center

179

Santana

Identification of Risk Factors in Coronary Bypass Surgery Julia Schiffner1 , Erhard Godehardt2 , Stefanie Hillebrand1 , Alexander Albert2 , Artur Lichtenberg2 , and Claus Weihs1 1

2

Faculty of Statistics, TU Dortmund, 44221 Dortmund, Germany [email protected] Clinic of Cardiovascular Surgery, Heinrich-Heine University, 40225 D¨ usseldorf, Germany [email protected]

Abstract. In quality improvement in medical care one important aim is to prevent complications after a surgery and, particularly, keep the mortality rate as small as possible. Therefore it is of great importance to identify which factors increase the risk to die in the aftermath of a surgery. Based on data of 1163 patients who underwent an isolated coronary bypass surgery in 2007 or 2008 we selected predictors that affect the in-hospital-mortality. A forward search using the wrapper approach in conjunction with simple linear and also more complex classification methods such as gradient boosting and support vector machines is performed. Since the classification problem is highly imbalanced with certainly unequal, but unknown misclassification costs, the area under ROC curve (AUC) is used as performance criterion for hyperparameter tuning as well as for variable selection. In order to assess the stability of results and to obtain accurate estimates of AUC variable selection is repeated ten times on different subsamples of the data set. It turns out that simple linear classification methods (linear discriminant analysis and logistic regression) are suitable for this problem since the AUC cannot be considerably increased by more complex methods. We identified the three most important predictors to be the severity of cardiac insufficiency, the patient’s age as well as pulmonary hypertension. A comparison with full models trained on the same ten subsamples shows that classification performance in terms of AUC is only slightly decreased by variable selection and actually increased in case of logistic regression.

Keywords CORONARY BYPASS SURGERY, MORTALITY, VARIABLE SELECTION, WRAPPER APPROACH

Schiffner

180

Using User Generated Content for Image Clustering and Market Segmentation Diana Schindler Department of Business Administration and Economics, Bielefeld University, Postbox 100131, 33501 Bielefeld, Germany [email protected] Abstract. The analysis of images for different purposes - particularly image clustering - has been the subject of several research streams in the past. Since the 90ies query by image content and, somewhat later, content-based image retrieval have been topics of growing scientific interest. By combining information about the images of interest with textual information about similar images from the World Wide Web, Yeh et al. 2004 were one of the first to create a hybrid image-and-key-word searching technique. Since the advent of Flickr and other media sharing sites textual information about images can also be captured from user-generated tags (see, e.g., Kennedy et al. 2007). An important concern in this context is the adequate closing of the corresponding semantic gap (Sigurbjoersson and van Zwol 2008). Literature review shows that research on image analysis, so far, is primarily related to computer science. Against this background, the present paper investigates options for clustering consumers on the basis of personal image preferences, e.g. for market segmentation purposes.

References KENNEDY, L.; NAAMAN, M.; AHERN, S.; NAIR, R. and RATTENBURY, T. (2007). How Flickr Helps us Make Sense of the World: Context and Content in Community-Contributed Media Collections. In: R. Lienhart, A. R. Prasad, A. Hanjalic, S. Choi, B. P. Bailey, N. Sebe (Eds.): ACM International Conference on Multimedia., Association for Computing Machinery, New York, 631–640. SIGURBJOERSSON B. and VAN ZWOL R. (2008). Flickr Tag Recommendation based on Collective Knowledge. In: J. Huai, R. Chen, H.-W. Hon, Y. Liu, W.Y. Ma, A. Tomkins and X. Zhang (Eds.): International Conference on World Wide Web., Association for Computing Machinery, New York, 327–336. YEH, T.; TOLLMAR, K. and DARREL, T. (2004). Searching the Web with Mobile Images for Location Recognition. In: L. S. Davis, R. Chellappa, A. Bobick, G. Hager, D. Jacobs and Y. Yacoob (Eds.): IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Part Vol.2, IEEE Comput. Soc., Los Alamitos, 76–81.

181

Schindler

High Performance Hardware Architectures for Automated Music Classification Ingo Schm¨ adecke, Christian Banz and Holger Blume Institute of Microelectronic Systems, Appelstr. 4, 30167 Hannover {schmaedecke,banz,blume}@ims.uni-hannover.de Abstract. Automated music classification is a very attractive feature for multimedia devices. Today, even portable music playback devices provide storage capacities for huge music collections of several thousand files. Therefore, an enhancement of user comfort by dynamic automated music classification into predefined categories (e.g. music genres) or user-defined categories is an important goal for both, portable and static music devices. However, automated music classification, based on audio feature extraction, is, firstly, extremly computation intensive, and secondly, has to be applied to enormous amounts of data. Thus, energy-efficient high-performance implementations for feature extraction and classification are required. This contribution presents a dedicated hardware architecture for music classification applying typical audio features for discrimination (e.g. spectral centroid, zero crossing rate) and a support vector machine (SVM) with polynomial kernel for classification. For evaluation purposes, the architecture is realized on a Field Programmable Gate Array (FPGA). Further, the same application is also implemented on an off-the-shelf Graphics Processing Unit (GPU). Both implementations are evaluated in terms of processing time and energy efficiency.

References I. Schm¨ adecke, J. D¨ urre and H. Blume, Exploration of Audio Features for Music Genre Classification, Proc. of the ProRISC, 2009, 26.-27.11.2009, Veldhoven, Netherlands H. Blume, M. Haller, M. Botteck and W. Theimer, Perceptual Feature based Music Classification - A DSP Perspective for a New Type of Application, Proc. of the Samos Workshop, 2008, pp. 92-99 W. Theimer, I. Vatolkin and A. Eronen, Definitions of Audio Features for Music Content Description, Algorithm Engineering Report TR08-2-001, Technische Universit¨ at Dortmund, 2008

Keywords FPGA, GPU, Automated Music Classification, Support Vector Machine

Schm¨ adecke

182

Multilinear Model Estimation with L2-Regularization Frank Schmidt1 , Hanno Ackermann, and Bodo Rosenhahn University of Western Ontario Abstract. Many challenging computer vision problems can be formulated as a multilinear model. Classical methods like principal component analysis use singular value decomposition to infer model parameters. Although it can solve the problem easily if all measurements are known this prerequisite is usually violated for computer vision applications. In this work, a standard tool to estimate singular vectors under incomplete data is reformulated as an energy minimization problem. This admits for a simple and fast gradient descent optimization with guaranteed convergence. Furthermore, the energy function is generalized by introducing an L2regularization on the parameter space. We show a quantitative and qualitative evaluation of the proposed approach with synthetic and real image data, and compare it with prior work.

183

Schmidt

Logic Based Conjoint Analysis Using the Commuting Quantum Query Language Ingo Schmitt1 and Daniel Baier2 1

2

Institute of Computer Science, Information and Media Technology, BTU Cottbus, Postbox 101344, D-03013 Cottbus, [email protected] Institute of Business Adminstration and Economics, BTU Cottbus, Postbox 101344, D-03013 Cottbus, [email protected]

Abstract. Recently, in computer science, the quantum query language (QQL) has been introduced for ranking search results in a database (Schmitt 2008). A user is asked to express search conditions and to rank presented samples, the QQL system matches select-conditions with attribute values. Schmitt’s (2008) theory combines concepts from database query processing with concepts from quantum mechanics and quantum logic (von Neumann 1932) and allows to incorporate seamlessly retrieval search into database query processing. The approach can be adapted to the usual conjoint analysis setting (see, e.g. Green, Rao 1971, Baier, Brusch 2009), where respondents are asked to evaluate selected attribute level combinations in order to rank all possible combinations. The approach is presented and applied. The results show that the new approach competes well with traditional alternatives.

References BAIER, D., and BRUSCH, M. (2009): Conjointanalyse: Methoden – Anwendungen – Praxisbeispiele. Springer, Berlin Heidelberg. GREEN, P.E., and RAO, V.R. (1971): Conjoint Measurement for Quantifying Judgemental Data. Journal of Marketing Research, 8, 355-363. VON NEUMANN J. (1932): Grundlagen der Quantenmechanik. Springer, Berlin. SCHMITT, I. (2008): QQL: A DB & IR Query Language. The Very Large Database Journal, 17(1):39–56.

Keywords Conjoint Analysis, Quantum Mechanics, Quantum Logic

Schmitt

184

ECO-power: A novel method to reveal common mechanisms underlying coupled data Martijn Schouteden, Katrijn Van Deun, and Iven Van Mechelen Quantitative Psychology and Individual Differences, Katholieke Universiteit Leuven, Tiensestraat 102 - bus 3713, 3000 Leuven, Belgium [email protected] Abstract. Often data are collected consisting of different blocks that all contain information about the same entities (e.g., items, persons, biological samples). In order to unveil the mechanisms underlying such data, an integrated analysis of the whole of all data blocks may be most useful. An interesting class of methods for such an approach is the family of methods of simultaneous component analysis. Recently, these methods have been extended so that the mechanisms that are common to all data blocks and the mechanisms that are distinctive for one or a few of them, can be revealed. However, the primary focus of these extended methods is on revealing distinctive mechanisms, with common mechanisms having a more residual-like status. Yet, sometimes, the retrieval of common mechanisms may be of utmost importance. To fulfill this need, I will present a novel method, using ideas underlying power regression. After describing this method, I will illustrate it with data stemming from psychology and systems biology.

Keywords Multiset data, Multiblock data, Component analysis, Data fusion

185

Schouteden

Combined Head Localization and Head Pose Estimation for Video-based Advanced Driver Assistance Systems Andreas Schulz1 , Naser Damer, Mika Fischer, and Rainer Stiefelhagen Robert Bosch GmbH Abstract. This work presents a novel approach for pedestrian head localization and head pose estimation in single images. The presented method addresses an environment of low resolution gray-value images taken from a moving camera with large variations in illumination and object appearance. The proposed algorithms are based on normalized detection confidence values of separate, pose associated classifiers. Those classifiers are trained using a modified one vs. all framework that tolerates outliers appearing in continuous head pose classes. Experiments on a large set of real world data show very good head localization and head pose estimation results even on the smallest considered head size of 7x7 pixels. These results can be obtained in a probabilistic form, which make them of a great value for pedestrian path prediction and risk assessment systems within video-based driver assistance systems or many other applications.

Schulz

186

Testing for the number of regimes in Markov dependent mixtures (HMMs) Florian Schwaiger1 , Hajo Holzmann1 and Joern Dannemann2 1

2

Fakult¨ at f¨ ur Mathematik und Informatik, Philipps-Universit¨ at Marburg [email protected], [email protected] RWE AG, Essen, Germany

Abstract. We consider hidden Markov models with a one dimensional regime dependent and a d ∈ N dimensional structural parameter. Based on the EM-Test (Li and Chen, 2010) for i.i.d. mixtures with a one dimensional switching parameter and without structural parameters, we propose a likelihood ratio test for the hypothesis k = k0 versus k > k0 , where k is the number of regimes of the Markov chain and derive its asymptotic distribution. As in Holzmann and Dannemann (2008) the test is based on inferences from the stationary mixture model of the HMM. Further, we examine the finite-sample behaviour of the testing procedure in a simulation study. We conclude our analysis with an application to financial log-returns by modeling the time series with a HMM with skew normal distributed regimes. Here, we test the number of regimes and estimate the most likely sequence of regimes with the Viterbi algorithm.

References LI, P. and CHEN, J. (2010): Testing the Order of a Finite Mixture. Journal of the American Statistical Association, 105, 1084–1092. DANNEMANN, J. and HOLZMANN, H. (2008): Testing for two states in a hidden Markov model. Canadian Journal of Statistics, 36: 505–520.

Keywords HIDDEN MARKOV MODELS, LIKELIHOOD RATIO TEST, ASYMPTOTIC DISTRIBUTION, SWITCHING VOLATILITY, SKEW NORMAL DISTRIBUTION, FINANCIAL TIME SERIES

187

Schwaiger

Applying Location Planning Algorithms to Schools: The Case of Special Education in Hesse (Germany) Alexandra Schwarz German Institute for International Educational Research, Center for Research on Educational Governance, Schlossstrasse 29, D-60486 Frankfurt am Main, [email protected] Abstract. Although the education acts of all German federal states give precedence to a common, inclusive schooling, children with special educational needs are still predominately taught in special schools. The United Nations convention on the rights of disabled people has reinforced the critical discussion of this situation and again raises the question of how inclusive schooling concepts can be implemented in practice. The focus of this paper is not on pedagogical motives, but on quantitative aspects of inclusive concepts which are analyzed by means of administrative data from the school statistics of Hesse (Germany). Location planning algorithms - well-known instruments for finding optimal locations of distribution centers of stores in logistic networks - are used to analyze the current schooling situation of students with special educational needs and to simulate which regular school they may attend under (more) inclusive conditions. A cost function, which includes varying components (e. g. opportunity costs in terms of ways to school, other variable costs and fixed costs of schooling), is implemented to estimate the economic effects of different supply models for students with special needs. Using administrative data turns out to be a particular challenge: For example, for reasons of data protection the concrete educational need is only provided on the regional level of Hessian municipalities which made it necessary to estimate the individual educational needs. The results suggest a recipient-oriented model and clearly indicate that especially in rural regions - where schools are closed due to the demographic progress - inclusive schooling is not (only) a matter of concept, but a question of demand and supply, of the number of all students and of (still) disposable schools. Location planning methods are powerful instruments to analyze such questions and should therefore become an integral part of educational planning procedures.

Keywords LOCATION PLANNING, SCHOOL LOCATION, SPECIAL SCHOOLS, ECONOMICS OF EDUCATION

Schwarz

188

Non-Linear Curvature Mapping - A novel approach on morphological classification of neolithic pottery Ilya Shabanov1 , Klaus-Robert Mueller2 , and Wolfram Schier3 1 2

3

Technische Universit¨ at Berlin [email protected] Technische Universit¨ at Berlin, Machine Learning [email protected] Freie Universit¨ at Berlin, Prehistoric Achaeology [email protected]

Abstract. During the last decades many approaches to machine based analysis of ceramics were proposed which are still actively developed as computers become cheaper and faster. However, their performance is still not even competitive with human perception. Another obstacle is the lack of standardised data sets for evaluation and comparison of different algorithms. In this work we present NLCM (Non-Linear Curvature Mapping), a novel approach for similarity measurements between ceramic vessels and sherds, which uses the curvature and orientation of the profile curve. Our approach uses the dynamic time warping (DTW, Sakoe and Chiba (1978)) algorithm which provides a nonlinear mapping of the profiles and is able to deal with non-linear scale and rotation differences in a very flexible manner. Fully preserved vessels as well as sherds of different sizes and types can be compared. To compare the similarity of otherwise not comparable sherds (i.e. bottom and rim sherds) we propose GWENN (Global Weighting by Evaluation of Nearest Neighbors) an indirect comparison method which relies on the similarity of the nearest neighbors. In a unique comparative study we show that NLCM has significant benefits over current state-of-the-art methods for similarity measurement like the GHT (Duda and Heart (1995) and Durham, Lewis and Shennan (1995)), Fourier Descriptors (Zahn and Roskies (1972) and Zhang and Lu (2001)), Linear Curvature Mapping (Femiani, Razdan and Farin (2004)) and others (Smilansky et al. (2004)). The study comprises of database-like querying experiments as well as clustering experiments for which we use a semi-supervised agglomerative clustering method. Furthermore we explore the benefits of introducing cannot-link and must-link constraints into the clustering scheme to achieve a clustering solution similar to that of a human expert. Our data set contains 1615 objects with 152 fully preserved vessels from the neolithic Vinˇca culture (as obtained by Schier (1995)). The data is given by binary images where the profile and its discrete curvature is extracted in a fully unsupervised manner.

Keywords ARCHAEOLOGY, CERAMICS, SHAPE RECOGNITION, CLUSTERING, COMPARATIVE STUDY

189

Shabanov

On The Stress Function of Asymmetric Triangulation Scaling Kojiro Shojima The National Center for University Entrance Examinations Komaba 2-19-23, Meguro-ku, Tokyo 153-8501, Japan [email protected] Abstract. Asymmetric triangulation scaling (or ATRISCAL) proposed by Shojima (2009, 2010) is a kind of asymmetric multidimensional scaling for analyzing test data. The analysis objective of the ATRISCAL is a conditional correct response rate (CCRR) matrix made by the test items. The ATRISCAL is the method to extract an interitem dependency structure underlying the CCRR matrix and visualize the structure in a 3D model space. In this study, a penalty term was added to the conventional stress function of the ATRISCAL. The penalty term made it more efficient that the coordinates of an item pair with a strong dependency relationship were located close to each other in the model space and the coordinates of an item pair with a weak dependency relationship were plotted in different directions from the origin of the space.

References SHOJIMA, K. (2009): Asymmetric triangulation scaling: A multidimensional scaling for visualizing inter-item dependency structure. Proceedings of The 7th annual meeting of the Japan Association for Research on Testing, pp. 88-91. SHOJIMA, K. (2010): Exametrika 4.4 (www.rd.dnc.ac.jp/˜shojima/exmk/index.htm).

Keywords ASYMMETRIC MULTIDIMENSIONAL SCALING, STRESS FUNCTION, CONDITIONAL CORRECT RESPONSE RATE MATRIX, TEST DATA

Shojima

190

SHOG - Spherical HOG Descriptors for Rotation Invariant 3D Object Detection Henrik Skibbe1 , Marco Reisert, and Hans Burkhardt University of Freiburg Abstract. We present a method for densely computing local spherical histograms of oriented gradients (SHOG) in volumetric images. The descriptors are based on the continuous representation of the orientation histograms in the harmonic domain, which we compute very efficiently via spherical tensor products and the fast Fourier transformation. Building upon these local spherical histogram representations, we utilize the Harmonic Filter to create a generic rotation invariant object detection system that benefits from both the highly discriminative representation of local image patches in terms of histograms of oriented gradients and an adaptable trainable voting scheme that forms the filter. We exemplarily demonstrate the effectiveness of such dense spherical 3D descriptors in a detection task on biological 3D images. In a direct comparison to existing approaches, our new filter reveals superior performance.

191

Skibbe

Assessment of Visibility Quality in Adverse Weather and Illumination Conditions Andrzej Sluzek1 and Mariusz Paradowski Khalifa University Abstract. A framework for the automatic detection of dangerously deteriorating visibility (e.g. due to bad weather and/or poor illumination conditions) is presented. The method employs image matching techniques for tracking similar fragments in video-frames captured by a forward-looking camera. The visibility is considered low when performances of visual tracking deteriorate and/or its continuity is lost either temporarily (i.e. a sudden burst of light, a splash of water) or more permanently. Two variants of the tracking algorithm are considered, i.e. the topological approach (more important) and the geometric one. Using the most difficult examples of DAGM2011 Challenge dataset (e.g. Snow, Rain and Light-sabre clips) it is demonstrated that the visibility quality can be numerically estimated, and the most severe cases (when even the human eye can hardly recognize the scene components) are represented by zero (or near-zero) values. The paper also briefly discusses the implementation issues (based on a previously developed similar real-time application) and directions of future works.

Sluzek

192

Under what circumstances do regular Computerized Adaptive Tests allow for sound clinical classifications? Niels Smits Vrij Universiteit, Faculty of Psychology and Education, Department of Clinical Psychology, Amsterdam, The Nethderlands [email protected] Abstract. In the last few years Computerized Adaptive Testing (CAT) has become a very popular method for efficient self-report assessment in mental health research and clinical practice. The CAT algorithms used are based on Item Response Theory (IRT) and during a CAT each item is dynamically selected from a pool until a pre-specified measurement precision is reached for the patient. By contrast, in some educational settings adaptive tests are mainly used for classification (masters versus non-masters) purposes. Such algorithms are similar to standard CAT, but instead of optimizing measurement precision they provide as much information as possible for a cut-point relevant for decision making. In a clinical setting such classification based adaptive testing is called Clinical Decision (CD) CAT (Waller and Reise, 1989). Here, it is studied under what circumstances it does not hurt to use regular CAT instead of CD-CAT for clinical decision making. In this paper the results of a simulation study of the effect of (i) prevalence of disease (ii) the size of the score difference between healthy and diseased subjects, and (iii) the number of items are presented.

References Waller, N.G., and Reise, S.P. (1989): Computerized adaptive personality assessment: an illustration with the absorption scale. Journal of Personality and Social Psychology 57, 1051–1058.

Keywords Clinical Psychology, Classification, Measurement, Short assessment.

193

Smits

MEASURES FOR COMPARING PARTITIONS EVALUATION, SELECTION, DISTRIBUTIONS Andrzej Sokolowski1 , Sabina Denkowska2 , Kamil Fijorek3 , and Marcin Salamaga4 Cracow University of Economics [email protected] Abstract. There had been more than 30 measures proposed in the literature for comparing partitions of the finite set of objects. Generally they can be divided into four groups: measures based on counting pairs, measures calculated from the fourfold membership table, measures using identical parts of both partitions, measures based on information theory. Some criteria are proposed in the paper to evaluate measures, together with simple transformations for some measures not satisfying those criteria. Measures which cannot be transformed to [0,1] interval or do not satisfy the symmetry condition have been rejected from the study. For finally chosen measures the simulation experiment has been conducted, based on 10000 runs. The special random partition generator has been used, a nonparametric one, assuming that each partition is equally probable. Most of the studied distributions are symmetric, but some of them have unexpectedly high location parameters. Empirical (from the simulation study) critical values have been calculated for testing the similarity of partitions.

Sokolowski

194

Finding clusters in high-dimensional data via multiple projections of variable subsets Douglas L. Steinley Department of Psychological Sciences University of Missouri [email protected] Abstract. Often times a data set is treated as a complete ”unit” where all variables are included in the cluster analysis, or if not, there is a simple variable selection procedure that eliminates some of the variables. Similarly, when variables are projected into lower dimensional space, commonly, all variables are included in the projection. Based on recent advances, this presentation will demonstrate finding multiple subsets of variables that can define different cluster structures in reduced space.

195

Steinley

Cluster it! Semiautomatic splitting and naming of classification concepts. Dominik Stork1 , Kai Eckert2 , and Heiner Stuckenschmidt3 1 2 3

University of Mannheim, [email protected] Mannheim University Library, [email protected] University of Mannheim, [email protected]

Abstract. The maintenance of Knowledge Organization Systems (KOS) like thesauri and classifications is still an expensive and time-consuming process. There has been a lot of research towards (semi-) automatic KOS construction and enhancement, but the maintenance by and large is still a manual task as many decisions and changes still require human interaction. We present a semiautomatic approach to split overpopulated concepts into subconcepts and propose suitable names for the new concepts. The problem of splitting a concept into useful subconcepts is akin to the problem of clustering a set of documents into useful clusters. Our approach consists of three steps: In a first step, meaningful term clusters are created and presented to the user for a further curation and selection of possible new subconcepts. A graph representation and simple TF-IDF weighting is used to create the cluster suggestions. The term clusters are used as seeds for the subsequent content-based clustering of the documents using k-Means. At last, the resulting clusters are evaluated based on their correlation with the preselected term-clusters and a proper term for the naming of the clusters is proposed. We show that this approach efficiently supports the KOS maintainer while avoiding the usual quality problems of fully automatic clustering approaches, especially with respect to the handling of outliers and determining the number of target clusters. The documents of the parent concept are directly assigned to the new subconcepts favoring high precision.

Keywords CLASSIFICATION, KOS, CLUSTERING, SPLITTING

Stork

196

Multi-Group Confirmatory Factor Analysis Model in mixed-culturally populations Piotr Tarka Poznan University of Economics, Poland [email protected] Abstract. Common Factors Analysis is often tied to specific properties of population and its culture characteristics. If measurement is applied from population to another, then extracted factors may hard to be equally compared on the reflective basic level, unless there are met all conditions of invariance measurement. Hence implementation of market research and any inter-cultural studies require a multicultural (otherwise, multi-group) model describing statistical differences in both cultures with invariance as underlying assumption. In article we implement a Multi-Group Confirmatory Factor Analysis Model (MGCFA) for marketing analysis of customers personal values pertaining to hedonic lifestyle and consumption aspects in two culturally opposite populations. We conducted survey in two countries: Poland and The Netherlands with randomly prepared samples with youth representatives on both sides. This model permitted us for testing invariance measurement under cross-group constraints and thus hypotheses examining structural equivalence of latent variables - values. Assuming equivalence in the structure of data, one could further test the existence of scalar equivalence, metric or measurement errors.

References JORESKOG, K. (1976): Simultaneous Factor Analysis in Several Populations. Psychometrika, 36, 409–426. MEREDITH, W. (1993): Measurement Invariance, Factor Analysis and Factorial Invariance. Psychometrika, 58, 525–543. BOLLEN, K.A. (1989): Structural Equations With Latent Variables. New York, Wiley. VAN de VIJVER, F.J.R., LEUNG, K. (1997): Methods and Data Analysis for CrossCultural Research. Newbury Park, Sage.

Keywords MGCFA, CUSTOMERS, HUMAN VALUES, FACTOR ANALYSIS

197

Tarka

Probabilistic Object Models for Pose Estimation in 2D Images Damien Teney1 and Justus Piater University of Liege Abstract. We present a novel way of performing pose estimation of known objects in 2D images. We follow a probabilistic approach for modeling objects and representing the observations. These object models are suited to various types of observable visual features, and are demonstrated here with edge segments. Even imperfect models, learned from single stereo views of objects, can be used to infer the maximum-likelihood pose of the object in a novel scene, using a Metropolis-Hastings MCMC algorithm, given a single, calibrated 2D view of the scene. The probabilistic approach does not require explicit model-to-scene correspondences, allowing the system to handle objects without individually-identifiable features. We demonstrate the suitability of these object models to pose estimation in 2D images through qualitative and quantitative evaluations, as we show that the pose of textureless objects can be recovered in scenes with clutter and occlusion.

Teney

198

On the Efficiency of German Regions Nguyen Xuan Thinh1 , Martin Behnisch2 , and Alfred Ultsch3 1

2

3

Spatial Information Management and Modelling, August Schmidt Straße 10, TU Dortmund University, D-44221 Dortmund. [email protected] Leibniz Institute of Ecological and Regional Development, Weberplatz 1, 01217 Dresden. [email protected] Datenbionic Research Group, Hans-Meerwein-Strasse, Philipps-University Marburg, D-35032 Marburg. [email protected]

Abstract. Resource efficiency is a key part of the Europes 2020 strategy. The need for more scientific and objective knowledge leads to a whole range of research on the efficiency of spatial structures. However knowledge discovery and data mining are little-noticed when quantifying the multidimensional efficiency. Regarding 16 selected efficiency indicators an empircal concept is suggested for this multidimensional purpose. The concept is applied in a canonical way to compare a set of 298 German regions. All indicators are pre-processed and analyzed in view of the degree of efficiency. The comparison of regions is realized in a quantitative (efficiency as percentage) and qualitative (efficiency as a dichotomic characteristic: low/high) understanding of efficiency. The Expectation Maximization algorithm is used for parameter computation of Gaussian mixture models. Bayesian theorem offers advantages through its ability to formally incorporate prior knowledge into model specification via prior distributions and allows considering the variability. Techniques of knowledge discovery are applied to observe important efficiency indicators describing a subset of high efficient and non efficient regions. The subset of regions and related machine-generated explanations are validated in mind of the spatial analyst. Results are presented in a symbolic and spatial representation. The exploration of efficiency indicators and German regions aim to trigger discussions in the application domain: urban planners, regional policy and knowledge acquisition systems. Key words: Spatial Planning, Spatial Classification, Efficiency, Mixture Models

References M. Behnisch. 2009. Urban Data Mining. Karlsruhe: KIT Scientific Press. M. Hand, H. Mannila and P. Smyth. 2001. Principles of Data Mining. Cambridge: MIT Press. A. Ultsch. 2003. Pareto Density Estimation: A Density Estimation for Knowledge Discovery. In: Baier D., Wernecke K.D. (Eds): Innovations in Classification, Data Science, and Information Systems. Berlin, Heidelberg, Springer, pp. 91100.

199

Thinh

Multivariate Modelling of Cross-Commodity Price Relations Along the Petrochemical Value Chain Myriam Th¨ ommes1 and Peter Winker2 1

2

Center for Finance and Banking, Justus-Liebig-University Giessen, Licher Str. 74, 35394 Giessen [email protected] Department of Statistics and Econometrics, Justus-Liebig-University Giessen, Licher Str. 64, 35394 Giessen [email protected]

Abstract. We aim to shed light on the relationship between the prices of crude oil and oil-based products along the petrochemical value chain. The analyzed commodities are tied in an integrated production process. This characteristic motivates the existence of long-run equilibrium price relationships. The economic equilibrium mechanism can be captured by econometric models, namely error correction models. An understanding of the complex price relations between input and output products is important for petrochemical companies, which are exposed to price risk on both sides of their business. Their profitability is linked to the spread between input and output prices. Therefore, information about price relations along the value chain is valuable for risk management decisions. Using vector error correction models (VEC), we explore cross-commodity price relationships. We find that all prices downstream the value chain are cointegrated with the crude oil price, which is the driving price in the system. Furthermore, we assess whether the information about long-run cross-commodity relations, which is incorporated in the VEC models, can be utilized for forecasting prices of oil-based products. Rolling out-of-sample forecasts are computed and the forecasting performance of the VEC models is compared to the performance of naive forecasting models. Subsequently, we evaluate a trading strategy which is based on forecasts of the probability of a directional change. Our study offers new insights into how economic relations between commodities linked in a production process can be used for price forecasts and offers implications for risk management in the petrochemical industry.

Keywords COMMODITY PRICES, CROSS-COMMODITY FEEDBACK EFFECTS, COINTEGRATION, FORECASTING

Th¨ ommes

200

Training of Sparsely Connected MLPs Markus Thom1 , Roland Schweiger, and G¨ unther Palm Daimler AG Abstract. Sparsely connected Multi-Layer Perceptrons (MLPs) differ from conventional MLPs in that only a small fraction of entries in their weight matrices are nonzero. Using sparse matrix-vector multiplication algorithms reduces the computational complexity of classification. Training of sparsely connected MLPs is achieved in two consecutive stages. In the first stage, initial values for the network’s parameters are given by the solution to an unsupervised matrix factorization problem, minimizing the reconstruction error. In the second stage, a modified version of the supervised backpropagation algorithm optimizes the MLP’s parameters with respect to the classification error. Experiments on the MNIST database of handwritten digits show that the proposed approach achieves equal classification performance compared to a densely connected MLP while speeding-up classification by a factor of seven.

201

Thom

Bayesian mixture modeling with variable selection Tomoki Tokuda1 , Iven Van Mechelen2 , and Francis Tuerlinckx3 1

2

3

Department of Psychology, University of Leuven, Tiensestraat 102, 3000 Leuven, BELGIUM [email protected] Department of Psychology, University of Leuven, Tiensestraat 102, 3000 Leuven, BELGIUM [email protected] Department of Psychology, University of Leuven, Tiensestraat 102, 3000 Leuven, BELGIUM [email protected]

Abstract. A general problem in clustering high-dimensional data is that the presence of irrelevant variables can mask the ‘true‘ group structure; for an effective clustering of observations, some form of variable selection then is essential. As a solution to this problem, Tadesse, Sha and Vannuchi (2005) proposed a fully Bayesian method (based on a multivariate normal mixture model) that includes a procedure for variable selection. This method, however, appears to suffer from two drawbacks: Firstly, it is not scale-invariant (i.e., transforming the unit of one or more variables may influence the results); secondly, the results of the method are sensitive to the number of irrelevant variables. These drawbacks may considerably hamper the use of the method proposed by Tadesse et al. in practice. In this talk, we propose some modifications of the method to deal with these drawbacks. The main idea is to make the method hierarchical by introducing hyperpriors for some parameters and to apply it to a suitably preprocessed form of the data. In a large-scale simulation study, our modified version will be shown to outperform their original method. Finally, the performance of the modified method will also be compared with that of a benchmark method, that is, the clustering method with variable selection from Steinley and Brusco (2008), which performed best in a comparative study by the same authors.

References Tadesse, M.G., Sha, N., and Vannucci, M. (2005): Bayesian variable selection in clustering high-dimensional data. Journal of the American Statistical Association, 100, 602–617. Steinley, D., and Brusco, M.J. (2008): Selection of variables in cluster analysis: An empirical comparison of eight procedures. Psychometrika, 73, 125–144

Keywords VARIABLE SELECTION, CLUSTERING, MULTIVARIATE NORMAL MIXTURES, MARKOV CHAIN MONTE CARLO

Tokuda

202

Factorial PD-Clustering Cristina Tortora12 , Mireille Gettler Summa2 , and Francesco Palumbo1 1

2

Universit` a Federico II di Napoli, C.so Umberto I 40, 80138, Napoli Italia [email protected], [email protected] Universit´e Paris Dauphine, CEREMADE, CNRS, Place Du Mar´echal De Lattre De Tassigny, 75775, Paris Francia [email protected]

Abstract. This paper proposes a Probabilistic Distance (PD) Clustering based new method to find homogeneous groups in the data. Given a set of p continuous variables registered on n statistical units, and a set of K centers in Rp , PD-clustering assigns units to cluster according to their probability of belonging to a cluster, under the constraint that the ratio between the probability and the distance of each point to any cluster center is a constant (Ben-Israel and Iyigun 2008). As p tends to be large the solution tends to became unstable. The scope of this paper is to extend PD-Clustering to the context of factorial clustering. We prove that the optimal factorial solution is based on the Tucker3 decomposition, which is equivalent to project original data in a subspace defined according to the same PD-Clustering criterion. The whole method consists in a two step iterative procedure: linear transformation of the initial data, PD-clustering on the transformed data. The algorithm can be summarized as follow: i) Probabilistic D-Clustering on original data; ii) Three-way analysis on the dissimilarity matrix; iii) Probabilistic Dclustering on the factors of three way analysis. The first step is an initialization, steps 2 and 3 are iterated and the K centers are updated. The procedure stops when the solution convergence: K centers do not change anymore. The factorial step makes the method more stable and allows us: to work with dataset with large p; to find clusters which have an arbitrary form. An example illustrates the method capabilities.

References BEN-ISRAEL A. and IYIGUN C. (2008): Probabilistic D-clustering Journal of Classification, 25, 5–26. KROONENBERG, P. M. (2008): Applied multiway data analysis, Wiley series in probability and statistics.

Keywords NON HIERARCHICAL CLUSTERING, FACTOR ANALYSIS

203

Tortora

Object Recognition System Guided by Gaze of the User with a Wearable Eye Tracker Takumi Toyama1 German Research Center for Artificial Intelligence Abstract. Existing approaches for object recognition typically rely on images captured on an ordinary digital camera and therefore the recognition task sometimes becomes difficult when the image contains other objects (cluttered) and the object of interest is not clearly indicated. In this work, we integrate a wearable eye tracker into the object recognition system in order to recognize which object the user is paying attention to in the scene camera. To demonstrate the usability of such a gaze based object recognition interface, we developed a prototypical application named Museum Guide 2.0 which can be used in a museum as a mechanical guide for visitors.

Toyama

204

Fuzzy Clustering by the Hyperbolic Smoothing Approach Javier Trejos1 , Eduardo Piza2 , Luiz Carlos F. Souza3 , Alex Murillo4 , Vinicius L. Xavier5 , and Adilson E. Xavier6 1 2 3 4 5 6

CIMPA, University of Costa Rica, Costa Rica. [email protected] CIMPA, University of Costa Rica, Costa Rica. [email protected] Petrobras, Brazil. [email protected] CIMPA, University of Costa Rica, Costa Rica. [email protected] Instituto Brasileiro de Geografia e Estad´ıstica, Brazil. [email protected] COPPE, Federal University of Rio de Janeiro, Brazil. [email protected]

Abstract. The hyperbolic smoothing clustering method is a new general strategy for solving problems in cluster analysis scope; verily it corresponds to a fuzzy way for clustering. We analyze these features and present a new fuzzy clustering algorithm. The approach has three main stages: relaxation of the allocation to the nearest center’s class, and smoothing the maximal and Euclidean norm functions. This leads to a continuous optimization problem which can be solved by Newton-Raphson iterations whose solution are the centroids of the classes. Then, allocation to the classes is made for each value of the relaxation step according to a simple rule, which is essentially a fuzzy clustering. Computational results obtained for solving a set of test problems of the literature show the efficiency and potentialities of the proposal. We show the possibility of obtaining a hard solution of the particular sum-of-squares clustering problem by a fuzzy strategy. The same methodology can be used for solving similar clustering problems. Moreover, we believe that the application of a sequence of fuzzy formulations that gradually approach the original one can be successfully used for solving a broad class of mathematical problems.

References XAVIER, A.E. and XAVIER, V.L. (2011): Solving the Minimum Sum-of-Squares Clustering Problem by Hyperbolic Smoothing and Partition into Boundary and Gravitational Regions. Pattern Recognition, 44, 70–77. TREJOS, J. and VILLALOBOS, M. (2007): Partitioning by particle Swarm Optimization. In: P. Brito et al. (Eds.): Selected Contributions in Data Analysis and Classification. Springer, Berlin, 235–244.

Keywords CLUSTERING, RELAXATION, SMOOTHING, OPTIMIZATION

205

Trejos

Convex Optimization as a Tool for Correcting Dissimilarity Matrices for Regular Minimality ¨ u Matthias Trendtel and Ali Unl¨ Faculty of Statistics, Dortmund Technical University, Germany {trendtel,uenlue}@statistik.tu-dortmund.de Abstract. Fechnerian scaling as developed by Dzhafarov and Colonius (e.g., [2]) aims at imposing a metric on a set of objects based on their pairwise dissimilarities, e.g., discrimination probabilities. A necessary condition for this theory is the law of Regular Minimality (RM), which is a fundamental property of discrimination (e.g., [1]). A dissimilarity matrix of discrimination measures satisfies RM if every row and every column of the matrix contains a single minimal entry, and an entry minimal in its row is minimal in its column. In [3] and [5] tests have been proposed for RM. These tests, however, do not allow correcting a dissimilarity matrix for RM, if violations of this property can be deemed negligible. In this paper, we solve that problem by phrasing it as a convex optimization problem. An ‘optimal’ correction of a given data matrix violating RM (of a specified form) is defined as the RM-compliant matrix (of that form) with minimal Euclidean distance to the data matrix. This can be seen as a classical optimization problem, since the set of matrices satisfying RM (of a specified form) is a convex set. Hence such algorithms as [4] can be used to solve it. In simulations, we demonstrate the usefulness of this correction procedure.

References 1.E.N. Dzhafarov and H. Colonius. Regular Minimality: A fundamental law of discrimination. In H. Colonius and E. N. Dzhafarov, editors, Measurement and Representation of Sensations, pages 1–46. Erlbaum, Mahwah, NJ, 2006. 2.E.N. Dzhafarov and H. Colonius. Dissimilarity cumulation theory and subjective metrics. Journal of Mathematical Psychology, 51:290–304, 2007. ¨ u, M. Trendtel, and H. Colonius. Matrices with a given 3.E.N. Dzhafarov, A. Unl¨ number of violations of Regular Minimality. Journal of Mathematical Psychology, in press, 2011. 4.D. Goldfarb and A. Idnani. A numerically stable dual method for solving strictly convex quadratic programs. Mathematical Programming, 27:1–33, 1983. ¨ u and M. Trendtel. Testing for Regular Minimality. In A. Bastianelli and 5.A. Unl¨ G. Vidotto, editors, Fechner Day 2010, pages 51–56. The International Society for Psychophysics, Padua, Italy, 2010.

Keywords Fechnerian scaling, Regular Minimality, Convex optimization

Trendtel

206

Shape- and Pose-Invariant Correspondences using Probabilistic Geodesic Surface Embedding Aggeliki Tsoli1 and Michael Black Brown University Abstract. Correspondence between non-rigid deformable 3D objects provides a foundation for object matching and retrieval, recognition, and 3D alignment. Establishing 3D correspondence is challenging when there are non-rigid deformations or articulations between instances of a class. We present a method for automatically finding such correspondences that deals with significant variations in pose, shape and resolution between pairs of objects. We represent objects as triangular meshes and consider normalized geodesic distances as representing their intrinsic characteristics. Geodesic distances are invariant to pose variations and nearly invariant to shape variations when properly normalized. The proposed method registers two objects by optimizing a joint probabilistic model over a subset of vertex pairs between the objects. The model enforces preservation of geodesic distances between corresponding vertex pairs and inference is performed using loopy belief propagation in a hierarchical scheme. Additionally our method prefers solutions in which local shape information is consistent at matching vertices. We quantitatively evaluate our method and show that is is more accurate than a state of the art method.

207

Tsoli

Large Displacement Optical Flow for Volumetric Image Sequences Benjamin Ummenhofer1 University Freiburg Abstract. In this paper we present a variational optical flow algorithm for volumetric image sequences (3D + time). The algorithm uses descriptor correspondences that allow us to capture large motions. Further we describe a symmetry constraint that considers the forward and the backward flow of an image sequence to improve the accuracy of the flow field. We have tested our algorithm on real and synthetic data. Our experiments include a quantitative evaluation that show the impact of the algorithm’s components. We compare a single core implementation to two parallel implementations, one on a multi-core CPU and one on the GPU.

Ummenhofer

208

Efficient Stereo and Optical Flow with Robust Similarity Measures Christian Unger1 , Eric Wahl, and Slobodan Ilic BMW AG Abstract. In this paper we address the problem of dense stereo matching and computation of optical flow. We propose a generalized dense correspondence computation algorithm, so that stereo matching and optical flow can be performed robustly and efficiently at the same time. We particularly target automotive applications and tested our method on real sequences from cameras mounted on vehicles. We performed an extensive evaluation of our method using different similarity measures and focused mainly on difficult real-world sequences with abrupt exposure changes. We did also evaluations on Middlebury data sets and provide many qualitative results on real images, some of which are provided by the adverse vision conditions challenge of the conference.

209

Unger

Individual differences scaling (INDSCAL) revisited Steffen Unkel, John C. Gower and Nickolay T. Trendafilov Department of Mathematics and Statistics, The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom

e-mail:[email protected] Abstract. Individual differences scaling (INDSCAL) is a model for the simultaneous analysis of a number of proximity matrices for a set of objects. In this talk, the INDSCAL model is viewed as being embedded within a hierarchy of models, each layer with possible formulations based on dissimilarity and configuration matrices. In its configuration version, INDSCAL is considered as a specific form of a generalized Procrustes problem. Algorithms are introduced for fitting the reformulated INDSCAL model. We also propose new simple methods for solving the INDSCAL problem, which take as input slices either inner-product matrices or squared dissimilarity matrices. Applications to real data illustrate the performance of the methods and their fitting solutions.

References Gower, J. C. and Dijksterhuis, G. B. (2004): Procrustes Problems. Oxford University Press: Oxford.

Keywords INDSCAL, Multidimensional scaling, Procrustes problems, Proximity matrices, Three-way data.

Unkel

210

Clustering Covariates Regression Eva Vande Gaer1−2 , Eva Ceulemans1 , and Iven Van Mechelen2 1 2

Methodology of Educational Sciences Research Group, K.U.Leuven, Belgium Research Group Quantitative Psychology and Individual Differences, K.U.Leuven, Belgium [email protected]

Abstract. Linear regression is a much applied technique in many research fields. Its aim is to predict one or more dependent variables on the basis of a number of independent variables. However, when analyzing data sets with very many independent variables, some of which are highly correlated, one may face the bouncing beta problem: Regression weights obtained for such data sets tend to be unstable, in that small changes in the data can lead to completely different regression weights. To solve the bouncing beta problem, many solutions have already been suggested. Roughly, two types of solutions can be distinguished: variable selection methods (e.g. Oscar and the Lasso; Bondel & Reich, 2008; Tibshirani,1996) and dimension reduction methods (e.g. principal component regression and principal covariates regression; Kiers & Smilde, 2007). However, the interpretation of the solutions obtained by these methods is not always straightforward. As a possible alternative, we therefore propose the Clustering Covariates Regression method (CCovR). This method simultaneously partitions the independent variables into a few predictor types and regresses the dependent variable(s) on these types. In this talk, we first introduce the CCovR method. Next, we compare CCovR and some variable selection and dimension reduction methods by applying them to the same data set.

References BONDELL, H.D. and REICH, B.J. (2008): Simultaneous regression shrinkage, variable selection and supervised clustering of predictors with OSCAR. Biometrics, 64, 115–123. KIERS, H.A.L. and SMILDE, A.K. (2007): A comparison of various methods for multivariate regression with highly collinear variables. Statistical Methods and Applications, 16, 193–228. TIBSHIRANI, R. (1996): Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B, 58, 1, 267–288.

Keywords clustering, regression, bouncing betas, variable selection, dimension reduction

211

Vande Gaer

Sparse multidimensional unfolding Katrijn Van Deun Research Group of Quantitative Psychology and Individual Differences, Katholieke Universiteit Leuven, Tiensestraat 102, box 3713, 3000 Leuven, Belgium. Email: [email protected] Abstract. Gene expression data profile the expression of thousands of genes in a number of conditions (environmental conditions, knockout experiments, patients). Such data are not only complex to handle due to their size, but also due to the heterogeneity of the genes (and possibly also the conditions). Previously, we have shown multidimensional unfolding to be a very fruitful dimension reduction method for an initial exploration of such data. However, typically interest is only in a limited number of grouped genes and, moreover, unfolding representations become less insightful in case of a very large number of genes. Selection of relevant groups of genes is therefore of utmost importance. In this paper we will present a sparse multidimensional unfolding method that relies on a model with regularized gene weights using an L1 and L2 penalty. Estimation of this model is based on iterative majorization, also known as the Majorize Minimize (MM) optimization technique.

Keywords MULTIDIMENSIONAL UNFOLDING, REGULARIZATION, GENE EXPRESSION DATA, MAJORIZE MINIMIZE

Van Deun

212

Multiple Nested Reductions of Single Data Modes as a Tool to Deal with Large Data Sets Iven Van Mechelen and Katrijn Van Deun University of Leuven, Tiensestraat 102 - box 3713, 3000 Leuven, Belgium [email protected] Abstract. The increased accessibility and concerted use of novel measurement technologies give rise to highdimensional data with matrices that comprise both a high number of variables and a high number of objects. As an example, one may think of transcriptomics data pertaining to the expression of a large number of genes in a large number of samples or tissues (as included in various compendia). The analysis of such data typically implies major challenges on the level of estimation, computation, and interpretation. Van Mechelen and Schepers (2007) proposed a generic method to deal with these problems. This method implies that single data modes (i.e., the set of objects, or the set of variables under study) are subjected to multiple (discrete and/or dimensional) nested reductions. In this talk, we will briefly recapitulate the generic multiple nested reductions method, and we will show how a few recently proposed modeling approaches are subsumed by it. Next, we will introduce two novel instantiations of the generic method, which simultaneously include a two-mode partitioning of the objects and variables under study (Van Mechelen et al. (2004)) and a low-dimensional, principal component-type dimensional reduction of the two-mode cluster centroids. We will illustrate these novel instantiations with an application to transcriptomics data for normal and tumourous colon tissues.

References VAN MECHELEN, I., BOCK, H.-H. and DE BOECK, P. (2004): Two-Mode Clustering Methods: A Structural Overview. Statistical Methods in Medical Reseach, 13, 363–394. VAN MECHELEN, I. and SCHEPERS, J. (2007): A Unifying Model Involving a Categorical and/or Dimensional Reduction for Multimode Data. Computational Statistics and Data Analysis, 52, 537–549.

Keywords HIGHDIMENSIONAL DATA, TWO-MODE CLUSTERING, DIMENSION REDUCTION

213

Van Mechelen

Recognition of Harmonic Characteristics for Audio Intervals and Chords Igor Vatolkin1 , Markus Eichhoff2 , and Claus Weihs3 1 2

3

Chair of Algorithm Engineering, TU Dortmund [email protected] Chair of Computational Statistics, TU Dortmund [email protected] Chair of Computational Statistics, TU Dortmund [email protected]

Abstract. Recognition of high-level harmonic characteristics may be helpful for classification of large music data sets. These features describe the relationships between the simultaneously played notes and are motivated by a music theory. The calculation is often straightforward if the score is available, but can be very hard if only audio data is present. The last case is typical for personal mp3 collections. We compare the algorithm performance for two data sets of intervals and chords. Chroma-based enhanced characteristics take as input a chroma or chroma energy normalized statistics vector (M¨ uller and Ewert(2010)). The relationships between the played tones can be measured (balance of consonant and dissonant components, interval estimation). Another approach is directed by the training of classification models based on labeled low-level audio feature vectors. The identification of the most relevant features is applied using feature selection, see e.g. (Bischl et al (2010)). Overtone analysis (Mattern (2010)) can further improve the harmonic analysis performance: here the spectral peaks are sorted responding to their strengths, and several tone candidates with the corresponding overtones are matched to the spectral distribution.

References BISCHL, B., VATOLKIN, I. and PREUSS, M. (2010): Selecting Small Audio Feature Sets in Music Classification by Means of Asymmetric Mutation. In: Proc. of the 11th Int’l Conf. on Par. Probl. Solv. fr. Nature (PPSN), Krakow, pp. 314-323. MATTERN, V. (2010): Ableitung Partitur-basierter Merkmale aus Audiosignaldaten. Diploma thesis, TU Dortmund, Chair of Algorithm Engineering. ¨ MULLER, M. and EWERT, S. (2010): Towards timbre-invariant audio features for harmony-based music. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 18, No. 3, pp. 649-662.

Keywords MUSIC CLASSIFICATION, HARMONY FEATURES

Vatolkin

214

Principal Covariates Regression: How to Weight and Rotate? Marlies Vervloet1 , Eva Ceulemans1 , Katrijn Van Deun2 and Wim Van den Noortgate1 1 2

Methodology of Educational Sciences Research Group, K.U.Leuven Research Group Quantitative Psychology and Individual Differences, K.U.Leuven

Abstract. As is commonly known, ordinary linear regression falls short when the predictor variables are highly correlated with each other, because in that case, the estimates for the regression weights tend to be unstable. Principal Covariates Regression (PCovR) was developed by De Jong & Kiers (1991) as a solution to this problem. PCovR combines the main ideas behind Principal Component Analysis(PCA) and regression. Like PCA, PCovR reduces the variables to a few components and, like regression, it predicts the criterion variables, but using the components as predictor variables. Specifically, PCovR minimizes the following criterion: α kX − TPX k2 + (1 − α) kY − TPY k2 , where X and Y are the scores on, respectively, the predictor and the criterion variables, α is the weighting parameter, which indicates the extent to which the reconstruction of the predictor scores and the criterion scores are emphasized, T contains the scores of the observations on the components, PX holds the loadings of the predictor variables on the components, and PY are the regression weights of the components when predicting the criterion variables. Although PCovR is potentially a very interesting method (e.g., there are strong relations with exploratory SEM; Asparouhov & Muthen, 2009), it is rarely used. This might be because the estimates for the regression weights PY display rotational freedom. Another issue is the optimal value of the weighting parameter α. In this paper, based on extensive simulations, we make some recommendations on how to deal with the rotational freedom and how to select the value of α.

References DE JONG, S. and KIERS, H.A.L. (1991): Principal covariates regression. Part I. Theory. Chemometrics and Intelligent Laboratory Systems, 14, 155–164. ASPAROUHOV, T. and MUTHEN, B. (2009): Exploratory Structural Equation Modeling. Structural Equation Modeling, 16, 397–438.

Keywords MULTICOLLINEARITY, PRINCIPAL COVARIATES REGRESSION

215

Vervloet

Agnostic Domain Adaptation Alexander Vezhnevets and Joachim M. Buhmann ETH Zurich, Switzerland {alexander.vezhnevets,jbuhmann}@inf.ethz.ch Abstract. The supervised learning paradigm assumes in general that both training and test data are sampled from the same distribution. When this assumption is violated, we are in the setting of transfer learning or domain adaptation: Here, training data from a source domain, aim to learn a classifier which performs well on a target domain governed by a different distribution. We pursue an agnostic approach, assuming no information about the shift between source and target distributions but relying exclusively on unlabeled data from the target domain. Previous works suggest that feature representations, which are invariant to domain change, increases generalization. Extending these ideas, we prove a generalization bound for domain adaptation that identifies the transfer mechanism: what matters is how much learnt classier itself is invariant, while feature representations may vary. Our bound is much tighter for rich hypothesis classes, which may only contain invariant classifier, but can not be invariant altogether. This concept is exemplified by the computer vision tasks of semantic segmentation and image categorization. Domain shift is simulated by introducing some common imaging distortions, such as gamma transform and color temperature shift. Our experiments on a public benchmark dataset confirm that using domain adapted classifier significantly improves accuracy when distribution changes are present.

Vezhnevets

216

Clustering by Moving Centroids using Simulated Annealing Mario Villalobos-Arias1 , Eduardo Piza-Volio2 , and Javier Trejos3 1 2 3

CIMPA, University of Costa Rica, Costa Rica. [email protected] CIMPA, University of Costa Rica, Costa Rica. [email protected] CIMPA, University of Costa Rica, Costa Rica. [email protected]

Abstract. In this work we present a new approach for classification of numerical data, in which centroids are moved, using the simulated annealing algorithm, instead on transferring the individuals between classes, as has been traditionally made by the algorithms that use similar heuristic for optimization see references, we compared the results with classical algorithms.

References TREJOS, J., MURILLO, A. and PIZA, E. (1998): Global stochastic optimization for partitioning. In: A. Rizzi, M. Vichi and H.H. Bock (Eds.): Advances in Data Science and Classification. Springer, Berlin, 185-190. TREJOS, J. and VILLALOBOS, M. (2007): Partitioning by particle Swarm Optimization. In: P. Brito et al. (Eds.): Selected Contributions in Data Analysis and Classification. Springer, Berlin, 235–244.

Keywords CLUSTERING, SIMULATED ANNEALING, OPTIMIZATION

217

Villalobos-Arias

Interactive Principal Components Analysis: a new technological resource in the classroom Carmen Villar-Pati˜ no, Miguel Angel Mendez-Mendez, Carlos Cuevas-Covarrubias Anahuac University Abstract. Principal Components Analysis (PCA) is a mathematical technique widely used in multivariate statistics and pattern recognition. From a statistical point of view, PCA is an optimal linear transformation that eliminates the covariance structure of the data. From a geometrical point of view, it is simply a convenient axes rotation. A successful PCA application depends, at certain point, of the comprehension of this geometrical concept; however, to visualize these axes rotations can be an important challenge for many students. At the present time, undergraduate students are immersed in a social environment with an increasing amount of collaborative and interactive elements. This situation gives us the opportunity to incorporate new and creative alternatives of knowledge transmission. We present an interactive educational software, that helps students understand geometrical foundations of Principal Components Analysis. Based on the Nintendo’s Wiimote (a new generation device) students manipulate axes rotations interactively in order to get a diagonal covariance matrix. The graphical environment shows different projections of the data, as well as several statistics like the percentage variance explained by each component. Previous applications of this new pedagogical tool suggest that it constitutes an important didactic support in the classroom.

Villar-Pati˜ no

218

Model based clustering for three-way data Cinzia Viroli Department of Statistics, University of Bologna, Italy [email protected] Abstract. The technological progress of the last decades has made a huge amount of information available, often expressed in unconventional formats. Among these, three-way data occur in different application domains from the simultaneous observation of various attributes on a set of units in different situations or locations. These include data coming from longitudinal studies of multiple responses, spatiotemporal data or data collecting multivariate repeated measures. In this work we propose model based clustering for the wide class of continuous three-way data by a general mixture model which can be adapted to the different kinds of three-way data. This purpose is achieved by modeling the distribution of the observed matrices according to a matrix-variate normal distribution (Nel 1977; Dutilleul 1999). In so doing we also provide a tool for simultaneously performing model estimation and model selection. The effectiveness of the proposed method is illustrated on a simulation study and on real examples.

References DUTILLEUL P. (1999): The MLE algorithm for the matrix normal distribution. Journalof Statistical Computation and Simulation, 64: 105–123. NEL, H. M. (1977): On distributions and moments associated with matrix normaldistributions. Mathematical Statistics Department, University of the Orange FreeState, Bloemfontein, South Africa, (Technical report 24).

Keywords BIRTH AND DEATH PROCESS, MATRIX-VARIATE NORMAL DISTRIBUTION, MIXTURE MODELS, THREE-WAY DATA.

219

Viroli

Product Design Optimization Using Ant Colony and Bee Algorithms: A Comparison Sascha Voekler and Daniel Baier Institute of Business Administration and Economics, Brandenburg University of Technology Cottbus, Postbox 101344, D-03013 Cottbus, Germany [email protected] [email protected] Abstract. In recent years, heuristic algorithms, especially swarm intelligence algorithms, have become popular for product design, where problem formulations often are NP-hard [1]. Swarm intelligence algorithms offer an alternative for large-scale problems to reach near-optimal solutions, without constraining the problem formulations immoderately [3]. In this paper, ant colony [3] and bee colony algorithms [2] are compared. Simulated conjoint data for different product design settings are used for this comparison, their generation uses a Monte Carlo design similar to the one applied in [3]. The purpose of the comparison is to provide an assistance, which algorithm should be applied in which product design setting. Key words: Product Design, Swarm Intelligence, Ant Colony Optimization, Bee Colony Optimization, Conjoint Analysis

References 1.K. Socha, M. Dorigo. Ant colony optimization for continuous domains. European Journal of Operational Research, 185:1155–1173, 2008. 2.D. Karaboga, B. Basturk. A powerful and efficient algorithm for numerical function optimization: artificial bee colony (ABC) algorithm. Journal of Global Optimization, 39:459–471, 2007. 3.M. D. Albritton, P. R. McMullen. Optimal product design using a colony of virtual ants. European Journal of Operational Research, 176:498–520, 2007.

Voekler

220

How well does a phylogenetic tree represent the underlying data? Arndt von Haeseler Center for Integrative Bioinformatics Vienna (CIBIV), Wien, Austria [email protected] Abstract. As models of sequence evolution become more and more complicated, many criteria for model selection have been proposed, and tools are available to select the best model for an alignment under a particular criterion. However, in many instances the selected model fails to explain the data adequately as reflected by large deviations between observed pattern frequencies and the corresponding expectation. We present an approach to evaluate the goodness of fit. We introduces a minimum number of ”extra substitutions” on the inferred tree to provide a biologically motivated explanation why the alignment may deviate from expectation. These extra substitutions plus the evolutionary model then fully explain the alignment. We illustrate the method on several examples.

221

Von Haeseler

Implicit scene context for object segmentation and classification Jan Wegner1 , Bodo Rosenhahn, and Uwe Soergel Leibniz Universit¨ at Hannover Abstract. In this paper, we propose a generic integration of context-knowledge within the unary potentials of Conditional Random Fields (CRF) for object segmentation and classification. Our aim is to learn object-context from the background class of partially labeled images which we call implicit scene context (ISC). A CRF is set up on image super-pixels that are clustered into multiple classes. We then derive context histograms capturing neighborhood relations and integrate them as features into the CRF. Classification experiments with simulated data, eTRIMS building facades, Graz-02 cars, and samples downloaded from Google show significant performance improvements.

Wegner

222

Channel Coding for Joint Colour and Depth Segmentation Marcus Wallenberg1 , Michael Felsberg, Per-Erik Forss´en, and Babette Dellen Link¨ oping University Abstract. Segmentation is an important preprocessing step in many applications. Compared to colour segmentation, fusion of colour and depth greatly improves the segmentation result. Such a fusion is easy to do by stacking measurements in different value dimensions, but there are better ways. In this paper we perform fusion using the channel representation, and demonstrate how a state-of-the-art segmentation algorithm can be modified to use channel values as inputs. We evaluate segmentation results on data collected using the Microsoft Kinect peripheral for Xbox 360, using the superparamagnetic clustering algorithm. Our experiments show that depth gradients are more useful than depth values for segmentation, and that channel coding both colour and depth gradients makes tuned parameter settings generalise better to novel images.

223

Wallenberg

Non-parametric item response models for scale construction and adaptive testing Otto B. Walter University of Bielefeld, Department of Psychology, Universit¨ atsstraße 25, D-33615 Bielefeld otto [email protected] Abstract. Item response models express the probability that a test person selects a certain response option as a function of person and item properties. Parametric item response models define this relation by a mathematical expression containing person and item properties as parameters. In contrast, non-parametric item response models do not rely on a specific parametrization. In these models, the relation between item and person properties and the probability of choosing a certain response option is estimated directly from the data. Non-parametric models have relatively weak assumptions and are an interesting alternative to parametric item response models that are still mainly used in applications. Using two real data sets from achievement and personality testing, the talk discusses how non-parametric item response models can be used for scale construction and adaptive testing.

Walter

224

Sensitivity of divergence measures as structure similarity measurements Ewa W¸edrowska Unversity of Warmia and Mazury, Poland Abstract. The analyses of social and economic phenomena often involve the issue of similarity between business objects characterized by structure indicators. Usually, measures used for quantifying similarity or the lack of similarity between structures are a function of the distance metrics of their partial indicators. An examination of the similarity between structures can also apply divergence measures. This article indicates the possibility of using Csisz´ ara class divergence measures (f -divergences), in particular: Hellinger discrimination, triangular discrimination, symmetric Chisquare divergence, arithmetic-geometric mean divergence, Kullback-Leibler divergence and Jensen-Shannon divergence to evaluate the degree of discrepancy between structures. The aim of the article is to examine the sensitivity of the indicated measures to the changes in the degree of discrepancy between structures.

Keywords ´ r’s divergence, similarity of structure. Csisza

225

W¸edrowska

Erschließungsdaten besser nutzen - geographische Recherche mit SWD-L¨ andercodes Heidrun Wiesenmueller1 Hochschule der Medien Stuttgart [email protected] Zusammenfassung. Nur ein Bruchteil der in den Schlagwortnorms¨ atzen abgelegten Informationen wird von heutigen OPACs f¨ ur die Benutzerrecherche nutzbar gemacht. Wie man das Input-Output-Verh¨ altnis der bibliothekarischen Erschließungsleistung verbessern kann, wird am Beispiel der ISO-L¨ andercodes gezeigt. Diese werden nicht nur in Datens¨ atzen f¨ ur Geographika erfasst, sondern z.B. auch bei Personen und K¨ orperschaften. Macht man sie im OPAC recherchierbar, so k¨ onnen sie als Basis f¨ ur eine Einschr¨ ankung nach dem geographischen Raum dienen. Dadurch erh¨ oht sich der Recall bei Anfragen vom Typ Tourismus in Baden-W¨ urttemberg¨ oder ”Klima in Afrikateils dramatisch, ohne dass sich die Precision verschlechtern w¨ urde. Denn u andercodes wird auch Literatur zu kleineren geographischen Ein¨ber die L¨ heiten gefunden (z.B. Landkreise, St¨ adte, Landschaften), die bei einer einfachen Schlagwortsuche ausgeblendet bleiben. Im HEIDI-Katalog der UB Heidelberg und im Primo-Katalog der UB Mannheim wurde die L¨ andercode-Recherche vor kurzem prototypisch in Form eines Drill-down-Men¨ us realisiert.

Wiesenmueller

226

Solving complex optimization problems with many parameters by means of optimally designed block-relaxation algorithms Tom F. Wilderjans, Iven Van Mechelen, and Dirk Depril Research Group of Quantitative Psychology and Individual Differences, Katholieke Universiteit Leuven, Tiensestraat 102, box 3713, 3000 Leuven, Belgium. Email: [email protected] Abstract. Many data analysis problems involve the optimization of a criterion that is a function of many parameters, with these parameters being continuous, discrete, or a combination of both. To deal with such optimization problems, the class of block-relaxation algorithms (with alternating least-squares algorithms being a specific instance of this class) may be most useful. The rationale behind the algorithms in this class is that the complex optimization problem is solved by alternatingly tackling a series of subproblems, each of which, considered separately, is easy to handle. For this purpose, the set of parameters is divided into a number of subsets, in such a way that the optimal values for the parameters in each subset (conditional on the current estimates of the parameters in all other subsets) can be determined easily (e.g., in terms of closed-form expressions). The algorithm then cycles through the different subsets and updates the parameters in each subset until there is no further improvement in the loss function value. When designing a block-relaxation algorithm, two choices need to be made, which may influence the performance of the algorithm: (1) the way in which the parameters are divided in subsets, and (2) the order in which the parameter subsets are updated. In this presentation, based on theoretical and empirical arguments, guidelines for optimally designing block-relaxation algorithms will be derived. As an illustration, these guidelines will be applied to the estimation of the INDCLUS model (i.e., a mixed discrete-continuous optimization problem). Different alternating least-squares algorithms for fitting the INDCLUS model will be proposed and compared to each other in an extensive simulation study. The simulation results support the derived guidelines. In particular, it is advisable to group together parameters that highly depend on each other.

Keywords INDCLUS, THREE-WAY SIMILARITY DATA, OVERLAPPING CLUSTERING, BLOCK-RELAXATION, ALTERNATING LEAST-SQUARES

227

Wilderjans

Multiple Instance Boosting for Face Recognition in Videos Paul Wohlhart, Martin K¨ ostinger, Peter Roth, and Horst Bischof Graz University of Technology Abstract. For face recognition from video streams often cues such as transcripts, subtitles or on-screen text are available. This information could be very valuable for improving the recognition performance. However, frequently this data can not be associated directly with just one of the visible faces. To overcome this limitations and to exploit valuable information, we define the task as a multiple instance learning (MIL) problem. We formulate a robust loss function that describes our problem and incorporates ambiguous and unreliable information sources and optimize it using Gradient Boosting. A new definition of the posterior probability of a bag, based on the Lp-norm, improves the ability to deal with varying bag sizes over existing formulations. The benefits of the approach are demonstrated for face recognition in videos on a publicly available benchmark dataset. In fact, we show that exploring new information sources can drastically improve the classification results. Additionally, we show its competitive performance on standard machine learning datasets.

Wohlhart

228

Bivariate binary classification using the ideal point classification model H.M. Worku, M. De Rooij, W.J. Heiser, and P. Spinhoven Leiden University, Psychological Institute, Netherlands Abstract. In social as well as medical studies the interest is often in simultaneous classification of participants/subjects on two or more variables. For example, in the Netherlands Study of Depression and Anxiety (NESDA) the scientific question of interest is how personality factors are related to the prevalence and association of depression and anxiety disorders. We approach this problem using the ideal point classification (IPC) model, a probabilistic multidimensional unfolding model. We are especially interested in the representation of marginal and conditional relationships between personality and depression/anxiety and also modeling jointly with the relationship between personality and the comorbidity. Here identifiability issues are also re-investigated. In the IPC model the Euclidean space of the dependent variables is three dimensional, in which the first dimension corresponds to the prevalence of depression, the second to the prevalence of anxiety, and the last for the comorbidity among depression and anxiety. Comparisons of IPC models with logistic regression models for clustered data will be provided.

Keywords Categorical data; Logistic Regression; Multidimensional Unfolding.

229

Worku

One-mode three-way analysis besed on result of one-mode two-way analysis Satoru Yokoyama1 and Akinori Okada2 1

2

Department of Business Administration, Faculty of Economics, Teikyo University. [email protected] Graduate School of Management and Information Sciences, Tama University.

Abstract. Several analysis models for proximities have been introduced. While most of them were for one-mode two-way proximities, some analysis models which are able to analyze one-mode three-way proximities have been suggested in recent years. Yokoyama et al. (2009) suggested one-mode three-way overlapping cluster analysis model based on Arabie and Carroll (1980). Furthermore, several studies for comparisons between the results of analyses by one-mode two-way and by onemode three-way proximities which are generated from the same source of data have done, e.g. Yokoyama and Okada (2010). In the present study, the authors suggest the analysis of one-mode three-way proximities based on one-mode two-way analysis for overlapping clusters. To evaluate the necessity of one-mode three-way analysis, firstly one-mode three-way proximities are reconstructed from the clusters and weights obtained by one-mode two-way overlapping cluster analysis. Secondly the reconstructed one-mode three-way proximities are subtracted from original onemode three-way proximities. The resulting differences or subtracted proximities are analyzed by one-mode three-way overlapping cluster analysis model. The analysis discloses the components of proximities which can be expressed by one-mode threeway analysis but not by one-mode two-way analysis.

References ARABIE, P. and CARROLL, J. D. (1980): MAPCLUS: A mathematical programming approach to fitting the ADCLUS model. Psychometrika, 45, 211–235. YOKOYAMA, S., NAKAYAMA, A., and OKADA, A. (2009): One-mode three-way overlapping cluster analysis. Computational Statistics, 24, 165–179. YOKOYAMA, S. and OKADA, A. (2010): External analysis of overlapping cluster analysis.The 28th Annual Meeting of the Japanese Classification Society, 13–14. (in Japanese)

Keywords ONE-MODE THREE-WAY DATA, ONE-MODE TWO-WAY DATA, OVERLAPPING CLUSTER ANALYSIS, PROXIMITY

Yokoyama

230

Multi-target Tracking in Crowded Scenes Jie Yu, Dirk Farin, and Bernt Schiele Robert Bosch GmbH Abstract. In this paper, we propose a two-phase tracking algorithm for multitarget tracking in crowded scenes. The first phase extracts an overcomplete set of tracklets as potential fragments of true object tracks by considering the local temporal context of dense detection-scores. The second phase employs a Bayesian formulation to find the most probable set of tracks in a range of frames. A major difference to previous algorithms is that tracklet confidences are not directly used during track generation in the second phase. This decreases the influence of those effects, which are difficult to model during detection (e.g. occlusions, bad illumination), in the track generation. Instead, the algorithm starts with a detection-confidence model derived from a trained detector. Then, tracking-by-detection (TBD) is applied on the confidence volume over several frames to generate tracklets which are considered as enhanced detections. As our experiments show, detection performance of the tracklet detections significantly outperforms the raw detections. The second phase of the algorithm employs a new multi-frame Bayesian formulation that estimates the number of tracks as well as their location with an MCMC process. Experimental results indicate that our approach outperforms the state-of-the-art in crowded scenes.

231

Yu

Index

¨ u, 206 Unl¨ Diaz-Aviles, 48 Abou-Moustafa, 1 Ackermann, 183 Adler, 2 Akcatepe, 3 Akkucuk, 4 Albatineh, 5 Albert, 180 Alexandrov, 106 Alexandrovich, 6 Arends, 7 Askarova, 38 Atkinson, 36 Aubry, 8 B¨ uchel, 30 Badescu, 22 Baier, 43, 72, 151, 170, 184, 220 Balakrishnan, 9 Banz, 182 Bartel, 143, 144 Batagelj, 10 Bauckhage, 64 Bauer, 11 Baumgart, 12 Baust, 13 Bavaud, 14 Behnisch, 199 Belo, 15 Benhimane, 116

Benoit, 16 Bernau, 17, 25 Bertrand, 18 Bessler, 19 Bischl, 20, 21, 125 Bisson, 82 Black, 207 Bladenopoulos, 136 Blume, 182 Boc, 22 Bodesheim, 23 Bohak, 24 Boulesteix, 17, 25 Bouveyron, 26 Braun, 27 Brito, 28 Brzezinska, 29 Buhmann, 216 Bulla, 31 Burger, 32 Burkhardt, 191 Busing, 33 Carlsson, 34 Carroll, 4 Celeux, 35 Cerdeira, 179 Cerioli, 36 Ceulemans, 46, 211, 215 Chavent, 37, 100 Chernyak, 38 Chudy, 39

Chugunova, 38 Coretto, 91 Cremers, 8 Cuevas-Covarrubias, 40, 218 Cui, 41, 42 D’Ambrosio, 88 Damer, 186 Daniel, 43 Dannemann, 187 De Angelis, 44 De La Torre, 1 De Rooij, 45, 152, 229 De Roover, 46 Dedovic, 110 Demongeot, 54 Denkowska, 194 Denzler, 87 Depril, 227 Diallo, 22 Dias, 28, 49 Diaz-Aviles, 47 Dichtl, 50 Dixon, 39 Dlugosz, 51, 52 Dodt, 20 Dolata, 144 Domenach, 53 Doreian, 112 Douzal-Chouakria, 54 Dragiev, 126 Dragon, 55 Drareni, 56 Drauschke, 57 Drayer, 58 Drobetz, 50 Ebert, 59 Eckert, 196 Eichhoff, 214 Eilers, 60 Elhayek, 61 Elias Xavier, 62 Esposito, 63 Evangelidis, 64 Fanelli, 65 Farin, 231

Felsberg, 111, 223 Ferligoj, 112 Ferrie, 1 Fijorek, 194 Fischer, 186 Fober, 66 Foerstner, 57 Forssen, 223 Fr¨ uhwirth-Schnatter, 73 Frambourg, 54 France, 67 Franke, 68, 70 Frey, 69 Frick, 70, 71 Fritz, 59 Fritz,, 130 Frost, 72 Gall, 65 Ganzenmueller, 74 Garc´ıa-Escudero, 130 Gaschler, 75 Gaul, 76–78 Gaussier, 54 Gavrila, 101, 117 George, 102 Georgescu, 48 Gertheiss, 79 Gettler Summa, 203 Gey, 80 Geyer-Schulz, 81 Godehardt, 180 Gordaliza, 130 Gormley, 133 Gower, 210 Gr¨ un, 95 Grimal, 82 Gross, 83 Grundke, 84 Gu´enoche, 86 Gudicha, 85 H¨ orstermann, 104 H¨ ullermeier, 66, 105 Haase, 87 Handmann, 161 Harmeling, 32 Hatzinger, 94, 171

234

Heiser, 88, 229 Heller, 89 Helten, 90 Hennig, 91, 92 Hermes, 101 Herzog, 93 Hildebrand, 30 Hillebrand, 122, 180 Hofmarcher, 94, 95 Hohensinn, 96 Hohmann, 97 Holzmann, 97, 187 Hornik, 94, 95, 172 Horvat, 3 Husson, 100 H¨ ullermeier, 135 Ilic, 116, 209 Imaizumi, 98 Indorf, 99 Irie, 147 Jacques, 26 Jamitzky, 17 Jank, 172 Josse, 100 Jud, 123 Kaciak, 175 Keller, 101 Keriven, 56 Kiefer, 102 Kiers, 103 Kim, 137 Kirchner, 137, 138 Klages, 76 Klapproth, 104 Kn¨ oller, 105 Kobarg, 106 Koch, 21, 70, 153 Koeser, 107 Koestinger, 228 Koh, 105 Kohli, 159 Konen, 21, 153 Kononov, 108 Konushin, 108 Kovaleva, 141

Kr¨ atzsch, 109 Kr¨ ohne, 69 Krajsek, 110 Krebs, 111 Krolak-Schwerdt, 104 Kronegger, 112 Kubinger, 96 Kubus, 113 Kuentz, 37 Kuziak, 114 Leal-Taixe, 160 Lee, 172 Leischner, 20 Liberati, 115 Lichtenberg, 180 Lieberknecht, 116 Liem, 117 Limam, 134 Lindpointner, 118 Lindsay, 119 Liquet, 37, 100 Loetsch, 120 Loureiro, 121, 179 Lucas Drumond, 3 Luebke, 122, 164 Luethi, 123 Luetz, 124 Lukashevich, 125 M¨ ullensiefen, 146 M¨ uller-Funk, 12, 30 Maier, 171 Mair, 95 Makarenkov, 22, 126 Mariani, 115 Markos, 136 Markowski, 127 Marlet, 56 Marolt, 24 Martin, 148 Maruotti, 31 Mary-Huard, 80 Matr´ an, 128 Matr´ an-Bea, 130 Mattern, 129 Mayo-Iscar, 130 McLachlan, 131

235

McMorris, 132 McParland, 133 Mejri, 134 Mendez-Mendez, 218 Menexes, 136 Mernberger, 135 Metzen, 137, 138 Meulders, 139 Minami, 140 Mirkin, 38, 141 Mizuta, 142 Moerke, 55 Morales-Merino, 143 Mucha, 143, 144 Mueller, 90, 145, 189 Murakami, 147 Murillo, 205 Murphy, 176 Nadon, 126 Nagathil, 148 Nakayama, 149 Nappo, 150 Nascimento, 38 Naundorf, 151 Navab, 13 Nejdl, 47, 48 Ninaber, 152 Nowozin, 159 Nugent, 153 Nyakatura, 87 Okada, 149, 154, 230 Onghena, 46 Ortiz, 157 Orwat-Acedanska, 155 Ovelg¨ onne, 81 Owsinski, 156 Paas, 44 Pagani, 41, 42 Palm, 201 Palumbo, 203 Paradowski, 192 Pardo, 157 Piater, 198 Piontek, 158 Piza, 205

Piza-Volio, 217 Pletscher, 159 Poddig, 99 Pollefeys, 107 Pons-Moll, 160 Potapov, 2 R¨ oetter, 163 Rabe, 145 Rabie, 161 Rannacher, 145 Reif, 96 Reisert, 162, 191 Riani, 36 Richter, 25 Robitzsch, 102 Rojahn, 164 Rokita, 165 Romaniuk, 166 Rosenhahn, 55, 183, 222 Roth, 228 Rozkrut, 167 Rozmus, 168 Ruano, 169 Rudolph, 129 Rumstadt, 170 Rusch, 94, 171, 172 Rutkowska-Ziarko, 173 Sack, 174 Sagan, 175 Salamaga, 194 Salter-Townshend, 176 Samworth, 177 Sandner, 178 Sanfins, 15 Santana, 179 Saracco, 37 Scharr, 110 Schiele, 59, 231 Schier, 189 Schiffner, 11, 122, 180 Schindler, 181 Schlickewei, 8 Schlieker, 20 Schm¨ adecke, 182 Schmidt, 183 Schmidt-Thieme, 3

236

Schmitt, 184 Schouteden, 185 Schubert, 143 Schulz, 186 Schwaiger, 187 Schwarz, 188 Schweiger, 201 Seitz, 69 Shabanov, 189 Shah, 1 Shojima, 190 Siegmund, 93 Silva, 28 Skibbe, 162, 191 Sluzek, 192 Smits, 193 Soergel, 222 Sokolowski, 194 Souza, 205 Spinhoven, 229 Steinley, 195 Stelz, 79 Stewart, 47, 48 Stork, 196 Stricker, 41, 42 Stuckenschmidt, 196 Tarka, 197 Tautges, 90 Tayari, 53 Teney, 198 Th¨ ommes, 200 Thinh, 199 Thom, 201 Timmerman, 46 Tokuda, 202 Tortora, 203 Toyama, 204 Trejos, 205, 217 Trendafilov, 210 Trendtel, 206 Truong, 160 Trzpiot, 155 Tsoli, 207 Tsurumi, 149 Tuerlinckx, 202 Ultsch, 120, 199 Ummenhofer, 208

Unger, 209 Unkel, 210 Unl¨ u, 102 Van den Noortgate, 215 Van den Poel, 16 Van Deun, 185, 212, 213, 215 Van Mechelen, 185, 202, 211, 213, 227 Vande Gaer, 211 Vatolkin, 129, 163, 214 Vermunt, 85 Vervloet, 215 Vetter, 123 Vezhnevets, 216 Villalobos-Arias, 217 Villar-Pati˜ no, 218 Vincent, 77 Viroli, 219 Voekler, 220 Von Haeseler, 221 W¸edrowska, 225 Wahl, 209 Waitelonis, 174 Wallenberg, 223 Walter, 224 Wambach, 50 Wegner, 222 Weickert, 61 Weihs, 11, 20, 21, 122, 125, 134, 153, 163, 180, 214 Weise, 65 Welk, 61 Wiesenmueller, 226 Wiklund, 111 Wilderjans, 227 Winker, 200 Winkler, 78 Wohlhart, 228 Wolff, 19 Worku, 229 Xanthos, 14 Xavier, 205 Yokoyama, 154, 230 Yu, 231 Zach, 107 Zeileis, 172

237

238