31 Aug 2010 ... In this computer exercise, we will study some examples of visualization methods
and ... Part II - simple visualization methods for small data sets.
Computer Exercise
August 31, 2010
In this computer exercise, we will study some examples of visualization methods and see how they can be used for exploratory analysis of observed data sets. The exercises are performed in the statistical package R1 and the commercial software Qlucore Omics Explorer2 . Basic R commands R is a freely downloadable software for computational statistics which also provides nice graphical output. It can be downloaded from http://www.r-project.org/ . There is also a large collection of contributed packages implementing a wide variety of statistical methodology, which can be accessed and downloaded from http://cran.r-project.org/web/packages/ . To get information about a command or a function in R, use the help() command.
Part I - the usefulness of visualization 1. Start R. To appreciate the importance of visualization, we will first study a small collection of data sets, commonly known as the Anscombe quartet3 . These are four data sets, each containing 11 observations of two variables, denoted xi and yi for i = 1, 2, 3, 4. It is available in R under the name anscombe. Have a first look at the data and add it to the current namespace by writing anscombe attach(anscombe) and calculate the mean and variance of each variable by typing mean(anscombe) diag(var(anscombe)) 1
R Development Core Team www.qlucore.se 3 Anscombe (1973): Graphs in statistical analysis. The American Statistician 27(1):17-21 2
1
What can you note from these results? Furthermore study the correlation between the variables in each data set and the best-fitting regression line in each case by the following commands (repeat for all four data sets): cor(x1,y1) lm(y1~x1) What do these results tell you? Finally, plot each data set together with the regression line with the following commands par(mfrow=c(2,2)) for(i in 1:4){ ff=y~x ff[2:3]=lapply(paste(c("y","x"),i,sep=""),as.name) plot(ff,data=anscombe,pch=16,col="red") abline(lm(ff),col="blue") } What can you conclude from these figures?
Part II - simple visualization methods for small data sets 2. We will study a small data set consisting of 50 samples of iris flowers from three different species45 . This data set is available in R under the name iris. For each of the flowers, four features have been measured: the petal length, the petal width, the sepal length and the sepal width. To study the distribution of each of the variables separately, type par(mfrow=c(2,2)) hist(iris$Petal.Length) hist(iris$Sepal.Length) hist(iris$Petal.Width) hist(iris$Sepal.Width) What type of information can we obtain from these figures? Is it possible to conclude e.g. whether any of the variables are correlated with each other? Compare the histograms to the density estimates, obtained by 4 5
Anderson (1935): The irises of the Gasp´e Peninsula. Bulletin of the American Iris Society 59:2-5 Fisher (1936): The use of multiple measurements in taxonomic problems. Annals of Eugenics 7:179-188
2
par(mfrow=c(2,2)) plot.density(density(iris$Petal.Length)) plot.density(density(iris$Sepal.Length)) plot.density(density(iris$Petal.Width)) plot.density(density(iris$Sepal.Width)) How are these visualizations related? Another useful way to visualize each variable separately is through a box-plot, which can be obtained with par(mfrow=c(1,1)) boxplot(iris[,1:4]) Finally, create scatter plots of each variable, by par(mfrow=c(2,2)) plot(iris$Petal.Length,col=iris$Species) plot(iris$Sepal.Length,col=iris$Species) plot(iris$Petal.Width,col=iris$Species) plot(iris$Sepal.Width,col=iris$Species) 3. Next we study pairwise relationships between the four variables. Write par(mfrow=c(1,1)) plot(iris[,1:4],col=iris$Species,bg=iris$Species,pch=20, main="Iris data (black=setosa,red=versicolor,green=virginica)") What do the resulting figures show? What do they add to the information we could obtain from the one-dimensional visualizations? 4. A commonly used method for visualizing e.g. gene expression data sets is through a heatmap, which can be generated for the iris data with the command par(mfrow=c(1,1)) heatmap(t(data.matrix(iris[,1:4]))) The heatmap command also produces hierarchical clusterings of both the samples and the variables in the data set (if we want to plot the heatmap without the dendrogram, add the arguments Rowv=NA,Colv=NA to the heatmap command). What can we conclude from this visualization? Is this what we would expect from looking at the relationships between the variables above? How is the clustering generated? We can use another dissimilarity measure, for example the correlation dissimilarity, by the following commands: 3
par(mfrow=c(1,1)) cordist=function(x) as.dist((1-cor(t(x)))/2) heatmap(data.matrix(t(iris[,1:4])),distfun=cordist) How does this change the results?
Part III - visualization and dimension reduction for large data sets For data sets with many variables, the visualization methods used above become practically infeasible. We are also interested not only in relationships between pairs of variables, but larger subsets. However, the number of dimensions which can be visualized at the same time is limited to three (or perhaps even two). Therefore, it is often desirable to approximate the data set as well as possible in low dimension, i.e. with just a few features where we can explore it visually. If the number of variables is large, taking these features to be two of the original variables is not likely to approximate the entire data set very well. Therefore, many methods have been developed for creating aggregate features, i.e. combinations of the original variables, to use for visualization. Using aggregate features like this allows us to incorporate more of the information from the data set into just a few dimensions. 5. Principal Component Analysis (PCA) creates aggregate features (called principal components) accounting for as much as possible of the variance in the original data, while being uncorrelated to each other. Moreover, projecting the original samples onto the principal components creates a configuration which approximates the original, high-dimensional configuration as well as possible, in a least-squares sense. We extract the principal components from the iris data and produce a graphical representation with the following code: pciris=prcomp(iris[,1:4],center=TRUE,scale.=TRUE) biplot(pciris) The argument scale.=TRUE indicates that the original variables will be standardized before the principal components are extracted, by subtracting the mean and dividing by the standard deviation of each variable. In this way, each variable has the same influence on the analysis, independent of the individual variances. Can you think of any potential problems arising if such a standardization is applied? The biplot is a convenient way to visualize both the approximative sample configuration and the principal components simultaneously. The samples are shown with numbers, and the variable contributions to the principal components are shown with arrows. Which variables have the strongest influence on each of the first two principal components? How can we extract this information from the biplot? Which variables appear to be strongly related to the apparent 4
separation of the samples into two groups? Have we seen this effect in any of the previous visualizations? Under point 4 above, we noted that the petal length and the petal width were highly correlated. Now, we can see that they also appear to have similar contribution to each of the principal components. The handling of correlated variables in PCA is different from e.g. regression, where multicollinearity should be avoided. In PCA, groups of highly correlated variables often tend to be collected with high weights in the same principal component. Why is this reasonable? The contribution of each variable to each of the principal components, as well as the standard deviations and the fraction of the variance explained by each of the principal components, can be extracted with the commands pciris$rotation pciris$sdev summary(pciris) With the command screeplot(pciris) we obtain the so called scree plot, showing the variance accounted for by each of the principal components. Given that the variance can be seen as a measure of the information content of a principal component, the scree plot is often used to determine how many of the principal components that are ”important” to obtain a good enough approximation of the data. How could this be done? We can also extract principal components from the original (unstandardized) data matrix, by pciris.unstandardized=prcomp(iris[,1:4],center=TRUE,scale.=FALSE) biplot(pciris.unstandardized) What is the difference compared to the visualization for the standardized matrix? The command var(iris[,1:4]) gives the covariance matrix for the four variables. Can you explain the differences between the extracted components using this information? Compare also the scree plot to the one obtained for the standardized data.
5
6. We will now study a larger example, given in the library ElemStatLearn. The data set (called nci) comes from the NCI60 cancer microarray project, and contains 64 samples of cell lines from different tumor and tissue types, for which the expression of 6,830 genes were measured6 . In this case, it is clearly infeasible to get an overview of the data by plotting the variables pairwise. Instead, we extract the principal components and use the first two of these to visualize the data set. First, install the ElemStatLearn package, and load it by typing library(ElemStatLearn) Then do PCA with the following commands: pcnci=prcomp(t(nci),center=TRUE,scale.=TRUE) par(mfrow=c(1,2)) Types=c("1","1","1","2","3","1","1","3","4","4","2","2","2","2", "2","2","2","3","4","2","10","5","6","7","5","5","5","5","5","7", "4","4","4","8","10","10","8","8","8","8","8","9","9","9","9","9", "9","9","10","3","10","3","4","4","4","6","3","3","6","6","6","6", "6","6") cha=rep(16,64) cha[Types=="9"]=17 cha[Types=="10"]=17 plot(pcnci$x[,1],pcnci$x[,2],pch=cha,col=Types) plot(pcnci$rotation[,1],pcnci$rotation[,2]) Is it possible to detect any interesting patterns in any of these figures (e.g. clusters of samples and variables responsible for the clustering)? Try also the biplot command. Is the resulting plot interpretable? Why/why not? What does the scree plot look like? How many principal components are necessary to account for a large enough part of the total variance?
Part IV - kernel PCA 7. We will use a simple two-dimensional example data set to study the effect of applying kernel PCA instead of regular, linear PCA. We need the package kernlab, so install it, and load it with library(kernlab). To create and plot the data set containing 300 samples, type 6
For more information and additional visualizations, see http://genome-www.stanford.edu/nci60/index.shtml
6
n1=150 n2=100 n3=50 theta1=2*pi*runif(n1) radius1=0.3*rnorm(n1) theta2=2*pi*runif(n2) radius2=3+0.3*rnorm(n2) theta3=2*pi*runif(n3) radius3=6+0.3*rnorm(n3) class=c(rep("1",n1),rep("2",n2),rep("3",n3)) thetas=c(theta1,theta2,theta3) radii=c(radius1,radius2,radius3) X=radii*cos(thetas) Y=radii*sin(thetas) Data=cbind(X,Y) par(mfrow=c(2,2)) plot(Data[,1],Data[,2],pch=16,col=class,main="Original data") Now do regular (linear) PCA and plot the resulting two-dimensional projection of the samples. LinPC=prcomp(Data,center=TRUE,scale.=FALSE) plot(LinPC$x,pch=16,col=class,main="Regular (linear) PCA") Also try kernel PCA with a polynomial kernel, K (x, y) = (x T y)2 ; kpcPoly=kpca(Data,kernel="polydot",kpar=list(degree=2,scale=1,offset=0)) plot(rotated(kpcPoly),pch=16,col=class,main="Kernel PCA (polynomial kernel)") 2
2
and with a Gaussian kernel, K (x, y) = e−kx−yk /0.1
kpcGauss=kpca(Data,kernel="rbfdot",kpar=list(sigma=0.1)) plot(rotated(kpcGauss),pch=16,col=class,main="Kernel PCA (Gaussian kernel)")
Part V - multidimensional scaling (MDS) Multidimensional scaling is a method for reducing the dimensionality and creating a visualization of a data set for which we know only the pairwise distances between all objects. We search for a low-dimensional configuration of the objects, such that the (often Euclidean) distances in this configuration as closely as possible (in some well-defined sense) match those given in the distance matrix.
7
8. As a basic example of the application of MDS, we consider a matrix containing distances between some European cities. This matrix is created as follows: D=matrix(c(0,36,104,99,164,146,57,107,124,183,79,83,90,142,31,105,# 36,0,69,67,126,123,25,74,94,148,45,55,105,107,62,81,# 104,69,0,21,58,114,54,38,61,79,27,61,161,46,130,83,# 99,67,21,0,72,133,59,56,82,89,22,75,167,65,128,100,# 164,126,58,72,0,120,107,64,71,26,84,99,206,26,185,108,# 146,123,114,133,120,0,101,77,54,143,121,69,130,96,151,42,# 57,25,54,59,107,101,0,50,68,131,37,32,108,86,79,59,# 107,74,38,56,64,77,50,0,26,89,50,36,142,39,127,49,# 124,94,61,82,71,54,68,26,0,95,75,42,143,45,140,39,# 183,148,79,89,26,143,131,89,95,0,105,125,232,52,209,134,# 79,45,27,22,84,121,37,50,75,105,0,57,145,71,106,84,# 83,55,61,75,99,69,32,36,42,125,57,0,107,75,97,28,# 90,105,161,167,206,130,108,142,143,232,145,107,0,181,66,106,# 142,107,46,65,26,96,86,39,45,52,71,75,181,0,165,82,# 31,62,130,128,185,151,79,127,140,209,106,97,66,165,0,113,# 105,81,83,100,108,42,59,49,39,134,84,28,106,82,113,0),nrow=16) Cities=c("Stockholm","Lund","Paris","London","Madrid","Athens","Berlin",# "Milan","Rome","Lisbon","Amsterdam","Vienna","Moscow","Barcelona",# "Helsinki","Belgrade") DF2=data.frame(D,row.names=Cities) DM=as.dist(DF2) Create a two-dimensional point configuration approximating the given distance matrix with the following code: angle=39 loc=-cmdscale(DM) g=matrix(c(cos(angle/180*pi),-sin(angle/180*pi),sin(angle/180*pi),# cos(angle/180*pi)),nrow=2) rotloc=matrix(loc,ncol=2)%*%g x=rotloc[,1] y=rotloc[,2] plot(x,y,type="n",xlab="",ylab="",main="Classical MDS, European cities") text(x,y,rownames(loc),cex=0.8) Does the result seem reasonable?
8