Biostatistics Workshop - CiteSeerX

2 downloads 0 Views 460KB Size Report
12 60. Lemur_cat AAGCTTCATA GGAGCAACCA TTCTAATAAT CGCACATGGC CTTACATCAT CCATATTATT. Tarsius_s AAGTTTCATT GGAGCCACCA ...
Confidence Regions and Averaging for Trees Examples

Susan Holmes Statistics Department, Stanford and INRA- Biom´etrie, Montpellier,France [email protected] http://www-stat.stanford.edu/~susan/

1

Confidence Regions and Averaging for Trees Examples

Susan Holmes Statistics Department, Stanford and INRA- Biom´etrie, Montpellier,France [email protected] http://www-stat.stanford.edu/~susan/

Joint Work with Louis Billera (Mathematics, Cornell) Persi Diaconis (Mathematics and Statistics, Stanford) Karen Vogtmann (Mathematics, Cornell).

1

Phylogenetic Trees

2

Confidence Statements for trees

3

Confidence Regions

4

5

6

Phylogenetic Tree Example

Family trees or phylogenetic trees whose leaves are different evolutionary entities (species, genes, populations).

DNA Data for 12 species of primates Mitochondria, 898 characters on 12 species. Hayasaka, K., T. Gojobori, and S. Horai. 1988. Trees are built from DNA data such as the following: 12 60 Lemur_cat Tarsius_s Saimiri_s Macaca_sy Macaca_fa Macaca_mu Macaca_fu Hylobate Pongo Gorilla Pan Homosapie

AAGCTTCATA AAGTTTCATT AAGCTTCACC AAGCTTCTCC AAGCTTCTCC AAGCTTTTCT AAGCTTTTCC AAGCTTTACA AAGCTTCACC AAGCTTCACC AAGCTTCACC AAGCTTCACC

GGAGCAACCA GGAGCCACCA GGCGCAATGA GGTGCAACTA GGCGCAACCA GGCGCAACCA GGCGCAACCA GGTGCAACCG GGCGCAACCA GGCGCAGTTG GGCGCAATTA GGCGCAGTCA

TTCTAATAAT CTCTTATAAT TCCTAATAAT TCCTTATAGT CCCTTATAAT TCCTCATGAT TCCTTATGAT TCCTCATAAT CCCTCATGAT TTCTTATAAT TCCTCATAAT TTCTCATAAT

CGCACATGGC TGCCCATGGC CGCTCACGGG TGCCCATGGA CGCCCACGGG TGCTCACGGA CGCTCACGGA CGCCCACGGA TGCCCATGGA TGCCCACGGA CGCCCACGGA CGCCCACGGG

CTTACATCAT CTCACCTCCT TTTACTTCGT CTCACCTCTT CTCACCTCTT CTCACCTCTT CTCACCTCTT CTAACCTCTT CTCACATCCT CTTACATCAT CTTACATCCT CTTACATCCT

CCATATTATT CCCTATTATT CTATGCTATT CCATATACTT CCATGTATTT CCATATATTT CCATATATTT CCCTGCTATT CCCTACTGTT CATTATTATT CATTATTATT CATTACTATT

Color Coded version of the data, after alignment

Homosapie Pan Gorilla Pongo Hylobate Macacafu Macacamu Macacafa Macacasy Saimiris Tarsiuss Lemurcat

7

TwoTrees Built with this data by parsimony +---Homosapien +--Homosapien +-11 +-11 ! ! +--Pan +-10 +--Pan +--9 +-10 ! ! ! ! +--Gorilla +--9 +-----Gorilla +--------8 ! ! ! ! ! +--------Pongo +------8 +--------Pongo ! ! ! ! ! +----------Hylobates ! +----------Hylobates +--7 ! ! ! +--Macaca_fus +--7 +--Macaca_fus ! ! +--6 ! ! +--6 ! ! +-5 +--Macaca_mul ! ! +--5 +--Macaca_mul +--3 ! ! ! ! ! ! ! ! ! +----------4 +-----Macaca_fas +--3 +---------4 +-----Macaca_fas ! ! ! ! ! ! +--2 ! +-------Macaca_syl ! ! +--------Macaca_syl ! ! +---------------------Saimiri_sc +--2 +---------------------Saimiri_sc -1 ! --1 ! ! +------------------------Tarsius_sy ! +------------------------Tarsius_sy +---------------------------Lemur_catt +---------------------------Lemur_catt

8

With the Bootstrap “Confidence” Values +----Macaca mul +-100.0 +-99.5 +----Macaca fus ! ! +------100.0 +---------Macaca fas ! ! ! +--------------Macaca syl +-79.4 ! ! +-------------------Hylobates ! ! ! ! ! ! +----Homosapien ! +-99.0 +-50.2 +-100.0 ! +-100.0 +----Pan ! ! ! ! ! ! ! +-89.0 +---------Gorilla ! ! ! +-100.0 ! +--------------Pongo ! ! ! ! ! +-----------------------------Saimiri sc ! +----------------------------------Tarsius sy +---------------------------------------Lemur catt

9

10

Using the geometry

4

3

2

1

4

3

2

1

10

Using the geometry

4

3

2

14

3

2

14

3

2

1

10

Using the geometry

4

3

2

14

3

2

14

Already used by biolgists to indicate ’unresolved’ branchings.

3

2

1

The Wine Data

11

wine 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

D28 1.09 0.61 0.90 1.11 0.85 0.98 1.24 0.92 1.18 1.13 1.04 1.27 1.23 1.34 1.28 1.17 1.31 0.91 0.93 1.08 1.11 0.90 1.09 0.80 0.76 0.86 0.63 0.65 0.69 0.70 0.69 0.78 0.59 0.58 0.66 0.79

TAN 2.09 1.27 1.68 2.00 1.50 2.00 2.41 1.63 2.00 2.32 2.36 2.69 3.09 3.00 3.00 2.23 2.81 1.63 1.63 2.01 2.27 1.76 2.06 1.76 1.62 1.71 1.41 2.11 1.34 1.25 1.27 1.68 1.47 1.24 1.45 1.41

IGEL 45 35 38 28 37 41 58 50 59 62 61 69 62 64 75 68 65 39 64 44 30 54 55 35 31 39 33 48 22 33 33 39 29 33 19 72

IHCL 13 20 29 26 40 26 27 18 30 25 30 38 22 28 33 20 58 21 12 30 9 23 26 19 24 20 26 26 27 16 19 17 20 34 10 28

COL 1.06 0.65 0.84 0.74 0.66 0.99 0.83 0.58 0.84 0.84 0.69 0.91 0.88 0.89 1.00 1.06 1.00 0.59 0.62 0.79 0.69 0.80 1.21 0.62 0.55 0.67 0.59 0.66 0.60 0.59 0.44 0.65 0.45 0.41 0.31 0.54

JAU 64 40 36 37 39 36 38 40 39 40 40 40 42 42 42 42 42 47 46 44 46 46 43 47 47 47 39 37 38 43 43 41 42 41 44 47

RGE 52 45 49 51 49 51 50 47 48 48 47 47 46 46 46 45 45 42 42 43 42 41 42 41 42 41 49 50 51 45 44 46 44 47 46 42

BLE 14 16 15 12 12 13 12 12 13 12 13 13 12 12 12 13 13 11 12 13 12 13 15 12 11 12 12 13 11 13 13 13 14 12 10 10

DFLV 54 38 48 51 48 51 49 45 46 46 44 43 41 42 41 39 39 32 32 35 30 29 31 28 31 28 47 49 51 38 37 40 38 43 41 32

TNT 0.64 0.89 0.74 0.74 0.80 0.72 0.77 0.85 0.80 0.82 0.85 0.85 0.92 0.91 0.91 0.92 0.94 1.10 1.08 1.02 1.09 1.11 1.03 1.15 1.13 1.14 0.80 0.75 0.74 0.96 0.97 0.91 0.94 0.89 0.96 1.11

ANT 413 220 422 527 369 316 316 229 316 281 299 352 255 141 202 105 123 132 149 141 229 88 158 62 202 44 290 387 308 185 246 229 123 141 202 44

PVP 40 48 46 27 38 67 44 15 39 41 53 68 62 44 83 83 71 73 76 38 46 70 94 86 78 40 45 55 40 19 39 46 71 63 70 80

ION 26 22 22 11 17 16 16 15 15 9 7 11 10 11 18 57 43 25 19 21 13 37 27 36 12 16 6 3 7 14 9 15 16 13 7 17

PH 3.80 3.80 3.84 3.83 3.78 3.75 3.69 3.70 3.80 3.82 3.84 3.85 3.82 3.76 3.82 3.76 3.71 3.76 3.88 3.90 4.08 4.08 4.28 3.94 4.04 4.13 3.66 3.63 3.65 3.85 3.78 3.87 3.72 3.69 3.66 3.91

12 cepag Syrah Syrah Syrah Syrah Syrah Syrah Syrah Syrah Syrah Syrah Syrah Syrah Syrah Syrah Syrah Syrah Syrah Syrah Syrah Syrah Syrah Syrah Syrah Syrah Syrah Syrah Carign Carign Carign Carign Carign Carign Carign Carign Carign Carign

13

Syrah

Classification And Regression Trees [?]

52/78

INTCOL0.505

Cinsault

Syrah

8/34

18/44 D281.06

INTCOL0.405

Cinsault

Carignan

Carignan

Syrah

1/21 ANTCY197.5

6/13 TEINTE1.095

12/30 ION37.5

0/14

Cinsault

Carignan

Cinsault

Carignan

Syrah

Carignan

0/20

0/1

3/9

0/4

10/22 PH3.68

0/8

PH3.79

Carignan

Cinsault

Carignan

0/3

0/6

0/4

Syrah

6/18 JAUNE40.5 Syrah

Syrah

0/6

6/12 PH3.93 Carignan

Syrah

2/8 D280.905

0/4

Carignan

Syrah

0/4

2/4 D280.975 Syrah

Carignan

0/2

0/2

14

Syrah 52/78 INTCOL0.505 Cinsault 8/34

Syrah 18/44

INTCOL1.06

Carignan 6/13

Carignan 12/30

D280.675 Carignan 1/6

Cinsault 2/7

ION37.5 Syrah 10/22

Carignan 0/8

TANINS2.01 Syrah 5/16

Carignan 1/6

D280.795 Carignan 2/7

Syrah 0/14

Syrah 0/9

Clustering Trees - For micro-arrays [?]

15

16

16

16

16

16

16



How can abstract mathematics help?

• Decompositions that can be generalisable. • Geometric Picture of Tree Space ? A space for comparisons. ? Ways of projecting. ? Follow trees as they change, (paths of trees) ? Centroids of trees ? Neighborhoods (convex hulls of trees).... ? Averages of trees

17

How can abstract mathematics help?

• Decompositions that can be generalisable. • Geometric Picture of Tree Space ? A space for comparisons. ? Ways of projecting. ? Follow trees as they change, (paths of trees) ? Centroids of trees ? Neighborhoods (convex hulls of trees).... ? Averages of trees • Justification of commonsense, ground for generalizations.

17

References [1] Y. Amit and D. Geman, Quantization and recognition with randomized trees, Neural Computation, 9 (1997), pp. 1545–1588. [2] L. Billera, S. Holmes, and K. Vogtmann, Geometry of tree space, to appear, Statistics, Stanford, 1999. [3] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees, Wadsworth, 1984. [4] M. Charleston, ??, http://taxonomy.zoology.gla.ac.uk/~mac/landscape /trees.html., ?? (1996), pp. –. [5] B. Charnomordic and S. Holmes, Dnaview, an interactive viewer for alignment and tree building, unpublished software, (1997). [6] P. Diaconis, Group Representations in Probability and Statistics, Institute of Mathematical Statistics, 1988. [7]

, A generalization of spectral analysis with application to ranked data, The Annals of Statistics, 17 (1989), pp. 949–979.

[8] P. W. Diaconis and S. P. Holmes, Matchings and phylogenetic trees, Proc. Natl. Acad. Sci. USA, 95 (1998), pp. 14600–14602 (electronic). [9] B. Efron, E. Halloran, and S. P. Holmes, Bootstrap confidence levels for phylogenetic trees, Proc. Natl. Acad. Sci. USA, 93 (1996), pp. 13429–34. [10] B. Efron and R. Tibshirani, The problem of regions, Ann. Statist., 26 (1998), pp. 1687– 1718.

18

[11] M. Eisen, P. Spellman, P. Brown, and D. Botstein, Cluster analysis and display of genome-wide expression patterns., Proc Natl Acad Sci USA, (1998). [12] J. Friedman, Greedy function approximation: A gradient boosting machine, tech. rep., Stanford Statistics Dept., 1999. http://www-stat.stanford.edu/˜jhf/trebst.ps. [13] M. Gromov, Hyperbolic groups, in Essays in group theory, Springer, New York, 1987, pp. 75–263. [14] K. Hayasaka, T. Gojobori, and S. Horai, Molecular phylogeny and evolution of primate mitochondrial dna., Mol. Biol. Evol., 5/6 (1988), pp. 626–644. [15] S. Holmes, Phylogenies: An overview, in Statistics and Genetics, E. Halloran and S. Geisser, eds., no. 81 in IMA, Springer Verlag, NY, 1999. [16] S. Holmes and P. Diaconis, Computing with Trees, Interface Foundation of North America, 1999. [17] S. Li, D. K. Pearl, and H. Doss, Phylogenetic tree construction using mcmc, Journ. American Statistical Association (to appear)., (2000). Ohio Statistics Dept., (). [18] B. MacFadden and R. Hulbert Jr, Explosive speciation at the base of the adaptive radiation of miocene grazing horses, Nature, 336 (1988), pp. 466–468. [19] B. Mau, M. A. Newton, and L. B., Bayesian phylogenetic inference via markov chain monte carlo methods, Biometrics, (1999). ¨ der, Zeit. f¨ur. Math. Phys., 15 (1870), pp. 361–376. [20] E. Schro [21] G. L. Thompson, Generalized permutation polytopes and exploratory graphical methods for ranked data, The Annals of Statistics, 21 (1993), pp. 1401–1430. [22] G. M. Ziegler, Lectures on polytopes, Springer-Verlag, New York, 1995.

19