Project 3: Dealing with SNP data and other practical

0 downloads 0 Views 197KB Size Report
software/arlequin35/man/arlequin35.pdf. Use text editor to prepare the input file. (Notepad++ recommended, you need to see all the invisible characters.).
Tanja Pyhäjärvi Basics of population genetics

Project 3: Dealing with SNP data and other practical population genetics computing skills Objective Learn to deal with large datasets and prepare input files for example, for Arlequin. Introduction to Tassel and Arlequin. Interpretation and dealing with outputs. AMOVA, LD, basic diversity indices, FIS, selection. Schedule ! Date

Task

29.1.

Task 1. Prepare an Arlequin input file using our mtDNA frequency data. Task 2. Use the input file to analyze population structure using Arlequin (FST, pairwise FST, AMOVA etc.)

4.2.

Task 3. Open large teosinte SNP data and understand how it is different from the mtDNA dataset Task 4. Calculate basic population genetics statistics for the teosinte dataset.

5.2.

Task 5. Use Tassel to calculate LD for teosinte data. Find out how and why LD varies across the genome and among populations. Task 6. Take meaningfull subsets of the data and understand how that alters the results.

11.2

Task 7. Inferring selection from population differentiation using Arlequin. FST outliers. Task 8. Using databases to acquire more information about individual SNPs.

Task 1 Get familiar with Arlequin file format and prepare a working input file from your mtDNA data. Dowload a simple example file from ArlequinExample.arp. Try first to open the example file in Arlequin. Check parts 2.4, 3.2, 4 and 5 for advice about preparing input files. Figure out what data type your data is and follow the instructions.

Arlequin tips Arlequin manual: http://cmpg.unibe.ch/ software/arlequin35/man/arlequin35.pdf Use text editor to prepare the input file. (Notepad++ recommended, you need to see all the invisible characters.) Start by e.g. 2 populations and if that works, add the rest. Save the file with .arp extension

Task 2 Use the input file to analyze population structure using Arlequin 1. Open Arlequin 3.5. 2. Open project -> choose your data 3. Check the project information 4. Check the population structure, correct number of populations? 5. Go to settings to choose which analyses you want to do. Start by FST, pairwise FST and AMOVA 6. Hit start. 7. Wait for the analyses to run, results should open in browser. 8. Interpret the output file. Do Arlequin results differ from those calculated by hand? Extra task: change the Structure part of the input file, add hierarchical levels (e.g. divide into Southern, Central and Northern Europe) and redo 2-8. How do your results differ from the first round? Task 3 1. Download teosinte SNP dataset from Arlequin20popsParvMex.arp. Its size is about 30 Mb. 2. Open the input file in Notepad++ or equivalent. a. How many loci are in the dataset? b. How many individuals, populations and genotypes? c. How is the population structure predefined? (Figure 1)

Figure 1. Sampled teosinte populations

Task 4 Find answers to these questions using Arlequin 1. What is the amount of diversity in each population? 2. What is the average inbreeding coefficient in each population? 3. Are some populations deviating from the major patterns of diversity? 4. Calculate pairwise FST between each populations. Given the geogrpahical locations of populations, what can you conclude? 5. Is there population structure in general? 6. What is the overall FST? 7. Do AMOVA and interpret the results: What proportion of variance is observed among populations? Among subspecies? What is the difference between FST and FCT? Is there significant structure in all levels?

Arlequin tips Try connecting Arlequin to Rcmd.exe and a text editor through Arlequin configuration Click View log file to see what the program is doing For initial runs, use low numbers of permutations (e.g. 100) to avoid waiting results for very long time Open results in Firefox

Task 5. Use Tassel to calculate LD for teosinte data. Find out how and why LD varies across the genome and among populations. 1. To get started download and unzip teosinte data from HapMapFiles.zip 2. Launch TASSEL 3.0 (950Mb Heap Size) 3. Take a look at TASSEL User guide: http://www.maizegenetics.net/tassel/docs/Tassel_User_Guide_3.0.pdf 4. Upload data to TASSEL: Data -> Load -> Load Hapmap It is practical to analyze one chromosome at a time. 5. To analyze LD, go to Analysis -> link. diseq. ! 6. To view the results, choose the result, Result -> LD Plot 7. To export results, choose Results -> Table -> Export (Tab) 8. Open the exported result file in Notepad. What each line indicates? 9. Try to open in Excel. What happens, why? 10. Try R program. Here are some useful commands: !

LDdata Sites, and Data -> Taxa 2. Try only choosing sites, with a given minimum allele frequency, how does that affect estimates of r2? Why? 3. Choose individuals from a single population or single subspecies. What is the level of LD compared to data with several populations or both subspecies? Why? Task 7. Inferring selection from population differentiation using Arlequin. FST outliers, Ewens-Watterson test. 1. To identify FST outliers, go to Settings -> Detecting loci under selection. Choose a meaningful nr. of demes and groups. Start with low number of simulations, e.g. 100-1000. (To get good estimates of the neutral distribution of FST, you will need more simulations.) Maximum expected heterozygosity in biallelic SNP data is...? 2. If Arlequin does not co-operate, download sample results from Arlequin20popsParvMex.res.zip 3. Make sure you understand what is in the main results file Arlequin20popsParvMex_main.htm. 4. Plot the results and identify interesting SNPs using R, follow the code in ReadArlequinResults.R. You will need files fdist2_ObsOut.txt and fdist2_simOut.txt. Task 8. Using databases to acquire more information about individual SNPs. Each SNP’s name can be foud from BestSNPs.txt (the order of the SNPs is the same as in the Arlequin input file). You can use the name of the SNP to find information about it from 55KAnnotations.txt. This file contains information about the location and function of the SNP. Use e.g. http://www.panzea.org/db/searches/webform/marker_search and http:// www.maizesequence.org to find information about SNPs that seem to have been affected by natural selection. Pick an interesting SNP and find answer to at least these questions: 1. In which chromosome and nucleotide position the SNP is? 2. In which gene the SNP is or which is the closest gene? 3. How far is the closest gene? 4. What is the gene’s predicted function? 5. What is the function of the homologous Arabidopsis gene? (http:// www.arabidopsis.org/)?

Suggest Documents