Select the Options tab. ⢠The Sort Genotype Columns and Map Rows checkbox is selected by default. The default value of 10000 under Number of Rows to Scan ...
_______________________________ Step by Step Guide to Importing Genetic Data into JMP Genomics
Page 1
Introduction Data for genetic analyses can exist in a variety of formats. Before this data can be analyzed it must imported into one or more properly-formatted SAS data sets. JMP Genomics contains numerous processes for importing and formatting data from a variety of sources. In this document, we will explore several of these processes.
Objectives In this document, we will explore the following import processes1:
Importing data from Affymetrix CHP files Importing Illumina data Importing data from PLINK files Importing data from text-format files
Importing Affymetrix SNP CHP files When importing Affymetrix data for genotype analysis, a design file must be created. It is advantageous to include all available phenotype information in the design file at this time so that it is available for use in downstream analysis. The SNP data must be in the CHP file format, and we recommend using the Affymetrix Genotyping Console to convert SNP CEL files to CHP files.
The Experimental Design File An experimental design file (EDF) is required when importing Affymetrix CHP files. This design table identifies the names of the files you will be working with, and links sample information to each file name. There are three variables that are required in a design table when importing Affymetrix SNP data. These are: 1. File or FileName 2. Array or Chip 3. ColumnName
Creating a Design Table from a Text or Excel File Often, as samples are tracked through processing, most of the information needed to create a design table is collected in a text or Excel file. At a minimum, a text design
1
We will not cover import of VCF files in this document. That format will be covered in a separate document entitled Importing Next Generation Sequence File Formats.
Page 2
table template should contain the CHP file name, sample phenotype information, and any sample attributes that relate to the experimental conditions. We strongly recommend that you provide a file with the array name, the subject IDs and phenotypes associated with the samples at the time of import. Otherwise, you will have to create a SAS dataset with this information and merge it to the data file after the import is completed. We will use the JMP Genomics Wizard to create a design data set from a text template (see figure below for example). The files used in this example (including the data sample description.txt file, shown below) are from the Gene Expression Omnibus (GEO) data set GSE26853.
Tab delimited text or csv file formats are recommended. Excel formats often have table formatting that is incompatible with our import tools.
Creating a Design Data Set from a Text File Template 1. Select Import > Getting Started > Getting Started Wizard from the Genomics Starter window. 2. Select Affymetrix (circled in the figure below).
Page 3
3. Click the Next button at the bottom of the window. 4. Complete the Experiment Information window, as shown below:
Page 4
5. Click the Next button. 6. Choose the data sample description.txt file, as shown below:
7. Click the Next button. The information in the text file is saved as a SAS data set, and opened as a new JMP table. Note that a column called Chip has been created.
Page 5
8. Select the directory for the data files. Note: If the data files are in the same directory as the design file, this window will not appear.
9. Click the Next button. A dialog prompting you to check the design file for accuracy appears. 10. Click the Next button to create a ColumnName. 11. Add Sample as a Selected Column, as shown below:
Page 6
The ColumnName must be unique for each sample. Adding the Chip variable below the first value can be used to create ColumnNames based on sample attributes, e.g., Control_1, Control_2, etc. 12. Click OK, and then Next on the Wizard dialog. 13. On the next window, if desired, enter a name for the design file. For this example, type DemoEDF in the text box. 14. Choose the Output Folder.
15. Click Next. 16. In the following window, select SNP for What type of Experiment is this?
Page 7
17. Click Next. The Affymetrix SNP CHP Input AP opens and is populated with the design file, the path to the data files and the output directory.
The Annotation SAS Data Set appropriate for the array must be added. This can be downloaded using the tool, Import > Affymetrix > Download NetAffx Files from the Genomics starter menu. The CDF file is not required when the SNP CHP files were generated from CEL files using the Affymetrix Genotyping Console.
Page 8
18. Select the Options tab at the top of the AP window.
19. Set the Number of Rows to Scan to 10000. This ensures that an X or Y value is encountered in the chromosome variable column, thus preventing an import error. 20. Keep the other default settings. This ensures that the ordering is correct between the SNPs in the data set and the output annotation file. 21. Click Run to start the process. The process is complete when the results window pops up.
Because the data set is large, only a small subset will be created when View Subset is clicked. This is sufficient to check if the import completed correctly.
Page 9
Importing Illumina SNP Data When importing Illumina SNP data, it is recommended, though not required, that you have both a Sample file and a Map file. Importing these files together will make downstream processing simpler. We also recommend that the Final Report format be used, if possible. This example will utilize a Full Data Table format, but we will review the Final Report options as well. 1. Select Import > Illumina > SNP from the Genomics Starter window. 2. Click the Load button. Select the FullDataTable setting.
The Final Report radio button under Source of Files indicates the type of file source. It is important to know what format of file is provided. 3. Select Final Report radio button, under Source of Files, to activate the Final Report Options. Multiple files can be imported and combined into a single SAS data set. The Available Files are selected by highlighting the appropriate file for each type of file (Genotype, Sample and Map) and using the arrows ( ) to specify the highlighted file in the appropriate field.
Page 10
The Column Delimiter can be set to TAB or COMMA. It is important to know the exact formats of the files provided. You also have the option to use the JMP G Custom Report. This is a plug-in to Genome Studio and can be found in the path: C:\Program Files\SASHome\JMP\10\Genomics\ThirdPartyAnnotation\Illumina. 4. Select the Options tab The Sort Genotype Columns and Map Rows checkbox is selected by default. The default value of 10000 under Number of Rows to Scan should be left unchanged. 5. Select the Final Report Options tab.
The Length of Sample ID Column field should be set with a value greater than the length of the longest sample identifier in the data set. The GC score is a quality score output in the Final Report files exported from Genome Studio. Optionally, this score may be used to filter out poor quality SNP calls. Final Report files also may capture allele calls in various formats (e.g., AB or AGCT), so you should select the allele format that is present in your Final Report file. All SNPs in the input data set can be imported, or selected subsets of SNPs may be specified for import by selecting one or more SNP prefixes. If SNPs with more than one SNP prefix type are selected for import, you may want to add the additional prefix “SNP_” in the Prefix to Append to SNP Column Names text box. This will make it simpler to specify all columns containing all types of imported SNPs in the data set in downstream analysis processes. Do not check the Create Final Report Data Set box unless there is a compelling reason to create a SAS data table from the initial Final Report table. If you do select this option, please keep in mind that a SAS data set version of a large Final Report file can require substantial disk space to create and save. In addition, a Final Report-formatted SAS data set cannot be used directly in downstream SNP analysis processes in JMP Genomics.
Page 11
6. Click Run to start the process.
Importing PLINK Files Genotype data files in PLINK format can easily be imported into JMP Genomics. PLINK files in text or binary format can be imported using separate APs.
Importing PLINK files in Text Format 1. Select Import > Other Genetics > Plink from the Genomics Starter menu. 2. Complete the General tab as shown below:
It is important to know how each file is delimited. If the wrong option is selected you will see a SAS error at the end of the process, which indicates the wrong option was selected for one or both of the files. 3. Select the Other Files tab. 4. Complete the Other Files tab as shown below:
Page 12
The files on this tab are optional. If used, be sure to provide the proper value for the missing phenotype and/or covariate. If these files have header rows, check the appropriate box. 5. Select the Options tab. 6. Complete the Options tab as shown below:
You can optionally name the phenotype variable in the output SAS dataset. The Type of Phenotype Variable option refers to the phenotype present in the .ped or .fam file, and not the optional .phe files. 7. Select the Map Options tab. Select the appropriate choice for the Unit for Genetic Distance (M, cM or Column Excluded). The physical position is a separate column in the map file and will be imported regardless of the choice made here. 8. Click Run to start the process.
Importing PLINK Files in Binary Format 1. Select Import > Other Genetics > PLINK Binary from the Genomics starter menu. 2. Complete the General tab as shown below:
Page 13
These 3 files are required to run the process. 3. Select the Other Files tab. Alternate Phenotype and Covariate files can be added to the output SAS data table. You will need to specify the delimiter, and for each file, if there is a header row and the value to denote missing data for each file. 4. Select the Options tab. This is similar to the Options tab for the PLINK text-format import AP. Select the Type of Phenotype Variable, Affection Status Coding and if needed, the Missing Quantitative Trait Value. 5. Click Run to start the process.
Importing Text Format Genotype Data The JMP Genomics text file format import engine can be used to import large genotype data sets. For small data sets (fewer than 5,000 markers or so), the JMP text file import (File > Open) may be the best option, followed by saving the JMP table as a sas7bdat file. In this section, we will cover importing large data sets.
Page 14
Note that if any text file is imported and a map file separately imported, you must run the Subset and Reorder genetic utility AP prior to any other process. 1. Select Import > Text > Import Individual Text, CSV, or Excel Files from the Genomics starter menu. We recommend that you import text or csv formatted files, as Excel formats often have formatting that is incompatible with the import process. 2. Complete the General tab as shown below:
The File Filter Expression box is optional. Multiple files can be imported, but all the files must be in the same format. File Type does not need to be specified if the file suffix is .txt or .csv. If the file has a different suffix but is in one of these formats, specify the type with this pull-down menu. The Row Number of Variable Names should have the appropriate value entered. The Data Start Row should have the appropriate value entered. 3. Select the Options tab. 4. Make no changes to the default settings on the Options tab. 5. Select the Wide Data tab.
Page 15
Complete the Minimum Number of Columns to Scan box. Set this value to the column number in the data set past which the type of data (categorical vs. numeric) is constant from that column to the last column of the data table. When JMP Genomics imports a text or csv file, the software checks each column for the data type. Setting the minimum number of columns to scan will turn off this process for all columns beyond the selected value and significantly improve the import speed. All other fields can be left as the default. 6. Click Run to start the process. This concludes the import of genetic data. Following import, you should next refer to the Basic Genetic Analysis document which reviews data cleaning and assessment as well as basic association and other statistical tests.
Page 16