protein contains 13 tandemly-repeated scavenger receptor cysteine-rich (SRCR) domains that are ...... matrix, leading to size independent mobility and loss of resolution. The resolution ...... The constitutional DMBT1 deletions ...... fishing, and hunting/gathering) of HGDP populations were examined using partial Mantel tests.
THE GENETIC STRUCTURE, FUNCTION AND RELEVANCE TO DISEASE OF THE SALIVARY AGGLUTININ GENE (DMBT1)
Thesis submitted for the degree of Doctor of Philosophy at the University of Leicester
by Shamik Polley MVSc Department of Genetics University of Leicester
2014
ABSTRACT The Genetic Structure, Function and Relevance to Disease of The Salivary Agglutinin Gene (DMBT1) Shamik Polley Salivary agglutinin, encoded by the gene DMBT1, is a multifunctional high molecular mass glycoprotein (340 kDa) that acts as a pattern recognition receptor (PRRs) in innate immunity and mediates epithelial differentiation. The central region of the protein contains 13 tandemly-repeated scavenger receptor cysteine-rich (SRCR) domains that are copy number variable and bind to bacteria and viruses. The paralogue ratio test (PRT) was used to estimate the exact copy number of two distinct CNV (1 & 2) regions of DMBT1 gene and results were compared with other CNV estimation assays. Both CNV1 and CNV2 at DMBT1 were multiple allelic CNVs and diploid copy number varied in different populations. The de novo mutation rate at CNV1 and CNV2 of DMBT1 was estimated using a segregation study of 520 samples from 40 multigenerational CEPH families; a high mutation rate was found at both loci of DMBT1 (CNV1 - 1.4% and CNV2 - 3.3% per generation). The evolutionary basis of CNV at DMBT1 was examined using 971 samples from 52 populations from the Human Genome Diversity Panel (HGDP-CEPH). The study found that the subsistence history of human populations affected the frequency distribution of both CNVs at DMBT1. The increase in dental caries following the development of agriculture, and the likely causative role played by an increase in Streptococcus mutans following transition to a starch-rich diet, the present study suggests that this has favoured CNV1 and CNV2 alleles at DMBT1 with more S. mutans-binding SRCR domains in agricultural populations. Due to the functional importance of DMBT1, the study analysed association of DMBT1 copy number in different disease cohorts. The study found no evidence of the association between DMBT1 copy number with Crohn’s disease (n=2900), Urinary tract infection (UTI; n=405), vesicoureteral reflux (VUR; n=625), Chronic obstructive pulmonary disease (COPD; n=241) and Asthma cohorts (n=850). A significant association was found between CNV2 copy number and base-line HIV (n=987) viral load just before anti-retroviral therapy. i
Acknowledgements There are a number of people to whom I would like to extend my sincere gratitude and thanks. I would not have been able to complete this project without the assistance and support of these people. First of all, I would like to express my deepest gratitude to my wonderful supervisor, Dr. Ed Hollox to whom I am in debt forever. My thesis would not have been possible without his constant guidance, encouragement, support, patience and enthusiasm. I wish to thank all collaborators: Dr. David Hains, University of Tennessee, USA; Dr. Jack Satsangi, University of Edinburgh, UK; Dr. Christopher Mathew, King’s College London, UK; Dr. Paal Skyt Andersen, Statens Serum Institute, Copenhagen, Denmark; Prof.
John Yates,
University of Cambridge, UK; Dr. Valentina Cipriani, Institute of Opthalmology, University College London, UK; Dr Robert Bals, Saarland University, Germany, Germany; Dr Eleni Aklillu and Prof Lars Lindquist, Karolinska Institute, Sweden; Prof Martin Tobin, Dr Louise Wain and Dr Ioanna Ntalla, University of Leicester, UK for providing DNA samples and helping in data analysis. I wish to thank Prof. Jan Mollenhauer, University of Southern Denmark, Prof. Mark Jobling, Prof. Sir Alec Jeffreys, Prof. Yuri Dubrova for control DNA samples and for use of machines and instruments. I wish to thank members of my first year thesis committee: Dr. Flaviano Giorgini and Dr. Celia May for the helpful advices. Many thanks to Dr. Rob Hardwick and Dr. Lee Machado for their help in the early stages of my research. Lots of thanks to fellow PhD students, Angelica Vittori, Barbara Ottolini, Razan Abujaber and Ezgi Kucukkilic for making the lab a pleasant and inspiring place to work and learn. I would also like to thank all past and present members of Hollox group and Jobling group at University of Leicester for all the help, encouragement and friendship. I wish to thank the Ministry of Social Justice & Empowerment, Government of India for the financial support during my PhD. And finally, a heartfelt thanks to my fiancée Somdatta, my family and friends for their love and support throughout this four-year endeavour abroad, as always. I would like to thank my parents for their continuous encouragement and passion for my education. ii
TABLE OF CONTENTS INTRODUCTION ............................................................................................................................ 1 1.1
Copy number Variation ................................................................................................. 1
1.1.1
Classes of CNVs ..................................................................................................... 3
1.1.2
Functional consequences of CNVs ........................................................................ 3
1.1.3
Mechanisms of structural change ......................................................................... 4
1.2
CNV detection methods ................................................................................................ 5
1.2.1
Southern blotting and Pulse Field Gel Electrophoresis ......................................... 5
1.2.2
Fibre-FISH (Fluorescence in situ hybridization) ..................................................... 7
1.2.3
Array comparative genomic hybridization ............................................................ 8
1.2.4
Representational Oligonucleotide Microarray Analysis (ROMA) .......................... 9
1.2.5
Quantitative PCR (qPCR) ..................................................................................... 10
1.2.6
Multiplex ligation – dependent probe amplification (MLPA) ............................. 11
1.2.7
Multiplex amplifiable probe hybridization (MAPH) ............................................ 12
1.2.8
Fosmid Paired-End Sequencing ........................................................................... 13
1.2.9
Paralogue ratio test (PRT) ................................................................................... 14
1.3
Deleted in Malignant Brain Tumours 1 (DMBT1) ........................................................ 15
1.3.1
Genomic Structure of the DMBT1 gene .............................................................. 16
1.3.2
Expression and localisation of DMBT1 ................................................................ 17
1.3.3
The domain organization of DMBT1 ................................................................... 17
1.3.4
The role of DMBT1 in epithelial and stem cell differentiation............................ 22
1.3.5
Role of DMBT1 in Innate immunity ..................................................................... 23
1.3.6
Bacteria-binding domain on DMBT1 ................................................................... 24
1.3.7
Hydroxyapatite-binding domain on DMBT1 ....................................................... 25
1.3.8
Interaction of DMBT1 with viruses ..................................................................... 26
1.3.9
Interaction of DMBT1 with endogenous protein ligands .................................... 27
1.3.10
The glycosylation pattern of DMBT1................................................................... 28
1.3.11
Role of DMBT1 in the complement pathway ...................................................... 28
1.3.12
Involvement of DMBT1 in mechanism of fertilization ........................................ 29
1.3.13
DMBT1 binding region to Streptococcus mutans Ag I/II ..................................... 30
1.3.14
Evidence of copy number variation at DMBT1 ................................................... 31
1.4
AIMS OF THE STUDY .................................................................................................... 33 iii
2
MATERIALS AND METHODS ............................................................................................ 34
2.1
DNA samples used....................................................................................................... 34
2.1.1
HapMap samples................................................................................................. 34
2.1.2
Cell lines samples ................................................................................................ 34
2.1.3
ECACC Human Random Control (HRC) samples .................................................. 34
2.1.4
CEPH family samples ........................................................................................... 35
2.1.5
HGDP-CEPH panel ............................................................................................... 35
2.1.6
Leicester local volunteers.................................................................................... 35
2.1.7
Crohn’s samples .................................................................................................. 35
2.1.8
African HIV cohort ............................................................................................... 37
2.1.9
Lung disease cohort ............................................................................................ 37
2.1.10
UTI and VUR cohort............................................................................................. 38
2.2
DMBT1 Sequence processing and bioinformatics ...................................................... 38
2.3
Sequence analysis of SRCR repeats of DMBT1............................................................ 39
2.4
Evidence of copy number variation on DMBT1 .......................................................... 39
2.5
Growing of lymphoblastoid cell lines .......................................................................... 39
2.6
Genomic DNA extraction from lymphoblastoid cell lines ........................................... 40
2.7
Characterization of CNV1 region of DMBT1 ............................................................... 40
2.7.1
Long-range PCR ................................................................................................... 40
2.7.2
PCR for the long allele ......................................................................................... 41
2.7.3
Block-specific long PCR........................................................................................ 41
2.7.4
Analysis aCGH of data for CNV1 .......................................................................... 42
2.7.5
Designing of primers for CNV1 region ................................................................ 42
2.7.6
PRT assays for CNV1 of DMBT1 ........................................................................... 43
Characterization of CNV2 region of DMBT1 ............................................................... 49
2.8
2.8.1
Long PCR spanning CNV2 region ......................................................................... 49
2.8.2
Analysis of aCGH data for CNV2 .......................................................................... 50
2.8.3
PRT assays for CNV2 of DMBT1 ........................................................................... 50 Designing of probes for physical mapping of DMBT1 ................................................. 57
2.9
Synthesis of DMBT1 probe .................................................................................. 58
2.9.1 2.10
Fibre-FISH molecular combing methods ..................................................................... 59
2.11
Analysis of fibre-FISH .................................................................................................. 60
2.12
Sample Preparation for PFGE ...................................................................................... 60
2.12.1
Genomic DNA extraction from lymphoblastoid cell lines ................................... 60 iv
2.12.2
Digestion of liquid DNA Samples......................................................................... 61
2.13
Pulsed field Gel Electrophoresis conditions ................................................................ 61
2.14
Southern blot analysis ................................................................................................. 62
2.14.1
Gel depurination, denaturation and neutralization............................................ 62
2.14.2
Transfer of DNA to membrane ............................................................................ 62
2.14.3
Synthesis of DMBT1 probe .................................................................................. 62
2.14.4
Probe labeling and recovery ............................................................................... 63
2.14.5
Hybridization ....................................................................................................... 63
2.14.6
Washing blot ....................................................................................................... 63
2.14.7
Preparing blot for exposure ................................................................................ 63
2.14.8
Autoradiography ................................................................................................. 63
2.15
Analysis of DNA fragment size after PFGE .................................................................. 63
2.16
Estimation of allele and genotype frequency for mutation estimation ..................... 64
2.17
Simple tandem repeats (STR) analysis ........................................................................ 64
2.17.1
DMBT1-m1 .......................................................................................................... 64
2.17.2
DMBT1-m2 .......................................................................................................... 65
2.18
Detection of de novo mutation ................................................................................... 65
2.19
Estimation of mutation rate ........................................................................................ 66
2.20
Worldwide distribution of CNV1 and CNV2 copy number of DMBT1 gene ................ 66
2.21
Relationship of copy number variation in HGDP populations .................................... 66
2.22
Analysis of Pathogen Richness .................................................................................... 67
2.23
Analysis of DMBT1 copy number variation due to human life style adaptations....... 68
2.24
Isolation of Genomic DNA from buccal cells ............................................................... 68
2.25
Designing of PCR primers for C-terminal region of Ag I/II of S. mutans .................... 69
2.26
PCR amplification of C-terminal region of Ag I/II of S. mutans ................................... 69
2.27
Extraction of PCR product from Agarose gel .............................................................. 70
2.28
Sequencing of PCR product using internal sequencing primers ................................. 70
2.29
Sequence read and alignment .................................................................................... 70
2.30
Sequence diversity ...................................................................................................... 71
2.31
Phylogenetic analysis .................................................................................................. 71
2.32
McDonald-Kreitman test ............................................................................................. 71
2.32.1
Input sequences for McDonald-Kreitman test .................................................... 72
2.32.2
Sequence analysis for McDonald-Kreitman test ................................................. 72
2.32.3
Allele frequency spectrum .................................................................................. 73 v
2.33
Secretor status assay................................................................................................... 74
2.33.1
Primer design for secretor status assay .............................................................. 74
2.33.2
PCR amplification and restriction digestion ........................................................ 75
2.34
Regression Analysis ..................................................................................................... 77
2.35
Comparison between aCGH and PRT for Crohn’s samples ......................................... 77
2.36
Statistical analysis for case-control study ................................................................... 78
2.36.1
Analysis of Crohn’s disease samples ................................................................... 78
2.36.2
Analysis of African HIV cohorts ........................................................................... 78
2.36.3
Analysis of lung disease cohorts ......................................................................... 78
2.36.4 Analysis of Vesicoureteral Reflux (VUR) and Urinary Tract Infections (UTI) cohorts ............................................................................................................................. 79 3
CHARACTERIZATION OF COPY NUMBER VARIATION OF THE HUMAN DMBT1 GENE .. ............................................................................................................................................. 80
3.1
DMBT1 Sequence processing and bioinformatics ...................................................... 80
3.2
Sequence relationship of SRCR repeats ...................................................................... 81
3.3
Evidence of copy number variable region of DMBT1 ................................................. 82
3.4
PRT results for HapMap samples ................................................................................ 83
3.4.1
Analysis of CNV1 region in HapMap samples ..................................................... 83
3.4.2
Analysis of CNV2 region in HapMap samples ..................................................... 94
3.5
Discussion.................................................................................................................. 103
4 ANALYSIS OF COPY NUMBER VARIATION OF DMBT1 USING PHYSICAL MAPPING APPROACHES ........................................................................................................................... 105 4.1
Introduction .............................................................................................................. 105
4.2
Copy number variation of DMBT1 using Fibre-FISH ................................................. 105
4.2.1
Analysis of DNA fibres ....................................................................................... 105
4.2.2
Measurements of DNA fibre lengths ................................................................ 106
4.2.3
Analysis of the DMBT1 region in YRI family Y045 ............................................. 107
4.2.4
Analysis of DMBT1 region in 1447 family ......................................................... 110
4.3
Copy number variation of DMBT1 using PFGE.......................................................... 112
4.3.1
Selection of restriction enzyme for genomic DNA digestion ............................ 112
4.3.2
Selection of DNA samples ................................................................................. 112
4.3.3
Southern blotting analysis of liquid genomic DNA ........................................... 113
4.3.4
DNA size analysis using Southern blotting ........................................................ 114
4.4
Discussion.................................................................................................................. 115
vi
5 ANALYSIS OF SEGREGATION PATTERNS AND DE NOVO MUTATION RATES AT THE DMBT1 GENE ........................................................................................................................... 117 5.1
Aim of the study ........................................................................................................ 117
5.2
Estimation of DMBT1 copy number in CEPH pedigree samples ............................... 117
5.2.1
Analysis of CNV1 copy number in CEPH pedigree samples .............................. 117
5.2.2
Distribution of integer CNV1 copy number....................................................... 119
5.2.3
Analysis of CNV2 copy number ......................................................................... 120
5.2.4
Distribution of integer CNV2 copy number....................................................... 123
5.2.5
Allelic architecture and copy number genotype ............................................... 124
Detection of de novo mutation ................................................................................. 127
5.3
Estimation of mutation rate .............................................................................. 129
5.3.1 5.4
Discussion.................................................................................................................. 130
6 DETERMINATION OF EXTENT OF DIVERSITY AND EVOLUTIONARY BASIS OF DMBT1 COPY NUMBER IN GLOBAL POPULATIONS .......................................................................... 132 6.1
Introduction .............................................................................................................. 132
6.2
Estimation of DMBT1 copy number in HGDP samples ............................................. 132
6.2.1
Analysis of CNV1 copy number in HGDP samples ............................................. 132
6.2.2
Distribution of CNV1 diploid copy number in HGDP ......................................... 135
6.2.3
Distribution of CNV1 diploid copy number in different geographical regions.. 136
6.2.4
Distribution of CNV1 diploid copy number in different populations ................ 138
6.2.5
Analysis of CNV2 copy number ......................................................................... 140
6.2.6
Distribution of CNV2 diploid copy number in HGDP ......................................... 144
6.2.7
Distribution of CNV2 diploid copy number in different geographical regions.. 145
6.2.8
Distribution of CNV2 diploid copy number in different populations ................ 147
6.3 Estimation and distribution of total SRCR copy number in different geographical regions................................................................................................................................... 150 Analysis of CNV1 and CNV2 copy number association ..................................... 152
6.3.1 6.4
Analysis of pathogen-driven selection on DMBT1 Copy number variation .............. 156
6.4.1
Kendall rank correlation for pathogen richness................................................ 156
6.4.2
Partial Mantel tests for pathogen richness ....................................................... 157
6.5
DMBT1 copy number variation due to human life style adaptations ....................... 158
6.5.1 Analysis of human life style adaptations using agriculture data as dichotomous variables ........................................................................................................................... 159 6.5.2 Analysis of human life style adaptations using agriculture data as relative amount of activity ............................................................................................................. 159 vii
6.6
Discussion.................................................................................................................. 160
7 ANALYSIS OF DIVERSITY OF THE SALIVARY AGGLUTININ-BINDING PROTEIN OF STREPTOCOCCUS MUTANS...................................................................................................... 162 7.1
Aim of the study ........................................................................................................ 162
7.2
Analysis of S. mutans sequence ................................................................................ 162
7.2.1
Sequence diversity of SpaP gene of S. mutans ................................................. 162
7.2.2
Phylogenetic analysis of SpaP gene of S. mutans ............................................. 163
7.2.3
Analysis of DMBT1 binding regions of S. mutans.............................................. 168
7.2.4
McDonald-Kreitman test for SpaP gene of S. mutans ...................................... 170
7.2.5
Allele frequency spectrum of SpaP of S. mutans .............................................. 171
Estimation of DMBT1 Copy number in Leicester local volunteers............................ 174
7.3
7.3.1
Estimation of CNV1 copy number in Leicester local volunteers ....................... 174
7.3.2
Distribution of CNV1 copy number in Leicester local volunteers ..................... 177
7.3.3
Estimation CNV2 copy number in Leicester local volunteers ........................... 178
7.3.4
Distribution of CNV2 copy number in Leicester local volunteers ..................... 181
7.3.5
Secretor status of Leicester local volunteers .................................................... 182
7.3.6
Analysis of SpaP genotype and CNV1 and CNV2 copy number of DMBT1 ....... 182
7.4
Discussion.................................................................................................................. 183
8 ASSOCIATION OF COPY NUMBER VARIATION OF DMBT1 IN CROHN’S DISEASE PATIENTS ................................................................................................................................. 185
9
8.1
Introduction .............................................................................................................. 185
8.2
The rationale for study .............................................................................................. 185
8.3
Estimation of DMBT1 copy number in Crohn’s samples........................................... 186
8.3.1
Copy number estimation of English Crohn’s samples ....................................... 186
8.3.2
Copy number estimation of Scottish Crohn’s samples ..................................... 192
8.3.3
Copy number estimation of Danish Crohn’s samples ....................................... 198
8.4
Comparison aCGH and PRT for copy number estimation ......................................... 203
8.5
Distribution of diploid copy number in the Crohn’s samples ................................... 209
8.5.1
Distribution of CNV1 copy number in Crohn’s samples .................................... 209
8.5.2
Distribution CNV2 copy number in Crohn’s samples ........................................ 209
8.5.3
Distribution of SRCR copy number in Crohn’s samples .................................... 210
8.6
Association of DMBT1 copy number with Crohn’s disease ...................................... 211
8.7
Discussion.................................................................................................................. 213
ASSOCIATION OF COPY NUMBER VARIATION OF DMBT1 IN AFRICAN HIV COHORTS . ........................................................................................................................................... 215 viii
9.1
Introduction .............................................................................................................. 215
9.2
Estimation of DMBT1 copy number in African HIV cohorts ...................................... 217
9.2.1
Estimation of CNV1 copy number in African HIV cohorts ................................. 217
9.2.2
Estimation of CNV2 copy number in African HIV cohorts ................................. 219
9.2.3
Distribution of CNV1 and CNV2 copy number in African HIV cohorts .............. 221
9.3
Association of copy number with clinical parameters in African HIV cohorts .......... 225
9.4
Discussion.................................................................................................................. 226
10 ASSOCIATION OF COPY NUMBER VARIATION OF DMBT1 IN LUNG DISEASE COHORTS.................................................................................................................................. 229 10.1
Introduction .............................................................................................................. 229
10.2
Estimation of DMBT1 copy number in lung disease cohorts .................................... 231
10.2.1
Estimation of DMBT1 copy number in Gedling cohort ..................................... 231
10.2.2
Estimation of DMBT1 Copy number in Leicester Respiratory cohort (LRC) ...... 233
10.3
Distribution of CNV1 and CNV2 copy number in lung disease cohorts .................... 237
10.4
Association study in the Gedling and LRC cohorts .................................................... 238
10.5
Discussion.................................................................................................................. 243
11 ASSOCIATION OF COPY NUMBER VARIATION OF DMBT1 IN VUR AND UTI SAMPLES .................................................................................................................................. 244 11.1
Introduction .............................................................................................................. 244
11.2
Estimation of DMBT1 copy number in VUR and UTI samples................................... 245
11.2.1
Estimation of CNV1 copy number in VUR and UTI samples .............................. 245
11.2.2
Estimation of CNV2 copy number in VUR and UTI samples .............................. 247
11.3
Distribution of CNV1 and CNV2 copy number in VUR and UTI samples ................... 250
11.4
Secretor status in VUR and UTI samples ................................................................... 251
11.5
Association study in VUR and UTI samples ............................................................... 252
11.6
Discussion.................................................................................................................. 253
12
DISCUSSION ……………………………………………………………………………………………………………. 255
BIBLIOGRAPHY .......................................................................................................................... 261
ix
LIST OF FIGURES Figure 1: Diallelic and multiallelic copy number variation. -------------------------------------------------------------- 2 Figure 2: Schematic picture of Southern blotting methods. ------------------------------------------------------------ 7 Figure 3: AMY1 copy number estimation using high-resolution Fibre-FISH. --------------------------------------- 8 Figure 4: Schematic picture of array-based comparative genome hybridization (array-CGH).---------------- 9 Figure 5: Representational oligonucleotide microarray analysis (ROMA) for copy number detection.---- 10 Figure 6: Multiplex PCR-based methods for the identification of copy-number variants.--------------------- 13 Figure 7: The paired-end sequencing methodology for detection of structural variation. -------------------- 14 Figure 8: Schematic picture describing different steps of PRT. ------------------------------------------------------- 15 Figure 9: Domain organization of DMBT1 and DMBT1 orthologs. --------------------------------------------------- 19 Figure 10: Schematic of gp340 structures, indicating conserved cysteines of SRCR domains. --------------- 21 Figure 11: Schematic picture of bacteria-binding on SRCR domains of DMBT1. --------------------------------- 24 Figure 12: Schematic presentation of SRCR domain, highlighting the Hydroxyapatite (HA)-binding domain and bacteria-binding domain on DMBT1. ----------------------------------------------------------------------- 25 Figure 13: Schematic picture of virus-binding regions on SRCR domains of DMBT1. --------------------------- 26 Figure 14: Cartoon shows of primary sequence of S. mutans UA159 AgI/II. -------------------------------------- 30 Figure 15: UCSC genome browser screen shot showing the exon-intron structure with three DMBT1 gene annotations from different transcripts. ------------------------------------------------------------------------------ 32 Figure 16: Schematic illustration of PRT1 assay. -------------------------------------------------------------------------- 44 Figure 17: Schematic illustration of PRT2 assay. -------------------------------------------------------------------------- 45 Figure 18: An example of GeneMapper© electropherogram showing test locus and reference locus in PRT1 assay. -------------------------------------------------------------------------------------------------------------------------- 46 Figure 19: An example of GeneMapper® electropherogram showing test locus and reference locus in PRT2 assay. -------------------------------------------------------------------------------------------------------------------------- 47 Figure 20: The calibration standard on reference DNA samples for PRT1 from single PCR reaction. ------ 48 Figure 21: The calibration standard on reference DNA samples for PRT2 from single PCR reaction. ------ 48 Figure 22: Schematic illustration of PRT3 assay. -------------------------------------------------------------------------- 51 Figure 23: Schematic illustration of PRT4 assay. -------------------------------------------------------------------------- 52 Figure 24: Schematic illustration of PRT5 assay. -------------------------------------------------------------------------- 53 Figure 25: An example of a GeneMapper® electropherogram showing the test locus and the reference locus in the PRT3 assay. --------------------------------------------------------------------------------------------------------- 54 Figure 26: An example of a GeneMapper® electropherogram showing the test locus and the reference locus in PRT4 assay. --------------------------------------------------------------------------------------------------------------- 55 Figure 27: An example of a GeneMapper® electropherogram showing the test locus and the reference locus in PRT5 assay. --------------------------------------------------------------------------------------------------------------- 55 Figure 28: The calibration standard on reference DNA samples for PRT3 from single PCR reaction. ------ 56 Figure 29: The calibration standard on reference DNA samples for PRT4 from single PCR reaction. ------ 56
x
Figure 30: Strategic picture of primer design for synthesis of Fibre-FISH probes. ------------------------------- 58 Figure 31: Schematic picture showing primers location to amplify C-terminal region of Ag I/II of S. mutans. ------------------------------------------------------------------------------------------------------------------------------ 69 Figure 32: Schematic picture showing PCR-RFLP strategy of rs601338 of FUT2 gene. ------------------------- 75 Figure 33: Electropherogram of secretor PCR products showing three possible genotypes of rs601338 of FUT2 gene using Leicester local samples. ----------------------------------------------------------------------------------- 77 Figure 34: Dot plot analysis of DMBT1 gene (exons and introns) against itself. --------------------------------- 80 Figure 35: Nucleotide sequence relationship of SRCR repeats. ------------------------------------------------------- 81 Figure 36: Copy number variable regions in DMBT1 gene. ------------------------------------------------------------ 83 Figure 37: Histogram of raw PRT ratio of PRT1 and PRT2 in the HapMap Phase I DNA samples. ----------- 84 Figure 38: Assessment of CNV1 copy number assay quality in HapMap samples. ------------------------------ 85 Figure 39: Scree plot for PCA of aCGH data generated using Agilent 210 K CNV chip for CNV1 in HapMap samples. The X-axis shows the number of principal components. --------------------------------------------------- 86 Figure 40: Scatter plot of mean unrounded copy number value of CNV1 and first PC of Agilent aCGH data for CNV1 region of HapMap samples. --------------------------------------------------------------------------------------- 87 Figure 41: Histogram of mean normalized PRT ratio of CNV1 for HapMap samples. X-axis shows mean PRT ratio of CNV1. ---------------------------------------------------------------------------------------------------------------- 88 Figure 42: Output of the clustering procedure using the PRT transformed data of CNV1 for HapMap samples.------------------------------------------------------------------------------------------------------------------------------ 88 Figure 43: Analysis of integer copy number calling for CNV1 in HapMap samples. ----------------------------- 89 Figure 44: Frequencies of integer copy number of CNV1 per diploid genome in different HapMap populations.------------------------------------------------------------------------------------------------------------------------- 90 Figure 45: Schematic illustration of long range PCR for genotyping CNV1 region of DMBT1. ---------------- 91 Figure 46: Genotyping of CNV1 allele of good quality DNA samples using long range PCR. ------------------ 91 Figure 47: Genotyping of CNV1 allele of freeze-thawed DNA using long range PCR. --------------------------- 92 Figure 48: Top panel showed location of long allele specific PCR. --------------------------------------------------- 92 Figure 49: (A) Schematic presentation of block specific long PCR.. -------------------------------------------------- 93 Figure 50: Genotype frequency of CNV1 region in different HapMap populations. ---------------------------- 94 Figure 51: Histogram of raw PRT ratio of PRT3 and PRT4 in the HapMap Phase I DNA samples. ----------- 94 Figure 52: Histogram of raw PRT ratio of PRT5 in the HapMap Phase I DNA samples. ------------------------- 95 Figure 53: Assessment of CNV2 copy number assay quality. ---------------------------------------------------------- 95 Figure 54: Assessment of CNV2 copy number assay quality. ---------------------------------------------------------- 96 Figure 55: Assessment of CNV2 copy number assay quality.. --------------------------------------------------------- 97 Figure 56: Scree plot for PCA of aCGH data generates using Agilent 210 K CNV chip for CNV1 in HapMap samples.------------------------------------------------------------------------------------------------------------------------------ 98 Figure 57: Scatter plot produced by mean unrounded copy number value and first PC of Agilent aCGH signal of CNV2 of HapMap samples. ----------------------------------------------------------------------------------------- 99 Figure 58: Histogram of normalized PRT ratio of CNV2 for HapMap samples. --------------------------------- 100
xi
Figure 59: Output of the clustering procedure using the PRT transformed data of CNV2 for HapMap samples.---------------------------------------------------------------------------------------------------------------------------- 100 Figure 60: Analysis of integer copy number calling of CNV2 in HapMap samples. ---------------------------- 101 Figure 61: Frequencies of integer copy number of CNV2 per diploid genome in different HapMap populations.----------------------------------------------------------------------------------------------------------------------- 102 Figure 62: Schematic presentation of CNV2 copy number estimation using long PCR. ---------------------- 103 Figure 63: Schematic picture of DMBT1 fibre-FISH. -------------------------------------------------------------------- 106 Figure 64: Analysis of DNA fibre of DMBT1 region. ------------------------------------------------------------------- 107 Figure 65: Fiber-FISH image on DNA from cell lines derived from YRI HapMap trio Y045. ------------------ 108 Figure 66: Individual measurements of DMBT1 probe fibre length of Y045 family. -------------------------- 109 Figure 67: Individual measurements of DMBT1 probe fibre length of 1447 family. -------------------------- 111 Figure 68: Selection of restriction enzyme for DMBT1 gene.-------------------------------------------------------- 112 Figure 69: Agarose gel shows smearing of gDNA after Sca I digestion. ----------------------------------------- 114 Figure 70: Southern blot analysis of genomic DNA using DMBT1 SRCR probe. -------------------------------- 114 Figure 71: Standard curve using known size standard of PFGE ladder. ------------------------------------------- 115 Figure 72: Histogram of mean normalized PRT ratio of CNV1 in CEPH pedigree samples. ------------------ 118 Figure 73: Output of the clustering procedure using the PRT transformed data of CNV1 in CEPH pedigree samples.---------------------------------------------------------------------------------------------------------------------------- 118 Figure 74: Analysis of integer copy number calling.. ------------------------------------------------------------------- 119 Figure 75: Histogram of raw PRT ratio of PRT3 and PRT4 in the CEPH pedigree samples. ------------------ 120 Figure 76: Scatter plot produces by PRT3 and PRT4 assays of CNV2 estimation in CEPH pedigree samples. --------------------------------------------------------------------------------------------------------------------------------------- 121 Figure 77: Histogram of mean normalized PRT ratio of CNV2 in CEPH pedigree samples. ------------------ 121 Figure 78: Output of the clustering procedure using the PRT transformed data of CNV2 in CEPH pedigree samples.---------------------------------------------------------------------------------------------------------------------------- 122 Figure 79: Analysis of integer copy number calling of CNV2 in CEPH family. ----------------------------------- 123 Figure 80: CoNVEM analysis for CNVs of DMBT1 using unrelated parents data from 40 CEPH families. 125 Figure 81: Comparison of allele frequencies for CNVs of DMBT1 using unrelated parents data from 40 CEPH pedigrees. ----------------------------------------------------------------------------------------------------------------- 125 Figure 82: Analysis of CEPH/FRENCH pedigree 12 for detection of de novo mutation. ---------------------- 128 Figure 83: Analysis of CEPH/FRENCH pedigree 1424 for detection of de novo mutation. ------------------- 129 Figure 84: Analysis of CEPH/FRENCH pedigree 1362 for detection of de novo mutation. ------------------- 130 Figure 85: Histogram of raw PRT ratio of PRT1 and PRT2 in the HGDP-CEPH samples. ---------------------- 133 Figure 86: Scatter plot produces by PRT1 and PRT2 assays of CNV1 estimation in HGDP samples. ------ 133 Figure 87: Histogram of mean normalized PRT ratio of CNV1 for HGDP samples. ---------------------------- 134 Figure 88: Output of the clustering procedure using the PRT transformed data of CNV1 for HGDP samples.---------------------------------------------------------------------------------------------------------------------------- 134 Figure 89: Analysis of CNV1 integer copy number calling using CNVtools. -------------------------------------- 135
xii
Figure 90: Population distribution of diploid CNV1 copy number. ------------------------------------------------- 136 Figure 91: Frequency distribution of worldwide CNV1 copy number in HGDP continental regions. ----- 137 Figure 92: Histogram of raw PRT ratio of PRT3 and PRT4 in the HGDP-CEPH samples. ---------------------- 141 Figure 93: Scatter plot produced by PRT3 and PRT4 assays of CNV2 estimation in HGDP samples. ------ 141 Figure 94: Histogram of mean normalized PRT ratio of CNV2 for HGDP samples. ---------------------------- 142 Figure 95: Output of the clustering procedure using the PRT transformed data of CNV2 for HGDP samples.---------------------------------------------------------------------------------------------------------------------------- 143 Figure 96: Analysis of CNV1 integer copy number calling using CNVtools for the HGDP samples.-------- 144 Figure 97: Population distribution of CNV2 copy number. ----------------------------------------------------------- 145 Figure 98: Frequency distribution of worldwide CNV2 copy number in HGDP continental regions. ----- 146 Figure 99: Schematic picture of DMBT1 region to estimate total SRCR copy number. ----------------------- 151 Figure 100: Frequency distribution of total diploid SRCR copy number in HGDP samples. ----------------- 151 Figure 101: The pattern of CNV1 and CNV2 copy number variation in different HGDP individuals.------ 152 Figure 102: The pattern of copy number variation at CNV1 and CNV2 in different HGDP populations.- 153 Figure 103: The variation pattern of copy number for CNV1 and CNV2 in different HGDP regions. ----- 155 Figure 104: Molecular Phylogenetic analysis for all samples by Maximum Likelihood method in MEGA6 using nucleotide sequences. ------------------------------------------------------------------------------------------------- 164 Figure 105: Molecular Phylogenetic analysis for European samples by Maximum Likelihood method in MEGA6 using nucleotide sequences. ------------------------------------------------------------------------------------- 166 Figure 106: Molecular Phylogenetic analysis for all samples by Maximum Likelihood method in MEGA6 using Amino acid sequences.------------------------------------------------------------------------------------------------- 167 Figure 107: Molecular Phylogenetic analysis for European samples by Maximum Likelihood method in MEGA6 using Amino acid sequences. ------------------------------------------------------------------------------------ 168 Figure 108: Sequence logos showing pattern of aligned nucleotide sequences of Ad1 region of SpaP gene of S. mutans. ---------------------------------------------------------------------------------------------------------------------- 169 Figure 109: Sequence logos showing pattern of aligned nucleotide sequences of Ad2 region of SpaP gene of S. mutans. The height of each nucleotide is made proportional to its frequency and most common nucleotide is on top. The number of nucleotide correspondences to main DNA sequence use in the study. ------------------------------------------------------------------------------------------------------------------------------- 169 Figure 110: Sequence logos showing pattern of aligned amino acid sequences of Ad1 region of Ag I/II of S. mutans. ------------------------------------------------------------------------------------------------------------------------- 169 Figure 111: Sequence logos showing pattern of aligned amino acid sequences of Ad2 region of Ag I/II of S. mutans. ------------------------------------------------------------------------------------------------------------------------- 170 Figure 112: Analysis of allele frequency spectrum in S. mutans for all samles. -------------------------------- 172 Figure 113: Analysis of allele frequency spectrum in S. mutans for EU samles. -------------------------------- 173 Figure 114: Histogram of raw PRT ratio of PRT1, PRT2 and mean CNV1 PRT ratio in the Leicester local samples.---------------------------------------------------------------------------------------------------------------------------- 174
xiii
Figure 115: Scatter plot produces by raw ratio from PRT1 and PRT2 assays of CNV1 estimation in the Leicester local samples. -------------------------------------------------------------------------------------------------------- 175 Figure 116: Output of the clustering procedure using the PRT transformed data of CNV1 in Leicester local samples.--------------------------------------------------------------------------------------------------------------------- 176 Figure 117: Analysis of integer copy number calling of CNV1 in local samples. -------------------------------- 177 Figure 118: Histogram of raw PRT ratio of PRT3, PRT4 and mean CNV2 PRT ratio in the Leicester samples.---------------------------------------------------------------------------------------------------------------------------- 178 Figure 119: Scatter plot produces by raw ratio from PRT3 and PRT4 assays of CNV2 estimation in the Leicester samples. --------------------------------------------------------------------------------------------------------------- 179 Figure 120: Output of the clustering procedure using the PRT transformed data of CNV2 in local samples. --------------------------------------------------------------------------------------------------------------------------------------- 180 Figure 121: Analysis of CNV2 integer copy number calling in Leicester local sample. ------------------------ 181 Figure 122: Scatter plot produces by PRT1 and PRT2 assays use to estimate diploid copy number of CNV1 in English Crohn’s and control samples.----------------------------------------------------------------------------------- 187 Figure 123: Histogram of mean unrounded normalized PRT ratio of CNV1 for English Crohn’s and control samples.---------------------------------------------------------------------------------------------------------------------------- 188 Figure 124: Output of the clustering procedures using the PRT transformed data of CNV1 for English Crohn’s and control samples. ------------------------------------------------------------------------------------------------ 188 Figure 125: Analysis of integer CNV1 copy number calling for English Crohn’s disease and control samples.---------------------------------------------------------------------------------------------------------------------------- 189 Figure 126: Scatter plot produces by PRT ratio of PRT3 and PRT4 assays of CNV2 estimation in English Crohn’s and control samples. ------------------------------------------------------------------------------------------------ 190 Figure 127: Histogram of mean normalized PRT ratio of CNV2 in English CD samples. ---------------------- 190 Figure 128: Output of the clustering procedure using the PRT transformed data of CNV2 for English CD samples.---------------------------------------------------------------------------------------------------------------------------- 191 Figure 129: Analysis of integer CNV2 copy number calling for English Crohn’s disease and control samples.---------------------------------------------------------------------------------------------------------------------------- 192 Figure 130: Scatter plot produces by PRT1 and PRT2 assays use to estimate diploid copy number of CNV1 in Scottish Crohn’s samples. -------------------------------------------------------------------------------------------------- 193 Figure 131: Output of the clustering procedure using the PRT transformed data of CNV1 for Scottish CD samples.---------------------------------------------------------------------------------------------------------------------------- 194 Figure 132: Analysis of CNV1 integer copy number calling using CNVtools for Scottish Crohn’s samples. --------------------------------------------------------------------------------------------------------------------------------------- 195 Figure 133: Scatter plot produces by PRT ratio of PRT3 and PRT4 assays of CNV2 estimation in Scottish Crohn’s samples. ---------------------------------------------------------------------------------------------------------------- 196 Figure 134: Output of the clustering procedure using the PRT transformed data of CNV2 for Scottish Crohn’s samples. ---------------------------------------------------------------------------------------------------------------- 197
xiv
Figure 135: Analysis of CNV2 integer copy number calling using CNVtools for Scottish Crohn’s samples. --------------------------------------------------------------------------------------------------------------------------------------- 198 Figure 136: Histogram of mean normalized PRT ratio of CNV1 for Danish Crohn’s samples. -------------- 199 Figure 137: Output of the clustering procedure using the PRT transformed data of CNV1 for Danish CD samples.---------------------------------------------------------------------------------------------------------------------------- 199 Figure 138: Analysis of CNV1 integer copy number calling using CNVtools for Danish IBD samples. ---- 200 Figure 139: Scatter plot using raw PRT ratio of PRT3 and PRT4 assays, use to estimate diploid copy number of CNV2 in Danish CD samples. ---------------------------------------------------------------------------------- 201 Figure 140: Output of the clustering procedure using the PRT transformed data of CNV2 for Danish samples.---------------------------------------------------------------------------------------------------------------------------- 202 Figure 141: Analysis of CNV2 integer copy number calling using CNVtools for Danish CD samples. ----- 203 Figure 142: Scree plot for PCA of first normalized aCGH data is generated using Agilent 210 K CNV chip for CNV1 in Crohn’s disease cohort.---------------------------------------------------------------------------------------- 205 Figure 143: Scree plot for PCA of second normalized aCGH data is generated using Agilent 210 K CNV chip for CNV1 in Crohn’s disease cohort. --------------------------------------------------------------------------------- 206 Figure 144: Scree plot for PCA of first normalized aCGH data is generated using Agilent 210 K CNV chip for CNV2 in Crohn’s disease cohort.---------------------------------------------------------------------------------------- 207 Figure 145: Scree plot for PCA of second normalized aCGH data is generated using Agilent 210 K CNV chip for CNV2 in the Crohn’s disease cohort. ---------------------------------------------------------------------------- 208 Figure 146: Distribution of diploid copy number for CNV1 of DMBT1 in CD cohorts. ------------------------ 209 Figure 147: Distribution of diploid copy number for CNV2 of DMBT1 in CD cohorts. ------------------------ 210 Figure 148: Distribution of total SRCR domain of DMBT1 in the English Crohn’s cohorts. ------------------ 210 Figure 149: Distribution of total SRCR domain of DMBT1 in the Scottish Crohn’s cohorts. ----------------- 211 Figure 150: Distribution of total SRCR domain of DMBT1 in the Danish Crohn’s cohorts. ------------------ 211 Figure 151: Worldwide HIV prevalence among adults (adopted from WHO-HIV department) ------------ 215 Figure 152: Regional HIV and AIDS statistics according to WHO-HIV departments --------------------------- 216 Figure 153: Output of the clustering procedure using the PRT transformed data of CNV1 for HIV samples. --------------------------------------------------------------------------------------------------------------------------------------- 218 Figure 154: Analysis of integer CNV1 copy number calling for HIV samples. ----------------------------------- 219 Figure 155: Output of the clustering procedure using the PRT transformed data of CNV2 for HIV samples. --------------------------------------------------------------------------------------------------------------------------------------- 220 Figure 156: Analysis of integer CNV2 copy number calling for HIV cohort.-------------------------------------- 221 Figure 157: Distribution of diploid copy number of CNV1 and CNV2 at DMBT1. ------------------------------ 222 Figure 158: Distribution of diploid copy number of SRCR at DMBT1. --------------------------------------------- 224 Figure 159: Analysis of integer CNV1 copy number calling for Gedling COPD samples. --------------------- 231 Figure 160: Scatter plot produces by raw ratio of PRT3 and PRT4 assay and uses to estimate diploid copy number of CNV2 in the Gedling samples. --------------------------------------------------------------------------------- 232 Figure 161: Analysis of integer CNV2 copy number calling for Gedling COPD samples. --------------------- 233
xv
Figure 162: Scatter plot produces by raw ratio of PRT1 and PRT2 assay, use to estimate diploid copy number of CNV1 in LRC samples. ------------------------------------------------------------------------------------------- 234 Figure 163: Analysis of integer copy number calling for LRC samples. ------------------------------------------- 235 Figure 164: Scatter plot produces by raw ratio of PRT3 and PRT4 assay, use to estimate diploid copy number of CNV2 in LRC samples. ------------------------------------------------------------------------------------------- 236 Figure 165: Analysis of integer CNV2 copy number calling for LRC samples. ----------------------------------- 237 Figure 166: Scatter plots of unadjusted raw (A and B) and adjusted (for age, age2, sex, height) inverse normally transformed (C and D) FEV1/FVC against average raw (A and C) and integer CNV2 copy number (B and D) of DMBT1 in LRC. --------------------------------------------------------------------------------------------------- 240 Figure 167: Cumulative frequency distribution of average raw PRT ratio of DMBT1 CNV1 (left) and DMBT1 CNV2 (right) in Gedling COPD cases and controls. ----------------------------------------------------------- 241 Figure 168: Cumulative frequency distribution of average raw PRT ratio of DMBT1 CNV1 and DMBT1 CNV2 in doctor diagnosed asthma cases and controls from Gedling cohort. ---------------------------------- 241 Figure 169: Cumulative frequency distribution of average raw PRT ratio of DMBT1 CNV1 and DMBT1 CNV2 in doctor diagnosed asthma cases and controls from LRC. ------------------------------------------------- 242 Figure 170: Histogram of mean unrounded normalized PRT ratio of CNV1 for VUR, UTI and control samples.---------------------------------------------------------------------------------------------------------------------------- 245 Figure 171: Output of the clustering procedure using the PRT transformed data of CNV2 for VUR & UTI cohortS. ---------------------------------------------------------------------------------------------------------------------------- 246 Figure 172: Analysis of integer copy number calling for VUR and UTI cohort.---------------------------------- 247 Figure 173: Histogram of mean normalized PRT ratio of CNV2 for VUR, UTI and control samples. ------ 248 Figure 174: Output of the clustering procedure using the PRT transformed data of CNV2 for VUR and UTI samples.---------------------------------------------------------------------------------------------------------------------------- 249 Figure 175: Analysis of integer CNV2 copy number calling for UTI and VUR samples. ----------------------- 250 Figure 176: Agarose gel electrophoresis to determine of secretor status using PCR-RFLP in VUR and UTI cohort. ------------------------------------------------------------------------------------------------------------------------------ 251
xvi
LIST OF TABLES Table 1: DMBT1 synonyms and orthologs in different organisms. ______________________________ 18 Table 2: Summary of samples analysed for Crohn's study. ____________________________________ 36 Table 3: PCR Primers used to amplify deletion allele of CNV1. _________________________________ 41 Table 4: Primer sequences used to amplify for CNV1 long allele. _______________________________ 41 Table 5 : Primer sequences used to amplify block-specific long PCR. ____________________________ 41 Table 6: Primer sequences used in PRT1 assay for CNV1. _____________________________________ 43 Table 7: Primer sequences used in PRT2 assay for CNV1. _____________________________________ 44 Table 8: Long PCR primers used to amplify CNV2 region. _____________________________________ 49 Table 9: Primer sequences used in PRT3 assay for CNV2. _____________________________________ 50 Table 10: Primer sequences use in PRT4 assay for CNV2. _____________________________________ 51 Table 11: Primer sequences use in PRT5 assay for CNV2. _____________________________________ 53 Table 12: Primer sequences for amplification of DMBT1 probes. _______________________________ 58 Table 13: The conditions used to resolve DMBT1 restriction fragments after Sca I digestion _________ 62 Table 14: Primer sequences used to amplify DMBT1-m1 for STR analysis. ________________________ 64 Table 15: Primer sequences used to amplify DMBT1-m2 for STR analysis. ________________________ 65 Table 16: Primer sequences used to amplify C-terminal regions of SpaP gene of S. mutans. _________ 69 Table 17: Internal sequencing primers sequences used to sequence full length full length PCR product of C-terminal region of SpaP gene of S. mutans. ______________________________________________ 70 Table 18: Contingency table for McDonald-Kreitman test of S. mutans from all samples. ___________ 73 Table 19: Contingency table for McDonald-Kreitman test of S. mutans from European samples. _____ 73 Table 20: Primers used to amplify rs601338 of FUT2 gene for secretor status assay. _______________ 75 Table 21: The fragments size, genotypes and secretor status of rs601338 of FUT2 gene based on PCRRFLP. _______________________________________________________________________________ 76 Table 22: CNV1 copy number frequencies in HapMap samples. ________________________________ 90 Table 23: PCR fragments for Validation of CNV1 copy number using long-range PCR._______________ 91 Table 24: Combination of different PCR assays for validation of CNV1 copy number. _______________ 93 Table 25: CNV2 copy number frequencies in HapMap samples. _______________________________ 102 Table 26: Family trio with integer copy number of CNV1 and CNV2 at DMBT1 for family Y045. ______ 108 Table 27: Estimated size of one unit CNV1 allele of DMBT1 using samples of HapMap YRI family Y045 by Fiber-FISH. _________________________________________________________________________ 110 Table 28: Family trio with integer copy number of CNV1 and CNV2 at DMBT1 for CEU family 1447. __ 110 Table 29: Estimated size of one unit of CNV2 of DMBT1 using samples of HapMap CEU family 1447 by Fiber-FISH. _________________________________________________________________________ 112 Table 30: The samples with integer copy number used for PFGE. Integer copy numbers of the samples were measured using PRT assays. _______________________________________________________ 113
xvii
Table 31: Estimated size of DMBT1 region from different control DNA samples using PFGE combined with Southern blotting. _______________________________________________________________ 115 Table 32: Frequency of CNV1 copy number of DMBT1 in CEPH pedigrees. ______________________ 120 Table 33: Frequency distribution of CNV2 copy number of CEPH family. ________________________ 124 Table 34: Estimated genotype frequencies for CNV1 after CoNVEM analysis. ____________________ 126 Table 35: Estimated genotype frequencies for CNV2 based on result1 after CoNVEM analysis. ______ 126 Table 36: Estimated genotype frequencies for CNV2 based on result2 after CoNVEM analysis. ______ 127 Table 37: CNV1 diploid copy number frequencies in HGDP samples. ___________________________ 136 Table 38: Diploid CNV1 copy number frequencies in HGDP continental regions. __________________ 137 Table 39: CNV1 copy number frequencies in HGDP American populations. ______________________ 138 Table 40: CNV1 copy number frequencies in HGDP South Asia populations. _____________________ 138 Table 41: CNV1 copy number frequencies in HGDP East Asia populations. ______________________ 139 Table 42: CNV1 copy number frequencies in HGDP European populations. ______________________ 139 Table 43: CNV1 copy number frequencies in HGDP Middle East and Oceania populations. _________ 140 Table 44: CNV1 copy number frequencies in HGDP Sub-Saharan Africa populations. ______________ 140 Table 45: CNV2 copy number frequencies in HGDP samples. _________________________________ 145 Table 46: CNV2 copy number frequencies in HGDP continental regions. ________________________ 146 Table 47: CNV2 copy number frequencies in HGDP American populations. ______________________ 147 Table 48: CNV2 copy number frequencies in HGDP South Asia populations. _____________________ 147 Table 49: CNV2 copy number frequencies in HGDP East Asia populations. ______________________ 148 Table 50: CNV2 copy number frequencies in HGDP European populations. ______________________ 149 Table 51: CNV2 copy number frequencies in HGDP Middle East and Oceania populations. _________ 149 Table 52: CNV2 copy number frequencies in HGDP Sub-Saharan African populations. _____________ 150 Table 53: Mean unrounded copy number for CNV1 and CNV2 in different HGDP populations. ______ 154 Table 54: Mean CNV1 and CNV2 copy number for different geographical regions in HGDP samples. _ 155 Table 55: The Kendall Correlations with the richness of viruses, helminths, bacteria and protozoa. __ 156 Table 56: Partial mantel correlations (using distance from Africa). _____________________________ 157 Table 57: Correlations with copy number variable of DMBT1 and a human life style variable (agriculture variable) as dichotomous variables. _____________________________________________________ 159 Table 58: Correlations with copy number variable of DMBT1 and human life style as relative amount of human activity using partial mantel tests. ________________________________________________ 160 Table 59: Regression analysis with copy number variable of DMBT1 and human life style as relative amount of human activity spent. _______________________________________________________ 160 Table 60: Sequence diversity of C-terminal region of SpaP gene of S. mutans. ___________________ 163 Table 61: Summary results of McDonald-Kreitman test of Antigen I/II of S. mutans using sequences from all samples. _________________________________________________________________________ 170 Table 62: Summary results of McDonald-Kreitman test of Antigen I/II of S. mutans using sequences from European samples. ___________________________________________________________________ 171
xviii
Table 63: Frequency of synonymous and non-synonymous polymorphisms of S. mutans from all samples. ___________________________________________________________________________ 173 Table 64: Frequency of synonymous and non-synonymous polymorphism of S. mutans from European samples. ___________________________________________________________________________ 174 Table 65: CNV1 copy number frequencies in the Leicester samples. ___________________________ 177 Table 66 :CNV2 copy number frequencies in the Leicester samples. ___________________________ 181 Table 67: Genotype frequency, allele frequency and secretor status of the Leicester samples.______ 182 Table 68: Summary table relating regression analysis of polymorphic alleles of S. mutans with CNVs and secretor status of all Leicester samples. __________________________________________________ 182 Table 69: Summary table relating regression analysis of polymorphic alleles of S. mutans with CNVs and secretor status of European samples. ____________________________________________________ 183 Table 70: Comparison of CNV1, CNV2 and SRCR copy number frequency in Crohn’s patients and controls of three different Crohn’s cohorts. ______________________________________________________ 212 Table 71: Comparison of DMBT1 deletion allele frequency in Crohn’s patients and controls of three different Crohn’s cohorts. _____________________________________________________________ 212 Table 72: Copy number frequencies of CNV1 at DMBT1 in African HIV cohorts. __________________ 222 Table 73: Copy number frequencies of CNV2 at DMBT1 in HIV samples. ________________________ 223 Table 74: Copy number frequencies of total SRCR of DMBT1 in HIV cohorts._____________________ 225 Table 75: Tests of association of copy number with HIV load pre-HAART. _______________________ 226 Table 76: Tests of association of copy number with CD4 count during HAART. ___________________ 226 Table 77: Copy number frequencies of CNV1 at DMBT1 in respiratory disease cohorts. ____________ 237 Table 78: Copy number frequencies of CNV2 at DMBT1 in respiratory disease cohorts. ____________ 238 Table 79: Association of CNV1, CNV2 and total SRCR copy number of DMBT1 with lung function in Gedling cohort. ______________________________________________________________________ 239 Table 80 : Association of CNV1, CNV2 and total SRCR copy number of DMBT1 with lung function in LRC. Significant results are shown in asterisk. _________________________________________________ 239 Table 81: Association of CNV1, CNV2 and total SRCR copy number of DMBT1 with lung COPD in the Gedling cohort. ______________________________________________________________________ 240 Table 82: Association of CNV1, CNV2 and total SRCR copy number of DMBT1 with asthma (doctor diagnosed) in Gedling cohort. __________________________________________________________ 242 Table 83: Association of CNV1, CNV2 and total SRCR copy number of DMBT1 with asthma-ICS in LRC. 242 Table 84: Copy number frequencies of CNV1 at DMBT1 in VUR cohort. _________________________ 250 Table 85: Copy number frequencies of CNV2 at DMBT1 in VUR and UTI cohort. __________________ 251 Table 86: Genotype frequencies of FUT2 gene for secretor status in VUR and controls. ____________ 252 Table 87: Summary of samples and tests of association of copy number at DMBT1 with VUR samples. 252 Table 88: Summary of samples and tests of association of copy number at DMBT1 with UTI samples. 253 Table 89: Summary of samples and tests of association of copy number at DMBT1 with VUR and UTI samples. ___________________________________________________________________________ 253
xix
LIST OF ABBREVIATIONS aCGH
array-Comparative Genomic Hybridisation
Bp
Base pairs
CD
Crohn's Disease
CEPH
Centre de'Etude du Polymorphisme Humain
CNV
Copy Number Variations
CUB
C1r/C1s Uegf Bmp1
ddNTPs
Dideoxy Nucleotides Triphosphates
DGV
Database of Genomic Variants
DMBT1
Deleted in Malignant Brain Tumours 1
DMBT1pbs1
DMBT1 pathogen binding site 1
DNA
Deoxyribonucleic acid
dNTPs
Deoxy Nucleotides Triphosphates
ECM
Extra Cellular Matrix
FISH
Fluorescence In Situ Hybridisation
gDNA
Genomic DNA
Gp120
HIV Envelope surface glycoprotein 120
Gp340
Cell surface glycoprotein 340
GWAS
Genome-Wide Association Study
HA
Hydroxyapatite
HAART
highly active antiretroviral therapy
HIV-1
Human Immunodeficiency Virus type 1
IgG
Immunoglobulin G
Indels
Insertions and Deletions
KDa
Kilodalton
LCRs
Low Copy Repeats
NAHR
Non-Allelic Homologous Recombination
NaOH
Sodium Hydroxide
ng
nanogram
PAMP
Pathogen associated molecular structure
PCR
Polymerase Chain Reaction
PRR
Pattern recognition receptor
PRT
Paralogue Ratio Test
RFLP
Restriction Fragment Length Polymorphism
SAG
Salivary agglutinin
SID
SRCR interspersed domain
SNP
Single Nucleotide Polymorphism
SRCR
Scavenger receptor cysteine-rich
STRs
Short Tandem Repeats
UCSC
University of California Santa Cruz
VNTR
Variable Number Tandem Repeat
ZP
Zona pellucida
μg
Microgram
xx
LIST OF WEB RESOURCES The URLs for data presented in the thesis are as follows: Human Genome Diversity Panel, http://www.cephb.fr/en/hgdp_panel.php International HapMap Project, http://hapmap.ncbi.nlm.nih.gov/ UCSC Human Genome Browser, http://genome.ucsc.edu/ UCSC In-Silico PCR, http://genome.ucsc.edu/cgi-bin/hgPcr Database of Genomic Variants, http://dgv.tcag.ca/dgv/app/home R Project, http://www.r-project.org/ Coriell Institute for Medical Research, http://www.coriell.org/ Human Random Control DNA Panels, http://www.pheculturecollections.org.uk/products/dna/hrcdna/hrcdna.jsp Repeat Masker, http://www.repeatmasker.org/ Basic Local Alignment Search Tool (BLAST), http://blast.ncbi.nlm.nih.gov/Blast.cgi Clustal Omega, http://www.ebi.ac.uk/Tools/msa/clustalo/ Primer3, http://primer3.ut.ee/ Gideon database, http://www.gideononline.com/ Sequence Manipulation Suite, http://www.bioinformatics.org/sms2/ WebLogo 3, http://weblogo.threeplusone.com/ McDonald and Kreitman test, http://mkt.uab.es/mkt/ GraphPad, http://graphpad.com/quickcalcs/ ImageJ, http://imagej.nih.gov/ij/ HIV/AIDS, http://www.who.int/hiv/data/en/
xxi
1 INTRODUCTION
1.1 Copy number Variation The human genome shows extensive variation in different forms. Copy number variants (CNVs) account for a major proportion of human genetic polymorphism (Craddock et al., 2010) and contribute to the differences between individual humans (Hastings et al., 2009). CNVs also play an important role in genetic susceptibility to common disease. Human genetic variation is the genetic diversity or variation in alleles of genes of humans and represents the total amount of genetic diversity within the human genome at both the individual and the population level (Conrad et al., 2010; Sudmant et al., 2010; Zhang et al., 2009). Recent studies have reported that variations exist in the human genome at different levels: large microscopically visible chromosome anomalies (several kilobase to megabase pairs), submicroscopic copy number variation of DNA segments (tens to thousands of kilo base pairs) and the single base pair. CNVs are widespread in human genomes and a major source of genetic variation in humans (Iafrate et al., 2004; Sebat et al., 2004). Deletions, insertions and duplications of DNA segments ranging from several kilobases (kb) to megabases (Mb) in size at variable number, in comparison with a reference genome are collectively referred to as copy number variants (CNV) (Conrad et al., 2010). A CNV can be simple tandem duplication, or may involve complex gains or losses of homologous sequences at multiple sites in the genome (Figure 1). Recent studies show that up to 12% of the genome is subject to CNV (Conrad et al., 2010; Iafrate et al., 2004; Kidd et al., 2010; Korbel et al., 2007; Redon et al., 2006). It has been reported that copy number varies in different organs and tissues in the same individual and can arise both meiotically and somatically (Piotrowski et al., 2008). Most CNVs are benign variants and do not directly cause disease. Genes involved in the development and activity of both the immune system and brain tend to be enriched in CNVs (Feuk et al., 2006; Zhang et al., 2009). The simplest type of copy number variation in the human genome may occur due to deletion or duplication of a gene. A diploid genome contains two copies of a particular gene, one on each chromosome. Copy number can be categorized into diallelic and multiallelic groups. Diallelic CNVs have two alleles and could produce three different genotypes in both deletion and duplication events (Figure 1). A simple deletion event could change the diploid copy number of particular gene and therefore could result in diploid copy number of two, one or zero (Figure 1). Similarly a diploid genome could therefore contain two, three, or four copies of gene after simple duplication event in genome (Figure 1). But the pattern of deletion or duplication events in the genome is not always simple and could result complex copy number 1
variation, known as multiallelic copy number variants (Wain et al., 2009). A diploid genome after successive rounds of duplication could produce multiallelic copy number variants in diploid copy genome. Multiallelic CNVs have more than two alleles and could produce more than three genotypes (Figure 1). Generally, the size of genomic segments of deletion and duplication regions can vary from a few hundred to several million bp and could contain an entire gene, part of a gene, a region outside of a gene, or several genes in case of larger variants.
Figure 1: Diallelic and multiallelic copy number variation (Wain et al., 2009). Diallelic locus (grey) and flanking loci (green and blue) with copy number variation cause by (A) deletion and (B) duplication, each showing the locus with (i) normal diploid copy number, (ii) heterozygous state, and (iii) homozygous state. (C) Multiallelic locus showing (i) normal copy number, (ii) multiple rounds of duplication on one chromosome and a deletion on the homologous chromosome, (iii) duplication on one chromosome and no deletion on the homologous chromosome, (iv) multiple rounds of duplication on one chromosome and no deletion on the homologous chromosome, (v) one round of duplication on each chromosome, (vi) one round of duplication on one chromosome and multiple rounds of duplication on the homologous chromosome, and (vii) multiple rounds of duplication on both chromosomes. Multiallelic assays measure total diploid copy number but cannot describe genotypes status of (ii) and (iii), or (iv) and (v).
2
1.1.1
Classes of CNVs
Based on the mutational origin and molecular mechanism of their formation CNVs can be classified into two classes; frequently termed “recurrent” and “non-recurrent” CNVs. The mutation rates are thought to be different for recurrent and non-recurrent CNVs (Hollox & Hoh, 2014).
1.1.1.1 Recurrent CNVs Recurrent CNVs exist in regions containing large segmental duplications and are mainly generated by a non-allelic homologous recombination mechanism of CNV formation. 20-40% of normal polymorphic CNVs can be classified as recurrent CNVs (Conrad et al. 2010). Recurrent CNVs can occur anywhere in the genome but hotspots for these CNVs mainly exist in subtelomeric and pericentromeric regions (Conrad & Hurles, 2009; Redon et al., 2006).
1.1.1.2 Non-recurrent CNVs Non-recurrent CNVs involve large genomic regions and break-point analysis shows minimal or no-homology is required for non-recurrent CNV formation (Conrad et al., 2010). Non-recurrent CNVs can be generated by non-homologous end joining (NHEJ) or fork-stalling and template switching (FoSTeS) mechanisms (Zhang et al., 2009) and many non-recurrent CNVs are unique (Hollox & Hoh, 2014). The majority of benign CNVs and a large percentage of pathogenic CNVs come under this class and sometimes show extreme deleterious phenotypes (Arlt et al., 2009; Arlt et al., 2011).
1.1.2
Functional consequences of CNVs
The functional importance of many CNVs is relatively clear; reduced copy number of a gene can be correlated with reduced expression level, while duplicated copies of a gene can lead to increase expression level (McCarroll & Altshuler, 2007; Stranger et al., 2007). 85%-95% of CNVs in human and mice were reported to be associated with a change in expression of the affected genes (Stranger et al., 2007). CNVs are thought to be a major driving force in evolution and adaptation (Hancock et al., 2010; Iskow et al., 2012; Zhang et al., 2009). Additional copies of genes provide redundancy in sequence, so that some copies maintain the original function while extra copies are free to evolve new or modified functions (Inoue & Lupski, 2002). The copy number variation of specific genes can offer selective advantage in human adaptation and evolution. For example the amount of salivary amylase is directly proportional to the copy number of the AMY1 gene. The higher copy numbers of AMY1 in starch-consuming populations suggests that a high copy number of AMY1 may be 3
advantageous in starch eating individuals (Perry et al., 2008). But much variation in copy number of specific genes is disadvantageous and leads to a group of pathological conditions known as genomic disorders (Lupski & Ph, 2007). The copy number change in human somatic cells leads to cancer formation and progression (Volik et al., 2006) and contributes to cancer proneness (Frank et al., 2007). CNVs have been reported to confer risk of complex disease including susceptibility to autism (Kumar et al., 2008; Marshall et al., 2008; Sebat et al., 2004, 2007), schizophrenia (Stefansson et al., 2009; Walsh et al., 2008; Xu et al., 2008), Crohn’s disease (Mccarroll et al., 2009), psoriasis (Hollox et al., 2008), systemic lupus erythematosus (Aitman et al., 2006). Several CNV gene involve in some known metabolizing enzymes, such as CYP2D6, GSTM1 and potential drug targets such as CCL3L1, may also make significant contributions to pharmacogenomic studies (Ouahchi et al., 2006).
1.1.3
Mechanisms of structural change
Heritable CNVs are produced by germline genomic rearrangements that result in gains or losses of DNA segments. The different mechanisms for chromosomal structural change have been studied in model organisms mainly yeast, Escherichia coli and Drosophila. Each of recent findings in model organisms has led to different mechanisms for copy number variation in human genomes (Hastings et al., 2009). The mechanism of copy number variations in the human genome can be broadly categorised into two groups; homologous recombination (HR) and nonhomologous recombination (Alkan et al., 2011; Sudmant et al., 2010). At least 300 bp long homologous sequences are required in eukaryotes cells for homologous recombination (HR), where as nonhomologous recombination mechanism typically utilizes short homologous DNA sequences called microhomologies to guide repair. The microhomologies are often present in single-stranded overhangs on the ends of double-strand breaks (Hastings et al., 2009). HR is the basis of several mechanisms of accurate DNA repair that use another highly identical sequence in the genome such as segmental duplication, or related interspersed repetitive elements to repair damaged sequence. Chromosomal structural change and copy number variation can occur by HR when the repair mechanisms utilize homologous sequences in different chromosomal positions. This is called non-allelic or ectopic homologous recombination (NAHR). In contrast, nonhomologous recombination mechanisms change copy number of genes as they use sequence from a non-homologous template (Lupski & Stankiewicz, 2005; Stankiewicz & Lupski, 2002). NAHR events are more frequent than NHEJ, with estimates of up to 10-4 per locus per generation (Shaffer & Lupski, 2000) and tend to be associated with larger CNVs (Redon et al., 2006). Environmental factors and localized DNA conformations are likely to influence the rate 4
of NHEJ events which in general has been estimated to occur at a rate of less than 10-7 per generation, similar to the estimated mutation rate of single nucleotide polymorphisms (SNPs) of 10-8 per locus per generation (Conrad & Hurles, 2009).
1.2 CNV detection methods Recent studies have provided increasing evidence that CNVs have an important role in conferring differences in infectious disease susceptibility. The detection of accurate copy number of a particular gene is more challenging than SNP genotyping. CNV typing measures a quantitative difference rather than a qualitative difference. At the beginning of CNV typing, the commonly applied methods were Southern blotting, fluorescence in situ hybridization (FISH) and quantitative PCR (qPCR). In recent years different methods have been developed to study copy number variants (CNVs) with greater accuracy and precision (Fiegler et al., 2006; McCarroll & Altshuler, 2007). The accuracy and precision of CNV typing method also depends on adequate availability of well-characterized copy number reference controls to allow for comparison of results. Many recent methods also facilitate the study of many different genomic regions in parallel and sometimes on a genome-wide basis. However at present, no single existing methodology has the scope for accurately genotyping all CNV classes for large case-control studies and that power comes from combining methods and repeat typing (Hollox et al. 2008). Diverse approaches have been developed over the years, with good improvements in the detection accuracy and precision.
1.2.1
Southern blotting and Pulse Field Gel Electrophoresis
Southern blotting was first devised by E. M. Southern (1975) and was used as the standard method to detect gene deletion or duplication in the genome and was routinely used for genetic fingerprinting and paternity testing (figure 2). In Southern blotting, restriction enzyme digested fragments of DNA are transferred from an electrophoresis gel to a nitrocellulose or nylon membrane. The immobilized DNA is hybridised with probes that specifically target individual sequences in the blotted DNA. The size alterations of restriction fragments size appear as novel bands on the blot (Mellars & Gomez, 2011). The resolution of conventional agarose gel technique depends on migration of DNA molecules through a relatively small gel pore. Large random coils of DNA molecules cannot be resolved through a much smaller gel matrix, leading to size independent mobility and loss of resolution. The resolution and accuracy for measuring copy number can be improved using pulsed field gel electrophoresis (PFGE) in combination with Southern blotting. The periodic alteration of the electric field in PFGE produces continuous re-orientation of DNA molecules and allows the resolution of large 5
DNA fragments. Southern blotting has its own intrinsic limitations; it is laborious, time consuming, and requires large amounts of high-quality DNA. Southern blotting is not suited for automation as it involves many steps like DNA digestion, electrophoresis, blotting, and hybridization. Southern blotting can analyze a limited number of loci per blot (maximum 10 to 15 when good probes are available). PFGE allows more scope for detection of deletions and duplications in larger genomic regions with the ability to resolve DNA sequences up to 2 Mb. In a semi-quantitative approach, the intensity of probe hybridization to a specific target is compared to a control locus and a control sample but uneven transfer of DNA to the nylon membrane or incomplete washing of probes can result in misinterpretation of band intensities.
6
Figure 2: Schematic picture of Southern blotting methods. (Adopted from Essential Cell Biology, Second Edition, Garland Science). (A) Separation of DNA by electrophoresis. (B) Transfer of DNA to nitrocellulose paper or nylon paper. (C) The nitrocellulose sheet is carefully peeled off the gel. (D) Hybridization of membrane with buffer containing a radioactively labeled DNA probe specific for the required DNA sequence. (E) Specific DNA fragments as DNA bands on the autoradiograph.
1.2.2
Fibre-FISH (Fluorescence in situ hybridization)
Fibre-FISH is a modified FISH technique developed for high resolution mapping of genes and chromosomal regions on fibres of chromatin or DNA. Fibre-FISH permits physical ordering of DNA probes to a resolution of 1000 bp. The high resolution of Fibre-FISH allows assessment of gaps and overlaps in contigs and analysis of segmental duplications and copy number variations. In Fibre-FISH, the chromatin/DNA fibres are released from interphase nuclei and are stretched on a glass slide by means of salt or solvent extraction. After stretching, the DNA fibres are fixed on a microscope slide before hybridization. The stretching uniformity and reproducibility of DNA has improved significantly after implementation of the molecular combing protocol (Bensimon et al., 1994). In the molecular combing protocol the action of a receding air/water meniscus is used to extend and align DNA molecules at one end to a glass surface. Fibre-FISH allows the determination of copy number per allele which is important for studies of inheritance and diseases. The Salivary amylase gene (AMY1) copy number was successfully genotyped using Fibre-FISH methods (Perry et al. 2008) (Figure 3). The Fibre-FISH requires a labour intensive workflow, low throughput and a high quality sample requirement and due to overlapping signals, highly variable regions are difficult to interpret (Cantsilieris et al., 2012).
7
Figure 3: AMY1 copy number estimation using high-resolution Fibre-FISH (Perry et al., 2008). Red (∼10 kb) and green (∼8 kb) probes encompass the entire AMY1 gene and a retrotransposon directly upstream of (and unique to) AMY1, respectively. (a) Individual with 14 diploid AMY1 gene copies showing one allele with 10 copies and the other with four copies. (b) Individual with 6 diploid AMY1 gene copies, consistent with Fibre-FISH results.
1.2.3
Array comparative genomic hybridization
Array comparative genomic hybridization (aCGH) was developed for high resolution, genomewide screening of segmental genomic copy number variations (CNVs). This technique identifies balanced and unbalanced structural and numerical chromosomal abnormalities (Baris et al., 2007; Chin et al., 2007; Jaillard et al., 2010). aCGH allows comprehensive interrogation of thousands of discrete genomic loci for DNA copy number gains and losses. Routine karyotype analysis is not sensitive enough to detect subtle chromosome rearrangements (less than 4 Mb). As a result the higher resolution and throughput, with possibilities for automation, robustness, simplicity, high reproducibility and precise mapping of aberrations are the most significant advantages of aCGH over cytogenetic methods (Miller et al., 2010). aCGH is gradually replacing cytogenetic methods in an increasing number of genetics laboratories (Ahn et al., 2013). In aCGH, equal amounts of labelled genomic DNA from a test and a reference sample are cohybridized to an array containing the DNA targets. Genomic DNA of the patient and control are differentially labelled with Cyanine 3 (Cy3) and Cyanine 5 (Cy5) (Figure 4). Hybridization of 8
the repetitive sequences can be blocked by the addition of Cot-1DNA. The slides are scanned into image files using a microarray scanner. The spot intensities are measured and analyzed for copy number analysis (Ahn et al., 2013; Baris et al., 2007; Feuk et al., 2006; Jaillard et al., 2010). The resulting ratio of the fluorescence intensities is proportional to the ratio of the copy numbers of DNA sequences in the test and reference genomes. If the intensities of the fluorescent dyes are equal on one probe, this region of the patient’s genome is interpreted as having equal quantity of DNA in the test and reference samples; if there is an altered Cy3:Cy5 ratio this indicates a loss or a gain of the patient DNA at that specific genomic region.
Figure 4: Schematic picture of array-based comparative genome hybridization (array-CGH) adopted from (Feuk et al., 2006). The reference and test DNA samples are differentially labelled with fluorescent tags (Cy5 and Cy3, respectively), after repetitive-element is blocked using COT-1 DNA and then hybridized to genomic arrays. After hybridization, the fluorescence ratio (Cy3:Cy5) reveals copy-number differences between the two DNA samples. Typically, in array-CGH, the initial labelling of the reference and test DNA samples reversed for a second hybridization (‘dye-swap’) (left and right sides of the panel) to detect spurious signals. The red line represents the original hybridization and the blue line represents the reciprocal hybridization.
1.2.4
Representational Oligonucleotide Microarray Analysis (ROMA)
Representational oligonucleotide microarray analysis (ROMA) is a variant of array-CGH designed to search for CNVs (Figure 5). The reference and test DNA samples are made into ‘representations’ to reduce the sample complexity before hybridization. DNA is digested with a common restriction enzyme that has uniformly distributed cleavage sites (BglII is shown in Figure 5) and ligated with common adaptors containing PCR primer sites. Ligated fragments are amplified by PCR under controlled PCR conditions so that only DNA of less than 1.2 kb is amplified, therefore reducing the complexity of the DNA. The PCR amplified fragments are hybridized to the array for copy number detection (Lucito et al., 2003; Sebat et al., 2004). Like 9
the common microarray method the reference and test DNA samples are then tagged with different fluorescent dyes, usually green or red. An oligonucleotide (around 50-100 base pairs) is spotted with computationally on glass or synthesizing onto silica by laser photochemistry, with many copies of a single probe comprising each dot. If the gene is present equally in both samples then a dot glows yellow. A mostly-red or a mostly-green dot indicates a deletion or duplication respectively in the gene. An estimation of the number of copies of gene can be made depending on the colour’s intensity of each dot (Lucito et al., 2003). One limitation with the ROMA technique is that PCR can only amplify ~200,000 fragments of DNA, comprising approximately 2.5% of the human genome (Lucito et al., 2003).
Figure 5: Representational oligonucleotide microarray analysis (ROMA) for copy number detection (Feuk et al., 2006). DNA is digested with a common restriction enzyme with uniformly distributed cleavage sites (BglII). The adaptors with PCR primer sites are ligated to each fragment and are amplified by PCR. Only DNA of less than 1.2 kb (yellow) is amplified and hybridized to the array.
1.2.5
Quantitative PCR (qPCR)
Quantitative PCR (qPCR) is a high throughput technique to measure and validate copy number variation of a gene. qPCR measures PCR amplicons in real time and the fractional cycle number (Ct) indicates the amount of starting template, when PCR amplification reaches a defined threshold during the exponential phase of the reaction. The absolute or relative quantitation of an unknown sample is measured using a standard curve, which is constructed using known amounts of target DNA (usually a serial dilution), plotting resultant Ct values as log concentrations and fitting a linear trend line to the data. The amount of PCR amplicon accumulation is measured by fluorescent based chemistry, either DNA intercalating dyes such as SYBR green or probe based methodologies such as TaqMan, Scorpion and molecular beacons (Cantsilieris & White, 2013; Chen et al., 2006; Fellermann et al., 2006; Linzmeier & Ganz, 2005). Quantitative PCR is quick, powerful, requires very small amounts of DNA (5–10 ng) and is able to detect very small deletion or duplication in the genome. However, the number of targets are limited by the number of fluorophores available and the detection capabilities of the instrument (Cantsilieris et al. 2012). Quantitative real-time PCR is a powerful technique and is mainly used to confirm or validate copy number variation (Fellermann et al.,
10
2006; Linzmeier & Ganz, 2005) but the resolution capabilities of this approach cannot distinguish high copies (more than 4 copies) of a gene (Hollox et al. 2008).
1.2.6
Multiplex ligation – dependent probe amplification (MLPA)
Multiplex ligation-dependent probe amplification (MLPA) was first introduced in 2002 by Schouten and co-workers and has been widely applied in a variety of clinical and research situations (Schouten et al., 2002). The technique has proven to be an efficient and reliable technique for detection and validation of copy number variation (Hills et al., 2010; Janssen et al., 2005; Pedersen et al., 2010). In MLPA, two sequence-tagged half probes are annealed to adjacent sites on the genomic target sequence and ligated using a thermostable DNA ligase. The ligated probes are subsequently amplified with universal PCR primers (one of which is fluorescently labelled) and quantified using electrophoresis (Figure 6b). Each PCR product has a distinct size which allows identification of specific DNA fragments. The amount of ligated probe is proportional to copy number of the target gene and can be quantified after fractionating the ligated PCR products by capillary electrophoresis. A typical, capillary-based MLPA assay allows quantification of up to 45 distinct sequences of unknown copy number. In each MLPA experiment, reference probes are also included in probe mixes to calculate unknown copy number. The reference probes are assumed to have a normal copy number (n=2) in both test samples and control samples. The reference probes are designed from nonvariable chromosomal regions. Several reference samples are recommended to estimation experimental variability. Groth and co-workers estimated beta-defensin copy number in 44 different samples using the MLPA technique and a noticeable correlation was observed with other techniques, such as PRT and quantitative PCR (Groth et al., 2008). MLPA detects copy number variation of maximum 45 distinct genomic sequences in a single reaction using small amounts of starting DNA (20 ng) and does not require cells for chromosome spreads. MLPA assay can be used to target any genomic sequences for copy number analysis, irrespective of their size or proximity to each other. MLPA allows more accurate determination of the size of deletions or duplications in comparison FISH or qPCR (Janssen et al., 2005). MLPA is high throughput and results can be obtained within 20 hours. There are significant challenges in designing custom probes for those regions not yet commercially available as kits. A list of criteria (probe length, Tm, secondary structure, GC content, nucleotide composition at the ligation, site, sequence uniqueness, avoidance of known SNPs, etc) need to be satisfied to improve the likelihood of a successful MLPA assay. Unknown SNP in the probe binding regions may affect MLPA results and appear as exon deletions. 11
1.2.7
Multiplex amplifiable probe hybridization (MAPH)
Multiplex amplifiable probe hybridization (MAPH) is a PCR-based method of quantifying multiple genomic loci in a single reaction (Armour et al., 2000; Hollox et al., 2002). The technique is based on the quantitative amplification of multiple probes that have been hybridized to immobilized genomic DNA (Figure 6a). All the probes have universal primers at the ends to amplify by single PCR. MAPH probes are generated by cloning the target sequences into a plasmid vector, followed by PCR amplification of cloned sequence using primers directed to the vector, to have similar flanking sequence in all PCR products. Probes are with different length as well as identical tails facilitating PCR amplification with a single primer pair. Almost 0.5–1 µg denatured genomic DNA is spotted onto a nylon filter and hybridized with a set of probes corresponding to the target sequences. The membranes are washed rigorously to remove unbound probe, and the remaining specifically bound probe is proportional to its target copy number. The probes are stripped from the membrane and amplified simultaneously with the universal primer pair and separated by electrophoresis. A relative comparison is made between the test and control probes based on band intensities, peak area/peak heights depending on the detection method. Reduced band intensities or peak area/peak heights compared to internal control probes indicate deletion and increased band intensities or peak area/peak heights indicates duplication. Armour et al., multiplexed up to 40 probes in one single reaction and resolved by gel electrophoresis simultaneously (Armour et al., 2000). MAPH has been used to measure beta-defensin copy number (Armour et al., 2000; Hollox et al., 2002). The design of probes for MAPH is far simpler than MLPA probe generation. MAPH works with double-stranded DNA probes that are obtained from cloning or PCR. SNPs in the probe binding regions are unlikely to affect MAPH but if part of a region targeted by a MAPH probe is deleted, the probe may still hybridize and the target will be scored as being present. The washing steps in the MAPH technique, necessary to remove unbound probe, may also introduce a contamination risk. MAPH requires 1µg of DNA for reliable and reproducible results.
12
Figure 6: Multiplex PCR-based methods for the identification of copy-number variants (adopted from (Feuk et al., 2006)). a | In multiplex amplifiable probe hybridization (MAPH), The probes (red) of different sizes normally clone into vectors and amplify by PCR such that each end flanks by the same sequence site (blue). The genomic DNA is fixed to a membrane and probes are hybridized to it. Unbound probes removes after rigorous washing and the probes are stripped from the membranes. The amount of probe presents at this stage is proportional to its copy number in the target genomic DNA. Probes amplify by a universal primer pair and size-separates by gel electrophoresis. Changes in peak heights relative to control DNA (non-CNV), indicates the copy number. b | Multiplex ligation-dependent probe amplification (MLPA) For each target region 2 probes are designed, which hybridize adjacent to each other (probes for 2 regions are shown in red and yellow). Like MAPH, all probe pairs are flanked by universal primer sites (blue). The probes are hybridized to genomic DNA and adjacent probes are ligated to join the two primers together. The number of ligated primers is proportional to the target copy number. After denaturation, the ligated probes amplify with PCR amplification. Sometimes ‘stuffer’ sequence is added with one of these probes as having a universal primer site, which allows each probe set to produce fragments of a different size. Size separation by gel electrophoresis is carried out as with MAPH, to detect deletions and duplications.
1.2.8
Fosmid Paired-End Sequencing
Fosmid Paired-End Sequencing was used to characterize structural variation in the human genome (Tuzun et al., 2005). The assay sequences both ends from a fosmid genome library (representing a single individual) and maps paired end-sequences to the human genome reference sequence assembly (Figure 7). This method can identify deletions, insertions, and even inversions by identifying discordant regions where multiple fosmids show discrepancy by length and/or orientation. The sequences of end-sequence pairs are compared with the reference sequence. When fosmid end-sequence pairs span much shorter (48 kb) on reference chromosome. In the case of sequence inversion the fosmid end-sequences have inconsistent orientation.
13
Figure 7: The paired-end sequencing methodology for detection of structural variation (Tuzun et al., 2005). Fosmid end-sequence pairs span >48 kb defines deletion and insertions are defined when two or more end-sequence pairs span ) and reverse primers ( 0.9 indicating very high quality of copy number calling for CNV2. Where this probability was below 0.8, then the mean of a duplicate test was used to call the correct integer copy number.
Figure 59: Output of the clustering procedure using the PRT transformed data of CNV2 for HapMap samples. The coloured lines show the posterior probability for each of the eight copy number classes (copy number = 1; 2; 3; 4; 5; 6; 7; 8).
100
Figure 60: Analysis of integer copy number calling of CNV2 in HapMap samples. Scatter plot and associated histograms show mean unrounded copy number values generate by PRT3 and PRT4 and plots against posterior probabilities of integer copy number call for HapMap samples.
3.4.2.7 Distribution of CNV2 integer copy number The CNV2 copy numbers per diploid genome were measured using PRT and PRT3 assays in 4 different HapMap populations, CEU, JPT, CHB and YRI. The CNV2 copy number was estimated in a total of 269 HapMap samples. The study found a total of eleven diploid copy number classes for the CNV2 region in HapMap population with a diploid copy number from 1 to 11 with a mean copy number of 4.41 per diploid genome (Figure 61). Most of the HapMap CEU samples showed higher diploid copy number classes (>4) for CNV2 copy number but most individuals of HapMap Asian (JPT and CHB) and HapMap African (YRI) populations showed lower diploid copy number classes ( 0.99, indicating very high quality of copy number measurement for CNV1.
Figure 89: Analysis of CNV1 integer copy number calling using CNVtools. The clustering of data follows by assignment of a Gaussian mixture model to the data allowed integer copy number calling from normalized mean unrounded value of CNV1 generates by PRT1 and PRT2. The Bayesian posterior probabilities of each CNV1 copy number calls are shown for the HGDP samples.
6.2.2
Distribution of CNV1 diploid copy number in HGDP
A total of 971 individuals from 52 populations from 7 geographical regions were genotyped for CNV1 copy number estimation. CNV1 diploid copy number distributions were from total copy number 2 to total copy number 7. Out of 971 samples 630 (65%) individuals were found with diploid copy number 4. A total of 174 (18%) samples were detected with diploid copy number 3 and 104 individuals (11%) with diploid copy number 5. The details of copy number count and frequencies of CNV1 diploid copy number are presented in Table 37.
135
Table 37: CNV1 diploid copy number frequencies in HGDP samples. Diploid copy number
Copy number count
Copy number frequency
2 3 4 5 6 7 Total Mean
14 174 630 104 45 4 971 4.00
0.01 0.18 0.65 0.11 0.05 4.25) the CNV2 data was re-analyzed using mixture model of eight components and CNVtools produced very good data clusters without any overlapping cluster (Figure 95). The resulting clustering quality score (Q) for CNV2 data was 7.59. Based on long range PCR of CNV2 region the actual copy number was assigned for each copy number cluster. The first cluster was assigned as diploid CNV2 copy number of 1 and actual diploid copy numbers of rest clusters were counted as 2, 3, 4, 5, 6, 7 and 8 respectively.
142
Figure 95: Output of the clustering procedure using the PRT transformed data of CNV2 for HGDP samples. The coloured lines show the posterior probability for each of the eight copy number classes (copy number = 1, 2, 3, 4, 5, 6, 7 or 8).
The samples of lower PRT ratio were assigned manually as actual diploid copy number 0 (total deletion of CNV2) and revalidated using long range PCR. The samples of extreme mean PRT value were assigned manually as diploid copy number 9, 10 and 11 based on mean PRT value of CNV2. Posterior probabilities of the integer copy number call for each sample were plotted in Figure 96. The posterior probabilities for most of the samples were more than 0.95 and posterior probabilities of more than 0.80 indicated copy number calling for CNV2 was very good for HGDP samples. The samples showed posterior probabilities < 0.75 were retyped and revalidated using long PCR.
143
Figure 96: Analysis of CNV1 integer copy number calling using CNVtools for the HGDP samples. The clustering of data follows assignment of a Gaussian mixture model to the data and allows integer copy number calling from normalized mean unrounded value of CNV2 generated by PRT3 and PRT4. The Bayesian posterior probabilities of each CNV1 copy number calls are shown for HGDP samples.
6.2.6
Distribution of CNV2 diploid copy number in HGDP
A wide range of diploid copy number (from 0-11) distributions was found for CNV2 in HGDP populations (Figure 97). The mean CNV2 copy number for HGDP populations was 4.37 and most of the HGDP samples (88%) showed diploid copy number from 2 to 6 for CNV2. The detailed copy number count and frequencies of CNV2 diploid copy number are presented in Table 45.
144
Table 45: CNV2 copy number frequencies in HGDP samples.
Diploid copy number Copy number count Copy number frequency 0 4 0.004 1 21 0.02 2 149 0.15 3 108 0.11 4 252 0.26 5 167 0.17 6 186 0.19 7 48 0.05 8 27 0.03 9 5 0.005 10 3 0.003 11 1 0.001 Total 971 Mean 4.37
Figure 97: Population distribution of CNV2 copy number in the HGDP samples. Distributions of CNV2 integer copy number in the HGDP populations, pie sizes are drawn in proportion to sample size.
6.2.7
Distribution of CNV2 diploid copy number in different geographical regions
The distribution of diploid CNV2 copy number data was analysed to present global CNV2 copy number variations. The frequency of each copy number classes is presented in Figure 98. The copy number of CNV2 in all populations is distributed from 0 (complete deletion) to 11 copies (Table 46). 4 copy of CNV2 is common for all regions except Europe, Oceania and African population. The 5 copy CNV2 type is common in European populations (0.31) but in 145
African populations 2 copies CNV2 was common (0.51%). 6 copy of CNV2 was common in Oceania population and found in 47% of populations.
Figure 98: Frequency distribution of worldwide CNV2 copy number in HGDP continental regions. Table 46: CNV2 copy number frequencies in HGDP continental regions. Region
America
South Asia
East Asia
Europe
Middle East
Oceania
Africa
Diploid copy number 0
Copy number count (frequency) 0 (0)
Copy number count (frequency) 1 ( 5) were excluded from CNVtools analysis and called manually based on mean PRT ratio. A mixture model of eight components was used based on the histogram of the mean PRT data of CNV2 (Figure 135). The clustering quality score (Q) for CNV2 data was 3.86. Based on long range PCR of CNV2 region the actual copy number of first cluster was assigned. The first cluster was assigned a diploid CNV2 copy number of 2 and the actual diploid copy numbers of the remaining clusters were counted as 3, 4, 5, 6, 7, 8 and 9, respectively.
196
Figure 134: Output of the clustering procedure using the PRT transformed data of CNV2 for Scottish Crohn’s and control samples. The coloured lines show the posterior probability for each of eight copy number classes (copy number = 2, 3, 4, 5, 6, 7, 8 and 9).
The quality of copy number calling for CNV2 was assessed based on posterior probabilities of the integer copy number call for each sample in Figure 135. The posterior probabilities for most of the samples were more than 0.95 and CNV calling with posterior probabilities of more than 0.80 was considered good. A total of 80 samples that showed posterior probabilities < 0.75 (11%) were either retyped or called manually using duplicate PRT ratio. Finally posterior probabilities of CNV2 for Scottish samples indicated that the quality of copy number calling was similar to English CD cohorts.
197
Figure 135: Analysis of CNV2 integer copy number calling using CNVtools. The clustering of data follows assignment of a Gaussian mixture model to the data allows integer copy number calling from normalized mean unrounded value of CNV2 generates by PRT3 and PR4. The Bayesian posterior probabilities of each CNV2 copy number calls are shown for Scottish Crohn’s and controls samples.
8.3.3
Copy number estimation in Danish Crohn’s samples
8.3.3.1 CNV1 copy number estimation in Danish Crohn’s samples The histogram analysis of mean PRT ratio of CNV1 in Danish Crohn’s was performed to determine number of clusters before CNVtools analysis. The histogram of mean PRT ratio of CNV1 indicated a 4 cluster component in CNVtools analysis (Figure 136).
198
Figure 136: Histogram of mean normalized PRT ratio of CNV1 for Danish Crohn’s and control samples.
The average raw PRT ratios were transformed to have a standard deviation of 1 before CNVtools analysis and integer copy numbers were called using transformed PRT data by a Gaussian mixture model in CNVtools. A mixture model of four components was fitted nicely (Figure 137) and the quality score was measured to check quality of the clusters. The resulting clustering quality score (Q) was 10.57. The first cluster indicated diploid copy number 2 and the remaining clusters were assigned diploid copy number 3, 4 and 5, respectively.
Figure 137: Output of the clustering procedure using the PRT transformed data of CNV1 for Danish CD and control samples. The coloured lines show the posterior probability for each of the four copy number classes (copy number = 2, 3, 4 and 5).
199
Posterior probabilities of the integer CNV1 copy number call for each sample were plotted for Danish CD samples (Figure 138). The posterior probabilities for most of the samples were greater than 0.99 indicating CNVtools called them very nicely without any overlapping. The posterior probability for one sample was poor (almost 0.61), and so the sample was retyped and called as a 3 copy, based on raw PRT value. The overall copy number calling of CNV1 was very good for Danish IBD samples, and might be useful for a case-control association study.
Figure 138: Analysis of CNV1 integer copy number calling using CNVtools. The clustering of data follows by assignment of a Gaussian mixture model to the data allows integer copy number calling from normalized mean unrounded value of CNV1 generates by PRT1 and PRT2. The Bayesian posterior probabilities of each CNV1 copy number calls are shown for Danish CD and control samples.
8.3.3.2 CNV2 copy number estimation in Danish Crohn’s samples The sensitivity and specificity of PRT assays, PRT3 and PRT4, of Danish CD samples were compared using a scatter plot (Figure 139). The scatter plot showed high correlation (r2=0.94) between raw PRT ratio of PRT3 and PRT4 and produced very good quality clusters without any overlapping of data for first five clusters. The samples with higher PRT ratio showed poor 200
clustering around integer values and copy number was called by eye based on the averaged PRT ratio.
Figure 139: Scatter plot using raw PRT ratio of PRT3 and PRT4 assays, use to estimate diploid copy number of CNV2 in Danish CD and control samples.
The histogram of the averaged PRT ratio of CNV2 data in Danish CD samples indicated seven clusters (Figure 140) and the samples showing raw PRT ratios greater than 4.5 were treated as outliers. The number of clusters (7) was used to measure the integer copy number of CNV2 using CNVtools. The outlier samples were called by eye using mean PRT ratio. A mixture model of seven components was used, based on a histogram of the mean PRT data of CNV2. The clustering quality score (Q) for CNV2 data was 4.15, greater than the English or Scottish CD cohorts. Based on long range PCR and prior knowledge of CNV2 ratio, actual copy number was assigned for each copy number cluster. The first cluster was assigned as a diploid CNV2 copy number of 2, and actual diploid copy numbers of the remaining clusters were assigned 3, 4, 5, 6, 7 and 8, respectively.
201
Figure 140: Output of the clustering procedure using the PRT transformed data of CNV2 for Danish CD and control samples. The coloured lines show the posterior probability for each of the seven copy number classes (copy number = 2, 3, 4, 5, 6, 7 and 8).
The quality of copy number calling for CNV2 was evaluated using posterior probabilities of the integer copy number call for each sample in Figure 141. The posterior probabilities for most of the samples were more than 0.95 and CNV calling with posterior probabilities of more than 0.80 was also considered good. A total of 49 samples showing posterior probabilities < 0.75 (7%) and were either retyped or called by eye using retyped PRT ratio.
202
Figure 141: Analysis of CNV2 integer copy number calling using CNVtools. The clustering of data follows by assignment of a Gaussian mixture model to the data allows integer copy number calling from normalized mean unrounded value of CNV2 generates by PRT3 and PRT4. The Bayesian posterior probabilities of each CNV2 copy number calls are shown for Danish CD and controls samples.
8.4 Comparison aCGH and PRT for copy number estimation To validate CNV1 and CNV2 copy number calls for the Crohn’s disease samples, mean unrounded PRT ratio was compared with arrayCGH data previously generated using an Agilent 210k aCGH chip as part of the WTCCC genome wide CNV association study. A total of 785 Crohn’s disease samples (97 of Scottish Crohn’s and 688 of English Crohn’s samples) from our samples were analysed as part of the WTCCC CNV-associations study. A principal component analysis was used to compare aCGH data and PRT raw ratio for both CNV1 and CNV2 regions. The aCGH data for the Crohn’s disease cohort was normalised two different ways; in the case of first normalisation (normalised1), the log of the ratio of the red and green channel data (log(R/G)) was used, whereas in the second normalisation (normalised2), the log of the ratio of the quantile normalised red and green channel data (log( QNorm(R)/QNorm(G) )) was 203
calculated. A scree plot for both first and second normalised value was plotted to determine the proportion of variation described by different principal components and also to finalise which principal components might be useful for validation of copy number calling of CNV1. The scree plot for first normalised aCGH data shows that the first principal component describes 65% of the variation (Figure 142 A), reflecting the underlying copy number variation of the samples. The scatter plot of normalised PRT ratio obtained from PRT1 and PRT2 indicated good correlation without any overlapping clusters for the English (r2=0.96) and Scottish (r2=0.90) Crohn’s samples. The average PRT ratio of PRT1 and PRT2 assays was used for integer copy number estimation of CNV1. The average PRT ratios were compared with the first PC of the first normalized arrayCGH data. There was a clear positive correlation between the two methods for all samples (r2=0.75) and at the population level correlation was better for English (r2=0.77) than the Scottish (r2=0.62) Crohn’s samples. The data generated by both PRT and aCGH assays clustered effectively but limited overlap of copy number value was produced by aCGH (Figure 142 B-D). The clusters produced by average PRT ratio were well separated compared to data clusters of 1PC generated by aCGH data of CNV1.
204
Figure 142: (A) Scree plot for PCA of first normalized aCGH data is generated using Agilent 210 K CNV chip for CNV1 in Crohn’s disease sample. X-axis shows number of principal components. (B, C, D) Scatter plots show correlation of the mean unrounded copy number value of CNV1 and the 1PC of first normalised Agilent aCGH data in all (B), English (C) and (D) Scottish Crohn’s disease samples respectively.
The scree plot for the second normalised aCGH data from the Crohn’s cohort explained maximum variation (almost 65%) for the first principal component as noticed previously for first normalised data of CNV1 (Figure 143 A). So, the first principal component of second normalised aCGH data was used to validate quality of integer copy number calling of CNV1 using PRT ratio. The data generated by PRT and aCGH assays clustered effectively with a moderate correlation (r2=0.65) and at population level correlation was higher for the English CD cohort (r2=0.66) than Scottish CD cohort (r2=0.61) (Figure 143). The second normalised data generated more overlapping copy number values of aCGH data than first normalised data. The clusters produced by average PRT ratio were distinct compared to clusters of 1PC of second normalised aCGH data of CNV1 in the Crohn’s disease cohort. 205
Figure 143: (A) Scree plot for PCA of second normalized aCGH data is generated using Agilent 210 K CNV chip for CNV1 in Crohn’s disease. X-axis shows number of principal components. (B, C, D) Scatter plots show correlation of mean unrounded copy number value of CNV1 and 1PC of second normalised Agilent aCGH data in all (B), English (C) and (D) Scottish Crohn’s disease samples respectively.
To validate CNV2 copy number calling, the principal component analysis was performed for aCGH data of CNV2 region for all CD samples and aCGH data and mean unrounded PRT raw ratio were compared. Like aCGH data of CNV1, both first and second normalised were used for CNV2 region and all analysis was performed as for CNV1 region. The scree plot showed that the first principal component for the first normalised aCGH data of CNV2 explained maximum variation (almost 58%) of Crohn’s disease cohort (Figure 144 A) although correlation was less than CNV1. Two independent PRT assays (PRT3 and PRT4) were used to estimate the integer copy number of CNV2 in English and Scottish Crohn’s disease cohorts. The normalised PRT ratio of PRT3 and PRT4 was compared and scatter plot indicated good correlation for English (r2=0.89) and Scottish (r2=0.93) Crohn’s samples with some overlapping cluster for higher PRT ratio. The mean unrounded PRT generated by PRT3 and PRT4 assays was used for integer copy number 206
estimation of CNV2. The mean unrounded PRT ratios were compared with with first PC of first normalized arrayCGH data. The scatter plot analysis of PRT ratios and 1PC of aCGH showed positive correlation at moderate level (r2=0.55) compared to CNV1 region (Figure 144 B, C, D). The maximum correlation was noticed for Scottish (r2=0.66) Crohn’s samples followed by English (r2=0.54) Crohn’s samples. The histogram of average PRT ratio produced good clusters for CNV2 in the Crohn’s disease cohort but clusters were not as good as the CNV1 region and also overlapped for higher PRT vales. For CNV2 region, the first PC of first normalized arrayCGH data did not show any evidence of clustering.
Figure 144: (A) Scree plot for PCA of first normalized aCGH data is generated using Agilent 210K CNV chip for CNV2 in Crohn’s disease samples. X-axis shows number of principal components. (B, C, D) Scatter plots show correlation of mean unrounded copy number value of CNV2 and 1PC of first normalised Agilent aCGH data in all (B), English (C) and (D) Scottish Crohn’s disease samples respectively.
207
The scree plot for the second normalised aCGH data of Crohn’s disease cohort explained the maximum variation (almost 65%) for first principal component, which explained more variation (Figure 145) than first principal component of the first normalised data of CNV2 region. The scatter plot analysis of mean unrounded PRT ratios and 1PC of second normalized arrayCGH data produced less correlation (r2=0.42) than other data set used in our analysis. The highest correlation was found for Scottish (r2=0.47) Crohn’s samples compared to English (r2=0.42) Crohn’s cohort (Figure 145). The 1PC of second normalized arrayCGH data showed no evidence of clustering for integer copy number estimation using both first and second normalized 1PC for CNV2 region.
Figure 145: (A) Scree plot for PCA of second normalized aCGH data is generated using Agilent 210K CNV chip for CNV2 in the Crohn’s disease samples. (B, C, D) Scatter plots show correlation of mean unrounded copy number value of CNV2 and 1PC of second normalised Agilent aCGH data in all (B), English (C) and (D) Scottish Crohn’s disease samples respectively.
In previous work on the HapMap samples, scatter plots showed good clustering around integer copy numbers and CNV1 was called well using both aCGH using Agilent 210k aCGH chip and 208
PRT assays. For CNV2, the PRT assays showed evidence of satisfactory clustering although it was poor at higher copy numbers and there was no evidence of clustering for the aCGH data. Comparison of aCGH and PRT ratio on the Crohn’s patients and controls produced similar results for both CNV1 and CNV2, like the HapMap samples, but it should be noted that, for CNV2, both aCGH and PRT showed poorer clustering. However, clustering was still evident for the PRT data, but not at all for the array CGH data.
8.5 Distribution of diploid copy number in the Crohn’s samples 8.5.1
Distribution of CNV1 copy number in Crohn’s samples
The distribution of diploid copy number CNV1 was shown in Figure 146. It was clear that CNV1 was a multiallelic CNV with copy number varying between 2 and 4 per diploid genome. The modal copy number was 4 in both case and control samples for all three Crohn’s cohorts. The mean copy number for cases and controls was almost the same in the English (CD=3.82 and controls=3.85) and Scottish (CD and controls=3.84) cohorts. The mean CNV1 value for Danish Crohn’s (3.83) sample was higher compared to the control (3.73) samples.
Figure 146: Distribution of diploid copy number for CNV1 of DMBT1 in CD samples. Bar graphs illustrating distributions of copy number determined using CNVtools by Gaussian mixture distributions in the English, Scottish and Danish Crohn’s samples (from left to right), from paralog ratio test data.
8.5.2
Distribution CNV2 copy number in Crohn’s samples
The distribution of diploid copy number for CNV2 is shown in Figure 147. It was clear that CNV2 was a multiallelic CNV with a copy number varying between 2 and 10 per diploid genome in English and Scottish cohorts, but in Danish cohort CNV2 diploid distribution was between 2 and 9. The mean copy number for cases was lower than controls in the English (CD=4.93 and controls=5.15) cohort but the opposite trend was observed in the Scottish (CD = 5.27and controls=5.15) and Danish (CD = 5.34 and controls=4.98) cohort. 209
Figure 147: Distribution of diploid copy number for CNV2 of DMBT1 in CD samples. Bar graphs illustrating distributions of copy number determined using CNVtools by Gaussian mixture distributions in the English, Scottish and Danish Crohn’s samples (from left to right), using the paralog ratio test data.
8.5.3
Distribution of SRCR copy number in Crohn’s samples
The total number copy of SRCR domain was estimated based on the diploid copy number of CNV1 and CNV2 regions of DMBT1 together and the non-CNV region was also included. The distributions of total SRCR copy number for English, Scottish and Danish cohorts are shown in Figure 148, Figure 149 and Figure 150 respectively. The range of total number of SRCR domain was between 16 and 31 per diploid genome in the English and Scottish cohort but in the Danish cohort distribution was from 16 to 32. The three major classes of total SRCR domain (24, 25 and 26) were found in all three cohorts.
Figure 148: Distribution of total SRCR domain of DMBT1 in the English Crohn’s samples. Bar graphs illustrating distributions total SRCR domain in the English Crohn’s samples.
210
Figure 149: Distribution of total SRCR domain of DMBT1 in the Scottish Crohn’s samples. Bar graph illustrating distribution of total SRCR domain in the Scottish Crohn’s samples.
Figure 150: Distribution of total SRCR domain of DMBT1 in the Danish Crohn’s samples. Bar graph illustrating distribution of total SRCR domain in the Danish Crohn’s samples.
8.6 Association of DMBT1 copy number with Crohn’s disease There were no significant differences between cases and control for CNV1 in any of the three CD populations. For CNV2, two of the cohorts (English and Danish) showed a slight difference that achieved modest statistical significance, but the trend was in the opposite direction in each cohort (Table 70). The mean CNV2 copy number was lower in CD cases (4.93) than 211
controls (5.15) in the English Crohn’s samples but in the case of the Danish Crohn’s samples mean CN2 copy number was higher in CD cases (5.34) compared to Danish controls (4.98). The same trend was also reported in the Scottish Crohn’s samples (CD = 5.27 and controls = 5.15) with non-significant effects. There was a significant difference (p=0.005) between mean SRCR copy number of the cases and controls in the English Crohn’s samples, although this did not replicate in either the Scottish (p = 0.355) or Danish (0.863) Crohn’s samples. Table 70: Comparison of CNV1, CNV2 and SRCR copy number frequency in Crohn’s and controls of three different Crohn’s samples. CD cohort Disease status CNV1 copy number CNV2 copy number SRCR copy number
English CD Crohn’s control 3.82
3.85
P value 0.259
4.93
5.15
24.18
24.56
Scottish CD Crohn’s control 3.84
3.84
P value 0.891
0.006
5.27
5.15
0.005
24.64
24.46
Danish CD Crohn’s control P value 3.73
3.83
0.0671
0.328
5.34
4.98
0.0376
0.355
24.24
24.29
0.863
A previous study on people of similar ethnic background (Caucasian) including 367 Italian Crohn’s patients and 346 controls without history of IBD showed that a deletion allele of DMBT1 was associated with an increased risk of CD (P =0.00056; odds ratio, 1.75) (Renner et al. 2007). The deletion allele described previously as a diallelic copy number variation corresponded to the CNV1 region in our study. From our study it was clear that CNV1 was a multiallelic CNV with copy number ranging between 2 to 5 per diploid genome in Caucasians. One CNV1 copy number variable unit represents a block of 4 SRCR domains and a diploid copy number of 2, 3 and ≥4 for CNV1 represents homozygous deletion, heterozygous deletion and normal genotype of longest allelic version. To determine whether the deletion allele increases the risk of CD in our cohorts, the number of deleted and non-deleted alleles was counted in CD cases and controls for the three cohorts and no statistically significant association between presence or absence of the deletion allele and CD. The present study showed that deletion allele frequency (10-14%) was almost equal in cases and controls for three CD cohorts in (Table 71). Table 71: Comparison of DMBT1 deletion allele frequency in Crohn’s patients and controls of three different Crohn’s cohorts. CD cohort Allele Cases Controls Fisher’s exact test
English CD Deleted Non-deleted 172(10%) 1490 (90%) 83 (9%) 877 (91%) p=0.17
Scottish CD Deleted Non-deleted 67(10%) 629(90%) 67(10%) 613(90%) p=0.93
Danish CD Deleted Non-deleted 44(14%) 266(86%) 34(10%) 314(90%) p= 0.09
212
8.7 Discussion A previous study had reported that a deletion allele of DMBT1 resulting in a reduced number of SRCR domains was associated with an increased risk of CD (Renner et al. 2007). The previously reported deletion allele represents a low copy number at CNV1; 2 and 3 copies of CNV1 represent homozygous deletion and heterozygous deletion allele respectively. The present study did not find any significant association with CD cases and mean CNV1 copy number. An allelic model of CNV1 did not find evidence of the association reported previously in the Italian CD cohort (Renner et al. 2007). Significant association was found with CNV2 copy number and CD in two cohorts although it was not repeated in the third cohort. The trend of mean copy number was in the opposite direction in the two cohorts that showed association. The fact that this CNV2 association shows opposite trends suggests it is unlikely to be a true genetic association. A significant association with total SRCR copy and CD was found in the English cohort, but the association was not replicated in the Scottish or Danish CD cohorts. The modest statistically significant difference in CNV2 may be due to a very subtle differential bias between cases and controls. For both the English and Danish DNA plates, the cases and controls samples were aliquotted on different plates without random distribution. In addition in the English cohort the cases and controls DNA were collected from different sources. For the English CD study, random human (HRC) DNA isolated from lymphoblastoid cell lines was used as disease free control samples. Healthy blood donors from a Danish blood bank were included as controls in the Danish Crohn’s cohort. However for the Scottish cohort, the cases and controls DNA were distributed randomly on different plates and 92% of the samples were collected from blood, and the rest were from saliva. This study might argue that the copy number variation at DMBT1 is not a risk factor for CD pathogenesis and similar results were reported previously by the WTCCC genome-wide association study of CNVs for CD (Craddock et al. 2010). Our study showed that the WTCCC would have called CNV1, but not CNV2. The previous association with between the DMBT1 deletion allele and an increased risk of CD in an Italian cohort may not have been a true genetic association and might be a chance association due to the subtle shifts in allele frequency. The frequency of the deletion allele was lower in all three cohorts (10-14%) compared to the deletion allele frequency in the Italian CD cohort (22%) (Renner et al. 2007). The previous studies considered CNV1 as diallelic although the present study shows both CNVs are multiallelic. Sometimes loss of SRCR copy due to deletion allele at CNV1 might be 213
compensated by for CNV2 copy number resulting in no change in the total SRCR copy in the diploid genome. DMBT1 is a glycoprotein and the nature and patterns of DMBT1 of glycosylation remain to be clarified. The copy number along with glycosylation of DMBT1 might help to explain CD pathogenesis.
214
9 ASSOCIATION OF COPY NUMBER VARIATION OF DMBT1 IN AFRICAN HIV COHORTS
9.1 Introduction Human immunodeficiency virus infection/acquired immunodeficiency syndrome (HIV/AIDS) is caused by HIV (1&2), of the family Retroviridae, which has prevalence all around the world. A total of 35.3 million adults and children are infected with HIV (WHO-HIV department report, Figure 151 and Figure 152) and in 2006 it was predicted that it would be the largest cause of morbidity by 2030, as measured by disability-adjusted life-years (Mathers & Loncar, 2006). The disease prevalence is not the same for all countries or continents. The highest disease burden of HIV is in African countries with 9.2% prevalence in Addis Ababa in Ethiopia and over 10% in Dar-es-Salaam in Tanzania (Aklillu et al., 2013).
Figure 151: Worldwide HIV prevalence among adults (adopted from WHO-HIV department) (http://www.who.int/hiv/data/en/).
Various genetic studies have been conducted to establish the genetic contribution of HIV susceptibility in African populations, mainly from Western countries. A genome-wide association study has not yet found any significant signal of HIV susceptibility (Petrovski et al., 2011). Copy number variation often shows complex patterns of linkage disequilibrium with surrounding SNPs and the previous studies may have missed the association with structurally complex regions, mainly copy number variable variation (Locke et al., 2006). 215
Figure 152: Regional HIV and (http://www.who.int/hiv/data/en/).
AIDS
statistics
according
to
WHO-HIV
departments
The retrovirus HIV-1 infects host cells through its viral envelope glycoprotein (gp120), displayed as a host cell-derived lipid bi-layer and virus-encoded envelope glycoprotein. HIV-1 infects host cells by binding to the CD4 molecule through the envelope glycoprotein (gp120). Binding results in conformational changes in gp120 and forms a discontinuous region for highaffinity interaction with the chemokine receptor (Levy, 1993). Chemokine receptor binding triggers additional conformational changes in gp120 that eventually leads to the fusion of viral and cellular membranes. For effective binding and efficient entry to the host cell, HIV-1 needs CD4 and a co-receptor, such as CCR5 or CXCR4 (Farzan et al., 1997; Pleskoff et al., 1997; Sattentau & Moore, 1991; Stein & Engleman, 1991; Trkola et al., 1996a; Wu et al., 1996) and chemokine receptor-binding sites induce by CD4 interaction (Wu et al. 1996). DMBT1 behaves as a secreted or membrane-linked protein (Sasaki et al., 2002; Sasaki et al., 2003) and binds with HIV-1 through the envelope glycoprotein (gp120). The DMBT1-binding site on gp120 appears to be a distinct inhibitory-binding site, different from the CD4-binding site on gp120 (Wu et al., 2004, 2006). However previous studies on the role of DMBT1 in HIV-1 infectivity have been contradictory. DMBT1 is expressed as a soluble protein in human saliva and binds with HIV-1 gp120 protein through protein-protein interactions (Wu et al. 2004; Wu et al. 2006; Chu et al. 2013) and as membrane-associated protein in cervical and vaginal epithelial cells. DMBT1 facilitates HIV trans-infection and plays a role in sexual transmission (Stoddard et al., 2007). In vitro, it has been reported that DMBT1 binds to HIV-1 and inhibits human 216
immunodeficiency virus type 1 infection (Nagashunmugam et al., 1998). Monocyte derived macrophages also express DMBT1 and enhance the efficiency of HIV-1 infection by increasing the local concentration of infectious virus (Cannon et al., 2008). Previous studies identified a HIV-1 gp120 binding site on the SRCR1 domain of DMBT1 (Chu et al. 2013; Wu et al. 2006) and three different regions of SRCR domains directly interact with gp120 (Chu et al. 2013). The present study has characterized the SRCR regions by considering two CNVs (CNV1 and CNV2) and showed inter-population copy number variation. The hypothesis is that extra SRCR domains may facilitate improved binding to HIV-1 directly. At present, no study of DMBT1 copy number with HIV status is available. In this study, two cohorts of HIV patients from Ethiopia and Tanzania were analyzed for association of copy number at DMBT1 with viral load immediately prior to highly active antiretroviral therapy (HAART). We also tried to discover any effect of DMBT1 copy number on response to HAART.
9.2 Estimation of DMBT1 copy number in African HIV cohorts 9.2.1
Estimation of CNV1 copy number in African HIV cohorts
Diploid copy number was estimated using the mean unrounded PRT ratio of CNV1 in HIV samples. The histogram analysis of average PRT ratio showed good clusters and indicated a total of 4 clusters with mean value difference between the two respective clusters of almost 0.5. The histogram data indicated total a 4 of clusters in CNVtools analysis. Samples showing PRT ratios greater than 2.5 were considered outliers. The mean PRT ratios of CNV1 for all samples were transformed to have a standard deviation of 1 for improving the quality of clustering and integer copy numbers were inferred from transformed PRT data using a Gaussian mixture model in CNVtools. A mixture model of four components was fitted, based on clustering of the normalized PRT data (Figure 153). The quality scores were measured to check quality of the clusters; the clustering quality score (Q) was 11.75. The first cluster indicated a diploid copy number 3 and the remaining clusters were assigned diploid copy numbers of 4, 5 and 6, respectively.
217
Figure 153: Output of the clustering procedure using the PRT transformed data of CNV1 for HIV samples. The coloured lines show the posterior probability for each of the four copy number classes (copy number = 3, 4, 5 or 6). X-axis indicates transformed PRT ratio of CNV1.
Posterior probabilities of the integer copy number call for each sample are plotted in Figure 154. The posterior probabilities for most of the samples were more than 0.95 and posterior probabilities of two samples were more than 0.80 indicating copy number calling for CNV1 was very good for HIV samples.
218
Figure 154: Analysis of integer copy number calling. Scatter plot and associated histograms show mean unrounded copy number values of CNV1 generates by PRT1 and PRT2, plotted against posterior probabilities of integer copy number call for HIV samples.
9.2.2
Estimation of CNV2 copy number in African HIV cohorts
The mean PRT ratio for CNV2 was distributed from 0 to 5.0 but the highest number of samples showed PRT ratios from 0.5 to 3.0. The samples that showed mean PRT ratio around 0 or greater than 3.2 were treated as outliers and were excluded from the CNVtools analysis. The histogram of average PRT ratio of CNV2 data in HIV samples indicated six clusters (Figure 155). A six clusters component was used to measure integer copy number of CNV2 with CNVtools. The mean PRT ratios were transformed to have a standard deviation of 1 as recommended by CNVtools and integer copy numbers were inferred from transformed PRT data using a Gaussian mixture model. A mixture model of six components was fitted, based on clustering of the normalized PRT data (Figure 155). The quality scores were measured to check quality of the clusters; the clustering quality score (Q) was 7.08. The first cluster indicated diploid copy number 1 and remaining clusters were assigned as diploid copy numbers 2, 3, 4, 5 and 6, respectively. 219
Figure 155: Output of the clustering procedure using the PRT transformed data of CNV2 for HIV samples. The coloured lines show the posterior probability for each of the six copy number classes (copy number = 1, 2, 3, 4, 5 or 6). X-axis indicates transformed PRT ratio of CNV2.
Posterior probabilities of the integer copy number call of CNV2 for each sample are plotted in Figure 156. The posterior probabilities for most of the samples were more than 0.99 and samples with posterior probabilities greater than 0.80 indicated copy number calling for CNV2 was very good for the HIV samples. Samples (almost 2%) with posterior probabilities of less than 0.8 were retyped before final copy number calling. Outlier samples were genotyped by usual inspection of the data based on mean PRT ratio considering one unit CNV2 changed mean value of 0.5 scales from previous cluster.
220
Figure 156: Analysis of integer copy number calling. Scatter plot and associated histograms show mean unrounded copy number values of CNV2, generates by PRT3 and PRT4 plotted against posterior probabilities of integer copy number call of HIV cohort.
9.2.3
Distribution of CNV1 and CNV2 copy number in African HIV cohorts
A total of 1002 individuals from 2 populations (Ethiopian and Tanzanian) were genotyped for CNV1 and CNV2 copy number estimation. Both CNVs at DMBT1 were multiallelic, as with other African populations (HapMap YRI and Zambian samples). The distribution of CNV1 diploid copy number (found in more than 1% of samples) varied from 3 to 7 with a modal CNV1 copy number 4 (67%) (Figure 157). The mean copy number for CNV1 for all HIV samples was 4.25 although at the population level mean CNV1 was higher in Tanzanian samples (4.50) compared to Ethiopian samples (4.10). The details of CNV1 copy number distributions in HIV and HIV+TB co-infected samples in two populations are shown in Table 72. The CNV1 copy number range for two African HIV populations is higher than European populations (HapMap CEU and HGDP populations of 221
European origin). The Ethiopian population showed the greatest range (2-9 copies) although three copy number classes (2, 8 and 9) were found only three Ethiopian samples, one for each class. The common copy number range for CNV1 was between 3 and 6 copies per diploid genome and one Tanzanian showed a copy number of 7. Table 72: Copy number frequencies of CNV1 at DMBT1 in African HIV cohort. Diploid copy number 2 3 4 5 6 7 8 9 Total Mean
Ethiopian HIV count (frequency) 1 (