STEM CELLS AND DEVELOPMENT 16:381–392 (2007) © Mary Ann Liebert, Inc. DOI: 10.1089/scd.2007.0015
Technical Report Multiarm High-Throughput Integration Site Detection: Limitations of LAM-PCR Technology and Optimization for Clonal Analysis MICHAEL A. HARKEY,1 RAJINDER KAUL,2 MICHAEL A. JACOBS,2 PETER KURRE,3 DON BOVEE,2 RUTH LEVY,2 and C. ANTHONY BLAU4
ABSTRACT Retroviral integration provides a unique and heritable genomic tag for a target cell and its progeny, enabling studies of clonal composition and repopulation kinetics after gene transfer into hematopoietic stem cells. The clonal tracking method, linear amplification-mediated polymerase chain reaction (LAM-PCR) is widely employed to follow the hematopoietic output of retrovirally marked stem cells. Here we examine the capabilities and limitations of conventional LAM-PCR to track individual clones in a complex multiclonal mix. Using artificial mixtures of retrovirally marked, single-cell–derived clones, we demonstrate that LAM-PCR fails to detect 30–40% of the clones, even after exhaustive analysis. Furthermore, the relative abundance of specific clones within a mix is not accurately represented, deviating by as much as 60-fold from their true abundance. We describe an optimized, multiarm, high-throughput modification of LAM-PCR that improves the global detection capacity to greater than 90% with exhaustive sampling, facilitates accurate estimates of the total pool size from smaller samplings, and provides a rapid, cost-effective approach to the generation of large insertion-site data bases required for evaluation of vector integration preferences. The inability to estimate the abundance of individual clones within mixtures remains a serious limitation. Thus, although LAM-PCR is a powerful tool for identification of integration sites and for estimations of clonal complexity, it fails to provide the semiquantitative information necessary for direct, reliable tracking of individual clones in a chimeric background. INTRODUCTION
L
INEAR AMPLIFICATION-MEDIATED POLYMERASE CHAIN
(LAM-PCR), Schmidt et al. (1,2), is widely used for tracking clones of donor cells following stem cell transplants with retrovirally marked cells. This method has become an important tool in transplant studies that use retroviral gene transfer either as a means of tracking donor cells or as a vehicle for gene therapy. REACTION
LAM-PCR exploits the semirandom insertion of the retroviral cDNA (provirus) into the host genome of an infected cell, using the integration site as a unique signature by which to track that cell and its progeny. A variety of basic biological and clinical issues are addressed with this technology, including: (1) multiplicity of stem cell engraftment; (2) clonal fluctuation of stem cell contributions to hematopoiesis; (3) regulatory decisions of individual stem cells regarding self-renewal, differentia-
1Department
of Transplantation Biology, Fred Hutchinson Cancer Research Center, Seattle, WA 98109-1024. Center, University of Washington, Seattle, WA 98195-2145. 3Department of Pediatrics, Oregon Health and Science University, Portland, OR 97239-3098. 4Department of Hematology, University of Washington, Seattle, WA 98195-7710. 2Genome
381
HARKEY ET AL. tion, lineage choice, apoptosis, or quiescence; and (4) leukemic outgrowth of hematopoietic clones. In the field of gene therapy, LAM-PCR technology has been employed to characterize possible integration site preferences of retroviral vectors. The integration site can impact the effectiveness of the therapeutic gene by suppressing or enhancing its expression (3–5). It can also disrupt the normal function of local host genes (6). The potential for initiating leukemia through insertional activation of a proto-oncogene is well documented in mice (7–9), and has recently been observed in clinical gene therapy trials (10). As evidence has emerged that retroviral insertion patterns are not totally random and may vary with the type of vector and with host cell activity (11–18), the need to characterize these patterns has grown in importance. The goal here would be to identify vectors and transduction conditions that facilitate the least risky insertional patterns and perhaps to manipulate these patterns in the future. Due to the prevalent use of this technology in gene therapy research, it is important that its limitations be delineated, and that its full potential be developed. A number of studies have used the electrophoretic patterns generated from LAM-PCR reactions to estimate the number of stem cell clones contributing to hematopoiesis in a transplant recipient, the relative contributions of those clones, or the temporal history of those contributions (10,19–23). Although these questions are inherently quantitative in nature, the strong competitive bias that can be introduced using a multiple-template PCR format may seriously limit its quantitative value. Another possible limitation of this technology relates to the choice of restriction site used to anchor the second primer position of the PCR. Insertion sites that occur either too close or too far from a restriction site may be too small to resolve, or alternatively, may not be amplified, thus limiting the analysis to a subset of clones in a mix. A third limitation is the labor-intensive nature of the method. Analyzing preferred sites for vector insertion, or globally interrogating complex mixtures of clones, requires large sequence databases, which are difficult to generate using the present approach. TABLE 1. Primer LTR-biotin LTR-1 LTR-2 AP-For AP-Rev-1 AP-Rev-2 AP-1 AP-2
Here, we examine the limitations of conventional LAM-PCR, describe an optimized, multiarm, highthroughput modification that addresses these limitations, and compare the capabilities of these approaches with respect to quantitative and qualitative representation of a complex clonal mix.
MATERIALS AND METHODS Reagents for LAM-PCR Escherichia coli polymerase I large fragment and all restriction endonucleases, (New England Biolabs, Beverly, MA), Advantage 2 polymerase (BD Biosciences Clontech, Palo Alto, CA), Fast-Link T4 DNA ligase (Epicenter Technologies, Madison, Wis), strepavidin-coated magnetic M280 Dynabeads and magnetic separation stands (Dynal Biotech, Oslo, Norway), Novex TBE 4–20% polyacrylamide gels and TOPO-TA cloning kits (Invitrogen, Carlsbad, CA), as well as QIAamp genomic DNA Mini Kits, Qiaquick gel extraction kits, and QIAprep plasmid miniprep kits (Qiagen, Valencia CA), were used according to the manufacturer’s recommendations.
Cell lines and DNA isolation NIH-3T3 cells were transduced at multiplicities of infection (MOI), of either 0.01 or 100, and sorted for green fluorescent protein (GFP)-positive cells, as described (24) using a lentiviral vector (RRL-CFPG) that encodes GFP. Individual clonal lines were developed by outgrowth of single GFP cells. Genomic DNA was purified on QIAamp DNA Mini Kit columns.
Primers Primers were obtained from Operon Biotechnologies (Huntsville, AL) and stored at 10 M in 10 mM Tris, pH 7.9, at 20°C. All primers used for insertion site analysis are listed in Table 1. The long terminal repeat (LTR)
PRIMERS USED
FOR
LAM-PCR
Sequence (5–3) 5-Biotin-AAGCCTCAATAAAGCTTGCC TGAGTGCTTCAAGTAGTGTGTGC ACTCTGGTAACTAGAGATCC GACCCGGGAGATCTGAATTCAGTGGCACAGCAGTTAGG AATTCCTAACTGCTGTGCCACTGAATT CCTAACTGCTGTGCCACTGAATT GACCCGGGAGATCTGAATTC GATCTGAATTCAGTGGCACAG
382
OPTIMIZED CLONAL ANALYSIS primers are all designed to amplify the 3 LTR–host junction from an inserted Lenti-based vector.
Real-time PCR assay DNA (100 ng) was amplified in duplicate with enhanced (E) GFP-specific primers (5-CTG CAC CAC CGG CAA-3 and 5-GTA GCG GCT GAA GCA CTG3; Integrated DNA Technologies Inc., Coralville, IA) and a fluorescence-tagged probe (5-FAM-CCA CCC TGA CCT ACG GCG TG-TAMRA-3; Perkin-Elmer Applied Biosystems, Foster City, CA). DNAs were normalized in duplicate by mouse GAPDH PCR with a commercially available assay (Perkin-Elmer). Primers and probes were designed using Primer Express software v.1.5 (Perkin-Elmer). Standard DNAs were from NIH-3T3 cells harboring a single copy of EGFP. Reac-
tions were run using the ABI universal PCR master mix (Applied Biosystems, Branchburg, NJ) on the ABI Prism 7300 sequence detection system (Applied Biosystems) using the following thermal cycling conditions: 50°C for 2 min and 95°C for 10 min, followed by 40 cycles of 95°C for 15 sec and 60°C for 1 min. The spectrum was then analyzed using sequence detector v.1.3.0. Determination of copy numbers was made in two repeat PCR runs to generate average values and standard deviations.
Standard single-arm LAM-PCR This method, illustrated in Fig. 1A, is similar to that of Schmidt et al. (2). All reactions and washes were conducted in 200-l thin-walled PCR tubes or in the wells of 96-well PCR plates.
FIG. 1. Schematic diagram of LAM-PCR protocols. (A) Standard single arm protocol as described by Schmidt et al. (1). (B) Three-arm high throughput protocol. Gray boxes indicate significantly modified or added steps in the three-arm protocol.
383
HARKEY ET AL. Step 1—linear amplification: Reactions of 50 l containing 50–100 ng genomic DNA, 2 nM dNTPs, 5 nM 5-biotinylated LTR primer (LTR-biotin), 1 Advantage 2 buffer, and 1 l of Advantage 2 Enzyme mix, were run in a GeneAmp 9700 thermal cycler (Applied Biosystems) for 50 cycles: 20 sec at 95°C, 45 sec at 60°C, and 90 sec at 68°C. Enzyme and dNTPs were replenished before continuing an additional 50 cycles. Step 2—binding to matrix: Streptavidin-coated magnetic beads (200 g) were washed twice in 100 l of binding buffer (1 M NaCl, 5 mM Tris, pH 7.5, 0.5 mM EDTA) using a magnetic separation stand, resuspended in 50 l of 2 binding buffer, and mixed with the linear PCR reaction. The suspension was incubated for 60 min. at 22–25°C under constant agitation, and then washed three times in 200 l of wash buffer (10 mM NaCl, 5 mM Tris, pH 7.5, 0.5 mM EDTA 0.01% vol/vol Triton X100) with the aid of a 96-well magnetic separation unit. All subsequent washes were done the same way. Step 3—second-strand synthesis: Matrix-bound DNA was resuspended in 20 l of Klenow reaction, containing 500 nM dNTPs, 100 ng/l random hexamers, 2–5 units of Klenow enzyme, and manufacturer-supplied buffer at 1, incubated at 37°C, for 60 min, and washed. Step 4—digestion: Beads were resuspended in 20 l of digest reaction containing 10 units of Tsp509I and 1 supplied buffer, incubated at 55°C for 60 min, and washed. Step 5—ligation of exposed Tsp509I sites to anchorprimer: Beads were resuspended in 10 l of ligation reaction, containing 4 units of Fast-Link T4 ligase, 2 mM ATP, and 1 buffer (supplied with enzyme) and 10 M doublestranded, Tsp509I-compatible, Anchor-Primer (made by annealing primers AP-For and AP-Rev 1). Reactions were incubated 30 min at room temperature and washed. Step 6—elution: Beads were resuspended in 5 l of 0.1N NaOH and incubated at room temperature for 10 min. The eluate was separated from the matrix using the magnetic separator. Step 7—nested PCR: Two microliters of the eluate was used as template in a 50-l PCR reaction containing 1 l of Advantage enzyme mix, 1 Advantage 2 buffer, 200 M dNTPs, and 500 nM each of primers LTR-1 and AP-1. Amplification was for 30 cycles, using the same conditions as in step 1. Reactions were diluted 100 in sterile distilled water. Two microliters of the diluted product was used as template in a 20-l PCR reaction containing primers LTR-2 and AP-2. All components were at the same concentrations as the first nested PCR, and cycling conditions were identical. Step 8—electrophoresis: Five microliters of PCR product was separated on a TBE-buffered 4–20% polyacrylamide gel at 300 volts for 31 min, and then stained with ethidium bromide and recorded in transmitted UV light on a ChemiDoc Imager (BioRad).
Step 9—isolation and sequencing of bands: Fifteen microliters of PCR product was separated on a 3% agarose gel. Bands were detected by ethidium bromide staining and excised, and the DNA was purified using a Qiagen Gel Purification kit. DNA was cloned into the PCR4TOPO vector, using a TOPO-TA cloning kit, and transferred into electrocompetent Top Ten cells by electroporation. Plasmid was isolated from ampicillin-resistant clones, and sequenced using Big Dye fluorescent dideoxynucleotide chemistry (Amersham), and a Prism 310 Genetic Analyzer (ABI). Sequences were compared to the mouse genome using the UC Santa Cruz BLAT service (http://genome.ucsc.edu/cgi-bin/hgBlat).
Three-arm high-throughput LAM-PCR This method is illustrated in Fig. 1B. Steps 1–3 were as described for the single-arm method. The beads were then divided into three aliquots and processed as three separate arms. Step 4—digestions: Beads were resuspended in 20-l digest reactions containing 10 units of Tsp509I (Arm T), HaeIII (Arm H), or RsaI (Arm R) and 1 supplied buffer. Reactions were incubated at 55°C (Arm T) or 37°C (Arms H and R) for 60 min, and washed. Step 5—ligation: Ligation was done as in the standard method. Anchor-Primer for Arm T was as described for the single-arm method. Anchor-Primer for Arms H and R, which had blunt ends, was generated by annealing primers AP-For and AP-Rev-2. Step 5.5—secondary digest to eliminate internal control: Beads were resuspended in a 20-l digest reaction containing 10 units of BssHI and 1 supplied buffer, incubated at 37°C for 60 min, and washed. The retroviral vector used here contains a BssHI site 72 bp downstream of the 5 long terminal repeat (LTR), which lies between the priming positions that generate the internal control product in the nested PCR steps. Other vectors require other specific secondary digests, depending on the sequence of the internal control region. Steps 6 and 7—elution and nested PCR: These steps were performed for each of the three arms as described for the single-arm method. Step 8—Shotgun cloning and sequencing: The PCR products were separated on 3% agarose, and fragments between 100 and 900 bp were excised as a block. The DNA was purified using a Qiaquick gel extraction kit. One microliter of DNA was ligated into the PCR4TOPO vector using a TOPO-TA cloning kit, transferred into Top 10 electrocompetent E. coli by electroporation, and plated on LB agar, with carbenicillin, in 530cm2 Bio-Assay plates (Genetix Inc., Boston, MA). All subsequent procedures were automated, including robotic picking and growth of clones, amplification of plasmid with TempliPhi technology (Amersham), and
384
OPTIMIZED CLONAL ANALYSIS fluorescent dideoxy-termination-based sequencing (Amersham). A PERL program was written (available at http://www.genome.washington.edu) to assess the presence of LTR and anchor sequences, and to determine the position and orientation of the insertion(s) within transduced cells. FASTA files from sequence traces were searched first for the LTR sequence in the forward or reverse direction, relative to the sequence trace, followed by a search for the anchor sequence in either direction and, similarly, the T3 and T7 sequences flanking the cloning site of the vector. Sequences where the LTR could be found were used to isolate the “junction sequence” (the sequence adjacent to the LTR), which was then subjected to BLAST analysis against the mouse genome. The output results included the length of the junction sequence (i.e., the distance between the LTR and anchor sequences), position in the mouse genome, and orientation within the genome.
RESULTS Test pools of retroviral insertion sites Our first objective was to test the efficiency and accuracy with which conventional LAM-PCR can be used TABLE 2.
NUMBERS
OF INSERTION
to interrogate a complex pool of integration sites, such as would occur following transplantation of retrovirally marked hematopoietic stem cells. To address these issues, we generated a series of pools of varying complexity (1–110 integration sites) by mixing clones of NIH-3T3 cells containing lentiviral insertions. Six lowcopy clones (lines 1–6; Table 2) were generated by transducing cells at a very low multiplicity of infection MOI (0.01) with a lentivirus vector encoding GFP, then expanding lines from single GFP-expressing cells. A pool was generated by mixing DNA from lines 1–6 in equal proportions (Mix 1–6). Three multiple-copy clones (lines A, B, and C; Table 2) were similarly generated after transducing at high MOI (100). The largest pool was generated by mixing DNAs from lines A, B, and C in equal proportions (Mix A–C). Estimates of the numbers of insertions in these various pools (Table 2) were made by quantitative real-time PCR (QPCR) and by exhaustive multiarm LAM-PCR (see below). The copy number values were comparable by the two methods, with somewhat higher values detected by the LAM-PCR. Because these latter values were based on repeated identification of specific viral–host sequence junctions, they were considered more reliable. Three low-MOI clones were confirmed to contain a single transgene copy, whereas a fourth clone contained two detectable insertions. Two of the single-copy clones con-
SITES DETECTED
IN
CLONES
AND
MIXES
Number of insertion sites detected
Pool Low MOI line 1 Low MOI line 2 Low MOI line 3 Low MOI line 4 Low MOI line 5 Low MOI line 6 mix (1–6) High MOI line A High MOI line B High MOI line C mix (A–C) Combined A B C data (A–C) Combined
Quantitative PCRa
Three-arm LAM-PCRb
1 0 1 0 2 1 4e 40 31 38 103 109 114
2 1 1 0 3 2 9 34 24 25 83 83 83
1 1 1 1 1 1 5d 17 12 12 41d 41d 41d
aQPCR
Calculated from “efficiency of detection”c 10% threshold
50% threshold
1 0 1 0 2 1 3.5 42 34 40 120
1 0 1 0 2 1 3.5 57 32 36 114
values with 50% variance assumed. three-arm LAM-PCR to 10-fold sampling excess over pool size. cCalculated from three-arm LAM-PCR data using “efficiency of detection” function from Fig. 5B. For the 10% efficiency threshold (sampling size at which only 10% of reactions yield a novel site), p N/0.86, where p is the number of total sites in the pool and n is the number of detected sites. For the 50% threshold, p n/0.5. dValues calculated as sum of components. eInsertion sites in lines 3 and 6 were identical and were counted as a single site. bExhaustive
385
HARKEY ET AL. tained the identical insertion, indicating they were probably derived from the same clone. The two remaining clones, one of which was positive by GFP PCR, yielded no identifiable insertions by LAM-PCR. The GFP clone produced a short LAM-PCR product whose “host sequence” could not be identified. Thus, a total of four unique, verifiable insertions were found among the six low-MOI lines. The high-MOI clones yielded 40, 31, and 38 insertions, for a total of 109 unique genomic insertions in the high-MOI mix.
Quantitative accuracy of standard LAM-PCR The standard method, as described by Schmidt et al. (2), is illustrated in Fig. 1A (see Materials and Methods). The resulting junction fragments are amplified by nested PCR and separated by electrophoresis. Typically, each inserted provirus yields two bands, one unique to that insertion, and whose size depends on the distance from the 3 LTR to the nearest host Tsp509I site, and an internal control band of constant size, generated within the vector from the 5 LTR (e.g., lane 1 of Fig. 2A). We first tested the accuracy with which the electrophoretic banding profile of LAM-PCR represents a small number of insertion sites. Each of the six low-copy lines and the pool composed of equal contributions from each of these lines (mix 1–6) were assayed by standard LAM-PCR, and the products separated by electrophoresis (Fig. 2A). Five of the six lines yielded a simple banding pattern, composed of a common internal control band and a unique band, representing the viral–host junction. Many of the less intense bands (e.g., lane 2) represent various PCR artifacts, often not related to insertion sites, or in the case of line 5, a second insertion site (Table 2). The banding pattern of the pool (lane M), however, failed to reveal at least three of the major bands observed in individual clones. To more accurately test the quantitative representation of this mix by standard LAM-PCR, we generated a library of junction site clones from the low-MOI pool (mix 1–6) by shotgun cloning the unfractionated product and sequencing 384 random clones. Although the four detectable insertion sites were mixed in similar abundance in the low-MOI pool (1:1:1:2), their relative representation in the library (0:1:33:66) was highly skewed (Fig. 2B). Similar analysis of the insertion sites in the 40-site pool (Fig. 2C) and the 109-site pool (Fig. 2D) yielded similarly unequal representation of the equimolar insertion sites. In each analysis, the number of sites detected was well below the actual number of sites, and the quantitative representation of detected sites varied by as much as 60-fold. These data indicate that neither the gel patterns nor the libraries generated by conventional LAMPCR yield a quantitatively or qualitatively accurate representation of the pool.
Because the efficiency of PCR amplification can vary sharply with the size of the template, some of this unequal representation could be related to the relative sizes of the PCR products. There is no simple correlation between these parameters (Fig. 2E), but there is a strong bias toward fragments of 100–200 bp (Fig. 2F). This result is consistent with the predicted average spacing of 123 bp for AATT motifs (Tsp509I sites) in a ⬇40% GC genome, and as such, does not demonstrate a strong sizerelated amplification bias within the 100 to 800 bp range. Part of the bias in LAM-PCR probably results from the nonuniform distribution of Tsp509I sites. Because the assay is based on proximity to these sites and requires a distance of 50–800 bp for robust amplification and homology analysis, regions of very high or low Tsp509I site density, would be expected to suppress detection of some retroviral insertion sites.
Optimized method: three-arm high-throughput LAM-PCR To optimize LAM-PCR for identifying clones within complex mixtures, we developed several modifications (Fig. 1B). The objectives were: (1) to decrease the detection bias resulting from the choice of restriction sites used to anchor the PCR, and (2) to streamline the protocol for high throughput. To reduce bias, we split reactions into three separate digests, using Tsp509I (AATT) HaeIII (GGCC), and RsaI (GTAC) to account for random clustering of any one recognition sequence, and for systematic biases derived from AT-rich and GC-rich regions of the genome. HaeIII is insensitive to CpG methylation. To bypass the labor-intensive process of gel-purifying bands for sequence analysis, we adopted a direct shotgun cloning approach. Four modifications were introduced to facilitate this approach. First, an additional digest step was introduced to eliminate the internal control band, and to eliminate a major source of noninformative clones in the junction library. Second, all PCR products smaller than 100 bp (35 bp of host sequence) were eliminated to remove noninformative molecules, such as primer–dimer and host sequences too short for unambiguous identification. Third, the size-selected PCR product was directly shotgun cloned in a TOPO-TA vector. Fourth, all subsequent steps, including picking and growth of clones from the junction library, template preparation, sequencing, and identification of insertion sites, were automated. Figure 3 shows the three-arm LAM-PCR products of each of the cell lines and equimolar mixes described above. Comparison of gels A and B illustrates the effect of the secondary digest, which eliminates the noninformative internal control band. As seen in gels B–D in Fig. 3 the three arms vary in the number, size, and relative intensity of bands. In particular, note that clones 2 and 4
386
OPTIMIZED CLONAL ANALYSIS
A
B
C
D
E
F
FIG. 2. Nonuniform representation of individual insertion sites by single-arm LAM-PCR in a pool of equally abundant insertion sites. Tsp509I was used in all cases as the single restriction enzyme. (A) Electrophoretic profile from an equimolar mix of six low-copy clones. Lanes 1–6 show the products from 100 ng of genomic DNA from each of six individual lowcopy cell lines with an average of one inserted provirus per line. Lane M shows the product generated from an equimolar mix (17 ng of DNA from each cell line) of these lines. LAM-PCR products were separated by polyacrylamide gel electrophoresis and stained with ethidium bromide. Lane L contains a 100-bp ladder. The arrow indicates the “internal control” band at about 200 bp. Bands ␣, , ␥, and ␦ yielded the integration sites in (B) with the same designations. Band could not be identified due to small insert size. (B–D) Relative abundance of individual insertion sites in shotgun cloned libraries of LAM-PCR products. In each case, a BssHII digest was included after the ligation step to eliminate internal control fragments. The final product was shotgun cloned into PCR4-TOPO, and the resultant library was analyzed for insertion sites by random sequencing. Individual insertion sites are arranged along the x-axis according to increasing abundance. (B) The mix of six low-copy lines. Four identifiable insertion sites (␣,,␥,␦) were detected at the frequencies indicated (filled bars). The insertion site designated as ␣ was not detected in this assay, but was known to exist from three-arm LAM-PCR assays. The cell line from which each insertion site was recovered is indicated above the bar. (C) High MOI line A, with a predicted pool of 40 insertion sites. (D) An equimolar mix of high MOI lines A, B, and C, with a predicted pool of 109 insertion sites. (E,F) Relation between size and abundance of LAM-PCR products in cloned libraries. Accumulated data from the mixture of high-MOI lines is shown. (E) Each insertion site is plotted by both size (base pairs of cloned insert) and abundance (number of times detected in library out of 1,152 sequencing reactions). (F) Same data are clustered in 100-bp size ranges to show a general trend.
387
HARKEY ET AL.
A
B
C
D
FIG. 3. Electrophoretic banding profiles generated by threearm LAM-PCR. Products are shown from all cell lines and mixes used in this work. (A) Standard LAM-PCR, using a Tsp509I digest. Note the constant band at 200 bp representing internal control. (B–D) Individual arms of three-arm LAMPCR. Note the absence of internal control bands.
yielded strong bands in the Tsp509I and RsaI arms (gels B and C), but no major band in the HaeIII arm (gel D). In these cases, the choice of restriction enzyme in a standard single-arm LAM-PCR assay would clearly bias any conclusion drawn from the banding pattern. Shotgun cloning and sequencing of these PCR products was highly productive, yielding 65–70% informative viral–host junction sites.
Evaluation of one-arm versus three-arm approaches To test the relative capacities of these methods to interrogate fully a mixture of clones, one-arm and threearm LAM-PCR assays were performed on small and large pools of insertion sites and analyzed by highthroughput sequencing. The objective was to use high sampling numbers to saturate the available information from the insertion site libraries generated by each approach. First, the pool of six low-copy lines was interrogated at the level of 384 sequences per arm. (Fig. 4A). Four insertion sites were detected, consistent with the number found in individual lines (see Table 2). However, the relative detection frequencies for each site varied dramatically from arm to arm, consistent with the variable band intensities observed on gels.
Similar three-arm analysis of the multiple-copy cell lines (A, B, and C), with a sampling size of 384 sequences per arm, gave similar arm-dependent results. For example, the individual arms of the line A analysis yielded from 58% to 82% of the total detected sites, and only 40% were observed in all three arms (Fig. 4B). In a similar analysis of the A–C pool, at a sampling level of 768 sequences per arm (Fig. 4C), only 32% of the sites were detectable in all three arms. Figure 4D shows the rate and extent of acquisition of novel insertion sites from the large pool (A–C) as a function of the sampling size. It is clear that multiarm LAM-PCR yields data at a higher rate than do the individual arms, regardless of the sampling size. Data acquisition begins to saturate at 6–10 times the pool size, making subsequent data retrieval increasingly expensive. Of the pool of 103 total detected insertion sites, 50%, 75%, and 90% were acquired with sampling sizes of only 110, 340, and 1040, respectively. The quantitative representation of individual clones in this data set was not improved over the single-arm method (Fig. 4E). To test the reproducibility of this technology, we interrogated the large (A–C) pool twice by the three-arm method. A total of 1,153 clones were sequenced from each LAM-PCR (384 from each arm), corresponding to about 10 times the pool size. Together, the two experiments detected 103 total insertion sites in the mix, and the separate LAM-PCRs yielded 90% and 95% of the total detections, with an overlap of 85%. A similar analysis was done at twice this sampling level by comparing the combined data from the A–C mix with the combined data from the component cell lines A, B, and C. This comparison detected a total of 114 insertion sites, with 90% and 92% representation by the individual data sets, and an overlap of 82% (data not shown).
Estimation of pool size from saturation kinetics As the sampling level of the A–C pool increases, the number of insertion sites detected rises and the efficiency of detection of novel sites decreases. These curves show reciprocal asymptotic behaviors, with the pool size approaching 1.0 (100%, ⵑ120 sites) and the efficiency of detection approaching 0 (Fig. 5A). The relationship between these two functions, shown in Fig. 5B, is described over most of its range by a linear function, y 1 x, where Y is the fraction of the pool detected (saturation), and X is the fraction of samplings that yield novel insertion sites (efficiency of detection). This simple relationship facilitates estimation of the pool size (P) at relatively low sampling levels, knowing only X and the number of novel sites detected (N), by the equation: P N/Y N/1 X. For example, at a sampling level of 10 times the pool size, 1,200 samplings in this case, Y 0.9, X 0.1, and the pool size can be calculated as P 1.11N. Similarly at a sampling
388
OPTIMIZED CLONAL ANALYSIS
FIG. 4. Three-arms detect partially overlapping sets of insertion sites. (A) Arm-to-arm variability of representation of individual insertion sites from the pool of six low-copy cell lines. A total of 384 clones were sequenced from each arm. (B) Venn diagram depicting the overlap in detection of 40 insertion sites in cell line A among the three arms. (C) Venn diagram depicting the overlap in detection of 103 insertion sites in the A,B,C mix among the three arms. (D) Saturation curves, showing the total number of detected insertion sites in a pool as a function of the number of samplings. The A–C mix, containing about 109 proviral insertions, was interrogated by three-arm LAM-PCR, 768 clones were sequenced from each arm, and insertion sites were identified. The 2,304 identifications were randomized, and the number of unique insertion sites detected was plotted as a function of the number of random clones analyzed. The data from each arm were analyzed separately by the same method. H, HaeIII arm; R, RsaI arm; T, Tsp509I arm. (E) Relative frequency of detection of the 103 individual insertion sites detected by three-arm LAM-PCR of the A–C mix (same data set as in D). Individual insertion sites are arranged along the x-axis according to increasing abundance.
level of only twice the pool size, X 0.5 and P 2N. Such calculations are expected to be more accurate for larger pool sizes and for higher sampling levels. As a test of this strategy the pool sizes of each of the cell lines and mixes were calculated using the X 0.1 and X 0.5 efficiency thresholds (Table 2). The calculated
values at both thresholds closely followed the values obtained from exhaustive sampling and the values determined by QPCR. This result indicates that, under conditions of equimolar abundance of constituent insertion sites, a sampling size of just twice the pool size can yield a reasonable estimate of the pool size.
389
HARKEY ET AL.
A
B
FIG. 5. Relationship between pool saturation and efficiency of detection of novel insertion sites in the pool. As in Fig. 4D, the A–C mix was interrogated by three-arm LAM-PCR, and 768 clones were sequenced from each arm. The 2,304 identifications were randomized. (A) The level of saturation of the pool (percent of total insertion sites in pool detected) and efficiency of detection (percent of clones that yield novel insertion sites) are both plotted as functions of the number of random clones analyzed. The pool size was assumed to be 120 total insertion sites, based on the saturation curve in Fig. 4D. (B) The relationship between detection efficiency and pool saturation. Using the data from plot A, detection efficiency is plotted as a function of pool saturation. The curve is roughly described by the linear function, Y 1 X, and more accurately by the polynomial function, Y 100%(0.0001X3 0.017X2 1.5374X 0.99004).
DISCUSSION We show here that conventional single-arm LAM-PCR strategies provide a very poor representation of the relative abundance of retroviral insertion sites in a complex mix. We routinely observed 30- to 60-fold differences in the quantitative representation of insertion sites that were equally abundant in the mix. The multiarm strategy described here does not significantly alter this quantitative discrepancy. We conclude from these observations that LAM-PCR cannot be used to estimate, with any confidence, either the relative contributions of donor stem cell
clones in a transplant recipient, or changes in those contributions over time. Given the magnitude of the quantitative discrepancies observed here, even apparent qualitative differences in electrophoretic banding patterns or insertion site libraries are unreliable. The appearance or disappearance of a band may reflect changes in competing clones rather than the clone of interest. Thus, comparison of LAM-PCR banding patterns from different tissues, sorted cell populations or time points is probably not informative for tracking clonal contributions in complex chimeric systems, such as stem cell transplant recipients.
390
OPTIMIZED CLONAL ANALYSIS Quantitative or semiquantitative tracking of clonal abundance requires additional strategies that can focus on specific clones. For example, LAM-PCR analysis of hematopoietic colonies, as seen using in vitro colonyforming assays or by direct harvesting of mouse spleen colonies, can be used to track clones in the progenitor population (24). Because each colony represents a single clone, the possibility of template competition between clones is eliminated. The frequency of a given LAM-PCR banding pattern among the colonies is directly proportional to the frequency of a specific clone among the progenitor cells, capable of forming colonies. Alternatively, the viral–host junction sequence of a clone can be exploited to develop a clone-specific quantitative PCR assay for that clone. This approach has been used successfully to track individual donor stem cell clones in transplant recipients (2). A major value of LAM-PCR technology is in qualitative mining of insertion site data from a complex pool. These data can then be used to design clonal tracking assays as described above. It may also facilitate estimations of the number of clones in a mix (e.g., the number of stem cells contributing to hematopoiesis). However, this latter objective is better served by a multiarm approach. Our results indicate that any single-arm method, even at saturating levels of sampling, misses 30–40% of the insertion sites detected by a three-arm protocol. On the basis of the saturation behavior of data acquisition and on independent quantitative analysis of the viral copy numbers in our cell lines, the three-arm protocol is able to detect most of the insertion sites with a 10-fold excess of samplings. Furthermore, reasonable estimates of the size of an insertion site pool can be generated with only a two-fold sampling excess. Thus, a high-throughput multiarm strategy, such as we have described, may facilitate estimation of the size of a clonal mix with reasonable accuracy and cost. It is important to note that this conclusion applies to a pool of equally abundant clones, whereas most biological systems would be expected to contain clones of varied abundance. Because LAM-PCR misrepresents clones of equal abundance by up to 60-fold, the actual clonal abundance may not be important in pool size determination except in cases of very rare clones. However, it should be kept in mind that the approach described here will probably underestimate pool size if very rare clones exist. Multiarm analysis may also prove useful for examination of the integration site preferences of retroviral and other vector (15–17,25). To analyze integration site preferences for new vectors or transduction protocols, it will be necessary to develop large databases of such sites. This effort, although important in characterizing and minimizing the risks of insertional mutagenesis in gene therapy, will be expensive and time consuming. Multiarm high-throughput LAM-PCR offers some advantages
in this respect over the single-arm approach. First, the cost and time commitment are minimized by replacement of the labor-intensive steps of electrophoresis, band isolation, and individual processing of each band with an automated shotgun cloning approach. Second, the multiarm approach accesses a larger proportion of the insertion sites in a complex mix, than does the singlearm approach. As a result, it yields a higher initial rate of data acquisition (unique insertion sites per unit of sequencing; see Fig. 4D) and, therefore, a lower cost for that data. In practical terms, a very large pool of vectorcontaining clones, sampled at a very low saturation level (Y 0.1), should yield unique integration sites at a very high efficiency (X 0.9 or 90 sites per 100 sequencing reactions). Thus, a cost-effective strategy for analysis of integration site preferences would be to transduce a large number of cells, relative to the desired number of sites, subject the cells to a multiarm high-throughput LAMPCR, and limit the sequencing to 10% of the cell population size. Such a study should also take into consideration recent reports that retroviral insertions in certain genomic loci may lead to preferential expansion of specific clonal lines of transduced cells (26,27). Harvest of the cells shortly after transduction would minimize the risk of skewing the results due to a post-integration event.
ACKNOWLEDGMENTS This work was supported by National Institutes of Health (NIH, Bethesda, MD) grant 5U54 HG00243 (Maynard V. Olson, PI), NIH grant HL077231 (Peter Kurre PI), NIH grants R01 DK52997-05, R01 DK6184401, and P01 HL53750-09 (C. Anthony Blau, PI), and NIH grant DK56456-02 (Beverly Torok-Storb, PI). This work was facilitated by the Clonal Analysis Core of the Core of Excellence in Molecular Hematology funded by NIH grant DK56456-02 (Beverly TorokStorb, PI).
391
REFERENCES 1. Schmidt M, G Hoffmann, M Wissler, N Lemke, A Mussig, H Glimm, DA Williams, S Ragg, CU Hesemann and C von Kalle. (2001). Detection and direct genomic sequencing of multiple rare unknown flanking DNA in highly complex samples. Hum Gene Ther 12:743–749. 2. Schmidt M, P Zickler, G Hoffmann, S Haas, M Wissler, A Muessig, JF Tisdale, K Kuramoto, RG Andrews, T Wu, HP Kiem, CE Dunbar and C von Kalle. (2002). Polyclonal long-term repopulating stem cell clones in a primate model. Blood 100:2737–2743. 3. Lund AH, M Duch and FS Pedersen. (1996). Transcriptional silencing of retroviral vectors. J Biomed Sci 3:365–378.
HARKEY ET AL. 4. Kurre P, J Morris, B Thomasson, DB Kohn and H-P Kiem. (2003). Scaffold attachment region-containing retrovirus vectors improve long-term proviral expression after transplantation of GFP-modified CD34 baboon repopulating cells. Blood 102:3117–3119. 5. Ellis J. (2005). Silencing and variegation of gammaretrovirus and lentivirus vectors. Hum Gene Ther 16:1241–1246. 6. Nagayama J, M Iino, Y Tada, H Kusaba, A Kiue, K Ohshima, M Kuwano and M Wada. (2001). Retrovirus insertion and transcriptional activation of the multidrug-resistance gene in leukemias treated by a chemotherapeutic agent in vivo. Blood 97:759–766. 7. Bedigian HG, DA Johnson, NA Jenkins, NG Copeland and R Evans. (1984). Spontaneous and induced leukemias of myeloid origin in recombinant inbred BXH mice. J Virol 51:586–594. 8. Gilbert DJ, PE Neumann, BA Taylor, NA Jenkins and NG Copeland. (1993). Susceptibility of AKXD recombinant inbred mouse strains to lymphomas. J Virol 67:2083–2090. 9. Li Z, J Dullmann, B Schiedlmeier, M Schmidt, C von Kalle, J Meyer, M Forster, C Stocking, A Wahlers, O Frank, W Ostertag, K Kuhlcke, HG Eckert, B Fehse and C Baum. (2002). Murine leukemia induced by retroviral gene marking. Science 296:497. 10. Hacein-Bey-Abina S, C von Kalle, M Schmidt, MP McCormack, N Wulffraat, P Leboulch, A Lim, CS Osborne, R Pawliuk, E Morillon, R Sorensen, A Forster, P Fraser, JI Cohen, G de Saint Basile, I Alexander, U Wintergerst, T Frebourg, A Aurias, D Stoppa-Lyonnet, S Romana, I RadfordWeiss, F Gross, F Valensi, E Delabesse, E Macintyre, F Sigaux, J Soulier, LE Leiva, M Wissler, C Prinz, TH Rabbitts, F Le Deist, A Fischer and M Cavazzana-Calvo. (2003). LMO2-associated clonal T cell proliferation in two patients after gene therapy for SCID-X1. Science 302:415–419. 11. Taruscio D and L Manuelidis. (1991). Integration site preferences of endogenous retroviruses. Chromosoma 101:141–156. 12. Holmes-Son ML, RS Appa and SA Chow. (2001). Molecular genetics and target site specificity of retroviral integration (review). Adv Gen 43:33–69. 13. Bushman FD. (2002). Integration site selection by lentiviruses: biology and possible control (review). Curr Topics Microbiol Immunol 261:165–177. 14. Schroder AR, P Shinn, H Chen, C Berry, JR Ecker and F Bushman. (2002). HIV-1 integration in the human genome favors active genes and local hotspots. Cell 110:521–529. 15. Wu X, Y Li, B Crise and SM Burgess. (2003). Transcription start regions in the human genome are favored targets for MLV integration. Science 300:1749–1751. 16. Laufs S, B Gentner, KZ Nagy, A Jauch, A Benner, S Naundorf, K Kuehlcke, B Schiedlmeier, AD Ho, WJ Zeller and S Fruehauf. (2003). Retroviral vector integration occurs in preferred genomic targets of human bone marrow-repopulating cells. Blood 101:2191–2198. 17. Imren S, ME Fabry, KA Westerman, R Pawliuk, P Tang, PM Rosten, RL Nagel, P Leboulch, CJ Eaves and RK Humphries. (2004). High-level beta-globin expression and preferred intragenic integration after lentiviral transduction of human cord blood stem cells. J Clin Invest 114:953–962. 18. Johnson CN and LS Levy. (2005). Matrix attachment regions as targets for retroviral integration. Virol J 2:68. 19. Kuramoto K, D Follman, P Hematti, S Sellers, MO Laukka-
20.
21.
22.
23.
24.
25.
26.
27.
nen, R Seggewiss, ME Metzger, A Krouse, RE Donahue, C von Kalle and CE Dunbar. (2004). The impact of lowdose busulfan on clonal dynamics in nonhuman primates. Blood 104:1273–1280. Kiem H-P, S Sellers, B Thomasson, JC Morris, JF Tisdale, PA Horn, P Hematti, R Adler, K Kuramoto, B Calmels, A Bonifacino, J Hu, C von Kalle, M Schmidt, B Sorrentino, A Nienhuis, CA Blau, RG Andrews, RE Donahue and CE Dunbar. (2004). Long-term clinical and molecular followup of large animals receiving retrovirally transduced stem and progenitor cells: no progression to clonal hematopoiesis or leukemia. Mol Ther 9:389–395. Schmidt M, DA Carbonaro, C Speckmann, M Wissler, J Bohnsack, M Elder, BJ Aronow, JA Nolta, DB Kohn and C von Kalle. (2003). Clonality analysis after retroviral-mediated gene transfer to CD34() cells from the cord blood of ADA-deficient SCID neonates. Nature Med 9:463–468. Podsakoff GM, BC Engel, DA Carbonaro, C Choi, EM Smogorzewska, G Bauer, D Selander, S Csik, K Wilson, MR Betts, RA Koup, GJ Nabel, K Bishop, S King, M Schmidt, C von Kalle, JA Church and DB Kohn. (2005). Selective survival of peripheral blood lymphocytes in children with HIV-1 following delivery of an anti-HIV gene to bone marrow CD34() cells. Mol Ther 12:77–86. Neff T, BC Beard, LJ Peterson, P Anandakumar, J Thompson and H-P Kiem. (2005). Polyclonal chemoprotection against temozolomide in a large-animal model of drug resistance gene therapy. Blood 105:997–1002. Kurre P, P Anandakumar, MA Harkey, B Thomasson and HP Kiem. (2004). Efficient marking of murine long-term repopulating stem cells targeting unseparated marrow cells at low lentiviral vector particle concentration. Mol Ther 9:914–922. Miller DG, GD Trobridge, LM Petek, MA Jacobs, R Kaul and DW Russell. (2005). Large-scale analysis of adeno-associated virus vector integration sites in normal human cells. J Virol 79:11434–11442. Modlich U, OS Kustikova, M Schmidt, C Rudolph, J Meyer, Z Li, K Kamino, N von Neuhoff, B Schlegelberger, K Kuehlcke, KD Bunting, S Schmidt, A Deichmann, C von Kalle, B Fehse and C Baum. (2005). Leukemias following retroviral transfer of multidrug resistance 1 (MDR1) are driven by combinatorial insertional mutagenesis. Blood 105:4235–4246. Kustikova O, B Fehse, U Modlich, M Yang, J Dullmann, K Kamino, N von Neuhoff, B Schlegelberger, Z Li and C Baum. (2005). Clonal dominance of hematopoietic stem cells triggered by retroviral gene marking. Science 308:1171–1174.
Address reprint requests to: Dr. Michael A. Harkey Mail Stop D1-100 Fred Hutchinson Cancer Research Center 1100 Fairview Avenue, North P.O. Box 19024 Seattle, WA, 98109-1024 E-mail:
[email protected] Received December 11, 2006; accepted January 18, 2007.
392