GenoMap: A Distributed System for Unifying Genotyping ... - CiteSeerX

GenoMap: A Distributed System for Unifying Genotyping and Genetic Linkage Analysis Todd E. Scheetz, Terry A. Braun, Thomas L. Casavant, Kyle J. Munn Edwin M. Stone, and Val C. Sheeld Dept. of Electrical and Computer Engineering and the Dept. of Genetics University of Iowa [email protected]

Abstract

This paper describes GenoMap, an integrated, parallel/distributed computing software system that aids in large-scale gene identi cation studies. This problem requires data from heterogeneous, distributed sources. In addition, the users of the system are researchers spanning a wide range of cultures, and represent a broad range of computing needs. A distributed, heterogeneous solution addresses most of the primary challenges, but presents its own diculties. The key elements of GenoMap are provided by way of a web-based set of interactive Java applets and applications, and native applications, with automated load-balancing support for sharing computationally intensive tasks among a collection of distributed UNIX workstations. While computationallyintensive tasks are supported on the most appropriate systems, users are able to conduct most of their work in familiar MacOS and Windows/PC environments. We also describe GenoScape, a C & X Windows native application for robust genotyping of a wide range of electrophoresis gel formats.

1 Introduction Modern genome sequencing, mapping, and discovery research eorts are distributed processing systems. The processing involved is computationally demanding in nature, but the basic nature of the entire process of gene discovery and understanding is highly parallel, heterogeneous, and distributed. To date, the dominant component of the human genome project (HGP) has been the discovery of the entire sequence of the human genome. This has been a worldwide parallel/cooperative eort in which the dimensions of parallelism have been drawn at the boundaries of organisms (e.g., human, mouse, C.elegans, etc)[DOE95, BlR97, Ber95], and chromosomes of these organisms. However, close examination of this phase of the eort reveals that the parallelism is mostly job-level [HoJ88]. Partitioning of the sequencing eort along organism and chromosomal lines requires only very coarse-grained Inter-Process Communication (IPC) and synchronization (e.g., often via human examination of WWW sites 1

displaying similar regions of diering organisms). The future of the area of genome-wide study will involve two distinctly dierent components: 1. closer integration of distributed sources of data in single computations, and 2. instances of problems in which the demand for computation will easily exceed locallyavailable computational resources. This paper will address these two aspects of informatics support for the HGP and describe ways to attack the inherent, but more challenging forms of parallelism they will require. For the past three years at the University of Iowa, we have been developing more and more automated systems for gathering, storing, organizing, retrieving, and processing large amounts of genetic information. This information goes beyond the type of sequence data that has dominated the early phases of the HGP. The data to be managed include: informative sets of polymorphic markers; databases of patient demographic, pedigree, and phenotypic information; and large numbers of genotypes gathered using a variety of methods. The challenge of such a system is to span the physical, cultural, and psychological boundaries of various sub-components (clinical, laboratory, computational) to bring together the components necessary to extract useful meaning from the human genome. GenoMap TM , the system described in this paper, is a heterogeneous parallel/distributed application that can execute on many hardware/software platforms simultaneously and transparently. From the user's point of view, the system consists of the functional components shown in Figure 1. The user activates (i.e., by clicking on) the part of the system needed, and the application is either run within the user's web browser, or else a server application is started remotely and interaction provided via the routing of the application's input and output back to/from the user's client web browser on a MAC, PC, or UNIX Workstation. Much of the system has been developed with Java [CoH97, Fla97] using sockets [Tan96] for IPC in a client-server manner. The database itself is a commercial product from Sybase TM , and the server applications are run as UNIX processes, that were originally developed in a variety of languages (e.g., C, FORTRAN, Pascal). (

(

)

)

2 Principle Technical Challenges The goal of GenoMap is to manage from geographically dispersed sites, all the information and computation resources necessary to support disease gene discovery/identi cation. The process being supported is illustrated in Figure 1. The typical target user of the system is the genetics researcher with a hypothesis regarding the linkage of a particular disease 2

Figure 1: The user's view of GenoMap as seen from the WWW-based main interface. (phenotype) to a locus of the genome. The process shown in Figure 2 is quite general and can be used with a wide variety of hypotheses from general screening to the location of genes on a chromosome level of granularity, to ne-grained screening to locate the locus of a gene to within a few kilobases [LaW96] of resolution. The use of the system begins by creating a linkage experiment with the Linkage Experiment Editor, that states a hypothesis regarding a suspected linkage between a disease (phenotype) and a set of genome loci. In many cases, it is useful to manage a linkage study which consists of a set of linkage experiments. The term linkage experiment is used throughout the rest of this paper to describe the principle object that GenoMap is managing. In Figure 2, the square boxes represent accumulated data from the entire GenoMap process. Details of the process shown in Figure 2 are left to later sections, but the primary components of such a study involve: 1. Experiment Creation/Editing (Linkage Experiment Editor - see Figure 1) (a) Selection of a set of polymorphic markers [GuW83] as candidate linkage sites. 3

Figure 2: The GenoMap process (b) Selection of a set of subjects (usually families) in which genotypes, at the chosen marker sites, as well as the aected and unaected status of each member can be determined. 2. Data Gathering (a) Gathering the pedigree structure of the set of subjects being used; clinical data. (Subject Log) (b) Gathering the aected status of each subject; clinical data. (Subject Log) (c) Gathering the genotypes of the subjects for each of the selected markers; experimental/laboratory data. (Genotyping Assistant, GenoScape, Genotype Veri cation) 3. Statistical analysis of all gathered information related to a linkage experiment to determine the likelihood of linkage (lod score). (Linkage Analysis) In most studies conducted to date, the size of the subject set and the marker set have been on the order of several hundred. In addition, the linkage hypotheses are quite often 4

only attempting to link a disease to a single locus. Our typical experience to date has been dominated by cases in which the pedigree and phenotypes of a subset of a subject set are known, but most (and usually all) of the genotypes need to be gathered. GenoMap contains a component to support this activity. Unlike most other relevant LIMS , GenoMap is web-based and requires little in the way of custom software installation. 1

3 Approach The primary goals of GenoMap are to allow closer integration of distributed sources of genomic data (clinical and laboratory) for input to single computations, and to make available the considerable high-performance computing resources available in the Internet today to genome-scale computations. In order to achieve these goals, the system must be able to interoperate among multiple hardware platforms (PC, Macintosh, Workstations), dierent software environments (Windows, MacOS, UNIX), and dierent user communities (Geneticists, Computer Scientists/Engineers, Clinicians, and Statisticians). Most existing systems do not address the interoperability issue, nor the need for high-performance (parallel) computing systems was not as apparent as it is now that the HGP has entered a phase in which genome-scale problems are being attacked (e.g., functional genomics [MyM97]). The standalone programs (distributed either as binaries for a single platform, or as source code for several similar platforms) that have been the standard means for sharing software solutions in the past, present a challenge to the typical user. This user is increasingly a person not familiar with idiosyncrasies of various languages, le formats, operating systems, or windowing environments. Yet, the state of the art in networked computing today is capable of supporting this type of user. The primary interface targeted by GenoMap is one of several WWW browsers (Internet Explorer, Netscape Navigator, HotJava, Mosaic, etc). From within such a web browser environment, tools can be built that address the primary objectives stated above. The principle elements of our approach are the following: 1. A fully-networked database running on a UNIX platform such as a SUN, SGI, or HP compute server. This assures uniform access to all remote access and computation sites. It also represents a robust, high-performance database server on a server/workstation class computer. 2. Java implementations of most of the client interface applications. This allows full 1

Laboratory Information Management Systems

5

accessibility to most features of the system from the wide variety of computers employed by the variety of users of GenoMap (Mac, PC, UNIX workstation, etc). 3. C, Fortran, and Pascal implementations for compute-intensive applications such as multi-point linkage analysis [Ott91]. While these applications may not be interoperable on all platforms, common interfaces (written in Java) can be developed to hide the machine details of the systems on which they will run. The design decisions above have been employed in the development of the rst full internal release of GenoMap. The remainder of the paper describes this implementation in more detail.

4 GenoMap Tools GenoMap is a suite of independent, yet inter-related tools. These tools manipulate data from a centralized networked database, allowing sharing of information among multiple distributed sites without replication and coherency problems. The main goal of GenoMap is to provide a portable, intuitive interface for managing the information associated with the gene location/discovery process. For clarity, we provide the following de nitions. A linkage experiment is a grouping of subjects, markers and trait to be analyzed through linkage analysis. A linkage study is a set of linkage experiments, all pertaining to the same trait, but with possibly dierent subjects and/or markers. For example, the rst linkage experiment of a linkage study might be to screen for the position of a trait's locus across the entire genome, such that three candidate loci are identi ed. Then a more focused linkage experiment could be designed, to more eciently probe the suspected loci. A genotyping experiment is a set of worksheets associated with a given linkage experiment, where a worksheet is an ordered set of subjects and markers to be genotyped under a given genotyping system and technology. A worksheet has a one-toone correspondence to a physical object within a genotyping experiment (e.g., a genotyping gel). A genotyping system's technology consists of pertinent parameters pertaining to the number of number of subjects and markers that can t on a worksheet. While GenoMap has been designed to operate with a variety of genotyping technologies, to date our experience has been primarily with gel-based apparati, many of which support multiplexing of markers on a gel in several dimensions.

6

Name of Software Component Subject Log

Language

Description

Custom Java Applet Marker Log Custom Java Applet Trait Log Custom Java Applet Linkage Experiment Custom Editor Java Applet Genotyping Custom Assistant Java Applet GenoScape Custom C and X Windows Application GenoScape Launcher Custom (Client) Java Applet GenoScape Launcher Custom Java (Server) Application Veri cation Custom (Client) Java Applet Veri cation Custom Java (Server) Application Linkage Analysis Custom (Client) Java Applet Linkage Analysis Custom Java (Server) Application Sybase Server 11.0 Commercial Application JDK 1.1.3 Free Java Application Socket Server Custom Java Application

Recording and searching for subject speci c information. Recording and searching for marker speci c information. Recording and searching for trait speci c information. Creating, viewing, and updating linkage experiments Creating, viewing, and updating genotyping experiments Viewing and genotyping of genotyping gel images Front-end interface to genotyping tool Process to open les and start GenoScape Interface to display all recorded genotypes together Process to open genotype les on the GenoScape Server Interface to multiple linkage analysis packages Process to format data for linkage packages Sybase Database server

Java Developer Kit, used in creating Java programs Manages communication between clients and servers, plus handles load-balancing Table 1: GenoMap Components

4.1 Linkage Experiment Editor The Linkage Experiment Editor assists in the speci cation, creation and maintenance of linkage experiments, where a linkage experiment is a grouping of subjects, markers, and phenotype necessary to perform a linkage analysis computation. The editor manages each 7

Figure 3: Linkage Experiment Editor of the three categories separately, providing tailored searches for each category to assist in selection and speci cation of units to be contained within the linkage experiment. Recall that a particular disease/phenotype-oriented linkage study may consist of a number of linkage experiments. The update portions of the linkage experiment editor allow the database to be queried to nd elements within the database that match speci ed criterion, such as a partially speci ed marker name (e.g., GATA22F). In addition to \direct searching," \indirect searching" is also supported. In \indirect searching," a given type of item (e.g., markers) with a speci c property are being sought. For example, one of the search options allows for marker selection based upon its inclusion in an existing linkage experiment. An additional feature of the Linkage Experiment Editor is its ability to load the subjects, markers and phenotypes from other previously-de ned linkage experiments. This allows for simpler experiment creation given that one or more of the categories may be quite similar to a previous experiment. This is expected to be most useful in specifying the set of markers to be used in a linkage experiment. For example, when performing a genome-wide search for a gene it is common to use the same (or nearly the same) set of markers, known as a screening 8

Figure 4: Marker Based Speci cation set that may be relatively evenly spread across an organism's genome. Another important feature of the Linkage Experiment Editor is its ability to estimate the cost (and duration) of a genotyping experiment derived from the current parameters. This is done by rst searching the database for all available genotypes of involved subjects at the speci ed markers. This feature requires the speci cation of an estimated cost per worksheet and the speci cation of a gel technology. The information managed by the Linkage Experiment Editor is used by two other GenoMap tools. First, the Genotyping Assistant uses the speci cation of the linkage experiment to determine the set of subjects and markers required for the creation of a genotyping experiment, as well as using the linkage experiment to aid in specifying the set of worksheets to view using the viewing mode of the Genotyping Assistant. Second, the Linkage Analysis tool uses a speci ed linkage experiment as an input to determine the subjects, markers and phenotype to be analyzed.

9

4.2 Genotyping Assistant The Genotyping Assistant is used to assist in the instantiation and viewing of genotyping experiments. It is the central component of the LIMS role of GenoMap. The genotyping assistant has been broken down into two components, corresponding to the separate functions that each component performs. The rst component is responsible for the creation of new genotyping experiments. Creation requires two primary inputs: a linkage experiment, and the technology parameters pertaining to the system under which the genotyping will be performed. Given these pieces of information, the user is then allowed to select whether to partition the subjects and markers into worksheets manually or automatically. The technology parameters are used for three purposes: (1) they allow the entire worksheet structure, including the exact layout, to be maintained within the database, (2) by knowing certain key parameters (such as the degree of multiplexing a given genotyping system supports) the partitioning of subjects and markers can be maximally ecient, and (3) they help minimize the amount of information that must be supplied interactively by the user. Using the automatic subject partitioning methodology, one subject is placed per position, as long as the position is empty. A position, is the location within the physical analogue of a worksheet (e.g., a lane of a genotyping electrophoresis gel). A position may be occupied prior to automatic layout due to constraints from the underlying technology (e.g., by a size standard). Similarly, the manual layout methodology will also allow placement of one subject per position, notifying the user when the position is previously occupied. The dierence being that the user is allowed to place a subject in an already occupied position, meaning two samples may be run at the same position. The automatic creation of multiplex sets uses the speci ed technology parameters to determine the number of markers that can/should be run on a worksheet. It rst orders the list of markers by size, and then attempts to select groups of maximal size to be run as a unit on a set of worksheets. Such a group is referred to as a marker multiplex set, because it supports two dimensions of multiplexing. The rst is multiplexing of base pair size { multiple markers may be in the same multiplex set as long as their alleles do not overlap (e.g., sets A, B, C, and D). The second dimension of multiplexing is that of channels (e.g., dyes) { several size-multiplexed sets of markers may be combined provided the genotyping system supports multiple channels. For example, sets A, B, C, and D may be merged into set, M, in a system that supports four channels, where each set is in a dierent channel (e.g., tagged with a dierent dye). The second component provides an interface to check on the progress of an existing 10

Figure 5: GenoScape Launch Utility genotyping experiment. Speci cally, the status and annotation of each worksheet can be checked (and modi ed). In addition, links (similar to hypertext links) are provided to allow viewing of individual worksheets at a higher level of detail. This includes the exact placement of every subject and what markers were run at that location.

4.3 GenoScape Launcher The GenoScape Launcher provides convenient access to our custom genotyping system GenoScape TM . GenoScape is a software package that is capable of automated, semiautomated, and manual genotyping of electrophoresis gel images. A wide variety of physical gel formats (horizontal, vertical, radiolabeled, uorescent, and silver-stained) can be analyzed by GenoScape. The GenoScape gel-analysis system is capable of analyzing digitized electrophoresis gel images with an arbitrary number of markers and an arbitrary number of dyes. Information regarding the number of dyes, allele sizes and names for each marker, and the base pair sizes of the standard \ladder" must be provided in a header le. A header le is associated with each gel image which allows customization to a wide variety of genotyping needs. All of the data to create a header le for a gel image can be obtained through GenoMap by user interaction and from the database. The client component of the GenoScape Launcher (see Figure 5) allows the user to identify a prede ned genotyping experiment and worksheet from those stored in the database. Recall (

)

11

Figure 6: Editing the Standard Ladder that there is a one-to-one correspondence between a worksheet and a gel image. Once a valid genotyping experiment and worksheet have been identi ed, the user must identify the date that the gel image was created and the lab that ran the gel associated with the worksheet. Using these parameters, the location of the gel image is determined. If a header le is already available for the speci ed gel image, the user can choose to launch the GenoScape tool. Otherwise, a header le must be created via further interaction. Most of the information for the header le comes directly from the database using the identi ed study and worksheet. The technology speci ed for the worksheet determines the width (number of lanes) of a speci c gel as well as the standard to be used for image normalization. Information about the standard including the base-pair sizes and the lanes containing standard are retrieved from the database. The information for the markers within a given gel are also retrieved from the database. The user can choose to edit the number and location of base pair sizes within a standard (see Figure 6), but these changes only aect the header le for a particular gel and are not automatically updated in the database. The number of lanes containing standard is also editable, but limited to the range between 2 and the number of lanes in the gel inclusive. Standard is assumed to be present in the outside lanes of a gel with the remaining locations calculated so that the standard is evenly spaced. Individual markers can be removed or edited. If a marker failed to express clearly in a gel, the user may remove the information about that marker from the header le to be used. The base pair sizes and marker names also are editable within the marker editor of the GenoMap GenoScape utility. 12

After veri cation (and possible modi cation) of information for the header le, a clientserver connection is established. (The client is the Java GenoScape TM launching utility within GenoMap.) The server uses the path to the gel image to determine additional parameters regarding the gel image. These include the height, width, and number of dyes in the gel image. Once these parameters have been determined, the data from the GenoScape utility is passed from the client to the server. The server uses the data to create a valid header le for the speci ed gel image. After the header le is successfully written, the GenoScape tool can be launched to view and genotype the speci ed gel image. (

)

4.4 Veri cation Typically, genotyping is a manual activity of lab technicians who score gels by writing directly on the glass of a gel. The glass plates are usually imaged, and paper copies of the images are archived. If the data is to be stored electronically, a separate data entry step transfers the data to a computer. Aside from being laborious, the repetition of hand scoring and data entry is a major source of errors. A check for Mendelian inconsistencies will detect some of the errors, but it is not unusual for errors to propagate to the analysis phase. Typically, the error rate is reduced by redundant genotyping. This is accomplished by having at least two technicians \check" the calls made by the other technician. We refer to the process of comparing multiple calls of genotypes as \veri cation." With reduction of the error rate as motivation, we have implemented a Veri cation tool to compare genotypes generated by the automatic genotyper application GenoScape. The scale of genotyping that is now possible with microsatellite markers creates the need for automation. The large number of genotypes that GenoScape rapidly assembles makes it essential that the veri cation phase have similar capacity for automation in verifying genotypes. The automation provided by GenoScape provides some interesting options for the veri cation phase. With the rst scoring pass, the pertinent les are created as needed by GenoScape. Thus, the second technician can use GenoScape to review the calls already made, and make changes as appropriate. An alternative mode is for the second technician to duplicate the genotyping output les and replicate the entire scoring process. A third variation is to have multiple people perform independent scoring sessions. The Veri cation tool can accommodate these variations by allowing the user to specify the source(s) of genotyping information. Figure 7 shows the WWW interface to the Veri cation tool. The user speci es the experiment, worksheet, laboratory name, and date (de ning the archive location) in a fashion consistent with the other tools. Veri cation is also implemented as a client/server applica13

Figure 7: Veri cation Interface tion. The client rst checks that the experiment and worksheet exist in the database, and then checks for the speci ed genotype le(s). The server reads the genotype le(s) and sends the information to the client to be displayed to the user. The client process then queries the database for any genotypes that may already exist for the worksheet being veri ed. The client displays all genotypes from the selected genotype les and the worksheet's genotypes that may be in the database. Figure 8 shows the genotypes assembled by GenoScape from two independent genotyping sessions, in addition to the genotypes already stored in the database. The Veri cation tool compares each genotype (by base pair size) for each subject for all of the genotype sources (single or multiple genotyping sessions and the database) speci ed by the user. Any discrepancies are highlighted by the tool and must be resolved before the user can store or overwrite genotypes in the database. For example, Figure 8 shows an inconsistency in the genotypes from lane 5, between the rst set of genotypes, the second set of genotypes, and the genotypes stored in the database. The rst set of genotypes has assigned base pair sizes of 187 and 181, versus the calls of 183 and 181. This dierence must be resolved, with all sets of genotypes in agreement, before Genotype Veri cation can update the database. Both the Genotype Veri cation tool and GenoScape can be used concurrently to examine the original gel image(s) and the calls assigned during a speci ed session in the context of a genotyping discrepancy between genotype les and/or the database. 14

Figure 8: Veri cation of Genotypes

4.5 Linkage Analysis The purpose of the Linkage Analysis tool is to provide access to linkage analysis software packages through GenoMap. The two issues of interest to the Linkage Analysis tool are: 1. automatic data le generation and formatting, 2. complexity of mathematics underlying linkage analysis, and 3. potential need to perform load balancing of linkage analysis processes across several workstations. The rst issue requires that access to linkage analysis software packages through GenoMap provide the user with a graphical interface to several tools. Currently, several linkage analysis packages are available, each with their own format and interface peculiarities. Such diversity complicates the analysis of data, due to the overhead of: (1) necessary data manipulations to generate appropriately formatted les, and (2) interacting with speci c interfaces. Secondly, the complexity of the mathematics underlying linkage analysis makes it impractical to expect fully automated linkage analysis for all available tools, but given the amount of genotypic 15

Figure 9: Linkage Analysis Interface data being generated, it is clear that the analysis represents a future bottleneck in the mapping of genes [CoI93]. Therefore, the Linkage Analysis tool has been implemented as a network-distributed application. Such an implementation makes it practical for the Socket Server to perform load-balancing. We loosely de ne a Linkage Experiment as containing a set of individuals who are/will be genotyped at each marker for a given set of markers, in addition to the relevant clinical data necessary for analysis including pedigree information and phenotypes. All necessary data is stored in a centralized database. The centralized database simpli es tasks that otherwise contribute to the growing analysis bottleneck. For example, with the marker information readily available, access to allele frequencies can be a choice of known allele frequencies, or can just as easily be calculated from the set of subjects in the linkage experiment, or a subset such as unaected founders. As de ned here, selecting a linkage experiment with the Linkage Analysis interface (Figure 9) completely de nes the set of subjects and markers (genotypes). Mlink, a 2-point analysis tool of the FASTLINK [CoI93] package, has been implemented for the GenoMap interface. The tool obtains the set of genotypes (as speci ed by the Linkage Experiment) as actual marker sizes, along with the pedigree relationships and phenotypes 16

Figure 10: Mlink Parameters Interface from the database. Storage of standardized absolute allele sizes enables the incorporation of genotypic data from almost any source. The Linkage Analysis tool is implemented in a client-server format. The client obtains the information (genotypes, pedigree and aected status) from the database, and performs some pre-processing on it. For example, the client normalizes the genotypes (mlink cannot use actual marker sizes for analysis) and passes the results to the server application which is capable of performing the analysis. Once it has received the necessary information, the server writes a pedigree le (format consistent with makeped [Roc97]), and generates a pedigree.dat le using the makeped application. The datain.dat le is also created with allele frequencies taken from the unaected founders. These les can be created on the server's local machine, because the server is an application, and therefore not restricted by the applet security model [CoH97]. Figure 10 shows the interface to the parameters which allows the user to control how the server will perform the analysis. The application named unknown [Roc97] performs a check for Mendelian inconsistencies, and nally the server application executes mlink. The resulting lod score table, Figure 11 is accessible through a web browser as an HTML page. 17

Figure 11: Lod Score Table The current implementation has been used on a 190-person study with 6 markers, and is being used in an extension of this study to include 289 markers.

4.6 Subject Log The Subject Log provides an interface for entering subjects into the database, as well as for searching, based upon a partially speci ed subject, for all information pertaining to that subject. This information has been partitioned among four dierent interfaces, where the primary interface is used for subject identi cation, and the other sub-interfaces can be reached from the primary interface, providing access to information on a subject's traits, samples and personal information. The primary interface of the Subject Log (see Figure 12) allows the speci cation of the identity and pedigree of a subject. This includes elds for entering a study number (by which a subject is uniquely identi ed), as well as the study numbers of both parents. The rst sub-interface allows entering and querying of a subject's phenotypes and clinical observations. These entries allow a speci c pre-existing trait and severity to be associated with a subject. In the case where the trait does not exist in the database (and has therefore not been speci ed), the Trait Log can be used to add the trait into the database. The second sub-interface provides access to information pertaining to laboratory samples supplied by the subject. This interface keeps track of where each sample is, as well as what type of sample it is (e.g., blood, DNA, etc) and what amount is in each location. 18

Figure 12: Subject Log Primary Interface Finally, the third sub-interface provides access to potentially sensitive information. This includes such information as a subject's name, address and phone number. Apart from this interface, all information available is completely anonymous. While it may bear a resemblance to other, known subjects, the information contained within this interface provides an absolute match to a speci c person.

4.7 Marker Log The Marker Log allows marker information to be entered and queried interactively into the database. While this tool has only a single interface, it still deals with a large amount of information. The database structure used for storing markers allows any number of alleles to be stored for a given marker, and this tool makes use of that by maintaining a list of known allele sizes for the speci ed marker. In addition, other marker characteristics such as chromosome, locus, allele frequency and heterozygosity are also maintained within the database. The information managed via the Marker Log is used by almost every other GenoMap component. The Linkage Experiment Editor (which guides the creation of linkage experiments) uses the information to aid in searching for markers with speci c characteristics. The Genotyping Assistant uses marker information in the automatic creation of multiplex sets, in addition to keeping track of the names.

4.8 Trait Log The Trait Log provides an intuitive interface for the creation of traits to be used within the database. For example, the Linkage Experiment Editor requires that a phenotype be 19

speci ed. This means that the subjects involved in the linkage experiment will have their aected status with respect to the speci ed trait checked with respect to their genotypes for linkage. To accommodate the broadest possible range of traits, GenoMap allows the speci cation of traits with a continuous range of aectedness. In addition, each trait has an annotation describing the clinical qualities of the trait. In this way, dierent possible varieties of a phenotype can be distinguished. For example, one clinical criterion for doing such a subdivision of a phenotype could be \age of onset."

4.9 Socket Server The Socket Server manages the connections between client applets and server applications. The most important bene t of using the Socket Server is that it allows completely dynamic speci cation of the client-server structure. Without the Socket Server, every server application (e.g., the Veri cation Server) would have to be running on a \well known" set of hosts, at speci c ports. However, with the Socket Server, the clients (and servers) only needs to know the host and port of the Socket Server. As instances of servers are started on various hosts, they register themselves with the Socket Server. Therefore, the Socket Server has a table specifying the servers that are currently running and the machines those servers are running on. Given this information, the Socket Server distributes the work among the available servers.

5 GenoScape GenoScape (see Figure 13) is an X Windows (Motif 1.2) based software package written in C [KeR98] for the semi-automated genotyping of electrophoresis gel images. A UNIXbased computer system is required to run GenoScape, but any computer with an X-Windows server can display the GenoScape graphical interface. The UNIX platform was chosen as the base operating system due to the demanding computational and memory requirements of GenoScape's numerous image processing lters. In particular, at present we do not feel that Java would be able to deliver adequate performance for such applications. Through the use of a user-de ned header le, GenoScape can be customized to suit a wide variety of applications. Any digitized gel image with a known standard can be automatically genotyped for an arbitrary number of markers in an arbitrary number of dyes provided that the image is in a known le format. By standard, we mean synthetic alleles of known sizes superimposed on subsets of the lanes of a gel to be genotyped. For example, in Figure 13, 20

Figure 13: Original GenoScape View the \red" alleles are a standard \ladder" of approximately 50 base pair sesolution poured in every 15 lanes. Currently, GenoScape handles both a proprietary raw data format and the generic tagged image le format (TIFF) [Ald92]. Filters are available for converting most common image le formats into TIFF. When storing images in the raw format, multiple acquisition channels can be contained within one image le. However, if the TIFF format is used, a separate le created to represent each channel. One of the most important steps in the processing of channel multiplexed electrophoresis gels is the clean separation of a set of sampled data into the original dye channels. GenoScape accepts the coecients of an inverted, normalized, dye separation matrix, and can automatically apply the separation matrix to the image data. GenoScape also has the capability to interact with the user to customize a dye separation matrix based on the current gel image being analyzed. Due to a wide variety of sources, an electrophoresis gel image may have a number of anomalies that can inhibit automatic genotyping. A number of automatic image smoothing, 21

and noise reduction lters are therefore integrated into GenoScape. Median, low-pass, and threshold lters are currently available. In addition, a number of manual image editing capabilites are supported to remove additional anomalies that cannot be automatically removed. For example, if a standard allele is too small to be automatically recognized, a standard allele with \ideal" dimensions can drawn in that location. Correspondingly, diusion from one area of a gel to another can be removed by drawing the \background color" over such anomalies. After removing as many anomalies as possible, the image is \normalized" or \straightened" based on the locations of the standard ladder within the gel image. Location of standard information can be done in a number of dierent ways. The most powerful mechanism utilizes a separated dye for a \standard" ladder with known lanes and base pair sizes. If such a dye is cleanly separable from channels containing markers, a high degree of fully automated genotyping will generally be possible. Even in cases of single-dye gels, or gels containing unseparable dyes, automatic genotyping is often possible through manual identi cation of known standard locations in the gel image. This can be supported in a number of ways including the insertion of standard ladder components at regular intervals across the gel or simply using allele size information from known individuals. Gel normalization (straightening), uses the known \coordinates" (most often the standard alleles) located in the previous step to relocate each pixel in the original gel. The new \normalized" or \straightened" image created by this process horizontally and vertically aligns all of the information from the original image based on interpolation of the known coordinates within the original image. After normalization has been accomplished, the task of automated genotyping can be accomplished quite simply (see Figure 14). This allows sophisticated heuristics for identifying background bands, homozygotes, missing alleles, etc, to be implemented quite simply and eciently. GenoScape uses the marker speci cations (provided in the header le) to automatically call the genotypes in all lanes and dye channels of the normalized gel. A number of tuning parameters are available to the user to assure that GenoScape calls as many genotypes as possible automatically, and more importantly, accurately. A human observer skilled in genotyping is generally used to verify the calls made by GenoScape. After the automated genotyping portion of GenoScape has attempted to call most genotypes, the user has a very powerful, and intuitive \point-and-click" interface to edit the genotypic information. This is normally only required to call genotypes that were expressed very poorly in the gel relative to the alleles appearing elsewhere within a speci c marker. Cases in which the user should need to \correct" an incorrectly called genotype should be rare if all genotyping parameters are correctly adjusted. 22

Figure 14: A Genotyped Gel Image After genotypes have been created and veri ed, the genotypes are output in a text format. The output le naming convention associates the genotypes with the genotyping technician and allows for the status (complete/veri ed or incomplete) to be easily discernible. The genotypes can then be input to nearly any a number commercial, or public-domain, databases. The GenoMap Veri cation tool is then used to verify the correctness of all called genotypes, and to then enter them into the database. Additionally, GenoScape stores all parameters that were edited and or created during a session. This allows a given user to resume a session and for other users to review an identical session. Preliminary results using a collection of di-, tri- and tetranucleotide repeat markers indicate very high quality genotyping results. Using the tunable parameters in GenoScape, 60-lane, 4-dye gels have been called with up to 97% correct calls, with the remaining 3% of calls indicated as \not-callable". I.e., GenoScape can be tuned so as to not report a genotype rather than reporting an inaccurate result. If gel image quality is low, or the parameters describing the marker set are errant, then GenoScape simply indicates that the lanes that 23

are not analyzable to a very high degree of con dence are \not callable." The degree of tunability of GenoScape, however, permits the software to automatically call a very high number of genotypes accurately in many circumstances. Extensive experience with GenoScape using 65-lane, single-dye gels (proprietary silverstained format) captured using a CCD camera has also been performed. User estimates place the accuracy of the automatic genotyping around 90% for these gels depending on the degree of image distortion within the gel. The person-hours required to produce, verify, and record accurate genotypes has been estimated to have been estimated to have been reduced by as much as 25%, depending on the experience of the GenoScape technician.

6 Implementation Issues During the implementation of GenoMap, several key issues had to be resolved. The most important of these were: (1) the stability and availability of a rapidly developing language such as Java, (2) how to best access a networked database, (3) how to maintain security in an Internet-based application such as this, and (4) database selection/speci cation.

6.1 Java Stability One concern throughout the development of GenoMap was the stability of the Java programming language. Early in the project, the decision of whether to immediately start developing with the (then) newly released 1.1 API had to be addressed. Due to the fact that Java is a relatively new as a programming language, and its prevalence in the mainstream, Java is continuously undergoing development. In addition, while the Java programming language supports numerous classes, even those are subject to frequent re nement. Another concern over Java, linked to the stability issue, is the support for current/developing APIs within browsers. For example, the Java 1.1 JDK/JRE has been available for several months now, yet the only mainstream browser that fully supports the 1.1 API is HotJava TM , Sun's browser written entirely in Java with the 1.1 API. While both Netscape and Microsoft have partial support for the 1.1 API presently available (as of September 1997), and should soon have full support of the 1.1 API, neither currently has a product the supports the entire 1.1 API (even though the 1.1 API speci cation has been available since Dec., 1996 [Sun96]). In addition, Microsoft's Internet Explorer will not have a UNIX version until sometime early in 1998 [Mic97]. (

24

)

6.2 Database Access { JDBC Along with the diculties associated with migrating existing code to the Java 1.1 API, the interface between Java and the database had to be worked out. Fortunately, Java has the JDBC (Java Database Connectivity) API, that speci es the programming interface between Java and a database. Thus, while any JDBC implementation can have vendor-speci c classes, it must also present the same \look-and-feel" as the standard. One important factor in our product selection was the ability to run \thin" clients, i.e., clients that do not have the JDBC software installed locally. Another was the ability to run 100% JDBC applications where the Java application/applet connects directly to the database without passing through any intervening processing (such as a JDBC/ODBC gateway). In addition, we want to provide a high-quality database interface with fast access and a consistent view of the data. The issue of access speed has been addressed through the use of database indices and through the use of appropriate JDBC classes and methods, rather than through a generic \do-anything" database interface. Consistency of the information retrieved from the database can be addressed in multiple ways as well. First, transactions can be used to group a set of queries into a single statement (also improving access speed), forcing a consistent view among the queries in the transaction. Another way to address the consistency issue is to modify the isolation level . At higher isolation levels, the consistency of the view provided by the database is required to be more strict. For example, at an isolation level of 3, strict sequentiality of data among all transactions is required, implying more locks must be set, and therefore potentially degrading the performance of database accesses in favor of a higher degree of consistency among concurrent processes. The Java $ Sybase we have chosen still leaves open the matter of specifying the database structure. Several issues need to be considered when designing the structure of the database, including exibility of speci cation, extensibility, and data association. Flexibility is the ability to support 1:N, N:1 and N:M type relations when necessary. Extensibility means making design choices that will allow for future extensions to the database structure without impacting other database tables. Data association means keeping related pieces of information together. For example, keeping a subject's name closely associated with any phenotypes and/or observation taken for the subject. In addition, duplication of data within the database should be avoided, meaning that items of the same type within the database are not allowed to have the same name. This allows avoidance of the issue of ambiguity, where specifying the name of an item is insucient for a query. 2

2

This is the terminology that Sybase uses in their discussion of database consistency.

25

6.3 Security From the earliest beginnings of GenoMap development, security has been one of the primary design goals. This is due to the types of information to be accessed by the GenoMap suite. Not only would demographic information (such as name and address) be available, but also clinical data, such as disease aection status. Therefore, multiple layers of security have been used to reduce the risk of unauthorized access to sensitive data. The rst level of security is the partitioning of the GenoMap website into separate areas that are highly similar, but with signi cantly dierent uses. These areas are the production, demonstration, testing, and unprivileged demonstration areas. Each serves a purpose, while not aecting the other areas (and often not other tools within the same area). For example, while a new version of the genotyping assistant is being developed it would only be placed into the testing area to verify correct functioning when loaded via the Internet. The second level of security is managed by the web-server itself, through the use of password protection and SSL/HTTP. The use of passwords to allow entry into the \activated" GenoMap area, helps ensure that only authorized users access the GenoMap components. In addition, the use of SSL (Secure Socket Layer) [FrK96] on top of the HTTP (Hypertext Transfer Protocol) manages an automatic encryption of the data sent over the Internet. A nal layer of security are the login/password combinations that must be sent to acquire a valid connection to the database server. This allows dierent users to be granted speci c access rights to portions of the database, thereby avoiding the problem of granting all accesses complete power in accessing the database. This layer of security will assist in protecting against accidental, as well as malevolent, destruction and/or modi cation of data. It should be noted that this con guration is a compromise. While it is possible to enforce stronger security measures (e.g., hardware keys), such measures degrade the accessibility and portability to the GenoMap system signi cantly. The current security structure is presently judged to be sucient for the current and future environment in which GenoMap is used. Future experience is required to draw a more conclusive judgment.

6.4 Database Selection The issue of distribution and use of GenoMap at other sites also poses a problem. Speci cally, we are concerned with the problem of specifying an alternate database (including specifying a dierent database server location and/or a dierent database within the server). This can be solved through the use of con gurable JAR les [Fla97], containing site-speci c information for GenoMap con guration. This may include information about the Socket 26

Server con guration, or local le structures (e.g., allowing dierent naming conventions), in addition to the aforementioned database con guration.

7 Conclusion This paper has described GenoMap, an integrated, parallel/distributed computing system that aids in large-scale gene identi cation studies. This problem requires data from heterogeneous, distributed sources. In addition, the individuals conducting such research span a wide range of cultures, and represent a broad range of computing needs. A distributed, heterogeneous solution addresses most of the primary challenges, but presents its own dif culties. The key elements of GenoMap are a web-based set of interactive Java applets and applications, and native applications, with automated load-balancing support for sharing computationally intensive tasks among a collection of distributed UNIX workstations. While computationally-intensive tasks are supported on the most appropriate systems, users are able to conduct most of their work in familiar MacOS and Windows/PC environments.

References [Ald92] Aldus Corporation, \TIFF Revision 6.0," June 3, 1992. Available via the WWW at http://sgi.com/graphics/ti/TIFF.ps (September 1997). [Ber95] M. Berks, \The C. elegans genome sequencing project," Genome Research, Volume 5, 1995, pp. 99-104. [BlR97] J. A. Blake, J. E. Richardson, M. T. Davisson, J. T. Eppig and the Mouse Genome Informatics Group. \The Mouse Genome Database (MGD). A comprehensive public resource of genetic, phenotypic and genomic data," Nucleic Acids Res, Volume 25, Number 1, 1997, pp. 85-91. [CoH97] G. Cornell and C. S. Horstmann, Core Java, Prentice Hall, Upper Saddle River, New Jersey, 1997. [CoI93] R. W. Cottingham and R. M. Idury, \Faster Sequential Genetic Linkage Computations," American Journal of Human Genetics, 53:252-263, 1993. [DOE95] Deparment of Energy, \Five Years of Progress in the Human Genome Project," Human Genome News, Volume 7, Numbers 3-4, September27

December 1995. Available via the WWW from www.ornl.gov in TechResources/Human Genome/publicat/hgn/v7n3/04progre.html (September, 1997). [Fla97] D. Flanagan, Java in a Nutshell, Second Edition, O'Reilly & Associates Inc., Sebastopol, CA, 1997. [FrK96] A. O. Freier, P. Karlton, P. C. Kocher, \The SSL Protocol, Version 3.0," IETF Internet Draft, March 1996. Available via the WWW from http://www.netscape.com/eng/ssl3/ (September 1997). [GuW83] J. F. Gusella and N. S. Wexler, \A polymorphic DNA marker genetically linked to Huntington's disease," Nature, Volume 306, November 17, 1983. [HoJ88] R. W. Hockney and C. R. Jesshope, Parallel Computers 2: Architecture, Programming, and Algorithms, IOP Publishing, 1988. [KeR98] B. Kernighan and D. Ritchie \The C Programming Language, 2nd Edition," ISBN 0131103628, Prentice Hall, 1989. [LaW96] J. Lalonel, R. White, \Analysis of Genetic Linkage," Emery & Rimoin's Principles and Practice of Medical Genetics, pp. 111-125, 1996. [Mic97] Available via the WWW at http://www.microsoft.com/ie/press/xplatform.htm (Septemeber, 1997). [MyM97] R. L. Mynatt, R. J. Miltenberger, M. L. Klebig, L. L. Keifer, J-H Kim, M. B. Zemel, J. E. Wilkinson, W. O. Wilkison, and R. P. Woychik. \Analysis of the function of the agouti gene in obesity and diabetes," Proceedings: International Business Communications 2nd Annual International Symposium: Obesity, Advances in Understanding and Treatment, In press. [Ott91] J. Ott, Analysis of Human Genetic Linkage, Johns Hopkins University Press, Baltimore, 1991, pp. 108-141. [Roc97] Available via the WWW at http://linkage.rockefeller.edu/soft/list.html (Septemeber, 1997). [Sun96] Available via the WWW at http://java.sun.com/pr/1996/dec/pr961203-01.html (Septemeber, 1997). [Tan96] A. Tanenbaum, Computer Networks, 3rd Edition, Prentice-Hall, 1996. 28

GenoMap: A Distributed System for Unifying Genotyping ... - CiteSeerX

GenoMap: A Distributed System for Unifying Genotyping ... - CiteSeerX

Suggest Documents

SuiteSound: A System for Distributed Collaborative ... - CiteSeerX

Replication for a Distributed Multimedia System - CiteSeerX

MBAT: A scalable informatics system for unifying

System Architecture of a Distributed Expert System for the ... - CiteSeerX

A Unifying Semantics for Belief Change - CiteSeerX

A Framework for Unifying Reordering Transformations - CiteSeerX

A unifying framework for specifying DEVS parallel and distributed

System Services for Distributed Application Configuration - CiteSeerX

System Services for Distributed Application Configuration - CiteSeerX

System Services for Distributed Application Configuration - CiteSeerX

Load balancing vs. distributed rate limiting: an unifying ... - CiteSeerX

A Distributed Network Management System - CiteSeerX

DIPC: A Heterogeneous Distributed Programming System - CiteSeerX

DIPC: A Heterogeneous Distributed Programming System - CiteSeerX

A Microkernel based Distributed Operating System - CiteSeerX

A distributed system architecture for a distributed ...

A distributed system architecture for a distributed ...

A distributed system architecture for a distributed ... - Google Sites

IntelliGEN: A Distributed Workflow System for Discovering ... - CiteSeerX

cards: a distributed system for detecting coordinated attacks - CiteSeerX

A distributed planning and control system for industrial ... - CiteSeerX

A Distributed Information System for Health Care Facilities ... - CiteSeerX

A Distributed Multi-Agent System for Collaborative ... - CiteSeerX

D-Card: A Distributed Mobile Phone Based System for ... - CiteSeerX