Distributed Architecture for Hierarchically

2 downloads 0 Views 498KB Size Report
The reader with a background in genetics/molecular biology may easily skip .... The set of genetic markers use on a worksheet is referred to as a ..... Mendelian.
An illustration of a Parallel/Distributed Architecture for Hierarchically Heterogeneous Web-Based Cooperative Applications Thomas L. Casavant, Todd E. Scheetz, Terry A. Braun, Kyle J. Munn, and Sureshkumar Kaliannan Parallel Processing Laboratory, Dept. of Electrical and Computer Engineering, and the Dept. of Genetics University of Iowa Iowa City, IA 52242 U.S.A. [email protected]

Abstract

A new class of applications is described which require cooperation among diverse users in multiple data and problem instance domains. The hierarchy of parallelism includes heterogeneity within a single instance of the problem, homogeneity of subsets of users within a problem domain, and multiple problem domains which share computation resources { software and hardware. The core of the architecture is a socket-server which registers clients and servers (both statically and dynamically), and assures isolation of users in separate problem domains. The users all see the system as a set of functions accessible via the WWW. The particular problem of genetic linkage analysis is used as a case study to illustrate and implement the architecture. GenoMap, the rst implementation of this system has been used by several groups of cooperating users at multiple institutions in a study to isolate the genomic locus of the controlling gene(s) in several diseases including autism. More th an 400 genetic markers are being analyzed from more than 300 individuals in this study. The users span geneticists, clinical physicians, statisticians, disease specialists, laboratory technicians, and computer scientists/engineers.

1 Introduction This paper describes a general software architecture for a class of applications made possible by the WWW. The motivation for our study is a medical research application in the Human Genome Project (HGP) { Genetic Linkage Analysis. Genetic Linkage Analysis is the primary method used to determine the association between a genetically-linked trait (of interest to us are disease traits such as cancer, autism, etc) and a locus of the human genome (a search space of some three billion base pairs). This particular problem requires cooperation among a diverse collection of researchers including geneticists, clinical physicians, statisticians, disease specialists, laboratory technicians, and computer scientists/engineers. The scale of this problem would clearly indicate the need for large scale informatics support. 1

Historically, however, relatively rare, simple traits have been studied and ad hoc methods for data storage and analysis have been employed successfully to isolate only hundreds of genes (from a set of approximately 100,000 candidates) to date. The experience gained in these studies, coupled with the application of distributed/parallel computing methods, is making possible the systematic approach of the isolation of all the disease genes in the human genome. Below, we characterize the primary parallel/distributed characteristics of this class of applications. Section 2 provides some basic de nitions from molecular biology and genetics used throughout the remainder of the paper Sections 3 through 5 describe GenoMap and its various components. To date, the rst internal release of GenoMap has been in use for approximately two years and most notable is presently being employed in the analysis of more than 400 genetic loci with data describing disease traits from more than 300 individuals with respect to the disease autism. The class of applications of interest have parallelism present at three distinct levels. We describe them in a hierarchy, and also distinguish between multiplicity of users (sets of clinicians) and multiplicity of problems (diseases).

Level 1. Heterogeneous parallelism within a single problem instance. These components

represent functional parallelism. Level 2. Multiple instances of each of the functional components referred to in level 1. Level 3. Multiple instances of the entire problem class, potentially with multiple instances of each functional component in level 1.

Level 1 illustrates a key attribute of an application of this class. In the genetics linkage setting, this refers to the need to support the computational demands of each of the diverse members of the team of users { physicians, statisticians, laboratory technicians, etc. Each user needs to see the system through the "window" of their part of the collaboration. In level 2, we are describing the oft-present attribute of scale. As a problem instance becomes large, not only do computational demands grow, but so do the number of project personnel and their geographical dispersion. In the case of our autism study, the scale of the project dictated that distinct groups (one at the University of Iowa and one at Tufts University) cooperate in the labor-intensive gathering of phenotypes (patient status w.r.t. autism) and genotypes (the states of the 400 di erent genome loci). At the third level of the hierarchy, it is desirable to manage the computational resources, in addition to the data resources, in a cooperative way. This involves the sharing of software and hardware. Users in many problems domains (such as medicine and biology) do not have easy access to the kind of support for cooperative computing as dictated in this class of applications. Thus, the solution to such a problem requires the ability to support sharing of data and software, while assuring secure isolation of data from multiple distinct instances of the problem. Clearly, in handling information regarding genetic traits, and diseases, the need for such security is clear. We summarize the requirements of our solution as follows:  The solution must support seamless sharing and cooperation of users at levels 1 and 2.  The solution must assure that parallelism at level 3 is supported with secure isolation of distributed components. 2

GenoMap, a distributed, Web-based system for the support of genetic linkage analysis, illustrates the class of applications, and the key elements of our approach to solving the problem. The next section de nes key molecular biology terms needed to further describe our solution. The reader with a background in genetics/molecular biology may easily skip section 2.

2 Background: Genetic Linkage Analysis This section serves as a brief introduction to some of the integral concepts of the genetics and molecular biology underlying the application of linkage analysis. First, some de nitions. A genome is the entire genetic material of an organism, meaning all of the DNA that makes up the organism (which is replicated in every cell). The human genome is divided into 22 separate somatic (non-sex) chromosomes as well as the X and Y chromosomes. Within the genome, are genes { elements that serve as templates for the production of proteins. Thus genes code for proteins. A dysfunctional gene and/or gene product can have a harmful e ect on the organism (e.g., Duchenne Muscular Dystrophy). However, the entire genome of higher organisms is not dedicated only to encoding for proteins. In fact, only a small portion is used as a template for protein production. Much of the intervening areas are still of use for linkage analysis through the use of genetic markers. A genetic marker is a unique site within the genome that can be used to determine the "state" of a region of the genome, known as the genotype for that marker. Now, given a set of polymorphic (informative) genetic markers, we can collect genotypes for an individual across all of his/her chromosomes. Another piece of information required by linkage analysis is the phenotype of an individual { the status of that subject with respect to the disease (or trait) being studied. Formally, linkage analysis is a statistical analysis, attempting to correlate a suspected genetic disease to a region within the genome. Linkage analysis requires a set of related subjects, the phenotypes of the subjects with respect to a disease, and the genotypes of the subjects for a genetic marker. A linkage analysis calculates the likelihood that the diseasecausing gene is located near the genetic marker. In the context of GenoMap, we provide the following de nitions. A linkage experiment is a grouping of subjects, genetic markers and a trait to be analyzed through linkage analysis. A linkage study is a set of linkage experiments, all pertaining to the same trait, but with possibly di erent subjects and/or markers. For example, the rst linkage experiment of a linkage study might be to screen for the position of a trait's locus across the entire genome, such that three candidate loci are identi ed. Then a more focused linkage experiment could be designed, to more eciently probe the suspected loci. A genotyping experiment is a set of worksheets associated with a given linkage experiment, where a worksheet is an ordered set of subjects and markers to be genotyped under a given genotyping system and technology. A worksheet has a one-to-one correspondence to a physical object within a genotyping experiment (e.g., a genotyping gel). A genotyping system's technology consists of pertinent parameters pertaining to the number of number of subjects and markers that can t on a worksheet. While GenoMap has been designed to operate with a variety of genotyping technologies, to date our experience has been primarily with gel-based apparati, many of which support multiplexing of markers on a gel in several dimensions. 3

3 The GenoMap System from the User View GenoMap is a suite of independent, yet inter-related tools primarily developed in Java. These tools manipulate data from a domain-speci c networked database, allowing sharing of information among multiple distributed clients without replication and coherency problems. The main goal of GenoMap is to provide a portable, intuitive interface for managing the information associated with the gene location/discovery process. This section brie y tours the user interfaces and explains the primary function modules of GenoMap. Section 4 describes the internal architecture.

Name of Software Component Description Subject Log Marker Log Trait Log Linkage Experiment Editor Genotyping Assistant GenoScape GenoScape Launcher (Client/Server) Veri cation (Client/Server) Linkage Analysis (Client/Server) Socket Server

Recording and searching for subject speci c information. Recording and searching for marker speci c information. Recording and searching for trait speci c information. Creating, viewing, and updating linkage experiments Creating, viewing, and updating genotyping experiments Viewing and genotyping of genotyping gel images Front-end interface to genotyping tool Interface to display all recorded genotypes together Interface to multiple linkage analysis packages Manages communication between clients and servers, plus handles load-balancing Table 1: GenoMap Components

3.1 Linkage Experiment Editor

The Linkage Experiment Editor assists in the speci cation, creation and maintenance of linkage experiments, where a linkage experiment is a grouping of subjects, markers, and traits necessary to perform a linkage analysis computation. The editor manages each of the three categories separately, to assist in category-speci c speci cation of information within linkage the experiment. The update portions of the Linkage Experiment Editor provide an intuitive interface for the speci cation of the set of elements to be used in a given category. This interface utilizes the JDBC (Java Database Connectivity) to retrieve records that meet the speci ed criteria from a database. Such searches can be performed using various properties of the given categories (e.g., having been used in a previous linkage experiment). 4

Figure 1: Linkage Experiment Editor An additional feature of the Linkage Experiment Editor is its ability to load the complete sets of subjects, genetic markers and trait from a previously-de ned linkage experiments. This allows for simpler experiment creation given that one or more of the categories may be quite similar to a previous experiment. This is expected to be most useful in specifying the set of markers to be used in a linkage experiment. For example, when performing a genomewide search for a gene it is common to use the same (or nearly the same) set of genetic markers, referred to as a screening set that is relatively evenly spread across an organism's genome.

3.2 Genotyping Assistant

The Genotyping Assistant is used to assist in the management of genotyping experiments. It is the central component of the LIMS (Laboratory Information Management System) role of GenoMap. The genotyping assistant has been broken down into two components, corresponding to the separate functions that each component performs. The rst component is responsible for the creation of new genotyping experiments. Creation requires two primary inputs: (1) a linkage experiment, that speci es the subjects, markers, and trait, and (2) the technology parameters, that describe the system under which the genotyping will be performed. Given these pieces of information, the user is then allowed to select whether to partition the subjects and markers onto worksheets manually or automatically. 5

The information speci ed by the gel technology may be used to automatically map subjects onto worksheet positions (e.g., lanes of genotyping gels) and markers onto worksheets, although manual speci cation is also possible. The mapping of subjects onto worksheet positions is referred to as the layout. The set of genetic markers use on a worksheet is referred to as a multiplex set, because it supports two dimensions of multiplexing. The rst is multiplexing of base pair size { multiple markers may be in the same multiplex set as long as their alleles do not overlap (e.g., sets A, B, C, and D). The second dimension of multiplexing is that of channels (e.g., dyes) { several size-multiplexed sets of markers may be combined provided the genotyping system supports multiple channels. For example, sets A, B, C, and D may be merged into another set, M, in a system that supports four channels, where each original set is in a di erent channel (e.g., tagged with a di erent dye). The second component provides an interface to check on the progress of an existing genotyping experiment. Speci cally, the status and annotation of each worksheet can be checked (and modi ed). In addition, links are provided to allow viewing of individual worksheets at a higher level of detail. This includes the exact placement of every subject, genetic marker run at that location, and genotypes.

3.3 GenoScape Launcher & GenoScape

The GenoScape Launcher provides convenient access to our custom genotyping system GenoScape TM . GenoScape is a software package that is capable of automated, semiautomated, and manual genotyping of electrophoresis gel images. A wide variety of physical gel formats (horizontal, vertical, radiolabeled, uorescent, and silver-stained) can be analyzed by GenoScape. The GenoScape gel-analysis system is capable of analyzing digitized electrophoresis gel images with an arbitrary number of markers and an arbitrary number of dyes. Information regarding the number of dyes, allele sizes and names for each marker, and the base pair sizes of the standard \ladder" must be provided in a header le. A header le is associated with each gel image, allowing customization to a wide variety of genotyping needs. All of the data to create a header le for a gel image can be obtained through GenoMap by user interaction and from the database. The GenoScape Launcher is implemented in a client-server structure, where the client component allows the user to identify a prede ned genotyping experiment and worksheet from those stored in the database. Recall that there is a one-to-one correspondence between a worksheet and a gel image. Once the worksheet has been speci ed, a header le must be created for GenoScape (through further interaction), if one does not exist for the speci ed worksheet. Otherwise, if a header le is already available for the speci ed gel image, the user can choose to launch the GenoScape tool. The header le containts worksheet speci c information as well as GenoScape speci c information. This includes descriptions of the markers (and standard) present in a worksheet, as well as the dimensions of the image to be read in. While most of this information is available automatically from the database, manual editing of the most parameters is supported through graphical interfaces provided by the GenoScape Launcher. After veri cation (and possible modi cation) of information for the header le, a clientserver connection is established, the data from the GenoScape utility is passed from the client to the server. The server uses the data to create a valid header le for the speci ed (

)

6

gel image. After the header le is successfully written, the GenoScape tool can be launched to view and genotype the speci ed gel image. GenoScape (see Figure 2) is an X Windows based application written in C [KeR98] for the semi-automated genotyping of electrophoresis gel images. A UNIX-based computer system is required to run GenoScape, but any computer with an X Windows server can display the GenoScape graphical interface. The UNIX platform was chosen as the base operating system due to the demanding computational and memory requirements of GenoScape's numerous image processing lters. In particular, at present we do not feel that Java would be able to deliver adequate performance for such applications. One of the most important steps in the processing of channel multiplexed electrophoresis gels is the clean separation of a set of sampled data into the original dye channels. GenoScape accepts the coecients of an inverted, normalized, dye separation matrix, and can automatically apply the separation matrix to the image data. GenoScape also has the capability to interact with the user to customize a dye separation matrix based on the current gel image being analyzed. Location of standard information can be done in a number of di erent ways. The most powerful mechanism utilizes a separate dye for a \standard ladder" with known lanes and base pair sizes. If such a dye is cleanly separable from channels containing markers, a high degree of fully automated genotyping will generally be possible. Even in cases of single-dye gels, or gels containing unseparable dyes, automatic genotyping is often possible through manual identi cation of known standard locations in the gel image. This can be supported in a number of ways including the insertion of standard ladder components at regular intervals across the gel or simply using allele size information from known individuals. Gel normalization (straightening), uses the known \coordinates" (most often the standard alleles) located in the previous step to \straighten" the image. After normalization has been accomplished, the task of automated genotyping can be accomplished quite simply (see Figure 2). This allows sophisticated heuristics for identifying background bands, homozygotes, missing alleles, etc, to be implemented quite simply and eciently. GenoScape uses the marker speci cations (provided in the header le) to automatically call the genotypes in all lanes and dye channels of the normalized gel. A number of tuning parameters are available to the user to assure that GenoScape calls as many genotypes as possible automatically, and more importantly, accurately. A human observer skilled in genotyping is generally used to verify the calls made by GenoScape. After the automated genotyping portion of GenoScape has attempted to call most genotypes, the user has a very powerful, and intuitive \point-and-click" interface to edit the genotypic information. This is normally only required to call genotypes that were expressed very poorly in the gel relative to the alleles appearing elsewhere within a speci c marker. After genotypes have been created and veri ed, the genotypes are output in a text format. The GenoMap Veri cation tool is then used to verify the correctness of all called genotypes, and to then enter them into the database. Additionally, GenoScape stores all parameters that were edited and or created during a session. This allows a given user to resume a session and for other users to review an identical session.

7

Figure 2: A Genotyped Gel Image

3.4 Veri cation

Typically, genotyping is a manual activity of lab technicians who score gels by writing directly on the glass of a gel. Aside from being laborious, the repetition of hand scoring and data entry is a substantial source of errors. A check for Mendelian inconsistencies will detect some of the errors, but it is not unusual for errors to propagate to the analysis phase. Typically, the error rate is reduced by redundant genotyping (i.e., generating multiple genotype les using GenoScape). This can be done via a second genotyping of the gel where: (1) the original genotypes le is double-checked by a second technician, or (2) by a complete re-genotyping of the gel without knowledge of the results from the rst genotyping session. We refer to the process of comparing multiple calls of genotypes as \veri cation." With reduction of the error rate as motivation, we have implemented a Veri cation tool to compare genotypes generated by the semi-automated genotyper application GenoScape. The large number of genotypes that GenoScape rapidly assembles makes it essential that the veri cation phase have similar capacity for automation in verifying genotypes. The automation provided by GenoScape provides some interesting options for the veri cation phase. With the rst scoring pass, the pertinent les are created as needed by GenoScape. Thus, the second technician can use GenoScape to review the calls already made, and make changes as appropriate. An alternative mode is for the second technician to 8

Figure 3: Veri cation of Genotypes duplicate the genotyping output les and replicate the entire scoring process. A third variation is to have multiple people perform independent scoring sessions. The Veri cation tool can accommodate these variations by allowing the user to specify the source(s) of genotyping information. The Veri cation client interface requires the user to specify the experiment, worksheet, laboratory name, and date (de ning the archive location) in a fashion consistent with the GenoScape Launcher. Veri cation is also implemented as a client/server application. The client checks that the experiment and worksheet exist in the database, and checks for the speci ed genotype le(s). The server reads the genotype le(s) and sends the information to the client to be displayed to the user. The client process then queries the database for any genotypes that may already exist for the worksheet being veri ed. The client displays all genotypes from the selected genotype les and the worksheet's genotypes from the database. The genotypes assembled by GenoScape from two independent genotyping sessions, in addition to the genotypes already stored in the database are shown in Figure 3. Any discrepancies in genotypes are highlighted by the tool and must be resolved before the user can store or overwrite genotypes in the database. For example, Figure 3 shows an inconsistency in the genotypes from lane 5, between the rst set of genotypes, the second set of genotypes, and the genotypes stored in the database. This inconsistency must be resolved, with all sets of genotypes in agreement, before Genotype Veri cation can update the database. The Genotype Veri cation tool and GenoScape can be used concurrently to examine the original gel image and the calls assigned during a speci ed session in the context of a genotyping 9

discrepancy between genotype les and/or the database.

3.5 Linkage Analysis

The purpose of the Linkage Analysis tool is to provide access to linkage analysis software packages through GenoMap. The three issues of interest to the Linkage Analysis tool are: 1. Automatic data le generation and formatting, 2. Complexity of the mathematics underlying linkage analysis, and 3. Potential need to perform load balancing of linkage analysis processes across several workstations. The rst issue requires that access to linkage analysis software packages through GenoMap provide the user with a graphical interface to several tools. Currently, several linkage analysis packages are available, each with their own format and interface peculiarities. Such diversity complicates the analysis of data, due to the overhead of: (1) necessary data manipulations to generate appropriately formatted les, and (2) interacting with speci c interfaces. Secondly, the complexity of the mathematics underlying linkage analysis makes it impractical to expect fully automated linkage analysis for all available tools, but given the amount of genotypic data being generated, it is clear that the analysis represents a future bottleneck in the mapping of genes [CoI93]. Therefore, the Linkage Analysis tool has been implemented as a network-distributed application. Such an implementation makes it practical for the Socket Server to perform load-balancing. Recall that a linkage experiment is the speci cation of a set of subjects (with attributes including relationships, a ection status, and genotypes), markers, and phenotype. This information is all maintained within the database, simplifying the process of gathering the necessary information. The Linkage Analysis tool is implemented in a client-server format. Currently, support for mlink, a 2-point analysis tool of the FASTLINK package, has been implemented in Linkage Analysis. The client obtains the information (genotypes, pedigree and a ected status) from the database, and performs some pre-processing on it. For example, the client must normalize the genotypes, because mlink cannot use actual marker sizes for analysis, and then passes the results to the server application which is capable of performing the analysis. Once it has received the necessary information, the server creates appropriately formatted data les, checks for Mendelian inconsistencies and, nally, executes mlink. The resulting lod score table, Figure 5 is accessible through a web browser as an HTML page. The current implementation has been used on a 190-person study with 6 markers, and is being used in an extension of this study to include 289 markers.

3.6 Subject Log

The Subject Log provides an interface for entering subjects into the database, as well as for searching for all information pertaining to a speci ed subject. 10

Figure 4: Linkage Analysis Interface The primary interface of the Subject Log allows the speci cation of the identity, sex, and pedigree of a subject. This includes elds for entering a study number (by which a subject is uniquely identi ed), as well as the study numbers of both parents. The other Subject Log interfaces are reachable only from the primary Subject Log interface in which a valid subject has been speci ed. These interfaces allow the management of a subject's:  phenotypes { clinical observations  laboratory samples { blood, DNA  personal information { name, address

3.7 Marker Log

The Marker Log allows marker information to be entered and queried interactively into the database. While this tool has only a single interface, it still deals with a large amount of information. The database structure used for storing genetic markers allows any number of alleles to be stored for a given marker, and this tool makes use of that by maintaining a list of known allele sizes for the speci ed marker. In addition, other marker characteristics such as chromosome, locus, allele frequency and heterozygosity are also maintained within the database. 11

Figure 5: Lod Score Table

3.8 Trait Log

The Trait Log provides an intuitive interface for the creation of traits to be used within the database. For example, the Linkage Experiment Editor requires that a phenotype be speci ed. This implies that the subjects involved in the linkage experiment will have their a ection status determined with respect to speci ed linkage experiment's phenotype. To accommodate the broadest possible range of traits, GenoMap allows the speci cation of traits with a continuous range of a ectedness. In addition, each trait has an annotation describing the clinical qualities of the trait. In this way, di erent possible varieties of a phenotype can be distinguished. For example, one clinical criterion for doing such a subdivision of a phenotype could be \age of onset."

4 GenoMap Architecture GenoMap is a large-scale, distributed, heterogeneous, client-server application to support the systematic exploration of the genome to narrow, and ultimately identify, the locus of a particular gene (or set of genes) involved in a disease or trait. In contrast to many applications developed in support of the Human Genome Project (HGP) [DOE95] to date, GenoMap does not involve gathering or analysis of any DNA sequence data. Rather, the fundamental informational components of this functional genomics [MyM97] application are 1) familial relationships (pedigree information), 2) clinical observations of disease, or trait (phenotype), 3) sets of known polymorphic genome loci (genetic markers), and 4) information about the state of candidate loci for the individuals being studied (genotypes). There are two primary rami cations of these characteristics that distinguish this problem from most existing network-based genome analysis applications. 1. The need for privacy regarding pedigrees, genotypes, and phenotypes, and 12

2. the need to support a diverse collection of cooperating individuals in the gene identi cation process. The rst implication requires protection of the data being used to perform analyses, and the second naturally suggests a heterogeneous, distributed solution. However, if these are the primary requirements of the solution, then they present an immediate con ict. Distribution of data to heterogeneous computation sites inherently involves taking risks with the security of data. The approach taken by GenoMap provides for security by:  Verifying identities of individuals in the granting of access to sensitive data through the network: { Passwords protect access to the the tools themselves. { Database contents can only be retrieved in the context of a single functional tool of the system. Thus users are prevented from making "custom" queries to extract information that would compromise individual privacy. { The database structure itself hinders the association of individual identities with their clinical, or genetic information.  Databases containing sensitive data can be kept localized with the computations accessing them, thus allowing local administrative control, and allowing physical restrictions to be placed on the set of IP addresses allowed to access a particular database. Most of GenoMap has been implemented in Java (currently v1.2) [Fla97] with a socketoriented, client-server design employing recent applet security features [CoH97]. The gene identi cation application supported by GenoMap is characterized by a two stage process of data gathering and veri cation, followed by an analysis phase known as genetic linkage analysis [Ott91, LaW96]. Java Applets provide interfaces for specifying linkage experiments, support for management of the data collection and veri cation process, and interacting with statistical linkage analysis packages [CoI93]. One of the data collection tools is a large C/Xwindows application { GenoScape [CaM97], for scoring gels with repeat markers [GuW83], and many of the analysis packages are pre-existing and run in a UNIX environment (originally written in various languages { Fortran, Pascal, etc). However, to support both security, interoperability, and sharing of the software among multiple laboratories and projects, a novel component of our system is a socket server process (SSP) that provides a naming service to the applets and applications, controls access to applications and data, and provides a load balancing function. Figure 6 shows the main interface window that provides not only the common user view of the GenoMap system, but also serves as a central starting point for launching all GenoMap functions, and validating user's identity. GenoMap is currently being used in the collection of data for, and analysis of, a number of relatively small gene identi cation studies. It is also being used in one large genome-wide screening for the locus of the gene(s) involved in autism. The former shows its usefulness in supporting users who want to employ the analysis features of GenoMap, while not needing the large-scale data collection and management facilities, while the latter shows the usefulness in managing gene identi cation studies that would have been unmanageable without such a system. 13

Figure 6: User's WWW View of GenoMap

4.1 Requirements

The GenoMap system is designed to support a diverse collection of users in a wide-area network environment. The data objects being managed are of the most sensitive nature { usually identifying family relationships among members, some of whom may carry stigmatizing genetic diseases. An additional requirement is that GenoMap be portable and sharable among multiple research laboratories. Due to the diculties and costs associated with updating and distributing copies of software, we have decided that GenoMap must be a web-based system { i.e., whenever possible, the only static copies of GenoMap applets and applications will be stored on the web server at the University of Iowa. While appearing to be a potential bottleneck in several ways, in fact, this allows our group to make full use of the system locally, while not having to expend e orts supporting users outside of our local site with updates, releases, and copies of documentation. In fact, these tasks represent the primary reason that much University-based software is rarely used outside the domain in which it was created. In order to support the use of GenoMap at multiple, geographically dispersed sites we de ne the notion of a database domain or simply domain. Users register with the GenoMap system and assigned to a domain. The domain corresponds to a database that is most likely stored and served on a machine within the administrative domain of the user. This greatly reduces the chances that security violations with respect to privacy of data will occur. However, it complicates the overall design of GenoMap, requiring that Applets and Applications be restricted to access only the domain in which they were originally instantiated. Finally, GenoMap must interact with extant software. The software ranges from data collection packages on PC/Mac-based systems, to legacy codes in FORTRAN for genetic 14

linkage analysis. E.g., Mlink, a 2-point analysis tool of the FASTLINK [CoI93] is a common package used to conduct analyses that would require at least one person year to rewrite in Java. For the immediate future, it is a requirement for GenoMap to directly be able to "wrap" the interface details of such packages into a Java client/server interface that makes Mlink appear to be a Java applet

4.2 Component Organization Socket Server

1

Register UNIX host

DB

Get Service Info/ Authenticate

Domain 2

Register Register UNIX host

Load Applets Client 1

Web

Database

Server

Load Applets

UNIX host

MacIntosh Load Applets

Client 2

UNIX Host

Server 2(Service 1) Register

Windows

UNIX

Server 1(Service 1)

1

UNIX host Service 2

Access Database

Client 3 Get Service

1

1 interpreted as

Access Database 1

followed by

Access Database

Figure 7: Domain Components A typical domain consists of the following components. Web Server It hosts all the User Interface applets for the services o ered by GenoMap. Services A service in GenoMap consists of 2 or more components/resources. An applet which implements a User Interface, a backend server which does all the computations, implements all le accesses, and a Database which stores all the information required for the service. Not all service[s] require a backend server. Clients Clients are the users who access the services through WWW browsers. They don't directly interact with the server or the database. Each user has privileges which determine which service he/she can access. Database An important characteristic of any Genome related experiment is the need for a large volume of sensitive data. A Database is an essential component and is used to form an administrative domain; i.e., only users belonging to that domain can access the data. 15

4.2.1 Socket Server

The socket server performs primarily the following four functions.

Storage of Information

The socket server keeps track of where the servers are running and to which port each is listening for client requests. It stores the database location and maintains the list of registered domains and the users in each domain and their privileges.

Load Balancing

The goal of load balancing in GenoMap is to evenly distribute the requests for services among the multiple server instances. This is achieved by simply providing the appropriate server location information to client applets. The socket server selects the appropriate server depending on the following parameters.   

Number of client requests currently being serviced by each server. Individual server parameters such as the maximum number of requests the server can handle. Idle time since last serviced request.

A LoadValue is calculated using the following equation = Pni i i where = number of factors used to calculate LoadValue(LV) i = Weight(importance) assigned to that factor i = Normalized value of the parameter LV

=1 W X

n

W

X

Our initial implementation considers the following simple factors: ?NoOfRequests = MaxNoOfRequests MaxNoOfRequests ?IdleTime = MaxIdleTime MaxIdleTime X1

X2

i are values ranging from 0 1. For example, if selecting the server whose LoadValue is the highest. W

:::

Management of resources

W 1 > W2

, the socket server favours

The socket server manages addition/deletion of domains . It allows for adding and removing of users from a domain and changing their privileges. The database and the servers can be dynamically registered and unregistered. The socket server accounts for failures during a transaction between client applets and servers.

16

Domain Root

Linkage

Create Linkage Experiments

Manage Pedigree Errors

Verify

Calculate Redundant Mendelian Worksheet Lod Score Check Checking Creation Scores

Genotyping

Genotype Entering

Figure 8: Authentication Hierarchy

Authentication

GenoMap uses an authentication scheme based on naming (see Figure 8). Each domain is divided into logical sub-domains corresponding to the services o ered in that domain. Each service is further divided into sub-domains giving ner control of the service. Each user is placed in one or more nodes. The user's access rights are assigned according to their position in the hierarchy. The closer the user is to the root, the more rights he/she has.

5 Implementation and Usage to Date Apart from speci c implementation details, GenoMap is a networked application consisting of a set of interacting 1) Java Applets and Applications, 2) Java Applets/Applications and C/Xwindows user interfaces, 3) Java and C applications that interact with local les systems, and 4) Java applets that interact with a database server and Java and C applications. Most of the Clients are Java applets which provide various functionalities such as veri cation of genotypes to users. To meet the basic requirements outlined in the previous section, there are two di erent design approaches to this problem. 1. Clients contact their paired Servers individually and registration is done on an Applet/Server pair basis. Servers then use \well-know" socket addresses to be contacted by Applets Client. This approach is the basis for our rst implementation of GenoMap. 2. Separate Socket Server process is implemented and registered with the Web Server. The socket server authenticates valid users of GenoMap, passes the location information for the Applet Clients needing to contact various backend servers and the Database. Socket Server also performs load balancing among multiple intances of backend servers. Our rst implementation of GenoMap is capable of (and in fact is) supporting a large autism gene identi cation study within the University of Iowa Department of Genetics. We are presently expanding the set of users to include individuals and laboratories outside of 17

our direct administrative control (e.g., a cooperating group at Tufts University). Thus, while the rst approach can be made to work in such an environment, the management requirements of that approach, and the limited ability to load-balance application activity, make the approach less attractive and scalable.The following sections describes our current implementation in great detail.

5.1 Installation and Usage Initialization of GenoMap

GenoMap is served from one web server. Once this web server is enabled and all the applet code is installed, bootstrapping of the rest of the system involves the installation of set of server processes: the database server, the socket server and application server(s).The system administrator starts the socket server and places the location information in a le where client applets and servers can read it. A Web page with all the client applets is created. The system administrator starts at least one instance of the server for each service o ered. These servers are used as application servers during the initial stages of domain creation.

Setting up Domains

The system administrator registers the domain with the socket server and authentication server. The system administrator creates an account for the domain administrator and the domain administrator also becomes the rst user of that domain. The database for the domain is next registered with the socket server. The system administrator is responsible for distributing binaries for each server that is to be run to the local domain.

Adding Users

The domain administrator adds users to the domain and assigns privileges to indicate which services each user is allowed to access. The domain administrator is also responsible for administering the database. Since the information in the database is very sensitive, separate levels of authentication are required to access the database.

Starting Service Providers

The domain administrator starts at least one server for each service. Depending on the number of registered users and the need for a particular service, multiple instances of servers may be created. This completes the domain setup and the users are ready now to use GenoMap.

Using GenoMap

The client applet rst reads the location of the socket server from the JAR le. It authenticates the user with the socket server and accesses the requested service. The information

18

regarding the port and host name of the server is provided by the socket server after authentication. To access the database the client applet must obtain authentication in a separate step.

5.2 Socket Server Architecture

C1

C2

C l i e n t

I n t e r f a c e

Socket Server

A p p l c a t i o n

A1 I n t e r f a c e

A2

AdminInterface

Sys Admin

Figure 9: Interface provided by SS Our main philosophy was to keep the implemetation open and be adaptable to changes in requrements or technology. We identi ed a standard set of APIs that were required to satisy the socket server responsibilities outiled in the previous section.These APIs were then used by the clinets(Applets) and backend servers to access services of the socket server(See Figure 9). This allows us to change our implementation in future without the need to make any costly changes to our existing client or server code. Below we describe the various APIs and thier functinality

5.2.1 Socket Server APIs

The socket server provides four classes of APIs to be used by client applets, system administrator and servers. The APIs are described informally to highlight their functionality and the parameters required for each. 19

Service APIs

The Service APIs allow the client applets to request location information about the servers. The APIs are:  Location GetMeThisService(Key,ServiceRequired)  Location GetMeTheDatabaseLocation(Key)  CompletedSession(FeedBack) Location is a tuple consisting of Hostname and Port , where the server listens for client applet requests. FeedBack is an optional parameter which client applets can use to query the server and its quality of service such as the duration of time taken to complete the service.

Authentication APIs

The Authentication APIs are used by client applets and the system administrator to obtain Authentication before doing any processing.  Key AuthenticateMe(UserName,DomainName,Password) This returns the Key which is used in other APIs for veri cation.

Administrative APIs

The administrative APIs are used for the registering/unregistering of domains and users to a particular domain. Correspondingly, there are two sets of APIs, one used by the system administrator and the other used by domain administrator. 1. APIs for the System Administrator  AddADomain(Key,DomainName,DatabaseLocation,DomainAdminName)  RemoveADomain(Key,DomainName) 2. APIs for Domain Administrator  AddAUser(Key,UserName,Priveleges)  RemoveAUser(Key,UserName)  ChangePrivleges(Key,UserName,Priveleges)

Server APIs

Backend servers use these APIs for registering themselves with the socket server.  RegisterMe(EncryptedMessage)  UnRegisterMe() 20

RequestServiced(FeedBack)  PingResponse(Parameters) FeedBack is used by the servers to inform the socket server of the completion of a session and any comments about the session. 

Key

Key is an encrypted message which includes DomainName, Priveleges and a MagicNumber. The authentication server passes this Key to the socket server for every call to AuthenticateMe. Key is used by the socket server to verify users privileges before giving any information to client applets.

5.2.2 Socket Server Internal Structure Internal Database AuthenticateMe

RegisterMe

AddADomain

Server Registration

RemoveADomain AddAUser RemoveAUser

UnRegisterMe

RequestServiced

Authentication Server

Ping

Server Feedback

ChangePrivelges

Ping

CompletedSession

State Of Server

Client Feedback

PingResponse

GetMeThisService GetMeTheDatabase

Client Information Provider

Load Balancing

Internal Administartor

GenerateAlert

GenerateReport

Figure 10: Architecture Each class of APIs are implemented by di erent modules(see Figure 10). Internal Database serves as the information repository for all modules of the socket server. The authentication server is responsible for maintenance of all administrative information and authenticating clients and system administrator. It also generates Keys to be used for veri cation and passes the Keys to the modules which require them. It stores all information 21

in the Internal database. It is designed in such a way that it can run on a separate Host as an independent entity. The Client Information Provider provides required information to client applets after verifying the keys with the keys from the authentication server . It gets the appropriatebackend server from the Load Balancer and the database location from the Internal Database. Server Registration does the decryption of the registration message from servers and is responsible for keeping track of the currently registered servers. State Of Server maintains the current State of the servers using the information from the following sources. Client feedBack This feedback is useful in knowing the state of the servers and the successful completion of transactions. Server feedBack This feedback is used by servers to ask for special requirements such as temporary suspension. This also indicates the completion of a transaction. Client Requests This increases the load on a scheduled server. Ping The socket server periodically "pings" the registered servers to guarantee sure they are in a "live" state. Service providers can also use the PingResponse method to inform the socket server of any change in their server parameters. Load Balancer calculates the best server for a particular service using the information from State of Service and Internal Database. Internal Administrator periodically sends Activity Reports to the system administrator and domain administrators. Also, when the requests for a particular service go beyond a speci ed limit, and the available servers cannot handle the requests, it sends an Alert Message to the domain administrator requesting the start of more servers for that particular service.

5.3 Registration of a Backend Server

This is a crucial phase in the GenoMap Security Framework. The client applets give very important and sensitive data to servers for processing. Client applets need to be sure they are communicating with a genuine server. Spurious/malicious servers must be prevented from connecting with the socket server. This is done using public key cryptography. There is a pre-de ned Public Key and a Startup Message for each service being o ered. During the startup phase, the server forms a message containing the following information.  Its Location  Parameters such as number of simultaneous client requests it can handle  StartupMessage The server then encrypts the message using its PrivateKey and uses the RegisterMe API to register with the socket server. The socket server decrypts the message using the Public Key and if is able to recover the message, it can be sure that only a genuine server is requesting registration. Since the Private Key is known only to genuine servers, no spurious server can encrypt a message and expect the socket server to recover the message. 22

5.4 Virtual Domain Virtual Path Socket Server

Clients

New User Group Clients

Database New Service

Database New Service

Domain 1

Domain 2

Figure 11: Illustration of Virtual Domains A Virtual domain (see Figure 11) is formed by combining two or more existing domains to o er a new service which utilizes the resources of participating domains. A new User Group is formed that has rights to access the services. The backend servers/clients can access the resources of the other domain only after proper authentication from the socket server.

6 Conclusion and Future Directions The GenoMap system is presently being used in a production environment in its rst version. The second version involves incorporating the socket server and authentication server as described previously. In this section we brie y describe two extensions to be completed in the future to address the need for higher volume and performance.

6.1 Multiple socket servers

When the number of domains increases, the performance of a single socket server will degrade as it handles all the domains. Our approach is to have multiple socket servers arranged in a hierarchical fashion (Refer to Figure 12). The domains will be distributed among the socket server s and each socket server will handle set of domains. A Master socket server will act as a router of the requests to the appropriate socket server. The APIs will remain intact and this change will not a ect the client applets, servers or system administrator. Also, multiple socket servers help in providing fault tolerance by replicating Master socket server functionality in one of the leaf socket servers. Thus, if the master socket server fails, the pre-de ned leaf socket server as a backup. 23

Master Socket Server

S1

S2

S2

Figure 12: Hierarchical Socket Servers

6.2 Automated Service provider Startup

In the current design, domain administrators are responsible for starting server. This means that if the administrator does not respond quickly to a GeneralAlert message, the client requests will become queued up and eciency su ers. To avoid this manual dependence, during initialization of the domains, the administrator runs startup daemons which instantiate Application Servers on potential hosts. These startup daemons register themselves with the socket server and are responsible to start the server. The socket server sends alert messages to these startup daemons which then start the required server.

6.3 Conclusion

In this paper we have described GenoMap, is a large-scale, distributed, heterogeneous, clientserver application to support gene identi cation. This functional genomics application manages pedigree information, phenotypes, genetic markers, and genotypes. The personal nature of this information requires attention to con dentiality, which in a network environment, presents challenges. In addition, the interdisciplinary nature of the users presents a the need to support a diverse collection of cooperating individuals in the gene identi cation process. The con icting requirements of security and heterogeneity are at the heart of the approach taken in GenoMap. Our approach provides for security by verifying identities of individuals in the granting of access to sensitive data through the network. The database itself is physically partitionable across lab boundaries, and access to database is encapsulated within well-de ned APIs. GenoMap has been implemented primarily in Java with a socket-oriented, client-server design employing recent applet security features. GenoMap is currently being used in the collection of data for, and analysis of, a number of relatively small gene identi cation studies. It is also being used in one large genome-wide screening for the locus of the gene(s) involved in autism. The former shows its usefulness in supporting users who want to employ the analysis features of GenoMap, while not needing the large-scale data collection and management facilities, while the latter shows the usefulness in managing gene identi cation studies that would have been unmanageable without such a system. 24

References

[Ald92] Aldus Corporation, \TIFF Revision 6.0," June 3, 1992. Available via the WWW at http://sgi.com/graphics/ti /TIFF.ps (September 1997). [Ber95] M. Berks, \The C. elegans genome sequencing project," Genome Research, Volume 5, 1995, pp. 99-104. [BlR97] J. A. Blake, J. E. Richardson, M. T. Davisson, J. T. Eppig and the Mouse Genome Informatics Group. \The Mouse Genome Database (MGD). A comprehensive public resource of genetic, phenotypic and genomic data," Nucleic Acids Res, Volume 25, Number 1, 1997, pp. 85-91. [CaM97] T. L. Casavant, K. J. Munn, T. A. Braun, T. E. Scheetz, V. Sheeld, E. M. Stone, \GenoMap: A portable, Network-based Gene-mapping System", 1997 Human Genome and Sequencing Meeting, Abstract and Computer Demonstration, Cold Spring Harbor, NY, May 1997, p. 35. [CoH97] G. Cornell and C. S. Horstmann, Core Java, Prentice Hall, Upper Saddle River, New Jersey, 1997. [CoI93] R. W. Cottingham and R. M. Idury, \Faster Sequential Genetic Linkage Computations," American Journal of Human Genetics, 53:252-263, 1993. [DOE95] Deparment of Energy, \Five Years of Progress in the Human Genome Project," Human Genome News, Volume 7, Numbers 3-4, SeptemberDecember 1995. Available via the WWW from www.ornl.gov in TechResources/Human Genome/publicat/hgn/v7n3/04progre.html (September, 1997). [Fla97] D. Flanagan, Java in a Nutshell, Second Edition, O'Reilly & Associates Inc., Sebastopol, CA, 1997. [GuW83] J. F. Gusella and N. S. Wexler, \A polymorphic DNA marker genetically linked to Huntington's disease," Nature, Volume 306, November 17, 1983. [HoJ88] R. W. Hockney and C. R. Jesshope, Parallel Computers 2: Architecture, Programming, and Algorithms, IOP Publishing, 1988. [KeR98] B. Kernighan and D. Ritchie \The C Programming Language, 2nd Edition," ISBN 0131103628, Prentice Hall, 1989. [LaW96] J. Lalonel, R. White, \Analysis of Genetic Linkage," Emery & Rimoin's Principles and Practice of Medical Genetics, pp. 111-125, 1996. [MyM97] R. L. Mynatt, R. J. Miltenberger, M. L. Klebig, L. L. Keifer, J-H Kim, M. B. Zemel, J. E. Wilkinson, W. O. Wilkison, and R. P. Woychik. \Analysis of the function of the agouti gene in obesity and diabetes," Proceedings: International Business Communications 2nd Annual International Symposium: Obesity, Advances in Understanding and Treatment, In press. [Ott91] J. Ott, Analysis of Human Genetic Linkage, Johns Hopkins University Press, Baltimore, 1991, pp. 108-141.

25