'Groups') have also been used by Green and Green. [4] to embody identical concepts. ..... Green, Eric D and Philip Green: Sequence-tagged Site (STS) Content.
CHROMINFO: A Database for Viewing and Editing Top-Level Chromosome Data Prakash M. Nadkarni, Stephen T. Reeders', Mark A. Shifman and Perry L. Miller Center for Medical Informatics, Dept. of Anesthesiology and lDepartment. of Genetics, Yale University School of Medicine, New Haven, CT 065 10 ABSTRACT
BACKGROUND Each chromosome has two arms, a short (p) arm and a long (q) arm, which are joined at a point called the centromere. The end of each arm is called the terminus or telomere. With Giemsa or Quinacrine staining, these arms are seen under light microscopy to have several alternating light and dark bands, which are given numbers using a standardized decimal nomenclature. The band location, or cytogenetic location, of a region of interest, termed a locus, forms a localizing co-ordinate for that locus. The process of ascertaining the position of a locus with respect to the bands (and, more important, the position with respect to other loci and the distance from these loci), is termed mapping. Cytogenetic location alone is so broad that it is currently of limited utility for extensively studied chromosomal regions; for example, in the band 13.3 of the p arm of chromosome 16, there are already 40 named loci. Hence it is no longer sufficient merely to map a locus to a band.More detail is required about the order of loci within bands. All loci consist of DNA, and a particular locus may exist as several variations of the same molecule; each form of the locus is called an allele. It is sometimes possible to synthesize DNA that binds to a variant of a particular locus specifically and with high affinity. If labelled with a radioactive or fluorescent material, this synthetic DNA is called a probe. The most informative measure of distance between one locus and another is the physical distance, i.e., the number of nucleotide basepairs separating the two. Some studies do not measure physical distance, but estimate distance and order by the statistical analysis of inheritance patterns of loci in extended families (pedigrees). This technique, termed linkage mapping, measures genetic distance between loci, expressed in units called centiMorgans, which refer to the probability of two loci being separated during meiotic
CHROMINFO is a prototype database that is intended to serve as a liaison toolfor researchers working in different centers on mapping of the same mammalian chromosome. It provides a bird's-eye-view of top-level entities on a chromosome (such as gene loci, chromosome breakpoints and contigs) and relates them to one another in one dimension, the axis of the chromosome. Consensus data can be entered, edited, queried and displayed in a variety of ways. Summary evidencefor consensus data can also be stored and retrieved. Information may be downloadedfrom the Genome Data Base periodically, and order and distance information is then incorporated. The prototype of CHROMINFO was built for human chromosome 16. Versions have been createdfor several other chromosomes. INTRODUCTION The national Human Genome Project (HGP) has as its ultimate goal the determination of the complete DNA constitution of a human being, the human genome. Informatics activities will comprise a critical component of the HGP in order to support the storage, analysis and display of the large amounts of data that are being generated. Currently, no single informatics approach has proven itself to the extent of becoming a defacto standard, and the HGP currently encourages diverse approaches to data management and representation. CHROMINFO represents one such approach. It is designed to complement public databases such as the Genome Data Base (GDB)[1] by letting researchers (e.g., the members of laboratories that contribute to consensus chromosome maps) enter and retrieve order and distance information. CHROMINFO is currently available for the following chromosomes: 5, 6, 10,11, 12, 16, 17, 21, 22, X. The chromosome 16 data is being being maintained by one of the authors (S. T. Reeders) who is one of the GDB editors for chromosome 16.
0195-4210/92/$5.00 ©01993 AMIA, Inc.
366
crossover (and thus being separately inherited in the offspring). Physical distance can be summed whereas genetic distance cannot, because the relationship between genetic distance and physical distance varies depending on factors such as chromosomal location and sex of the individual. Crude measures of distance can also be obtained by techniques such as fluorescent insitu hybridization (FISH), where fluorescent probes tagged with differently colored fluorescent dyes are used to directly visualize (under light microscopy) loci on the chromosomes of either resting or actively dividing cells. Linkage mapping and FISH mapping cannot generally resolve loci much less than 1 million bases apart. Below this level, there are a variety of mapping techniques, most of which rely on the digestion of DNA with one or more restriction endonucleases, enzymes that recognize and cleave particular short stretches (e.g., 4-8 bases) of DNA. As already mentioned, CHROMINFO is intended to serve as a liaison tool for researchers working in different centers on the same chromosome. It deals primarily with order and distance, as well as basic locus data (their names, descriptions, cytogenetic locations) and textual references to data supporting particular position/order/distance data items. The word "map" has multiple meanings in genetics based on the level of resolution one is referring to. At the level of the chromosome, a map is an ordered (or partially ordered) list of loci with intervening distances which may or may not be currently determined. This is the level at which CHROMINFO currently operates; maps are an important part of its output. There are relatively few programs available for top-level map display and editing. Acedb [2] is a publicly available database engine originally created for the C. elegans projrct. SIGMA [3] is a map editing and display tool under development at Los Alamos National Labs. Both of these programs run under UNIX. CHROMINFO appears to be the first microcomputer based map management program, and the first to explicitly handle uncertainty in data. TERMS AND CONCEPTS USED IN
'Groups') have also been used by Green and Green [4] to embody identical concepts. Entities And Landmarks A named region of interest in a chromosome is an Entity. Entities may be Atomic or
Compound. Atomic Entities are those which are treated as single points on a ruler representing the chromosome. While loci have a finite size, measured in base pairs, we ignore this aspect for the sake of recording relative position and distance, in much the same way as the distance between two cities on a low-resolution geographical map ignores the size of the cities themselves. Atomic Entities thus have zero dimension. Examples of Atomic Entities are functional genes, radiation breakpoints, fragile sites and DNA segments. An Atomic Entity may be a locus but it may not: a meiotic crossover point that has been mapped in a publicly available family can also be entered as an Atomic Entity. Compound Entities span a particular length of the chromosome, and consist of several Atomic Entities grouped together for a particular reason. Examples of Compound Entities are areas containing overlapping DNA fragments ("contigs"), anomalies spanning a considerable length of the chromosome (inversions, deletions and translocations), or any group of loci that have been researched together in order to obtain relative position and distance information. Each Entity is assigned a particular Entity Type so that the user can rapidly search and retrieve Entities of a particular Type. Entity Types currently include: functional genes, singlecopy sequences, chromosome breakpoints and fragile sites. The user can add types of his or her own. The positions of some Entities have been ascertained more reliably and reproducibly than others, and these serve as reference points in research. CHROMINFO calls such Entities 'Landmarks'. The ordered chain of Landmarks constitutes a roadmap of tie chromosome. One goal of the Gene Mapping Initiative will have been achieved when every Entity worth knowing about reaches Landmark status. The physical distance between two Landmarks, if known, can be entered explicitly as two numbers: the minimum and maximum distance bounds (in kilobases). If the distance is known precisely the two will be the same. These bounds allow for experimental error in determination of distance. When asking questions
CHROMINFO In order to describe CHROMINFO's capabilities fully, the paper first defines some terms that embody concepts that are central to CHROMINFO. Some terms have been coined by us; by a coincidence, two terms (tLandmarks' and
367
Landmark as one of its members). CHROMINFO lets the user edit Group structure using the mouse.
about distance information between two nonadjacent Landmarks, these numbers will be summated to give minimum and maximum distances for the interval of interest. If physical distance is not known but genetic distance is, the maximum and minimum genetic distances can be entered in centiMorgans. For some entities, it is only known that they lie in a certain cytogenetic region. For others, we know a little more, for example, that while not Landmarks themselves, they lie between two Landmarks which are not (necessarily) adjacent. Such Entities are termed Bounded Entities; these are candidates for further research and possible future promotion to landmark status.
Chromosomal Anomalies An inversion is a situation in which an internal piece of a chromosome has been broken off, and then reinserted in the same place reversed in orientation. In a deletion, the broken-off piece is lost. If a chromosome breaks and a piece becomes joined to another chromosome, a translocation has occured. In CHROMINFO, the broken-off piece is represented as a kind of Bounded Entity, except that the bounding landmarks represent the span of the piece concerned. A user may then ask what other landmarks or entities are encompassed within the anomaly.
Groups Certain Entities are partially mapped with respect to each other. For example: A, B and C lie between chromosome breakpoints 1 and 2, but the order of A, B, and C with respect to each other is not known. A, B, C can be considered as a Group. The Group is a Compound Entity, and really represents a partial map. Sometime later the order of A, B and C may be determined, but the orientation of the Group with respect to the chromosome (i.e., which end of the Group is closest to the p terminus) may not be known. The Group is then ordered but not orientated. Subsequently, the Group may be orientated. At this point it may be possible to dissolve the Group by promoting all its members to Landmark status. Members of Groups are called SubEntities. Sub-Entities may also be Groups and so on. In this way a hierarchy is created. To handle the situation of partial ordering, one can create one or more Groups which are unordered, and place them within a larger Group which can be ordered. In computer science parlance, a Group is represented internally in CHROMINFO as a tree of entities (some of which may themselves be Groups) nested to an arbitrary level of depth. In essence, Groups are "quasi-entropy" in the database: first approximations which enable researchers to get partial, if not fully satisfactory, position information until higher-resolution experiments can be performed to localize individual members of the Group more precisely. The structure of a Group will change over time, as its members are localized, or as subgroups are fully ordered and moved to a higher level. Sometimes a Group may have to be broken into two Groups, as one of its members achieves Landmark status (a Group cannot have a
Allelic Variants Although most of the differences between different copies of chromosomes 16 in the entire human population are not globally significant on the scale of the whole chromosome, it is now clear that there is large scale allelic variation in some regions. For example, at the p-terminus of chromosome 16, the distance between the locus NFG400 and the telomere is either 40, 220 or 300 kb. To handle this situation, CHROMINFO lets the user create an 'allelic variant' Group, which is a Group comprised of each of the alleles. The members of the Group are mutually exclusive. Only one member can appear in any one linear map. Obviously it is meaningless to talk of order or orientation for an allelic Group.
Conflicting Data Sets A common problem in map building is conflict, where two or more incompatible orders of sets of loci are inferred from different sets of data. When first discovered, conflict may be due to error in one of the data sets, or it may be due to allelic variation. Until we know which is the case, it must be flagged as suspect and requiring further research. CHROMINFO lets the user create a special kind of Group called a 'conflict set'. The Sub-Entities in a 'conflict set'are also mutually exclusive; each Sub-Entity consists of one of the possible orders. Thus, if we had two possible orders for the same four loci, A-B-C-D and B-A-D-C, we would create a group with A,B,C,D and flag it as a conflict set, recording the two possible orders as Sub-Entities. Each Sub-Entity would have references to the supporting evidence for that order. If conflict is later found to be due to error, the true order of the group can be entered and all
368
Sub-Entities purged. If allelic variation is found to be the cause of conflict, the alternatives are retained, but the Group itself now becomes an allelic variant. In either case, the Group ceases to be a conflict set.
To create a CHROMINFO data file for a new chromosome, we download probe and locus data from GDB text files into CHROMINFO. We then add order and distance information from published sources such as HGM1 1, and homologous mouse data from the Jackson Laboratory Encyclopedia of the Mouse Genome [5]. Probes and homologous mouse data can be displayed and searched on multiple criteria. Distance Information Distance information between loci is represented as maximum and minimum bounds on distance. Other measures of distance (e.g., linkage) can also be stored and retrieved through boolean query. CHROMINFO will also compute the relative position and physical distance (if possible) between any two loci on the chromosome and display all intervening landmarks. Display of Landmarks CHROMINFO displays the master roadmap of Landmarks, with distances between landmarks wherever they are known. Related functions are display of all Bounded Entities between any two landmarks, all landmarks near an Entity (and in the case of a Group, a Bounded Entity or a Chromosomal Anomaly, all landmarks encompassed within that Entity's span). In addition, CHROMINFO allows promotion of Entities to Landmarks and demotion of Landmarks. Group Information Groups can be created and retrieved in several ways, as well as restructured using the mouse. One can display all members of a Group, as well as all Entities which are members of Groups. Graphic Output CHROMINFO generates a composite map of the chromosome (or a userselected portion of it) as a graphic similar to that currently published for Chromosome 16 in the annual proceedings of the Human Gene Mapping Workshops. It also generates a map of selected Entities superimposed on the cytogenetic map (ideogram) of the chromosome. DESIGN CONSIDERATIONS When developing CHROMINFO, we decided that it would run on a platform that noncomputer professionals would feel comfortable with, and where rapid learning would be facilitated through a graphical user interface. Since we were venturing into unknown territory with regard to data representation, manipulation and display, we realized that design would require multiple iterations of refinement, and that in each cycle a fully functional prototype (rather than a mock-up of screens) would be needed. We therefore decided to use the Apple Macintosh as
Consequences of Acquisition, or Loss, of Landmark status When evidence for its position reaches critical mass, an Entity can be promoted to become a Landmark. A newly promoted Landmark will create opportunities for placing Bounded Entities more accurately (i.e., narrowing their bounds through experiment). For example, if there are Bounded Entities between Landmarks A and B, and a new Landmark C appears between A and B, there is an opportunity to refine the position of the Bounded Entities placing some between A and C, and others between C and B. If the evidence that led to promotion of an Entity was bad, it may lead to a propagation of error in the database. In such a case, one would wish to strip Landmark status from the Entity that caused the problem by demoting the Landmark. CHROMINFO does not enforce any arbitrary criteria for Landmarks. It does, however, cascade the changes made when an Entity is promoted or demoted, letting the user create a report of all entities that are affected as a result of the change. In the case of promotion, it tells the user which entities can be refined further. In the case of demotion, it tells the user which entities are now out of date and require immediate intervention. Since Landmarks have elite status, Landmark editing should be done only after careful thought. CAPABILITIES OF CHROMINFO This section describes several of CHROMINFO's capabilities. Addition. Change and Retrieval of Locus Data Data may be entered from the keyboard or imported from a public database such as the Genome Data Base and then edited to add value. Some fields in the data are shared, others are fields of interest only to the user but not necessarily to others. In this way, a user can add value to a personal copy of the database. Loci can be classified by user-defined criteria and can be retrieved through boolean query in a variety of ways. Supporting references can also be added and retrieved. A reference is anything that points to the underlying data supporting a conclusion (e.g., published material, lab notes, personal communications) and can be classified by the nature of the data (e.g., linkage studies, FISH).
369
overlapping cosmid contigs, more graphics will be incorporated.
the platform, and used an existing commercial relational database engine, 4th Dimension (4D), by Acius. The major benefits of 4D include a robust programming language, a powerful graphically oriented screen designer that can be used to rapidly design extremely sophisticated screens and reports, and a database design tool that displays (and lets the user manipulate) an entityrelationship diagram, where relationships between relational data tables are specified through mouse-click-and-drag interaction. While 4D can run in multi-user mode on a Macintosh local area network, it can also act as a client to a remote server running Sybase on UNIX or VMS, (in the former case, by using the standard TCP/IP communications protocol). In this fashion, an intuitive graphical interface can be used to access a database engine lacking such abilities. Currently, we have not designed CHROMINFO for simultaneous multiuser access. The intention has been to duplicate existing practice, which is to let a designated individual (e.g., the chairperson for a Human Gene Mapping Workshop chromosome committee) be responsible for editorial decisions regarding landmark status. Further, we have felt that, in the initial phase at least, letting individuals have their own copy of the data, to which they can add value through entry in designated tables and fields, was more important than having a single shared database to which personal value could not be added. We have a mechanism by which information from key individual users can be sent to the chairperson for electronic incorporation, as well as a mechanism by which upgrades to the database can be created so that the incremental information can be incorporated into an individual's copy of the database without erasing personal data. As in GDB, all entries are stamped with the date of creation and date of last change. FUTURE ENHANCEMENTS Future extensions to CHROMINFO are planned in the following directions. 1. Database support for the lower level data on which the consensus information currently presented in CHROMINFO is based will be provided. 2. The graphic output capabilities of CHROMINFO will be enhanced. Currently, true graphics are only needed for displaying the top-level consensus map of the chromosome. As CHROMINFO starts dealing with
In summary, with suitable extensions, we believe that CHROMINFO will act as a useful complement to public databases such as GDB and Geninfo.
ACKNOWLEDGMENTS The costs involved in purchase of the development software used to build CHROMINFO, and the costs of distribution, were partly met by NIH Grant RO1 HG00175 from the National Center for Human Genome Research. We thank the many users who have given us feedback, in particular Ed Hildebrand of the Los Alamos National Labs. REFERENCES 1. Genome Data Base Manual Version 6.0. Laboratory for Applied Research in Academic Information, William H. Welch Medical Library, Johns Hopkins University,Baltimore, Maryland USA. 2. Thierry-Mieg, J. and Durbin R. Acedb: A Database for Genomic Information Proceedings of the 1992 Conference on Genome Mapping and Sequencing, Cold Spring Harbor Laboratory, New York. pg 218. 3. Cinkosky, Michael J, et al. SIGMA: System for Integrated Map Assembly. Proceedings of the 1992 Conference on Genome Mapping and Sequencing, Cold Spring Harbor Laboratory, New York. pg 165. 4. Green, Eric D and Philip Green: Sequence-tagged Site (STS) Content Mapping of Human Chromosomes: Theoretical Considerations and Early Experiences. PCR Methods and Applications 1991 (1)77-90. 5. The Jackson Laboratory, Bar Harbor, Maine 04609. Encylopedia of the Mouse Genome. Version 2.0.
highly visual, lower-level data such as
370