CytometryML, an XML format based on DICOM ... - Wiley Online Library

3 downloads 80635 Views 127KB Size Report
May 4, 2002 - create classes of documents that include shared vocabu- laries and a common .... angle, 45- or 90-degree light scatter; extinction; dc or rf impedance; and ... with commercial software products, including Adobe. Photoshop.
© 2003 Wiley-Liss, Inc.

Cytometry Part A 54A:56 – 65 (2003)

CytometryML, an XML Format Based on DICOM and FCS for Analytical Cytology Data Robert C. Leif,* Suzanne B. Leif and Stephanie H. Leif XML_Med, a Division of Newport Instruments, San Diego, California Received 10 June 2002; Revision Received 12 January 2003; Accepted 10 February 2003

Background: Flow Cytometry Standard (FCS) was initially created to standardize the software researchers use to analyze, transmit, and store data produced by flow cytometers and sorters. Because of the clinical utility of flow cytometry, it is necessary to have a standard consistent with the requirements of medical regulatory agencies. Methods: We extended the existing mapping of FCS to the Digital Imaging and Communications in Medicine (DICOM) standard to include list-mode data produced by flow cytometry, laser scanning cytometry, and microscopic image cytometry. FCS list-mode was mapped to the DICOM Waveform Information Object. We created a collection of Extensible Markup Language (XML) schemas to express the DICOM analytical cytologic text-based data types except for large binary objects. We also developed a cytometry markup language, CytometryML, in an open environment subject to continuous peer review. Results: The feasibility of expressing the data contained in FCS, including list-mode in DICOM, was demonstrated;

Analytical cytology produces clinically useful information, which should be accessible to physicians, scientists, and other laboratory personnel. A standard for analytical cytologic data has three major functions: data transmission, data storage, and data sharing. DATA TRANSMISSION Because the Internet is the major medium for data transmission, a standard should be compatible and use the facilities presently available for the Internet. The use of Internet standards minimizes the content of a data standard and provides reliability and capabilities well beyond anything that can be produced by a small group, such as the International Society for Analytical Cytology (ISAC) Data File Standards Committee. For instance, Extensible Markup Language (XML) standards are being developed for encryption (1) and signatures (2). DATA STORAGE Data can be stored in the form of an archive or a database. Flow Cytometry Standard (FCS) (3), has been a

and a preliminary mapping for list-mode data in the form of XML schemas and documents was completed. DICOM permitted the creation of indices that can be used to rapidly locate in a list-mode file the cells that are members of a subset. DICOM and its coding schemes for other medical standards can be represented by XML schemas, which can be combined with other relevant XML applications, such as Mathematical Markup Language (MathML). Conclusions: The use of XML format based on DICOM for analytical cytology met most of the previously specified requirements and appears capable of meeting the others; therefore, the present FCS should be retired and replaced by an open, XML-based, standard CytometryML. Cytometry Part A 54A:56 – 65, 2003. © 2003 Wiley-Liss, Inc.

Key terms: Flow Cytometry Standard (FCS); Digital Imaging and Communications in Medicine (DICOM); Extensible Markup Language (XML); standards; waveform; Cytometry Markup Language (Cytometry ML); schema

useful means for transferring and archiving flow cytometric data. However, when data are archived, the necessary information for its retrieval must be stored separately; otherwise, the contents of each file must be searched sequentially. This sequential search is a very slow, inefficient process. If the data are stored in a database (4), the retrieval of data can be optimized. The use of the data can be facilitated by employing a database that is interfaced with or is part of the clinical laboratory information system. The use of the data can be maximized by storing the data in a standard format that permits their use by other applications.

Presented at the 21st meeting of the International Society for Analytical Cytology. Contract grant sponsor: Newport Instruments. *Correspondence to: Robert C. Leif, XML_Med, a Division of Newport Instruments, 5648 Toyon Road, San Diego, CA 92115. E-mail: [email protected]; www.newportinstruments.com Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cyto.a.10043

CYTOMETRYML

DATA SHARING One measure of the utility of an open standard is the ease with which commercial and research applications can access the data. The format described by the ISAC FCS (3,5) produces files that are suitable only for archiving. FCS data are not directly available to clinical information systems, other commercial off-the-shelf programs, or to users of analytical cytologic instrumentation who wish to create software that manipulates their data. FCS files must be parsed by special software to be readable by any program or translated into the programs’ format. A solution to this problem is to replace the FCS format with one that is easily accessible. In the best of circumstances, if the cytometric data were to be stored in the same format as the one used by the clinical information system, then the system would be extended to include the functionality of FCS. Specialized software to format, transmit, and store clinical data must be produced in conformance with medical regulatory requirements. According to the U.S. Food and Drug Administration (FDA) (6), “Unless specifically exempted in a classification regulation, any medical device software product developed after June 1, 1997, regardless of its device class, is subject to applicable design control provisions.” The FDA requires (7) a “device hazard analysis” and a “software requirements specification.” Both documents are required to validate and verify software. Because neither a hazard analysis nor a software requirements specification for FCS had ever been published or was known to exist, Leif and Leif attempted to remedy this omission in the FCS design process by creating and publishing (8) both documents. To maximize input to these documents, they were posted on the ISAC Web page. Except as described below for the removal of the recursive structure of FCS 3.0, these two required documents were neither amended nor replaced by the ISAC Data Standards Committee. The absence of an agreedupon version of these two critical documents has compromised the software design process for FCS. Because the FCS design process has been compromised, continued use of FCS eventually will require that programs that primarily use this standard be labeled, “This software is for research use only and is not intended to be used for diagnostic procedures.” The cost of producing a new, complete cytometric standard that meets FDA specifications likely is beyond the resources of ISAC; in any event, there is no need to do so. A major purpose for a standard is to meet the FDA requirements for a description of the “data structures and data flow diagrams” and “definitions of variables (control and data) and description of where they are used.” We previously reported (8 –10) flaws in the FCS design, in particular those that could lead to hazardous conditions or would require increased testing of a software product. We are gratified that our critiques have provided an impetus to the proposed removal (11) of one of the worst of these flaws, which is the recursive structure of FCS 3.0 (8). There presently is no limit on the number of TEXT

57

sections, each of which (except for the last) points to a HEADER. In turn, each HEADER starts a subsequent data set and points to a TEXT section. This combination of the list-mode data from multiple experiments can produce extremely large files, which require long, uninterrupted periods for transmission. Thus, these files would be unsuited for transmission in many parts of the world. FCS has other problems. The first is the serious defect in the use of delimiters. FCS section 3.2.1 states, “The TEXT segments (primary and supplemental) contain a series of ASCII encoded keyword-value pairs that describe various aspects of the data set. For example, $TOT/5000/ is a keyword-value pair indicating that the total number of events in the file is 5,000. $TOT is the keyword and 5000 is the value. The ‘$’ character flags this keyword as a standard FCS keyword. In this example, the ‘/’ is the delimiter character.” There is no specific limitation on the choice of the delimiter by the user. The delimiter of keyword-value pairs and other special characters should be specified by the standard. The delimiters for the start and end of constructs should be different from each other. If there is a faulty data transmission, the use of the same character as a delimiter for the beginning and end of a value can cause errors by concatenating keyword-value pairs to produce a construct that loses one or more delimiters. XML greatly reduces or possibly eliminates this problem. XML employs the less-than character, ‘⬍’, as a start tag to begin elements (objects), followed by the name of the element, and the ‘‘/⬎’’ tag to end the elements. Complex statements also begin with a ‘⬍’ followed by a name; however, they end with an end tag in the form of ⬍/name⬎. Second, if the data are corrupted by a transmission, memory, or other error; a string or other similar construct can have extra characters added. XML includes the capacity to specify the minimum and maximum lengths of a string in the data. If there are more or fewer characters than specified, the XML page will not validate against a schema. This serves as a very effective means to warn the user and the system that something wrong has occurred (requirement 2) (8). (All requirements are listed in the Conclusions.) The existence of a maximum length for string types permits their inclusion in the tables of relational databases and minimizes the probability of undetected data corruption. Third, Digital Imaging and Communications in Medicine (DICOM) and XML can specify the range of their numeric objects. This facility, which is absent in FCS, can catch typographic and other errors. Fourth, the B in PnB has two meanings, bit and byte. In Section 3.2.20, B is described this way: “The additional keywords $PnB (bits per parameter) and $PnR (range per parameter) are needed to completely describe an event in the DATA segment.”; Section 3.2.19 states: “$DATATYPE/A/ means that the data are written as ASCII-encoded integer values. In this case, the keyword $PnB specifies the number of bytes allocated per value (one byte per character).” This can result in ambiguity, which increases the complexity of software design, code, and testing.

58

R. LEIF ET AL.

Fifth, in FCS, BYTEORD, byte order for data acquisition, should be specified and not be an option. The description of BYTEORD includes the example of the obsolete PDP11, where the order is 3,4,1,2. The least significant bit is numbered 1. Presently, for 32-bit data, there are 24 possible (four factorial) byte order formats. Support of all of these including testing (requirement 3) is not warranted. There are two common byte orders, big- and little-endian. For big-endian, the most significant byte is first, with the remaining bytes encoded in decreasing order of significance. Conversely, for little-endian, the least significant byte is first, with the remaining bytes encoded in increasing order of significance. DICOM employs little-endian as the default. Unicode (12) and, consequently, much of the data transmitted by the Internet employ big-endian as the default. Both employ only little- and big-endian. The Intel Pentium processor (13) “can convert between big-endian and little-endian formats,” and the PowerPC (14) can operate in big-endian and little-endian modes. It should be emphasized that the standardization of byte order for transmission is totally irrelevant to the internal disk storage or analysis of the data. Sixth, the use in FCS of short Fortran-style keywords is an anachronism (requirement 10), which decreases readability. What is needed is the standard software engineering practice of separating syllables with underscores or by capitalization of the first letter of a term or abbreviation, camel case. However, case sensitivity can impair programer productivity; and the convention of starting succeeding syllables with a capital letter has the difficulty that programing editors, which reformat software (prettyprinters) in case-insensitive languages, can change the case of individual letters. Seventh, FCS provides an inadequate collection of types to describe analytical cytometry data. The example given, “$GnF/string/ $G2F/520LP/”, does, at least, define a longpass filter. However, if filters are defined as the objects that are put in filter holders, then at least the following terms could be used in the first element in a record or equivalent with the enumerated types: Long_Pass, Short_Pass, Transmission_Band, Blocking_Band, Polarizer, Half_Wave_Plate, Quarter_Wave_Plate, Dichroic_Mirror, Grating, and Prism. There is nothing in FCS to describe a grating or prism. The maximum or other specific wavelength, bandpass, and efficiency could also be included as fields. Part numbers and manufacturers’ names should be included for all relevant items. The meaning of “$GnP Percent of emitted light collected by gating parameter n.” is not clear. Is it the product of the transmissions? Eighth, FCS is a flow cytometry rather than an analytical cytology standard. Laser scanning cytometry (15,16) is a hybrid modality, which employs a microscope to produce data similar to those obtained with a flow cytometer. Flow cytometry, laser scanning cytometry, and digital slide microscopy are sufficiently similar to permit the use of one comprehensive standard. Ninth, FCS is incapable of direct data interchange with other standards. In a plenary lecture at ISAC XXI, Buetow (17) described the problems resulting from: “[D]ifferent

groups that are all spread out over the entire scientific landscape generating little stovepipes of pieces of information that are very difficult to bring together.” This was mirrored in our published requirement 8 that: “[T]he capacity to interact and interface with other clinical data management systems should be maximized.” The standards development equivalent to Occam’s razor is, the fewer, the better. The simplest approach to meeting this requirement is to use and extend one or more existing standards. Tenth, the use of FCS 3.0 evidently required its authors to create an application interface for specific programing languages. The proliferation of the C family of languages into C, C⫹⫹, C#, and Java, and the existence of many other languages, such as Ada, Cobol, Fortran, and Pascal, and data exchange formats, such as XML and the CORBA Interface Definition Language, make creating such as interface a daunting task. A starting point for a cytometry standard was to use the DICOM standard (18). DICOM is an extensive, rich, and well-designed clinical information system standard, which already includes the functionality required for digital slide microscopy and can easily be extended (8 –10) to include flow cytometry and laser scanning cytometry. DICOM has been sponsored for microscopy by the American College of Pathologists. Because the functionality of FCS requires a small subset of DICOM, it is feasible to translate FCS into DICOM. Because DICOM was unacceptable to the ISAC community, its use as a direct replacement for FCS had to be abandoned. Fortunately, a simple compromise exists. The DICOM data types can be expressed as XML schema (19 –21), and all the actual data, except for large binary objects, can be transmitted and stored as XML data (10). Presently, the major commercial relational databases (22– 24) have been extended to work with XML, and specialized XML databases are commercially available (25–27). It must be emphasized that this is a minimalist approach. Because only the areas of DICOM relevant to cytometry are used, the other parts of DICOM are presently ignored; and all data transmission will be based on the World Wide Web Consortium (www.w3.org) and Internet Engineering Task Force (www.ietf.org) standards. The use of preexisting data types and a standardized, well-documented, functioning software environment has the very significant cost benefit of minimizing the scope of the project. The cost of documenting software design decisions for a medical device often can exceed the cost of writing the code. The use of DICOM data types and XML syntax would also follow the present approach of DICOM (28) and the future FDA new drug application for the collection of biological data. The FDA is creating a Web-based review for drug applications (29). The first step is the creation of an “XML backbone,” which includes an “electronic table of contents” that describes the status, location, and whether a file has been replaced or amended. Another FDA initiative is a Proposed standard for Exchange of Electrocardiographic and Other Time-series Data (30).

59

CYTOMETRYML

FIG. 1. Schema organization shows different parts of the cytometric schema import data types from DICOM, including types from DICOM foreign coding schemes. Sources of data types and other elements are expressed as an Extensible Markup Language schema. DICOM (18), Digital Imaging and Communications in Medicine; FCS (3,5), Flow Cytometry Standard; LOINC (35,36), Logical Observation Identifier Names and Codes; MathML (37), Mathematical Markup Language; UCUM (38), Unified Code for Units of Measure; SNOMED (39), Systematized Nomenclature of Medicine.

Draft XML data format requirements (31) and design (32) specifications have been created. Electrocardiographic and list-mode data are similar, and a list-mode data set is a time series. Of considerable interest to clinical cytometrists, the Centers for Medicare and Medicaid Services (CMS), formerly the Health Care Financing Administration, already has a pilot project (33) for reporting on the treatment of endstage renal disease, which is based on an XML application, Health Level 7 (HL7), version 3.0. CMS in its campaign for standardization has already stated (34) that: “If states wish to collaborate on the development of a common XML data tag standard, CMS will work with the states to coordinate and facilitate the effort.” The objective of this paper is to demonstrate the feasibility of mapping analytical cytologic data to DICOM standard data types and the feasibility of expressing DICOM including its coding schemes in XML. Because the major item in FCS that does not exist in DICOM is list-mode, this mapping is performed first. METHODS Previously (8 –10), the existing DICOM standard documents were used to create a mapping of the existing FCS data types to DICOM data types, in particular those from Slide-Coordinates Microscopic Image Information Object. This mapping follows the present design of DICOM. As shown in Figure 1, Cytometry Markup Language (CytometryML) is composed of data types from multiple standards. The data types from each standard can be contained in one or more XML schema documents (schemas). XML schema is a World Wide Web Consortium recommendation (19,20). XML schemas provide a precise means for defining the structure, content, and semantics of XML documents. Schemas permit groups or organizations to create classes of documents that include shared vocabularies and a common thesaurus. Schemas include the definitions of: elements, attributes, and data types. Elements are objects that can have simple and complex data types. Attributes can have only simple data types. Examples of simple data types are strings, integers, and enumerated data types. Complex data types include records and data types with attributes. The schemas and XML documents were developed with XMLSpy version 5.2 (www.xmlspy.com). The mapping of the DICOM data types to XML and their correspondence

with FCS equivalents have been documented in these schemas. These schemas, wherever reasonable, limit the acceptable ranges of data types. These limits increase the safety of the system by forcing the software to indicate an error whenever an element or attribute has a value that is out of the specified range. Before the publication of this paper, the schemas used to produce this publication and any other supplemental materials were made available in color at www.newportinstruments.com. It should be cautioned that these documents are relevant only to the status of CytometryML at the time of submission of this paper. The future versions will be posted at the Newport Instruments Web site and, we hope, at some official Web site. To completely identify the data types and facilitate interoperability between standards, the DICOM Data Element and Value Representation (VR) tags and the FCS keywords have been and will be provided as XML fixed attributes. Because these fixed attributes are constants, their presence will be limited to the schemas, and they will not be included in the XML documents. Reusability and readability of the schemas were maximized by declaring only data types and one element based on a complex type for schemas that are referred to by XML documents. All the XML documents were based on an automatic translation by XMLSpy of an individual schema into an XML document. These XML documents were subsequently edited to remove the constant attributes and other extraneous material. Mapping of FCS List-Mode to DICOM Waveform This use of data types from existing standards is demonstrated by a mapping of the FCS list-mode to the DICOM Waveform Information Object (40) or class. The form of data presentation is the XML document, which can be transmitted. If the source of the fields is in a database, or can serve as structured information, it can be used in a form or publication, such as a Cytometry paper. The waveform class can be instantiated and extended to include the closely related flow, laser scanning cytometry, and microscopic image list-mode objects. The list-mode for flow cytometry and slide microscopy data differences are: (a) for flow, time is relevant; (b) for image, the slide x and y coordinates are relevant; and (c) individual cell images can be elements in an image list-mode file. The new DICOM Waveform Information Object (40) provides a means to describe list-mode data with functionality that is superior to FCS. Figure 2 shows how to map FCS list-mode data to the elements of the DICOM Waveform Information Object. Analytical cytology requires one Multiplex Group (listmode file) for each data set. One analytical cytologic parameter is equivalent to a Waveform Channel. The Sample is the multidimensional vector or record that describes the data obtained from a cell. Waveform A very abbreviated version of the waveform XML document is shown below. The complete versions of all XML documents have been posted at www.newportinstruments.

60

R. LEIF ET AL.

FIG. 2. Description of a DICOM waveform specific for cytometric data. Adapted from Figure J.5-1 in Leif and Leif (10).

com and is described in detail in a separate publication (41). 1 ⬍Modality⬎Flow⬍/Modality⬎ 2 ⬍Waveform_Originality⬎Original ⬍/Waveform_Originality⬎ 3 ⬍Acquisition_Date_Time⬎ 2002-05-04T13:30:47-05:00 ⬍/Acquisition_Date_Time⬎ 4 ⬍Acquisition_Context⬎ Described in text below. ⬍/Acquisition_Context⬎ All XML statements begin with the ‘⬍’ character and end with the ‘⬎’ character. XML syntax specifies beginnings, ⬍element⬎, and endings, ⬍/element⬎. The end of an XML statement includes the ‘/’ character. Statement numbers have been added at the left. Statement 1 specifies the Modality (Flow, Sort, Slide_image, or Plate_image). The Originality of the data (Original or Processed) is specified in statement 2. Processed data are produced by some analytical process. Although FCS 3.0 employs an analysis section for this type of data, it does not provide an audit trail suitable for

today’s medicolegal environment. The third element, statement 3, is the start Date-Time of acquisition of the flow cytometric list-mode data or image data from which list-mode can be derived. The Acquisition_Context, which includes multiple elements starts at statement 4 and begins with an optional description of up to 1,024 characters. This is followed by a description of the Triggers, each of which include a parameter short name of up to 16 characters, a parameter long name of up to 64 characters, and the range of values that can be used to trigger the instrument for that parameter. The addition of cell subsets to FCS (42) has been an important improvement and can be enhanced by employing the functionality provided by the DICOM Waveform to express subsets as index files. The locations of these index files are specified as XML URLs, which can be on a local hard drive or externally located on the Internet. The indices are based on DICOM Referenced Sample Positions, which are positions in one or more Channels (parameters) in the Multiplex Group (list-mode data). These positions are numbered, starting with 1 and are equivalent to the indices of an array. Specific ranges or SEGMENTs of the array can be addressed. This capacity to specify a collection of individual events permits the identification of these events as members of a subset. The use of an index in DICOM, as opposed to the addition of a parameter in the FCS list-mode data, simplifies the software and increases its execution speed. Because the software can index through all of the data that apply to a specific cell subset, the subsets can be analyzed or rendered sequentially rather than simultaneously. These Referenced Sample Positions also can be applied to single channels and employed to gate the list-mode data. The elements of the FCS Fluorescence Compensation Matrix and the DICOM Frame of Reference Transformation Matrix are listed in row-major order. However, there is no reason to be limited to DICOM for mathematical formulae, because this is the domain of the XML Mathematical Markup Language (MathML), Version 2.0. The description of the compensation matrix can be based on Section 3.5.1 Table or Matrix (mtable) (37, p. 91). For image data specific to the Acquisition Context, parts of the Visible Light Slide-Coordinates Microscopic Image information object are used. Multiplex Group A Multiplex Group is a collection of channels that are acquired synchronously. Although flow parameters can be acquired in sequence, all parameters for an individual particle (cell) are grouped together in one list-mode file. Thus, they can be treated as one Multiplex Group. Similarly, in the case of Slide Microscopy, the data describing individual cells, which were derived from a large number of sequentially collected images, are stored as one listmode file or one Multiplex Group. A selection from the XML document that describes the data in a Multiplex Group is given below.

CYTOMETRYML

FIG. 3. Redrawn from a spreadsheet created by opening a Multiplex_ Group.XML file in Microsoft Excel威. The column titles have been shortened to the last item in the element name.

1 ⬍Num_Waveform_Channels⬎10 ⬍/Num_Waveform_Channels⬎ 2 ⬍Num_Samples⬎50000⬍/Num_Samples⬎ Statements 1 and 2 specify that data from 10 parameters and from 50,000 cells, respectively, have been acquired. The schema for the Multiplex Group XML document is included in a separate paper (41). The example below demonstrates the linkage of the DICOM data types and FCS keywords to the XML schema. 1 ⬍complexType name⫽“Num_Samples_Type”⬎ 2 ⬍simpleContent⬎ 3 ⬍extension base⫽ “multi:Num_Samples_Simple_Type”⬎ 4 ⬍attribute name⫽“Tag” type⫽“dicom:Tag_Type” fixed⫽“50xx,2006”/⬎ 5 ⬍attribute name⫽“VR” type⫽“dicom:VR_Type” fixed⫽“UL”/⬎ 6 ⬍attribute name⫽“FCS_Keyword” type⫽“fcs:FCS_Keyword_Type” fixed⫽“$TOT”/⬎ 7 ⬍/extension⬎ 8 ⬍/simpleContent⬎ 9 ⬍/complexType⬎ Statement 1 indicates that this is a complex type. The name of this complex type is an attribute, in this case, called name. The value of name is given on the right side of the equals sign. Because this is a DICOM type, its tag (statement 4), which is unique, and its VR (statement 5), which is type or class (Unsigned Long), are given as fixed (constant) attributes. This provides a 1:1 correspondence with DICOM. Similarly, a 1:1 correspondence with FCS 3.0 is accomplished by including an attribute (statement 6) that specifies the FCS keyword, $TOT. The inclusion of these three fixed attributes will facilitate interconversion with DICOM and FCS. This approach has the additional benefit that these attributes need not be included in the XML documents. As shown above, XML documents can be defined by schemas (19,20) that employ XML syntax. The data in these XML documents can be directly imported into commercial applications such as the spreadsheet shown in Figure 3. Parameters (Channels) Each parameter (channel) includes the equivalent information used in FCS to describe a parameter. An XML document that describes a parameter is included in a companion paper (41).

61

The description of the Parameter (Channel) sequence starts with the Waveform Channel Number, which is equal to the FCS parameter number, n. Because this XML sequence contains the elements that describe the parameter, the value of n needs to be given once rather than, as in FCS, being included with each data element. The complete description of the specimen is separate from that of the Waveform (List-Mode) data. The detector information includes the type of detector and its units. The detector types are coded as an enumerated type, which presently includes PMT, multi-anode PMT, diode, avalanche diode, diode array, CCD camera, DC impedance, AC impedance, software, and other. Software has been included because a parameter can be based on a calculation, which often can involve more than one detector. The possible measurements include: fluorescence; low angle, 45- or 90-degree light scatter; extinction; dc or rf impedance; and other. Both beam splitters and emission filters can have up to three wavelengths. The amplifier mode (linear or log) and gain are also specified. The data format describes the numeric class, integer or float, the size of the data unit, and the precision of the measurement. A character type could be added, but this does not appear to be necessary. The excitation information includes the light source and, if there is one, an excitation filter. DISCUSSION Any standard must describe the objects and provide methods for them including their storage and transmission. The simplest way to have two standards interoperate is to base them on the same objects and data types. The schema is a software construct that maximizes readability and control. Schemas can import other schemas including their namespaces, data types, and objects. XML schemas include the object-oriented capacity for type extensibility and type restriction. Schemas also include range checking, which is one type of assertion; and a pattern capability, which permits string formats that describe the order of characters and their membership in character sets. Because an XML page that employs types requires the public specification of the location of its schema(s), any additions by a software vendor have to be public. This should permit the investigators to have complete ownership and use of their data. DICOM encompasses other standards by treating them as external “Coding Schemes” (43). The simple approach (Fig. 1) is to represent the data types of these other standards as schema. As was shown in the schema for the Multiplexed Groups, the multitude of coding schemes for the other standards can be mapped to fixed attributes. These fixed attributes are included only in the schema(s) for each standard and thus do not have to be referenced in schemas that import their data types or in XML documents that are based on them. The use of schemas also permits

62

R. LEIF ET AL.

FIG. 4. Interaction of Extensible Markup Language (XML) schemas with other applications to produce screens, print documents, store and retrieve data, and distribute information via the Web.

the inclusion of extremely useful XML standards, such as MathML, which are unrelated to DICOM. As shown in Figure 4, XML schemas can be the basis of XML documents, which can interoperate efficiently and productively with local and distributed computer systems. XML syntax acts as a common technology that glues systems together. Because many commercial programs can exchange XML data, it is possible for them to directly import the cytometric data into XML documents. In fact, the three XML documents described in Figure 2 were imported into Microsoft Excel威. The mapping of the large binary DICOM data types, such as visual light images and large waveforms, to a textural format would result in an unacceptable increase in data size. Therefore, the existing DICOM image data types should remain in DICOM format. JPEG 2000 is already a part of the DICOM standard (44, 45). The DICOM formats of these binary data types can be completely described in and accessible from XML. Because DICOM includes commercial off-the-shelf formats, the image files stored in these formats can be accessed and manipulated with commercial software products, including Adobe Photoshop威. A list-mode file and its associated index files are the only new binary formats contained in CytometryML. Both file types are stored as single-dimensional arrays. The index files correspond to an array of integers. The list-mode files correspond to an array of records or structs. FCS required the daunting task of creating an application programming interface to the entire standard.

The only application programming interface required for CytometryML is for an array of records and one or more arrays of integers. The capacity for Structured Reporting (43,46) in DICOM provides a very powerful means to connect the data and the pathologist or other individual who makes a clinical decision based on the data. XML technology includes the capacity to generate these reports. CONCLUSIONS The requirements for a cytometry standard that we originally proposed (8) have been met by using DICOM semantics and XML syntax. The original requirements and an assessment of CytometryML meeting these requirements are given below. 1. FCS should be defined as a manufacturer-independent data transfer format. XML and the World Wide Web are international standards. 2. Undetectable errors in writing, transmitting, reading, and receiving FCS data should be minimized. Errors are minimized by having typed data with specified ranges that are validated against a schema. Transmission is not part of CytometryML; transmission is handled directly by Web protocols, such as TLS (HTTPS), which

CYTOMETRYML

are standard and thus required to be exceedingly reliable (47). 3. Conformance to the FCS should be testable. Because CytometryML is a collection of XML schemas, these core schemas can be hosted on the ISAC Web site. Additional schemas supplied by the vendors will be human readable and will be extensions of these core schemas. 4. The developer (vendor) of software who produces FCS software should, at the very least, be required to submit a statement that the product conforms to FCS. The vendor’s XML data will be validated against the CytometryML core schemas and the vendor’s public, human-readable schemas. The vendor’s schemas can be validated by commercially available programs, such as XMLSpy. 5. The FCS data format should be optimized for reliable transmission. Although file formats are independent of transmission protocol, having the data within XML files allows standard internet transmission protocols to be used. This includes TCP-based protocols such as HTTP and FTP and other ways for moving data between machines. The basics of the low layer of the FCS specification made it difficult to use standard protocols with the data because those files had to be encapsulated as an untyped string of bytes. No standard internet tools could parse them because they were specific to FCS, a very narrow standard within a niche field. By moving to an XML-based file format, any commercial off-the-shelf system can parse (and thus packetize and transmit) the file. These are the same systems that are used to support electronic commerce and other secure operations. 6. International standards should be used for the design of and referenced by the FCS. Most of the data types in the XML Schema Part 2: Data Types reference the appropriate ISO and IEEE standards. DICOM, which was used for the design, is an international medical informatics standard. 7. FCS should be independent of programming language. Virtually all modern programming languages can interface XML. The Microsoft Net initiative provides interfaces to XML. 8. The capacity to interact and interface with other clinical data management systems should be maximized.

63

The combination of the references to DICOM data types in the schema and the XML initiative by HL7 ensures this. 9. FCS should be compatible with or based on existing, relevant standards. CytometryML is based on DICOM and XML and includes references to the FCS keywords in its schemas. 10. Human reading of the contents of a FCS data transmission or file should not result in errors or breaches of patient privacy. Transmitting XML files with the use of secure sockets layer (also known as TLS or HTTPS) protects the contents by using RSA encryption. This does not protect the contents of the file on the machine, only its contents during transmission. Vendors will need to provide their own security methods to protect XML data files stored on the machine. These can be operating system– dependent methods of protecting file access. By moving security from inside the file to an outer system layer, the flow cytometry instrument vendors can take advantage of all the commercial off-the-shelf advances in security. This makes more sense, from an engineering perspective and a patient privacy perspective, than trying to re-invent the security features inside the file format. If additional privacy measures are needed, XML provides a standard way of encrypting the contents of an XML file (1,48). 11. The data acquisition configuration information included in the standard should be sufficient to permit an experimenter to set up his/her instrument to repeat the acquisition of the data. This is an iterative process. The manufacturers can ensure its success by extending CytometryML to include the specifics of their instruments. Two ways to maximize the probability of successfully meeting this requirement are to create schemas in a manner as open as possible and to adhere to existing well-tested standards, e.g., XML and DICOM. 12. The FCS ANALYSIS section should permit an experimenter to understand how the data were analyzed. The combined use of MathML, detailed descriptions of the analyses (already started), and the steps taken for requirement 11 should be sufficient. 13. The data included in the standard should be sufficient to permit an experimenter to prepare the sample(s). DICOM (18), SNOMED (39), and LOINC (35,36) contain most of the data types needed to develop a complete test

64

R. LEIF ET AL.

protocol. The Analyte_Info section of the Parameters XML document is an example of this. The feasibility of expressing the data contained in a FCS file in XML employing DICOM data types has been demonstrated, and a preliminary set of schema has been created. DICOM permits the creation of indices to individual cell subsets. Parts of DICOM coding schemes for other standards, such as UCUM, have been represented by XML schema. The combination of the domain knowledge of the cytometry community with the well thought out, tested design of DICOM and the universality of XML will have the benefits of: (a) creating one analytical cytology standard for flow, laser scanning cytometry, and image data; (b) being able to express all FCS keywords that are used to describe a flow cytometric experiment in DICOM; (c) employing the DICOM Waveform design to provide a simpler and easier-to-maintain structure than the monolithic FCS (the DICOM Waveform places the list-mode array into a separate file and incorporates the cell-type data into indices); (d) retiring the present FCS 3.0 (5); (e) basing a cytometry standard on one that is open, well designed, reliable, internationally accepted, and backed by the medical profession; (f) being operating system independent; (g) interfacing with multiple programming languages; (h) interoperating with the existing medical informatics infrastructure; (i) increasing the amount of information, including clinically relevant material, in the data; (j) increasing the speed of analysis and display of cell subsets; and (k) using a World Wide Web standard, XML, as an intermediate form, which will greatly facilitate data exchange via the Internet, provide an open, reliable standard, and be interoperable with the rest of medical informatics. THE FUTURE The XML documents, schemas, and other ancillary information on CytometryML have been posted at www.newportinstruments.com. We welcome assistance in further refining and extending these documents. We hope to do this in manner as open as possible. Continuous peer review is a tested way to create a reliable software construct, such as an informatics standard. If ISAC accepts these documents for its Web site, we will post them there. ACKNOWLEDGMENTS We thank Prof. Ulysses J. Balis for the very helpful suggestion of mapping the Flow Cytometry list-mode to a DICOM Waveform. Margie Becker, Liza Leif, and Gary Ferguson made many helpful suggestions to increase the readability of this paper. LITERATURE CITED 1. Imamura T, Dillaway B, Simon E. XML encryption syntax and processing. In: Eastlake D, Reagle J., editors. W3C recommendation, 10 December 2002. Available from http://www.w3.org/TR/2002/ REC-xmlenc-core-2002/210 2. Boyer J, Hughes M, Reagle J. XML-Signature XPath filter 2.0. W3C Recommendation 08 November 2002. Available from: http:// www.w3.org/TR/2002/REC-xmldsig-filter2-20021108/

3. Dean PN, Bagwell CB, Lindmo T, Murphy RF, Salzman GC (Data File Standards Committee). Data file standard for flow cytometry. Cytometry 1990;11:323–332. 4. Leif RC, Rios R, Becker MC, Becker CK, Self JT, Leif SB. The creation of a laboratory instrument quality monitoring system with AdaSAGE. In: Askura T, Farkas DL, Leif RC, Priezzhev AV, Tromberg BJ, Katzir A, editors. Advanced techniques in analytical cytology, optical diagnosis of living cells and biofluids. Progress in biomedical optics. Volume 2678. Bellingham, WA: SPIE Proceedings; 1996. p 232–239. 5. Seamer LC, Bagwell CB, Barden L, Redelman D, Salzman GC, Wood JC, Murphy RF. Proposed new data file standard for flow cytometry, version FCS 3.0. Cytometry 1997;28:118 –122. 6. General principles of software validation; final guidance for industry and FDA staff. Document issued 11 January 2002. Washington, DC: US Department of Health and Human Services, Food and Drug Administration, Center for Devices and Radiological Health, Center for Biologics Evaluation and Research; 2002. Available from: http:// www.fda.gov/cdrh/comp/guidance/938.pdf 7. Guidance for TDA reviewers and industry guidance for the content of premarket submissions for software contained in medical devices. Document issued 29 May 1998. Washington, DC: US Department of Health and Human Services, Food and Drug Administration, Center for Devices and Radiological Health, Office of Device Evaluation; 1998. Available from: http://www.fda.gov/cdrh/ode/57.html 8. Leif RC, Leif SB. The evolution of flow cytometry standard, FCS3.0, into a DICOM-compatible format. In: Priezzhev AV, Asakura T, Leif RC, editors. Optical diagnostics of biological fluids and advanced techniques in analytical cytology. Volume 2982. Bellingham, WA: SPIE Proceedings; 1997. p 354 –366. 9. Leif RC, Leif SB. A DICOM compatible format for analytical cytology data. In: Farkas DL, Leif RC, Tromberg BJ, editors. Optical investigations of cells in vitro and in vivo. Volume 3260. Bellingham, WA: SPIE Proceedings; 1998. p 282–289. 10. Leif RC, Leif SB. A DICOM compatible format for analytical cytology data, that can be expressed in XML. In: Farkas DL, Leif RC, editors. Optical diagnostics of living cells IV. Volume 4260. Bellingham, WA: SPIE Proceedings; 2001. p 238 –248. 11. Murphy RF. International Society for Analytical Cytology, data standards committee meeting minutes, 4 –5 January 2002. Available from: http://www.isac-net.org/ 12. Online Edition of the Unicode Standard. Version 3.0. Available from: http://www.unicode.org/unicode/uni2book/ch03.pdf. p 37. 13. Programming with the general-purpose instructions. In: IA-32 Intel architecture software developer’s manual. Volume 1. Basic architecture; 2002. Available from: www.intel.com/design/pentium4/manuals/24547011.pdf 14. MPC823 support team. Endian modes. In: PowerPC MPC823 reference manual. Accessed November 2002. Available from: http:// e-www.motorola.com/brdata/PDFDB/docs/MPC823UM.pdf 15. Kamentsky LA. Methods and apparatus for measuring multiple optical properties of biological specimens. US patent 5,072,382; 1991. 16. Kamentsky LA, Kamentsky LD, Fletcher JA, Kurose A, Sasaki K. Methods for automatic multiparameter analysis of fluorescence in situ hybridized specimens with a laser scanning cytometer. Cytometry 1997;27:117–125. 17. Buetow K. Combining bioinformatics and genomics to gain insight into the etiology of cancer. ISAC XXI Lectures Online. Special session on Cytomics; 2002. Available from: http://isac.digiscript.com/ presentation/index.cfm?media_id⫽8743 18. The DICOM standard, Rosslyn, VA: National Electrical Manufacturers Association; Official DICOM printed standard volumes. Available from: http://www.nema.org/index_nema.cfm/563/?type⫽standard&pageid⫽ A314508D%2DA289%2D4D56%2D953C71DC9F7C0D4E&node⫽694 [purchase] or http://medical.nema.org/dicom/2003.html [download] 19. Thompson HS, Beech D, Maloney M, Mendelsohn N. XML schema part 1: structures, W3C; 2001. Available from: www.w3.org/TR/ 2001/REC-xmlschema-1-20010502/ 20. Biron PV, Malhotra A. XML schema part 2: datatypes, W3C; 2001. Available from: www.w3.org/TR/2001/REC-xmlschema-2-20010502/ 21. Walmsley P. Definitive XML schema. Saddle Brook, NJ: Prentice-Hall; 2002. 22. IBM Corp. Leverage the power of XML in e-business applications, DB2 Universal Database XML Extender: Web-enabling your data with XML. Available from: http://www-3.ibm.com/software/data/db2/ extenders/xmlext/xmlextfctsht.pdf 23. Microsoft Corp. Microsoft SQL server 2000 Web services toolkit. Availablr from: http://www.microsoft.com/sql/techinfo/xml/default. asp 24. Oracle Corp. Oracle XML developer’s kits. Available from: http:// otn.oracle.com/tech/xml/xdkhome.html

CYTOMETRYML

25. Software AG. Tamino XML server number one in XML management, white paper. Available from: http://www.softwareag.com/tamino/ download/tamino.pdf 26. eXcelon Corp. eXtensible snformation server (XIS), native XML data management system. Available from: http://www.exln.com/ products/xis/ 27. Ipedo Corp. Products, Ipedo XML database. Available from: www. ipedo.com/html/ 28. Work item proposal: DICOM Standard publication and maintenance in XML. Available from: http://medical.nema.org/dicom/minutes/wg-10/ ahg-publish-xml/2003-02-27/wg_10_workitemreq_dicominxml.doc 29. Brolund G. How to build an electronic common technical document. Washington, DC: US Food and Drug Administration, CDER/OIT. Available from: http://www.fda.gov/cder/present/dia72001/cderectdreview/ tsld001.htm 30. FDA/Center for Drug Evaluation and Research (CDER) proposed standard for exchange of electrocardiographic and other time-series data. Available from: http://www.fda.gov/cder/regulatory/ersr/ ECGdata.htm 31. Brown B, Kohls M, Stockbridge N. Draft, FDA XML data format requirements specification, revision B. Available from: http://www. cdisc.org/discussions/EGC/FDA_XML_Data_Format_Requirements_ Specification_DRAFT_B.pdf 32. Brown B, Kohls M, Stockbridge N. Draft, FDA XML data format design specification, revision C. Available from: http://www.cdisc.org/ discussions/EGC/FDA%20_XML_Data_Format_Design_Specification_ DRAFT_C.pdf 33. CMS ESRD. HL7 V3 XML specifications & design, end stage renal disease (ESRD). Washington, DC: US Department of Health and Human Services and Metaintegration Technology. Available from: http:// www.metaintegration.net/CMS/ 34. Future Eligibility Determination Systems. How standardization can help solve the challenges. Washington, DC: US Department of Health and Human Services. Available from: http://cms.hhs.gov/medicaid/ eligibility/tanfdelink.pdf 35. Regenstrief Institute. LOINC and RELMA. Available from: http:// www.loinc.org/download 36. Regenstrief Institute. LOINC database version LOINC 2.08. released September 13, 2002. Available from: http://www.loinc.org/download 37. Carlisle D, Ion P, Miner R, Poppelier N, editors. Mathematical markup language (MathML). Version 2.0. W3C recommendation; 21 February 2001. Available from: http://www.w3.org/TR/2001/REC-MathML220010221/

65

38. Schadow G, McDonald CJ. The unified code for units of measure (UCUM). Version 1.4. 27 April 2000. Regenstrief Institute for Health Care. Available from: http://aurora.rg.iupui.edu/UCUM/UCUM.pdf 39. SNOMED International, College of American Pathologists. Systematized nomenclature of medicine. SNOMED. Accessed 27 May 2002. Available from: http://www.snomed.org/ 40. Digital Imaging and Communications in Medicine. PS 3.3-2003, A.34 waveform information object definitions and C.10.8 waveform identification module. Rosslyn, VA: National Electrical Manufacturers Association; 2003. Available from: http://medical.nema.org/dicom/ 2003/03_03PU.PDF 41. Leif RC, Leif SB, Leif SH. CytometryML, a markup language for analytical cytology. In: Enderlein J, Farkas DL, Leif RC, Nicolau DV, editors. Manipulation and analysis of biomolecules, cells, and tissues. Volume 4962. New York: SPIE Proceedings. Forthcoming. 42. Redelman D, Coder DM. Cell subset (CS) parameter to record the identities of individual cells in flow cytometric data. Cytometry 1994; 18:95–102. 43. Digital Imaging and Communications in Medicine. Part 16: Content mapping resource, PS 3.16-2003, Table 8-1 coding schemes. Rosslyn, VA: National Electrical Manufacturers Association. Available from: http://medical.nema.org/dicom/2003/01_16PU.PDF 44. Digital Imaging and Communications in Medicine, Part 3: Information object definitions, PS 3.3-2003, Section C.7.6.3.1.2 Photometric Interpretation. Rosslyn, VA: National Electrical Manufactureres Association. Available from: http://medical.nema.org/dicom/2003/03_ 03PU.PDF 45. Digital Imaging and Communications in Medicine, Part 5: Data structures and encoding, PS 3.5-2003, Section 8.2 Native or encapsulated format encoding. Available from: http://medical.nema.org/dicom/ 2003/03_05PU.PDF 46. Dierks T, Allen C. The TLS protocol. Version 1.0. Response to network working group request for comments: 2246 category: standards track; 1999. Available from: http://www.rfc-editor.org/rfc/ rfc2246.txt 47. Dierks T, Allen C. The TLS protocol. Version 1.0. Response to network working group request for comments: 2246 category: standards track; 1999. Available from: ftp://ftp/rfc-editor.org/in-notes/ rfc2246.txt 48. W3C威, The technology & society domain, Technical building blocks that help address critical public policy issues on the Web. Available from: http://www.w3.org/TandS/

Suggest Documents