KEY WORDS: largedatasets,validation, satellite dataanalysis. Ralph. Kahn,. Robert. D. Haskins, .... Mike Freilich,Wes Nicholson, Bill Rossow, Victor. ZIomicki ...
Large Data Set Validation
I33
N94Validating ......Experiences Ralph =
a Large
Geophysical
with
Satelhte-Denved
Kahn,
Robert
D.
Haskins,
Data Cloud
James
E.
Set:
/'-2/3/
Parameters
:
#
Knighton,
.
Andrew Jet Propulsion Abstract
Pursch,
Laboratory,
and
California
Stephanie _[nstitutc 1.
Granger-Gallegos of Technology,
Pasadena,
CA 91109
Introduction
NASA's EarthObservingSystem CEOS) willgeneratevast We arcvalidating theglobalcloudparame.ters derived from the of data. Hundreds of terabytes of datawillbe satellite-borne KIRS2 and MSU atmosphericsounding quantities acquired from orbit tocharacterize theEarth's environment instrument measurements, and arcusingtheanalysis of these withthekindof spatial and temporaldetail neededtostudy dam=_ One prototype forstudyinglargegeophysical datasets change.Such highresolution isrequired toproperly in general.The HIRS2/IvISUdatasetcontains a totalof 40 climate sample thenon-linear impact of small-scale phenomena, physical parameters, falling 25 MB/day; raw HIRS2/MSU dam contributions totheglobal-scale areavailable fora periodexceeding10 years. Validation which canmake significant budgets of heatand momentum. Itisalso expectedthatthe involvesdevelopinga quantitative sense forthe physical datawillbc analyzednotjustin thetraditional manner, meaning of the derived parameters over the range of on a single datasetata time,butinnew ways environmental conditions sampled.This isaccomplishedby concentrating thatinvolveroutinely comparing datasetsfrom multiple co -p the' g and dis butio.s ofthe derived sources.Partof theneed to studymultiple datasetscomes quantities with Sh-nilar measurements made using other from a growing appreciation fortheimportance to global techniques, and withmodel results. conditions of transports acrossboundariessuch as theair(e.g., Earth System Science Commiuee, The datahandlingneededforthiswork _ p0s_bleonlywith ocean intcffa_f.e 1988). the help of a suiteof interactive graphical and numerical analysis tools,l.avel 3 (gridded) dataisthecommon form in We are und_m,'tk_ng thevalidation ofcloudparameters d_vcd which largedatasetsofthistypearedistributed forsci_c from the High Resolution Infrared RadiationSounder 2 analysis. We find that Lavel 3 data is inadP.zluatefor the data (I-URS2) and the Microwave Sounding Unit (MSLD comparisons required for validation. Lavel 2 data (individ_ insu-uments aboardtheNOA.A polarorbiting meteorological measurementsin geophysicalunits)isneeded. A sampling satellites. The instruments provide one of thefew global probiem_ when individ_ measurements, which am not extendingover many yeats. uniformlydistributed in space or time,are used for the measuresof cloud properties They are also capable of obtainingnear-simultaneous comparisons.Smnd_d 'interpolation' methodsinvolvefitting on thephysical characteristics of theatmosphere the measurementsforeachdam settosurfaces, whicharcthen constraints and surface neededtoderivecloudproperties. One goalof compared. We areexperimentingwith formalcriteria for thiswork istolearnaboutanalyzing largegeophysical data selectinggeographicalregions,based upon the spatial seu ingeneral. fi'cquency and variability of measurements,thatallowus to quantifytheuncertainty due to sampling.As partof this Radiancesfrom theHIRS2 and MSU insmm_entshave been project, we are also dealingwith ways to keep trackof constraints placedon theoutputby assumptionsmade in the analyzedby Susskindand co-workersusingan algorithm thataccountsself-consistentiy for thefirst-order physical computercode.The needtowork withLavel2 dam introduces quantities affecting the emergent radiation (Susskind et al., a number ofotherdatahandlingissues, suchas accessing data filesa_oss machine types, meeting large data storage 198,4; 1987). The standard data products are (1) monthly requirements, accessing othervalidated datasets, procassing mean values for forty meteorological parameters, including speed and throughputfor interactive graphical Work, and effective cloud amount and effective cloud top height, on a grid of boxes 2 degrees in latitude by 2.5 degrees in problemsrelating to graphical interfaces. longitude, and (2) 'daily data' with twice-daily temporal sampling, a spatial resolution of about 125 kin, and spacing KEY WORDS: largedatasets,validation, satellite between points of about 250 kin. The monthly mean data dataanalysis arerdcrredto asa 'Level3'(gridded) product, and thedaily
--
PMM_iDtN_
PAGE
BLANK
NOT
FILMED
i' 34
R. Kahn etat.
We encountered a second problem when making comparisonsamong Level3 productswithdiffexent gridding schemes. The bestconcurrent cloudclimatology available for comparison with the dam in Figure IA was derived from the Temperature Humidity Infrared Radiome_r/Tor=J Ozone Mapping Spectrometer (THiR/TOMS) on the NASA Nimbus 7 satellite (Stoweetal., 1988;1989).The standard By validation we mean 'developing a quantitative sensefor THIR_ONIS Level3 dam productwas blurted according toa thephysical meaning of themeasuredparameters,' forthe global500 by 500 km grid thatis also used forEarth range of conditions under which theyarcacquired.Our approachinvolves: (I)identifying theassurnpdonsmade in radiation budgetstudies. The July1979 HIRS2/MSU Level deriving parameters from them_d radiances, ('2) testing 3 data,degradedusing area-weightedaveragingto the grid,isshown inFigureIB. We _ theinputdataand derivedparametersforstatistical error, THIR/rOMS spati_J resmnpledthedegradedHIRS2/MSU databack tothe2 by sensitivity, andinmmal consismncy,and (3)comparingwith similar parameters obtainedfrom othersouzcesusingother 2.5 degree grid,and subtractedit from the original HIRSZ_SU data(Hgure Ice. Note thatthedifferences are techniques.A study of thistype was performedforsea nearlyas largeastherangeof thesig1_I, withbothpositive surface mmpcrature(Njoku,1985"), and ourproject isone of values.The pattorn of differences varies with severalparalleleffortscurrentlyunderw_.yto validat_ and negative different cloud climatologies (e.g., Rossow et al.,1985; thelocation of edgesin theoriginal data,and ismodulated 1990).The validation effort we areun_rtaldngintroduces a by therelative position" ofgridboundaries.DEferencesare number ofproblemsthatmay be of interest tospecialists in especiallylarge at high latitudes, where the spatial computational statistics, such as the INTERFACE resolution of theTHIR_OMS grid ks much lower than u'mt community,as wellas tothoseinvolvedinresearch direcdy oftheHIRSZ/MSU grid, and wherevertherearesharpedges generatedby cloud patterns, such as in the inter_opical relatedto interpreting largegeophysicaldatasets. This article summarizes thekey dam handlingissueswe have convcrgenc_ zoneand monsoon er, countm-ed. With theLevel 2 products,we have accessto physical quantities atthefull resolution acqu/redby theinsu'uments, 2. The Need for 'Level 2' Data andavoidintroducing additional ar_actsintothecomparison betweendam sets.Level2 dataarenotuniformly dism'buted over thesurface. At low latitudes there aregoresin the I..m-ge geophysical datasets, suchascloudclimatologies, are I=IIRS2 samplingbetweenorbits, whereasathighlatitudes, oftendistributed to researchers in gridded(L_vel3) form. thesurfaceis heavilyoversampled. Data dropoutsand This can reducethedatavolume by ordersof magnitude relative tothe parameter valuesforeachindividual sounding calibration linesoccur at alllatitudes.The sampl_ l_y more thana fac-to_ _Sf:_from nadir _ (Level2),and providestheuserwitha 'spatially uniform' _oiu-dd-hc-I'L_ge_ steptoward making data product. For example, Figure IA is the global, the limitsof each scan. As a first monthly-mean cloudamount map forJuly 1979 from the comparisonsamong Level 2 datasets, surfacesthattake HIRS2/MSU data,in theoriginal 2 degreeby 2.5degree accoum Ofn0n-um_'orrnclustering of datapointsmay be fit to the dam. We have begun experimentingwith locally averagingbins. Allacceptedcloudamount datafrom the adaptive surfacefitting techniques (e.g., Renka,1988),and individual atmosphericsoundings thatfellwithineach geographic box weresummed, andmean and variance values arc exploringtheuse of methods thatgeneratevariance foreachboxw_ calculate._ surfaces together witheachfitted surface (Cresse, 1989,and references thc_in). Severalproblemsoccur when usingLevel 3 produc_ for gln_g,w_i_c_ _ _didonaHy __ to-make Eomp_ns validation. First, ifonlytheLevel 3 parametervaluesand associated variances arcavailable, thereisno way toassess among globaldata sets,is performed as an automatic procedure.In usingLevel 2 dataforvalidating datasets, how much of thereportedvarianceisdue to inherent nonfor uniformityof the parameterover the averagingregion. geographicsub-re_ons of the globe must be selected Essentially, the insu'ument resolution is degraded to a scale surface fining, basedupon some criterion thatevaluates the density ofpo_mtsrelative tothesizeof localgradients ofthe comparable to the box size, and information originally parameterfield, possiblyin severaldirections. Figure2 acquiredto measure smaller-scale phenomena inboth the illustrates theroleof _n_cdve geographi c subset_Iccti_, spatial and temporaldomains islost.For example,ina 2 _ by 2.5 degreebox, the surfacetemperaturemay exhibit a pan of thesoftwarewe areassemblingto performtlfe HIRS2/'MSU validation.'HDF' in thisfigurerefersto random fluctuations of half a degreeand may change Data Format,a transportable fileformatthat systematically by severaldegrees, whereasthebox average Hierarchical variance willassign all thevariability torandom error. eliminates allbutan initial file conversionforexchanging dataamong DEC, Sun,Macintosh,andothermachine.s used in the validation 0NCSA SoftwareTools Group, 1990). dataiscalleda 'Level2'product(individual measurements reducedtOgeophysical units) (SpaceScienceBoard,1982; EOS Data Panel,1986). The sizeof theuncompressed Level 3 data isabout4 MB/month, whereas the LeveY2 productfills about25 N_Iday ('750 M_Imonth).
w
m
U
_m
W
m
i
W
i
J
l
Large Data Set Validation
L
L
:
This allows us to store single copies of data files on centrally located disks, that are ace=ssible across the network to machines with differing architectures. We arc currenOy investigating the criteria for accepting subsets, choice of method for surface fitting, and methods for making formal comparisons among surfaces fitted to data from different sources. The important question of interpolation in the temporal domain we set aside for the present.
geophysical datasets.We need Level 2 data(individual measurements in geophysicalunits)(A) to perform comparisonsamong datasetswithdifferent sampling, and (B) to understandthe effectsof spatialand temporal samplingon the'average' valuesobtainedfrom a singledata set.The need forLevel2 dataseverelycomplicates dam handling.Among theareaswhere advanceswould be most helpful am:
To summarize:in spiteof themuch largervolume of the I.,cvel 2 data, relative toLevel3,and theconccfion of issues related tothespatial and temporalsamplingofLevel2 .data, we ne.e.dthe ability to access, store, and process Level 2 data for (1) studies of the internal consistency and precision of the data set and (2) comparisons with other cloud climatoIogies, that are involved in the validation of the HIRSZ/MSU cloud parameters. We anticipate that similar needs will arise for interdisciplinary process studies, and in work directed toward using observations to better understand mesoscale climatological phenomena.
1. Surface fitting software for data distributed non-uniJ'ormly in 2-dimensional space, and ways to obtain some measure of the associated variances.
3.
v
I35
Tracking
Assumptions
in
the
Code
Another issue that bears upon the degree to which we may perform validation, and other scientific analysis on large data sets, is our ability to grasp the collection of constraints imposed on parameter values by the code that generates them. An assumption embedded in a large data handling code may produce results that hide important information in the data, or may produce patterns in the data that could be incorrectly interpreted as scientifically meaningful.
2. Softwareformaking formalcomparisonsamong fitted surfaces from several sources, and their associated variance sl_ac_. 3.Ways of documentingsoftware and datafiles so theymay be exchangedand usedby otherseasily. 4. Ways of documenting the assumptions embedded in retrieval and processing algorithms, so a researcher studying thedataproductscan graspthe collection of constraints placedon theoutputdataby thecode. 5. Additional ways of storing data. For a given Level 2 data product, we need readily accessible data storage capacity of between one and two orders of magnitude the size of the basic data set, for intermediate and derived products that arc created aspartofthevalidation. Several longcr-ltm'n needs include:
We are experimenting with methods of charting the collection of assumptions, as a way of calling the attention of the user to areas where the code may influence the output parameters. We are using standard charting symbols as much as possible (e.g., Yourdon and Constantine, 1979). An example of this type of chart is Figure 3. This shows the flow of control and the flow of assumptions made in a relatively small part of the I-IIRS2/MSU analysis code that produces Level 3 data from Level 2 products. This chart made clear the number and complexity of the assumptions involved in generating Level 3 products, and it played a role in our assessment of the value of Level 3 data for the validation exercise. Chartingtheflowof control providesa neededcontextfor theconswaints placedon thedata.These charts takea step in the direction of making itpossibleto keep trackof assumptions, buttheydo noteliminate thework involvedin carefully assessing themeaningofde.rived parameters.
4.
Conclusions
The/-IIRS2/MSU cloud parameter validation effort raises a number of data handling issues that are likely to arise frequently when scientific analysis is attempted on large
6. The development of validation procedures that are easy enough to apply so that it will be feasible to generate and access a large number of validated geophysical data sets for interdisciplinary studies of all types. 7.Ways of fitting surfacestodatavaluesdistributed nonuniformly in2-dimensional spaceand intime,andobtaining a measureoftheassociated variances. 8.Betterways of discovering patterns and surprises inhighdimensional datasets. 9.Ways of fitting hypcr-surfaces tohigherdimensional data sets, and techniques forstudying them. We have described our data, the collection of problems we are facing in the validation work, and our approaches to some of these issues. Solutions or partial solutions may exist to some of the problems that arc not widely known outside specialized data handling and computational statistics communities. We hope to stimulate experts in these fields to participate in the effort to improve our understanding of Earth through the study of large, geophysical data sets.
.......
1176
R. Kahn et aL WlW
Acknowledgments we thank Paul Tukey for inviting us to pan/cipate in the INTERFACE 91 conference, and Daniel Can', Jeff Dozier, Mike Freilich, Wes Nicholson, Bill Rossow, Victor ZIomicki, andRichardZumk forstimulating discussions on many aspects of thiswork. This project issupported in part by theNASA EarthSciencesInterdisciplinary Program in theEarthScienceand Applications Division, and by theJet Propulsion Laboratory Direc_r's Discretionary Fund. The work was performed at the Jet PropulsionLaboratory, California Institute of Technology,un_r contract withthe Nation,a/Aeronautics andSpace Administration.
Stowe,L.I.,., Wellemeyer,C.G., Eck, T.F.,Yeh, H.Y.M., and theNIMBUS 7 Cloud Data ProcessingTeam (1988), NIMBUS 7 globalcloudclimatology. PartI:Algorithms and validation, J. Climate, 1, #45-470.
!
Stowe, L.L., Ych, _LY.M., Eck, T.F., Wellemeyer, C.G., ILL. Kyle,and theNIMBUS 7 Cloud DataProcessing Team (1979),NIMBUS 7 globalcloudclimatology. PartIf:Fast year results, I. Climate, 2, 671-709.
m
S usskdnd, J._,Rosenfield, L, Reumr, D., Chahme, bLT. (1984), Remote sensing of wca_cr and cUmate parameters from HIRS2/MSU on TIROS-N, J. Geophys.RcJ., 89, 4677-4697. .....
Im
I)
i Susskind, J., Reuter, D.,Chahine, M.T. (1987),Cloud fields retrieved from analysisof HIR$2/MSU soundingdata,J.
References
aeophys.Res..92, 4o35-4oso. Cressie, N. 0989), 43, 197-202.
Geostafistics,
The amer.
m
I
Statistician,
Earth System Science Committee (1988), "Earth System Science: A Closer View', Repon of the Earth System ScienceCommittee, NASA Advisory Council,NASA, Washington.D.C. _
Yourdon, E., and Constantine, E._. (1979), Structured Design: Fundament_s of a DiscipEne of Computer Program and System D_gn, Yourdon Press, N/, pp 473.
I
EOS DataPanel(19863, The Earthobserving system:Report of theEOS datapanel,Vol 2a,NASA Tech.Memo. 8777. Washington,D.C.
i I
NCSA SoftwareTools Grow (1990), Hierarchical Data Format,Nadonal CenterforSupercomputingApplications, Champaign, IL.
m
I
Njoku,E.(1985), Satellite-derived seasurfacetempera, rare: Workshop comparisons, Bull, Am. Meteorol. Soc., 66, 274281. : " ..............
I
Renka, R=}'.(1988), Multivadam interpolation of large sets of sca_cred dab ACM Transact. Math. Software. 14, 139148.
D
Rossow, W.B., Mosher, F., Kinsella,E., Atking, A ................. Dcsbois,E.,Harrison, E.,Minnis,P.,Rtrprecht, E.,Seze, G.,Simmer, C., and Smith, E. (1985),ISCCP cloud algorithmintexcomparison., J. ClimateA991. Meteor., 24, 877-903.
g
J
Rossow, W.B. (1990), Report of the Workshop on ComparisonofCloud ClimatologyDatuscts, NASA Goddatd Insdtute for Space Studies, New York.
!
Space Science Board (1982), Dam management and computation, Vol 1: Issues and rccgmmendations Nadon_ Academy of Sciences/Nafion_ Academy Press, Washington, D.C.
ID
I
"
I
-
_,_
............
[] I ' l
' lilll
II II I-
........
.
I .
-
l ..................
L-
--
.....................
2 --
_
__
I I
I ,. J..._
.....
., _ ......
l.a,t,," l_ala ._'cI i'aliailtion
137
C 0
r (/1
! L
E .
i |0.4. =_ (4)"incU_'='mnnd5" i
°°_'/°
]
i
7.
ddi=,,id_m SOOmb_,u,f,-,a
by=.z _ ,_,,'= uard,=_,
i
J (FTTCK> 0.9). or _) hycr 2 old
m
Figure 3.
HIRS2
Level
2 to 3 Software
Overview
(Continued)
i _1
LI I_llll_
II .....
Iv:
.......
I
_
._
_LI.........
....
J