A Grid based triangulation and collaboration infrastructure for e-Social Science Research Ben Anderson
Michael Gardner
Martin Hicks
Chimera The University of Essex Adastral Park, Ipswich, IP53RE +44 (0)1473 632215
Chimera The University of Essex Adastral Park, Ipswich, IP53RE +44 (0)1473 632274
Chimera The University of Essex Adastral Park, Ipswich, IP53RE +44 (0)1473 632247
b
[email protected]
[email protected]
[email protected]
ABSTRACT In this paper, we describe some of the opportunities for using the emerging GRID infrastructure for collaborative analysis of large federated social science data sets.
Categories and Subject Descriptors I.6.7 Simulation Support Systems [Simulation Modelling]
and
General Terms Management, Performance, Design, Experimentation, Human Factors
Keywords Grid, e-Science, social-science, visualization tools, large data sets
1. CURRENT TRENDS A number of different trends are converging in the domain of quantitative behavioural and social sciences. These include the growth of ‘online’ large-scale but traditional social science and behavioural datasets; the increasing emphasis o n longitudinal and mixed methods approaches; the capability t o link data records from different datasets directly or match records statistically by data fusion; the increasing mediation and observation of everyday life by pervasive information and communication technologies which have the ability to capture linkable behavioural data, and finally the increasing focus o n distributed data production, warehousing, processing and analysis which constitute e-Science and the GRID.
2. ONLINE SOCIAL SCIENCE DATA ACCESS Thanks to the efforts of the ONS, the UK Data Archive(s), the ESRC and others there is an increasingly large amount of high
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Conference’04, Month 1–2, 2004, City, State, Country. Copyright 2004 ACM 1-58113-000-0/00/0004…$5.00.
quality social and behavioural data that can be accessed ‘online’. Whilst in some cases this amounts to digital storage, access control and delivery, it is increasingly involving realtime interaction with datasets so that users can select the subsets they require and either conduct online analysis or retrieve them for more complex local analysis. As yet however it is not possible to link or fuse these datasets ‘on the fly’ nor to conduct multiple comparative analyses without downloading every data set and analysing them individually. It is also not possible to conduct case level analyses on complete datasets (such as the census) for ethical and confidentiality reasons. In addition there are datasets, which could be accessed, online if suitable cultural/social/legal/commercial processes can be put in place to ensure confidentiality, protection and where appropriate, revenue streams. These include Government datasets (ONS, DWP, DVLA etc), data maintained by the private sector for market research and data synthesis and those that are captured as part of normal business processes such as by retailers and credit card companies that are currently not available for research or other use.
3. MIXED-METHODING AND LONGITUDINAL STUDIES Social and Behavioural scientists are increasingly using mixed methods to gain better insights into the processes and behaviours they are researching. In some cases this means conducting surveys as well as qualitative studies, in others it involves ‘instrumenting’ people and organisations to collect temporal behavioural data which can be linked to ethnographic or survey data. In essence these are attempts to ‘triangulate’ the object of study in order to build a rich contextual picture based on collecting different data on the same entity using different instruments and methods. This practice enables grounded theorising that can combine analysis of patterns (what is going on?) with explanations (why is it going on?). Such studies have the potential to create extremely large linked heterogeneous datasets.
4. DATA FUSION AND RECORD LINKAGE Whilst mixed methods provide different data on the same entities through data collection, record linkage does this by matching records from different datasets through common identifiers – such as UK National Insurance, National Health Service numbers or the tuple . This approach is used to derive new datasets that consist of linked
records such as the combination of survey with life event and medical history data or telephony usage data with market research or geo-coded data. In contrast data fusion involves the statistical matching of records along specific common dimensions (such as income, housing tenure, household type) using a range of algorithms. In general this approach is used to derive new synthetic datasets that support analysis not possible on the original data sets in isolation, because they do not contain the necessary combination of variables.
innovations in user interfaces, graphics, displays and in shared analysis and collaboration tools for human behaviour analysis (see [1] and [2]). •
Analytic/modelling techniques for the identification and analysis of geographical and temporal patterns, sequences, differences and clusters triangulated by aggregated personal and social attributes.
•
Statistical tools that use GRID middleware to retrieve data from different sources, link variables, conduct analysis and report results (not the raw data) to the user. Exemplar analysis might use crosstabs and simple regression techniques, and might extend to co-analysis or remote presentation of results using Access GRID nodes.
5. THE OBSERVATION AND MEDIATION OF EVERYDAY LIFE BY ICT Everyday life in the UK is becoming increasing monitored by pervasive information and communications technologies (ICTs) whether active (CCTV) or passive (call records, internet usage logs, GPS, etc). In addition everyday activities such as social communication, leisure, entertainment, purchasing, selling, and financial/domestic management are becoming increasingly mediated by ICTs. With the advent of fixed and mobile broadband data access in the home and on the move, this trend can only continue. As a result massive behavioural and transactional micro-social datasets or data-streams of significant academic, public and commercial research value are already being collected and owned by the ICT and service sectors.
Financial Medical, govt, tax records Census data PSTN call records
ONS/ESRC/EU Sample and cohort surveys
Research/policy questions
Implementing a mixed method data resource comprising heterogeneous data is not a trivial task even when there are few data sets and small sample sizes. As we have already noted, we see significant opportunities for the commercial, research and policy sectors in the integration and analysis of massive behavioural data with extant social datasets in order to triangulate the UK population. These datasets, collected for specific purposes (billing, customer service, policy research, service personalisation, public service administration), would have significant and unique value if differentially integrated to provide a multiscale resource that links macroscopic social attributes with microscopic geographically situated and temporal behavioural data. This is illustrated in figure 1. These requirements generated by this triangulation infrastructure leads us towards the UK e-Science GRID as a possible platform that can support the integration of large heterogeneous datasets in a real-time analysis, modelling and simulation environment. Usage of these integrated data views will require distributed computational resources in order to scale and the nature of some of the service/usage scenarios implies the need for distributed ad hoc team working and collaboration. A number of different grid tools would be required to enable: •
Real time interaction with analysis/models/data visualisation/representations. This implies
Internet application usage logs (email, web, ecommerce, IM, a/v streaming)
Qualitative data (ESRC, commercial, private)
Real time commercial applications
CCTV, tolls, traffic camera data streams
6. DIGITAL UK DATA – A TRIANGULATION AND ANALYSIS INFRASTRUCTURE
transaction data
Mobile usage (voice and data) records, with location data
Figure 1: The Digital UK Data Grid: A triangulation infrastructure
7. A GRID INFRASTRUCTURE TO SUPPORT COLLABORATIVE ANALYSIS OF LARGE DATA SETS Figure 2 illustrates some of the various grid components that would need to be developed, and an example of a visualisation technique based around a helix metaphor that has been developed by Chimera (illustrated in figure 3). Grid testbed
New call records
Survey data
Visualisation node 1 Eg. Helix charting
Data Access nodes OGSI-DAI GRID-FTP
.. .
Visualisation node n
Analysis node 1 Eg. Call-record categorisation
.. .
Other Data sets
Analysis node n
User workstations 1 …. n
Figure 2: Grid infrastructure
distributed, large-scale computations, and radically new ways of sharing large data repositories. For example, there are clear indications that the presence of shared visual information enhances collaborative problem solving, and the inclusion of mixed representations does not necessarily contribute to the communication process.
8. SUMMARY In summary, we see significant opportunities for the commercial, research and policy sectors in the integration and analysis of massive behavioural data with extant social datasets in order to triangulate the UK population. Once this GRID platform is in place there is then an opportunity t o develop collaborative data-analysis and visualisation tools t o support team-working particularly using mixed-methods and longitudinal studies.
9. REFERENCES
Figure 3. Helix visualisation tool Chimera already has a number of large (telecommunications) data sets based on ongoing collaboration with BT. These include BT call records, Internet usage logs, survey data and time-use diaries. Our intention is to build a grid testbed using OGSI toolsets (eg. Globus GT3 and OGSI-DAI) and Chimera datasets, which could then be scaled to represent the UK population (based on demographic weightings) and used with ‘real’ large-scale datasets (see [3]). Once this infrastructure i s in place we envisage developing a number of tools to facilitate collaborative analysis and team working. This would build o n the work already done by Hicks et al [1] in comparing the roles of 3D representations in audio and audio-visual collaborations. In this research it is evident that the presence or absence of shared visual information directly influences collaborative problem solving. It is hoped that this research will generate a number of requirements for high-speed and
[1] Hicks, M., O’Malley, C., Nichols, S., and Anderson, B. (2003). Comparison of 2D and 3D representations for visualising telecommunications usage. Behaviour and Information Technology, May-June 2003, Vol. 22, No. 3, 185-201. [2] Thomas, B. (2002) The Visualisation of Flow Data: From UK Telephone Calls to a General Method. Unpublished PhD Thesis, Dept Geography, University of Leeds. [3] Paton, N,. Atkinson, M.P., Dialani, V., Pearson, D., Storey, T., and Watson, P. (2002) Database Access and Integration Services on the Grid. UKeS-2002-03, http://umbriel.dcs.gla.ac.uk/Nesc/general/technical_paper s/dbtf.pdf