Integrating Big Data Into the Monitoring and Evaluation of Development ... Theoretical challenges to the application of
Building Big Data into Program Evaluation ANNEX 4
Integrating Big Data Into the Monitoring and Evaluation of Development Programmes Report © UN Global Pulse, 2016
1. Initial considerations Important differences in how big data is applied in commercial and in development research Much of the experience of big data analytics is drawn from commercial applications where it is often sufficient to determine that, for example, certain kinds of messages or marketing strategies affect consumers’ search or buying behavior, and development research where it is essential to understand why a certain relationship exists and whether, and how, it contributes to the broad development goals of a program. Two examples illustrate the important differences of approach. First, many predictive models (see below) accept lower standards of validity and causal analysis because the data is continuing being updated and the model revised. In contrast most social science research requires higher standards of inference as many of the operational decisions based on the findings involve major investments or operational decisions, which are expensive and difficult to change. In contrast many marketing strategies can easily be changed. Second, a major concern of development evaluation is to seek to identify unintended outcomes, many of which can have serious consequences (for example, increases in domestic violence as a result of programs such as microcredit that are designed to promote womens’ economic empowerment). These concerns require that social research look at a broader focus and at the need to dig deeper. Theoretical challenges to the application of big data in development evaluation Letouzé et al 2016, identify 5 theoretical challenges that must be addressed when applying big data in development evaluation: •
•
Comparability over time: o Algorithm dynamics: organizations such as Google are continuously updating the algorithms used to analyze third party data. As the algorithms used by the organizations are usually not transparent this makes it difficult to assess how comparable are data collected at different points in time. Google Flu Trends is often cited as an example of this problem. o Interaction with users. The algorithms used in predictive modelling can be manipulated so that results responds to the interests of marketing or political campaigns. Given the lack of transparency in the algorithms it is difficult for readers to assess the validity of the model. This issue is of course not unique to predictive modeling as it is recognized that the outcomes for techniques such as cost-benefit analysis can be manipulated by adjusting some of the underlying assumptions that are only described in an inaccessible annex. Nonhuman internet traffic (bots):
Bots are computer programs designed to post autonomously while masquerading as humans. Given the fact that bots may generate as much as 60% of internet traffic, this can seriously affect results. •
Representativeness and the fallacy of large numbers:
BD generates data from very large samples which often appear to cover a total population (e.g. all mobile phone users). Consequently it is often assumed that these samples must be broadly representative of the total population of interest. However, there are usually issues of selection bias as not everyone has equal access to phones or other devices (for example, in many cultures women have more limited access than men) so that responses will frequently provide biased estimates for the total population. •
Spatial autocorrelation
In many cases there are differences in response rates for different geographical sectors of the population. For example, response rates may be higher in better off urban areas than in poorer rural areas.
•
Attribution and spurious autocorrelation
Letouzé et al (2016) cite evidence that the larger the number of variables the higher the risk of spurious correlations. Predictive v retrospective models and the issue of transferability (external validity) A limitation of most conventional evaluation designs is that they are retrospective as they measure changes that have occurred between project launch (baseline) and the time when the evaluation was completed. This has several important consequences in terms of the practical application of the findings. First, most projects are implemented in a dynamic and changing environment, so conventional evaluations assess how the program operated in an earlier context that no longer exists and which may have significantly changed. Second, the retrospective focus has limited ability to predict how the program will operate in the future when the context may have significantly changed. Third, retrospective evaluations are even more limited in their ability to predict how programs will operate in different contexts. , the evaluation is even more limit. BD analysts call these the problem of transferability, while conventional evaluators make the distinction between internal validity (determining attribution in the context where the evaluation was conducted – which experimental deigns do well), and external validity (generalizing findings to other contexts – which is a major weakness of experimental designs). This limitation is particularly serious for complex programs operating in rapidly changing environments. The role of theory Most conventional evaluation designs are based on a theoretical framework that helps identify indicators, hypotheses and the interpretation of findings. The theory may take the form of a theory of change or other theory-based model or a set of hypotheses that are derived deductively and tested through an experimental or quasi-experimental design. Evaluators warn against the dangers of data mining where statistically significant but meaningless relationships are produced by the laws of probability1. In contrast, Big Data implies a different approach to evaluation design and analysis and the use of theory. In a provocative and much cited article in Wired Chris Anderson (2008) argued that the “data deluge makes the scientific method obsolete” and presages the “end of theory”. Giving the example of biology, which he claims is relevant for many other areas where petrabytes of data are available: “Petrabytes allow us to say: “Correlation is enough”. We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.(p. 4)” While subsequent discussion argues that this is an oversimplification, that all data analytics is based on some kind of model and that in social, as opposed to natural science contexts, theory is required to define what kinds of variables will be included in the analysis and to give meaning to the findings, it is clear that operating with petrabytes of data, large numbers of variables that interact in complex ways, and with data analytical power that could hardly have been imagined even a few years ago, this radically changes the approach to evaluation design and analysis. ‘In practice, the theory and the data reinforce each other. It’s not a question of data correlations versus theory. The use of data for correlations allows one to test theories and refine them’”. (Bolier, D, 2010 cited in Letouzé 2016: 236). Distinguishing research, monitoring and evaluation Many of the examples of the power of BD are taken from research (for example biological and genetic research of data on population movements or traffic flows). It is sometimes implied that the same principles and benefits can be directly transferred to program evaluation without thinking through the
1
Taleb 2013 (cited in Letouzé et al 2016) shows that as datasets grow larger, falsity grows faster than information.
unique characteristics of program evaluation. Similarly the distinction between monitoring and evaluation is often not clarified. The fact that large volumes of data can be collected cheaply and quickly on the achievement of program goals is quite distinct from assessing the extent to which changes can be attributed to the project intervention. Clarifying the types of programs for which big data may be applicable When discussing potential applications of big data and sophisticated data analytics it is important to delimit the kinds of programs for which big data may and may not be applicable. Table A 6.1 identifies 8 dimensions in terms of which the applicability of big data can be assessed. For example, with respect to dimension A, big data are more likely to be applicable for large, complex interventions than for small, simple projects. In this respect the well-known distinction between simple, complicated and complex programs (See Box A 5.1) may be helpful. While there is not a perfect match, in general complex and complicated programs are more likely to provide opportunities for the use of big data. In contrast, in Dimension F, big data predictive models are more applicable for programs that will continue to operate after the initial proof of concept.
Table A 6.1 Examples of factors relating to potentially high and low applicability of Big Data High applicabiity 1. Large, “complex” interventions 2. Programs where conventional evaluation designs are considered to be methodologically weak 3. Programs that use easily measurable (and readily available) physical measurement such as climate change, urban growth, traffic patterns 4. Availability of BD indicators with high construct validity [indicators were collected for a purpose relevant to the evaluation] 5. Programs with a relatively long duration and where (real-time) time series data can be generated 6. Programs that will continue to operate after the initial proof of concept so that prediction is possible 7. Programs where there are large numbers of potential variables that might affect outcomes and where there is no articulated theory of how outcomes are expected to be achieved. 8. Where there are no political concerns about ownership, control, access or privacy
Low applicability of BD 1. Small, “simple” projects 2. Programs where conventional evaluation designs are considered perfectly adequate and there is no obvious need for a new approach 3. Programs that rely on social and behavioral indicators such as domestic violence, community organization where data is not readily available and would require special data collection 4. BD indicators with low construct validity [proxy indicators generated for a different purpose and where relevance is not clear] 5. Programs with a relatively short duration and where time series data cannot be generated or are not relevant 6. Proof of concept programs that will end after the initial program assumptions have been tested and prediction is not possible 7. Programs where a carefully articulated theory of change is required to understand the behavioral, socio-cultural and organizational processes through which outcomes are expected to be achieved 8. There are political concerns about ownership, control, access or privacy
In this respect it may be useful to introduce the widely-used classification of programs into simple, complicated and complex (see Box A 6.1). Simple programs tend to be smaller and to have fewer components and simpler causal relations between inputs and outcomes. In general most of these will have relatively limited need for or potential to use big data. The same is true to some extent for complicated programs, which are larger but still have relatively simpler causal patterns. There may be more, but still limited to need for BD. In contrast complex programs tend to be large, with more moving parts but still relatively clear causal relations between inputs and outcomes. The need to integrate, big, large and small data
Box A 6.1 Defining simple projects, complicated programs and complex interventions Simple projects • • • • • •
include relatively simple “blue-print” designs that produce a standardized product follow a causal path that is relatively linear have defined start and end-dates making it time-bound have only a few objectives but they are clearly defined define a target population that is usually relatively small have well defined budget and resources.
Complicated programs • • • • • • • • • •
include a number of different projects each with its own “blueprint” follow causal paths for different components and different objectives, but they are still relatively linear information on the process of project implementation is often not well documented target a larger and more diverse population involve several different donors and national agencies may be implemented by different donors in slightly different ways set objectives in broader and less clearly defined terms set without start-end dates, thus not so time-bound focus on importance of program context merge funds into ministry budgets, making it difficult to estimate.
Complex interventions • • • • • •
merge into national or sector development policy, making specific program interventions difficult to identify follow non-linear causal paths, as there may be multiple paths to achieve an outcome, or the same set of inputs may produce different outcomes in different settings are delivered by multiple agencies, and components and service are not delivered in a uniform manner have emergent designs that evolve over time have program objectives that are difficult to define or not specified have non-proportional relationships between inputs and outcomes.
(Note: the listed items are examples of how these are often set up. Not all projects will contain all of these elements.) Source: Adapted from Bamberger, Rugh and Mabry 2012).
Most evaluations that use big data will combine this with large and small data. Consequently, the use of BD should be considered as part of an integrated evaluation strategy and not as an isolated, stand-alone approach.
2. Ensuring BD-responsive evaluations follow evaluation methodological good practice When planning the integration of BD it is important to ensure that the evaluation design addresses evaluation best-practice guidelines. Although some of the principles may need to be adapted to the unique characteristics of BD, the principles should serve as a checklist, and where it is believed that BD can deviate in some circumstances the deviation should be recognized and justified, methodological principles as for any other evaluation. We discuss best practice/challenges with respect to four categories: Data collection, sample selection, evaluation design and data analysis, dissemination and use. 2.1 Data collection best practice and challenges
Some of the challenges that must be addressed include: a.
Monomethod bias: Many evaluations rely upon a single approach to data collection (for example, paper questionnaires, on-line surveys, focus groups or case studies. Every method has strengths and weaknesses, so the reliance on a single method increases the risk of misleading or unidimensional measures of multi-dimensional constructs. b. Preference for numerical indicators: Related to the previous point is the preference for quantitative, numerical indicators as compared to qualitative indicators. Numerical data is easier to collect and analyze but exclusive reliance runs the danger of presenting a one-dimensional picture of phenomena that must be assessed in terms of both quantity and quality. For example, many educational assessments measure the number of schools, teachers etc but fail to assess the quality of education. In some cases the schools are not even operating much of the time or the new teaching resources never arrived or is not being used. This is again a challenge for big data as much of the data is numerical. A related challenge is that many programs seek to produce behavioral change (as well as numerical outcomes). Behavior is difficult, but not impossible, to measure numerically. c. Construct validity: data are used to construct indicators that are intended to measure constructs (inputs, processes, outputs, outcomes and impacts). Constructs are abstract concepts (poverty, well-being, vulnerability, domestic violence, school performance, health of ecosystems) and the validity of the analysis and evaluation findings will be dependent to a significant degree on how well the individual indicators and how they are treated confirms to the underlying construct. While conventional evaluations must often rely on proxies that do not adequately capture the underlying construct, the risks are potentially greater for BD as many of the indicators that are used were generated for a completely different purpose. d. Collecting data on difficult to reach groups and those unwilling to be interviewed. A weakness of many evaluation designs is that due to cost and time constraints some of the more remote or difficult-to-reach groups (drug users, people who are HIV positive, sex workers, gang members, illegal immigrants) may be left out of the study. In other cases emergency or security situations make it difficult to reach important sectors of the sample population. In some cases BD may have an advantage in reaching these groups as satellite images, tweets, analysis of social media or GPS tracking may make it easier to reach these groups. There are also a number of ICT techniques that show promise for reaching these groups. However, BD may face other challenges as remote data collection does not provide the same opportunities to track difficult to reach groups as are available with researchers on the ground.
2.2. Sample selection challenges and best practice a.
One of the biggest challenges for evaluations when it is not possible to use experimental designs (which is the vast majority of evaluations) concerns the different sources of selection bias. There are two main causes of selection bias: (a) how participants are selected and (b) how the sample is selected for the evaluation. With respect to (a) the two most common causes of participant selection bias are: self-selection and administrative selection of beneficiaries. In the first case, subjects who self- select tend to have attributes that make them more likely to succeed; while in the second case planners or implementing agencies tend to select individuals, communities or institutions that are most likely to be successful. With respect to bias in the sample selection process there are a number of factors: • The sample frame (list/directory) that is used may not include all units in the population. For example, illegal squatters may not be included in the list of addresses used to select the sample. • The sample selection procedure may introduce bias. For example, if the intended respondent is not at home the interviewer may interview a different household member or may find a replacement from a different household. In both cases this may introduce a systematic sample selection bias against people who are less likely to be at home (e.g. long-distance truck drivers or fishermen) b. A second bias may result from how respondents are defined. Many surveys interview the person defined as “household head”. This will often mean that wives, or other household members are under-represented.
2.3 Evaluation design a. Ensure that all potential evaluation designs are considered Many evaluators have a preference for a particular evaluation design (e.g. RCT, regression discontinuity, focus groups, case studies) which they try to apply to all evaluations. However, there is no one-size-fits-all evaluation design that is appropriate in all situations. In fact, the choice of evaluation design is partly determined by the context and the characteristics of what is being evaluated, and partly by the questions that the evaluation is asked to address. Stern et all (2012) identify four key sets of evaluation questions: • To what extent can a specific (net) impact be attributed to the intervention? • Did the intervention make a difference? • How has the evaluation made a difference? • Will the intervention work elsewhere? They argue that each of these questions requires the use of a different evaluation design. Table A 6.2 lists the 6 most commonly used impact evaluation designs and adds two additional designs that must be included for the evaluation of complex development programs. It is recommended that evaluators should review all of the options and decide which is the most appropriate given the characteristics of the program being evaluated, the context in which it operates and the principal evaluation questions that must be addressed.
Table A 6.2 Most common approaches to impact evaluation in conventional evaluations Conventional evaluation designs Design Approaches
Specific Variants
Basis for Causal Inference
1. Experimental
RCTs Quasi Experiments, Natural Experiments Statistical Modelling Longitudinal Studies Econometrics
Counterfactuals; the co-presence of cause and effects
2. Statistical
3. Theory-based
4. Case-based
5. Participatory
6. Review and synthesis
Causal process designs: Theory of Change, Process tracing, Contribution Analysis, impact pathways Causal mechanism designs: Realist evaluation, Congruence analysis Interpretative: Naturalistic, Grounded theory, Ethnography
Correlation between cause and effect or between variables, influence of (usually) isolatable multiple causes on a single effect Control for ‘confounders’ Identification/confirmation of causal processes or ‘chains’ Supporting factors and mechanisms at work in context
Comparison across and within cases of combinations of causal factors
Structured: Configuration, Process Tracing, Congruence Analysis, QCA, Within-Case-Analysis, Simulations and network analysis Normative designs: Participatory or democratic evaluation, Empowerment evaluation
Analytic generalisation based on theory
Agency designs: Learning by doing, Policy dialogue, Collaborative Action Research Meta-analysis, Narrative synthesis, Realist synthesis
Adoption, customisation and commitment to a goal
Validation by participants that their actions and experienced effects are ‘caused’ by programme
Accumulation and aggregation within a number of perspectives (statistical, theorybased, ethnographic etc.)
Complexity responsive evaluation designs 7. Holistic approaches 8. Unpacking complex programs
Systems analysis Programs are unpacked into a set of components, each of which can be evaluated using conventional evaluation designs (designs 1-6). The findings of the individual evaluations are then reassembled to assess the overall program impact
Source: Vaessen, Raimondo and Bamberger (2016) adapted from Stern et al. (2012)
b. Trajectory analysis Program effects can occur over different periods of time and evolve according to different trajectories (See Fig A 6.1). While some projects produce steadily increasing outcomes over the project lifetime (Scenario 2), in other cases effects may reach a maximum and then gradually decline (scenario 3). This often happens when projects require a high level of maintenance (for example irrigation canals and pumps). When funding is no longer available for maintenance, or this ceases to be priority (for example after the completion of donor involvement), it is common for maintenance to deteriorate and the level and quality of services declines. In other cases most effects may be produced at a particular point in time (scenario 3), for example when a road is completed. Understanding the expected trajectory of outcomes is critical for determining when the evaluation should be conducted.
Figure A 6.1 Trajectory analysis: Different scenarios for how program effects evolve over time
Large Effect
Scenario 1 Immediate effect with no decrease
Small
Project starts Years
Project completed
Years Scenario 2
Effect
Large
Effect increases gradually over time
Small
Project starts
Project completed
Years Scenario 3 Large
Effect
Effect increases up to a certain point in time and then steadily decreases
Small
Project starts
Project completed
Years Source: adapted from Bamberger, Rugh and Mabry 2012:204
c. Complexity-responsive evaluations When projects are considered “complex” (in terms of the five dimensions of complexity discussed earlier) it is no longer possible to evaluate outcomes using conventional evaluation designs (see Chapter 2 Section 2.1). In these cases a complexity-responsive evaluation design will normally be required. Designs 7 and 8 in Table 5.2 describe two of the most common complexity-responsive designs. d. Sustainability analysis For operational reasons, many impact evaluations are conducted around the time that project implementation is complete and the program moves into the operational phase. The reason for this is that many development agencies only fund the implementation phase and for accountability purposes they require an evaluation of the project phases they funded. The implementation phases often ends soon after the schools have been constructed, the road or irrigation system has become operational. Consequently an evaluation conducted at this point in time is too early to assess whether a program can continue to deliver services and whether the financial, institutional, organizational and political mechanisms are in place to ensure that it will continue to deliver services. Consequently a “sustainability-responsive” design should be put in place. e. Equity-focused evaluation One of the central development objectives of most international development agencies is to promote equity, to ensure that program benefits reach the poorest and most vulnerable groups and to ensure that the programs they support contribute towards the achievement broader equity goals. However, many evaluations only measure aggregate outcomes (for example, that on average a higher proportion of children attend school or that the proportion of the population below the poverty line has been reduced). There is an extensive body of research showing it is possible, and in fact common to achieve aggregate improvements while the gap between the poorest, for example, 20% may not have been reduced or may even have increased (Bamberger and Segone pp XX). Consequently many evaluations should incorporate an equity focus (Bamberger and Segone 2011). f. Recognizing the strengths and limitations of RCTs Experimental designs (usually randomized control trials) are considered to be one of the strongest evaluation designs, due among other things to their ability to control for selection bias. However, RCTs have a number of important limitations (summarized in Table A 6.3). These are classified into design, data collection and analysis and utilization issues. One of the key issues that BD proponents emphasize problems relating to transferability. RCTs test for statistically significant differences in outcome variables between the project and comparison group between the time of project launch (baseline) and some point in time late in the project cycle. So if statistical differences are found these only relate to a specific time period in the past which was influenced by a particular set of conditions which only apply to that time period and that specific population. It is argued that the findings cannot be generalized to predict how a similar project would work in the future and in different contexts. BD uses techniques such as systems analysis to increase the transferability of findings to other contexts.
Table A 6.3 Methodological issues affecting experimental impact evaluation designs (RCTs) Evaluation Design Issues 1. Limited construct validity: Many strong evaluations use secondary data sources and must rely on proxy variables that may not adequately capture what is being studied so findings can be misleading. 2. Decontextualizing the evaluation. Conventional IE designs ignore the effect of the local political, economic, institutional, socio-cultural, historical and natural environmental context. These factors will often mean that the same project will have different outcomes in different communities or local settings. 3. Ignores the process of project implementation – the problem of the “black box”. Most IE’s use a pre-test comparison and do not study how the project is actually implemented. If a project does not achieve its objectives it is not possible to determine if this is due to design failure of implementation failure. 4. Designs are inflexible and cannot capture or adapt to changes in project design and implementation and in the local contexts. IE repeat the application of the same data collection instrument, asking the same questions and using the same definitions of inputs, outputs, outcomes and impacts. It is very difficult for these designs to adapt to the changes which frequently occur in the project setting or implementation policy. 5. Hard to assess the adequacy of the sampling frame. Evaluations frequently use the client list of a government agency as the sampling frame. This is easy and cheap to use but frequently the evaluation ignores the fact that significant number of eligible families or communities are left out – and these are usually the poorest or most inaccessible. 6 No clear definition of the time-frame over which outcomes and impacts can be measured. The posttest measurement is frequently administered at a time defined by administrative, rather than theoretical considerations. Very often the measurement is made when it is too early for impacts to have been achieved, and it may be concluded that the project did not have an impact. 7. Difficult to identify and measure unexpected outcomes. Structured surveys can also measure the expected outcomes and effects and are not able to detect unanticipated outcomes and impacts (positive and negative).
Data Collection Issues 8. Reliability and validity of indicators: Many strong designs only use a limited number of indicators of outcomes and impacts, almost all of which are quantitative 9. Inability to identify and interview difficult-to-reach groups. Most QUANT data collection methods are not well suited to identify and gain the confidence of sex-workers, drug users, illegal immigrants and other difficult to reach groups. 10. Difficult to obtain valid information on sensitive topics. Structured surveys are not well suited to collect information on sensitive topics such as domestic violence, control of household resources, and corruption 11. Lack of attention to contextual clues. Survey enumerators are trained to record what the respond says and not to look for clues such as household possessions, evidence of wealth, interaction among household members or the evidence of power relations to validate what is said. 12. Often difficult to obtain a good comparison group match. Adequate secondary data for using propensity score matching is only infrequently available and often control groups must be selected on the basis of judgment and usually very rapid visits to possible control areas.
13. The vanishing control group. Control groups get integrated into the project or they may be eradicated through migration, flooding or urban renewal 14. Lack of adequate baseline data. A high proportion of evaluations are commissioned late in the project and do not have access to baseline data. Many IEs collect baseline data but usually only collect QUANT information.
Analysis and Utilization Issues 15. Long delay in producing findings and recommendations that can be used by policy makers and other stakeholders. Conventional IE’s do not produce a report or recommendations until the post-test survey has been completed late in the project cycle or when the project has ended. By the time the report is produced it is often too late for the information to have any practical utility. 16. Difficult to generalize to other settings and populations. This is a particular challenges for RCTs and similar designs that estimate average effects by controlling for individual and local variations. 17. Identifying and estimating influence of unobservables. Participants who are self-selected or who are selected by an agency interested in ensuring success, are likely to have unique characteristics that affect, any usually increase, the likelihood of success. Many of these are not captured in structured surveys and consequently positive outcomes may be due to these pre-existing characteristics rather than to the success of the project.
3. How big data responsive evaluation designs relate to conventional evaluation designs There are a number of different ways that BD-responsive designs can relate to conventional impact evaluation designs: a. Big data can incorporate additional data into a conventional evaluation design. For example, electronic transfer data from ATMs, or satellite images of the number of lorries travelling to and from local markets can be used to complement survey and key informant data on poverty trends in a particular region. b. Big data can strengthen a conventional evaluation design. For example, Satellite images can be used to select comparison groups for an evaluation of the effect of interventions in maintaining forest coverage in protected environmental areas. c. Replacing a conventional evaluation design with a [design that would not be possible without Big Data data collection and analytics. For example: BD can collect and analyze billions of tweets, vehicle movements, or population movements via GPS tracking. It is also possible to analyze much larger numbers of variables than would be possible with conventional data analytics.
4. Potential applications of big data and ICTs in program evaluation
Table A 6.4 Potential ways that Big Data and ICTs can strengthen program evaluation Stage of the evaluation
Big data
1. Initial diagnostic studies
a. Satellite images b. Social media and internet queries a. Predictive designs to replace or complement regression based “backward looking” designs b. Systems modelling c. Sociometric analysis a. Modelling complex systems and causal pathways
2.
•
Designing evaluation of complex programs
ICT
a.
b.
On-line theory of change
3. Data collection
a. Satellite images b. Twitter c. Analysis of other forms of social media d. Remote sensors e. Integrated data platforms
a. b. c. d. e. f. g. h. i.
•
Process analysis
a.
a.
•
Collecting qualitative data Collecting contextual data
Real-time feedback on project implementation (dynamic data platforms) b. Satellite tracking of population movements, growth of human settlements a. Analysis of text-based data a.
b.
•
•
Quality control of data collection
•
Monitoring behavioral change
4. sample selection
5. Data analysis and interpretation
Satellite images can track physical changes over large areas b. crowdsourcing provides feedback on natural disasters, political protects and spread of disease a.
a. Twitter and social media b. Analysis of phone and financial transaction records c. Large scale surveys of household purchases (using smart phones to record food labels etc) d. Using satellite images for area sampling e. Combining satellite images with ground data for better calibration a. Big data analytics i. b. Management of n dimensional data sets c. Developing ontologies for collection of multiple sources of data on a common theme d. Analysis of complex QCA case studies
Smart phones Regular phones Drones Audio/video recording Biometric data Wearables GPS mapping SMS and self-reporting tools can help reach groups in dangerous areas Incident report via phone and internet
Smart phone video and audio recording during meetings, work groups etc b. Web-based M&E platforms allow for better documentation of processes a. Audio and video recordings
c.
GPS enabled phones/tablets can check location of interviewers d. randomly activated audio recorder can listen-in to interview a. Video and audio-recordings at project locations, in the community or households improve capacity to monitor behavior directly b. Socio-metric analysis through smart phones c. Random routes d. Rigorously selected automatic dialed samples (combined with human follow-up) Rapid data analysis and feedback with smart phones
Table A 6.5 Case studies illustrating the different evaluation design options Evaluation design (see Table A 6.2)
Example
1A. Experimental: Randomized control trial using Rural solar micro- grids in India [Poverty Action high frequency metering data for high-quality Lab] information about energy consumption and demand 1A. Experimental: Randomized control trial using Tablet-based financial education in Colombia savings and transaction data combined with survey [Poverty Action lab] and telemetric tablet data 1B. Enhanced quasi-experimental design (pretest- GEF protected areas evaluation [GEF] posttest comparison group design using propensity score matching to strengthen the comparison group 1C. Natural experiments: Using changes in search Effects of a government tax increase on smoking query volume to assess the effects of a major (Ayers 2011) (cited in Letouze et al 2016: 237-8) increase in cigarette smoking in the US. Canada, which did not have similar increase, was used as the comparison group. 2. Statistical modelling: Evaluating causal Understanding labor market shocks using mobile interactions between labor market shocks and phone data [World Bank, Latin American region] internal mobility 3. Theory-based evaluation 4. Case based evaluation: QCA country level data An empowered future: corporate evaluation of UN assessing factors determining impacts of women’s contribution to women’s economic women’s economic empowerment programs at empowerment [UN Women. Independent Evaluation the national level Office 2014] 5. Participatory evaluation 6. Review and synthesis approaches Complexity-responsive evaluation designs 7. Holistic approaches (systems analysis) 8. Unpacking complex programs