Impact Evaluation: How far have we come?

Wednesday, April 10, 2024

6 min.

The impact evaluation challenge of the evaluation community.

When I embarked on my dissertation on impact evaluation work 15 years ago, I observed the new trend of rigorous impact evaluation in international development. Criticisms of aid effectiveness have been raised for decades, for example the Paris Declaration on Aid Effectiveness (2005) called for evidence-based delivery of development interventions. Impact evaluation has become a valuable tool to provide greater accountability and determine the true effectiveness of development interventions. However, the international evaluation community has not yet reached consensus on the ideal tools and methodologies to be applied by these impact evaluations.

Startling the evaluation community.

In 2006, the Center for Global Development (CGD) published the report “When Will We Ever Learn?” to make things seemly clear, calling for more and more rigorous impact evaluations. The CGD referred to the low quality of up-to-date evaluations, lacking the needed rigour to make causal statements. The new impact evaluations would need to be able to directly attribute the “net effect” to a development programme. The report referred to the medical model of clinical trials as medical standard: “no responsible physician would consider prescribing medications without properly evaluating their impact or potential side effects” (p.3). This experimental approach was applied to evaluations of development interventions. The so-called randomized controlled trials would randomly allocate the subjects to treatment and comparison groups – thus creating a counterfactual: “Impact evaluation asks about the difference between what happened with the program and what would have happened without it.” (p.12). This “net effect” typically focuses on a single number (going up or down) and its confidence interval – which opens the door for new criticism and suggestions for alternative methods. The evaluation community responded swiftly: the bi- and multilateral Network of Networks for Impact Evaluation was formed to provide guidance on impact evaluation, not just on the net effect of experimental trials, but on an integrated methodological approach. The International Initiative for Impact Evaluation (3ie) was launched to develop evidence and promote a repository of impact evaluations.

The medical model revisited

Experiments in medicine also have shortcomings. For example, the personalized medicine movement and the focus on rare drugs seek way to generate evidence beyond experiments. Therefore, a critical use of experiments is equally important in impact evaluation.

The continuance of rigorous impact evaluation.

More recently, several multinational organizations launched impact evaluation strategies. For example, UNICEF created a Strategy and Action Framework on Evaluation of Impact (2022–2025). UNHCR is about to publish its strategy. WFP launched the WFP Impact Evaluation Strategy (2019–2026). The terminology is similar to the one from the CGD. The strategy calls for “robust evidence” (p.4) and “rigorous impact evaluation” (p.15), measuring the “net effect” of an intervention to understand whether something works (p.6). It calls on the “academic research organizations” (and not evaluation shops) to provide expertise in impact evaluation (p.24). Among others, the strategy refers to:

the 3ie (see above);
the Abdul Latif Jameel Poverty Action Lab (J-PAL). J-PAL’s founders Esther Duflo and Abhijit Banerjee were recipients of the Nobel Prize in Economic Sciences for using experimental methodology in international development; and
the World Bank’s Development Impact Evaluation (DIME) unit, which has excelled in conducting experimental evaluations across the world.

WFP’s approach is a bit more nuanced, as it also refers to the WFP Evaluation Policy (2016–2021), which aligns the definition of impact evaluation with the OECD DAC definition of impact as “assessments of the positive and negative, direct or indirect, intended or unintended changes in the lives of affected populations in receipt of WFP interventions” (p.8). It also aims to harness the “best possible tools for capturing and analysing data” about “what works best, how, and for whom” (p.15).

However, the document then focuses on “counterfactual” and “net effect” – terms that resemble the experimental definition of impact, as used by the CGD a decade earlier and promoted by DIME and J-PAL. In short, there seem to be different impact evaluation strands intertwined in a somewhat unresolved way. It would be important to clarify the exact stance in the following iteration of the WFP strategy.

OECD DAC Glossary: Impact

The higher-level effects of an intervention’s outcomes. The ultimate effects or longer-term changes resulting from the intervention. Such impacts can include intended and unintended, positive or negative higher-level effects.

How far have we come?

Over the past 15 years, the impact evaluation debate has not been fully settled, but has led to some approximation of perspectives that any (experimental) impact evaluation would benefit from other methodologies to increase relevance and use. Based on my analysis of the impact evaluation debates, I suggest the following good practices:

Weigh different possible impact evaluation tools: Ideally, at the beginning of each impact evaluation, there is a question for possible tools. Consider not just the experimental approach, but the full set of social science methodologies that contribute to causal analysis.
Evaluate early: Make sure that evaluations start early, ideally in the planning phase of interventions, and not during or even after implementation. This also ensures that the evaluators can collect baselines and adjust evaluation processes, if needed.
Understand impact (qualitatively): Every experimental or quantitative evaluation includes qualitative components. For example, the evaluator needs to (qualitatively) clarify evaluation questions, assess prior knowledge, interpret findings, etc. Evaluators need to have interpretive-qualitative skills when understanding and assessing the impact of an intervention.
Expand the evidence base: An experimental impact evaluation primarily answers the “what” question, but not the “why” and “how”. Therefore, any quantitative impact evaluation benefits from integrating qualitative tools to provide a richer picture of impact. Using multiple tools from the methodological toolbox in tandem ensures more robust evaluation findings than solely relying on a single method.
Ensure relevance: By combining impact evaluation tools, evaluators make their evaluations more relevant to stakeholders. Applying impact evaluation findings to other contexts or moving from a pilot to a larger implementation requires a strong theory of change – a task that is qualitative in nature.
Create multi-methods evaluation teams: Combining skill sets on an impact evaluation team makes it possible to navigate the various methods.

By following these good practices, impact evaluations will more likely make an impact.

Evaluation Gap Working Group (2006). When will we ever learn? Improving lives through impact evaluation. Washington, DC: Center for Global Development.

Leeuw, Frans; Vaessen, Jos (2009). Impact evaluations and development: NONIE guidance on impact evaluation. Washington, D.C.: World Bank Group

OECD Development Assistance Committee (2002). Glossary of key terms in evaluation and results based management.

Rahel Kahlert (2013): Randomized controlled trials to evaluate impact. Their challenges and policy implications in medicine, education and international development, Texas Digital Library.

United Nations Children’s Fund (UNICEF). UNICEF Evaluation of Impact - Strategy and Action Framework 2022–2025. UNICEF, New York

World Food Programme. WFP Impact Evaluation Strategy (2019-2026).

Topics

Impact evaluation

Rahel Kahlert
Senior Evaluation Officer IAEA
Austria

Dear Daniel!
I enjoyed reading your thoughtful comments. We agree that the CGD's report is based on a particular scientific paradigm, as Vedung (2010) called it the first scientific wave.
Surprising to me was that the CGD report--although based on an old paradigm--startled the evaluation community widely to a high degree. Evaluation societies felt an obligation to respond. In this sense, the CGD report positively influenced the further development of more robust qualitative (causal) methods in evaluation. Evaluators developed and refined theory-based approaches, process tracing, contribution analysis and thus made a positive contribution to the evaluation field. Without the CGD report, we (myself included) probably would not have been part of such a flourishing process of tackling causality in evaluation.
- read on a separate page
Daniel Ticehurst
Monitoring > Evaluation Specialist freelance
United Kingdom
Dear Rahel,

Thanks for posting this blog about how far "we" have come on impact evaluation. Let me be terse with my answer: not much, if at all. And for the following three reasons:
1. CGD's "When Will We Ever Learn" (WWWEL) is a throw back to Vedungs' first scientific wave of evaluation - Vedung, E. (2010) Four Waves of Evaluation Diffusion, Evaluation, Sage Publications, 16: 263 pp. 263-277. During the 1960s and even earlier, advanced evaluative thinking and practice was driven by a notion of scientification of public policy and public administration. It was argued this would make government more rational, scientific and grounded in facts. Its technocratic thrust sought to isolate public policy decisions from the messy, complex world we live in. Evaluation was to be performed by professional academic researchers (often masquerading as evaluators).Spitting roast for the labs and units you list, and many others. Towards the mid-1970s, confidence in experimental evaluation faded however. Voices started communicating how Evaluation should be more diverse and inclusive. Those other than academic researchers should be involved. Ring bells for today's debates on de-colonisation, localisation and Indigenous Evaluation?
2. CGD's self-serving basic thesis:
- "persistent shortcomings in our knowledge of the effects of social policies and programs reflect a gap in both the quantity and quality of impact evaluations.’
- the authors argued: An “evaluation gap” has emerged because governments, official donors, and other funders do not demand or produce enough impact evaluations and because those that are conducted are often methodologically flawed.” They ascribe the evaluation gap to the public good nature of impact measurement; and
- "that governments and development agencies are better at monitoring and process evaluations than at accountability or measuring impact"’ - this may be so but, monitoring, long neglected by the evaluation community, as practiced by most govts and dev agencies, is done far from well and is deliberately held down as routine reporting process (pers comm Michael Quinn Patton, April 2024).
James Morton in his 2009 paper "Why We Will Never Learn" provides a wonderfully lettered critique of the above: the Public Good concept is a favourite resort of academics making the case for public funding of their research. It has the politically useful characteristic of avoiding blame. No one is at fault for the ‘evaluation gap’ if evaluation is, by very its nature, something that will be underfunded. Comfortable as this is, there are immediate problems. For example, it is difficult to argue that accountability is a public good. Why does the funding agency concerned not have a direct, private-good interest in accountability?
Having effectively sidelined Monitoring and Processes, WWWEL goes on to focus, almost entirely, on measuring outcomes and impact. This left the "monitoring gap" conveniently alone. While avoiding any discussion of methodologies: randomised control trials, quasi experimental double-difference, etc. many discussions WWWEL encouraged were the abstruse, even semantic nature of the technical debates which dominate discussion about impact measurement.
3. Pawson and Tilley's expose - through their masterful 1997 publication "Realistic Evaluation" of experimentalists and RCT's intrinsic limits as defined by its narrow use based on the deficiency of its external validity. They challenge orthodox view of experimentation: the construction of equivalent experimental and control groups, the application of interventions to the experimental group only and comparisons of the changes that have taken place in the experimental and control groups as a method of finding out what effect the intervention has had. Their position throws into doubt experimental methods of finding out which programmes do and which do not produce intended and unintended consequences. They maintain it not to be a sound way of deriving sensible lessons for policy and practice.
In sum then, CGD's proposition of RCTs, to cite Paul Krugman. is like a cockroach policy: it was flushed away in the 1970's but returned forty years later along with its significant limits intact; and CGD missed the most significant gap. From the above, one could get the impression that development aid has lost the capacity to learn: it suppresses, not takes heed of, lessons.
I hope the above is seen as a constructive contribution to the debate your blog provokes; and my seeming pessimism simply qualifies my optimism - a book was launched yesterday on monitoring systems in Africa.
Best wishes and good luck,
Daniel
- read on a separate page
Rahel Kahlert
Senior Evaluation Officer IAEA
Austria

Dear Binod! I very much agree with your critical comments regarding 'net effect' impact evaluations, especially in related to long-term effects. I am currently exploring qualitative-investigative methodologies that could be useful for capturing those longer-term effects.
- read on a separate page
Binod Chapagain
Manager - Monitoring, Evaluation and Learning ADRA International
Yemen

Thanks for recommending good practices - they are very helpful.
I find the concept of 'net effect' (medical model!) challenging in some areas that are not visible in a short duration. How would we measure the net effect if we have short-term empowerment programs or climate change interventions? I also find it difficult to generalize randomized control trials for the areas that have wide impacts, such as air or water pollution. We can have qualitative data justifying the effects. However, the evaluation may need to challenge the programs that look for a quick-fix for a long-term problem!
- read on a separate page

Impact Evaluation: How far have we come?