Using Artificial Intelligence to search for evidence in evaluations: the WFP experience


Using Artificial Intelligence to search for evidence in evaluations: the WFP experience

4 min.

Since 2022, the Office of Evaluation of the World Food Programme (WFP) has invested in promoting the use of evidence generated by its evaluations, developing its ability to respond to evidence needs as they arise. 

One aspect of this investment has included developing the capacity to more efficiently and effectively mine the evidence contained in its portfolio of evaluations. This has naturally entailed looking to artificial intelligence (AI)-based options to automate text extraction.

Seeing the high interest in A.I., I am sharing some information on what we are aiming to do, and reflections from our experience, as our A.I. project is getting started.

Why is AI of interest? 

Evidence-based decision-making is central to many multilateral organizations such as WFP. Evaluation is a key provider of credible evidence, generated by independent teams and backed by solid quality-assurance mechanisms.

While evaluation functions have progressed, fully leveraging evaluation evidence remains problematic. Programme implementers interested to learn from the experience of others find it hard to find the specific evidence they need and lack the time to review long and dense reports. And our evaluation function itself, is challenged to address requests for synthesizing or summarizing existing evidence, because of the time and resources it takes to manually retrieve evidence on given topics of interest. 

To address this challenge and increase the usability of knowledge generated by evaluation, we been working to develop a solution to generate insights from existing evaluation evidence by enhancing search capacities using A.I. 

AI solution to better use evidence 

Our project intends to use natural language processing (NLP) technologies to build a tailored solution that enables the evaluation function (and in turn, any WFP staff member) to retrieve evidence from evaluation reports using an automated search, a bit like a Google search.

Making the most of recent technological leaps in the field of generative AI, such as Chat GPT, the solution will also generate new text, such as summaries of insights.

Automating text retrieval may have additional benefits, such as the automatic tagging of documents to support analytics and reporting.

Lastly, with AI, it is also possible to direct evidence to people based on their anticipated interests, like Netflix recommends films suited to one’s taste.

Reflections on risks and opportunities 

In such a fast-evolving field, better-performing and cheaper solutions continue to emerge, so an important choice we made was to opt for a modular structure, whereby independent elements can be replaced with more relevant, up-to-date elements. Modularity is thus key to future-proofing the system, and having the option to always use the latest and most performant models is the best way to tackle the central challenge of achieving a high level of accuracy of results when searching for the evidence that is just right. An added benefit is that it reduces the risk of being locked-in with specific providers. 

Inter-operability with other systems and solutions is another important feature, to reduce knowledge silos and support the ultimate ambition of facilitating access to evidence with a single user entry point. A challenge though, as everyone starts developing their own solution…

And as searching for evidence does not stop at evaluation, we keep in mind that building an evidence-mining solution could be replicated for other purposes than evaluation evidence use. This has influenced our decisions in certain areas, such as favoring the creation of renewable assets and the use of open-source models. 

An interest to ultimately offer a solution that may over time integrate into a single search point for WFP users, also got us to anticipate issues related to data protection although the evaluation evidence we feed is public. Indeed, one of the risks associated with the use of generative AI tools is the uncertainty surrounding how the data fed into these systems may be used. As pilots get developed, gatekeeping services and “sandbox” environments are needed to offer isolated, “safe” spaces. 

With generative AI models developing apace, the reliability of what they deliver remains uneven. Therefore, the focus will remain on ensuring high-performing extraction capacity, in which evidence is traced to source.

Reflecting on process

As we just embark on this new journey, I realize the immense learning this entails. Rarely to this extend have two technical fields such as Evaluation and TEC needed to understand each other to ensure progress. This is the beginning of what will surely occupy a significant part of our attention in the years to come: learning to navigate new systems, new terms, new ways to apprehend the world and our work. Embracing this transformation will be key so we need to get ready.