Extracting Experimental Evidence at Scale: Building the World's Largest Open RCT Database

How we built a multi-stage, LLM-powered pipeline to extract, standardize, and organize causal evidence from thousands of randomized controlled trials in development economics.

ImpactAI Team (impactai@worldbank.org) · Satvik Garg | Abelardo Carlos Martinez Lorenzo | Linxi Wang | Riccardo Orlando | Piriyakorn Piriyatamwong | Nihaa Sajid | Saqib Hussain | Clémence Gall | Ejigayehu Diriba | Rida Ali Khan | Aarushi Aggarwal | Giuliano Martinelli | Sharif Kazemi | Arianna Legovini | Samuel Fraiberger

Last updated: March 20th, 2026

The Evidence Accessibility Problem

Randomized controlled trials are widely regarded as the gold standard for causal inference in the social sciences. Over the past two decades, their growing adoption in development economics has produced thousands of rigorous evaluations spanning education, health, agriculture, social protection, governance, labor markets, and financial inclusion across more than 150 countries. Repositories such as J-PAL, 3ie, and IPA, along with major economics journals, have amassed a vast and growing body of experimental evidence on what works in global development.

Yet this wealth of knowledge remains remarkably difficult to access and use. The quantitative findings of each study are not stored in a database; they are buried inside lengthy PDF documents, scattered across tables with inconsistent formatting, and reported using a wide variety of statistical conventions. A single RCT may contain dozens of treatment arms, hundreds of outcome measurements, and multiple econometric specifications, with key information distributed across narrative text, dense results tables, and supplementary figures. Reconstructing the full experimental design of even one study, identifying which interventions were tested, which arms were compared, what outcomes were measured, and what effect sizes were observed, is a labor-intensive process that domain experts currently perform manually.

The consequences are significant. Policymakers seeking to understand the average impact of a class of interventions must rely on systematic reviews that are expensive, slow to produce, and quickly outdated as new trials are published. Researchers attempting meta-analyses face months of extraction work before analysis can begin. And billions of dollars in development aid continue to be allocated based on incomplete evidence, not because rigorous evidence does not exist, but because it is locked inside documents that no one has the time to read.

We set out to solve this problem by building an automated, multi-stage information extraction pipeline that can read full-text RCT papers, reconstruct their hierarchical experimental designs, extract and standardize all reported effect sizes, and organize the results into a structured, queryable database, making decades of causal evidence accessible in seconds.

Diagram showing the extraction pipeline: Documents on the left contain text with highlighted entities (unconditional transfers, farm productivity training, informational flyers). These map to Interventions (Cash Transfer Programs, Agricultural Training, Information Dissemination, Nutrient Supplementation), which combine into Arms (Cash Transfer + Agricultural Training Arm, Information Dissemination Arm, Nutrient Supplementation Comparator Arm), which link to Outcomes & Effects (Crop Production, Agricultural Knowledge, Child Health) with effect sizes.

Figure 1. The extraction challenge. A single RCT document contains interventions described in narrative text, treatment arms that bundle multiple interventions, and outcomes with effect sizes scattered across results tables. Our pipeline identifies these entities across modalities and reconstructs the full hierarchical structure linking them.

Our Approach: A Multi-Stage Extraction Pipeline

Each RCT encodes a deep dependency chain: interventions map to treatment arms, arms form counterfactual contrasts, and effect sizes link each contrast to specific outcomes and populations. A single linking error cascades through the entire structure. Our pipeline addresses this by decomposing extraction into sequential stages, each building on verified outputs of the previous one.

Phase 1: Document Processing and Table Filtering

The pipeline begins by ingesting full-text RCT papers and converting them into a structured representation. We use OCR-based parsing to extract text and tables, with specialized handling for rotated tables and complex layouts. Figures are converted to text descriptions. The result is a unified document representation that preserves all the information a human reader would have access to.

Not all tables in a paper contain treatment effects. Most contain baseline statistics, balance checks, descriptive summaries, or robustness analyses. To identify the tables that matter, we apply a multi-stage filtering pipeline that progressively narrows the set of candidate tables to only those reporting main treatment effect estimates on the full experimental sample, using the best available econometric specification.

In parallel, we score all text sections in each paper for topical relevance across multiple dimensions: measures and outcomes, sampling and study design, intervention details, econometric methods, and implementation quality. This relevance scoring ensures that each downstream extraction stage receives the most informative context from the paper, not just the tables, but the surrounding narrative that describes how the study was designed and conducted.

Phase 2: Structured Entity Extraction

With the relevant tables and text identified, the pipeline reconstructs the hierarchical experimental design through a series of coordinated extraction stages. Each stage uses large language models prompted with domain-specific instructions developed in close collaboration with development economics experts.

STAGE 1

Study Design & Treatment Arms

Identifies the country, all treatment and control arms, their descriptions, and the specific interventions applied in each arm. Handles parallel, factorial, and mixed experimental designs, correctly distinguishing multi-component interventions from multiple separate interventions.

STAGE 2

Outcome Identification

Extracts all measured outcomes from the filtered tables, including their descriptions, whether they are binary or continuous, whether they are index (composite) measures, their connotation (whether higher values are better or worse), and the population for whom each outcome was measured.

STAGE 3

Effect Size Selection

For each outcome and treatment arm, selects the best available effect size estimate using a hierarchical set of econometric best practices: prioritizing ITT estimands, author-preferred specifications, standardized coefficients, and parsimonious models with appropriate fixed effects and controls.

STAGE 4

Precision & Statistical Metadata

Extracts all available precision measures for each effect size: standard errors, confidence intervals, p-values, t-statistics, q-values, and sample sizes for both treatment and control groups. Also classifies the effect type (mean difference, risk difference, odds ratio, etc.) and estimand (ITT, TOT, LATE).

STAGE 5

Intervention & Population Enrichment

Generates detailed descriptions of each intervention (implementer, delivery method, duration, frequency, geographic scale) and each population (demographics, geography, selection criteria), along with hierarchical taxonomy tags that enable cross-study comparison.

To make this concrete, consider what the pipeline produces from a single study:

Worked Example: Extracting a Multi-Arm Agricultural RCT

Input: A 35-page RCT paper on cash transfers and agricultural training in rural Kenya — 12 tables, 3 treatment arms, 8 outcomes.

↓ Table Filtering: 12 tables → 3 main effect tables

Arms extracted: Cash transfer only, Cash transfer + training, Information only, and Control.

Outcomes extracted: Crop production, agricultural knowledge, consumption, food security, child health, school enrollment, savings, and labor supply.

↓ 8 outcomes × 3 contrasts = 24 estimates extracted

Sample estimate:

Intervention: $500 mobile money transfer + 6-week farm training program (NGO-delivered, weekly group sessions, Western Kenya).

Outcome: Crop production (kg/hectare) among smallholder farming households.

Effect: Hedges' g = 0.31, SE = 0.09, 95% CI [0.13, 0.49], p < 0.01, ITT estimand

Phase 3: Effect Size Standardization

A central challenge in building a cross-study database is that different RCTs report their results in different units: raw mean differences, regression coefficients, odds ratios, risk differences, or pre-standardized effect sizes. To enable meaningful comparison and aggregation, every extracted effect size is converted to a common metric: Hedges' g, a bias-corrected standardized mean difference.

The standardization process follows a decision tree that handles continuous outcomes (converting from raw means, standard errors, p-values, t-statistics, or confidence intervals), binary outcomes (converting from risk differences, odds ratios, or linear probability model coefficients), and clustered designs (applying design effect corrections when authors have not already adjusted for clustering). Each conversion is documented with its formula, inputs, and quality checks, and flagged if the resulting effect size falls outside plausible ranges.

Phase 4: Automated Validation

As a final quality assurance step, we run every extracted estimate back through an independent validation stage. Each paper's full text is re-read alongside its extracted entries, and a separate LLM evaluates whether each estimate, including the effect size, confidence interval, treatment arm linkage, outcome description, and causal interpretation, is consistent with what the source document actually reports. Each entry receives a consistency score and targeted feedback identifying any discrepancies. Estimates that fall below quality thresholds are flagged for manual review, ensuring that the database maintains high accuracy even as it scales to thousands of papers.

Expert-Validated Reliability

The pipeline's reliability has been validated against expert human annotations. Seven trained annotators, each holding at least a Master's degree in economics with experience in RCT research, produced gold-standard extractions following IDEAL consortium annotation guidelines. Against these benchmarks, the pipeline achieves precision levels that closely approach expert performance for identifying the correct intervention, outcome, and associated effect size.

The system is most reliable on papers with standard reporting conventions and well-structured tables, which account for the majority of published RCTs. It becomes more challenging for complex multi-arm factorial designs with many treatment arms, papers reporting only regression coefficients without sufficient descriptive statistics, and very long tables spanning many pages.

An important external validation signal: the distribution of effect sizes in our database closely mirrors those reported in other large-scale RCT repositories. Our mean effect of 0.07 standard deviations aligns with established benchmarks, providing extrinsic evidence that the pipeline does not introduce systematic extraction or standardization bias.

The ImpactAI RCT Database

We have applied this pipeline across a wide range of sources, including papers from four peer-reviewed economics journals (QJE, AEJ: Applied, JDE, and World Bank Economic Review), three major impact evaluation repositories (J-PAL, IPA, 3ie), and additional papers indexed in NBER, SSRN, CEPR, Scopus, Semantic Scholar, EconLit, and other academic databases. The result is one of the largest structured databases of experimental evidence in development economics.

~7,300

RCT Papers

74,826

Individual Estimates

150+

Countries Covered

Each estimate in the database is a richly structured record that captures the full experimental context of a finding. Far beyond a simple effect size number, every estimate encodes the complete chain from intervention to outcome, with detailed metadata at each level.

What Each Estimate Contains

💊

Intervention

Name, description, implementer, delivery method, duration, frequency, geographic scale, and hierarchical taxonomy tags

📊

Outcome

Name, description, domain classification, binary/continuous type, index flag, connotation (direction of improvement), and measurement unit

📐

Standardized Effect

Hedges' g, SE(g), 95% CI, p-value, original effect size, effect type (mean difference, odds ratio, etc.), and estimand (ITT, TOT, LATE)

👥

Population

Detailed characteristics (demographics, geography, selection criteria), analysis unit type, and population taxonomy tags

⚗️

Treatment Arms

All treatment and control arms, their descriptions, which interventions each arm contains, and sample sizes per arm

⚖️

Counterfactual Contrast

Which treatment arm is compared to which control arm for this specific estimate, enabling precise interpretation of the causal claim

🌍

Context

Country, sub-national locations, study setting, randomization unit, clustering level, ICC, and cluster adjustment status

📑

Source & Metadata

Paper title, authors, journal, year, DOI, source repository, and publication status

This structured representation powers ImpactAI's ability to answer complex evidence questions in seconds: comparing the effectiveness of different interventions on a given outcome, identifying evidence gaps across sectors and geographies, and performing real-time meta-analyses that would traditionally take weeks or months of researcher time.

The extraction pipeline and the underlying schema have also been formalized in our research on Hierarchical Information Extraction (HIE), the task of reconstructing the complete multi-level experimental design of an RCT from its full-text document. The database is continuously growing as new studies are published and processed.

If you would like to cite our content, use this citation:

Satvik Garg, Abelardo Carlos Martinez Lorenzo, Linxi Wang, Riccardo Orlando, Piriyakorn Piriyatamwong, Nihaa Sajid, Saqib Hussain, Clémence Gall, Ejigayehu Diriba, Rida Ali Khan, Aarushi Aggarwal, Giuliano Martinelli, Sharif Kazemi, Arianna Legovini, and Samuel Fraiberger. 2026. Extracting Experimental Evidence at Scale: Building the World's Largest Open RCT Database. https://impactai.worldbank.org/blog/extracting-experimental-data-from-studies.

Explore the Evidence

Interested in accessing the RCT database, exploring the data, or collaborating with our team?
We'd love to hear from you.