Field NoteOpen Access

Structuring Baseline Surveys For Donor Compliance

Published 15 April 2026CC BY-NC 4.0
Impact EvaluationKenyaDonor Compliance

Abstract

A practical view of how theories of change, sampling plans, and CAPI instruments become submission-ready evidence aligned with institutional donor requirements.

Institutional donors have never been more serious about evidence. As development finance has tightened globally — with nine of the top twenty humanitarian donors announcing cuts to official development assistance in 2025 alone — the organisations that continue to attract and retain funding are those that can demonstrate impact with data that withstands scrutiny. USAID now requires implementing partners to upload all approved project reports, baselines, and evaluations to the Development Experience Clearinghouse. The World Bank's project portfolio is subject to independent evaluation against pre-registered indicator frameworks. Major foundations demand that baseline surveys be designed before intervention begins, not reconstructed after results are in.

Yet in the field, the gap between survey design theory and donor-ready output remains surprisingly wide. Teams spend months crafting sophisticated Theories of Change, only to produce baselines that cannot answer the questions the ToC was built around. Sampling plans are developed in headquarters without accounting for the enumeration realities of the target geography. Survey instruments are built in paper form and digitised late, producing datasets riddled with skip-pattern errors, missing values, and timestamps that undermine data credibility. By the time a donor's mid-term review arrives, the evidentiary architecture is already too fragile to carry the weight of the project's claims.

The question is not whether to conduct a rigorous baseline. Every donor requires one. The question is how to design it, from day one, so that the data it generates is precisely the data the donor will need at reporting, at mid-term, and at endline — with no reconstruction, no retroactive indicator mapping, and no gaps in the audit trail.

The answer lies in treating the Theory of Change not as a narrative document but as a measurement blueprint, and building every element of the baseline — sampling plan, sample size, stratification, and CAPI instrument — directly from that blueprint before a single enumerator enters the field.


A Theory of Change Without an Instrument Map Is a Narrative, Not a Measurement System

The Theory of Change is the foundation of every credible development project. Done well, it traces the causal logic from inputs through activities to outputs, outcomes, and impact — specifying the assumptions that must hold for each step to occur. Done badly, it is a diagram that looks rigorous but cannot support actual measurement.

The test of a functional ToC is simple and unforgiving: for every outcome in the diagram, can you name the specific survey instrument that will generate baseline data against that outcome? If the answer is "we will figure that out in the data collection phase," the ToC is not a measurement system — it is a programme narrative dressed in logical framework language.

Donors — particularly USAID, FCDO, and the major foundations — have become sharply attuned to this distinction. A Performance Indicator Reference Sheet (PIRS), which USAID requires to be completed within three months of data collection commencing, demands that each indicator be mapped to a named data source, a specific collection method, and a defined baseline timeframe. This is not bureaucratic form-filling. It is the contractual assertion that the baseline you are about to collect will actually measure what your ToC claims you are trying to change.

The practical implication for any project team is that ToC development and survey instrument design must happen simultaneously, not sequentially. When the ToC workshop produces an outcome node — say, "women micro-entrepreneurs demonstrate improved access to formal credit" — the immediate next question is not "what activities lead to this outcome?" but "what question, asked of what respondent, using what scale, will give us a baseline value for this outcome?" If that question cannot be answered at the design stage, the outcome node should not be in the ToC.

Every outcome stage requires its own baseline collection unless there is pre-existing population-level data of sufficient quality and recency. And critically, the baseline and endline instruments must share the same structure — the same questions, the same response scales, the same unit of observation. A pre-programme self-report cannot be compared against a post-programme structured assessment. The comparability of the measurement is the foundation of the attribution claim.


Power Calculations Made Before Sampling Begins Protect the Project's Statistical Credibility at Endline

The sample size is not a logistical decision. It is a statistical commitment made at baseline that determines whether the project will be able to detect the change it is trying to achieve — if that change occurs.

Power calculations formalise this commitment. They take the project's expected effect size, the baseline variance of that indicator in the target population, the acceptable probability of a Type I error, and the acceptable probability of a Type II error, and translate them into a minimum sample size requirement. A well-powered study — typically designed for 80% statistical power — ensures that a real effect of the anticipated magnitude will be detected with high probability.

Why does this matter for donor compliance? Because at endline, a donor's evaluator will test whether the change observed between baseline and endline is statistically significant. If the baseline sample was too small to power that test, a genuine programme effect may fail to reach statistical significance — and the project will appear to have underperformed regardless of what actually happened on the ground. Retroactively increasing the sample does not solve this problem; the baseline values for the additional respondents no longer exist.

The power calculation is therefore a form of contractual protection. Completing it before data collection begins and documenting it in the project's M&E plan demonstrates to donors that the statistical architecture of the evaluation was designed to detect the effects the project claims to produce. In practice, power calculations in field settings must also account for design effects — the statistical cost of sampling from clustered units like villages, schools, or health facilities, rather than from a purely random individual-level sample.


Stratified Sampling Is the Only Design That Generates the Disaggregated Evidence Donors Now Require

Institutional donors have moved, firmly and without reversal, toward requiring evidence disaggregated by gender, geography, age cohort, disability status, and other equity dimensions. USAID's custom indicators guidance explicitly requires gender-disaggregated data. PEPFAR mandates biennial collection of equity-sensitive outcome indicators. The Sustainable Development Goals' commitment to "leave no one behind" is operationalised, in practice, as a requirement to report outcomes separately for the most marginalised subgroups.

A simple random sample will not reliably support this requirement. In a population where women represent 40% of beneficiaries and the programme targets remote rural clusters, a simple random sample may return too few female or rural respondents in specific sub-strata to produce statistically reliable estimates. The donor's disaggregated reporting requirement will either be unfulfillable or will be answered with such wide confidence intervals as to be meaningless.

Stratified random sampling solves this problem by design. Rather than sampling from the full population in a single draw, the population is first divided into non-overlapping strata and a minimum sample is drawn independently from each. This guarantees that every reporting sub-group will be represented at a statistically meaningful level, regardless of its share of the total population.

The deeper strategic point is this: the stratification decisions made during baseline design are the decisions that will determine what the project can claim at endline. If gender-disaggregated outcomes are not in the sampling frame from day one, they cannot be inserted at mid-term review without compromising comparability.


Digitising the Instrument Is Not a Technical Upgrade — It Is a Data Quality and Compliance Decision

Computer-Assisted Personal Interviewing — CAPI, delivered through platforms such as SurveyCTO, ODK, or KoboToolbox — has become the standard for high-quality donor-funded surveys. The move away from paper-based instruments is not merely a question of convenience. It is a structural difference in the data quality architecture of the entire baseline.

CAPI eliminates the most consequential class of errors by design. Skip logic is automated, range validations prevent out-of-scale values, and consistency checks flag implausible responses before the interview is submitted. The result is a dataset that arrives from the field already structured, already validated, and already free of the formatting inconsistencies that dominate paper-to-digital transfer.

For donor compliance, two CAPI features carry particular weight. First, metadata — GPS coordinates of each interview location, timestamps of each session, device IDs of each enumerator's tablet — is automatically recorded and embedded in the dataset, creating an irrefutable audit trail. Second, real-time monitoring dashboards allow supervisors to review submission rates and correct emerging data quality issues during data collection rather than after it has ended.


The Baseline Dataset Itself Must Be Documented as Though It Will Be Independently Audited — Because It Will

Collecting high-quality baseline data is a necessary condition for donor compliance. It is not sufficient. Donors increasingly require that the dataset be accompanied by documentation that allows an independent evaluator to reproduce the sampling, verify the instrument, and trace every data point from field collection to cleaned analytical file.

At minimum, this documentation set includes: a sampling methodology note; the final CAPI instrument in its native digital format; an enumerator training record; a field implementation log noting deviations from the original sampling plan; and a data cleaning log recording every transformation applied to the raw dataset before it was declared final.

US Government implementing partners are required by regulation to upload baseline datasets and associated documentation to the Development Data Library (DDL). This is not hypothetical — evaluators do request these materials, and projects that cannot produce them face findings of incomplete compliance.


The Baseline Is Not the Beginning of a Project — It Is the Architecture of Its Accountability

A baseline survey conducted well is one of the highest-leverage investments a development programme can make. The institutional environment in which implementing organisations now operate makes this architecture more consequential than it has ever been. Donors are reducing portfolios. Competition for remaining funding is intensifying.

What it requires is the integration of measurement architecture into programme design from the first day — a Theory of Change that doubles as an instrument map, a sampling plan that is designed to answer the disaggregated questions donors will ask at endline, a CAPI instrument that encodes data quality as a structural feature rather than a training outcome, and a documentation discipline that treats the baseline dataset as a public accountability record from the moment collection begins.

Donors will ask, eventually, whether the evidence holds up. The answer to that question is written in the baseline.


This article draws on primary field research experience and ongoing applied M&E advisory work at NM Research and Advisory.

Citation

Cite this publication