The Kirkpatrick Model’s four levels, Reaction, Learning, Behavior, and Results, provide a complete framework for training evaluation, but most organizations never get past Level 1 satisfaction surveys because each subsequent level demands more organizational cooperation and data infrastructure.
Donald Kirkpatrick introduced his four-level evaluation model in 1959. Sixty-seven years later, it remains the most referenced framework in training evaluation. It is also one of the most poorly implemented.
Most training operations measure Level 1. Many measure Level 2. Very few consistently measure Level 3. Almost none measure Level 4 with any rigor. The result is an industry that can tell you whether learners liked the training, sometimes whether they learned something, and almost never whether the training changed anything that matters.
That gap is not because the model is flawed. It is because measuring the higher levels requires effort, infrastructure, and organizational cooperation that most training teams do not have.
The industry can tell you whether learners liked the training, sometimes whether they learned something, and almost never whether the training changed anything that matters.
Here is how the model actually works and how to implement it without turning your training operation into a research department.
Level 1: Reaction
The question: Did learners find the training valuable, engaging, and relevant to their jobs?
How to measure it: Post-training surveys, typically delivered immediately after the learning experience. The standard questions cover satisfaction with the content, the instructor or platform, the pace, and the perceived relevance to the learner’s role.
What it tells you: Level 1 data tells you about the training experience, not the training effectiveness. High satisfaction scores mean the content was well-designed, the delivery was smooth, and learners felt their time was respected. Low scores identify problems with the learning experience that need to be fixed.
What it does not tell you: Whether learners actually learned anything. Satisfaction and learning are weakly correlated at best. Workers can enjoy a training session without retaining the material. Workers can dislike a challenging training session while learning a great deal from it.
Practical tips:
Keep surveys short. Five to seven questions maximum. Workers who just completed training do not want to spend 15 minutes evaluating it. Long surveys produce low response rates or rushed, unreliable responses.
Ask specific questions rather than generic ones. “How relevant was this content to your daily work?” is more useful than “How satisfied were you with the training?” Specificity produces actionable data.
Include one open-ended question. “What one thing would make this training more useful for your job?” generates improvement ideas that rating scales cannot.
Do not over-index on Level 1 data. A training module with a 4.8 out of 5 satisfaction rating is not necessarily effective. A module with a 3.2 rating is not necessarily ineffective. Use Level 1 data to improve the experience. Use other levels to evaluate the impact.
Level 2: Learning
The question: Did learners acquire the intended knowledge, skills, or attitudes?
How to measure it: Pre-training and post-training assessments, practical skills demonstrations, and scenario-based evaluations. The comparison between pre and post scores shows the learning gain attributable to the training.
What it tells you: Whether the training successfully transferred knowledge or skills from the content to the learner. This is the first level that actually measures whether the training worked as instruction.
What it does not tell you: Whether learners will apply what they learned on the job. Knowledge and behavior are different things. A worker can score perfectly on a de-escalation assessment and still fail to apply those techniques during an actual confrontation.
Practical tips:
Pre-assessments are essential. Without a baseline, you cannot measure learning gain. A worker who scores 90% on a post-training assessment might have known 85% of the material before the training and only learned 5%. Without the pre-training assessment, you would credit the training for 90% competency.
Design assessments that test application, not recall. “What is the first step of the lockout/tagout procedure?” tests recall. “You arrive at a machine that needs maintenance. The power switch is on the far side. An operator is standing nearby. Walk me through what you do.” tests application. Application-level assessment is a better predictor of on-the-job behavior.
Assess at the right time. Immediate post-training assessment measures short-term retention, which is useful but not sufficient. A follow-up assessment at 30 or 90 days measures longer-term knowledge retention and gives a more accurate picture of whether the learning stuck.
For skills-based training, observation-based assessment is more valid than written tests. Watching a worker perform a procedure tells you more about their competence than asking them to describe the procedure in writing.
Level 3: Behavior
The question: Are learners applying what they learned in their actual work?
How to measure it: Supervisor observations, on-the-job performance data, safety metrics, quality metrics, customer interaction data, and incident reports.
What it tells you: Whether the training translated into changed behavior on the job. This is the level where training proves its value, or fails to. A training program that produces learning (Level 2) but not behavior change (Level 3) has a transfer problem. The knowledge exists but is not being applied.
What it does not tell you: Whether the behavior change will persist, or whether it will produce the organizational outcomes you need.
Why most organizations stop here or earlier: Fewer than 15% of organizations consistently evaluate training at Level 3 or above. Level 3 measurement requires collaboration between training and operations. Training teams cannot observe on-the-job behavior. They need supervisors, quality teams, and safety personnel to collect the data. That requires organizational alignment that many training teams lack.
Practical tips:
Define the target behaviors before the training launches. What specific, observable actions should change as a result of this training? Write them down. If you cannot articulate the target behaviors, you cannot measure them, and you probably cannot train them effectively either.
Work with supervisors to build simple observation checklists. The checklist should include the three to five most important behavioral changes the training was designed to produce. Supervisors check whether they observe the behavior during normal work. This is not a formal assessment. It is a structured observation that takes two to three minutes per worker.
Measure at 30, 60, and 90 days post-training. Behavior change is not immediate. Workers need time to integrate new practices into their routines. A single observation one week after training tells you very little. Repeated observations over months show whether the change is sticking.
Look at existing data sources before creating new ones. Incident rates, customer complaint data, quality audit results, and safety inspection scores may already capture the behaviors you are trying to change. Connecting these existing data streams to training events is easier and more sustainable than building new data collection systems.
Account for non-training factors. If behavior does not change after training, the training may not be the problem. Inadequate tools, conflicting procedures, unsupportive supervisors, or misaligned incentives can all prevent workers from applying what they learned. Level 3 data helps identify these barriers, but only if you investigate the root cause rather than blaming the training.
Level 4: Results
The question: Did the training produce measurable business or organizational outcomes?
How to measure it: Connect training data to organizational performance metrics, including safety incident rates, compliance audit results, employee retention, operational efficiency, customer satisfaction, and cost reduction.
What it tells you: Whether the training program delivered a return on investment. This is the level that matters most to executives and budget decision-makers.
Why it is hard: Organizations with mature Level 4 evaluation practices report significantly higher confidence in their training investment decisions. But attribution is the core challenge. Organizational outcomes are influenced by many factors beyond training. If incident rates dropped after a safety training program, was it because of the training? Or because of a new safety policy, a staffing change, a seasonal pattern, or a coincidence? Isolating the training’s contribution from all other variables is analytically challenging.
Practical tips:
Do not attempt Level 4 evaluation for every training program. Reserve it for programs with significant budget investment, strategic importance, or executive visibility. The analytical effort required is substantial and should be directed where the return on that effort is highest.
Use comparison groups when possible. If you can compare outcomes between a trained group and a similar untrained group, the attribution problem becomes more manageable. This is not always possible, but when it is, it provides the strongest evidence.
Focus on leading indicators rather than lagging outcomes. It takes months or years for some organizational outcomes (like reduced incident rates) to show a statistically meaningful change. Our Training ROI Calculator can help you model expected returns while waiting for long-term data. Leading indicators like supervisor-observed behavior change, near-miss reporting rates, and assessment scores provide earlier signals of impact.
Build the measurement plan before the training launches. Decide what data you will collect, from what sources, and at what intervals before the training is delivered. Retrofitting a Level 4 measurement after the fact is unreliable because you will not have the baseline data you need.
Present results honestly. If you cannot cleanly isolate the training’s contribution, say so. “Following the safety training program, incident rates dropped 15% compared to the same period last year. The training is one of several factors that may have contributed, including the new pre-shift safety briefing protocol.” That is a defensible statement. “The training reduced incidents by 15%” is not, unless you have controlled data that supports it.
A practical evaluation strategy
Measuring all four levels for every training program is neither practical nor necessary. Here is a tiered approach:
All training programs: Level 1 (short satisfaction survey) and Level 2 (post-training assessment with pre-assessment for new programs). This is the baseline. It should be automated by your learning management system.
Compliance and safety-critical programs: Add Level 3 (supervisor observations at 30 and 90 days, plus relevant operational data). This is where you verify that compliance training is actually changing behavior, not just producing completion records.
High-investment strategic programs: Add Level 4 (connect training data to organizational outcomes with appropriate caveats about attribution). This is where you justify the budget to leadership.
This tiered approach allocates evaluation effort where it matters most. You are not spending supervisor time observing behavior change from a 10-minute policy update module. You are investing that effort in safety training where the behavior change has life-or-death implications.
The bottom line
The Kirkpatrick Model is not complicated. It is just difficult to execute beyond Level 1 because each subsequent level requires more organizational cooperation, more data infrastructure, and more analytical sophistication.
The organizations that evaluate well do not treat evaluation as a separate activity bolted onto the end of training. They design the training and the evaluation together from the start. They define target behaviors before building content. They identify data sources before launching programs. They build measurement into the workflow rather than adding it after the fact.
That integration is the real challenge. The model provides the framework. Your organization has to provide the commitment to use it. For practical guidance on translating evaluation results into business language, see our guide to measuring training ROI.
Frequently Asked Questions
- What are the four levels of the Kirkpatrick Model?
- Level 1: Reaction (did learners find the training valuable and engaging?). Level 2: Learning (did learners acquire the intended knowledge and skills?). Level 3: Behavior (are learners applying what they learned on the job?). Level 4: Results (did the training produce measurable business or organizational outcomes?). Each level builds on the previous one.
- Why do most organizations only measure Level 1?
- Level 1 is the easiest to measure. You hand out a survey after training and ask workers if they found it useful. Levels 2 through 4 require progressively more effort: assessments, performance observation, data collection, and analysis over weeks or months. Most training teams lack the time, tools, or organizational support to measure beyond Level 1.
- Is Level 1 evaluation worthless?
- Not worthless, but insufficient on its own. Learner satisfaction tells you whether the training experience was well-designed and respectful of the learner's time. That matters for adoption and morale. But satisfaction does not predict learning, and learning does not guarantee behavior change. Level 1 data is useful context, not a measure of effectiveness.
- How do you measure Level 3 behavior change?
- Through direct observation, supervisor assessments, performance data, and incident tracking. The key is identifying specific, observable behaviors that the training was designed to change, then measuring whether those behaviors actually changed in the weeks and months following the training. This requires collaboration between training and operations teams.
- Do you need to measure all four levels for every training program?
- No. The level of evaluation should match the stakes of the training. Low-risk awareness training may only warrant Level 1 and Level 2 evaluation. Safety-critical compliance training should be evaluated through Level 3 at minimum. High-investment strategic training programs justify the effort of Level 4 evaluation. Evaluate where the return on evaluation effort is highest.
See how Vekuri handles compliance training
Audit-ready records, automated tracking, and training that reaches every worker on their phone.