Clinical reasoning lies at the center of effective medical practice. It informs how clinicians gather information, generate hypotheses, interpret data, and make diagnostic and management decisions. Yet despite its centrality, clinical reasoning has long been one of the most difficult competencies to teach, observe, and reliably assess.

The Challenge for Educators

In everyday educational practice, faculty face persistent barriers to evaluating reasoning. Much of a learner’s cognitive process is invisible: educators see only the actions a learner takes or the final decisions they present, not the internal logic that produced them. As a result, they often cannot determine whether a correct answer reflects sound reasoning or whether a wrong answer stems from flawed thinking versus an incomplete data set.

Observation at the bedside is also difficult to scale. Meaningful assessment of reasoning requires real-time monitoring, probing questions, and debriefing—activities that depend on faculty time that is rarely available in sufficient quantity. Many learners receive feedback on their reasoning only intermittently, with substantial gaps between observed encounters.

A further challenge is variability among assessors. Even when educators use the same rubric, their interpretations of performance often differ. The literature has repeatedly documented this issue; for example, Govaerts et al. demonstrated that assessor cognition and individual frames of reference contribute significantly to inconsistency in workplace-based assessment, even under standardized conditions. These differences undermine trust in the fairness and reproducibility of clinical assessments.

Making Reasoning Visible Through DDx

To address these challenges, DDx by Sketchy has now launched an AI-powered Clinical Reasoning Assessment to evaluate how learners think and to provide rich, actionable feedback to both learners and educators. This assessment tool was developed and rigorously iterated on by a team of clinician-educators and draws on existing evidence-based frameworks. 

Because the platform tracks a learner’s working differential diagnosis at each point in the encounter, it can identify whether choices align with or diverge from their stated hypotheses. For example, if a learner lists pneumonia and pulmonary embolism as leading considerations but orders numerous laboratory panels unrelated to either condition, the system can recognize this as scattershot testing. Conversely, if a test is ordered specifically to strengthen or refute a hypothesis, the system can acknowledge this as intentional, hypothesis-driven behavior. This level of visibility makes feedback more specific, meaningful, and educationally useful.

Understandably, the term “AI-powered” elicits both excitement and skepticism. How do we ensure that every learner receives fair, consistent feedback that is aligned with human judgement? 

A Rubric That Serves as the Foundation for Trust

At the heart of this assessment is a structured clinical reasoning rubric developed by clinician-educators and grounded in validated models such as the Assessment of Reasoning Tool (ART) and the IDEA framework. These models provide shared, evidence-based definitions of diagnostic and clinical reasoning behaviors, allowing educators to reliably distinguish between hypothesis generation, data interpretation, refinement of differentials, and decision-making. The rubric has been refined through iterative review by experts across multiple specialties to ensure that it reflects authentic reasoning patterns expected in real clinical settings.

The rubric also functions as essential scaffolding for the AI system. Rather than allowing a language model to freely generate evaluative judgments, the rubric defines the criteria, anchors model interpretations to expert expectations, and constrains the assessment to recognizable, defensible standards. This approach limits variability and reduces the kinds of drift or improvisation that can occur in unconstrained generative systems.

Designing the AI System for Reliable Evaluation

The reliability of the system is driven by a combination of clear, well-defined scoring criteria, structured prompts that guide the model’s interpretation of learner actions, and a scoring process that ensures consistent application of the rubric. Together, these architectural features promote stable, reproducible evaluations. The system is designed so that when the same learner performance and reasoning inputs are provided, the assessment process produces consistent scoring and feedback, without fluctuations caused by model randomness or uncontrolled variation in prompt interpretation.

Early Evidence of Reliability

Initial internal testing demonstrates encouraging levels of consistency. When comparing AI-generated ratings to those of trained clinician-educators in preliminary testing, inter-rater reliability was strong with Cohen’s kappa values > 0.8. Test–retest reliability was also high; when the same learner data was evaluated multiple times, the resulting scores showed a reliability approaching 1, demonstrating almost no variability across repeated assessments.

These findings suggest that the system behaves reliably across encounters and aligns meaningfully with human expert judgment, both of which are essential for building trust in AI-supported evaluation.

Why Reliability Matters

For learners, consistent assessment means feedback they can rely on. Rather than being influenced by which faculty member happens to observe them, they receive evaluations tied directly to their reasoning behaviors. This clarity supports deliberate practice and growth.

For educators, a reliable system reduces the burden of repeated direct observation, provides insight into cognitive processes that are otherwise inaccessible, and frees faculty to focus on coaching rather than scoring. It also offers a fairer, more standardized approach to assessing a complex construct that has traditionally been difficult to evaluate.

Looking Ahead

We’re excited about what this means for medical education. Assessment of clinical reasoning is now available at a scale, both in breadth of encounters and depth of evaluation, that was simply impossible only a few years ago. 

By using AI intentionally and responsibly, to enhance the reasoning abilities of trainees, rather than impair them, we’re working toward an assessment system that truly reflects how learners reason through a case. We believe this system is one that educators can trust, and one that helps learners develop the skills that actually matter in real patient care.

And by grounding AI-driven assessment in expert frameworks, transparent criteria, and rigorous reliability testing, we aim to support an assessment ecosystem that enhances—not replaces—the human work of teaching clinical reasoning.

Explore how AI-enabled clinical simulation can benefit your institution. Schedule a demo of DDx today.

Schedule  Demo