AI Can Spot the Pattern, But Only Humans Can Teach the Wisdom
Earlier coverage of ai oversight and its implications for CME providers.
Clinician discussion of a JAMA review showed only 5% of LLM studies use real patient data; AI and faculty-development signals converge on the need for observable evaluation practice rather than passive orientation.
Only 5% of LLM studies highlighted by a clinician this week used real patient data, sharpening the question of how clinicians should judge AI before using it. AI and faculty-development conversations pointed to the same CME problem: learning has to include observable evaluation, not passive orientation.
The clearest signal came from clinician discussion of a JAMA systematic review of LLM testing in health care. The phrase that matters for CME teams was blunt: “Only 5% utilized real patient data, highlighting a significant gap in evaluations.” The same source summary noted that accuracy dominates evaluation while fairness, bias, and toxicity are less often assessed.
That changes the job of AI education. A session that explains what LLMs are is no longer enough if clinicians are being asked to decide whether an output is safe, biased, incomplete, or workflow-ready. We saw a related pattern in an earlier brief on AI ethics training moving from optional to non-negotiable; this week’s signal is more operational. The learner task is not just “understand AI risk.” It is “show how you would evaluate this output before acting on it.”
The examples are oncology- and cardiology-led, but the principle is portable. In an oncology diagnostics discussion, AI was framed around pathologist supply, objective scoring, cloud-based workflows, and rapid literature filtering. Those are exactly the settings where clinicians need to know when to trust, challenge, or route an AI-generated answer.
For CME providers, the implication is concrete: build AI modules around case decisions. Give learners an LLM output, the patient context, and a checklist that includes data source, bias risk, toxicity or safety concern, and workflow consequence. Then measure whether they accept, edit, reject, or escalate the output differently after the activity.
A second, smaller signal came from educator podcasts rather than broad independent clinician conversation, so it should be treated as emerging. Still, the pattern was specific: faculty want better ways to obtain honest feedback, support stressed learners, and improve teaching behavior in real settings.
In a musculoskeletal oncology educator discussion, faculty described why routine evaluations often fail: power differentials make honest trainee feedback hard to obtain. The useful response was not another lecture on feedback theory. It was more structured observation, including outside observers, objective rubrics, and specific questions about teaching techniques.
A Faculty Factory episode added the wellness layer: clear expectations, psychological safety, reflection, and staged support for stressed learners. It also described a faculty development fellowship using inter-specialty small groups, CME credit, protected time, and reflection to build teaching and coaching skills.
For CME teams, this argues for faculty-development products that institutions can actually implement: short observation rubrics, scripted feedback prompts, reflection guides, and follow-up checks at three to six months. The question is not whether faculty liked the session; it is whether someone can observe a better feedback conversation afterward.
This week’s useful lesson is not only about AI. It is about trust. Clinicians are asking for ways to evaluate new tools, and educators are asking for ways to see whether teaching and feedback actually improve. CME teams should reconsider any activity that stops at orientation. If the desired behavior involves judgment under pressure, the activity should include rehearsal, a rubric, and a visible decision point. Otherwise, the provider may be educating around the hardest part of the job instead of training for it.
Clinicians highlight lack of real-world data and missing bias/toxicity assessment in current LLM studies.
"Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. @JAMA_current @JAMANetwork #ArtificialIntelligence #MedEd #MEDTECH #DigitalHealth @StanfordMed"
Show captured excerptCollapse excerptResearchers emphasize need for phased deployment (silent mode to monitored use) and broader specialty coverage.
Earlier coverage of ai oversight and its implications for CME providers.
Earlier coverage of ai oversight and its implications for CME providers.
Earlier coverage of ai oversight and its implications for CME providers.
ChatCME surfaces the questions clinicians actually ask — so you can build activities that close real knowledge gaps.
Request a demoEducators describe difficulty obtaining honest feedback due to power differentials and value of structured observation programs.
Open sourceNew inter-specialty faculty development fellowship successfully builds teaching and coaching skills through clear expectations and reflection.
Open source