Insights/Clinician Learning Brief

LLM Tools Reach Clinics Before Clinicians Have Evaluation Frameworks

Topics: AI oversight, Learning design, Outcomes planning
Coverage Oct. 14–20, 2024. Public signals came from one clinician X thread and three podcast discussions

Abstract

Clinician discussion of a JAMA review showed only 5% of LLM studies use real patient data; AI and faculty-development signals converge on the need for observable evaluation practice rather than passive orientation.

Key Takeaways

  • AI education is moving past awareness. Clinicians need structured practice judging LLM outputs for real-world fit, bias, fairness, and toxicity before clinical use.
  • Faculty development carried the same design lesson: feedback and wellness training need observation, reflection, and usable rubrics, not one-way instruction.
  • For CME teams, the common thread is measurable rehearsal: can learners accept, edit, reject, observe, or give feedback differently after the activity?

Only 5% of LLM studies highlighted by a clinician this week used real patient data, sharpening the question of how clinicians should judge AI before using it. AI and faculty-development conversations pointed to the same CME problem: learning has to include observable evaluation, not passive orientation.

AI literacy now needs an evaluation framework

The clearest signal came from clinician discussion of a JAMA systematic review of LLM testing in health care. The phrase that matters for CME teams was blunt: “Only 5% utilized real patient data, highlighting a significant gap in evaluations.” The same source summary noted that accuracy dominates evaluation while fairness, bias, and toxicity are less often assessed.

That changes the job of AI education. A session that explains what LLMs are is no longer enough if clinicians are being asked to decide whether an output is safe, biased, incomplete, or workflow-ready. We saw a related pattern in an earlier brief on AI ethics training moving from optional to non-negotiable; this week’s signal is more operational. The learner task is not just “understand AI risk.” It is “show how you would evaluate this output before acting on it.”

The examples are oncology- and cardiology-led, but the principle is portable. In an oncology diagnostics discussion, AI was framed around pathologist supply, objective scoring, cloud-based workflows, and rapid literature filtering. Those are exactly the settings where clinicians need to know when to trust, challenge, or route an AI-generated answer.

For CME providers, the implication is concrete: build AI modules around case decisions. Give learners an LLM output, the patient context, and a checklist that includes data source, bias risk, toxicity or safety concern, and workflow consequence. Then measure whether they accept, edit, reject, or escalate the output differently after the activity.

Faculty development needs observation, not just advice

A second, smaller signal came from educator podcasts rather than broad independent clinician conversation, so it should be treated as emerging. Still, the pattern was specific: faculty want better ways to obtain honest feedback, support stressed learners, and improve teaching behavior in real settings.

In a musculoskeletal oncology educator discussion, faculty described why routine evaluations often fail: power differentials make honest trainee feedback hard to obtain. The useful response was not another lecture on feedback theory. It was more structured observation, including outside observers, objective rubrics, and specific questions about teaching techniques.

A Faculty Factory episode added the wellness layer: clear expectations, psychological safety, reflection, and staged support for stressed learners. It also described a faculty development fellowship using inter-specialty small groups, CME credit, protected time, and reflection to build teaching and coaching skills.

For CME teams, this argues for faculty-development products that institutions can actually implement: short observation rubrics, scripted feedback prompts, reflection guides, and follow-up checks at three to six months. The question is not whether faculty liked the session; it is whether someone can observe a better feedback conversation afterward.

What CME Providers Should Do Now

  • Convert AI sessions into case-based accept/edit/reject exercises with explicit checks for real-world data, bias, fairness, toxicity, and workflow fit.
  • Add an outcomes item that tests whether learners change their judgment of an AI output, not just their confidence using AI.
  • For faculty-development programs, include an observation tool and a follow-up prompt that institutions can use after the CME activity.

What to reconsider

This week’s useful lesson is not only about AI. It is about trust. Clinicians are asking for ways to evaluate new tools, and educators are asking for ways to see whether teaching and feedback actually improve. CME teams should reconsider any activity that stops at orientation. If the desired behavior involves judgment under pressure, the activity should include rehearsal, a rubric, and a visible decision point. Otherwise, the provider may be educating around the hardest part of the job instead of training for it.

Sources

  1. 01
    X post

    X post by Ryan Nipp, MD, MPH

    @RyanNipp ·

    Clinicians highlight lack of real-world data and missing bias/toxicity assessment in current LLM studies.

    "Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. @JAMA_current @JAMANetwork #ArtificialIntelligence #MedEd #MEDTECH #DigitalHealth @StanfordMed"

    Show captured excerpt
    Open source
  2. 02
    Podcast

    Revolutionizing Cancer Diagnostics Through AI-Powered Analysis: CorePath

    Oncology Data Advisor · · cited segment 12:48-14:58

    Researchers emphasize need for phased deployment (silent mode to monitored use) and broader specialty coverage.

    Open source
  3. 03
    Podcast

    Episode 33: Becoming a Musculoskeletal Oncologist | "Path 1": The Educator

    Sarcoma Insight Podcast · · cited segment 20:18-23:22

    Educators describe difficulty obtaining honest feedback due to power differentials and value of structured observation programs.

    Open source
  4. 04
    Podcast

    Best Supporting Practices and Strategies for Stressed-Out Learners and Faculty with Jessica Seaman, EdD

    Faculty Factory · · cited segment 62:24-64:28

    New inter-specialty faculty development fellowship successfully builds teaching and coaching skills through clear expectations and reflection.

    Open source

Turn learner questions into outcomes data

ChatCME surfaces the questions clinicians actually ask — so you can build activities that close real knowledge gaps.

Request a demo