LLMs4 min read

ChatGPT in medicine: a narrative review of applications, failure modes, and the ethical boundaries clinicians need to hold

Source: International Journal of General Medicine·Published: March 2024

Authors: Mu Y, He D·DOI: 10.2147/IJGM.S456659Open Access

Bottom line: ChatGPT has demonstrated useful capabilities in triage support, patient-education drafting, and study aid, with reported overall accuracy around 72 percent in one diagnostic study and 95.7 percent sensitivity for admission triage in another. It also hallucinates, fabricates citations, and encodes training biases. It is not a clinical decision-making tool.

What the study did

The authors conducted a literature search of PubMed, Web of Science, and Google Scholar for studies discussing ChatGPT’s applications in the medical field. The review organizes findings across four domains: clinical practice (diagnosis, triage, documentation), healthcare delivery (chatbots, patient education), medical education (study aid, simulated cases), and medical research (literature summarization, writing assistance). Limitations, ethical considerations, and future research directions are consolidated into a single reference piece.

What they found

Reported performance of ChatGPT was variable across use cases. In one diagnostic evaluation the model achieved an overall accuracy of 71.7 percent (95% CI 69.3 to 74.1 percent), with notably lower performance on differential diagnosis and clinical management than on general medical knowledge. In a triage study the model had 95.7 percent sensitivity for identifying patients suitable for admission but only 18.2 percent specificity for identifying patients who could be safely discharged. In patient-facing correspondence, clinical letters generated by ChatGPT scored a median accuracy of 7 out of 9 with a weighted kappa of 0.80 (p < 0.0001). The review repeatedly emphasizes hallucination, fabricated references, and algorithmic bias as core failure modes.

Why it matters for orthopedic practice

For orthopedic trainees, the practical takeaway is scope discipline. ChatGPT is appropriate for drafting, explaining, summarizing, and studying, provided every clinical fact is re-verified against the primary source. It is not appropriate as a source of differential diagnosis, clinical management recommendation, or verified citation. The triage numbers illustrate why: high sensitivity with very low specificity means the model will over-refer, which is acceptable for safety netting but is not decision-making. Residency programs integrating LLM tools should set explicit expectations about what counts as appropriate use and what does not.

Limitations

This is a narrative review, not a systematic review, and the authors did not quantify study quality or pool effect sizes. The cited studies span heterogeneous clinical domains, so generalizing to orthopedic practice specifically requires caution. Many of the cited performance numbers come from early iterations of ChatGPT, and current model versions may perform differently in either direction. The review predates several major model releases and the broader integration of retrieval-augmented generation, which may reduce the hallucination rate described here.

Citation

Mu Y, He D. The potential applications and challenges of ChatGPT in the medical field. Int J Gen Med. 2024;17:817-826. doi:10.2147/IJGM.S456659

More in LLMs →AI in Orthopedics hub →

Publishing AI research in orthopedics?

OSCRSJ accepts case reports and series on novel AI-assisted diagnoses and surgical planning. Full APC fee waiver for manuscripts submitted before August 1, 2026.

Submit a manuscript

OSCRSJ News items are editorial summaries for educational purposes. They are not clinical recommendations, endorsements, or substitutes for the primary literature. Always consult the source paper and applicable specialty-society guidelines before changing practice.