Research Tools4 min read

ChatGPT vs JBJS systematic reviews: median 91% of target abstracts captured in search, 75% after screening, 100% after manual review of model-identified papers

Source: Archives of Bone and Joint Surgery·Published: 2025

Authors: Yao JJ, Lopez RD, Rizk AA, Aggarwal M, Namdari S·DOI: 10.22038/ABJS.2025.84896.3874Open Access

Key table: Tables 2 and 3 — Head-to-head comparison of ChatGPT search and screening performance against five JBJS systematic reviews, reporting captured article counts, sensitivity, and precision for each review. View in source

Bottom line: ChatGPT-4 is useful as a supervised assistant for the search and screening phases of an orthopedic systematic review, not as an autonomous replacement for a research team. Prompt engineering materially changed the results. The LLM is not deterministic: identical prompts can return different outputs on different days.

What the study did

The authors identified five systematic reviews published in the Journal of Bone and Joint Surgery in 2021 and 2022 covering spine, arthroplasty, hip arthroscopy, and meniscal surgery topics. They used ChatGPT-4 to perform three distinct PRISMA-aligned tasks for each review: designing a database search strategy from the clinical question, screening the resulting abstracts against the original inclusion and exclusion criteria, and (for one review) reviewing individual manuscript texts for inclusion. ChatGPT memory was cleared between tasks and between reviews. Outputs were compared against the final list of articles actually included in each published review. The primary outcome was the percentage of target articles captured at each step; secondary outcomes were precision and F-score.

What they found

For search strategy design, ChatGPT captured a median of 91% of target abstracts (IQR 84 to 94%) while returning a roughly eightfold larger total set than the manual reviews (median 8,388 abstracts vs 1,307), consistent with the broad-search prompt used. For abstract screening against inclusion criteria, ChatGPT captured a median of 75% of target articles (IQR 70 to 79%). For manuscript-level screening on a single review, ChatGPT initially captured only 55% of target articles, but manual review of the 28 papers ChatGPT flagged led to identification of 11 of 11 (100%) included papers. Prompt engineering was required to elicit reliable performance; early iterations produced fabricated abstracts with invented titles and DOIs.

Why it matters for orthopedic practice

For residents and early-career faculty weighing whether to use LLMs in literature work, this study offers the most rigorous benchmark available to date in orthopedics. The headline is that a well-prompted LLM can replicate meaningful portions of PRISMA phase 1 and 2, but manuscript-level inclusion screening still requires human judgment. The workflow most supported by these data is sequential: use ChatGPT to draft broad search strategies and pre-screen abstracts, then rely on trained reviewers for manuscript-level inclusion and quality assessment. Unsupervised delegation of a systematic review to an LLM remains unsafe, and fabricated references are a recurring failure mode even on recent model versions.

Limitations

The study was not prospective, and the authors engineered their approach to retrieve articles that were already known to be in the gold-standard reviews, which does not fully answer whether ChatGPT can conduct a systematic review de novo. Only five JBJS reviews were included. ChatGPT is a general-purpose LLM trained on general internet text and is not tuned for medical literature; a domain-specific model may perform differently in either direction. Corresponding author Dr. Namdari reports a large set of industry disclosures, though none are directly tied to the study outcome. Variable PDF formats were a technical barrier to manuscript-level review. The study used ChatGPT-4 released March 2023, and newer model releases may perform differently.

Citation

Yao JJ, Lopez RD, Rizk AA, Aggarwal M, Namdari S. Evaluation of a popular large language model in orthopedic literature review: comparison to previously published reviews. Arch Bone Jt Surg. 2025;13(7):460-469. doi:10.22038/ABJS.2025.84896.3874

More in Research Tools →AI in Orthopedics hub →

Publishing AI research in orthopedics?

OSCRSJ accepts case reports and series on novel AI-assisted diagnoses and surgical planning. Full APC fee waiver for manuscripts submitted before August 1, 2026.

Submit a manuscript

OSCRSJ News items are editorial summaries for educational purposes. They are not clinical recommendations, endorsements, or substitutes for the primary literature. Always consult the source paper and applicable specialty-society guidelines before changing practice.