Robustness Gap of Large Language Models in Nephrology

Certificate Output Instructions

For best output, select "Paper Size" as "A4" and "Margin" as "0" or "None".

To save or print to PDF, please select Print Destination > Save as PDF, enable Background Graphics under "More Settings", then click "Save".

Presented the abstract " "
(Abstract co-author(s): )

Back

E-Poster Presentation

During the congress, E-Posters will be accessible to all participants on the congress website 24/7, as well as in the E-poster stations in the congress center.

Preparing your E-Poster

Please review the E-Poster format requirements carefully when preparing your E-Poster. Should your E-Poster not meet the mentioned requirements, it may not be displayed as described above.

E-Poster Submission Deadline

Please prepare and upload your E-Poster no later than March 14, 2026 11.59PM CET. After this date, you will no longer be able to prepare and upload your E-poster and it will not be displayed and accessible on the congress website.

E-Poster Format Requirements

PDF file
Layout: Portrait (vertical orientation)
One page only (Dim A4: 210 x 297mm or PPT)
E-Poster can be prepared in PowerPoint (one (1) PowerPoint slide) but must be saved and submitted as PDF file.
File Size: Maximum file size is 2 Megabytes (2 MB)
No hyperlinks, animated images, animations, and slide transitions
Language: English
Include your abstract number
E-posters can include QR codes, tables and photos

E-Poster

https://storage.unitedwebnetwork.com/files/1099/e1e17463e29fe35363ac3a7a36f840c8.pdf

Abstract Title *

Robustness Gap of Large Language Models in Nephrology

Please follow the instructions below to input your abstract title.

Abstract titles should be brief and reflect the content of the abstract.

The title will not be accepted if it exceeds 25 words.
Type in CAPITAL LETTERS.
Lowercase may be used for abbreviations only, for example, mRNA.

Co-author 1

Ayaka Soejima ayaka.suumo@gmail.com St. Marianna University School of Medicine Division of Nephrology and Hypertension, Department of Internal Medicine Kawasaki,Kanagawa Japan *

Co-author 2

Fumiya Kitano fumiya.kitano@marianna-u.ac.jp St. Marianna University School of Medicine Division of Nephrology and Hypertension, Department of Internal Medicine Kawasaki,Kanagawa Japan -

Co-author 3

Mamoru Masaki mamoru.masaki@marianna-u.ac.jp St. Marianna University School of Medicine Division of Nephrology and Hypertension, Department of Internal Medicine Kawasaki,Kanagawa Japan -

Co-author 4

Daisuke Ichikawa ichikawa6008@gmail.com St. Marianna University School of Medicine Division of Nephrology and Hypertension, Department of Internal Medicine Kawasaki,Kanagawa Japan -

Co-author 5

Yugo Shibagaki yugoshibagaki@gmail.com St. Marianna University School of Medicine Division of Nephrology and Hypertension, Department of Internal Medicine Kawasaki,Kanagawa Japan -

Co-author 6

Ryunosuke Noda nodaryu00@gmail.com St. Marianna University School of Medicine Division of Nephrology and Hypertension, Department of Internal Medicine Kawasaki,Kanagawa Japan -

Co-author 7

Co-author 8

Co-author 9

Co-author 10

Co-author 11

Co-author 12

Co-author 13

Co-author 14

Co-author 15

Introduction

Large language models (LLMs) achieve high accuracy on medical benchmarks, raising interest in their clinical application. However, whether this performance reflects genuine reasoning or pattern recognition remains unclear. To evaluate reasoning robustness, we replaced the correct answer in nephrology multiple-choice questions with “None of the other answers” (NOTA) and assessed changes in accuracy. We hypothesized that causal and pathophysiological reasoning would preserve accuracy, whereas reliance on memorized patterns would cause a marked decline. Major LLMs from OpenAI and Google were examined under this framework.

Methods

From 210 self-assessment questions for nephrology board renewal (2014–2023, Japanese Society of Nephrology), two nephrologists independently reviewed all items. Only questions in which NOTA would be the sole correct answer after replacement were included, with discrepancies resolved by consensus, resulting in a final set of 145 validated questions. Four large LLMs—GPT-5, GPT-4o, Gemini 2.5 Pro, and Gemini 2.0 Flash—were evaluated via their application programming interfaces under default settings. Each question pair (original and NOTA-modified) was presented independently, except for sequential questions from the same case, which were grouped to maintain clinical context. The primary endpoint was the proportion of correct answers under each condition. Statistical significance was assessed using the McNemar test (exact, two-sided), and 95% confidence intervals (CI) for the drop in accuracy were calculated using a bootstrap method with 1,000 iterations. A two-sided P-value <.05 was considered significant. The analysis was performed using Python (pandas, NumPy, SciPy).

Results

Across all 145 questions, accuracy on the NOTA-modified version was significantly lower than on the original version for all four models (Table 1). GPT-4o accuracy dropped from 66.21% (96/145) to 19.31% (28/145), a decrease of 46.90 percentage points (pp) [95% CI, 37.93–56.55; P < .001]. GPT-5 declined from 87.59% (127/145) to 73.10% (106/145), a drop of 14.48 pp [95% CI, 8.28–21.38; P < .001]. Gemini 2.0 Flash decreased from 58.62% (85/145) to 31.03% (45/145), a drop of 27.59 pp [95% CI, 17.93–36.55; P < .001]. Gemini 2.5 Pro fell from 86.90% (126/145) to 55.86% (81/145), a decline of 31.03 pp [95% CI, 22.76–39.31; P < .001]. Among the four models, GPT-5 showed the smallest decrease and achieved the highest accuracy in both versions, though its decline remained statistically significant.

Conclusion

Our findings reveal a robustness gap in medical reasoning across major LLMs. When confronted with nephrology questions requiring reasoning beyond familiar answer patterns, all models showed significant declines in accuracy. These results indicate that high benchmark scores overestimate their reliability and that accuracy alone fails to capture true clinical reasoning ability. Even GPT-5, which demonstrated the mildest decline, exhibited a statistically significant drop, underscoring the need for caution in autonomous clinical use. Their clinical role should prudently remain limited to nonautonomous, clinician-supervised decision support.

Kewords