Back
For best output, select "Paper Size" as "A4" and "Margin" as "0" or "None".
To save or print to PDF, please select Print Destination > Save as PDF, enable Background Graphics under "More Settings", then click "Save".
During the congress, E-Posters will be accessible to all participants on the congress website 24/7, as well as in the E-poster stations in the congress center.
Preparing your E-Poster
Please review the E-Poster format requirements carefully when preparing your E-Poster. Should your E-Poster not meet the mentioned requirements, it may not be displayed as described above.
E-Poster Submission Deadline
Please prepare and upload your E-Poster no later than March 14, 2026 11.59PM CET. After this date, you will no longer be able to prepare and upload your E-poster and it will not be displayed and accessible on the congress website.
Please follow the instructions below to input your abstract title.
Abstract titles should be brief and reflect the content of the abstract.
Large language models (LLMs) achieve high accuracy on medical benchmarks, raising interest in their clinical application. However, whether this performance reflects genuine reasoning or pattern recognition remains unclear. To evaluate reasoning robustness, we replaced the correct answer in nephrology multiple-choice questions with “None of the other answers” (NOTA) and assessed changes in accuracy. We hypothesized that causal and pathophysiological reasoning would preserve accuracy, whereas reliance on memorized patterns would cause a marked decline. Major LLMs from OpenAI and Google were examined under this framework.
From 210 self-assessment questions for nephrology board renewal (2014–2023, Japanese Society of Nephrology), two nephrologists independently reviewed all items. Only questions in which NOTA would be the sole correct answer after replacement were included, with discrepancies resolved by consensus, resulting in a final set of 145 validated questions. Four large LLMs—GPT-5, GPT-4o, Gemini 2.5 Pro, and Gemini 2.0 Flash—were evaluated via their application programming interfaces under default settings. Each question pair (original and NOTA-modified) was presented independently, except for sequential questions from the same case, which were grouped to maintain clinical context. The primary endpoint was the proportion of correct answers under each condition. Statistical significance was assessed using the McNemar test (exact, two-sided), and 95% confidence intervals (CI) for the drop in accuracy were calculated using a bootstrap method with 1,000 iterations. A two-sided P-value <.05 was considered significant. The analysis was performed using Python (pandas, NumPy, SciPy).
Across all 145 questions, accuracy on the NOTA-modified version was significantly lower than on the original version for all four models (Table 1). GPT-4o accuracy dropped from 66.21% (96/145) to 19.31% (28/145), a decrease of 46.90 percentage points (pp) [95% CI, 37.93–56.55; P < .001]. GPT-5 declined from 87.59% (127/145) to 73.10% (106/145), a drop of 14.48 pp [95% CI, 8.28–21.38; P < .001]. Gemini 2.0 Flash decreased from 58.62% (85/145) to 31.03% (45/145), a drop of 27.59 pp [95% CI, 17.93–36.55; P < .001]. Gemini 2.5 Pro fell from 86.90% (126/145) to 55.86% (81/145), a decline of 31.03 pp [95% CI, 22.76–39.31; P < .001]. Among the four models, GPT-5 showed the smallest decrease and achieved the highest accuracy in both versions, though its decline remained statistically significant.
Our findings reveal a robustness gap in medical reasoning across major LLMs. When confronted with nephrology questions requiring reasoning beyond familiar answer patterns, all models showed significant declines in accuracy. These results indicate that high benchmark scores overestimate their reliability and that accuracy alone fails to capture true clinical reasoning ability. Even GPT-5, which demonstrated the mildest decline, exhibited a statistically significant drop, underscoring the need for caution in autonomous clinical use. Their clinical role should prudently remain limited to nonautonomous, clinician-supervised decision support.