Back
For best output, select "Paper Size" as "A4" and "Margin" as "0" or "None".
To save or print to PDF, please select Print Destination > Save as PDF, enable Background Graphics under "More Settings", then click "Save".
During the congress, E-Posters will be accessible to all participants on the congress website 24/7, as well as in the E-poster stations in the congress center.
Preparing your E-Poster
Please review the E-Poster format requirements carefully when preparing your E-Poster. Should your E-Poster not meet the mentioned requirements, it may not be displayed as described above.
E-Poster Submission Deadline
Please prepare and upload your E-Poster no later than March 14, 2026 11.59PM CET. After this date, you will no longer be able to prepare and upload your E-poster and it will not be displayed and accessible on the congress website.
Please follow the instructions below to input your abstract title.
Abstract titles should be brief and reflect the content of the abstract.
Large language models (LLMs) increasingly support medical education and clinical decision tasks, yet whether reasoning models confer consistent advantages in nephrology across various task types remains unclear. We compared leading reasoning and baseline models on board-level multiple-choice questions and examined effect modification by question characteristics.
Overall accuracy was 87.6% (183/209; 95% CI 82.4–91.4) for GPT-5 and 83.7% (175/209; 95% CI 78.1–88.1) for Gemini 2.5 Pro, versus 69.9% (146/209; 95% CI 63.3–75.7) for GPT-4o and 62.7% (131/209; 95% CI 55.9–69.0) for Gemini 2.0 Flash (Figure 1). Paired analyses favored the reasoning models (OpenAI odds ratio [OR] 6.29, 95% CI 3.25–16.00; Google OR 7.29, 95% CI 3.83–18.33; both p<0.001). In GLMMs, adjusted ORs (aORs) for reasoning vs baseline were 5.00 (95% CI 3.00–8.35; p<0.001) for OpenAI and 7.28 (95% CI 4.60–11.52; p<0.001) for Google. Interactions showed larger effects for clinical questions in the OpenAI family (aOR 13.94; 95% CI 5.68–34.25) and taxonomy-dependent effects in the Google family (recall aOR 7.28; 95% CI 4.60–11.52; interpretation aOR 2.56; 95% CI 1.02–6.45); no significant modification by image inclusion was detected. Of 209 questions, 105 (50.2%) were answered correctly by all four models; 25 (12.0%) were answered correctly by both reasoning models but by neither baseline model; 15 (7.2%) were missed by all models, often involving questions that required an understanding of Japan-specific guidelines, research, and medical culture.
Reasoning models outperformed baseline models on nephrology questions, with advantages that varied by task demands—pronounced for clinical reasoning in the OpenAI family and for recall-dominant tasks in the Google family. These results suggest the potential utility of selectively applying reasoning models for education and decision support, while also revealing their current difficulty in understanding the specific details of local medical practices.