PERFORMANCE OF REASONING MODELS ON NEPHROLOGY MULTIPLE-CHOICE QUESTIONS

Certificate Output Instructions

For best output, select "Paper Size" as "A4" and "Margin" as "0" or "None".

To save or print to PDF, please select Print Destination > Save as PDF, enable Background Graphics under "More Settings", then click "Save".

Presented the abstract " "
(Abstract co-author(s): )

Back

E-Poster Presentation

During the congress, E-Posters will be accessible to all participants on the congress website 24/7, as well as in the E-poster stations in the congress center.

Preparing your E-Poster

Please review the E-Poster format requirements carefully when preparing your E-Poster. Should your E-Poster not meet the mentioned requirements, it may not be displayed as described above.

E-Poster Submission Deadline

Please prepare and upload your E-Poster no later than March 14, 2026 11.59PM CET. After this date, you will no longer be able to prepare and upload your E-poster and it will not be displayed and accessible on the congress website.

E-Poster Format Requirements

PDF file
Layout: Portrait (vertical orientation)
One page only (Dim A4: 210 x 297mm or PPT)
E-Poster can be prepared in PowerPoint (one (1) PowerPoint slide) but must be saved and submitted as PDF file.
File Size: Maximum file size is 2 Megabytes (2 MB)
No hyperlinks, animated images, animations, and slide transitions
Language: English
Include your abstract number
E-posters can include QR codes, tables and photos

E-Poster

https://storage.unitedwebnetwork.com/files/1099/7c27065b81ec92e32090d8ba871d7c4b.pdf

Abstract Title *

PERFORMANCE OF REASONING MODELS ON NEPHROLOGY MULTIPLE-CHOICE QUESTIONS

Please follow the instructions below to input your abstract title.

Abstract titles should be brief and reflect the content of the abstract.

The title will not be accepted if it exceeds 25 words.
Type in CAPITAL LETTERS.
Lowercase may be used for abbreviations only, for example, mRNA.

Co-author 1

Mamoru Masaki mamoru.masaki@marianna-u.ac.jp St. Marianna University School of Medicine Division of Nephrology and Hypertension, Department of Internal Medicine Kawasaki Japan *

Co-author 2

Fumiya Kitano fumiya.kitano@marianna-u.ac.jp St. Marianna University School of Medicine Division of Nephrology and Hypertension, Department of Internal Medicine Kawasaki Japan -

Co-author 3

Ayaka Soejima ayaka.suumo@gmail.com St. Marianna University School of Medicine Division of Nephrology and Hypertension, Department of Internal Medicine Kawasaki Japan -

Co-author 4

Daisuke Ichikawa ichikawa6008@gmail.com St. Marianna University School of Medicine Division of Nephrology and Hypertension, Department of Internal Medicine Kawasaki Japan -

Co-author 5

Yugo Shibagaki yugoshibagaki@gmail.com St. Marianna University School of Medicine Division of Nephrology and Hypertension, Department of Internal Medicine Kawasaki Japan -

Co-author 6

Ryunosuke Noda nodaryu00@gmail.com St. Marianna University School of Medicine Division of Nephrology and Hypertension, Department of Internal Medicine Kawasaki Japan -

Co-author 7

Co-author 8

Co-author 9

Co-author 10

Co-author 11

Co-author 12

Co-author 13

Co-author 14

Co-author 15

Introduction

Large language models (LLMs) increasingly support medical education and clinical decision tasks, yet whether reasoning models confer consistent advantages in nephrology across various task types remains unclear. We compared leading reasoning and baseline models on board-level multiple-choice questions and examined effect modification by question characteristics.

Methods

Results

Overall accuracy was 87.6% (183/209; 95% CI 82.4–91.4) for GPT-5 and 83.7% (175/209; 95% CI 78.1–88.1) for Gemini 2.5 Pro, versus 69.9% (146/209; 95% CI 63.3–75.7) for GPT-4o and 62.7% (131/209; 95% CI 55.9–69.0) for Gemini 2.0 Flash (Figure 1). Paired analyses favored the reasoning models (OpenAI odds ratio [OR] 6.29, 95% CI 3.25–16.00; Google OR 7.29, 95% CI 3.83–18.33; both p<0.001). In GLMMs, adjusted ORs (aORs) for reasoning vs baseline were 5.00 (95% CI 3.00–8.35; p<0.001) for OpenAI and 7.28 (95% CI 4.60–11.52; p<0.001) for Google. Interactions showed larger effects for clinical questions in the OpenAI family (aOR 13.94; 95% CI 5.68–34.25) and taxonomy-dependent effects in the Google family (recall aOR 7.28; 95% CI 4.60–11.52; interpretation aOR 2.56; 95% CI 1.02–6.45); no significant modification by image inclusion was detected. Of 209 questions, 105 (50.2%) were answered correctly by all four models; 25 (12.0%) were answered correctly by both reasoning models but by neither baseline model; 15 (7.2%) were missed by all models, often involving questions that required an understanding of Japan-specific guidelines, research, and medical culture.

Conclusion

Reasoning models outperformed baseline models on nephrology questions, with advantages that varied by task demands—pronounced for clinical reasoning in the OpenAI family and for recall-dominant tasks in the Google family. These results suggest the potential utility of selectively applying reasoning models for education and decision support, while also revealing their current difficulty in understanding the specific details of local medical practices.

Kewords