COMPARATIVE ANALYSIS OF CHATGPT-4 AND CLAUDE 3 OPUS IN ANSWERING ACUTE KIDNEY INJURY AND CRITICAL CARE NEPHROLOGY QUESTIONS

7 Feb 2025 12 a.m. 12 a.m.

WCN25-AB-359, Poster Board= FRI-611

Introduction:

Large language models (LLMs) like ChatGPT-4 and Claude 3 Opus are revolutionizing natural language processing (NLP). While Claude 3 Opus has demonstrated superior performance in coding and mathematical reasoning, its capability in answering nephrology-related questions remains untested. This study evaluates the performance of these LLMs in responding to acute kidney injury (AKI) and critical care nephrology questions.

Methods:

We compared ChatGPT-4 and Claude 3 Opus using 101 AKI and critical care nephrology questions from the American Society of Nephrology's Self-Assessment Program (NephSAP) and Kidney Self-Assessment Program (KSAP). Data tables were converted to plain text for input into the models. Responses were compared to correct answers, with a McNemar Test for accuracy differences and Cohen's kappa for inter-rater agreement.

Results:

ChatGPT-4 correctly answered 81.2% of questions, while Claude 3 Opus answered 72.3% correctly (P=0.066). The models agreed on 78.2% of questions, with 86.1% of these being correct. ChatGPT-4 outperformed Claude 3 Opus on 14 questions when they disagreed, while Claude 3 Opus did better on 5. Both models answered three questions incorrectly with different responses. The kappa statistic was 0.71, indicating good agreement.

Conclusions:

Both ChatGPT-4 and Claude 3 Opus exhibited good accuracy in addressing questions pertinent to AKI and critical care nephrology. ChatGPT-4 demonstrated superior performance over Claude 3 Opus, although the difference in accuracy was not statistically significant. The substantial agreement between the models suggests a shared understanding of the subject matter. Nevertheless, the occurrence of incorrect responses, especially in instances where the models diverged in their answers, underscores the necessity for continuous refinement to enhance the reliability of these tools in clinical settings.

I have no potential conflict of interest to disclose.

I did not use generative AI and AI-assisted technologies in the writing process.