Objective: There is increasing interest in applying artificial intelligence chatbots like generative pretrained transformer 4 (GPT-4) in the medical field. This study aimed to explore the universality of GPT-4 responses to simulated clinical scenarios of developmental dysplasia of the hip (DDH) across diverse global settings. Methods: Seventeen international experts with more than 15 years of experience in pediatric orthopaedics were selected for the evaluation panel. Eight simulated DDH clinical scenarios were created, covering 4 key areas: (1) initial evaluation and diagnosis, (2) initial examination and treatment, (3) nursing care and follow-up, and (4) prognosis and rehabilitation planning. Each scenario was completed independently in a new GPT-4 session. Interrater reliability was assessed using Fleiss kappa, and the quality, relevance, and applicability of GPT-4 responses were analyzed using median scores and interquartile ranges. Following scoring, experts met in ZOOM sessions to generate Regional Consensus Assessment Scores, which were intended to represent a consistent regional assessment of the use of the GPT-4 in pediatric orthopaedic care. Results: GPT-4’s responses to the 8 clinical DDH scenarios received performance scores ranging from 44.3% to 98.9% of the 88-point maximum. The Fleiss kappa statistic of 0.113 (P = 0.001) indicated low agreement among experts in their ratings. When assessing the responses’ quality, relevance, and applicability, the median scores were 3, with interquartile ranges of 3 to 4, 3 to 4, and 2 to 3, respectively. Significant differences were noted in the prognosis and rehabilitation domain scores (P < 0.05 for all). Regional consensus scores were 75 for Africa, 74 for Asia, 73 for India, 80 for Europe, and 65 for North America, with the Kruskal-Wallis test highlighting significant disparities between these regions (P = 0.034). Conclusions: This study demonstrates the promise of GPT-4 in pediatric orthopaedic care, particularly in supporting preliminary DDH assessments and guiding treatment strategies for specialist care. However, effective integr care contexts, highlighting the importance of a nuanced approach to health technology adaptation. Level of Evidence: Level IV.
Luo S, Canavese F, Aroojis A, Andreacchio A, Anticevic D, Bouchard M, Castaneda P, De Rosa V, Fiogbe MA, Frick SL, Hui JH, Johari AN, Loro A, Lyu X, Matsushita M, Omeroglu H, Roye DP, Shah MM, Yong B, Li L. Are Generative Pretrained Transformer 4 Responses to Developmental Dysplasia of the Hip Clinical Scenarios Universal? An International Review. J Pediatr Orthop. 2024 Jul 1;44(6):e504-e511. doi: 10.1097/BPO.0000000000002682. Epub 2024 Apr 9. PMID: 38597198.
Canavese F;De Rosa V;
2024-01-01
Abstract
Objective: There is increasing interest in applying artificial intelligence chatbots like generative pretrained transformer 4 (GPT-4) in the medical field. This study aimed to explore the universality of GPT-4 responses to simulated clinical scenarios of developmental dysplasia of the hip (DDH) across diverse global settings. Methods: Seventeen international experts with more than 15 years of experience in pediatric orthopaedics were selected for the evaluation panel. Eight simulated DDH clinical scenarios were created, covering 4 key areas: (1) initial evaluation and diagnosis, (2) initial examination and treatment, (3) nursing care and follow-up, and (4) prognosis and rehabilitation planning. Each scenario was completed independently in a new GPT-4 session. Interrater reliability was assessed using Fleiss kappa, and the quality, relevance, and applicability of GPT-4 responses were analyzed using median scores and interquartile ranges. Following scoring, experts met in ZOOM sessions to generate Regional Consensus Assessment Scores, which were intended to represent a consistent regional assessment of the use of the GPT-4 in pediatric orthopaedic care. Results: GPT-4’s responses to the 8 clinical DDH scenarios received performance scores ranging from 44.3% to 98.9% of the 88-point maximum. The Fleiss kappa statistic of 0.113 (P = 0.001) indicated low agreement among experts in their ratings. When assessing the responses’ quality, relevance, and applicability, the median scores were 3, with interquartile ranges of 3 to 4, 3 to 4, and 2 to 3, respectively. Significant differences were noted in the prognosis and rehabilitation domain scores (P < 0.05 for all). Regional consensus scores were 75 for Africa, 74 for Asia, 73 for India, 80 for Europe, and 65 for North America, with the Kruskal-Wallis test highlighting significant disparities between these regions (P = 0.034). Conclusions: This study demonstrates the promise of GPT-4 in pediatric orthopaedic care, particularly in supporting preliminary DDH assessments and guiding treatment strategies for specialist care. However, effective integr care contexts, highlighting the importance of a nuanced approach to health technology adaptation. Level of Evidence: Level IV.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.