| Abstract: |
This article presents a comparative evaluation of three large language models (LLMs), namely GPT-4o, Gemini 2.0 Flash 2.0 Flash, and Claude 3.5 Sonnet, examining their ability to automate key healthcare workflows while adhering to algorithmic constraints and supporting interpretability and fairness. The models were evaluated using Python, JavaScript, and Go under varying levels of prompt completeness across four healthcare tasks of increasing complexity: bed allocation, dynamic patient bed reallocation, ambulance dispatch, and patient triage.
We introduce a multidimensional evaluation framework that captures model performance across task complexity, prompt completeness, and programming language, with an emphasis on generating functionally correct, transparent, and reliable code. This framework enables a systematic analysis of how effectively LLMs translate natural language specifications into executable logic under realistic, constraint rich healthcare scenarios.
Experimental results show that all three models generate constraint compliant solutions for simpler tasks such as bed management. However, as task complexity increases and multiple constraints must be balanced, clear performance differences emerge. Claude 3.5 Sonnet consistently outperforms GPT-4o and Gemini 2.0 Flash 2.0 Flash by producing more robust, interpretable, and reliable code. These findings highlight Claude 3.5 Sonnet's stronger potential for transparent and dependable automation of critical healthcare services using LLM based code generation. The code is publicly available at: https://github.com/gauriivaidya/alter-automated-healthcare-tasks. |