Large Language Models in Medicine: Clinical Applications, Technical Challenges, and Ethical Considerations
Article information
Abstract
Objectives
This study presents a comprehensive review of the clinical applications, technical challenges, and ethical considerations associated with using large language models (LLMs) in medicine.
Methods
A literature survey of peer-reviewed articles, technical reports, and expert commentary from relevant medical and artificial intelligence journals was conducted. Key clinical application areas, technical limitations (e.g., accuracy, validation, transparency), and ethical issues (e.g., bias, safety, accountability, privacy) were identified and analyzed.
Results
LLMs have potential in clinical documentation assistance, decision support, patient communication, and workflow optimization. The level of supporting evidence varies; documentation support applications are relatively mature, whereas autonomous diagnostics continue to face notable limitations regarding accuracy and validation. Key technical challenges include model hallucination, lack of robust clinical validation, integration issues, and limited transparency. Ethical concerns involve algorithmic bias risking health inequities, threats to patient safety from inaccuracies, unclear accountability, data privacy, and impacts on clinician-patient interactions.
Conclusions
LLMs possess transformative potential for clinical medicine, particularly by augmenting clinician capabilities. However, substantial technical and ethical hurdles necessitate rigorous research, validation, clearly defined guidelines, and human oversight. Existing evidence supports an assistive rather than autonomous role, mandating careful, evidence-based integration that prioritizes patient safety and equity.
I. Introduction
The landscape of artificial intelligence (AI) in medicine is undergoing a profound transformation driven by large language models (LLMs). These advanced AI systems, grounded in deep learning methodologies and commonly based on transformer neural network architectures [1], exhibit remarkable proficiency in processing, understanding, and generating human language. Notable examples at the forefront of this technology include OpenAI’s Generative Pre-trained Transformer (GPT) series, Google’s Gemini, and Meta’s LLaMA [2]. The capabilities of these models derive from extensive pre-training on massive, diverse textual datasets comprising trillions of words sourced from the internet, books, and other digitized materials. This comprehensive pre-training equips them with broad knowledge of grammar, semantics, general world information, and complex linguistic patterns [2,3].
Adapting these general-purpose models to specialized fields such as medicine requires a process known as fine-tuning [4]. Fine-tuning involves additional training of a pre-trained model using a smaller, domain-specific dataset, such as clinical notes, biomedical literature, or patient communication transcripts. This procedure refines the model’s general linguistic skills, aligning them with the specialized vocabulary, subtleties, and reasoning processes unique to medicine. Examples of such tailored models include BioBERT [5], ClinicalBERT [6], GatorTron [7], Med-PaLM [8], and Med-Gemini [9]. These models, fine-tuned or specifically designed using clinical or biomedical texts, typically demonstrate superior performance in medical tasks compared to their general-purpose counterparts [3,10,11].
When introduced into clinical environments, LLMs enter a setting already influenced by earlier AI applications that primarily analyze structured data, such as laboratory results and billing codes, or medical images used in radiology and pathology. These earlier AI applications focused on risk prediction, diagnostic support, and operational optimization [12,13]. However, LLMs represent a paradigm shift due to their unique capacity to analyze unstructured text data, a significant advancement given the abundance of clinically relevant information found in unstructured text formats. Previously, these data sources have largely been inaccessible for computational analysis due to the inherent complexities of natural language. Therefore, LLMs signify a new technological frontier capable of unlocking valuable insights and making practical use of previously underutilized textual information [2,10].
In this article, we survey the current landscape of LLM applications within clinical practice, categorizing these applications by their primary functional roles. We discuss technical issues pertinent to building and effectively implementing these models within clinical settings. Substantial emphasis is also placed on ethical considerations, which are particularly critical in healthcare AI applications. Finally, the review concludes by examining future directions and the potential trajectory of LLM integration into medicine.
II. Clinical Applications of Large Language Models
LLMs are currently being explored across a spectrum of clinical activities to enhance efficiency, support decision-making, improve communication, and optimize workflows. The evidence supporting these applications varies, with some areas demonstrating greater maturity than others (Table 1).
1. Clinical Documentation Assistance
A significant portion of clinician time is dedicated to documentation tasks, contributing notably to clinician burnout. LLMs offer potential solutions by automating or assisting with various documentation-related activities.
1) Summarization
LLMs have demonstrated effectiveness in summarizing lengthy clinical texts. Specific applications include condensing progress notes into problem lists, abstracting findings from radiology reports into concise impressions, generating after-visit summaries from detailed history and physical examination notes, summarizing physician-patient dialogues, and shortening extensive patient-generated health questionnaires [14,15]. Research using models such as GPT-4 indicates that LLM-generated summaries can match or even surpass human-produced summaries in completeness and accuracy for particular tasks, while accomplishing this significantly faster [15].
2) Note generation/drafting
LLMs can assist in drafting various clinical documents, including discharge summaries [14], referral letters, and inter-unit handoff notes [16]. Additionally, LLMs can identify goals of care conversations within electronic health record (EHR) notes and rapidly generate clinically useful summaries. Importantly, the reported incidence of hallucination (incorrectly generated content) for this specific task is notably low [17].
3) Information extraction
These models can extract structured data elements (e.g., diagnoses, medications, symptoms, patient-reported outcomes) from unstructured narrative text within EHRs or other clinical documents [18]. This capability facilitates data aggregation for quality improvement initiatives, research purposes, or populating structured fields in EHRs.
4) Documentation improvement
LLMs may also analyze clinical documentation to identify missing information, inconsistencies (e.g., discrepancies between diagnosis and treatment plan), or areas requiring clarification, thereby improving the overall quality and accuracy of medical records [19].
2. Diagnostic and Clinical Decision Support
LLMs can analyze extensive medical information and perform inferential functions to support clinical decision-making and reasoning processes.
1) Generation of differential diagnoses
LLMs can evaluate patient symptoms, medical history, and preliminary test results to suggest potential differential diagnoses to clinicians. Studies have shown that LLMs can outperform clinicians in generating differential diagnoses and in the quality of diagnostic and management reasoning, especially when employing a chain-of-thought (CoT) process that mirrors human clinical reasoning [20–22]. Thus, LLMs hold potential to enhance diagnostic accuracy and efficiency, particularly in complex clinical scenarios.
2) Answering clinical questions
LLMs function effectively as advanced information retrieval systems capable of answering specific clinical queries by integrating information from medical knowledge bases, scientific literature, or clinical guidelines [8,23,24]. Systems like Med-PaLM 2 have demonstrated strong performance on clinical knowledge assessments analogous to medical licensing examinations (e.g., the MedQA benchmark), achieving accuracy levels comparable to or exceeding those of human experts in multiple studies [9].
3) Interpretation of complex clinical data
LLMs show emerging capabilities for interpreting complex clinical data, potentially integrating information across multiple modalities (e.g., textual reports and imaging findings). Current research involving multimodal models such as Med-PaLM M and Med-Gemini explores these promising opportunities [23,25]. Such interpretation capabilities could also assist in treatment planning decisions [17,26].
Nevertheless, translating these capabilities into reliable clinical decision support (CDS) remains challenging. Although strong performance on standardized assessments is reassuring, studies simulating real clinical scenarios have identified significant deficiencies [27]. A notable gap persists between performance on knowledge-based tests and the nuanced, context-dependent judgment essential for clinical practice. Current LLMs excel at information retrieval yet lack the sophisticated clinical judgment and integrative capacities of human clinicians [28]. Moreover, seamless integration with clinical workflows and EHR systems, which would be essential for successful CDS, remains difficult to achieve [29].
3. Patient Communication and Engagement
LLMs’ natural language processing abilities offer potential improvements in clinician-patient communication and patient access to, understanding of, and dissemination of health information.
1) Simplifying of medical information
LLMs can translate complex medical terminology (e.g., informed consent documents) [30,31], research findings (e.g., clinical trial summaries) [32], or clinical reports (e.g., discharge summaries) [6,33] into straightforward language that patients can easily comprehend. This application has significant potential to enhance patient understanding and promote effective shared decision-making.
2) Patient question answering/chatbots
Virtual assistants or chatbots powered by LLMs can respond to patient inquiries regarding health conditions or treatments, offering preliminary information and guidance [34]. Remarkably, LLMs can deliver responses with a significant degree of empathy, even being perceived by patients as more empathetic than human clinicians in certain written communications [22,35].
3) Increasing health literacy
By generating clear, understandable, and personalized health information, LLMs have the potential to enhance patients' overall health literacy [36].
Despite this promise, caution is warranted. LLMs can produce inaccurate or misleading information (misinformation), lack nuanced understanding of individual patient circumstances, and fail to convey critical emotional and psychological dimensions inherent in patient care. The impersonal nature of chatbot-mediated communication cannot replicate the therapeutic benefits of direct human clinician-patient interactions. Thus, employing LLMs for patient communication necessitates careful oversight and rigorous validation of information for accuracy and appropriateness.
III. Technical Challenges and Ethical Considerations
Despite enthusiasm for using LLMs in medicine, their translation into safe and effective clinical applications is impeded by significant technical hurdles and profound ethical concerns (Table 2). These challenges often intersect and require integrated solutions.
1. Technical Challenges
1) Accuracy and reliability
Ensuring the accuracy and reliability of LLM outputs in clinical contexts—where errors carry severe consequences—is arguably the most critical technical barrier. LLMs are known to “hallucinate,” generating fluent, plausible-sounding content that is factually incorrect, unsupported by source data, or entirely fabricated [37–40]. If relied upon for clinical decisions, these inaccuracies directly threaten patient safety. Mitigation strategies currently explored include retrieval augmented generation (RAG) to ground responses in external knowledge sources, improving uncertainty estimation and calibration techniques, employing specific prompting strategies such as chain-of-thought, domain-specific fine-tuning, implementing domain-informed safety guardrails, and developing robust hallucination detection methods [41].
Beyond complete fabrication, LLMs can produce outputs that contradict source data or established medical facts or omit critical details from summaries and analyses [16]. These errors also present substantial patient safety risks.
2) Validation and evaluation
Assessing the clinical utility and safety of LLMs is exceptionally challenging. Current validation methodologies often fall short.
Standard natural language processing (NLP) evaluation metrics—such as ROUGE and BLEU for summarization tasks, or accuracy metrics for question-answering—frequently correlate poorly with clinical relevance, factual correctness, or patient safety [42]. High benchmark scores do not necessarily equate to competence in real-world clinical reasoning. Therefore, there is an urgent need for standardized, clinically meaningful benchmarks and evaluation frameworks specifically designed for medical LLMs.
An additional challenge is the rapid evolution of models and reporting standards. The swift pace of LLM development complicates timely, rigorous evaluations, as models are continually updated, potentially altering their performance characteristics. Transparent reporting standards, such as the proposed TRIPOD-LLM and MI-CLEAR-LLM guidelines [43,44], are thus essential for proper appraisal and reproducibility of studies.
3) Data quality and privacy
Training and fine-tuning LLMs require extensive, high-quality data, the availability of which is often limited [2,3]. Protecting patient privacy is equally critical. Using patient data for model training can result in data memorization, leakage, or risks of re-identification [18]. Mitigation strategies include employing synthetic data, data anonymization, pseudonymization, differential privacy techniques, robust cybersecurity measures, model penetration testing, and implementing transparent patient consent processes [45,46].
4) Integration and interoperability
Successful implementation of LLMs depends heavily on seamless integration with existing clinical workflows and healthcare IT infrastructures, particularly EHRs [11]. This presents a considerable technical challenge, requiring extensive customization and interface development. User-friendly interfaces are essential for widespread clinical adoption.
5) Transparency and explainability
LLMs generally function as “black boxes,” making it challenging to understand how specific outputs or suggestions are derived [47]. This lack of transparency discourages clinician trust, complicates error analysis and validation, and hinders informed acceptance. Developing and applying explainable AI (XAI) methods—such as SHAP [48] attention visualization [48], and integration with knowledge graphs [49]—tailored for medical LLMs is an active area of research, along with promoting transparent reporting practices.
2. Ethical Considerations
Deploying LLMs in clinical medicine raises profound ethical issues that require careful consideration and proactive management (Table 3).
1) Algorithmic bias and equity
A major ethical concern is that LLMs trained on datasets reflecting existing societal inequities may amplify biases in healthcare delivery [36]. This could result in poorer performance or unfair clinical recommendations for underrepresented racial or ethnic groups, gender groups, socioeconomic populations, or linguistic minorities. Linguistic bias is especially relevant, as most models are primarily trained on standard English, resulting in diminished performance for non-English speakers or users of non-standard dialects [50]. Mitigation requires intentional curation of diverse and representative training data, the development of bias detection and mitigation algorithms, validation across diverse demographic groups, and inclusive design and evaluation processes. Addressing bias is both ethically imperative and essential for clinical validity, as a biased tool cannot reliably serve all patients.
2) Patient safety and liability
LLMs’ potential to produce inaccurate information—including hallucinations, factual errors, or critical omissions—constitutes a direct risk to patient safety, potentially causing erroneous diagnoses or treatment delays. Determining responsibility or liability when an LLM contributes to clinical errors is complicated [51,52]. Questions arise about whether liability rests with the clinician using the tool, the developers, the implementing institution, or a combination thereof. The absence of definitive regulatory guidelines and established legal precedents for AI-related medical errors creates uncertainty and may impede implementation. Thus, explicit guidelines, clear accountability mechanisms, and possibly new legal frameworks are essential.
3) Data governance and consent
Using large volumes of patient data for LLM training and operation raises significant governance challenges [29,53]. Ethical data sourcing, patient privacy, data security, and obtaining valid informed consent for data use are critical concerns. Issues around data ownership and patient rights concerning data used by AI systems must be clearly addressed.
4) Transparency and trust
The inherently opaque “black box” nature of most LLMs undermines trust among clinicians and patients. Without the ability to understand or explain an LLM’s reasoning behind recommendations, clinicians are less likely to trust or confidently utilize these tools. This mistrust could result in delayed or incorrect clinical decisions, negatively impacting patient care [54]. Advancing explainability and promoting open reporting are crucial steps toward establishing trust.
5) Impact on clinician-patient relationship
Integrating LLMs into clinical interactions carries the risk of depersonalizing care, reducing clinician empathy, or negatively affecting direct communication if poorly implemented. Over-reliance on LLMs could potentially deskill clinicians or diminish their critical-thinking abilities. Conversely, by reducing administrative burdens, LLMs might enable clinicians to spend more meaningful time interacting directly with patients [55]. Careful consideration of workflow integration and human factors is required to ensure technological support enhances rather than detracts from therapeutic relationships.
6) Equity of access
Ensuring equitable access to the benefits of LLM technology across diverse patient populations, socioeconomic groups, and healthcare settings—including low-resource environments—is a significant ethical consideration [29,54]. The digital divide and disparities in technology access could exacerbate existing health inequities unless proactively addressed.
IV. Discussion
1. Risks and Benefits of LLMs in Medicine
LLMs represent a technology with significant potential to profoundly impact clinical medicine, primarily due to their unparalleled ability to process and generate language, granting access to the vast amounts of unstructured textual data available in healthcare [2,11]. Current evidence highlights promising applications, particularly in reducing the burden of clinical documentation through summarization, note generation, and information extraction. Additionally, LLMs show considerable potential for improving patient communication by simplifying complex information, powering informative chatbots, and optimizing various aspects of clinical workflows.
Nevertheless, initial optimism must be tempered by a realistic assessment of the technical and ethical barriers to widespread, secure, and effective clinical deployment. Accuracy concerns, particularly the tendency of these models toward hallucinations, remain critical issues. A notable gap persists due to the lack of rigorous clinical validation, highlighting discrepancies between model performance on standardized benchmarks and real-world clinical decision-making. Furthermore, the “black box” nature of LLMs poses significant transparency and trust challenges, while ethical considerations surrounding bias, equity, privacy, and accountability require careful attention and proactive mitigation.
Therefore, integrating LLMs into clinical practice necessitates carefully balancing their potential benefits against inherent risks. Current technological capabilities and available evidence strongly indicate that LLMs should primarily serve as assistive tools augmenting clinician abilities rather than as autonomous entities intended to replace clinicians. Adopting a cautious, evidence-based approach is crucial, requiring robust human oversight across virtually all clinical applications and embracing a “physician-in-the-loop” paradigm, wherein clinicians retain ultimate responsibility for clinical decisions informed by LLM outputs.
Moving forward requires a collaborative, multi-stakeholder approach involving AI developers, clinicians, ethicists, regulators, healthcare institutions, and patients. Prioritizing patient safety, ensuring equity, maintaining transparency, and building trust should be guiding principles in the development and integration of these powerful technological tools.
2. Future Directions and Research Needs
Fully realizing the potential of LLMs in clinical medicine requires focused efforts in several key areas.
1) Rigorous validation
There is an urgent need for developing and adopting standardized, clinically meaningful validation methodologies and benchmarks. Such frameworks should accurately assess performance, safety, and real-world clinical utility, extending beyond traditional NLP evaluation metrics. Conducting prospective, randomized controlled trials comparing LLM-assisted workflows with conventional care is essential to establishing genuine clinical efficacy.
2) Technical advancement
Continued research is necessary to enhance model accuracy, minimize the frequency and consequences of hallucinations, improve robustness across diverse datasets and scenarios, and develop effective explainability techniques specifically tailored to clinical users. Further investigation into emerging ecosystems—in which multiple specialized AI agents (e.g., diagnostic agents, documentation agents, patient communication agents) collaborate by exchanging information and coordinating tasks within complex clinical workflows—is also warranted [56,57]. Active research areas additionally include integrating multimodal data [58,59] and extending LLM capabilities into physical systems [60].
3) Bias mitigation
Proactive strategies for identifying, measuring, and mitigating biases in training datasets and model outputs are crucial to ensure equitable performance across patient populations and prevent exacerbating existing health disparities.
4) Ethical and regulatory frameworks
Establishing clear ethical guidelines, robust regulatory oversight, and well-defined accountability structures is essential to govern the responsible development, deployment, and clinical use of LLM technologies in healthcare.
5) Human-AI collaboration
Research should prioritize optimizing human-AI interaction models, designing intuitive interfaces, and understanding how to best integrate LLMs into clinical workflows to effectively support clinicians rather than hinder their performance.
Notes
Conflict of Interest
No potential conflict of interest relevant to this article was reported.