Seoul National University Hospital and Harvard Medical School Unveil the World’s First “Virtual Hospital Simulator” for Real-World Evaluation of Medical AI
- Presenting 'Dynamic Evaluation' that goes beyond existing static, diagnosis-centric AI evaluations to reflect the impact of actual clinical settings.
- By simultaneously measuring both “patient outcomes” and “hospital operational efficiency,” the dual-metric approach presents a blueprint for the commercialization of medical AI

[Figure 1] Paradigm of a CES(clinical environment simulator) : Simultaneously evaluating the ripple effects of medical AI decisions within a virtual hospital on ‘patient prognosis’ and ‘hospital operational efficiency’.
Moving beyond fragmented evaluations of medical AI based solely on written-style testing with historical data, researchers have, for the first time, introduced a model that validates AI performance in a virtual hospital replicating real-world clinical environments. By preemptively verifying the cascading consequences of AI-driven medical decisions—such as patient deterioration or depletion of hospital resources—the framework establishes a preclinical gateway for rigorously testing AI safety without risking actual patients.
Joint research team from research professor Seong-Eun Kim of the Specialized Research Center at Seoul National University Hospital and a research team from Harvard Medical School have announced the ‘Clinical Environment Simulator (CES),’ which dynamically evaluates medical AI based on Large Language Models (LLM). This research on a digital virtual hospital evaluation system was published in the latest online edition of the international academic journal ‘Nature Medicine (IF 50).’
Conventional medical AI evaluations have relied on static historical datasets, limiting their ability to capture the cascading effects of clinical decision-making in real healthcare settings. In practice, patient conditions continuously evolve, while medical orders directly affect the consumption of finite hospital resources. Existing evaluation methods were unable to assess these temporal and systemic interdependencies. The researchers therefore analyzed that, just as pilots train in flight simulators, medical AI should also be evaluated under dynamic conditions involving the passage of time and resource constraints.
To implement this, the research team synchronized two core engines. First, the ‘Patient Engine’ simulates changes in the patient’s condition by having the LLM dynamically generate various virtual paths of symptoms and treatment responses based on disease trajectory templates defined by specialists and initial patient data from actual electronic medical records.
Working in parallel, The ‘Hospital Engine’ replicates the step-by-step workflow of the field based on actual hospital time data, tracking the status of beds, medical staff, and equipment in real time. When a blood test order is issued, necessary medical personnel are sequentially assigned step-by-step according to the actual time required, and a priority system that allocates resources first to critically ill patients has been perfectly implemented.

[Figure 2] Dynamic changes in patient status over time: Simulation of the ripple effect showing how the prognosis of a critically ill patient branches into three paths (green: stabilization, blue and orange: deterioration) depending on the timing of medical AI intervention.
Within this virtual hospital, crisis scenarios are vividly simulated depending on the timing of AI intervention. For example, if the AI delays ordering diagnostic tests, a patient with initially stable chest pain may deteriorate into an acute myocardial infarction. Likewise, when the AI prioritizes scarce resources such as CT scanners for a critically ill emergency patient, realistic bottlenecks emerge as waiting times increase for other patients. In other words, the simulator recreates a real hospital environment in which a single AI-driven decision can determine not only the survival of an individual patient, but also trigger cascading consequences by depleting remaining hospital resources and sequentially limiting treatment opportunities for subsequent patients.
Every decision made by the AI is evaluated using a “dual-metric composite score” that integrates two dimensions: ▲Patient prognosis, including survival, treatment timeliness, and guideline adherence; and ▲Hospital operational efficiency, including total length of stay, emergency department throughput, and utilization of beds and medical equipment. The framework rewards decisions that improve patient care without undermining hospital operations, while penalizing approaches that excessively concentrate resources on a single patient at the expense of others’ access to care, thereby enforcing a strict balance between individual optimization and system-wide efficiency. Furthermore, the simulator conducts adversarial stress tests under extreme conditions, including system-wide network failures and simultaneous emergency cases.
The central significance of this study lies in providing a “risk-free preclinical testing environment” that validates the safety of medical AI systems without exposing real patients to harm. When AI that has undergone such thorough verification becomes the digital agent for medical staff and takes charge of complex system operations, it is expected that doctors will finally be able to take their eyes off the monitor, return to the patient's side, and fully concentrate on their fundamental roles of empathy and judgment.
Research Professor Seung-Eun Kim (co-first author) emphasized, “Virtual hospitals cannot perfectly predict the complex physiological responses of the human body,” but added, “This study will be the most valuable next step in verifying that medical AI goes beyond being a tool for solving fragmentary problems and is fully integrated into a dynamic medical system to provide practical assistance.”

[Table] Performance comparison between existing medical AI evaluation models and a virtual hospital simulator (CES).: CES is the only evaluation framework that extends conventional approaches by simultaneously incorporating three critical dimensions: ▲time-dependent patient state evolution ▲real-time hospital resource tracking ▲crisis scenario stress testing.

[Photo] Research Professor Seong-Eun Kim, Specialized Research Center, Biomedical Research Institute, Seoul National University Hospital.