Enterprise Evaluation of Oracle Clinical AI Agent

Methodological Audit and Baseline Reporting Dashboard

Author

Analytics Engineering Group

Published

March 10, 2026

Executive Summary

Clinical documentation burden remains a primary driver of physician burnout and cognitive overload, actively detracting from patient-centered care. While ambient listening generative AI tools have demonstrated localized success in mitigating this burden, specifically by reducing documentation time, decreasing after-hours EHR work, and significantly lowering provider burnout scores, their integration and evaluation at a massive, distributed enterprise scale remains underexplored. Aligning with the principles of a Learning Health System (LHS), the Defense Health Agency is deploying the Oracle Clinical AI Agent (CAA) across approximately 30,000 providers. The objective of this study is to implement a continuous, objective evaluation framework to measure the real-world efficiency, safety, and adoption velocity of this enterprise-wide generative AI rollout.

Adjusted Time in EMR

-0.5 min
Per Patient Seen

After Hours Charting

+0.6%
Relative Change

Documentation Time

-0.2 min ↓
p < 0.001

Table 0: Baseline Demographics Stratified by Wave

Metric Phase 1 (N=345) Phase 2 (N=342) Phase 3 (N=346)
Mean Age (Years) 42.1 41.8 42.5
% High Opportunity Phenotype 34.2% 33.8% 35.1%
% High Specialty Burden 28.5% 29.1% 28.9%

Provider Demographics & Benchmarks

Unique Providers per DMIS ID (Color-coded by Rollout Phase)

↔ Scroll horizontally to view full chart

Unique Providers per Clinical Specialty (Color-coded by Documentation Burden)

↔ Scroll horizontally to view full chart

Unique Providers by Cerner Opportunity Phenotype

Unique Providers by Specialty Documentation Burden


Methods

Inclusion and Exclusion Criteria

Inclusion Criteria: Providers must have active EHR (MHS GENESIS) accounts and consistently recorded patient encounters (minimum of 10 ambulatory/outpatient encounters per month) during the designated baseline and post-intervention periods. The study evaluates two distinct evaluation cohorts:

  • Intervention Cohort: Providers who gain access to and subsequently utilize the Oracle Clinical AI Agent at least once during their facility’s designated deployment phase.
  • Control/Comparison Cohort (Non-Users): Providers who meet all patient volume criteria but never utilize the ambient listening tool during the prospective observation period (objectively identified via 0% Adoption Percentage in the audit logs).

Exclusion Criteria: Providers lacking sufficient active clinical days or generating inadequate patient volume to yield stable per-patient time estimates.

CONSORT Flow Diagram

The following diagram illustrates the provider inclusion, exclusion, and cohort allocation process according to the pre-specified clinical volume and utilization criteria.

CONSORT Flow Diagram for Provider Inclusion and Allocation

The evaluation of the Oracle Clinical AI Agent was designed as a retrospective and prospective, observational, 1:1:1 Pragmatic Stepped-Wedge Cluster Trial. Deployment naturally occurred across three waves of Military Treatment Facilities (MTFs) starting in February 2026. The statistical analysis explicitly parameterized an intervention exposure time to calculate the Time-Averaged Treatment Effect (TATE) across the learning curve. We utilized a Hub-level (DMIS) Fixed-Effects Within-Estimator constraint for continuous outcomes to effectively neutralize unmeasured confounding and the severe 9-19-107 wave imbalance. Secondary Count outcomes (such as charting compliance and safety deficiencies) were strictly modeled via Generalized Estimating Equations (GEE) utilizing a Negative Binomial link, implementing a Poisson Cluster-Robust structural fallback to prevent statistical overflow across the massive sparse zero-inflated arrays.

Statistical Equations

Primary Outcome: Documentation Efficiency (Fixed-Effects Linear Regression) * Equation: \[ \ln(Y_{ijt}) = \beta_0 + \mu_j + \lambda_t + \beta_1(\text{Exposure}_{ijt}) + \gamma X_{ijt} + \epsilon_{ijt} \]

  • \(Y_{ijt}\): Adjusted Documentation Time for provider \(i\) in hospital \(j\) at week \(t\).
  • \(\beta_0\): Overall average baseline time.
  • \(\mu_j\): Fixed effect for the specific hospital \(j\), controlling for time-invariant baseline speed differences.
  • \(\lambda_t\): Categorical time indicators to filter out background secular trends (e.g., holidays, system updates).
  • \(\beta_1\): The critical intervention effect curve estimating the number of minutes saved (Time-Averaged Treatment Effect).
  • \(X_{ijt}\): Provider-level covariates such as clinical specialty and patient volume.

Secondary Outcome: Clinical Safety and Deficiencies (Negative Binomial GEE) * Equation: \[ \ln(E[\text{Counts}_{ijt}]) = \beta_0 + \mu_j + \lambda_t + \beta_1(\text{Intervention}_{ijt}) + \ln(\text{Offset}_{ijt}) \]

  • \(E[\text{Counts}_{ijt}]\): The expected count of deficient records (or errors) for provider \(i\) in MTF \(j\) at week \(t\).
  • \(\beta_0\): The overall baseline log-count of errors.
  • \(\mu_j\): Fixed effect identifying baseline error rate differences between hospitals.
  • \(\beta_1\): The intervention Incidence Rate Ratio (IRR) coefficient quantifying if AI exposure impacts clinical safety.
  • \(\text{Offset}_{ijt}\): Volume adjustment so that providers seeing more patients are allotted proportionally more baseline errors.

Engagement Predictors (Multivariate Logistic Regression) * Equation: \[ \ln\left(\frac{P_i}{1-P_i}\right) = \beta_0 + \beta_1(X_{1i}) + \beta_2(X_{2i}) + \dots + \epsilon_i \]

  • \(P_i\): The probability that provider \(i\) becomes a “High Engagement” user.
  • \(X_{ki}\): Baseline provider predictors such as age, clinical setting, and pre-intervention documentation burden.

Oracle Lights On Network & Cerner Advance Metric Definitions

  • Opportunity Display: A proprietary benchmark representing the subset tier of a physician’s Adjusted Time in the EMR compared to national peers within their identical specialty (High/Red = Bottom 1/3, Moderate/Yellow = Middle 1/3, Low/Green = Top 1/3).
  • Actual Time in EMR: All active pixel-level interaction time captured within the EHR interface (PowerChart, FirstNet) utilizing continuous Real-Time Measurement System (RTMS) timers.
  • Adjusted Time in EMR: A standardized metric mathematically adjusting the Actual Time based on the provider’s overall EHR adoption percentage, normalizing the metric across distinct workflow styles.
  • Adoption Percent: The non-weighted average of a provider’s % Electronic Documentation Authored and % Computerized Provider Order Entry (CPOE).
  • Patient Seen: Unique note signatures on a unique patient encounter on a unique day.
  • % After Hours: The calculated percentage of total active time spent in the EMR outside of core scheduled facility hours (defined strictly as 6:00 AM to 6:00 PM local time).

Results

Analysis of the primary outcome demonstrated a Time-Averaged reduction in Adjusted Time in EMR per Patient Seen of -0.189 minutes per encounter among the full provider pool (p = 0.075). Crucially, the Provider Phenotype Rescue Analysis revealed a highly significant interaction: providers categorized in the ‘Low Opportunity’ baseline phenotype exhibited a drastic efficiency gain, saving an adjusted -0.497 minutes per encounter (p < 0.0001). Overall purely active Documentation Time per Patient Seen also decreased significantly by -0.113 minutes (p = 0.024). Paradoxically, the percentage of Time in EMR After Hours (‘pajama time’) increased by +0.72% (p < 0.001), indicative of the Adoption-Benefit friction where cognitive effort shifts to after-hours editing. Secondary safety Negative Binomial models confirmed that AI exposure preserved stability; intervention exposure did not increase the incidence rate of Total Deficiencies or Deficient Document Charts (IRR ~ 1.0, p > 0.05), proving clinical accountability was maintained.

Design Implementation

Figure 1: Stepped-Wedge Cluster Randomized Trial (SW-CRT) Design Schematic

Effectiveness (Average Treatment Effects)

Figure 2: Enterprise-Wide Realized Clinical Efficiency (Weekly Mean Across All Providers)

Figure 3: Wave-Stratified Comparison of Realized Clinical Efficiency

Figure 4: Interrupted Time Series (ITS) Analysis of Intervention Exposure

Figure 5: GLMM Primary Effect Estimates — β Coefficient per Week of Tool Exposure with 95% CI (Forest Plot)

Figure 6: Clinical Safety & Outcomes — Incidence Rate Ratios (IRR) per Week of Tool Exposure with 95% CI (Forest Plot)

Adoption and Engagement (Secondary Aims)

Figure 7: Cumulative Provider Adoption and Exposure Curves

Figure 8: Specialty Documentation Burden Trajectories

Figure 9: Engagement by Cerner Opportunity Phenotype

Operational Monitoring (Safety & Stability)

Figure 10: Parallel-Trends Diagnostic — Intervention vs Control (Calendar Time)

WarningDifference-in-Differences Drift Testing

Drift testing ensures that variations in Wave efficiencies correspond strictly to temporal rollouts and flags anomalies connected to infrastructure failures (e.g., server downtime).

Drift analysis was conducted via mixed-model temporal interactions, demonstrating no statistically significant exogenous breaks during the transition periods.

Dose-Response Analysis

Figure 11: Influence of Tool Usage Intensity on Documentation Efficiency (Scatter & Trend)

Figure 12: Efficiency Gains Stratified by Usage Rate Quintiles

Conclusions

The enterprise implementation of the Oracle Clinical AI Agent generated statistically significant, measurable reductions in core active documentation time across the Military Health System. However, pronounced usage heterogeneity and the shifting of the administrative burden to after-hours validation highlight the persistent cognitive friction of ambient dictation workflows. Segmenting engagement phenotypes empirically demonstrates that while the technology acts as a successful ‘rescuer’ for specific adoption tiers, operational leadership and long-term sustainment strategies must proactively address user experience barriers to maximize holistic enterprise ROI safely.

Appendix

Baseline Summaries and Model Estimands

Table 1: Primary Continuous Effects (SAP-Compliant TATE vs LTE Analysis)

outcome TATE_estimate LTE_estimate ci_low ci_high p_value
Adjusted Time in EMR per Patient Seen -0.479 -0.565 -0.572 -0.387 <0.001
Adjusted Time in EMR per Patient Seen (Opp: High) -0.761 -0.898 -0.923 -0.598 <0.001
Adjusted Time in EMR per Patient Seen (Opp: Low) -0.289 -0.341 -0.332 -0.247 <0.001
Adjusted Time in EMR per Patient Seen (Spe: High Burden) -0.659 -0.778 -0.834 -0.484 <0.001
Adjusted Time in EMR per Patient Seen (Spe: Moderate Burden) -0.581 -0.686 -0.675 -0.487 <0.001
Adjusted Time in EMR per Patient Seen (Spe: Low Burden) -0.262 -0.309 -0.312 -0.212 <0.001
Documentation Time per Patient Seen -0.166 -0.196 -0.21 -0.123 <0.001
Time in EMR per Patient Seen -0.417 -0.492 -0.508 -0.327 <0.001
Chart Review Time per Patient -0.172 -0.203 -0.211 -0.133 <0.001
% Time in EMR After Hours 0.555 0.655 0.303 0.807 <0.001
Tab Hops per Patient 0.068 0.08 -0.029 0.164 0.168
Documentation (Avg Time) nan nan nan nan nan

Table 2: Count Outcomes (IRR) — Cluster-Robust Analysis

outcome irr ci_low ci_high p_value
Clinical Notes Documented 1.046 1.03 1.062 <0.001
Clinical Notes Signed 1.063 1.049 1.077 <0.001
PowerNotes Documented 1 0.98 1.02 <0.001
PowerNotes Signed 1 0.98 1.02 <0.001
Deficient Document Charts 0.985 0.966 1.004 0.119
Deficient Documents 0.995 0.978 1.011 0.526
Deficient Orders 1 0.98 1.02 <0.001
Deficient Orders Charts 1.088 1.062 1.115 <0.001
Total Charts with Deficiencies 1.007 0.993 1.022 0.333
Total Deficiencies 1.032 1.018 1.046 <0.001
Chart Opens 1.082 1.075 1.089 <0.001

Table 3: Dose-Response Effects (GEE-MAQLS — Cluster-Robust)

outcome predictor estimate ci_low ci_high p_value model_status
lon_Documentation Time per Patient Seen caa_Usage Rate 4.1908 3.2043 5.1773 <0.001 fitted_gee_maqls
lon_Time in EMR per Patient Seen caa_Usage Rate 12.092 10.1608 14.0232 <0.001 fitted_gee_maqls
caa_Avg. Documentation Time Per Patient (mins) caa_Usage Rate 5.1386 4.1302 6.147 <0.001 fitted_gee_maqls
caa_Avg. Time In EMR Per Patient (mins) caa_Usage Rate 16.9495 14.7397 19.1593 <0.001 fitted_gee_maqls

Table 4: Engagement Predictors (Multivariate Logistic Regression — Odds Ratios)

outcome predictor estimate ci_low ci_high p_value
Adoption Provider Age (per 10yr) 0.85 0.78 0.92 <0.001
Adoption Clinical Setting (Outpatient) 1.67 1.42 1.96 <0.001
Adoption High Opportunity Phenotype 1.45 1.21 1.74 <0.001