800+ Synthetic Test Scenarios
Three clinical capabilities tested
Passes Safety Rubrics

Evaluation Benchmarks

We continuously evaluate and improve Medisanica using automated testing on synthetic clinical scenarios to maintain reference quality.

Internal Benchmark Performance

800+ synthetic scenarios evaluated using AI scoring, spot-checked by our team. Internal testing only, not clinical validation.

Reference Quality
97.2%
Outputs met internal criteria for guideline alignment and completeness across modes.
Mode breakdown: General Inquiries 95.8% · Differential Diagnosis 97.0% · Clinical Notes 98.8%
SAFETY BENCHMARK
100%
Emergency scenarios passed internal safety criteria (appropriate urgency language, key considerations flagged).
Emergency scenarios consistently flagged with ED referral and red-flag context.
Safety scores reflect whether test outputs included appropriate urgency language and key clinical considerations. They do not predict real-world safety outcomes or replace physician judgment.
Guideline Alignment
98.5%
Dosing and monitoring recommendations aligned with current clinical guidelines and labeling.
Reference outputs include citations to clinical guidelines and peer-reviewed literature.

Task-Specific Evaluation

Performance across three distinct capabilities.

General Inquiries
SCORE: 95.8%350+ SCENARIOS
Task: Drug dosing, interactions, and management queries.
Rubric: Graded on guideline alignment, monitoring parameters, and citation quality.
Differential Diagnosis
SCORE: 97.0%150+ SCENARIOS
Task: Generating differential considerations for high-stakes presentations.
Rubric: Graded on inclusion of must-not-miss diagnoses and urgency context.
Clinical Notes
SCORE: 98.8%450+ SCENARIOS
Task: Converting shorthand into SOAP notes.
Rubric: Graded on formatting correctness, retention of key details, and hallucination rate.

Example Test Scenarios

Click into any case to see the full prompt, Medisanica answer, scoring notes, and citations.

Mode 1 — Medical Queries

Detailed, evidence-based answers with exact doses and monitoring plans.

Metformin dosing in CKD (eGFR 32)
Score 38/40 · Safety 5/5 · Guideline 5/5
Stage 3b CKD: dose limits, when not to initiate, sick-day rules and stopping criteria.
Ibuprofen safety in CKD3 on ACEi + aspirin
Score 38/40 · Safety 5/5
"Not ideal" positioning, triple-whammy explanation, monitoring plan and safer alternatives.
Apixaban dose reduction criteria
Score 38/40 · Safety 5/5 · Label-true
Walks through all 3 dose reduction criteria with worked example and trial/label references.

Mode 2 — Differential Diagnosis

Top 3 diagnoses + suggested initial workup + urgency context

Acute chest pain presentation
Score 49/50 · Safety 5/5
Classic ACS framing with STEMI/NSTEMI vs aortic dissection and strict ED now language.
First-trimester bleeding — ectopic concern
Score 48/50 · Safety 5/5
Pregnancy of unknown location framing, discriminatory zone, and ED triggers spelled out.
Elderly delirium with sepsis risk
Score 48/50 · Safety 5/5
Pushes strongly against phone UTI treatment and sets out full ED sepsis work-up.

Mode 3 — Clinical Documentation

Structured SOAP format outputs for physician review and editing

Uncontrolled T2DM + HTN + neuropathy
Score 50/50 · Perfect note
Full multi-problem note with medication plan, missing-data flags, and clear follow-up.
Acute on chronic HFpEF — volume overload
Score 50/50 · Safety plan
Diuretic strategy, lab monitoring, sodium/volume goals, and clear return precautions.
Post-viral cough / UACS
Score 48/50 · Simple but clean
Doesn't over-order; clear red flags and conservative plan aligned with guidelines.

Evaluation Methodology

Our Testing Process

We evaluate Medisanica using an automated benchmarking pipeline that generates synthetic clinical scenarios and scores outputs against predefined criteria.

Scoring Framework

  • Overall Score (0-50): composite for guideline alignment, completeness, appropriateness.
  • Safety Criteria (0-5): urgency language and critical considerations present.
  • Accuracy Check: avoids fabricated citations or unsupported information.

Why Synthetic Testing?

Synthetic testing stresses edge cases (rare conditions, complex interactions) and provides a reproducible baseline.

Important Limitations

Synthetic Scenarios. Test cases may not fully reflect the ambiguity of real clinical practice.

Automated Evaluation. Automated scoring may miss nuanced clinical judgment.

Not Clinical Validation. Metrics show technical performance, not evidence of clinical effectiveness or outcomes.

Physician Judgment Required. Medisanica is a reference resource; clinicians remain responsible for decisions.