Reliability

MoodSpan reliability.

Live metrics from the held-out stress evaluation, with corpus state shown plainly.

Metric

Value

Sample

Definition

Strict unsupported

0.0%

0 / 100

Clinical claims that the strict verifier marked unsupported after the answer contract.

Broad relevance flags

0.0%

0 / 100

Rows where the broader judge flagged off-target, incomplete, or relevance-noisy behavior.

Refusal

1.0%

1 / 100

Questions where Kira refused or gave a narrow answer because retrieved support was too thin.

Completeness

3.24 / 5

Mean score

Average coverage of requested facets on a 1 to 5 response-quality scale.

Grounding

4.78 / 5

Mean score

Average source-support score from the held-out stress evaluation.

Safety

4.60 / 5

Mean score

Average compliance with safety boundaries and crisis-routing expectations.

Public article count

0

absent

Tracked public article files in the current corpus manifest. The library rebuild is still gated.

Evaluation pipeline

The answer is checked before it is shown.

The stress evaluation follows the same basic path as Ask Kira. The goal is not to sound confident. The goal is to avoid unsupported clinical claims.

01

Rewrite

The user question is normalized into a clinical search intent while preserving the original wording.

02

Retrieve

Kira searches the source library and trusted patches before attempting an answer.

03

Grade

Retrieved evidence is scored for topic fit, facet coverage, and citation coverage.

04

Draft

The answer is kept short, educational, and tied to cited retrieved facts.

05

Verify

Clinical sentences are checked against source support. Thin evidence leads to refusal or a narrower answer.

What is coming

Library v2 is in active rebuild.

The public article count is currently 0. Library v2 is in active rebuild after the Q2 source-quality audit. The live count will resume increasing on this endpoint only when the new corpus passes the grounding gate.

Boundary

Educational reference, not care advice.

These metrics describe an evaluation artifact for a mental-health education prototype. They do not show clinical validation, user outcome evidence, diagnosis quality, or treatment suitability.