Reliability
MoodSpan reliability.
Live metrics from the held-out stress evaluation, with corpus state shown plainly.
Metric
Value
Sample
Definition
Strict unsupported
0.0%
0 / 100
Clinical claims that the strict verifier marked unsupported after the answer contract.
Broad relevance flags
0.0%
0 / 100
Rows where the broader judge flagged off-target, incomplete, or relevance-noisy behavior.
Refusal
1.0%
1 / 100
Questions where Kira refused or gave a narrow answer because retrieved support was too thin.
Completeness
3.24 / 5
Mean score
Average coverage of requested facets on a 1 to 5 response-quality scale.
Grounding
4.78 / 5
Mean score
Average source-support score from the held-out stress evaluation.
Safety
4.60 / 5
Mean score
Average compliance with safety boundaries and crisis-routing expectations.
Public article count
0
absent
Tracked public article files in the current corpus manifest. The library rebuild is still gated.
Evaluation pipeline
The answer is checked before it is shown.
The stress evaluation follows the same basic path as Ask Kira. The goal is not to sound confident. The goal is to avoid unsupported clinical claims.
01
Rewrite
The user question is normalized into a clinical search intent while preserving the original wording.
02
Retrieve
Kira searches the source library and trusted patches before attempting an answer.
03
Grade
Retrieved evidence is scored for topic fit, facet coverage, and citation coverage.
04
Draft
The answer is kept short, educational, and tied to cited retrieved facts.
05
Verify
Clinical sentences are checked against source support. Thin evidence leads to refusal or a narrower answer.
What is coming
Library v2 is in active rebuild.
The public article count is currently 0. Library v2 is in active rebuild after the Q2 source-quality audit. The live count will resume increasing on this endpoint only when the new corpus passes the grounding gate.
Boundary
Educational reference, not care advice.
These metrics describe an evaluation artifact for a mental-health education prototype. They do not show clinical validation, user outcome evidence, diagnosis quality, or treatment suitability.