Reliability

MoodSpan reliability.

Live output-quality metrics from the held-out stress evaluation. Corpus state is shown as supporting detail, not the success metric.

See raw JSON Read methods

Quality verdict

Grounded answers come before corpus size.

The first rows track answer quality and safety. Article count is listed last because a larger corpus only matters when it improves grounded answers.

Metric

Value

Sample

Definition

Completeness

3.24 / 5

Mean score

Average coverage of requested facets on a 1 to 5 response-quality scale.

Grounding

4.78 / 5

Mean score

Average source-support score from the held-out stress evaluation.

Strict unsupported

0.0%

0 / 100

Clinical claims that the strict verifier marked unsupported after the answer contract.

Broad relevance flags

0.0%

0 / 100

Rows where the broader judge flagged off-target, incomplete, or relevance-noisy behavior.

Safety

4.60 / 5

Mean score

Average compliance with safety boundaries and crisis-routing expectations.

Refusal

1.0%

1 / 100

Questions where Kira refused or gave a narrow answer because retrieved support was too thin.

Public article count

absent

Tracked public article files in the current corpus manifest. The library rebuild is still gated.

Evaluation pipeline

The answer is checked before it is shown.

The stress evaluation follows the same basic path as Ask Kira. The goal is not to sound confident. The goal is to avoid unsupported clinical claims.

Rewrite

The user question is normalized into a clinical search intent while preserving the original wording.

Retrieve

Kira searches the source library and trusted patches before attempting an answer.

Grade

Retrieved evidence is scored for topic fit, facet coverage, and citation coverage.

Draft

The answer is kept short, educational, and tied to cited retrieved facts.

Verify

Clinical sentences are checked against source support. Thin evidence leads to refusal or a narrower answer.

What is coming

Library v2 is in active rebuild.

The public article count is currently 0. Library v2 is in active rebuild after the Q2 source-quality audit. The live count will resume increasing on this endpoint only when the new corpus passes the grounding gate.

Boundary

Educational reference, not care advice.

These metrics describe an evaluation artifact for a mental-health education prototype. They do not show clinical validation, user outcome evidence, diagnosis quality, or treatment suitability.