-
On-device PII / PHI redaction
✓ shipped Strips names, emails, phone numbers, addresses, URLs, and card numbers from user input before it leaves the device. Each match swaps to a category token like [NAME] or [EMAIL] so the model still understands the prompt structure but never sees the values. A stricter mode adds organization names for employer-sensitive verticals.
Shipped May 2026. Three detection layers (Apple's NaturalLanguage NER + structured-data detectors + regex) combined with overlap resolution. Same shield-icon footer as image redaction — one consistent privacy commitment surface across text and images. Vertical conformer hooks (health, finance) are designed-in for follow-ups.
-
Prompt-injection scrubber
✓ shipped Sanitises user-supplied context (clipboard, email body, OCR'd image, web page) before it joins the prompt. Detects override patterns ("ignore previous"), role confusion, system tag injection, prompt-leak probes, suspicious markdown and zero-width characters. Two-layer mitigation: a defensive prefix plus an "untrusted" wrapper the model is asked to treat as data.
Shipped May 2026. 14 detection rules across 5 categories plus a zero-width-character sweep. Detection surfaces an orange warning banner so the user knows the model saw flagged content; the wrap is best-effort prompt engineering whose effectiveness scales with the model's boundary-tag training. Hard problem — no clean fix exists; defence-in-depth, not silver bullet. Verified end-to-end: an injection-style prompt split the 6-model Council three ways (refused / acknowledged-and-complied-anyway / just-complied), and the trust signal correctly flagged the divergence.
-
Output filter hooks
✓ shipped Pluggable filters between model output and UI render: secret patterns (API keys, tokens) today; toxic content, leaked system prompts, and hallucinated PII as follow-ups. Fails open by default; app opts in to each filter.
Shipped May 2026. Eleven high-confidence credential patterns covered today — AWS, GitHub, OpenAI, Anthropic, Google, Stripe, Slack, JWTs, and PEM private keys. Filter runs at the end of generation; mid-stream is skipped to avoid partial-match false-positives. Same shield-icon footer as input redaction; consistent privacy commitment across input AND output.
-
Multi-model agreement-as-trust
✓ shipped Re-purpose Council fan-out as a safety check: when N models disagree on a factual answer, flag as low-trust before display. Reuses existing harness — agreement is a security signal.
Shipped May 2026. A trust badge appears whenever the model fan-out diverges or hedges, with a one-line advice line so the user knows what to verify. Extended later in May with vision-specific signals for image fan-outs.