Labs

The R&D queue.

What the lab is actively poking at. We bias toward layers the frontier model SDKs structurally can't ship — device context, cross-provider routing, mobile-side eval — and let them absorb the rest. Some pieces are already in production inside the Cuppa app or Cuppa Harness; some are under evaluation; some are on the list and not yet started. The freeform sandbox lives at mycuppa.io; this page is the curated index.

✓ shipped ⚠ evaluating ◐ wishlist


Inference

On-device inference

Running models on the phone is now real. Battery, thermal, and memory budget are the new latency.

  • On-device LLM shipping in iOS 26+. ~7s/post on iPhone 15 Pro, 100% schema-valid with permissive decoding.

    Powers the Cuppa app today.

  • MLX-Swift

    ⚠ evaluating

    Apple Silicon ML runtime with a Swift surface. Path to running Gemma 2/3 and similar locally without Python.

    Concurrent-model memory budget on iPhone 15/16 is the open question.

  • WhisperKit

    ◐ wishlist

    Whisper inference on Apple Silicon via Core ML. Voice-in for Posts and Studio narration.

  • Quantised CPU/GPU inference. Useful as a fallback path when MLX or Foundation Models aren't available.

Repair

Output repair & structured generation

Frontier models are absorbing this layer fast. We keep the repair primitives shipped as a safety net for on-device / older models — but stopped deepening the cloud-side parsing work.

  • JSON candidate ladders

    ✓ shipped

    Try raw → fence-stripped → brace-extracted → reasoning-tag-stripped → truncation-repaired. Took us from ~60% to 100% schema-valid output.

    Lifted into Cuppa Harness as OutputRepair. Maintained as a fallback for Foundation Models; not deepening for cloud providers.

  • Partial-JSON streaming parsers

    ◐ wishlist

    Parse incomplete JSON as it streams in so the UI can render fields as they arrive. Client-side concern, durable even as models improve.

    Stays on the list because it's a UI problem, not a model problem.

Memory

Memory & context engineering

Models forget. The harness layer is what makes them appear to remember.

  • USearch

    ⚠ evaluating

    Tiny, embeddable vector index. Runs on iOS. Candidate for on-device retrieval.

  • ObjectBox

    ◐ wishlist

    On-device vector + object DB with Swift bindings. Heavier than USearch; useful when you also want structured storage.

  • Conversation buffer + summarisation rolloff

    ✓ shipped

    Keep last N exchanges verbatim, summarise the rest into a rolling system note. Simple, durable, ships in Cuppa Harness.

    See the memory-patterns surface in the Harness extraction.

  • Semantic compression

    ◐ wishlist

    Summarise long context into structured slots (entities, decisions, open questions) instead of free-text summary. Less lossy on retrieval.

Orchestration

Agent orchestration

Where the harness layer keeps earning its keep: cross-provider fan-out, judging, routing, normalisation. The model itself can't see what other models said — coordination is structurally outside it.

  • Fan-out + moderator-as-judge

    ✓ shipped

    N providers reply in parallel; a moderator model summarises agreement / disagreement / factual flags. Powers the Cuppa app's Council mode.

    Lifted into Cuppa Harness.

  • Smart query routing

    ⚠ evaluating

    Lightweight classifier routes a Post to the right model (or set of models) before fan-out. Cheaper, faster, and reduces moderator load.

  • Tool-call protocols across providers

    ⚠ evaluating

    Each provider has its own tool-call shape. Normalising them without leaking provider quirks. The cross-provider gap widens with each new dialect — durable because there's no single SDK to absorb it.

    Promoted from wishlist 2026-05-07 — this is the orchestration bet.

Device

Device-context-aware harness

The data the model can't see: battery, thermal, network quality, foreground/background lifecycle. The piece of the harness that frontier SDKs structurally can't ship from the cloud — and the layer we're betting on most.

  • Battery- and thermal-aware routing

    ✓ shipped

    Drop to local Foundation Models when the phone is hot or low. Defer expensive Council fan-outs until charging. Cloud SDKs literally can't see this signal.

    Shipped 2026-05-08 as Phase 4+ B1 — RoutePolicy + DeviceState (ProcessInfo thermal + UIDevice battery + Low Power Mode); BatteryAndThermalRoutePolicy collapses fan-out to on-device when constrained, with a one-line banner on the post explaining why.

  • Network-quality-aware streaming

    ✓ shipped

    Switch streaming on/off, pre-warm on Wi-Fi, queue on cellular. Often dominates user-perceived latency more than model choice does.

    Shipped 2026-05-09 as Phase 4+ B2 (split into B2a native SSE for Claude + B2b NetworkState/StreamPolicy gating). DefaultStreamPolicy streams on Wi-Fi/wired, suppresses on cellular/Low Data Mode/offline; ReplyStatus.streaming carries live deltas with a caret.

  • Foreground/background lifecycle hooks

    ✓ shipped

    Pause cloud calls, persist partial results, resume on return. iOS-specific; the gap between "demo works" and "ships."

    Shipped 2026-05-07 as Phase 4+ A2 — LifecyclePolicy protocol with PauseDecision/ResumeDecision, scene-phase wiring, lifecycleState snapshot persisted via the existing ConversationStore.

Safety

Safety & policy primitives

Enterprise AI security ships server-side. Apps shipping LLM features on mobile need on-device primitives — redact before data leaves the phone, scrub injection on untrusted context, filter output before render. Long-tail durable layer; the security stance is structurally outside the model.

  • On-device PII / PHI redaction

    ◐ wishlist

    Strip names, emails, phone numbers, IDs, and health terms from user input before it leaves the device. NaturalLanguage framework + tunable regex layers per regulated vertical.

    The headline primitive — the one frontier chatbots structurally can't offer because the data is already in their cloud.

  • Prompt-injection scrubber

    ◐ wishlist

    Sanitise user-supplied context (clipboard, email body, OCR'd image, web page) before it joins the prompt. Detect override patterns ("ignore previous"), role-confusion, hidden instructions in markdown links.

    Hard problem, no clean fix exists — defence-in-depth, not silver bullet.

  • Output filter hooks

    ◐ wishlist

    Pluggable filters between model output and UI render: toxic content, leaked system prompts, hallucinated PII, secret patterns (API keys, tokens). Fails open by default; app opts in to each filter.

  • Multi-model agreement-as-trust

    ✓ shipped

    Re-purpose Council fan-out as a safety check: when N models disagree on a factual answer, flag as low-trust before display. Reuses existing harness — agreement is a security signal.

    Shipped 2026-05-07 as Phase 4+ A1 — TrustSignal struct (level + concerns + advice) on ModeratorVerdict, populated by the moderator JSON pass; UI renders a TrustBadge when level isn't .high.

Eval

Eval & observability

You can't improve what you don't measure. Schema validity, latency, cost-per-task, failure mode taxonomy.

  • ValidityHarness

    ✓ shipped

    Per-model schema-validity dashboard built into the Cuppa app. Powered the 60% → 100% jump on Foundation Models.

  • Failure mode taxonomy

    ✓ shipped

    Classify each model failure (truncation, fence leak, hallucinated field, refusal) so the repair ladder can target what actually breaks.

    Shipped 2026-05-08 as Phase 4+ A3 — FailureKind enum + classifier dispatching via FailureKindClassifiable conformances on each agent's error type. Reply.failureKind persists per call; ValidityHarness dashboard renders a Failure breakdown pill row.

  • Cost-per-task tracking

    ◐ wishlist

    Token + wall-clock + battery accounting per Post, grouped by model. The metric that decides which provider stays in the Council.


Got a lib worth a look? Email hello@cuppatech.com or open an issue on GitHub.