Labs

The R&D queue.

What the lab is actively poking at. We bias toward layers the frontier model SDKs structurally can't ship — device context, cross-provider routing, mobile-side eval — and let them absorb the rest. Some pieces are already in production inside the Cuppa app or Cuppa Harness; some are under evaluation; some are on the list and not yet started. The freeform sandbox lives at mycuppa.io; this page is the curated index.

✓ shipped ⚠ evaluating ◐ wishlist

Inference

On-device inference

Running models on the phone is now real. Battery, thermal, and memory budget are the new latency.

Apple Foundation Models
✓ shipped

On-device LLM shipping in iOS 26+. ~7s/post on iPhone 15 Pro, 100% schema-valid with permissive decoding.

Powers the Cuppa app today.
MLX-Swift
⚠ evaluating

Apple Silicon ML runtime with a Swift surface. Path to running Gemma 2/3 and similar locally without Python.

Concurrent-model memory budget on iPhone 15/16 is the open question.
WhisperKit
◐ wishlist

Whisper inference on Apple Silicon via Core ML. Voice-in for Posts and Studio narration.
llama.cpp Swift bindings
◐ wishlist

Quantised CPU/GPU inference. Useful as a fallback path when MLX or Foundation Models aren't available.

Repair

Output repair & structured generation

Frontier models are absorbing this layer fast. We keep the repair primitives shipped as a safety net for on-device / older models — but stopped deepening the cloud-side parsing work.

JSON candidate ladders
✓ shipped

Try raw → fence-stripped → brace-extracted → reasoning-tag-stripped → truncation-repaired. Took us from ~60% to 100% schema-valid output.

Lifted into Cuppa Harness as OutputRepair. Maintained as a fallback for Foundation Models; not deepening for cloud providers.
Partial-JSON streaming parsers
◐ wishlist

Parse incomplete JSON as it streams in so the UI can render fields as they arrive. Client-side concern, durable even as models improve.

Stays on the list because it's a UI problem, not a model problem.

Memory

Memory & context engineering

Models forget. The harness layer is what makes them appear to remember.

USearch
⚠ evaluating

Tiny, embeddable vector index. Runs on iOS. Candidate for on-device retrieval.
ObjectBox
◐ wishlist

On-device vector + object DB with Swift bindings. Heavier than USearch; useful when you also want structured storage.
Conversation buffer + summarisation rolloff
✓ shipped

Keep last N exchanges verbatim, summarise the rest into a rolling system note. Simple, durable, ships in Cuppa Harness.

See the memory-patterns surface in the Harness extraction.
Semantic compression
◐ wishlist

Summarise long context into structured slots (entities, decisions, open questions) instead of free-text summary. Less lossy on retrieval.

Orchestration

Agent orchestration

Where the harness layer keeps earning its keep: cross-provider fan-out, judging, routing, normalisation. The model itself can't see what other models said — coordination is structurally outside it.

Fan-out + moderator-as-judge
✓ shipped

N providers reply in parallel; a moderator model summarises agreement / disagreement / factual flags. Powers the Cuppa app's Council mode.

Lifted into Cuppa Harness.
Smart query routing
⚠ evaluating

Lightweight classifier routes a Post to the right model (or set of models) before fan-out. Cheaper, faster, and reduces moderator load.
Tool-call protocols across providers
⚠ evaluating

Each provider has its own tool-call shape. Normalising them without leaking provider quirks. The cross-provider gap widens with each new dialect — durable because there's no single SDK to absorb it.

Promoted from wishlist 2026-05-07 — this is the orchestration bet.

Device

Device-context-aware harness

The data the model can't see: battery, thermal, network quality, foreground/background lifecycle. The piece of the harness that frontier SDKs structurally can't ship from the cloud — and the layer we're betting on most.

Battery- and thermal-aware routing
✓ shipped

Drop to local Foundation Models when the phone is hot or low. Defer expensive Council fan-outs until charging. Cloud SDKs literally can't see this signal.

Shipped 2026-05-08 as Phase 4+ B1 — RoutePolicy + DeviceState (ProcessInfo thermal + UIDevice battery + Low Power Mode); BatteryAndThermalRoutePolicy collapses fan-out to on-device when constrained, with a one-line banner on the post explaining why.
Network-quality-aware streaming
✓ shipped

Switch streaming on/off, pre-warm on Wi-Fi, queue on cellular. Often dominates user-perceived latency more than model choice does.

Shipped 2026-05-09 as Phase 4+ B2 (split into B2a native SSE for Claude + B2b NetworkState/StreamPolicy gating). DefaultStreamPolicy streams on Wi-Fi/wired, suppresses on cellular/Low Data Mode/offline; ReplyStatus.streaming carries live deltas with a caret.
Foreground/background lifecycle hooks
✓ shipped

Pause cloud calls, persist partial results, resume on return. iOS-specific; the gap between "demo works" and "ships."

Shipped 2026-05-07 as Phase 4+ A2 — LifecyclePolicy protocol with PauseDecision/ResumeDecision, scene-phase wiring, lifecycleState snapshot persisted via the existing ConversationStore.

Safety

Safety & policy primitives

Enterprise AI security ships server-side. Apps shipping LLM features on mobile need on-device primitives — redact before data leaves the phone, scrub injection on untrusted context, filter output before render. Long-tail durable layer; the security stance is structurally outside the model.

On-device PII / PHI redaction
◐ wishlist

Strip names, emails, phone numbers, IDs, and health terms from user input before it leaves the device. NaturalLanguage framework + tunable regex layers per regulated vertical.

The headline primitive — the one frontier chatbots structurally can't offer because the data is already in their cloud.
Prompt-injection scrubber
◐ wishlist

Sanitise user-supplied context (clipboard, email body, OCR'd image, web page) before it joins the prompt. Detect override patterns ("ignore previous"), role-confusion, hidden instructions in markdown links.

Hard problem, no clean fix exists — defence-in-depth, not silver bullet.
Output filter hooks
◐ wishlist

Pluggable filters between model output and UI render: toxic content, leaked system prompts, hallucinated PII, secret patterns (API keys, tokens). Fails open by default; app opts in to each filter.
Multi-model agreement-as-trust
✓ shipped

Re-purpose Council fan-out as a safety check: when N models disagree on a factual answer, flag as low-trust before display. Reuses existing harness — agreement is a security signal.

Shipped 2026-05-07 as Phase 4+ A1 — TrustSignal struct (level + concerns + advice) on ModeratorVerdict, populated by the moderator JSON pass; UI renders a TrustBadge when level isn't .high.

Eval

Eval & observability

You can't improve what you don't measure. Schema validity, latency, cost-per-task, failure mode taxonomy.

ValidityHarness
✓ shipped

Per-model schema-validity dashboard built into the Cuppa app. Powered the 60% → 100% jump on Foundation Models.
Failure mode taxonomy
✓ shipped

Classify each model failure (truncation, fence leak, hallucinated field, refusal) so the repair ladder can target what actually breaks.

Shipped 2026-05-08 as Phase 4+ A3 — FailureKind enum + classifier dispatching via FailureKindClassifiable conformances on each agent's error type. Reply.failureKind persists per call; ValidityHarness dashboard renders a Failure breakdown pill row.
Cost-per-task tracking
◐ wishlist

Token + wall-clock + battery accounting per Post, grouped by model. The metric that decides which provider stays in the Council.

Got a lib worth a look? Email hello@cuppatech.com or open an issue on GitHub.

The R&D queue.

Apple Foundation Models ↗

MLX-Swift ↗

WhisperKit ↗

llama.cpp Swift bindings ↗

JSON candidate ladders

Partial-JSON streaming parsers

USearch ↗

ObjectBox ↗

Conversation buffer + summarisation rolloff

Semantic compression

Fan-out + moderator-as-judge

Smart query routing

Tool-call protocols across providers

Battery- and thermal-aware routing

Network-quality-aware streaming

Foreground/background lifecycle hooks

On-device PII / PHI redaction

Prompt-injection scrubber

Output filter hooks

Multi-model agreement-as-trust

ValidityHarness

Failure mode taxonomy

Cost-per-task tracking

Apple Foundation Models

MLX-Swift

WhisperKit

llama.cpp Swift bindings

USearch

ObjectBox