Labs

The R&D queue.

What the lab is actively poking at. We bias toward layers the frontier model SDKs structurally can't ship — device context, cross-provider routing, mobile-side eval — and let them absorb the rest. Some pieces are already in production inside the Cuppa app or Cuppa Harness; some are under evaluation; some are on the list and not yet started. The freeform sandbox lives at mycuppa.io; this page is the curated index.

Cuppa Harness — visual tour of the primitives below. Embedded from cuppa.studio.

✓ shipped ⚠ evaluating ◐ wishlist


Inference

On-device inference

Running models on the phone is now real. Battery, thermal, and memory budget are the new latency.

  • On-device LLM shipping in iOS 26+. ~7s/post on iPhone 15 Pro, 100% schema-valid with permissive decoding.

    Powers the Cuppa app today.

  • MLX-Swift

    ⚠ evaluating

    Apple Silicon ML runtime with a Swift surface. Path to running Gemma 2/3 and similar locally without Python.

    Concurrent-model memory budget on iPhone 15/16 is the open question.

  • WhisperKit

    ◐ wishlist

    Whisper inference on Apple Silicon via Core ML. Voice-in for Posts and Studio narration.

  • Quantised CPU/GPU inference. Useful as a fallback path when MLX or Foundation Models aren't available.

Repair

Output repair & structured generation

Frontier models are absorbing this layer fast. We keep the repair primitives shipped as a safety net for on-device / older models — but stopped deepening the cloud-side parsing work.

  • JSON candidate ladders

    ✓ shipped

    Try raw → fence-stripped → brace-extracted → reasoning-tag-stripped → truncation-repaired. Took us from ~60% to 100% schema-valid output.

    Lifted into Cuppa Harness as OutputRepair. Maintained as a fallback for Foundation Models; not deepening for cloud providers.

  • Partial-JSON streaming parsers

    ◐ wishlist

    Parse incomplete JSON as it streams in so the UI can render fields as they arrive. Client-side concern, durable even as models improve.

    Stays on the list because it's a UI problem, not a model problem.

Memory

Memory & context engineering

Models forget. The harness layer is what makes them appear to remember.

  • USearch

    ⚠ evaluating

    Tiny, embeddable vector index. Runs on iOS. Candidate for on-device retrieval.

  • ObjectBox

    ◐ wishlist

    On-device vector + object DB with Swift bindings. Heavier than USearch; useful when you also want structured storage.

  • Conversation buffer + summarisation rolloff

    ✓ shipped

    Keep last N exchanges verbatim, summarise the rest into a rolling system note. Simple, durable, ships in Cuppa Harness.

    See the memory-patterns surface in the Harness extraction.

  • Semantic compression

    ◐ wishlist

    Summarise long context into structured slots (entities, decisions, open questions) instead of free-text summary. Less lossy on retrieval.

Orchestration

Agent orchestration

Where the harness layer keeps earning its keep: cross-provider fan-out, judging, routing, normalisation. The model itself can't see what other models said — coordination is structurally outside it.

  • Fan-out + moderator-as-judge

    ✓ shipped

    N providers reply in parallel; a moderator model summarises agreement, disagreement, and factual flags. Powers the Cuppa app's Council mode for both text and image posts.

    Lifted into Cuppa Harness. Extended in May 2026 with vision-specific signals so image fan-outs surface a different failure mode than text disagreement.

  • Smart query routing

    ✓ shipped

    A lightweight classifier routes each post to the right model — or set of models — before fan-out, then a cost-aware policy picks the cheapest viable choice from the available lineup. Cheaper, faster, and reduces moderator load.

    Text classifier shipped earlier in the year (simple / factual / contested). Image-side shipped May 2026 — heavy OCR or high-confidence single subjects route to one fast vision model; ambiguous scenes fan out to Council. Cost-aware extension shipped a day later: each model carries a quality tier, the router filters by tier per class then scores by live cost data. The harness stays honest as rate sheets shift instead of relying on hard-coded preferences. Demoed: a simple factoid query routed to Apple Foundation Models (on-device, $0, zero data left the device); on a pure-cloud loadout the same factual query landed on a cheaper small vision model — about 10× cheaper than the previous hard-coded pick at the same quality tier.

  • Tool-call protocols across providers

    ⚠ evaluating

    Each provider has its own tool-call shape. Normalising them without leaking provider quirks. The cross-provider gap widens with each new dialect — durable because there's no single SDK to absorb it.

    Promoted from wishlist 2026-05-07 — this is the orchestration bet.

Device

Device-context-aware harness

The data the model can't see: battery, thermal, network quality, foreground/background lifecycle. The piece of the harness that frontier SDKs structurally can't ship from the cloud — and the layer we're betting on most.

  • Battery- and thermal-aware routing

    ✓ shipped

    Drop to local Foundation Models when the phone is hot or low. Defer expensive Council fan-outs until charging. Cloud SDKs literally can't see this signal.

    Shipped May 2026. When the phone is hot, low on battery, or in Low Power Mode, the harness collapses cloud fan-outs to on-device models and surfaces a one-line banner on the post explaining why.

  • Network-quality-aware streaming

    ✓ shipped

    Switch streaming on/off, pre-warm on Wi-Fi, queue on cellular. Often dominates user-perceived latency more than model choice does.

    Shipped May 2026. Live token-by-token streaming on Wi-Fi and wired connections; single-shot replies on cellular, Low Data Mode, and offline.

  • Foreground/background lifecycle hooks

    ✓ shipped

    Pause cloud calls, persist partial results, resume on return. iOS-specific; the gap between "demo works" and "ships."

    Shipped May 2026. Pauses in-flight calls when the app backgrounds, resumes on return without losing partial state.

Imaging

Cuppa Imaging — first-class vision lane

Phones are cameras. Image-as-first-class harness input — every vision capability built outside the model, so apps swap providers, redact before send, route by on-device CV signal, and reason about hallucinations across N vision models the same way Council reasons across text models.

  • Multimodal request types

    ✓ shipped

    Two request shapes — text-with-image and pure-image — kept separate so use-case routing stays distinct. Forward-compatible with audio and video.

    Shipped May 2026 as the foundation the rest of the imaging lane builds on.

  • Curated vision provider set

    ✓ shipped

    Claude Haiku, Gemini Flash, Qwen3-VL (via Ollama Cloud), and Llama 4 Scout (via Groq). Non-vision providers are silently auto-skipped from image fan-outs.

    Shipped May 2026. Four cloud vision providers wired with a single image-encoding contract per provider.

  • On-device CV preprocessor

    ✓ shipped

    A wrapper around Apple's Vision framework: OCR, face detection, scene classification, and PII heuristics. Runs before any LLM call; drives both redaction and routing decisions downstream.

    Shipped May 2026. Per-request fallback so one signal failing doesn't take down the others; graceful degradation on devices without Neural Engine acceleration.

  • Default-on image redaction

    ✓ shipped

    Strips photo metadata and blurs faces + visible PII regions (addresses, phones, emails, card numbers) before send. Smart default with a visible composer toggle for opt-out.

    Shipped May 2026. A small footer under each image shows what was stripped — honest UX commitment, the user can always see exactly what left the device.

  • Vision routing policy

    ✓ shipped

    Collapse image fan-out to one fast model when the on-device preprocessor is confident the image is unambiguous (clear scene, document-heavy OCR); full Council on ambiguity, low confidence, or missing signal. Pessimistic by design — only optimize when confident.

    Shipped May 2026. A small banner on the post distinguishes vision-content-driven collapse from device-state-driven collapse so the user knows which constraint applied.

  • Council-for-vision + visual-hallucination flagging

    ✓ shipped

    New trust-signal categories for image fan-outs: "visual disagreement" when models describe different subjects, and "text reading differs" for transcription divergence. Council's divergence-as-signal applied to vision.

    Shipped May 2026. The moderator is instructed to prefer these signals over generic factual-disagreement for image posts; text-only posts keep the original signal taxonomy.

Safety

Safety & policy primitives

Enterprise AI security ships server-side. Apps shipping LLM features on mobile need on-device primitives — redact before data leaves the phone, scrub injection on untrusted context, filter output before render. Long-tail durable layer; the security stance is structurally outside the model.

  • On-device PII / PHI redaction

    ✓ shipped

    Strips names, emails, phone numbers, addresses, URLs, and card numbers from user input before it leaves the device. Each match swaps to a category token like [NAME] or [EMAIL] so the model still understands the prompt structure but never sees the values. A stricter mode adds organization names for employer-sensitive verticals.

    Shipped May 2026. Three detection layers (Apple's NaturalLanguage NER + structured-data detectors + regex) combined with overlap resolution. Same shield-icon footer as image redaction — one consistent privacy commitment surface across text and images. Vertical conformer hooks (health, finance) are designed-in for follow-ups.

  • Prompt-injection scrubber

    ✓ shipped

    Sanitises user-supplied context (clipboard, email body, OCR'd image, web page) before it joins the prompt. Detects override patterns ("ignore previous"), role confusion, system tag injection, prompt-leak probes, suspicious markdown and zero-width characters. Two-layer mitigation: a defensive prefix plus an "untrusted" wrapper the model is asked to treat as data.

    Shipped May 2026. 14 detection rules across 5 categories plus a zero-width-character sweep. Detection surfaces an orange warning banner so the user knows the model saw flagged content; the wrap is best-effort prompt engineering whose effectiveness scales with the model's boundary-tag training. Hard problem — no clean fix exists; defence-in-depth, not silver bullet. Verified end-to-end: an injection-style prompt split the 6-model Council three ways (refused / acknowledged-and-complied-anyway / just-complied), and the trust signal correctly flagged the divergence.

  • Output filter hooks

    ✓ shipped

    Pluggable filters between model output and UI render: secret patterns (API keys, tokens) today; toxic content, leaked system prompts, and hallucinated PII as follow-ups. Fails open by default; app opts in to each filter.

    Shipped May 2026. Eleven high-confidence credential patterns covered today — AWS, GitHub, OpenAI, Anthropic, Google, Stripe, Slack, JWTs, and PEM private keys. Filter runs at the end of generation; mid-stream is skipped to avoid partial-match false-positives. Same shield-icon footer as input redaction; consistent privacy commitment across input AND output.

  • Multi-model agreement-as-trust

    ✓ shipped

    Re-purpose Council fan-out as a safety check: when N models disagree on a factual answer, flag as low-trust before display. Reuses existing harness — agreement is a security signal.

    Shipped May 2026. A trust badge appears whenever the model fan-out diverges or hedges, with a one-line advice line so the user knows what to verify. Extended later in May with vision-specific signals for image fan-outs.

Eval

Eval & observability

You can't improve what you don't measure. Schema validity, latency, cost-per-task, failure mode taxonomy.

  • ValidityHarness

    ✓ shipped

    Per-model schema-validity dashboard built into the Cuppa app. Powered the 60% → 100% jump on Foundation Models.

  • Failure mode taxonomy

    ✓ shipped

    Classify each model failure (truncation, fence leak, hallucinated field, refusal) so the repair ladder can target what actually breaks.

    Shipped May 2026. Every failure carries a category — network, rate-limited, safety refusal, malformed output, and so on — instead of just a stack trace. Powers the validity dashboard's per-category breakdown.

  • Cost-per-task tracking

    ✓ shipped

    Token usage and $ cost estimate per Reply, with a small label in the card header so users see what each model costs at a glance. The metric that decides which provider stays in the Council, and the data smart routing keys off.

    Shipped May 2026. Reads each provider's native token-usage block, multiplies through a rate-sheet snapshot, and labels the card. Demoed: the same image critique cost about 0.04¢ on a small vision model versus 0.32¢ on a frontier one — an 8× spread, which is the signal cost-aware routing keys off.


Got a lib worth a look? Email hello@cuppatech.com or open an issue on GitHub.