On-Device AI Assistants: What’s Possible Today

On-device AI is no longer a novelty—it’s shipping in everyday workflows. Phones, laptops, and even wearables now run capable language and vision models locally. In this post, we’ll cover what’s practical today, where the limits are, and how to design hybrid systems that balance privacy, performance, and cost.

Why run AI on device?

Privacy: Sensitive data (messages, photos, voice) can stay local.
Latency: No round-trip to the cloud for common tasks; sub-100ms is achievable.
Offline reliability: Works in low-connectivity environments and reduces failure modes.
Cost control: Fewer cloud tokens and API calls, especially for frequent micro-tasks.

What can modern devices actually run?

Small LLMs (1–3B params): Great for rewriting, formatting, extracting entities, short summaries.
Mid LLMs (4–8B params, quantized): Good for autocomplete, short reasoning, form filling, intent classification.
Multimodal encoders: On-device image captioning, OCR, face/object landmarks, simple VQA.
ASR/TTS: High-quality speech recognition and natural voices, often with hardware acceleration.

Performance depends on hardware (Apple Neural Engine, Snapdragon NPU, RTX laptop GPUs), quantization level, and context length.

The role of quantization and acceleration

Quantization: Reduces memory/compute (e.g., Q4_K, INT8) with minor quality tradeoffs.
Hardware backends: Core ML (Apple), NNAPI (Android), DirectML (Windows), Metal/Vulkan/CUDA.
Runtime choices: MLC/LLM, GGUF+llama.cpp, ONNX Runtime, TensorRT-LLM.

Tip: Benchmark with realistic prompts and target context sizes. Differences appear at longer sequences and multiturn workloads.

What belongs on device vs in the cloud?

On device
- Personalization state (recent email topics, preferences, local embeddings)
- Fast transforms (rewrite, summarize, classify)
- Low-risk tools (calendar lookup, clipboard transforms)
In the cloud
- Heavy reasoning or long-context tasks (research, coding, multi-doc RAG)
- Expensive tools (web search, proprietary systems)
- Team-shared knowledge bases and analytics
Hybrid routing
- Try local first → escalate to cloud on confidence/size thresholds
- Keep PII local; send only minimal context or hashed features
- Log telemetry/events, not raw content, for evals and quality tracking

Design patterns for hybrid assistants

Local-first transforms: Run rewrite/summarize classification locally; add a confidence score to decide escalation.
Chunked cloud handoff: When a task is too big, send only necessary chunks with sparse context windows.
RAG with local cache: Maintain device-side embeddings for personal docs; query cloud KB only when local misses.
Tool gating: Allow device tools by default; require policy checks for cloud tools and external actions.
Transparent UX: Show when and why a request goes to the cloud; allow user controls for privacy and cost.

Evaluating quality without shipping regressions

Build a golden set of frequent tasks (rewrite emails, summarize notes, extract todos).
Track latency, local-token savings, and escalation rates.
Use rubric-based scoring for hallucinations and instruction-following.
Canary new local models and quant levels before rolling out broadly.

Example capability map (today’s realistic targets)

iPhone 15+/M-series Macs: 4–8B quantized LLMs at interactive speeds; 16–32k context is feasible with tradeoffs.
Modern Android flagships: Similar class with NNAPI acceleration; vendor variance applies.
RTX 40-series laptops: 7–13B quantized models comfortably; strong multimodal.

Quality roughly tracks parameter count and pretraining quality more than anything else. Routing to a bigger cloud model for edge cases remains a best practice.

Developer stack starters

Local LLM runtimes: llama.cpp (GGUF), MLC/LLM, Ollama (for rapid prototyping)
Mobile acceleration: Core ML Tools (iOS), NNAPI + GPU delegates (Android)
Multimodal: ONNX Runtime, TFLite, OpenCV for pre/post-processing
RAG: SQLite + device vectors, LiteLLM proxy for routing, lightweight rerankers

UX considerations that build trust

Provide a “Local-only” mode with clear limitations.
Explain escalations to cloud and allow opt-out per request.
Make privacy, battery impact, and data retention visible and adjustable.
Offer quick actions (rewrite, tone shift, extract tasks) that feel instant.

When to move up the maturity curve

Start with a local-first assistant focused on fast transforms and memory of recent interactions. Add cloud escalation for harder tasks. Introduce tool use and richer planning only when telemetry shows clear ROI.

At Astro AI, we’re building assistants that feel instant, private, and capable—by default. If you care about speed and privacy on your device, join the waitlist and help shape the roadmap.

On-Device AI Assistants: What’s Possible Today

On-Device AI Assistants: What’s Possible Today

Why run AI on device?

What can modern devices actually run?

The role of quantization and acceleration

What belongs on device vs in the cloud?

Design patterns for hybrid assistants

Evaluating quality without shipping regressions

Example capability map (today’s realistic targets)

Developer stack starters

UX considerations that build trust

When to move up the maturity curve

Continue Reading

Discover more articles

Download on iOS