← Writing
AI and agents

What would an open voice copilot stack look like on Mac and Android?

This started with a very specific wish: I wanted something in the neighborhood of a Claude-style voice coworker, but open, configurable, and local-first. Not just speech-to-text. I wanted push-to-talk dictation into any field, an LLM in the middle when that helped, and enough command-and-control to make the machine feel operable by voice instead of merely transcribable by voice. The more I looked, the more it seemed like the honest answer was not "here's the app." It was "here's the stack."

Drafted January 2026 - AI and agents - voice interfaces, macOS automation, Android automation

I re-checked the current project pages before writing this because voice tooling churns quickly. My read now is that open voice control is no longer blocked mainly by speech recognition. That part is surprisingly good. The harder problem is reliable action: deciding whether spoken input should become plain dictation, a deterministic command, a browser task, or a risky open-ended agent action, and then executing that choice without the whole thing feeling brittle.

That matters because a lot of "voice AI" conversations still mash together four different needs: speech-to-text, text injection, planning, and control. Once I separated those, the landscape got clearer. There still is not one great fully open app that does all of it cleanly across macOS and Android. But there are enough strong components now to build something real.

The problem is really four layers

Layer What it has to do Mac options that look real Android options that look real
Speech Turn voice into text quickly, privately, and with acceptable accuracy. whisper.cpp FUTO Voice Input, Home Assistant Assist local pipelines
Local reasoning Route the transcript somewhere useful without forcing everything through the cloud. Ollama PokeClaw for on-device phone-agent experiments, or a local Assist pipeline
Deterministic control Map known commands to reliable system actions. Hammerspoon, Karabiner-Elements, AppleScript Home Assistant Assist on Android; if you relax the open-source requirement, Tasker is still hard to ignore
Open-ended agent work Handle browser flows, shell tasks, or ambiguous multi-step goals. OpenHands for browser and shell-style workflows PokeClaw is the most interesting current open prototype I found

That table is why I no longer think the right question is "what is the best open-source voice app?" The more useful question is "which layer is mature enough to trust, and which layer is still experimental?"

On Mac, dictation looks better than full voice control

whisper.cpp is the easiest part of this conversation now. Its README currently lists Mac OS on both Intel and Arm as supported platforms, describes the implementation as lightweight, and explicitly points toward offline voice-assistant-style uses. That is a much better starting point than the older "maybe local speech recognition is possible" era. Local speech on a Mac is not the fantasy layer anymore.

The action layer is more fragmented. Hammerspoon is still one of the cleanest open automation bridges on macOS: Lua in the middle, system APIs underneath, and enough hooks to control apps, windows, clipboard, and keyboard flows. Karabiner-Elements is not the planner, but it is a stable way to create the keyboard behavior you want around push-to-talk and command-mode switching. Put those together and you can build something useful: hold a key, transcribe locally, decide whether the result is dictation or command text, then either inject text or hand off to a macro/script layer.

What I still do not think exists in a mature open form is a dependable Mac voice operator that feels like a native coworker instead of a stitched demo. The gap is not just "can an LLM write commands?" It is whether the stack knows when not to improvise, how to recover from UI drift, and how to keep a browser-shopping request from becoming a chaos engine. OpenHands helps if the target is shell work, browser automation, or coding-style tool use. But that is not the same thing as a polished desktop voice-control product.

On Android, the open-source frontier is getting stranger in a good way

The Android side used to look weaker to me because of system sandboxing. It still has those limits, but the current open tools are more interesting than I expected.

FUTO Voice Input makes the privacy case cleanly: on-device voice input, no data stored, source code linked directly from the project page. That solves the "I want usable speech input in normal apps without phoning home" part better than most people realize.

Home Assistant Assist is also more relevant here than its branding first suggests. The docs now make three important points explicit: Assist can work locally or with LLMs, its local pipeline can use Whisper for open-ended speech and Piper for local TTS, and on Android it can be installed as the default assistant app with wake-word support processed locally on-device. That does not turn Home Assistant into a general phone agent, but it does give you a real open voice shell that can live on the phone rather than in a browser tab.

The most surprising current project I found was PokeClaw. Its README describes it as an open-source Android app for AI phone automation, local-first by default, with on-device execution and optional cloud models when needed. I would still call it a prototype, because the project itself calls it a local-first prototype. But it is the first open Android project I found that seems to be treating "AI controls the phone itself" as the primary product lane rather than as a side experiment.

Android still imposes some annoying honesty on the dream. The Home Assistant docs are very clear that third-party wake-word detection on Android costs more battery than Google's own assistant because Google does not expose the same low-power hardware path to outside developers. That is exactly the kind of platform constraint people forget when they imagine a smooth open replacement for the system assistant. The software stack might be good enough. The operating system still has opinions.

The architecture I would actually trust

If I were building this for myself instead of just sketching the landscape, I would stop trying to make every spoken request equally agentic. I would route commands into three lanes:

Push-to-talk
  -> local speech layer
  -> command router
      -> plain dictation into the focused field
      -> deterministic macro or automation
      -> explicit agent task for browser or multi-step work

On Mac, that probably means whisper.cpp -> Ollama -> Hammerspoon/AppleScript, with Karabiner-Elements managing hotkeys and mode changes. On Android, it probably means FUTO Voice Input or a local Assist pipeline for speech, then either Home Assistant for structured device actions or PokeClaw for truly agentic phone control.

The key design choice is not the model name. It is the router. I would want the system to be boring by default and agentic on purpose. "Type this sentence into the current field" should not go through the same control path as "compare three products and take the cheapest one all the way to checkout." That split is what keeps voice control from becoming a demo-only party trick.

Where I landed

My current read is that open voice control is real enough to build around now, but not real enough to pretend the whole stack has converged into one clean product. Speech is ahead. Local models are usable. Deterministic automation is solid in pockets. The shaky layer is still general-purpose action planning across messy real interfaces.

That is not a reason to dismiss the category. It is a reason to be more precise about what has actually gotten good. If what you want is private dictation, local routing, and a growing amount of structured control, the ingredients are here. If what you want is a fully open, deeply reliable, Claude-style voice operator that can run both your Mac and your phone without a lot of custom glue, I still think that is frontier territory.

And maybe that is the most useful framing change. The real question is no longer "can open voice AI exist?" It is "which parts of the voice stack deserve hardcoded reliability, and which parts are mature enough to hand to an agent loop?" That feels like the difference between a toy and a system I would actually trust.

Sources I found useful: whisper.cpp, Hammerspoon, Karabiner-Elements, Ollama, OpenHands, FUTO Voice Input, Home Assistant Assist overview, Home Assistant local voice pipeline, Assist on Android, PokeClaw, and Tasker.