🎙️ Phoneme

Local-first voice transcription for power users.

Hit a hotkey. Speak. Get text anywhere.

Phoneme runs 100% offline by default. No cloud required, no subscriptions, no telemetry.

🧠 Philosophy

Principle	What It Means
🔒 Privacy First	Voice never leaves your machine. No forced updates, no tracking.
⚡ Flexible	Transcription and the LLM steps are chosen independently. Transcribe locally (whisper.cpp) or on a cloud STT API (OpenAI, Groq, Deepgram, AssemblyAI, ElevenLabs); run cleanup/summary/title on local Ollama or a cloud LLM (OpenAI, Anthropic, Groq, Gemini, …). Each step picks its own provider.
🔌 Extensible	JSON output → your scripts. Obsidian, Notion, Jira, Discord, Python—wherever you want.

🎯 Why Voice?

Most people speak faster than they type, so voice is a good fit for a few kinds of work:

Capturing ideas hands-free — while walking, driving, or away from the keyboard. Hit a hotkey and talk.
Dictating prompts to an LLM — speaking a prompt often gives more context (pauses, asides, clarifications) than you'd bother typing.
Accessibility — for RSI, vision strain, dyslexia, or tremors, voice is an alternative to the keyboard.
Rough drafts — you say it once and let the cleanup step fix punctuation and formatting, instead of editing as you go.
Meetings — record the call and keep participating instead of stopping to take notes.

⚙️ How It Works

Phoneme uses a decoupled, pipeline-driven architecture.

%%{init: {'flowchart': {'curve': 'basis', 'useMaxWidth': false}, 'theme': 'dark', 'themeVariables': { 'fontSize': '12px' }}}%%
flowchart TD
    Input[🎤 Voice] -->|Hotkey| Daemon[Daemon]
    
    subgraph T [Transcribe]
        Daemon --> Whisper{Whisper}
        Whisper -->|Local/Cloud| Raw[Raw Text]
    end
    
    subgraph E [Enrich]
        Raw --> Diarize{Diarize}
        Diarize -->|Opt| Tagged[Tagged]
    end
    
    subgraph P [Process]
        Tagged --> LLM{LLM}
        LLM -->|Opt| Final[Final]
    end
    
    Final --> Catalog[(SQLite)]
    Final --> Hooks[[Hooks]]
    Hooks --> Dest[Obsidian/Webhooks/Type]

✨ Core Features

🎙️ Local transcription by default: A bundled whisper.cpp server runs on your machine — audio never leaves your PC. The First Run Wizard detects your RAM/VRAM and picks the right model.
🔌 Bring-your-own provider: Transcription, live preview, cleanup, summary, auto-title, and auto-tags each pick their own provider+model independently. Local whisper.cpp/Ollama for privacy, or cloud APIs (OpenAI, Anthropic, Groq, Gemini, Deepgram, AssemblyAI, ElevenLabs, and many more) for speed. One-click presets, live model lists.
👥 Meeting Mode (Dual-Track Capture): Capture both your microphone and your computer's audio as two linked tracks sharing a wall-clock timeline, merged into one chronological transcript. Optional speaker diarization (offline ONNX, or cloud) labels who spoke on any Zoom, Teams, or Meet call.
⌨️ Transcribe-in-Place (Ctrl+Alt+I): Speak with a global hotkey and Phoneme types (or pastes) your dictated words into the focused application (Word, Slack, Chrome, VS Code). A zero-latency fast lane skips the queue entirely so text lands the moment you stop talking.
✨ Smart Cleanup, AI Summary & Auto-Titles: Pipe raw transcripts through an LLM to fix stutters, reformat, or translate — and optionally generate a per-recording summary, on demand or automatically. Every recording gets a readable auto-title (a free heuristic, or an optional LLM title). Three transcript layers (raw → cleaned → edited) are kept so nothing is lost.
🔍 Keyword + Semantic Search: Manage thousands of recordings with SQLite FTS5 full-text search, or search by meaning with an offline, chunked hybrid index — per-passage ONNX embeddings fused with keyword ranking (RRF), cached in memory for fast recall, so a query finds the recording whether you remember the gist or the one distinctive word. More-like-this finds a recording's neighbours from its stored vectors (no re-embedding). Bring your own embedding model.
🏷️ Organize at scale: Tags with a full manager, ⭐ favorites, 📌 pinned recordings that sort to the top, saved searches that snapshot every filter, AI auto-tag suggestions you approve before they apply, and a side-by-side view for any two transcripts.
💬 Ask your archive (local RAG): Ask a plain-language question and get an answer drawn only from your own recordings, with a citation per claim. It reuses the same hybrid retriever as search; nothing leaves your machine beyond the LLM call you already use for cleanup.
🔎 Entities & ✅ tasks from voice: Optional LLM enrichment pulls structured entities (people / orgs / topics / terms) and task-shaped action items out of a transcript. Browse the library by entity (the entity counterpart of the tag facet), and check off, edit, or hand-add tasks — with a cross-recording "all tasks" view.
🕮 Topic timelines & digests: Split a long recording into time-ranged chapters, generate a whole-meeting digest across both meeting tracks, or roll up every recording in a date window into a daily/weekly digest.
⌨️ Dictation that fits the app: Text snippets (trigger → expansion), editable voice commands, and per-app tone — dictation picks its cleanup recipe by the focused app, so an email client gets a formal register and an editor gets a terse one.
📤 Import & export anything: Import .wav / .mp3 / .m4a / .flac straight into the pipeline; export the whole library as a portable zip, or any recording as SRT / WebVTT captions.
⌨️ Keyboard everything: Opt-in vim-style navigation drives all three panes (and the queue) — the detail pane is a 2D grid you walk with h/j/k/l, the waveform has an Enter-to-scrub mode (h/l ±1s, Space to play), g-chords jump anywhere, the list zooms with Ctrl+=/-, and ? shows the full cheat sheet.
🩺 Provider-aware Self-healing: A header health pill + Doctor watch the local servers and follow the effective connection each feature uses (cloud keys included); one click (or phoneme doctor --fix) sweeps a hung/orphaned whisper-server and respawns it from config.
♻️ Clean lifecycle: The daemon owns the work and outlives any window. Quit stops it cleanly (finalizing an in-flight take, killing its whisper/Ollama children) — or leave it running headless. A Phoneme-launched Ollama is started on demand and never touches an Ollama you already had running.
💻 CLI is a Peer: Every GUI action is a CLI command (phoneme record start). Bind it to AutoHotkey, Stream Deck, or Kanata.

🆚 Alternatives & Similar Projects

Phoneme isn't for everyone, and that's fine. If one of these fits your needs better, use it:

Wispr Flow — Highly polished, commercial, cloud-based. Types directly into your focused app.
MacWhisper & Superwhisper — Excellent local dictation for macOS.
AudioPen — Cloud web app that beautifully summarizes rambling thoughts.

Reach for Phoneme when you want it local-first, open-source, Windows-native, and endlessly scriptable.

📚 Documentation

Full documentation index →

Users

Guide	Topic
Getting Started	Install, wizard, first recording
Providers & Models	Pick STT/LLM providers, keys, local vs cloud
Meeting Mode	Dual-track capture + wall-clock sync
Hotkeys & Recording Modes	Hold, toggle, CLI bindings
Settings Overview	Every settings screen (with screenshots)
Smart Cleanup & Summary	LLM post-processing + AI summary
Semantic Search & Ask	Meaning-based recall + Ask your archive (cited answers)
FAQ	Common questions
Troubleshooting	Fixes and diagnostics

Developers

Guide	Topic
CONTRIBUTING.md	Dev setup, IPC workflow, PR checklist
Architecture	The full journey: three processes, a recording's life, recall path
Internals	Subsystem deep dives: async topology, audio, catalog/search, alignment math
Backend Guide	Rust workspace map, actors, supervision, SQLx/WAL
Config Reference	Full `config.toml` schema
IPC Integration	NDJSON named pipe
CLI Reference	All commands
Testing & CI	Local checks matching GitHub Actions
Roadmap	Planned features & direction
Changelog	Shipped releases

🚀 Quick Start

Download the latest MSI from the Releases page. The included First Run Wizard will detect your hardware and configure the optimal Whisper model automatically!

# Power users can bypass the UI entirely and use the CLI:
phoneme record start
phoneme record stop
phoneme list

📄 License

MIT OR Apache-2.0.

Phoneme is built by @namefailed. It is not a commercial product, has no telemetry, and never will.

💖 Support

If you find Phoneme useful, please consider supporting my work:

Name		Name	Last commit message	Last commit date
Latest commit History 1,582 Commits
.github		.github
archive_internal		archive_internal
bin		bin
crates		crates
docs		docs
frontend		frontend
hooks		hooks
installer		installer
scripts		scripts
src-tauri		src-tauri
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
ROADMAP.md		ROADMAP.md
config.example.toml		config.example.toml
rust-toolchain.toml		rust-toolchain.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎙️ Phoneme

🧠 Philosophy

🎯 Why Voice?

⚙️ How It Works

✨ Core Features

🆚 Alternatives & Similar Projects

📚 Documentation

Users

Developers

🚀 Quick Start

📄 License

💖 Support

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🎙️ Phoneme

🧠 Philosophy

🎯 Why Voice?

⚙️ How It Works

✨ Core Features

🆚 Alternatives & Similar Projects

📚 Documentation

Users

Developers

🚀 Quick Start

📄 License

💖 Support

About

Topics

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages