# Dictation Service - Complete Guide Voice dictation with system tray control and on-demand text-to-speech for Linux. ## Table of Contents - [Overview](#overview) - [Features](#features) - [Installation](#installation) - [Usage](#usage) - [Configuration](#configuration) - [Troubleshooting](#troubleshooting) - [Architecture](#architecture) ## Overview This service provides two main features: 1. **Voice Dictation**: Real-time speech-to-text that types into any application 2. **Read-Aloud**: On-demand text-to-speech for highlighted text Both features work seamlessly together without interference. ## Features ### Dictation Mode - ✅ Real-time voice recognition using Vosk (offline) - ✅ System tray icon for status (no notification spam) - ✅ Toggle via Alt+D or tray icon click - ✅ Automatic spurious word filtering - ✅ Works with all applications ### Read-Aloud - ✅ Middle-click to read selected text - ✅ High-quality neural voice (Microsoft Edge TTS) - ✅ Works in any application - ✅ On-demand only (no automatic reading) - ✅ Prevents feedback loops with dictation ## Installation See [INSTALL.md](INSTALL.md) for detailed installation instructions. Quick install: ```bash uv sync ./scripts/setup-keybindings.sh ./scripts/setup-middle-click-reader.sh systemctl --user enable --now dictation.service ``` ## Usage ### Dictation **Starting:** 1. Press `Alt+D` (or click tray icon) 2. Microphone icon turns "on" in system tray 3. Speak normally 4. Words are typed into focused application **Stopping:** - Press `Alt+D` again (or click tray icon) - Icon returns to "muted" state **Tips:** - Speak clearly and at normal pace - Avoid filler words like "um", "uh" (automatically filtered) - Pause briefly between thoughts for better accuracy ### Read-Aloud **Using:** 1. Highlight any text (in browser, PDF, editor, etc.) 2. Middle-click (press scroll wheel) 3. Text is read aloud **Tips:** - Works on any highlighted text - No need to enable/disable - always ready - Only reads when you middle-click ## Configuration ### Speech Recognition Models Switch models for different speed/accuracy trade-offs: ```bash ./scripts/switch-model.sh ``` **Available models:** - `vosk-model-small-en-us-0.15` - Fast, basic accuracy - `vosk-model-en-us-0.22-lgraph` - Balanced (default) - `vosk-model-en-us-0.22` - Best accuracy (~5.69% WER) ### TTS Voice Edit `src/dictation_service/middle_click_reader.py`: ```python EDGE_TTS_VOICE = "en-US-ChristopherNeural" ``` List available voices: ```bash edge-tts --list-voices ``` Popular options: - `en-US-JennyNeural` (female, friendly) - `en-US-GuyNeural` (male, professional) - `en-GB-RyanNeural` (British male) ### Audio Settings Edit `src/dictation_service/ai_dictation_simple.py`: ```python SAMPLE_RATE = 16000 # Higher = better quality, more CPU BLOCK_SIZE = 4000 # Lower = less latency, less accurate ``` ## Troubleshooting ### System Tray Icon Missing ```bash # Install AppIndicator sudo apt-get install gir1.2-appindicator3-0.1 # For GNOME Shell sudo apt-get install gnome-shell-extension-appindicator # Restart systemctl --user restart dictation.service ``` ### Dictation Not Typing ```bash # Check ydotool status systemctl status ydotool # Start if needed sudo systemctl enable --now ydotool # Add user to input group sudo usermod -aG input $USER # Log out and back in ``` ### Middle-Click Not Working ```bash # Check service systemctl --user status middle-click-reader # View logs journalctl --user -u middle-click-reader -f # Test selection echo "test" | xclip -selection primary xclip -o -selection primary ``` ### Poor Recognition Accuracy 1. **Check microphone:** ```bash arecord -d 3 test.wav aplay test.wav ``` 2. **Try better model:** ```bash ./scripts/switch-model.sh # Select vosk-model-en-us-0.22 ``` 3. **Reduce background noise** 4. **Speak more clearly and slowly** ### Service Won't Start ```bash # View detailed logs journalctl --user -u dictation.service -n 50 # Check for errors tail -f ~/.cache/dictation_service.log # Verify model exists ls ~/.shared/models/vosk-models/ ``` ## Architecture ### Components ``` ┌─────────────────────────────────┐ │ System Tray Icon (GTK) │ │ - Visual status indicator │ │ - Click to toggle dictation │ └─────────────────────────────────┘ ↓ ┌─────────────────────────────────┐ │ Dictation Service (Main) │ │ - Audio capture │ │ - Speech recognition (Vosk) │ │ - Text typing (ydotool) │ │ - Lock file management │ └─────────────────────────────────┘ ↓ Focused App ┌─────────────────────────────────┐ │ Middle-Click Reader Service │ │ - Mouse event monitoring │ │ - Selection capture (xclip) │ │ - Text-to-speech (edge-tts) │ │ - Audio playback (mpv) │ └─────────────────────────────────┘ ``` ### Lock Files - `listening.lock` - Dictation active - `/tmp/dictation_speaking.lock` - TTS playing (prevents feedback) ### Logs - Dictation: `~/.cache/dictation_service.log` - Read-aloud: `~/.cache/middle_click_reader.log` - Systemd: `journalctl --user -u ` ## Managing Services ### Dictation Service ```bash # Status systemctl --user status dictation.service # Start/stop systemctl --user start dictation.service systemctl --user stop dictation.service # Enable/disable auto-start systemctl --user enable dictation.service systemctl --user disable dictation.service # View logs journalctl --user -u dictation.service -f # Restart after changes systemctl --user restart dictation.service ``` ### Read-Aloud Service ```bash # Status systemctl --user status middle-click-reader # Start/stop systemctl --user start middle-click-reader systemctl --user stop middle-click-reader # Enable/disable systemctl --user enable middle-click-reader systemctl --user disable middle-click-reader # Logs journalctl --user -u middle-click-reader -f ``` ## Performance ### Resource Usage - Dictation (idle): ~50MB RAM - Dictation (active): ~200-500MB RAM (model dependent) - Read-aloud: ~30MB RAM - CPU: Minimal idle, moderate during recognition ### Latency - Voice to text: ~250ms - Text typing: <50ms - Read-aloud start: ~500ms ## Privacy & Security - ✅ All speech recognition is local (no cloud) - ✅ Only text sent to Edge TTS (no voice data) - ✅ Services run as user (not system-wide) - ✅ No telemetry or external connections (except TTS) - ✅ Conversation data stays on your machine ## Advanced ### Custom Filtering Edit spurious word list in `ai_dictation_simple.py`: ```python spurious_words = {"the", "a", "an"} ``` ### Custom Keybinding Edit `scripts/setup-keybindings.sh` to change from Alt+D. ### Debugging Enable debug logging: ```python logging.basicConfig( level=logging.DEBUG # Change from INFO ) ``` ## See Also - [INSTALL.md](INSTALL.md) - Installation guide - [MIGRATION_GUIDE.md](MIGRATION_GUIDE.md) - Upgrading from old version - [TESTING_SUMMARY.md](TESTING_SUMMARY.md) - Test coverage