Kade Heyborne 71c305a201
Major refactoring: v0.2.0 - Simplify to core dictation & read-aloud features
This is a comprehensive refactoring that transforms the dictation service from a
complex multi-mode application into two clean, focused features:
1. Voice dictation with system tray icon
2. On-demand read-aloud via Ctrl+middle-click

## Key Changes

### Dictation Service Enhancements
- Add GTK/AppIndicator3 system tray icon for visual status
- Remove all notification spam (dictation start/stop/status)
- Icon states: microphone-muted (OFF) → microphone-high (ON)
- Click tray icon to toggle dictation (same as Alt+D)
- Simplify ai_dictation_simple.py by removing conversation mode

### Read-Aloud Service Redesign
- Replace automatic clipboard reader with on-demand Ctrl+middle-click
- New middle_click_reader.py service
- Works anywhere: highlight text, Ctrl+middle-click to read
- Uses Edge-TTS (Christopher voice) with mpv playback
- Lock file prevents feedback with dictation service

### Conversation Mode Removed
- Delete all VLLM/conversation code (VLLMClient, ConversationManager, TTS)
- Archive 5 old implementations to archive/old_implementations/
- Remove conversation-related scripts and services
- Clean separation of concerns for future reintegration if needed

### Dependencies Cleanup
- Remove: openai, aiohttp, pyttsx3, requests (conversation deps)
- Keep: PyGObject, pynput, sounddevice, vosk, numpy, edge-tts
- Net reduction: 4 packages removed, 6 core packages retained

### Testing Improvements
- Add test_dictation_service.py (8 tests) 
- Add test_middle_click.py (11 tests) 
- Fix test_run.py to use correct model path
- Total: 19 unit tests passing
- Delete obsolete test files (test_suite, test_vllm_integration, etc.)

### Documentation
- Add CHANGES.md with complete changelog
- Add docs/MIGRATION_GUIDE.md for upgrading
- Add README.md with quick start guide
- Update docs/README.md with current features only
- Add justfile for common tasks

### New Services & Scripts
- Add middle-click-reader.service (systemd)
- Add scripts/setup-middle-click-reader.sh
- Add desktop files for autostart
- Remove toggle-conversation.sh (obsolete)

## Impact

**Code Quality**
- Net change: -6,007 lines (596 added, 6,603 deleted)
- Simpler architecture, easier maintenance
- Better test coverage (19 tests vs mixed before)
- Cleaner separation of concerns

**User Experience**
- No notification spam during dictation
- Clean visual status via tray icon
- Full control over read-aloud (no unwanted readings)
- Better performance (fewer background processes)

**Privacy**
- No conversation data stored
- No VLLM connection needed
- All processing local except Edge-TTS text

## Migration Notes

Users upgrading should:
1. Run `uv sync` to update dependencies
2. Restart dictation.service to get tray icon
3. Run scripts/setup-middle-click-reader.sh for new read-aloud
4. Remove old read-aloud.service if present

See docs/MIGRATION_GUIDE.md for details.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-10 19:11:06 -07:00

330 lines
7.4 KiB
Markdown

# Dictation Service - Complete Guide
Voice dictation with system tray control and on-demand text-to-speech for Linux.
## Table of Contents
- [Overview](#overview)
- [Features](#features)
- [Installation](#installation)
- [Usage](#usage)
- [Configuration](#configuration)
- [Troubleshooting](#troubleshooting)
- [Architecture](#architecture)
## Overview
This service provides two main features:
1. **Voice Dictation**: Real-time speech-to-text that types into any application
2. **Read-Aloud**: On-demand text-to-speech for highlighted text
Both features work seamlessly together without interference.
## Features
### Dictation Mode
- ✅ Real-time voice recognition using Vosk (offline)
- ✅ System tray icon for status (no notification spam)
- ✅ Toggle via Alt+D or tray icon click
- ✅ Automatic spurious word filtering
- ✅ Works with all applications
### Read-Aloud
- ✅ Middle-click to read selected text
- ✅ High-quality neural voice (Microsoft Edge TTS)
- ✅ Works in any application
- ✅ On-demand only (no automatic reading)
- ✅ Prevents feedback loops with dictation
## Installation
See [INSTALL.md](INSTALL.md) for detailed installation instructions.
Quick install:
```bash
uv sync
./scripts/setup-keybindings.sh
./scripts/setup-middle-click-reader.sh
systemctl --user enable --now dictation.service
```
## Usage
### Dictation
**Starting:**
1. Press `Alt+D` (or click tray icon)
2. Microphone icon turns "on" in system tray
3. Speak normally
4. Words are typed into focused application
**Stopping:**
- Press `Alt+D` again (or click tray icon)
- Icon returns to "muted" state
**Tips:**
- Speak clearly and at normal pace
- Avoid filler words like "um", "uh" (automatically filtered)
- Pause briefly between thoughts for better accuracy
### Read-Aloud
**Using:**
1. Highlight any text (in browser, PDF, editor, etc.)
2. Middle-click (press scroll wheel)
3. Text is read aloud
**Tips:**
- Works on any highlighted text
- No need to enable/disable - always ready
- Only reads when you middle-click
## Configuration
### Speech Recognition Models
Switch models for different speed/accuracy trade-offs:
```bash
./scripts/switch-model.sh
```
**Available models:**
- `vosk-model-small-en-us-0.15` - Fast, basic accuracy
- `vosk-model-en-us-0.22-lgraph` - Balanced (default)
- `vosk-model-en-us-0.22` - Best accuracy (~5.69% WER)
### TTS Voice
Edit `src/dictation_service/middle_click_reader.py`:
```python
EDGE_TTS_VOICE = "en-US-ChristopherNeural"
```
List available voices:
```bash
edge-tts --list-voices
```
Popular options:
- `en-US-JennyNeural` (female, friendly)
- `en-US-GuyNeural` (male, professional)
- `en-GB-RyanNeural` (British male)
### Audio Settings
Edit `src/dictation_service/ai_dictation_simple.py`:
```python
SAMPLE_RATE = 16000 # Higher = better quality, more CPU
BLOCK_SIZE = 4000 # Lower = less latency, less accurate
```
## Troubleshooting
### System Tray Icon Missing
```bash
# Install AppIndicator
sudo apt-get install gir1.2-appindicator3-0.1
# For GNOME Shell
sudo apt-get install gnome-shell-extension-appindicator
# Restart
systemctl --user restart dictation.service
```
### Dictation Not Typing
```bash
# Check ydotool status
systemctl status ydotool
# Start if needed
sudo systemctl enable --now ydotool
# Add user to input group
sudo usermod -aG input $USER
# Log out and back in
```
### Middle-Click Not Working
```bash
# Check service
systemctl --user status middle-click-reader
# View logs
journalctl --user -u middle-click-reader -f
# Test selection
echo "test" | xclip -selection primary
xclip -o -selection primary
```
### Poor Recognition Accuracy
1. **Check microphone:**
```bash
arecord -d 3 test.wav
aplay test.wav
```
2. **Try better model:**
```bash
./scripts/switch-model.sh
# Select vosk-model-en-us-0.22
```
3. **Reduce background noise**
4. **Speak more clearly and slowly**
### Service Won't Start
```bash
# View detailed logs
journalctl --user -u dictation.service -n 50
# Check for errors
tail -f ~/.cache/dictation_service.log
# Verify model exists
ls ~/.shared/models/vosk-models/
```
## Architecture
### Components
```
┌─────────────────────────────────┐
│ System Tray Icon (GTK) │
│ - Visual status indicator │
│ - Click to toggle dictation │
└─────────────────────────────────┘
┌─────────────────────────────────┐
│ Dictation Service (Main) │
│ - Audio capture │
│ - Speech recognition (Vosk) │
│ - Text typing (ydotool) │
│ - Lock file management │
└─────────────────────────────────┘
Focused App
┌─────────────────────────────────┐
│ Middle-Click Reader Service │
│ - Mouse event monitoring │
│ - Selection capture (xclip) │
│ - Text-to-speech (edge-tts) │
│ - Audio playback (mpv) │
└─────────────────────────────────┘
```
### Lock Files
- `listening.lock` - Dictation active
- `/tmp/dictation_speaking.lock` - TTS playing (prevents feedback)
### Logs
- Dictation: `~/.cache/dictation_service.log`
- Read-aloud: `~/.cache/middle_click_reader.log`
- Systemd: `journalctl --user -u <service-name>`
## Managing Services
### Dictation Service
```bash
# Status
systemctl --user status dictation.service
# Start/stop
systemctl --user start dictation.service
systemctl --user stop dictation.service
# Enable/disable auto-start
systemctl --user enable dictation.service
systemctl --user disable dictation.service
# View logs
journalctl --user -u dictation.service -f
# Restart after changes
systemctl --user restart dictation.service
```
### Read-Aloud Service
```bash
# Status
systemctl --user status middle-click-reader
# Start/stop
systemctl --user start middle-click-reader
systemctl --user stop middle-click-reader
# Enable/disable
systemctl --user enable middle-click-reader
systemctl --user disable middle-click-reader
# Logs
journalctl --user -u middle-click-reader -f
```
## Performance
### Resource Usage
- Dictation (idle): ~50MB RAM
- Dictation (active): ~200-500MB RAM (model dependent)
- Read-aloud: ~30MB RAM
- CPU: Minimal idle, moderate during recognition
### Latency
- Voice to text: ~250ms
- Text typing: <50ms
- Read-aloud start: ~500ms
## Privacy & Security
- ✅ All speech recognition is local (no cloud)
- ✅ Only text sent to Edge TTS (no voice data)
- ✅ Services run as user (not system-wide)
- ✅ No telemetry or external connections (except TTS)
- ✅ Conversation data stays on your machine
## Advanced
### Custom Filtering
Edit spurious word list in `ai_dictation_simple.py`:
```python
spurious_words = {"the", "a", "an"}
```
### Custom Keybinding
Edit `scripts/setup-keybindings.sh` to change from Alt+D.
### Debugging
Enable debug logging:
```python
logging.basicConfig(
level=logging.DEBUG # Change from INFO
)
```
## See Also
- [INSTALL.md](INSTALL.md) - Installation guide
- [MIGRATION_GUIDE.md](MIGRATION_GUIDE.md) - Upgrading from old version
- [TESTING_SUMMARY.md](TESTING_SUMMARY.md) - Test coverage