dictation-service/docs/README.md

# Dictation Service - Complete Guide

Voice dictation with system tray control and on-demand text-to-speech for Linux.

## Table of Contents

- [Overview](#overview)
- [Features](#features)
- [Installation](#installation)
- [Usage](#usage)
- [Configuration](#configuration)
- [Troubleshooting](#troubleshooting)
- [Architecture](#architecture)

## Overview

This service provides two main features:
1. **Voice Dictation**: Real-time speech-to-text that types into any application
2. **Read-Aloud**: On-demand text-to-speech for highlighted text

Both features work seamlessly together without interference.

## Features

### Dictation Mode
- ✅ Real-time voice recognition using Vosk (offline)
- ✅ System tray icon for status (no notification spam)
- ✅ Toggle via Alt+D or tray icon click
- ✅ Automatic spurious word filtering
- ✅ Works with all applications

### Read-Aloud
- ✅ Middle-click to read selected text
- ✅ High-quality neural voice (Microsoft Edge TTS)
- ✅ Works in any application
- ✅ On-demand only (no automatic reading)
- ✅ Prevents feedback loops with dictation

## Installation

See [INSTALL.md](INSTALL.md) for detailed installation instructions.

Quick install:
```bash
uv sync
./scripts/setup-keybindings.sh
./scripts/setup-middle-click-reader.sh
systemctl --user enable --now dictation.service
```

## Usage

### Dictation

**Starting:**
1. Press `Alt+D` (or click tray icon)
2. Microphone icon turns "on" in system tray
3. Speak normally
4. Words are typed into focused application

**Stopping:**
- Press `Alt+D` again (or click tray icon)
- Icon returns to "muted" state

**Tips:**
- Speak clearly and at normal pace
- Avoid filler words like "um", "uh" (automatically filtered)
- Pause briefly between thoughts for better accuracy

### Read-Aloud

**Using:**
1. Highlight any text (in browser, PDF, editor, etc.)
2. Middle-click (press scroll wheel)
3. Text is read aloud

**Tips:**
- Works on any highlighted text
- No need to enable/disable - always ready
- Only reads when you middle-click

## Configuration

### Speech Recognition Models

Switch models for different speed/accuracy trade-offs:

```bash
./scripts/switch-model.sh
```

**Available models:**
- `vosk-model-small-en-us-0.15` - Fast, basic accuracy
- `vosk-model-en-us-0.22-lgraph` - Balanced (default)
- `vosk-model-en-us-0.22` - Best accuracy (~5.69% WER)

### TTS Voice

Edit `src/dictation_service/middle_click_reader.py`:

```python
EDGE_TTS_VOICE = "en-US-ChristopherNeural"
```

List available voices:
```bash
edge-tts --list-voices
```

Popular options:
- `en-US-JennyNeural` (female, friendly)
- `en-US-GuyNeural` (male, professional)
- `en-GB-RyanNeural` (British male)

### Audio Settings

Edit `src/dictation_service/ai_dictation_simple.py`:

```python
SAMPLE_RATE = 16000   # Higher = better quality, more CPU
BLOCK_SIZE = 4000     # Lower = less latency, less accurate
```

## Troubleshooting

### System Tray Icon Missing

```bash
# Install AppIndicator
sudo apt-get install gir1.2-appindicator3-0.1

# For GNOME Shell
sudo apt-get install gnome-shell-extension-appindicator

# Restart
systemctl --user restart dictation.service
```

### Dictation Not Typing

```bash
# Check ydotool status
systemctl status ydotool

# Start if needed
sudo systemctl enable --now ydotool

# Add user to input group
sudo usermod -aG input $USER
# Log out and back in
```

### Middle-Click Not Working

```bash
# Check service
systemctl --user status middle-click-reader

# View logs
journalctl --user -u middle-click-reader -f

# Test selection
echo "test" | xclip -selection primary
xclip -o -selection primary
```

### Poor Recognition Accuracy

1. **Check microphone:**
   ```bash
   arecord -d 3 test.wav
   aplay test.wav
   ```

2. **Try better model:**
   ```bash
   ./scripts/switch-model.sh
   # Select vosk-model-en-us-0.22
   ```

3. **Reduce background noise**
4. **Speak more clearly and slowly**

### Service Won't Start

```bash
# View detailed logs
journalctl --user -u dictation.service -n 50

# Check for errors
tail -f ~/.cache/dictation_service.log

# Verify model exists
ls ~/.shared/models/vosk-models/
```

## Architecture

### Components

```
┌─────────────────────────────────┐
│     System Tray Icon (GTK)      │
│   - Visual status indicator     │
│   - Click to toggle dictation   │
└─────────────────────────────────┘
              ↓
┌─────────────────────────────────┐
│   Dictation Service (Main)      │
│   - Audio capture               │
│   - Speech recognition (Vosk)   │
│   - Text typing (ydotool)       │
│   - Lock file management        │
└─────────────────────────────────┘
              ↓
         Focused App


┌─────────────────────────────────┐
│  Middle-Click Reader Service    │
│   - Mouse event monitoring      │
│   - Selection capture (xclip)   │
│   - Text-to-speech (edge-tts)   │
│   - Audio playback (mpv)        │
└─────────────────────────────────┘
```

### Lock Files

- `listening.lock` - Dictation active
- `/tmp/dictation_speaking.lock` - TTS playing (prevents feedback)

### Logs

- Dictation: `~/.cache/dictation_service.log`
- Read-aloud: `~/.cache/middle_click_reader.log`
- Systemd: `journalctl --user -u <service-name>`

## Managing Services

### Dictation Service

```bash
# Status
systemctl --user status dictation.service

# Start/stop
systemctl --user start dictation.service
systemctl --user stop dictation.service

# Enable/disable auto-start
systemctl --user enable dictation.service
systemctl --user disable dictation.service

# View logs
journalctl --user -u dictation.service -f

# Restart after changes
systemctl --user restart dictation.service
```

### Read-Aloud Service

```bash
# Status
systemctl --user status middle-click-reader

# Start/stop
systemctl --user start middle-click-reader
systemctl --user stop middle-click-reader

# Enable/disable
systemctl --user enable middle-click-reader
systemctl --user disable middle-click-reader

# Logs
journalctl --user -u middle-click-reader -f
```

## Performance

### Resource Usage
- Dictation (idle): ~50MB RAM
- Dictation (active): ~200-500MB RAM (model dependent)
- Read-aloud: ~30MB RAM
- CPU: Minimal idle, moderate during recognition

### Latency
- Voice to text: ~250ms
- Text typing: <50ms
- Read-aloud start: ~500ms

## Privacy & Security

- ✅ All speech recognition is local (no cloud)
- ✅ Only text sent to Edge TTS (no voice data)
- ✅ Services run as user (not system-wide)
- ✅ No telemetry or external connections (except TTS)
- ✅ Conversation data stays on your machine

## Advanced

### Custom Filtering

Edit spurious word list in `ai_dictation_simple.py`:

```python
spurious_words = {"the", "a", "an"}
```

### Custom Keybinding

Edit `scripts/setup-keybindings.sh` to change from Alt+D.

### Debugging

Enable debug logging:

```python
logging.basicConfig(
    level=logging.DEBUG  # Change from INFO
)
```

## See Also

- [INSTALL.md](INSTALL.md) - Installation guide
- [MIGRATION_GUIDE.md](MIGRATION_GUIDE.md) - Upgrading from old version
- [TESTING_SUMMARY.md](TESTING_SUMMARY.md) - Test coverage