dictation-service/docs/AI_DICTATION_GUIDE.md

# AI Dictation Service - Conversational AI Phone Call System

## Overview

This enhanced dictation service transforms your existing voice-to-text system into a full conversational AI assistant that maintains conversation context across phone calls. It supports two modes:

- **Dictation Mode (Alt+D)**: Traditional voice-to-text transcription
- **Conversation Mode (Ctrl+Alt+D)**: Interactive AI conversation with persistent context

## Key Features

### 🎤 Dictation Mode (Alt+D)
- Real-time voice transcription with immediate typing
- Visual feedback through system notifications
- High accuracy with multiple Vosk models available

### 🤖 Conversation Mode (Ctrl+Alt+D)
- **Persistent Context**: Maintains conversation history across calls
- **VLLM Integration**: Connects to your local VLLM endpoint (127.0.0.1:8000)
- **Text-to-Speech**: AI responses are spoken naturally
- **Turn-taking**: Intelligent voice activity detection
- **Visual GUI**: Conversation interface with typing support
- **Context Preservation**: Each call maintains its own conversation context

## System Architecture

### Core Components
1. **State Management**: Dual-mode system with seamless switching
2. **Audio Processing**: Real-time streaming with voice activity detection
3. **VLLM Client**: OpenAI-compatible API integration
4. **TTS Engine**: Natural speech synthesis for AI responses
5. **Conversation Manager**: Persistent context and history management
6. **GUI Interface**: Optional GTK-based conversation window

### File Structure
```
src/dictation_service/
├── enhanced_dictation.py     # Original dictation (preserved)
├── ai_dictation.py           # Full version with GTK GUI
├── ai_dictation_simple.py    # Core version (currently active)
├── vosk_dictation.py         # Basic dictation
└── main.py                   # Entry point

Configuration/
├── dictation.service         # Updated systemd service
├── toggle-dictation.sh       # Dictation control
├── toggle-conversation.sh    # Conversation control
└── setup-dual-keybindings.sh # Keybinding setup

Data/
├── conversation_history.json # Persistent conversation context
├── listening.lock           # Dictation mode lock file
└── conversation.lock        # Conversation mode lock file
```

## Setup Instructions

### 1. Install Dependencies

```bash
# Install Python dependencies
uv sync

# Install system dependencies for GUI (if needed)
sudo apt-get install libgirepository1.0-dev gcc libcairo2-dev pkg-config python3-dev gir1.2-gtk-3.0
```

### 2. Setup Keybindings

```bash
# Setup both dictation and conversation keybindings
./setup-dual-keybindings.sh

# Or setup individually:
# ./setup-keybindings.sh  # Original dictation only
```

**Keybindings:**
- **Alt+D**: Toggle dictation mode
- **Super+Alt+D**: Toggle conversation mode (Windows+Alt+D)

### 3. Start the Service

```bash
# Enable and start the systemd service
systemctl --user daemon-reload
systemctl --user enable dictation.service
systemctl --user start dictation.service

# Check status
systemctl --user status dictation.service

# View logs
journalctl --user -u dictation.service -f
```

### 4. Verify VLLM Connection

Ensure your VLLM service is running:
```bash
# Test endpoint
curl -H "Authorization: Bearer vllm-api-key" http://127.0.0.1:8000/v1/models
```

## Usage Guide

### Starting Dictation Mode
1. Press **Alt+D** or run `./toggle-dictation.sh`
2. System notification: "🎤 Dictation Active"
3. Speak normally - your words will be typed into the active application
4. Press **Alt+D** again to stop

### Starting Conversation Mode
1. Press **Super+Alt+D** (Windows+Alt+D) or run `./toggle-conversation.sh`
2. System notification: "🤖 Conversation Started" with context count
3. Speak naturally with the AI assistant
4. AI responses will be spoken via TTS
5. Press **Super+Alt+D** again to end the call

### Conversation Context Management

The system maintains persistent conversation context across calls:
- **Within a call**: Full conversation history is maintained
- **Between calls**: Context is preserved for continuity
- **History storage**: Saved in `conversation_history.json`
- **Auto-cleanup**: Limits history to prevent memory issues

### Example Conversation Flow

```
User: "Hey, what's the weather like today?"
AI: "I don't have access to real-time weather data, but I recommend checking a weather app or website for current conditions in your area."

User: "That's fair. Can you help me plan my day instead?"
AI: "I'd be happy to help you plan your day! What are the main tasks or activities you need to accomplish?"

[Call ends with Ctrl+Alt+D]

[Next call starts with Ctrl+Alt+D]
User: "Continuing with the day planning..."
AI: "Great! We were talking about planning your day. What specific tasks or activities were you considering?"
```

## Configuration Options

### Environment Variables
```bash
# VLLM Configuration
export VLLM_ENDPOINT="http://127.0.0.1:8000/v1"
export VLLM_MODEL="default"

# Audio Settings
export SAMPLE_RATE=16000
export BLOCK_SIZE=8000

# Conversation Settings
export MAX_CONVERSATION_HISTORY=10
export TTS_ENABLED=true
```

### Model Selection
```bash
# Switch between Vosk models
./switch-model.sh

# Available models:
# - vosk-model-small-en-us-0.15 (Fast, basic accuracy)
# - vosk-model-en-us-0.22-lgraph (Good balance)
# - vosk-model-en-us-0.22 (Best accuracy, WER ~5.69)
```

## Troubleshooting

### Common Issues

1. **Service won't start**:
   ```bash
   # Check logs
   journalctl --user -u dictation.service -n 50

   # Check permissions
   groups $USER  # Should include 'audio' group
   ```

2. **VLLM connection fails**:
   ```bash
   # Test endpoint manually
   curl -H "Authorization: Bearer vllm-api-key" http://127.0.0.1:8000/v1/models

   # Check if VLLM is running
   ps aux | grep vllm
   ```

3. **Audio issues**:
   ```bash
   # Test audio input
   arecord -d 3 -f cd test.wav
   aplay test.wav

   # Check audio devices
   pacmd list-sources
   ```

4. **TTS not working**:
   ```bash
   # Test TTS engine
   python3 -c "import pyttsx3; engine = pyttsx3.init(); engine.say('test'); engine.runAndWait()"
   ```

### Log Files
- **Service logs**: `journalctl --user -u dictation.service`
- **Application logs**: `/home/universal/.gemini/tmp/debug.log`
- **Conversation history**: `conversation_history.json`

### Resetting Conversation History
```python
# Clear all conversation context
# Add this to ai_dictation.py if needed
conversation_manager.clear_all_history()
```

## Advanced Features

### Custom System Prompts
Edit the system prompt in `ConversationManager.get_messages_for_api()`:
```python
messages.append({
    "role": "system",
    "content": "You are a helpful AI assistant in a voice conversation. Be concise and natural in your responses."
})
```

### Voice Activity Detection
The system includes basic VAD that can be customized:
```python
# In audio_callback()
audio_level = abs(indata).mean()
if audio_level > 0.01:  # Adjust threshold as needed
    last_audio_time = time.currentTime
```

### GUI Enhancement (Full Version)
The full `ai_dictation.py` includes a GTK-based GUI with:
- Conversation history display
- Text input field
- Call control buttons
- Real-time status indicators

To use the GUI version:
1. Install PyGObject dependencies
2. Update `pyproject.toml` to include `PyGObject>=3.42.0`
3. Update `dictation.service` to use `ai_dictation.py`

## Performance Considerations

### Optimizations
- **Model selection**: Use smaller models for faster response
- **Audio settings**: Adjust `BLOCK_SIZE` for latency/accuracy balance
- **History management**: Limit conversation history for memory efficiency
- **API calls**: Implement request batching for efficiency

### Resource Usage
- **Memory**: ~100-500MB depending on Vosk model size
- **CPU**: Minimal during idle, moderate during active conversation
- **Network**: Only when calling VLLM endpoint

## Security Considerations

- The service runs as a user service with restricted permissions
- Conversation history is stored locally in JSON format
- API key is embedded in the client code
- Audio data is processed locally, only text sent to VLLM

## Future Enhancements

Potential additions:
- **Multi-user support**: Separate conversation histories
- **Voice authentication**: Speaker identification
- **Advanced VAD**: More sophisticated voice activity detection
- **Cloud TTS**: Optional cloud-based text-to-speech
- **Conversation export**: Save/export conversation history
- **Integration plugins**: Connect to other applications

## Support

For issues or questions:
1. Check the log files mentioned above
2. Verify VLLM service status
3. Test audio input/output
4. Review configuration settings

The system builds upon the solid foundation of the existing dictation service while adding comprehensive AI conversation capabilities with persistent context management.