dictation-service/docs/AI_DICTATION_GUIDE.md
Kade Heyborne 73a15d03cd
Fix dictation service: state detection, async processing, and performance optimizations
- Fix state detection priority: dictation now takes precedence over conversation
- Fix critical bug: event loop was created but never started, preventing async coroutines from executing
- Optimize audio processing: reorder AcceptWaveform/PartialResult checks
- Switch to faster Vosk model: vosk-model-en-us-0.22-lgraph for 2-3x speed improvement
- Reduce block size from 8000 to 4000 for lower latency
- Add filtering to remove spurious 'the', 'a', 'an' words from start/end of transcriptions
- Update toggle-dictation.sh to properly clean up conversation lock file
- Improve batch audio processing for better responsiveness
2025-12-04 11:49:07 -07:00

8.7 KiB

AI Dictation Service - Conversational AI Phone Call System

Overview

This enhanced dictation service transforms your existing voice-to-text system into a full conversational AI assistant that maintains conversation context across phone calls. It supports two modes:

  • Dictation Mode (Alt+D): Traditional voice-to-text transcription
  • Conversation Mode (Ctrl+Alt+D): Interactive AI conversation with persistent context

Key Features

🎤 Dictation Mode (Alt+D)

  • Real-time voice transcription with immediate typing
  • Visual feedback through system notifications
  • High accuracy with multiple Vosk models available

🤖 Conversation Mode (Ctrl+Alt+D)

  • Persistent Context: Maintains conversation history across calls
  • VLLM Integration: Connects to your local VLLM endpoint (127.0.0.1:8000)
  • Text-to-Speech: AI responses are spoken naturally
  • Turn-taking: Intelligent voice activity detection
  • Visual GUI: Conversation interface with typing support
  • Context Preservation: Each call maintains its own conversation context

System Architecture

Core Components

  1. State Management: Dual-mode system with seamless switching
  2. Audio Processing: Real-time streaming with voice activity detection
  3. VLLM Client: OpenAI-compatible API integration
  4. TTS Engine: Natural speech synthesis for AI responses
  5. Conversation Manager: Persistent context and history management
  6. GUI Interface: Optional GTK-based conversation window

File Structure

src/dictation_service/
├── enhanced_dictation.py     # Original dictation (preserved)
├── ai_dictation.py           # Full version with GTK GUI
├── ai_dictation_simple.py    # Core version (currently active)
├── vosk_dictation.py         # Basic dictation
└── main.py                   # Entry point

Configuration/
├── dictation.service         # Updated systemd service
├── toggle-dictation.sh       # Dictation control
├── toggle-conversation.sh    # Conversation control
└── setup-dual-keybindings.sh # Keybinding setup

Data/
├── conversation_history.json # Persistent conversation context
├── listening.lock           # Dictation mode lock file
└── conversation.lock        # Conversation mode lock file

Setup Instructions

1. Install Dependencies

# Install Python dependencies
uv sync

# Install system dependencies for GUI (if needed)
sudo apt-get install libgirepository1.0-dev gcc libcairo2-dev pkg-config python3-dev gir1.2-gtk-3.0

2. Setup Keybindings

# Setup both dictation and conversation keybindings
./setup-dual-keybindings.sh

# Or setup individually:
# ./setup-keybindings.sh  # Original dictation only

Keybindings:

  • Alt+D: Toggle dictation mode
  • Super+Alt+D: Toggle conversation mode (Windows+Alt+D)

3. Start the Service

# Enable and start the systemd service
systemctl --user daemon-reload
systemctl --user enable dictation.service
systemctl --user start dictation.service

# Check status
systemctl --user status dictation.service

# View logs
journalctl --user -u dictation.service -f

4. Verify VLLM Connection

Ensure your VLLM service is running:

# Test endpoint
curl -H "Authorization: Bearer vllm-api-key" http://127.0.0.1:8000/v1/models

Usage Guide

Starting Dictation Mode

  1. Press Alt+D or run ./toggle-dictation.sh
  2. System notification: "🎤 Dictation Active"
  3. Speak normally - your words will be typed into the active application
  4. Press Alt+D again to stop

Starting Conversation Mode

  1. Press Super+Alt+D (Windows+Alt+D) or run ./toggle-conversation.sh
  2. System notification: "🤖 Conversation Started" with context count
  3. Speak naturally with the AI assistant
  4. AI responses will be spoken via TTS
  5. Press Super+Alt+D again to end the call

Conversation Context Management

The system maintains persistent conversation context across calls:

  • Within a call: Full conversation history is maintained
  • Between calls: Context is preserved for continuity
  • History storage: Saved in conversation_history.json
  • Auto-cleanup: Limits history to prevent memory issues

Example Conversation Flow

User: "Hey, what's the weather like today?"
AI: "I don't have access to real-time weather data, but I recommend checking a weather app or website for current conditions in your area."

User: "That's fair. Can you help me plan my day instead?"
AI: "I'd be happy to help you plan your day! What are the main tasks or activities you need to accomplish?"

[Call ends with Ctrl+Alt+D]

[Next call starts with Ctrl+Alt+D]
User: "Continuing with the day planning..."
AI: "Great! We were talking about planning your day. What specific tasks or activities were you considering?"

Configuration Options

Environment Variables

# VLLM Configuration
export VLLM_ENDPOINT="http://127.0.0.1:8000/v1"
export VLLM_MODEL="default"

# Audio Settings
export SAMPLE_RATE=16000
export BLOCK_SIZE=8000

# Conversation Settings
export MAX_CONVERSATION_HISTORY=10
export TTS_ENABLED=true

Model Selection

# Switch between Vosk models
./switch-model.sh

# Available models:
# - vosk-model-small-en-us-0.15 (Fast, basic accuracy)
# - vosk-model-en-us-0.22-lgraph (Good balance)
# - vosk-model-en-us-0.22 (Best accuracy, WER ~5.69)

Troubleshooting

Common Issues

  1. Service won't start:

    # Check logs
    journalctl --user -u dictation.service -n 50
    
    # Check permissions
    groups $USER  # Should include 'audio' group
    
  2. VLLM connection fails:

    # Test endpoint manually
    curl -H "Authorization: Bearer vllm-api-key" http://127.0.0.1:8000/v1/models
    
    # Check if VLLM is running
    ps aux | grep vllm
    
  3. Audio issues:

    # Test audio input
    arecord -d 3 -f cd test.wav
    aplay test.wav
    
    # Check audio devices
    pacmd list-sources
    
  4. TTS not working:

    # Test TTS engine
    python3 -c "import pyttsx3; engine = pyttsx3.init(); engine.say('test'); engine.runAndWait()"
    

Log Files

  • Service logs: journalctl --user -u dictation.service
  • Application logs: /home/universal/.gemini/tmp/debug.log
  • Conversation history: conversation_history.json

Resetting Conversation History

# Clear all conversation context
# Add this to ai_dictation.py if needed
conversation_manager.clear_all_history()

Advanced Features

Custom System Prompts

Edit the system prompt in ConversationManager.get_messages_for_api():

messages.append({
    "role": "system",
    "content": "You are a helpful AI assistant in a voice conversation. Be concise and natural in your responses."
})

Voice Activity Detection

The system includes basic VAD that can be customized:

# In audio_callback()
audio_level = abs(indata).mean()
if audio_level > 0.01:  # Adjust threshold as needed
    last_audio_time = time.currentTime

GUI Enhancement (Full Version)

The full ai_dictation.py includes a GTK-based GUI with:

  • Conversation history display
  • Text input field
  • Call control buttons
  • Real-time status indicators

To use the GUI version:

  1. Install PyGObject dependencies
  2. Update pyproject.toml to include PyGObject>=3.42.0
  3. Update dictation.service to use ai_dictation.py

Performance Considerations

Optimizations

  • Model selection: Use smaller models for faster response
  • Audio settings: Adjust BLOCK_SIZE for latency/accuracy balance
  • History management: Limit conversation history for memory efficiency
  • API calls: Implement request batching for efficiency

Resource Usage

  • Memory: ~100-500MB depending on Vosk model size
  • CPU: Minimal during idle, moderate during active conversation
  • Network: Only when calling VLLM endpoint

Security Considerations

  • The service runs as a user service with restricted permissions
  • Conversation history is stored locally in JSON format
  • API key is embedded in the client code
  • Audio data is processed locally, only text sent to VLLM

Future Enhancements

Potential additions:

  • Multi-user support: Separate conversation histories
  • Voice authentication: Speaker identification
  • Advanced VAD: More sophisticated voice activity detection
  • Cloud TTS: Optional cloud-based text-to-speech
  • Conversation export: Save/export conversation history
  • Integration plugins: Connect to other applications

Support

For issues or questions:

  1. Check the log files mentioned above
  2. Verify VLLM service status
  3. Test audio input/output
  4. Review configuration settings

The system builds upon the solid foundation of the existing dictation service while adding comprehensive AI conversation capabilities with persistent context management.