Fix dictation service: state detection, async processing, and performance optimizations

- Fix state detection priority: dictation now takes precedence over conversation
- Fix critical bug: event loop was created but never started, preventing async coroutines from executing
- Optimize audio processing: reorder AcceptWaveform/PartialResult checks
- Switch to faster Vosk model: vosk-model-en-us-0.22-lgraph for 2-3x speed improvement
- Reduce block size from 8000 to 4000 for lower latency
- Add filtering to remove spurious 'the', 'a', 'an' words from start/end of transcriptions
- Update toggle-dictation.sh to properly clean up conversation lock file
- Improve batch audio processing for better responsiveness

2025-12-04 11:49:07 -07:00

8.7 KiB

Raw Blame History

AI Dictation Service - Conversational AI Phone Call System

Overview

This enhanced dictation service transforms your existing voice-to-text system into a full conversational AI assistant that maintains conversation context across phone calls. It supports two modes:

Dictation Mode (Alt+D): Traditional voice-to-text transcription
Conversation Mode (Ctrl+Alt+D): Interactive AI conversation with persistent context

Key Features

🎤 Dictation Mode (Alt+D)

Real-time voice transcription with immediate typing
Visual feedback through system notifications
High accuracy with multiple Vosk models available

🤖 Conversation Mode (Ctrl+Alt+D)

Persistent Context: Maintains conversation history across calls
VLLM Integration: Connects to your local VLLM endpoint (127.0.0.1:8000)
Text-to-Speech: AI responses are spoken naturally
Turn-taking: Intelligent voice activity detection
Visual GUI: Conversation interface with typing support
Context Preservation: Each call maintains its own conversation context

System Architecture

Core Components

State Management: Dual-mode system with seamless switching
Audio Processing: Real-time streaming with voice activity detection
VLLM Client: OpenAI-compatible API integration
TTS Engine: Natural speech synthesis for AI responses
Conversation Manager: Persistent context and history management
GUI Interface: Optional GTK-based conversation window

File Structure

src/dictation_service/
├── enhanced_dictation.py     # Original dictation (preserved)
├── ai_dictation.py           # Full version with GTK GUI
├── ai_dictation_simple.py    # Core version (currently active)
├── vosk_dictation.py         # Basic dictation
└── main.py                   # Entry point

Configuration/
├── dictation.service         # Updated systemd service
├── toggle-dictation.sh       # Dictation control
├── toggle-conversation.sh    # Conversation control
└── setup-dual-keybindings.sh # Keybinding setup

Data/
├── conversation_history.json # Persistent conversation context
├── listening.lock           # Dictation mode lock file
└── conversation.lock        # Conversation mode lock file

Setup Instructions

1. Install Dependencies

# Install Python dependencies
uv sync

# Install system dependencies for GUI (if needed)
sudo apt-get install libgirepository1.0-dev gcc libcairo2-dev pkg-config python3-dev gir1.2-gtk-3.0

2. Setup Keybindings

# Setup both dictation and conversation keybindings
./setup-dual-keybindings.sh

# Or setup individually:
# ./setup-keybindings.sh  # Original dictation only

Keybindings:

Alt+D: Toggle dictation mode
Super+Alt+D: Toggle conversation mode (Windows+Alt+D)

3. Start the Service

# Enable and start the systemd service
systemctl --user daemon-reload
systemctl --user enable dictation.service
systemctl --user start dictation.service

# Check status
systemctl --user status dictation.service

# View logs
journalctl --user -u dictation.service -f

4. Verify VLLM Connection

Ensure your VLLM service is running:

# Test endpoint
curl -H "Authorization: Bearer vllm-api-key" http://127.0.0.1:8000/v1/models

Usage Guide

Starting Dictation Mode

Press Alt+D or run ./toggle-dictation.sh
System notification: "🎤 Dictation Active"
Speak normally - your words will be typed into the active application
Press Alt+D again to stop

Starting Conversation Mode

Press Super+Alt+D (Windows+Alt+D) or run ./toggle-conversation.sh
System notification: "🤖 Conversation Started" with context count
Speak naturally with the AI assistant
AI responses will be spoken via TTS
Press Super+Alt+D again to end the call

Conversation Context Management

The system maintains persistent conversation context across calls:

Within a call: Full conversation history is maintained
Between calls: Context is preserved for continuity
History storage: Saved in conversation_history.json
Auto-cleanup: Limits history to prevent memory issues

Example Conversation Flow

User: "Hey, what's the weather like today?"
AI: "I don't have access to real-time weather data, but I recommend checking a weather app or website for current conditions in your area."

User: "That's fair. Can you help me plan my day instead?"
AI: "I'd be happy to help you plan your day! What are the main tasks or activities you need to accomplish?"

[Call ends with Ctrl+Alt+D]

[Next call starts with Ctrl+Alt+D]
User: "Continuing with the day planning..."
AI: "Great! We were talking about planning your day. What specific tasks or activities were you considering?"

Configuration Options

Environment Variables

# VLLM Configuration
export VLLM_ENDPOINT="http://127.0.0.1:8000/v1"
export VLLM_MODEL="default"

# Audio Settings
export SAMPLE_RATE=16000
export BLOCK_SIZE=8000

# Conversation Settings
export MAX_CONVERSATION_HISTORY=10
export TTS_ENABLED=true

Model Selection

# Switch between Vosk models
./switch-model.sh

# Available models:
# - vosk-model-small-en-us-0.15 (Fast, basic accuracy)
# - vosk-model-en-us-0.22-lgraph (Good balance)
# - vosk-model-en-us-0.22 (Best accuracy, WER ~5.69)

Troubleshooting

Common Issues

Service won't start:

# Check logs
journalctl --user -u dictation.service -n 50

# Check permissions
groups $USER  # Should include 'audio' group

VLLM connection fails:

# Test endpoint manually
curl -H "Authorization: Bearer vllm-api-key" http://127.0.0.1:8000/v1/models

# Check if VLLM is running
ps aux | grep vllm

Audio issues:

# Test audio input
arecord -d 3 -f cd test.wav
aplay test.wav

# Check audio devices
pacmd list-sources

TTS not working:

# Test TTS engine
python3 -c "import pyttsx3; engine = pyttsx3.init(); engine.say('test'); engine.runAndWait()"

Log Files

Service logs: journalctl --user -u dictation.service
Application logs: /home/universal/.gemini/tmp/debug.log
Conversation history: conversation_history.json

Resetting Conversation History

# Clear all conversation context
# Add this to ai_dictation.py if needed
conversation_manager.clear_all_history()

Advanced Features

Custom System Prompts

Edit the system prompt in ConversationManager.get_messages_for_api():

messages.append({
    "role": "system",
    "content": "You are a helpful AI assistant in a voice conversation. Be concise and natural in your responses."
})

Voice Activity Detection

The system includes basic VAD that can be customized:

# In audio_callback()
audio_level = abs(indata).mean()
if audio_level > 0.01:  # Adjust threshold as needed
    last_audio_time = time.currentTime

GUI Enhancement (Full Version)

The full ai_dictation.py includes a GTK-based GUI with:

Conversation history display
Text input field
Call control buttons
Real-time status indicators

To use the GUI version:

Install PyGObject dependencies
Update pyproject.toml to include PyGObject>=3.42.0
Update dictation.service to use ai_dictation.py

Performance Considerations

Optimizations

Model selection: Use smaller models for faster response
Audio settings: Adjust BLOCK_SIZE for latency/accuracy balance
History management: Limit conversation history for memory efficiency
API calls: Implement request batching for efficiency

Resource Usage

Memory: ~100-500MB depending on Vosk model size
CPU: Minimal during idle, moderate during active conversation
Network: Only when calling VLLM endpoint

Security Considerations

The service runs as a user service with restricted permissions
Conversation history is stored locally in JSON format
API key is embedded in the client code
Audio data is processed locally, only text sent to VLLM

Future Enhancements

Potential additions:

Multi-user support: Separate conversation histories
Voice authentication: Speaker identification
Advanced VAD: More sophisticated voice activity detection
Cloud TTS: Optional cloud-based text-to-speech
Conversation export: Save/export conversation history
Integration plugins: Connect to other applications

Support

For issues or questions:

Check the log files mentioned above
Verify VLLM service status
Test audio input/output
Review configuration settings

The system builds upon the solid foundation of the existing dictation service while adding comprehensive AI conversation capabilities with persistent context management.

8.7 KiB Raw Blame History