- Fix state detection priority: dictation now takes precedence over conversation - Fix critical bug: event loop was created but never started, preventing async coroutines from executing - Optimize audio processing: reorder AcceptWaveform/PartialResult checks - Switch to faster Vosk model: vosk-model-en-us-0.22-lgraph for 2-3x speed improvement - Reduce block size from 8000 to 4000 for lower latency - Add filtering to remove spurious 'the', 'a', 'an' words from start/end of transcriptions - Update toggle-dictation.sh to properly clean up conversation lock file - Improve batch audio processing for better responsiveness
8.7 KiB
AI Dictation Service - Conversational AI Phone Call System
Overview
This enhanced dictation service transforms your existing voice-to-text system into a full conversational AI assistant that maintains conversation context across phone calls. It supports two modes:
- Dictation Mode (Alt+D): Traditional voice-to-text transcription
- Conversation Mode (Ctrl+Alt+D): Interactive AI conversation with persistent context
Key Features
🎤 Dictation Mode (Alt+D)
- Real-time voice transcription with immediate typing
- Visual feedback through system notifications
- High accuracy with multiple Vosk models available
🤖 Conversation Mode (Ctrl+Alt+D)
- Persistent Context: Maintains conversation history across calls
- VLLM Integration: Connects to your local VLLM endpoint (127.0.0.1:8000)
- Text-to-Speech: AI responses are spoken naturally
- Turn-taking: Intelligent voice activity detection
- Visual GUI: Conversation interface with typing support
- Context Preservation: Each call maintains its own conversation context
System Architecture
Core Components
- State Management: Dual-mode system with seamless switching
- Audio Processing: Real-time streaming with voice activity detection
- VLLM Client: OpenAI-compatible API integration
- TTS Engine: Natural speech synthesis for AI responses
- Conversation Manager: Persistent context and history management
- GUI Interface: Optional GTK-based conversation window
File Structure
src/dictation_service/
├── enhanced_dictation.py # Original dictation (preserved)
├── ai_dictation.py # Full version with GTK GUI
├── ai_dictation_simple.py # Core version (currently active)
├── vosk_dictation.py # Basic dictation
└── main.py # Entry point
Configuration/
├── dictation.service # Updated systemd service
├── toggle-dictation.sh # Dictation control
├── toggle-conversation.sh # Conversation control
└── setup-dual-keybindings.sh # Keybinding setup
Data/
├── conversation_history.json # Persistent conversation context
├── listening.lock # Dictation mode lock file
└── conversation.lock # Conversation mode lock file
Setup Instructions
1. Install Dependencies
# Install Python dependencies
uv sync
# Install system dependencies for GUI (if needed)
sudo apt-get install libgirepository1.0-dev gcc libcairo2-dev pkg-config python3-dev gir1.2-gtk-3.0
2. Setup Keybindings
# Setup both dictation and conversation keybindings
./setup-dual-keybindings.sh
# Or setup individually:
# ./setup-keybindings.sh # Original dictation only
Keybindings:
- Alt+D: Toggle dictation mode
- Super+Alt+D: Toggle conversation mode (Windows+Alt+D)
3. Start the Service
# Enable and start the systemd service
systemctl --user daemon-reload
systemctl --user enable dictation.service
systemctl --user start dictation.service
# Check status
systemctl --user status dictation.service
# View logs
journalctl --user -u dictation.service -f
4. Verify VLLM Connection
Ensure your VLLM service is running:
# Test endpoint
curl -H "Authorization: Bearer vllm-api-key" http://127.0.0.1:8000/v1/models
Usage Guide
Starting Dictation Mode
- Press Alt+D or run
./toggle-dictation.sh - System notification: "🎤 Dictation Active"
- Speak normally - your words will be typed into the active application
- Press Alt+D again to stop
Starting Conversation Mode
- Press Super+Alt+D (Windows+Alt+D) or run
./toggle-conversation.sh - System notification: "🤖 Conversation Started" with context count
- Speak naturally with the AI assistant
- AI responses will be spoken via TTS
- Press Super+Alt+D again to end the call
Conversation Context Management
The system maintains persistent conversation context across calls:
- Within a call: Full conversation history is maintained
- Between calls: Context is preserved for continuity
- History storage: Saved in
conversation_history.json - Auto-cleanup: Limits history to prevent memory issues
Example Conversation Flow
User: "Hey, what's the weather like today?"
AI: "I don't have access to real-time weather data, but I recommend checking a weather app or website for current conditions in your area."
User: "That's fair. Can you help me plan my day instead?"
AI: "I'd be happy to help you plan your day! What are the main tasks or activities you need to accomplish?"
[Call ends with Ctrl+Alt+D]
[Next call starts with Ctrl+Alt+D]
User: "Continuing with the day planning..."
AI: "Great! We were talking about planning your day. What specific tasks or activities were you considering?"
Configuration Options
Environment Variables
# VLLM Configuration
export VLLM_ENDPOINT="http://127.0.0.1:8000/v1"
export VLLM_MODEL="default"
# Audio Settings
export SAMPLE_RATE=16000
export BLOCK_SIZE=8000
# Conversation Settings
export MAX_CONVERSATION_HISTORY=10
export TTS_ENABLED=true
Model Selection
# Switch between Vosk models
./switch-model.sh
# Available models:
# - vosk-model-small-en-us-0.15 (Fast, basic accuracy)
# - vosk-model-en-us-0.22-lgraph (Good balance)
# - vosk-model-en-us-0.22 (Best accuracy, WER ~5.69)
Troubleshooting
Common Issues
-
Service won't start:
# Check logs journalctl --user -u dictation.service -n 50 # Check permissions groups $USER # Should include 'audio' group -
VLLM connection fails:
# Test endpoint manually curl -H "Authorization: Bearer vllm-api-key" http://127.0.0.1:8000/v1/models # Check if VLLM is running ps aux | grep vllm -
Audio issues:
# Test audio input arecord -d 3 -f cd test.wav aplay test.wav # Check audio devices pacmd list-sources -
TTS not working:
# Test TTS engine python3 -c "import pyttsx3; engine = pyttsx3.init(); engine.say('test'); engine.runAndWait()"
Log Files
- Service logs:
journalctl --user -u dictation.service - Application logs:
/home/universal/.gemini/tmp/debug.log - Conversation history:
conversation_history.json
Resetting Conversation History
# Clear all conversation context
# Add this to ai_dictation.py if needed
conversation_manager.clear_all_history()
Advanced Features
Custom System Prompts
Edit the system prompt in ConversationManager.get_messages_for_api():
messages.append({
"role": "system",
"content": "You are a helpful AI assistant in a voice conversation. Be concise and natural in your responses."
})
Voice Activity Detection
The system includes basic VAD that can be customized:
# In audio_callback()
audio_level = abs(indata).mean()
if audio_level > 0.01: # Adjust threshold as needed
last_audio_time = time.currentTime
GUI Enhancement (Full Version)
The full ai_dictation.py includes a GTK-based GUI with:
- Conversation history display
- Text input field
- Call control buttons
- Real-time status indicators
To use the GUI version:
- Install PyGObject dependencies
- Update
pyproject.tomlto includePyGObject>=3.42.0 - Update
dictation.serviceto useai_dictation.py
Performance Considerations
Optimizations
- Model selection: Use smaller models for faster response
- Audio settings: Adjust
BLOCK_SIZEfor latency/accuracy balance - History management: Limit conversation history for memory efficiency
- API calls: Implement request batching for efficiency
Resource Usage
- Memory: ~100-500MB depending on Vosk model size
- CPU: Minimal during idle, moderate during active conversation
- Network: Only when calling VLLM endpoint
Security Considerations
- The service runs as a user service with restricted permissions
- Conversation history is stored locally in JSON format
- API key is embedded in the client code
- Audio data is processed locally, only text sent to VLLM
Future Enhancements
Potential additions:
- Multi-user support: Separate conversation histories
- Voice authentication: Speaker identification
- Advanced VAD: More sophisticated voice activity detection
- Cloud TTS: Optional cloud-based text-to-speech
- Conversation export: Save/export conversation history
- Integration plugins: Connect to other applications
Support
For issues or questions:
- Check the log files mentioned above
- Verify VLLM service status
- Test audio input/output
- Review configuration settings
The system builds upon the solid foundation of the existing dictation service while adding comprehensive AI conversation capabilities with persistent context management.