- Fix state detection priority: dictation now takes precedence over conversation - Fix critical bug: event loop was created but never started, preventing async coroutines from executing - Optimize audio processing: reorder AcceptWaveform/PartialResult checks - Switch to faster Vosk model: vosk-model-en-us-0.22-lgraph for 2-3x speed improvement - Reduce block size from 8000 to 4000 for lower latency - Add filtering to remove spurious 'the', 'a', 'an' words from start/end of transcriptions - Update toggle-dictation.sh to properly clean up conversation lock file - Improve batch audio processing for better responsiveness
292 lines
8.7 KiB
Markdown
292 lines
8.7 KiB
Markdown
# AI Dictation Service - Conversational AI Phone Call System
|
|
|
|
## Overview
|
|
|
|
This enhanced dictation service transforms your existing voice-to-text system into a full conversational AI assistant that maintains conversation context across phone calls. It supports two modes:
|
|
|
|
- **Dictation Mode (Alt+D)**: Traditional voice-to-text transcription
|
|
- **Conversation Mode (Ctrl+Alt+D)**: Interactive AI conversation with persistent context
|
|
|
|
## Key Features
|
|
|
|
### 🎤 Dictation Mode (Alt+D)
|
|
- Real-time voice transcription with immediate typing
|
|
- Visual feedback through system notifications
|
|
- High accuracy with multiple Vosk models available
|
|
|
|
### 🤖 Conversation Mode (Ctrl+Alt+D)
|
|
- **Persistent Context**: Maintains conversation history across calls
|
|
- **VLLM Integration**: Connects to your local VLLM endpoint (127.0.0.1:8000)
|
|
- **Text-to-Speech**: AI responses are spoken naturally
|
|
- **Turn-taking**: Intelligent voice activity detection
|
|
- **Visual GUI**: Conversation interface with typing support
|
|
- **Context Preservation**: Each call maintains its own conversation context
|
|
|
|
## System Architecture
|
|
|
|
### Core Components
|
|
1. **State Management**: Dual-mode system with seamless switching
|
|
2. **Audio Processing**: Real-time streaming with voice activity detection
|
|
3. **VLLM Client**: OpenAI-compatible API integration
|
|
4. **TTS Engine**: Natural speech synthesis for AI responses
|
|
5. **Conversation Manager**: Persistent context and history management
|
|
6. **GUI Interface**: Optional GTK-based conversation window
|
|
|
|
### File Structure
|
|
```
|
|
src/dictation_service/
|
|
├── enhanced_dictation.py # Original dictation (preserved)
|
|
├── ai_dictation.py # Full version with GTK GUI
|
|
├── ai_dictation_simple.py # Core version (currently active)
|
|
├── vosk_dictation.py # Basic dictation
|
|
└── main.py # Entry point
|
|
|
|
Configuration/
|
|
├── dictation.service # Updated systemd service
|
|
├── toggle-dictation.sh # Dictation control
|
|
├── toggle-conversation.sh # Conversation control
|
|
└── setup-dual-keybindings.sh # Keybinding setup
|
|
|
|
Data/
|
|
├── conversation_history.json # Persistent conversation context
|
|
├── listening.lock # Dictation mode lock file
|
|
└── conversation.lock # Conversation mode lock file
|
|
```
|
|
|
|
## Setup Instructions
|
|
|
|
### 1. Install Dependencies
|
|
|
|
```bash
|
|
# Install Python dependencies
|
|
uv sync
|
|
|
|
# Install system dependencies for GUI (if needed)
|
|
sudo apt-get install libgirepository1.0-dev gcc libcairo2-dev pkg-config python3-dev gir1.2-gtk-3.0
|
|
```
|
|
|
|
### 2. Setup Keybindings
|
|
|
|
```bash
|
|
# Setup both dictation and conversation keybindings
|
|
./setup-dual-keybindings.sh
|
|
|
|
# Or setup individually:
|
|
# ./setup-keybindings.sh # Original dictation only
|
|
```
|
|
|
|
**Keybindings:**
|
|
- **Alt+D**: Toggle dictation mode
|
|
- **Super+Alt+D**: Toggle conversation mode (Windows+Alt+D)
|
|
|
|
### 3. Start the Service
|
|
|
|
```bash
|
|
# Enable and start the systemd service
|
|
systemctl --user daemon-reload
|
|
systemctl --user enable dictation.service
|
|
systemctl --user start dictation.service
|
|
|
|
# Check status
|
|
systemctl --user status dictation.service
|
|
|
|
# View logs
|
|
journalctl --user -u dictation.service -f
|
|
```
|
|
|
|
### 4. Verify VLLM Connection
|
|
|
|
Ensure your VLLM service is running:
|
|
```bash
|
|
# Test endpoint
|
|
curl -H "Authorization: Bearer vllm-api-key" http://127.0.0.1:8000/v1/models
|
|
```
|
|
|
|
## Usage Guide
|
|
|
|
### Starting Dictation Mode
|
|
1. Press **Alt+D** or run `./toggle-dictation.sh`
|
|
2. System notification: "🎤 Dictation Active"
|
|
3. Speak normally - your words will be typed into the active application
|
|
4. Press **Alt+D** again to stop
|
|
|
|
### Starting Conversation Mode
|
|
1. Press **Super+Alt+D** (Windows+Alt+D) or run `./toggle-conversation.sh`
|
|
2. System notification: "🤖 Conversation Started" with context count
|
|
3. Speak naturally with the AI assistant
|
|
4. AI responses will be spoken via TTS
|
|
5. Press **Super+Alt+D** again to end the call
|
|
|
|
### Conversation Context Management
|
|
|
|
The system maintains persistent conversation context across calls:
|
|
- **Within a call**: Full conversation history is maintained
|
|
- **Between calls**: Context is preserved for continuity
|
|
- **History storage**: Saved in `conversation_history.json`
|
|
- **Auto-cleanup**: Limits history to prevent memory issues
|
|
|
|
### Example Conversation Flow
|
|
|
|
```
|
|
User: "Hey, what's the weather like today?"
|
|
AI: "I don't have access to real-time weather data, but I recommend checking a weather app or website for current conditions in your area."
|
|
|
|
User: "That's fair. Can you help me plan my day instead?"
|
|
AI: "I'd be happy to help you plan your day! What are the main tasks or activities you need to accomplish?"
|
|
|
|
[Call ends with Ctrl+Alt+D]
|
|
|
|
[Next call starts with Ctrl+Alt+D]
|
|
User: "Continuing with the day planning..."
|
|
AI: "Great! We were talking about planning your day. What specific tasks or activities were you considering?"
|
|
```
|
|
|
|
## Configuration Options
|
|
|
|
### Environment Variables
|
|
```bash
|
|
# VLLM Configuration
|
|
export VLLM_ENDPOINT="http://127.0.0.1:8000/v1"
|
|
export VLLM_MODEL="default"
|
|
|
|
# Audio Settings
|
|
export SAMPLE_RATE=16000
|
|
export BLOCK_SIZE=8000
|
|
|
|
# Conversation Settings
|
|
export MAX_CONVERSATION_HISTORY=10
|
|
export TTS_ENABLED=true
|
|
```
|
|
|
|
### Model Selection
|
|
```bash
|
|
# Switch between Vosk models
|
|
./switch-model.sh
|
|
|
|
# Available models:
|
|
# - vosk-model-small-en-us-0.15 (Fast, basic accuracy)
|
|
# - vosk-model-en-us-0.22-lgraph (Good balance)
|
|
# - vosk-model-en-us-0.22 (Best accuracy, WER ~5.69)
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **Service won't start**:
|
|
```bash
|
|
# Check logs
|
|
journalctl --user -u dictation.service -n 50
|
|
|
|
# Check permissions
|
|
groups $USER # Should include 'audio' group
|
|
```
|
|
|
|
2. **VLLM connection fails**:
|
|
```bash
|
|
# Test endpoint manually
|
|
curl -H "Authorization: Bearer vllm-api-key" http://127.0.0.1:8000/v1/models
|
|
|
|
# Check if VLLM is running
|
|
ps aux | grep vllm
|
|
```
|
|
|
|
3. **Audio issues**:
|
|
```bash
|
|
# Test audio input
|
|
arecord -d 3 -f cd test.wav
|
|
aplay test.wav
|
|
|
|
# Check audio devices
|
|
pacmd list-sources
|
|
```
|
|
|
|
4. **TTS not working**:
|
|
```bash
|
|
# Test TTS engine
|
|
python3 -c "import pyttsx3; engine = pyttsx3.init(); engine.say('test'); engine.runAndWait()"
|
|
```
|
|
|
|
### Log Files
|
|
- **Service logs**: `journalctl --user -u dictation.service`
|
|
- **Application logs**: `/home/universal/.gemini/tmp/debug.log`
|
|
- **Conversation history**: `conversation_history.json`
|
|
|
|
### Resetting Conversation History
|
|
```python
|
|
# Clear all conversation context
|
|
# Add this to ai_dictation.py if needed
|
|
conversation_manager.clear_all_history()
|
|
```
|
|
|
|
## Advanced Features
|
|
|
|
### Custom System Prompts
|
|
Edit the system prompt in `ConversationManager.get_messages_for_api()`:
|
|
```python
|
|
messages.append({
|
|
"role": "system",
|
|
"content": "You are a helpful AI assistant in a voice conversation. Be concise and natural in your responses."
|
|
})
|
|
```
|
|
|
|
### Voice Activity Detection
|
|
The system includes basic VAD that can be customized:
|
|
```python
|
|
# In audio_callback()
|
|
audio_level = abs(indata).mean()
|
|
if audio_level > 0.01: # Adjust threshold as needed
|
|
last_audio_time = time.currentTime
|
|
```
|
|
|
|
### GUI Enhancement (Full Version)
|
|
The full `ai_dictation.py` includes a GTK-based GUI with:
|
|
- Conversation history display
|
|
- Text input field
|
|
- Call control buttons
|
|
- Real-time status indicators
|
|
|
|
To use the GUI version:
|
|
1. Install PyGObject dependencies
|
|
2. Update `pyproject.toml` to include `PyGObject>=3.42.0`
|
|
3. Update `dictation.service` to use `ai_dictation.py`
|
|
|
|
## Performance Considerations
|
|
|
|
### Optimizations
|
|
- **Model selection**: Use smaller models for faster response
|
|
- **Audio settings**: Adjust `BLOCK_SIZE` for latency/accuracy balance
|
|
- **History management**: Limit conversation history for memory efficiency
|
|
- **API calls**: Implement request batching for efficiency
|
|
|
|
### Resource Usage
|
|
- **Memory**: ~100-500MB depending on Vosk model size
|
|
- **CPU**: Minimal during idle, moderate during active conversation
|
|
- **Network**: Only when calling VLLM endpoint
|
|
|
|
## Security Considerations
|
|
|
|
- The service runs as a user service with restricted permissions
|
|
- Conversation history is stored locally in JSON format
|
|
- API key is embedded in the client code
|
|
- Audio data is processed locally, only text sent to VLLM
|
|
|
|
## Future Enhancements
|
|
|
|
Potential additions:
|
|
- **Multi-user support**: Separate conversation histories
|
|
- **Voice authentication**: Speaker identification
|
|
- **Advanced VAD**: More sophisticated voice activity detection
|
|
- **Cloud TTS**: Optional cloud-based text-to-speech
|
|
- **Conversation export**: Save/export conversation history
|
|
- **Integration plugins**: Connect to other applications
|
|
|
|
## Support
|
|
|
|
For issues or questions:
|
|
1. Check the log files mentioned above
|
|
2. Verify VLLM service status
|
|
3. Test audio input/output
|
|
4. Review configuration settings
|
|
|
|
The system builds upon the solid foundation of the existing dictation service while adding comprehensive AI conversation capabilities with persistent context management. |