# AI Dictation Service - Conversational AI Phone Call System ## Overview This enhanced dictation service transforms your existing voice-to-text system into a full conversational AI assistant that maintains conversation context across phone calls. It supports two modes: - **Dictation Mode (Alt+D)**: Traditional voice-to-text transcription - **Conversation Mode (Ctrl+Alt+D)**: Interactive AI conversation with persistent context ## Key Features ### 🎤 Dictation Mode (Alt+D) - Real-time voice transcription with immediate typing - Visual feedback through system notifications - High accuracy with multiple Vosk models available ### 🤖 Conversation Mode (Ctrl+Alt+D) - **Persistent Context**: Maintains conversation history across calls - **VLLM Integration**: Connects to your local VLLM endpoint (127.0.0.1:8000) - **Text-to-Speech**: AI responses are spoken naturally - **Turn-taking**: Intelligent voice activity detection - **Visual GUI**: Conversation interface with typing support - **Context Preservation**: Each call maintains its own conversation context ## System Architecture ### Core Components 1. **State Management**: Dual-mode system with seamless switching 2. **Audio Processing**: Real-time streaming with voice activity detection 3. **VLLM Client**: OpenAI-compatible API integration 4. **TTS Engine**: Natural speech synthesis for AI responses 5. **Conversation Manager**: Persistent context and history management 6. **GUI Interface**: Optional GTK-based conversation window ### File Structure ``` src/dictation_service/ ├── enhanced_dictation.py # Original dictation (preserved) ├── ai_dictation.py # Full version with GTK GUI ├── ai_dictation_simple.py # Core version (currently active) ├── vosk_dictation.py # Basic dictation └── main.py # Entry point Configuration/ ├── dictation.service # Updated systemd service ├── toggle-dictation.sh # Dictation control ├── toggle-conversation.sh # Conversation control └── setup-dual-keybindings.sh # Keybinding setup Data/ ├── conversation_history.json # Persistent conversation context ├── listening.lock # Dictation mode lock file └── conversation.lock # Conversation mode lock file ``` ## Setup Instructions ### 1. Install Dependencies ```bash # Install Python dependencies uv sync # Install system dependencies for GUI (if needed) sudo apt-get install libgirepository1.0-dev gcc libcairo2-dev pkg-config python3-dev gir1.2-gtk-3.0 ``` ### 2. Setup Keybindings ```bash # Setup both dictation and conversation keybindings ./setup-dual-keybindings.sh # Or setup individually: # ./setup-keybindings.sh # Original dictation only ``` **Keybindings:** - **Alt+D**: Toggle dictation mode - **Super+Alt+D**: Toggle conversation mode (Windows+Alt+D) ### 3. Start the Service ```bash # Enable and start the systemd service systemctl --user daemon-reload systemctl --user enable dictation.service systemctl --user start dictation.service # Check status systemctl --user status dictation.service # View logs journalctl --user -u dictation.service -f ``` ### 4. Verify VLLM Connection Ensure your VLLM service is running: ```bash # Test endpoint curl -H "Authorization: Bearer vllm-api-key" http://127.0.0.1:8000/v1/models ``` ## Usage Guide ### Starting Dictation Mode 1. Press **Alt+D** or run `./toggle-dictation.sh` 2. System notification: "🎤 Dictation Active" 3. Speak normally - your words will be typed into the active application 4. Press **Alt+D** again to stop ### Starting Conversation Mode 1. Press **Super+Alt+D** (Windows+Alt+D) or run `./toggle-conversation.sh` 2. System notification: "🤖 Conversation Started" with context count 3. Speak naturally with the AI assistant 4. AI responses will be spoken via TTS 5. Press **Super+Alt+D** again to end the call ### Conversation Context Management The system maintains persistent conversation context across calls: - **Within a call**: Full conversation history is maintained - **Between calls**: Context is preserved for continuity - **History storage**: Saved in `conversation_history.json` - **Auto-cleanup**: Limits history to prevent memory issues ### Example Conversation Flow ``` User: "Hey, what's the weather like today?" AI: "I don't have access to real-time weather data, but I recommend checking a weather app or website for current conditions in your area." User: "That's fair. Can you help me plan my day instead?" AI: "I'd be happy to help you plan your day! What are the main tasks or activities you need to accomplish?" [Call ends with Ctrl+Alt+D] [Next call starts with Ctrl+Alt+D] User: "Continuing with the day planning..." AI: "Great! We were talking about planning your day. What specific tasks or activities were you considering?" ``` ## Configuration Options ### Environment Variables ```bash # VLLM Configuration export VLLM_ENDPOINT="http://127.0.0.1:8000/v1" export VLLM_MODEL="default" # Audio Settings export SAMPLE_RATE=16000 export BLOCK_SIZE=8000 # Conversation Settings export MAX_CONVERSATION_HISTORY=10 export TTS_ENABLED=true ``` ### Model Selection ```bash # Switch between Vosk models ./switch-model.sh # Available models: # - vosk-model-small-en-us-0.15 (Fast, basic accuracy) # - vosk-model-en-us-0.22-lgraph (Good balance) # - vosk-model-en-us-0.22 (Best accuracy, WER ~5.69) ``` ## Troubleshooting ### Common Issues 1. **Service won't start**: ```bash # Check logs journalctl --user -u dictation.service -n 50 # Check permissions groups $USER # Should include 'audio' group ``` 2. **VLLM connection fails**: ```bash # Test endpoint manually curl -H "Authorization: Bearer vllm-api-key" http://127.0.0.1:8000/v1/models # Check if VLLM is running ps aux | grep vllm ``` 3. **Audio issues**: ```bash # Test audio input arecord -d 3 -f cd test.wav aplay test.wav # Check audio devices pacmd list-sources ``` 4. **TTS not working**: ```bash # Test TTS engine python3 -c "import pyttsx3; engine = pyttsx3.init(); engine.say('test'); engine.runAndWait()" ``` ### Log Files - **Service logs**: `journalctl --user -u dictation.service` - **Application logs**: `/home/universal/.gemini/tmp/debug.log` - **Conversation history**: `conversation_history.json` ### Resetting Conversation History ```python # Clear all conversation context # Add this to ai_dictation.py if needed conversation_manager.clear_all_history() ``` ## Advanced Features ### Custom System Prompts Edit the system prompt in `ConversationManager.get_messages_for_api()`: ```python messages.append({ "role": "system", "content": "You are a helpful AI assistant in a voice conversation. Be concise and natural in your responses." }) ``` ### Voice Activity Detection The system includes basic VAD that can be customized: ```python # In audio_callback() audio_level = abs(indata).mean() if audio_level > 0.01: # Adjust threshold as needed last_audio_time = time.currentTime ``` ### GUI Enhancement (Full Version) The full `ai_dictation.py` includes a GTK-based GUI with: - Conversation history display - Text input field - Call control buttons - Real-time status indicators To use the GUI version: 1. Install PyGObject dependencies 2. Update `pyproject.toml` to include `PyGObject>=3.42.0` 3. Update `dictation.service` to use `ai_dictation.py` ## Performance Considerations ### Optimizations - **Model selection**: Use smaller models for faster response - **Audio settings**: Adjust `BLOCK_SIZE` for latency/accuracy balance - **History management**: Limit conversation history for memory efficiency - **API calls**: Implement request batching for efficiency ### Resource Usage - **Memory**: ~100-500MB depending on Vosk model size - **CPU**: Minimal during idle, moderate during active conversation - **Network**: Only when calling VLLM endpoint ## Security Considerations - The service runs as a user service with restricted permissions - Conversation history is stored locally in JSON format - API key is embedded in the client code - Audio data is processed locally, only text sent to VLLM ## Future Enhancements Potential additions: - **Multi-user support**: Separate conversation histories - **Voice authentication**: Speaker identification - **Advanced VAD**: More sophisticated voice activity detection - **Cloud TTS**: Optional cloud-based text-to-speech - **Conversation export**: Save/export conversation history - **Integration plugins**: Connect to other applications ## Support For issues or questions: 1. Check the log files mentioned above 2. Verify VLLM service status 3. Test audio input/output 4. Review configuration settings The system builds upon the solid foundation of the existing dictation service while adding comprehensive AI conversation capabilities with persistent context management.