SIP Gemini Proxy
A lightweight server bridging traditional telephony systems with Google's Gemini Live API, enabling real-time voice AI conversations over standard phone infrastructure.
This proxy provides straightforward protocol translation between SIP/RTP telephony and the Gemini Live WebSocket API, handling codec conversion (G.711 PCMU/PCMA), audio resampling (8kHz ↔ 16-24kHz), and real-time bidirectional audio streaming.
Perfect for integrating AI voice agents with existing phone systems, Twilio numbers, or any SIP-based telephony infrastructure. Supports custom system instructions, voice selection, and automatic transcription.
Telephony Integration
Bridges traditional SIP telephony systems with Google Gemini Live API for real-time voice AI conversations.
Protocol Translation
Seamlessly converts between SIP/RTP telephony protocols and WebSocket-based Gemini API, handling G.711 codec conversion and audio resampling.
Twilio Ready
Built-in webhook server for easy integration with Twilio phone numbers and other SIP providers.
Flexible Configuration
Customize system instructions, voice selection, and language settings per call through a simple callback API.
Real-time Transcription
Automatic transcription of input and output audio for monitoring and logging purposes.
Production Ready
Handles audio processing with proper codec conversion, resampling between 8kHz and 16-24kHz, and N-way media bridging architecture.
Quick Start
git clone https://github.com/livetok-ai/sip-proxy
cd sip-proxy
go mod download
export GOOGLE_API_KEY=your_api_key
go build -o sip-proxy
./sip-proxy --public-ip YOUR_PUBLIC_IP --twilio-port 8080
Architecture
The SIP Gemini Proxy consists of five core components working together:
- SIP Server - Handles INVITE/BYE/ACK messages and SIP protocol negotiation
- RTP Handler - Processes audio packets and performs G.711 codec conversion
- Media Bridge - Routes audio between participants with N-way broadcasting support
- Gemini Handler - Manages WebSocket connection to Gemini Live API
- Twilio Server - Webhook endpoint for phone number integration
When a call arrives, the flow is: Phone → Twilio → Webhook → SIP/RTP → Media Bridge → Gemini Live API, with bidirectional audio streaming throughout the conversation.
Configuration
Customize each call using a callback URL that returns configuration:
{
"system_instructions": "You are a helpful assistant...",
"voice": "Puck",
"language": "en-US"
}Available voices: Puck, Charon, Kore, Fenrir, Aoede. Configure per-call instructions, language, and voice characteristics dynamically based on caller ID, time of day, or any custom logic.