SIP Gemini Proxy

A lightweight server bridging traditional telephony systems with Google's Gemini Live API, enabling real-time voice AI conversations over standard phone infrastructure.

This proxy provides straightforward protocol translation between SIP/RTP telephony and the Gemini Live WebSocket API, handling codec conversion (G.711 PCMU/PCMA), audio resampling (8kHz ↔ 16-24kHz), and real-time bidirectional audio streaming.

Perfect for integrating AI voice agents with existing phone systems, Twilio numbers, or any SIP-based telephony infrastructure. Supports custom system instructions, voice selection, and automatic transcription.

  • Telephony Integration

    Bridges traditional SIP telephony systems with Google Gemini Live API for real-time voice AI conversations.

  • Protocol Translation

    Seamlessly converts between SIP/RTP telephony protocols and WebSocket-based Gemini API, handling G.711 codec conversion and audio resampling.

  • Twilio Ready

    Built-in webhook server for easy integration with Twilio phone numbers and other SIP providers.

  • Flexible Configuration

    Customize system instructions, voice selection, and language settings per call through a simple callback API.

  • Real-time Transcription

    Automatic transcription of input and output audio for monitoring and logging purposes.

  • Production Ready

    Handles audio processing with proper codec conversion, resampling between 8kHz and 16-24kHz, and N-way media bridging architecture.

Quick Start

git clone https://github.com/livetok-ai/sip-proxy
cd sip-proxy
go mod download
export GOOGLE_API_KEY=your_api_key
go build -o sip-proxy
./sip-proxy --public-ip YOUR_PUBLIC_IP --twilio-port 8080

Architecture

The SIP Gemini Proxy consists of five core components working together:

  • SIP Server - Handles INVITE/BYE/ACK messages and SIP protocol negotiation
  • RTP Handler - Processes audio packets and performs G.711 codec conversion
  • Media Bridge - Routes audio between participants with N-way broadcasting support
  • Gemini Handler - Manages WebSocket connection to Gemini Live API
  • Twilio Server - Webhook endpoint for phone number integration

When a call arrives, the flow is: Phone → Twilio → Webhook → SIP/RTP → Media Bridge → Gemini Live API, with bidirectional audio streaming throughout the conversation.

Configuration

Customize each call using a callback URL that returns configuration:

{
  "system_instructions": "You are a helpful assistant...",
  "voice": "Puck",
  "language": "en-US"
}

Available voices: Puck, Charon, Kore, Fenrir, Aoede. Configure per-call instructions, language, and voice characteristics dynamically based on caller ID, time of day, or any custom logic.