Development Status
- 5 - Production/Stable
Intended Audience
- Developers
Topic
- Multimedia :: Graphics :: 3D Rendering
- Scientific/Engineering :: Artificial Intelligence
Operating System
- OS Independent
- MacOS
- POSIX :: Linux
Programming Language
- Python :: 3
- Python :: 3.9
- Python :: 3.10
- Python :: 3.11
- Python :: 3.12
- Python :: 3.13
- Python :: 3.14
bitHuman Avatar Runtime

Real-time avatar engine for visual AI agents, digital humans, and creative characters.
bitHuman powers visual AI agents and conversational AI with photorealistic avatars and real-time lip-sync. Build voice agents with faces, video chatbots, AI assistants, and interactive digital humans — all running on edge devices with just 1-2 CPU cores and <200ms latency. Raw generation speed is 100+ FPS on CPU alone, enabling real-time streaming applications.
Installation
pip install bithuman --upgrade
Pre-built wheels for all major platforms — no compilation required:
| Linux | macOS | Windows | |
|---|---|---|---|
| x86_64 | yes | yes | yes |
| ARM64 | yes | yes (Apple Silicon) | — |
| Python | 3.9 — 3.14 | 3.9 — 3.14 | 3.9 — 3.14 |
For LiveKit agent integration:
pip install bithuman[agent]
Quick Start
Generate a lip-synced video
bithuman generate avatar.imx --audio speech.wav --key YOUR_API_KEY
Stream a live avatar to your browser
# Terminal 1: Start the streaming server
bithuman stream avatar.imx --key YOUR_API_KEY
# Terminal 2: Send audio to trigger lip-sync
bithuman speak speech.wav
Open http://localhost:3001 to see the avatar streaming live.
Python API (async)
import asyncio
from bithuman import AsyncBithuman
from bithuman.audio import load_audio, float32_to_int16
async def main():
runtime = await AsyncBithuman.create(
model_path="avatar.imx",
api_secret="YOUR_API_KEY",
)
await runtime.start()
# Load and stream audio
audio, sr = load_audio("speech.wav")
audio_int16 = float32_to_int16(audio)
async def stream_audio():
chunk_size = sr // 25 # match video FPS
for i in range(0, len(audio_int16), chunk_size):
await runtime.push_audio(
audio_int16[i:i + chunk_size].tobytes(), sr
)
await runtime.flush()
asyncio.create_task(stream_audio())
# Receive lip-synced video frames
async for frame in runtime.run():
if frame.has_image:
image = frame.bgr_image # numpy (H, W, 3), uint8
audio = frame.audio_chunk # synchronized audio
if frame.end_of_speech:
break
await runtime.stop()
asyncio.run(main())
Python API (sync)
from bithuman import Bithuman
from bithuman.audio import load_audio, float32_to_int16
runtime = Bithuman.create(model_path="avatar.imx", api_secret="YOUR_API_KEY")
audio, sr = load_audio("speech.wav")
audio_int16 = float32_to_int16(audio)
chunk_size = sr // 100
for i in range(0, len(audio_int16), chunk_size):
runtime.push_audio(audio_int16[i:i+chunk_size].tobytes(), sr)
runtime.flush()
for frame in runtime.run():
if frame.has_image:
image = frame.bgr_image
if frame.end_of_speech:
break
How It Works
- Load model —
.imxfile contains the avatar's appearance, animations, and lip-sync data - Push audio — Stream audio bytes in real-time via
push_audio(), callflush()when done - Get frames — Iterate
runtime.run()to receive lip-synced video frames with synchronized audio
The runtime handles the full motion graph internally: idle animations, talking with lip-sync, head movements, blinking, and smooth transitions between states.
Performance
| Metric | Value |
|---|---|
| Raw FPS | 100+ on CPU (Intel i5-12400, Apple M2) |
| CPU cores | 1-2 cores at 25 FPS |
| End-to-end latency | <200ms |
| Memory (IMX v2) | ~200 MB per session |
| Model load time | <10ms (IMX v2) |
| Audio formats | WAV, MP3, FLAC, OGG, M4A |
Features
- Real-time lip-sync — Audio-driven mouth animation at 25 FPS with synchronized audio output
- Cross-platform — Linux, macOS, Windows; x86_64 and ARM64; Python 3.9-3.14
- Edge-ready — 1-2 CPU cores, no GPU required for inference
- Sync + Async —
Bithumanfor threads,AsyncBithumanfor async/await - Streaming-first — Push audio chunks in real-time, receive frames as they're generated
- Actions & emotions — Trigger avatar gestures (wave, nod) and emotion states (joy, surprise)
- Interrupt support — Cancel mid-speech for natural conversation flow
- LiveKit integration — Built-in support for LiveKit Agents (WebRTC streaming)
- CLI tools — Generate videos, stream live, convert models, validate setups
- IMX v2 format — Optimized binary container with O(1) random access and WebP patches
- Zero scipy dependency — Pure numpy audio pipeline, minimal install footprint
API Reference
AsyncBithuman / Bithuman
The main runtime for avatar animation.
# Create and initialize
runtime = await AsyncBithuman.create(
model_path="avatar.imx", # Path to .imx model
api_secret="API_KEY", # API secret (recommended)
# token="JWT_TOKEN", # Or JWT token directly
)
await runtime.start()
# Push audio (int16 PCM, any sample rate — auto-resampled to 16kHz)
await runtime.push_audio(audio_bytes, sample_rate)
await runtime.flush() # Signal end of speech
runtime.interrupt() # Cancel current playback
# Receive frames
async for frame in runtime.run():
frame.bgr_image # np.ndarray (H, W, 3) uint8 BGR
frame.rgb_image # np.ndarray (H, W, 3) uint8 RGB
frame.audio_chunk # AudioChunk — synchronized audio
frame.end_of_speech # True when all audio processed
frame.has_image # True if image available
frame.frame_index # Frame number
frame.source_message_id # Correlates to input
# Controls
await runtime.push(VideoControl(action="wave")) # Trigger action
await runtime.push(VideoControl(target_video="idle")) # Switch state
runtime.set_muted(True) # Mute processing
# Info
runtime.get_frame_size() # (width, height)
runtime.get_first_frame() # First idle frame as np.ndarray
runtime.get_expiration_time() # Token expiry (unix timestamp)
runtime.is_token_validated() # Auth status
await runtime.stop()
Data Classes
from bithuman import AudioChunk, VideoControl, VideoFrame, Emotion, EmotionPrediction
# AudioChunk — container for audio data
chunk = AudioChunk(data=np.array([...], dtype=np.int16), sample_rate=16000)
chunk.duration # float — length in seconds
chunk.bytes # bytes — raw PCM bytes
# VideoControl — input to the runtime
ctrl = VideoControl(
audio=chunk, # Audio to lip-sync
action="wave", # Trigger action (wave, nod, etc.)
target_video="talking", # Switch video state
end_of_speech=True, # Mark end of speech
force_action=False, # Override action deduplication
emotion_preds=[ # Set emotion state
EmotionPrediction(emotion=Emotion.JOY, score=0.9),
],
)
# VideoFrame — output from runtime.run()
frame.bgr_image # np.ndarray (H, W, 3) uint8 — BGR
frame.rgb_image # np.ndarray (H, W, 3) uint8 — RGB
frame.audio_chunk # AudioChunk — synchronized audio
frame.end_of_speech # bool — True when done
frame.has_image # bool — True if image available
frame.frame_index # int — frame number
frame.source_message_id # Hashable — correlates to VideoControl
# Emotion enum
Emotion.ANGER | Emotion.DISGUST | Emotion.FEAR | Emotion.JOY
Emotion.NEUTRAL | Emotion.SADNESS | Emotion.SURPRISE
Audio Utilities
from bithuman.audio import (
load_audio, # Load WAV/MP3/FLAC/OGG/M4A -> (float32, sr)
float32_to_int16, # float32 -> int16
int16_to_float32, # int16 -> float32
resample, # Resample to target rate
write_video_with_audio, # Save MP4 with audio track
AudioStreamBatcher, # Real-time audio buffer
)
audio, sr = load_audio("speech.mp3") # Any format
audio_int16 = float32_to_int16(audio) # Ready for push_audio
audio_16k = resample(audio, sr, 16000) # Resample
write_video_with_audio("out.mp4", frames, audio, sr, fps=25)
Exceptions
All exceptions inherit from BithumanError:
| Exception | When |
|---|---|
TokenExpiredError |
JWT has expired |
TokenValidationError |
Invalid signature or claims |
TokenRequestError |
Auth server unreachable |
AccountStatusError |
Billing or access issue (HTTP 402/403) |
ModelNotFoundError |
Model file doesn't exist |
ModelLoadError |
Corrupt or incompatible model |
ModelSecurityError |
Security restriction triggered |
RuntimeNotReadyError |
Operation called before initialization |
LiveKit Agent Integration
Build conversational AI agents with avatar faces using LiveKit Agents:
from bithuman import AsyncBithuman
from bithuman.utils.agent import LocalAvatarRunner, LocalVideoPlayer, LocalAudioIO
# Initialize bitHuman runtime
runtime = await AsyncBithuman.create(
model_path="avatar.imx",
api_secret="YOUR_API_KEY",
)
# Connect to LiveKit agent session
avatar = LocalAvatarRunner(
bithuman_runtime=runtime,
audio_input=session.audio,
audio_output=LocalAudioIO(session, agent_output),
video_output=LocalVideoPlayer(window_size=(1280, 720)),
)
await avatar.start()
See examples/livekit_agent/ for a complete working example with OpenAI Realtime voice.
Optimize Your Models
Convert existing .imx models to IMX v2 for dramatically better performance:
bithuman convert avatar.imx
| Metric | Legacy (TAR) | IMX v2 | Improvement |
|---|---|---|---|
| Model size | 100 MB | 50-70 MB | 30-50% smaller |
| Load time | ~10s | <10ms | 1000x faster |
| Runtime speed | ~30 FPS | 100+ FPS | 3-10x faster |
| Peak memory | ~10 GB | ~200 MB | 98% less |
Conversion is automatic on first load, but pre-converting saves startup time.
CLI Reference
| Command | Description |
|---|---|
bithuman generate <model> --audio <file> |
Generate lip-synced MP4 from model + audio |
bithuman stream <model> |
Start live streaming server at localhost:3001 |
bithuman speak <audio> |
Send audio to running stream server |
bithuman action <name> |
Trigger avatar action (wave, nod, etc.) |
bithuman info <model> |
Show model metadata |
bithuman list-videos <model> |
List all videos in a model |
bithuman convert <model> |
Convert legacy to optimized IMX v2 |
bithuman validate <path> |
Validate model files load correctly |
Configuration
Environment Variables
| Variable | Description |
|---|---|
BITHUMAN_API_SECRET |
API secret for authentication |
BITHUMAN_RUNTIME_TOKEN |
JWT token (alternative to API secret) |
BITHUMAN_VERBOSE |
Enable debug logging |
CONVERT_THREADS |
Number of threads for model conversion (0 or unset = auto-detect) |
Runtime Settings
| Setting | Default | Description |
|---|---|---|
FPS |
25 |
Target frames per second |
OUTPUT_WIDTH |
1280 |
Output frame width (0 = native resolution) |
PRELOAD_TO_MEMORY |
False |
Cache model in RAM for faster decode |
PROCESS_IDLE_VIDEO |
True |
Run inference during silence (natural idle) |
Use Cases
- Visual AI Agents — Give your voice agents a face with real-time lip-sync
- Conversational AI — Build video chatbots and AI assistants with human-like presence
- Live Streaming — Stream avatars to browsers via WebSocket, LiveKit, or WebRTC
- Video Generation — Generate lip-synced content from audio at 100+ FPS
- Edge AI — Run locally on Raspberry Pi, Mac Mini, Chromebook, or any edge device
- Digital Twins — Photorealistic replicas for customer service, education, or entertainment
Examples
| Example | Description |
|---|---|
example.py |
Async runtime with live video + audio playback |
example_sync.py |
Synchronous runtime with threading |
livekit_agent/ |
LiveKit Agent with OpenAI Realtime voice |
livekit_webrtc/ |
WebRTC streaming server |
Troubleshooting
macOS: Duplicate FFmpeg library warnings
objc: Class AVFFrameReceiver is implemented in both .../cv2/.dylibs/libavdevice...
and .../av/.dylibs/libavdevice...
This happens when opencv-python (full) is installed alongside av (PyAV) — both bundle FFmpeg dylibs. Fix by switching to the headless variant:
pip install opencv-python-headless
This replaces opencv-python and removes the duplicate dylibs. The bithuman package already depends on opencv-python-headless, so this only occurs when another package has pulled in the full opencv-python.
Model conversion fails with TypeError
If you see TypeError: an integer is required during conversion, upgrade to the latest version:
pip install bithuman --upgrade
This was fixed in v1.6.2. The issue affected models in legacy TAR format during auto-conversion.
Getting a bitHuman Model
To create your own avatar model (.imx file):
- Visit bithuman.ai
- Register and subscribe
- Upload a photo or video to create your avatar
- Download your
.imxmodel file
Links
License
Commercial license required. See bithuman.ai for pricing.