Development Status
- 3 - Alpha
License
- OSI Approved :: Apache Software License
Programming Language
- Python :: 3
- Python :: 3.9
- Python :: 3.10
- Python :: 3.11
- Python :: 3.12
- Python :: 3.13
Topic
- Scientific/Engineering :: Artificial Intelligence
fastsafetensor-3fs-reader
3FS USRBIO file reader for fastsafetensors.
This package provides a high-performance reader for 3FS USRBIO files with two backend implementations (C++ and pure-Python) and a mock for testing.
Backends
| Backend | Module | Requirements | Performance |
|---|---|---|---|
| C++ | reader_cpp.py |
libhf3fs_api_shared.so + libtorch + CUDA |
Best (GIL-free, native USRBIO async I/O) |
| Python | reader_py.py |
hf3fs_py_usrbio (+ optional PyTorch for GPU) |
Good (USRBIO via Client API or OS pread) |
| Mock | mock.py |
None | For testing only |
The package auto-selects the best available backend at import time:
C++ → Python → Mock. Use get_backend() to check which one is active.
Note: The C++ backend supports pipelined mode (double-buffered async H2D copy via
cudaMemcpyAsync) which overlaps network I/O with GPU memory transfer for significantly better throughput. Passpipelined=Truetoread_chunked()to enable it. The Python backend does not support pipelining and will silently fall back to non-pipelined mode.
Installation
Pure-Python mode (no C++ compilation)
FST3FS_NO_EXT=1 pip install .
With C++ extension
Requires libhf3fs_api_shared.so (from a 3FS build) and CUDA Runtime.
The hf3fs_usrbio.h header is bundled in the package, so no external
header dependency is needed:
export HF3FS_LIB_DIR=/path/to/3FS/build/lib # directory with libhf3fs_api_shared.so
pip install .
Automatic libhf3fs_api_shared.so discovery
At import time, the package automatically searches for
libhf3fs_api_shared.so using the following priority:
HF3FS_LIB_DIRenvironment variable (user-explicit, highest priority).LD_LIBRARY_PATHdirectories (user already configured).hf3fs_py_usrbiopip install path — ifhf3fs_py_usrbiois installed via pip, the library is typically located in a sibling.libs/directory (e.g.site-packages/hf3fs_py_usrbio.libs/). This is discovered automatically so you don't need to setLD_LIBRARY_PATHmanually.
The library is pre-loaded with RTLD_GLOBAL so that both the C++ and
Python backends can resolve its symbols. Use get_hf3fs_lib_path() to
check which path was loaded:
from fastsafetensor_3fs_reader import get_hf3fs_lib_path
print(get_hf3fs_lib_path()) # e.g. "/path/to/site-packages/hf3fs_py_usrbio.libs/libhf3fs_api_shared.so"
Installing hf3fs_py_usrbio (for the Python backend)
hf3fs_py_usrbio is not available on PyPI. It must be built from the
DeepSeek 3FS source tree:
git clone https://github.com/deepseek-ai/3FS
cd 3FS
git submodule update --init --recursive
# Follow 3FS build instructions (cmake, etc.)
# After build, install the Python package:
cd build && pip install ..
Important: The default pip-installed
hf3fs_py_usrbiopackage is suitable for testing and validation but is not recommended for production use. For production deployments, build 3FS from source with optimized compiler flags tailored to your hardware. Refer to projects like SGLang for examples of production-grade 3FS compilation workflows.
Usage
from fastsafetensor_3fs_reader import (
ThreeFSFileReader,
MockFileReader,
is_available,
get_backend,
)
# Check which backend is active
print(f"Backend: {get_backend()}") # "cpp", "python", or "mock"
# Use mock reader for testing (always available)
reader = MockFileReader()
headers = reader.read_headers_batch(["/path/to/file.safetensors"])
reader.close()
# Use 3FS reader when available
if is_available():
reader = ThreeFSFileReader(mount_point="/mnt/3fs")
headers = reader.read_headers_batch([
"/mnt/3fs/model-00001.safetensors",
"/mnt/3fs/model-00002.safetensors",
])
# Read tensor data into GPU memory
import torch
buf = torch.empty(1024 * 1024, dtype=torch.uint8, device="cuda")
bytes_read = reader.read_chunked(
path="/mnt/3fs/model-00001.safetensors",
dev_ptr=buf.data_ptr(),
file_offset=0,
total_length=1024 * 1024,
)
reader.close()
Benchmark
The hack/benchmark/ directory contains a comprehensive benchmarking suite.
Use benchmark_runner.py to measure read throughput across different backends,
buffer sizes, chunk sizes, and process counts.
Full benchmark (read + GPU copy)
python hack/benchmark/benchmark_runner.py \
--mount-point /mnt/3fs \
--backends cpp,python \
--buffer-sizes 8,16,32,64,128,256,512 \
--chunk-sizes 8,16,32,64,128,256,512 \
--num-processes 1,2,4,8 \
--iterations 3
Download-only benchmark (host memory only, no GPU copy)
python hack/benchmark/benchmark_runner.py \
--mount-point /mnt/3fs \
--backends cpp,python \
--buffer-sizes 8,16,32,64,128,256,512 \
--chunk-sizes 8,16,32,64,128,256,512 \
--num-processes 1,2,4,8 \
--download-only \
--iterations 3
Key parameters
| Parameter | Description | Default |
|---|---|---|
--mount-point |
3FS FUSE mount-point path | (required) |
--backends |
Comma-separated backend names | mock,python,cpp |
--buffer-sizes |
Buffer sizes in MB | 8,16,32,64,128,256,512,1024 |
--chunk-sizes |
Chunk sizes in MB | 8,16,32,64,128,256,512,1024 |
--num-processes |
Process counts | 1,2,4,8 |
--download-only |
Read into host memory only (skip GPU copy) | false |
--iterations |
Iterations per combination | 3 |
--mode |
grid (sweep all combos) or single |
grid |
--output-dir |
Directory for CSV and chart output | ./benchmark_results |
Performance Results
Test environment: Single 400 Gbps RDMA NIC. These numbers represent a loading baseline under specific storage and network hardware conditions — they do not represent the performance ceiling of the system.
Model: DeepSeek-V3 (total ~640 GB safetensors)
| Configuration | Avg Throughput (GB/s) | Peak Throughput with fastsafetensors (GB/s) | Load Time (s) | Backend |
|---|---|---|---|---|
| 8 processes, buffer=8 MB | 35.0 | 32.0 | 30.34 | C++ (non-pipelined) |
| 8 processes, buffer=16 MB | 37.6 | 36.6 | 25.73 | C++ (pipelined) |
Benchmark: RDMA throughput across buffer sizes (8M / 16M / 32M)

Production: model weight loading with fastsafetensors (pipelined, peak 36.6 GB/s)

License
Apache-2.0
Wheel compatibility matrix
| Platform | CPython 3.9 | CPython 3.10 | CPython 3.11 | CPython 3.12 | CPython 3.13 | Python 3 |
|---|---|---|---|---|---|---|
| any | ||||||
| manylinux_2_34_x86_64 | ||||||
| manylinux_2_35_x86_64 |