fastsafetensor-3fs-reader 0.3.3


pip install fastsafetensor-3fs-reader

  Latest version

Released: Apr 03, 2026

Project Links

Meta
Requires Python: >=3.9

Classifiers

Development Status
  • 3 - Alpha

License
  • OSI Approved :: Apache Software License

Programming Language
  • Python :: 3
  • Python :: 3.9
  • Python :: 3.10
  • Python :: 3.11
  • Python :: 3.12
  • Python :: 3.13

Topic
  • Scientific/Engineering :: Artificial Intelligence

fastsafetensor-3fs-reader

3FS USRBIO file reader for fastsafetensors.

This package provides a high-performance reader for 3FS USRBIO files with two backend implementations (C++ and pure-Python) and a mock for testing.

Backends

Backend Module Requirements Performance
C++ reader_cpp.py libhf3fs_api_shared.so + libtorch + CUDA Best (GIL-free, native USRBIO async I/O)
Python reader_py.py hf3fs_py_usrbio (+ optional PyTorch for GPU) Good (USRBIO via Client API or OS pread)
Mock mock.py None For testing only

The package auto-selects the best available backend at import time: C++ → Python → Mock. Use get_backend() to check which one is active.

Note: The C++ backend supports pipelined mode (double-buffered async H2D copy via cudaMemcpyAsync) which overlaps network I/O with GPU memory transfer for significantly better throughput. Pass pipelined=True to read_chunked() to enable it. The Python backend does not support pipelining and will silently fall back to non-pipelined mode.

Installation

Pure-Python mode (no C++ compilation)

FST3FS_NO_EXT=1 pip install .

With C++ extension

Requires libhf3fs_api_shared.so (from a 3FS build) and CUDA Runtime. The hf3fs_usrbio.h header is bundled in the package, so no external header dependency is needed:

export HF3FS_LIB_DIR=/path/to/3FS/build/lib         # directory with libhf3fs_api_shared.so
pip install .

Automatic libhf3fs_api_shared.so discovery

At import time, the package automatically searches for libhf3fs_api_shared.so using the following priority:

  1. HF3FS_LIB_DIR environment variable (user-explicit, highest priority).
  2. LD_LIBRARY_PATH directories (user already configured).
  3. hf3fs_py_usrbio pip install path — if hf3fs_py_usrbio is installed via pip, the library is typically located in a sibling .libs/ directory (e.g. site-packages/hf3fs_py_usrbio.libs/). This is discovered automatically so you don't need to set LD_LIBRARY_PATH manually.

The library is pre-loaded with RTLD_GLOBAL so that both the C++ and Python backends can resolve its symbols. Use get_hf3fs_lib_path() to check which path was loaded:

from fastsafetensor_3fs_reader import get_hf3fs_lib_path
print(get_hf3fs_lib_path())  # e.g. "/path/to/site-packages/hf3fs_py_usrbio.libs/libhf3fs_api_shared.so"

Installing hf3fs_py_usrbio (for the Python backend)

hf3fs_py_usrbio is not available on PyPI. It must be built from the DeepSeek 3FS source tree:

git clone https://github.com/deepseek-ai/3FS
cd 3FS
git submodule update --init --recursive
# Follow 3FS build instructions (cmake, etc.)
# After build, install the Python package:
cd build && pip install ..

Important: The default pip-installed hf3fs_py_usrbio package is suitable for testing and validation but is not recommended for production use. For production deployments, build 3FS from source with optimized compiler flags tailored to your hardware. Refer to projects like SGLang for examples of production-grade 3FS compilation workflows.

Usage

from fastsafetensor_3fs_reader import (
    ThreeFSFileReader,
    MockFileReader,
    is_available,
    get_backend,
)

# Check which backend is active
print(f"Backend: {get_backend()}")  # "cpp", "python", or "mock"

# Use mock reader for testing (always available)
reader = MockFileReader()
headers = reader.read_headers_batch(["/path/to/file.safetensors"])
reader.close()

# Use 3FS reader when available
if is_available():
    reader = ThreeFSFileReader(mount_point="/mnt/3fs")
    headers = reader.read_headers_batch([
        "/mnt/3fs/model-00001.safetensors",
        "/mnt/3fs/model-00002.safetensors",
    ])

    # Read tensor data into GPU memory
    import torch
    buf = torch.empty(1024 * 1024, dtype=torch.uint8, device="cuda")
    bytes_read = reader.read_chunked(
        path="/mnt/3fs/model-00001.safetensors",
        dev_ptr=buf.data_ptr(),
        file_offset=0,
        total_length=1024 * 1024,
    )
    reader.close()

Benchmark

The hack/benchmark/ directory contains a comprehensive benchmarking suite. Use benchmark_runner.py to measure read throughput across different backends, buffer sizes, chunk sizes, and process counts.

Full benchmark (read + GPU copy)

python hack/benchmark/benchmark_runner.py \
    --mount-point /mnt/3fs \
    --backends cpp,python \
    --buffer-sizes 8,16,32,64,128,256,512 \
    --chunk-sizes 8,16,32,64,128,256,512 \
    --num-processes 1,2,4,8 \
    --iterations 3

Download-only benchmark (host memory only, no GPU copy)

python hack/benchmark/benchmark_runner.py \
    --mount-point /mnt/3fs \
    --backends cpp,python \
    --buffer-sizes 8,16,32,64,128,256,512 \
    --chunk-sizes 8,16,32,64,128,256,512 \
    --num-processes 1,2,4,8 \
    --download-only \
    --iterations 3

Key parameters

Parameter Description Default
--mount-point 3FS FUSE mount-point path (required)
--backends Comma-separated backend names mock,python,cpp
--buffer-sizes Buffer sizes in MB 8,16,32,64,128,256,512,1024
--chunk-sizes Chunk sizes in MB 8,16,32,64,128,256,512,1024
--num-processes Process counts 1,2,4,8
--download-only Read into host memory only (skip GPU copy) false
--iterations Iterations per combination 3
--mode grid (sweep all combos) or single grid
--output-dir Directory for CSV and chart output ./benchmark_results

Performance Results

Test environment: Single 400 Gbps RDMA NIC. These numbers represent a loading baseline under specific storage and network hardware conditions — they do not represent the performance ceiling of the system.

Model: DeepSeek-V3 (total ~640 GB safetensors)

Configuration Avg Throughput (GB/s) Peak Throughput with fastsafetensors (GB/s) Load Time (s) Backend
8 processes, buffer=8 MB 35.0 32.0 30.34 C++ (non-pipelined)
8 processes, buffer=16 MB 37.6 36.6 25.73 C++ (pipelined)

Benchmark: RDMA throughput across buffer sizes (8M / 16M / 32M)

RDMA throughput across buffer sizes

Production: model weight loading with fastsafetensors (pipelined, peak 36.6 GB/s)

Model weight loading throughput

License

Apache-2.0

Extras:
Dependencies: