sagemaker-feature-store-pyspark 2.0.0


pip install sagemaker-feature-store-pyspark

  Latest version

Released: Apr 29, 2026

Project Links

Meta
Author: Amazon Web Services
Requires Python: >=3.8

Classifiers

SageMaker FeatureStore PySpark

SageMaker FeatureStore Spark is an open source Spark library for Amazon SageMaker FeatureStore. With this connector, you can easily ingest data to FeatureGroup's online and offline store from Spark DataFrame. This package provides the Python (PySpark) interface.

For full documentation including Scala usage, cross-account Lake Formation access, and troubleshooting, see the GitHub repository.

Supported Versions

Component Supported Versions
Spark 3.1, 3.2, 3.3, 3.4, 3.5
Python 3.8, 3.9, 3.10, 3.11, 3.12
EMR emr-7.x and above

Note: Not all Python/PySpark combinations are supported. See the compatibility matrix below.

Python / PySpark Compatibility Matrix

Python \ PySpark 3.1 3.2 3.3 3.4 3.5
3.8 Yes Yes Yes Yes Yes
3.9 Yes Yes Yes Yes Yes
3.10 No Yes Yes Yes Yes
3.11 No No No Yes Yes
3.12 No No No Yes Yes

Note: PySpark versions older than 3.5 are in maintenance mode and will not receive new features. New functionality is only added for PySpark 3.5+.

Installation

Prerequisites: PySpark and NumPy must be installed in your environment.

The package is available on PyPI. It bundles pre-built JARs for each supported Spark version (3.1-3.5). At runtime, the correct JAR is automatically selected based on your installed PySpark version.

If SPARK_HOME is set, the installer copies the matching JAR into $SPARK_HOME/jars. For EMR, the path is handled automatically.

pip3 install sagemaker-feature-store-pyspark --no-binary :all:

EMR

Create a custom jar step to install the library:

  • Jar Location: command-runner.jar
  • Arguments: sudo -E pip3 install sagemaker-feature-store-pyspark --no-binary :all:

This installs the library on the Driver node only. To distribute to all executor nodes, create an installation script and add a custom bootstrap action when creating the EMR cluster.

Since bootstrap actions run before EMR applications are installed, dependent JARs cannot be automatically loaded to SPARK_HOME. When submitting your application, specify dependent JARs using:

--jars `feature-store-pyspark-dependency-jars`

SageMaker Notebook

SageMaker Notebook instances may use an older version of Spark. Install a compatible version first:

# Install a version of PySpark compatible with the library (3.1 - 3.5)
!pip3 install pyspark==3.5.1

Getting Started

FeatureStoreManager is the main interface for all library operations, including data ingestion and loading feature definitions.

Ingest Data

from feature_store_pyspark.FeatureStoreManager import FeatureStoreManager

feature_group_arn = "arn:aws:sagemaker:...:feature-group/your-feature-group"
feature_store_manager = FeatureStoreManager()
feature_store_manager.ingest_data(
    input_data_frame=df,
    feature_group_arn=feature_group_arn,
    target_stores=["OfflineStore"]
)

If target_stores is set to ["OfflineStore"], data is ingested directly to the offline store without using the FeatureStore Runtime API, reducing WCU costs. The default is None (ingests to both online and offline stores).

Load Feature Definitions

feature_definitions = feature_store_manager.load_feature_definitions_from_schema(df)

Returns feature definitions that can be used with the CreateFeatureGroup API.

Retrieve Failed Ingestion Records

failed_df = feature_store_manager.get_failed_stream_ingestion_data_frame()

Returns a DataFrame containing records that failed during ingest_data().

Lake Formation Support

When your offline store's S3 location is registered with AWS Lake Formation, enable the use_lake_formation_credentials parameter (requires PySpark 3.5+):

feature_store_manager.ingest_data(
    input_data_frame=df,
    feature_group_arn=feature_group_arn,
    target_stores=["OfflineStore"],
    use_lake_formation_credentials=True
)

For prerequisites, cross-account access, and troubleshooting, see the main repository README.

License

This project is licensed under the Apache-2.0 License.

Extras:
Dependencies:
setuptools (<82)