Materializers

Understanding and creating materializers to handle custom data types in ZenML pipelines

Materializers are a core concept in ZenML that enable the serialization, storage, and retrieval of artifacts in your ML pipelines. This guide explains how materializers work and how to create custom materializers for your specific data types.

What Are Materializers?

A materializer is a class that defines how a particular data type is:

Serialized: Converted from Python objects to a storable format
Saved: Written to the artifact store
Loaded: Read from the artifact store
Deserialized: Converted back to Python objects
Visualized: Displayed in the ZenML dashboard
Analyzed: Metadata extraction for tracking and search

Materializers act as the bridge between your Python code and the underlying storage system, ensuring that any artifact can be saved, loaded, and visualized correctly, regardless of the data type.

Built-In Materializers

ZenML includes built-in materializers for many common data types:

Core Materializers

Materializer

Handled Data Types

Storage Format

bool, float, int, str, None

.json

bytes

.txt

dict, list, set, tuple

Directory

np.ndarray

.npy

pd.DataFrame, pd.Series

.csv (or .gzip if parquet is installed)

pydantic.BaseModel

.json

zenml.services.service.BaseService

.json

zenml.types.CSVString, zenml.types.HTMLString, zenml.types.MarkdownString

.csv / .html / .md (depending on type)

ZenML also provides a CloudpickleMaterializer that can handle any object by saving it with . However, this is not production-ready because the resulting artifacts cannot be loaded when running with a different Python version. For production use, you should implement a custom materializer for your specific data types.

Integration-Specific Materializers

When you install ZenML integrations, additional materializers become available:

Integration

Materializer

Handled Data Types

Storage Format

bentoml

bentoml.Bento

.bento

deepchecks

deepchecks.CheckResult, deepchecks.SuiteResult

.json

evidently

evidently.Profile

.json

great_expectations

great_expectations.ExpectationSuite, great_expectations.CheckpointResult

.json

huggingface

datasets.Dataset, datasets.DatasetDict

Directory

huggingface

transformers.PreTrainedModel

Directory

huggingface

transformers.TFPreTrainedModel

Directory

huggingface

transformers.PreTrainedTokenizerBase

Directory

lightgbm

lgbm.Booster

.txt

lightgbm

lgbm.Dataset

.binary

neural_prophet

NeuralProphet

.pt

pillow

Pillow.Image

.PNG

polars

pl.DataFrame, pl.Series

.parquet

pycaret

Any sklearn, xgboost, lightgbm or catboost model

.pkl

pytorch

torch.Dataset, torch.DataLoader

.pt

pytorch

torch.Module

.pt

scipy

scipy.spmatrix

.npz

spark

pyspark.DataFrame

.parquet

spark

pyspark.Transformer

pyspark.Estimator

tensorflow

tf.keras.Model

Directory

tensorflow

tf.Dataset

Directory

whylogs

whylogs.DatasetProfileView

.pb

xgboost

xgb.Booster

.json

xgboost

xgb.DMatrix

.binary

Note: When using Docker-based orchestrators, you must specify the appropriate integrations in your DockerSettings to ensure the materializers are available inside the container.

Creating Custom Materializers

When working with custom data types, you'll need to create materializers to handle them. Here's how:

1. Define Your Materializer Class

Create a new class that inherits from BaseMaterializer:

import os
from typing import Type, Any, Dict
from zenml.materializers.base_materializer import BaseMaterializer
from zenml.enums import ArtifactType, VisualizationType
from zenml.metadata.metadata_types import MetadataType

class MyClassMaterializer(BaseMaterializer):
    """Materializer for MyClass objects."""
    
    # List the data types this materializer can handle
    ASSOCIATED_TYPES = (MyClass,)
    
    # Define what type of artifact this is (usually DATA or MODEL)
    ASSOCIATED_ARTIFACT_TYPE = ArtifactType.DATA
    
    def load(self, data_type: Type[Any]) -> MyClass:
        """Load MyClass from storage."""
        # Implementation here
        filepath = os.path.join(self.uri, "data.json")
        with self.artifact_store.open(filepath, "r") as f:
            data = json.load(f)
        
        # Create and return an instance of MyClass
        return MyClass(**data)
    
    def save(self, data: MyClass) -> None:
        """Save MyClass to storage."""
        # Implementation here
        filepath = os.path.join(self.uri, "data.json")
        with self.artifact_store.open(filepath, "w") as f:
            json.dump(data.to_dict(), f)
    
    def save_visualizations(self, data: MyClass) -> Dict[str, VisualizationType]:
        """Generate visualizations for the dashboard."""
        # Optional - generate visualizations
        vis_path = os.path.join(self.uri, "visualization.html")
        with self.artifact_store.open(vis_path, "w") as f:
            f.write(data.to_html())
        
        return {vis_path: VisualizationType.HTML}
    
    def extract_metadata(self, data: MyClass) -> Dict[str, MetadataType]:
        """Extract metadata for tracking."""
        # Optional - extract metadata
        return {
            "name": data.name,
            "created_at": data.created_at,
            "num_records": len(data.records)
        }

2. Using Your Custom Materializer

Once you've defined the materializer, you can use it in your pipeline:

from zenml import step, pipeline

@step(output_materializers=MyClassMaterializer)
def create_my_class() -> MyClass:
    """Create an instance of MyClass."""
    return MyClass(name="test", records=[1, 2, 3])

@step
def use_my_class(my_obj: MyClass) -> None:
    """Use the MyClass instance."""
    print(f"Name: {my_obj.name}, Records: {my_obj.records}")

@pipeline
def custom_pipeline():
    data = create_my_class()
    use_my_class(data)

3. Multiple Outputs with Different Materializers

When a step has multiple outputs that need different materializers:

from typing import Tuple, Annotated

@step(output_materializers={
    "obj1": MyClass1Materializer,
    "obj2": MyClass2Materializer
})
def create_objects() -> Tuple[
    Annotated[MyClass1, "obj1"],
    Annotated[MyClass2, "obj2"]
]:
    """Create instances of different classes."""
    return MyClass1(), MyClass2()

4. Registering a Materializer Globally

You can register a materializer globally to override the default materializer for a specific type:

from zenml.materializers.materializer_registry import materializer_registry
import pandas as pd

# Create a custom pandas materializer
class FastPandasMaterializer(BaseMaterializer):
    # Implementation here
    ...

# Register it for pandas DataFrames globally
materializer_registry.register_and_overwrite_type(
    key=pd.DataFrame, 
    type_=FastPandasMaterializer
)

Materializer Implementation Details

When implementing a custom materializer, consider these aspects:

Handling Storage

The self.uri property contains the path to the directory where your artifact should be stored. Use this path to create files or subdirectories for your data.

When reading or writing files, always use self.artifact_store.open() rather than direct file I/O to ensure compatibility with different artifact stores (local filesystem, cloud storage, etc.).

Visualization Support

The save_visualizations() method allows you to create visualizations that will be shown in the ZenML dashboard. You can return multiple visualizations of different types:

VisualizationType.HTML: Embedded HTML content
VisualizationType.MARKDOWN: Markdown content
VisualizationType.IMAGE: Image files
VisualizationType.CSV: CSV tables

Configuring Visualizations

Some materializers support configuration via environment variables to customize their visualization behavior. For example:

ZENML_PANDAS_SAMPLE_ROWS: Controls the number of rows shown in sample visualizations created by the PandasMaterializer. Default is 10 rows.

Metadata Extraction

The extract_metadata() method allows you to extract key information about your artifact for indexing and searching. This metadata will be displayed alongside the artifact in the dashboard.

Temporary Files

If you need a temporary directory while processing artifacts, use the get_temporary_directory() helper:

with self.get_temporary_directory() as temp_dir:
    # Process files in the temporary directory
    # Files will be automatically cleaned up

Example: A Complete Materializer

Here's a complete example of a custom materializer for a simple class:

import os
import json
from typing import Type, Any, Dict
from zenml.materializers.base_materializer import BaseMaterializer
from zenml.enums import ArtifactType

class MyObj:
    def __init__(self, name: str):
        self.name = name
    
    def to_dict(self):
        return {"name": self.name}
    
    @classmethod
    def from_dict(cls, data):
        return cls(name=data["name"])

class MyMaterializer(BaseMaterializer):
    """Materializer for MyObj objects."""
    
    ASSOCIATED_TYPES = (MyObj,)
    ASSOCIATED_ARTIFACT_TYPE = ArtifactType.DATA
    
    def load(self, data_type: Type[Any]) -> MyObj:
        """Load MyObj from storage."""
        filepath = os.path.join(self.uri, "data.json")
        with self.artifact_store.open(filepath, "r") as f:
            data = json.load(f)
        
        return MyObj.from_dict(data)
    
    def save(self, data: MyObj) -> None:
        """Save MyObj to storage."""
        filepath = os.path.join(self.uri, "data.json")
        with self.artifact_store.open(filepath, "w") as f:
            json.dump(data.to_dict(), f)

# Usage in a pipeline
@step(output_materializers=MyMaterializer)
def create_my_obj() -> MyObj:
    return MyObj(name="my_object")

@step
def use_my_obj(my_obj: MyObj) -> None:
    print(f"Object name: {my_obj.name}")

@pipeline
def my_pipeline():
    obj = create_my_obj()
    use_my_obj(obj)

Unmaterialized artifacts

Whenever you pass artifacts as outputs from one pipeline step to other steps as inputs, the corresponding materializer for the respective data type defines how this artifact is first serialized and written to the artifact store, and then deserialized and read in the next step.handle-custom-data-types. However, there are instances where you might not want to materialize an artifact in a step, but rather use a reference to it instead. This is where skipping materialization comes in.

Skipping materialization might have unintended consequences for downstream tasks that rely on materialized artifacts. Only skip materialization if there is no other way to do what you want to do.

How to skip materialization

While materializers should in most cases be used to control how artifacts are returned and consumed from pipeline steps, you might sometimes need to have a completely unmaterialized artifact in a step, e.g., if you need to know the exact path to where your artifact is stored.

An unmaterialized artifact is a . Among others, it has a property uri that points to the unique path in the artifact store where the artifact is persisted. One can use an unmaterialized artifact by specifying UnmaterializedArtifact as the type in the step:

from zenml.artifacts.unmaterialized_artifact import UnmaterializedArtifact
from zenml import step

@step
def my_step(my_artifact: UnmaterializedArtifact):  # rather than pd.DataFrame
    pass

The following shows an example of how unmaterialized artifacts can be used in the steps of a pipeline. The pipeline we define will look like this:

s1 -> s3 
s2 -> s4

s1 and s2 produce identical artifacts, however s3 consumes materialized artifacts while s4 consumes unmaterialized artifacts. s4 can now use the dict_.uri and list_.uri paths directly rather than their materialized counterparts.

from typing_extensions import Annotated  # or `from typing import Annotated on Python 3.9+
from typing import Dict, List, Tuple

from zenml.artifacts.unmaterialized_artifact import UnmaterializedArtifact
from zenml import pipeline, step


@step
def step_1() -> Tuple[
    Annotated[Dict[str, str], "dict_"],
    Annotated[List[str], "list_"],
]:
    return {"some": "data"}, []


@step
def step_2() -> Tuple[
    Annotated[Dict[str, str], "dict_"],
    Annotated[List[str], "list_"],
]:
    return {"some": "data"}, []


@step
def step_3(dict_: Dict, list_: List) -> None:
    assert isinstance(dict_, dict)
    assert isinstance(list_, list)


@step
def step_4(
        dict_: UnmaterializedArtifact,
        list_: UnmaterializedArtifact,
) -> None:
    print(dict_.uri)
    print(list_.uri)


@pipeline
def example_pipeline():
    step_3(*step_1())
    step_4(*step_2())


example_pipeline()

You can see another example of using an UnmaterializedArtifact when triggering a .

Best Practices

When working with materializers:

Prefer structured formats over pickle or other binary formats for better cross-environment compatibility.
Test your materializer with different artifact stores (local, S3, etc.) to ensure it works consistently.
Consider versioning if your data structure might change over time.
Create visualizations to help users understand your artifacts in the dashboard.
Extract useful metadata to make artifacts easier to find and understand.
Be explicit about materializer assignments for clarity, even if ZenML can detect them automatically.
Avoid using the CloudpickleMaterializer in production as it's not reliable across different Python versions.

Conclusion

Materializers are a powerful part of ZenML's artifact system, enabling proper storage and handling of any data type. By creating custom materializers for your specific data structures, you ensure that your ML pipelines are robust, efficient, and can handle any data type required by your workflows.

PreviousArtifacts NextVisualizations

Last updated 1 month ago

Was this helpful?