Materializers
Understanding and creating materializers to handle custom data types in ZenML pipelines
Materializers are a core concept in ZenML that enable the serialization, storage, and retrieval of artifacts in your ML pipelines. This guide explains how materializers work and how to create custom materializers for your specific data types.
What Are Materializers?
A materializer is a class that defines how a particular data type is:
Serialized: Converted from Python objects to a storable format
Saved: Written to the artifact store
Loaded: Read from the artifact store
Deserialized: Converted back to Python objects
Visualized: Displayed in the ZenML dashboard
Analyzed: Metadata extraction for tracking and search
Materializers act as the bridge between your Python code and the underlying storage system, ensuring that any artifact can be saved, loaded, and visualized correctly, regardless of the data type.
Built-In Materializers
ZenML includes built-in materializers for many common data types:
Core Materializers
bool
, float
, int
, str
, None
.json
bytes
.txt
dict
, list
, set
, tuple
Directory
np.ndarray
.npy
pd.DataFrame
, pd.Series
.csv
(or .gzip
if parquet
is installed)
pydantic.BaseModel
.json
zenml.services.service.BaseService
.json
zenml.types.CSVString
, zenml.types.HTMLString
, zenml.types.MarkdownString
.csv
/ .html
/ .md
(depending on type)
Integration-Specific Materializers
When you install ZenML integrations, additional materializers become available:
bentoml
bentoml.Bento
.bento
deepchecks
deepchecks.CheckResult
, deepchecks.SuiteResult
.json
evidently
evidently.Profile
.json
great_expectations
great_expectations.ExpectationSuite
, great_expectations.CheckpointResult
.json
huggingface
datasets.Dataset
, datasets.DatasetDict
Directory
huggingface
transformers.PreTrainedModel
Directory
huggingface
transformers.TFPreTrainedModel
Directory
huggingface
transformers.PreTrainedTokenizerBase
Directory
lightgbm
lgbm.Booster
.txt
lightgbm
lgbm.Dataset
.binary
neural_prophet
NeuralProphet
.pt
pillow
Pillow.Image
.PNG
polars
pl.DataFrame
, pl.Series
.parquet
pycaret
Any sklearn
, xgboost
, lightgbm
or catboost
model
.pkl
pytorch
torch.Dataset
, torch.DataLoader
.pt
pytorch
torch.Module
.pt
scipy
scipy.spmatrix
.npz
spark
pyspark.DataFrame
.parquet
spark
pyspark.Transformer
pyspark.Estimator
tensorflow
tf.keras.Model
Directory
tensorflow
tf.Dataset
Directory
whylogs
whylogs.DatasetProfileView
.pb
xgboost
xgb.Booster
.json
xgboost
xgb.DMatrix
.binary
Note: When using Docker-based orchestrators, you must specify the appropriate integrations in your
DockerSettings
to ensure the materializers are available inside the container.
Creating Custom Materializers
When working with custom data types, you'll need to create materializers to handle them. Here's how:
1. Define Your Materializer Class
Create a new class that inherits from BaseMaterializer
:
2. Using Your Custom Materializer
Once you've defined the materializer, you can use it in your pipeline:
3. Multiple Outputs with Different Materializers
When a step has multiple outputs that need different materializers:
4. Registering a Materializer Globally
You can register a materializer globally to override the default materializer for a specific type:
Materializer Implementation Details
When implementing a custom materializer, consider these aspects:
Handling Storage
The self.uri
property contains the path to the directory where your artifact should be stored. Use this path to create files or subdirectories for your data.
When reading or writing files, always use self.artifact_store.open()
rather than direct file I/O to ensure compatibility with different artifact stores (local filesystem, cloud storage, etc.).
Visualization Support
The save_visualizations()
method allows you to create visualizations that will be shown in the ZenML dashboard. You can return multiple visualizations of different types:
VisualizationType.HTML
: Embedded HTML contentVisualizationType.MARKDOWN
: Markdown contentVisualizationType.IMAGE
: Image filesVisualizationType.CSV
: CSV tables
Configuring Visualizations
Some materializers support configuration via environment variables to customize their visualization behavior. For example:
ZENML_PANDAS_SAMPLE_ROWS
: Controls the number of rows shown in sample visualizations created by thePandasMaterializer
. Default is 10 rows.
Metadata Extraction
The extract_metadata()
method allows you to extract key information about your artifact for indexing and searching. This metadata will be displayed alongside the artifact in the dashboard.
Temporary Files
If you need a temporary directory while processing artifacts, use the get_temporary_directory()
helper:
Example: A Complete Materializer
Here's a complete example of a custom materializer for a simple class:
Unmaterialized artifacts
Whenever you pass artifacts as outputs from one pipeline step to other steps as inputs, the corresponding materializer for the respective data type defines how this artifact is first serialized and written to the artifact store, and then deserialized and read in the next step.handle-custom-data-types. However, there are instances where you might not want to materialize an artifact in a step, but rather use a reference to it instead. This is where skipping materialization comes in.
Skipping materialization might have unintended consequences for downstream tasks that rely on materialized artifacts. Only skip materialization if there is no other way to do what you want to do.
How to skip materialization
While materializers should in most cases be used to control how artifacts are returned and consumed from pipeline steps, you might sometimes need to have a completely unmaterialized artifact in a step, e.g., if you need to know the exact path to where your artifact is stored.
The following shows an example of how unmaterialized artifacts can be used in the steps of a pipeline. The pipeline we define will look like this:
s1
and s2
produce identical artifacts, however s3
consumes materialized artifacts while s4
consumes unmaterialized artifacts. s4
can now use the dict_.uri
and list_.uri
paths directly rather than their materialized counterparts.
Best Practices
When working with materializers:
Prefer structured formats over pickle or other binary formats for better cross-environment compatibility.
Test your materializer with different artifact stores (local, S3, etc.) to ensure it works consistently.
Consider versioning if your data structure might change over time.
Create visualizations to help users understand your artifacts in the dashboard.
Extract useful metadata to make artifacts easier to find and understand.
Be explicit about materializer assignments for clarity, even if ZenML can detect them automatically.
Avoid using the CloudpickleMaterializer in production as it's not reliable across different Python versions.
Conclusion
Materializers are a powerful part of ZenML's artifact system, enabling proper storage and handling of any data type. By creating custom materializers for your specific data structures, you ensure that your ML pipelines are robust, efficient, and can handle any data type required by your workflows.
Last updated
Was this helpful?