Version: devel

Packaging a dltHub Project tutorial

dltHub

This page is for dltHub Feature, which requires a license. Join our early access program for a trial license.

Packaging a dltHub Project simplifies distribution across teams or stakeholders, such as data analysts or data science teams, without requiring direct access to the project’s internal code. Once installed, the package can be used to run pipelines and access production data through a standardized Python interface.

In this tutorial, you will learn how to package your dltHub project for reuse and distribution and make it pip-installable.

Prerequisites

Before you begin, ensure the following requirements are met:

dltHub is installed and set up according to the installation guide
You are familiar with the core concepts of dlt
You have completed the basic project tutorial

Additionally, install the required Python packages:

pip install pandas numpy pyarrow marimo dlt[duckdb] uv

Packaging a project

dlthub provides tools to help you package a project for distribution. This makes your project installable via pip and easier to share across your organization.

To create the project structure required for a package, add the --package option when initializing:

dlt project init arrow duckdb --package my_dlt_project

This creates the same basic project as in the basic tutorial, but places it inside a module named my_dlt_project, and includes a basic pyproject.toml file following PEP standards. You’ll also get a default __init__.py file to make the package usable after installation:

.
├── my_dlt_project/       # Your project module
│   ├── __init__.py       # Package entry point
│   ├── dlt.yml           # dltHub project manifest
│   └── ...               # Other project files
├── .gitignore
└── pyproject.toml        # the main project manifest

Your dlt.yml works exactly the same as in non-packaged projects. The key difference is the module structure and the presence of the pyproject.toml file. The file includes a special entry point setting to let dltHub discover your project:

[project.entry-points.dlt_package]
dlt-project = "my_project"

You can still run the pipeline as usual with the CLI commands from the root folder:

dlt pipeline my_pipeline run

If you open the __init__.py file inside your project module, you'll see the full interface that users of your package will interact with. This interface is very similar to the current interface used in flat (non-packaged) projects. The main difference is that it automatically uses the access profile by default. You can customize the __init__.py file to your project's needs.

Using the packaged project

To demonstrate how your packaged project can be used, let's simulate a real-world scenario where a data scientist installs and runs your project in a separate Python environment. In this example, we'll use the uv package manager, but the same steps apply when using poetry or pip. You can find installation instructions here. Assume your packaged dltHub project is located at: /Volumes/my_drive/my_folder/pyproject.toml. Navigate to a new directory and initialize your project:

uv init

Install your packaged project directly from the local path:

uv pip install /Volumes/my_drive/my_folder

Your dltHub project is now available for use in this environment.

As an example, create a new Python file named test_project.py, use your packaged project, and define the environment variables it needs:

# import the packaged project
import my_dlt_project
import os
import pandas as pd

os.environ["MY_PIPELINE__SOURCES__ARROW__ARROW__ROW_COUNT"] = "0"
os.environ["MY_PIPELINE__SOURCES__ARROW__ARROW__SOME_SECRET"] = "0"

if __name__ == "__main__":
    # should print "access" as defined in your dlt package
    print(my_dlt_project.config().current_profile)
    # Run the pipeline from the packaged project
    my_dlt_project.runner().run_pipeline("my_pipeline")
    # should list the defined destinations
    print(my_dlt_project.config().destinations)
    # get a dataset from the catalog
    dataset = my_dlt_project.catalog().dataset("my_pipeline_dataset")
    # Write a DataFrame to the "my_table" table in the dataset
    dataset.save(pd.DataFrame({"name": ["John", "Jane", "Jim"], "age": [30, 25, 35]}), table_name="my_table")
    # get the row counts of all tables in the dataset as a dataframe
    print(dataset.row_counts().df())

Run the script inside the uv virtual environment:

uv run python test_project.py

Once your pipeline has run, you can explore and share the loaded data using various access methods provided by dltHub. Learn more about it in the Secure data access and sharing.

info

In a real-world setup, a data scientist wouldn't install the package from a local path. Instead, it would typically come from a private PyPI repository or a Git URL.

Packaging a dltHub Project tutorial

Prerequisites

Packaging a project

Using the packaged project

Next steps

DHelp

Ask a question

Prerequisites​

Packaging a project​

Using the packaged project​

Next steps​

DHelp

Ask a question

Prerequisites

Packaging a project

Using the packaged project

Next steps