Stream datasets from storage .md .md

This guide walks through streaming datasets from disk or cloud storage.

# replace with your username and S3 bucket
!lamin login testuser1
!lamin init --storage s3://lamindb-ci/test-arrays
Hide code cell output
! updating cloud SQLite 's3://lamindb-ci/test-arrays/.lamindb/lamin.db' of instance 'testuser1/test-arrays'
! locked instance (to unlock and push changes to the cloud SQLite file, call: lamin disconnect)
 initialized lamindb: testuser1/test-arrays

Import lamindb and track this notebook.

import lamindb as ln
import numpy as np

ln.track()
Hide code cell output
 connected lamindb: testuser1/test-arrays
 created Transform('tRk5L04axDTb0000', key='arrays.ipynb'), started new Run('fQ1xjkAWPrr8wxwo') at 2026-02-05 16:08:08 UTC
 notebook imports: lamindb==2.1.1 numpy==2.4.2
 recommendation: to identify the notebook across renames, pass the uid: ln.track("tRk5L04axDTb")

DataFrame

A dataframe stored as sharded parquet.

artifact = ln.Artifact.connect("laminlabs/lamindata").get(key="sharded_parquet")
artifact.path.view_tree()
Hide code cell output
/opt/hostedtoolcache/Python/3.13.11/x64/lib/python3.13/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
11 sub-directories & 11 files with suffixes '.parquet'
hf://datasets/Koncopd/lamindb-test/sharded_parquet
├── louvain=0/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=1/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=10/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=2/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=3/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=4/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=5/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=6/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=7/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
├── louvain=8/
│   └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
└── louvain=9/
    └── 947eee0b064440c9b9910ca2eb89e608-0.parquet
backed = artifact.open()
Hide code cell output
 transferred: Artifact(uid='78XWb8yD09SCgVfl0000'), Storage(uid='5EYyeftHljIs')

This returns a pyarrow dataset.

backed
Hide code cell output
<pyarrow._dataset.FileSystemDataset at 0x7f5f0a0d3700>
backed.head(5).to_pandas()
Hide code cell output
cell_type n_genes percent_mito
index
CGTTATACAGTACC-8 CD4+/CD45RO+ Memory 1034 0.010163
AGATATTGACCACA-1 CD4+/CD45RO+ Memory 1078 0.012831
GCAGGGCTGTATGC-8 CD8+/CD45RA+ Naive Cytotoxic 1055 0.012287
TTATGGCTGGCAAG-2 CD4+/CD25 T Reg 1236 0.023963
CACGACCTGGGAGT-7 CD4+/CD25 T Reg 1010 0.016620

It is also possible to open a collection of cloud artifacts.

collection = ln.Collection.connect("laminlabs/lamindata").get(
    key="sharded_parquet_collection"
)
backed = collection.open()
Hide code cell output
 transferred: Artifact(uid='yBp5v9RRptoIrIMQ0000')
 transferred: Artifact(uid='fB33zDQDFb0i3Yxw0000')
 transferred: Collection(uid='6aWTZ7J2ej1Rj22q0000')
backed
Hide code cell output
<pyarrow._dataset.FileSystemDataset at 0x7f5eba6bd0c0>
backed.to_table().to_pandas()
Hide code cell output
cell_type n_genes percent_mito
index
CGTTATACAGTACC-8 CD4+/CD45RO+ Memory 1034 0.010163
AGATATTGACCACA-1 CD4+/CD45RO+ Memory 1078 0.012831
GCAGGGCTGTATGC-8 CD8+/CD45RA+ Naive Cytotoxic 1055 0.012287
TTATGGCTGGCAAG-2 CD4+/CD25 T Reg 1236 0.023963
CACGACCTGGGAGT-7 CD4+/CD25 T Reg 1010 0.016620
AATCTCACTCAGTG-3 CD4+/CD45RO+ Memory 1183 0.016056
CTAGTTTGGCTTAG-4 CD4+/CD45RO+ Memory 1002 0.018922
ACGCCGGAAGCCTA-6 CD8+/CD45RA+ Naive Cytotoxic 1292 0.018315
CTGACCACCATGGT-4 CD8+/CD45RA+ Naive Cytotoxic 1559 0.024427
AGTTAAACAAACAG-1 CD19+ B 1005 0.019806
CTACGCACAGGGTG-3 CD4+/CD45RO+ Memory 1053 0.012073
CAGACAACAAAACG-7 CD4+/CD25 T Reg 1109 0.012702
GAGGGTGACCTATT-1 CD4+/CD25 T Reg 1003 0.012971
TGACTGGAACCATG-7 Dendritic cells 1277 0.012961
ACGACCCTGTCTGA-3 Dendritic cells 1074 0.017466
GTTATGCTACCTCC-3 CD14+ Monocytes 1201 0.016839
GTGTCAGATCTACT-6 CD14+ Monocytes 1014 0.025417
AAGAACGAACTCTT-6 CD14+ Monocytes 1067 0.019530
TACTCTGACGTAGT-1 Dendritic cells 1118 0.012069
TAAGCTCTTCTGGA-4 CD14+ Monocytes 1059 0.021497

By default Artifact.open() and Collection.open() use pyarrow to lazily open dataframes. polars can be also used by passing engine="polars". Note also that .open(engine="polars") returns a context manager with LazyFrame.

with collection.open(engine="polars", use_fsspec=True) as lazy_df:
    display(lazy_df.collect().to_pandas())
Hide code cell output
cell_type n_genes percent_mito index
0 CD4+/CD45RO+ Memory 1034 0.010163 CGTTATACAGTACC-8
1 CD4+/CD45RO+ Memory 1078 0.012831 AGATATTGACCACA-1
2 CD8+/CD45RA+ Naive Cytotoxic 1055 0.012287 GCAGGGCTGTATGC-8
3 CD4+/CD25 T Reg 1236 0.023963 TTATGGCTGGCAAG-2
4 CD4+/CD25 T Reg 1010 0.016620 CACGACCTGGGAGT-7
5 CD4+/CD45RO+ Memory 1183 0.016056 AATCTCACTCAGTG-3
6 CD4+/CD45RO+ Memory 1002 0.018922 CTAGTTTGGCTTAG-4
7 CD8+/CD45RA+ Naive Cytotoxic 1292 0.018315 ACGCCGGAAGCCTA-6
8 CD8+/CD45RA+ Naive Cytotoxic 1559 0.024427 CTGACCACCATGGT-4
9 CD19+ B 1005 0.019806 AGTTAAACAAACAG-1
10 CD4+/CD45RO+ Memory 1053 0.012073 CTACGCACAGGGTG-3
11 CD4+/CD25 T Reg 1109 0.012702 CAGACAACAAAACG-7
12 CD4+/CD25 T Reg 1003 0.012971 GAGGGTGACCTATT-1
13 Dendritic cells 1277 0.012961 TGACTGGAACCATG-7
14 Dendritic cells 1074 0.017466 ACGACCCTGTCTGA-3
15 CD14+ Monocytes 1201 0.016839 GTTATGCTACCTCC-3
16 CD14+ Monocytes 1014 0.025417 GTGTCAGATCTACT-6
17 CD14+ Monocytes 1067 0.019530 AAGAACGAACTCTT-6
18 Dendritic cells 1118 0.012069 TACTCTGACGTAGT-1
19 CD14+ Monocytes 1059 0.021497 TAAGCTCTTCTGGA-4

Yet another way to open several parquet files as a single dataset is via calling .open() directly for a query set.

backed = ln.Artifact.filter(suffix=".parquet").open()
Hide code cell output
! this query set is unordered, consider using `.order_by()` first to avoid opening the artifacts in an arbitrary order
backed
Hide code cell output
<pyarrow._dataset.FileSystemDataset at 0x7f5eb0574ac0>

AnnData

We’ll need some test data:

ln.Artifact("s3://lamindb-ci/test-arrays/pbmc68k.h5ad").save()
ln.Artifact("s3://lamindb-ci/test-arrays/testfile.hdf5").save()
Hide code cell output
Artifact(uid='KLHFjHDq5v43xnEV0000', version_tag=None, is_latest=True, key='testfile.hdf5', description=None, suffix='.hdf5', kind=None, otype=None, size=1400, hash='UCWPjJkhzBjO97rtuo_8Yg', n_files=None, n_observations=None, branch_id=1, space_id=1, storage_id=3, run_id=1, schema_id=None, created_by_id=3, created_at=2026-02-05 16:08:13 UTC, is_locked=False)

An h5ad artifact stored on s3:

artifact = ln.Artifact.get(key="pbmc68k.h5ad")
artifact.path
Hide code cell output
S3QueryPath('s3://lamindb-ci/test-arrays/pbmc68k.h5ad')
access = artifact.open()

This object is an AnnDataAccessor object, an AnnData object backed in the cloud:

access
Hide code cell output
AnnDataAccessor object with n_obs × n_vars = 70 × 765
  constructed for the AnnData object pbmc68k.h5ad
    obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
    obsm: ['X_pca', 'X_umap']
    obsp: ['connectivities', 'distances']
    uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
    var: ['highly_variable', 'index', 'n_counts']
    varm: ['PCs']

Without subsetting, the AnnDataAccessor object references underlying lazy h5 or zarr arrays:

access.X
Hide code cell output
<HDF5 dataset "X": shape (70, 765), type "<f4">

You can subset it like a normal AnnData object:

obs_idx = access.obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
    access.obs.percent_mito <= 0.05
)
access_subset = access[obs_idx]
access_subset
Hide code cell output
AnnDataAccessorSubset object with n_obs × n_vars = 35 × 765
  obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
  obsm: ['X_pca', 'X_umap']
  obsp: ['connectivities', 'distances']
  uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
  var: ['highly_variable', 'index', 'n_counts']
  varm: ['PCs']

Subsets load arrays into memory upon direct access:

access_subset.X
Hide code cell output
array([[-0.326, -0.191,  0.499, ..., -0.21 , -0.636, -0.49 ],
       [ 0.811, -0.191, -0.728, ..., -0.21 ,  0.604, -0.49 ],
       [-0.326, -0.191,  0.643, ..., -0.21 ,  2.303, -0.49 ],
       ...,
       [-0.326, -0.191, -0.728, ..., -0.21 ,  0.626, -0.49 ],
       [-0.326, -0.191, -0.728, ..., -0.21 , -0.636, -0.49 ],
       [-0.326, -0.191, -0.728, ..., -0.21 , -0.636, -0.49 ]],
      shape=(35, 765), dtype=float32)

To load the entire subset into memory as an actual AnnData object, use to_memory():

adata_subset = access_subset.to_memory()

adata_subset
Hide code cell output
AnnData object with n_obs × n_vars = 35 × 765
    obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
    var: 'n_counts', 'highly_variable'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

It is also possible to add columns to .obs and .var of cloud AnnData objects without downloading them.

Create a new AnnData zarr artifact.

adata_subset.write_zarr("adata_subset.zarr")
artifact = ln.Artifact(
    "adata_subset.zarr", description="test add column to adata"
).save()
artifact
Hide code cell output
Artifact(uid='ItVqt5H02FvogTjO0000', version_tag=None, is_latest=True, key=None, description='test add column to adata', suffix='.zarr', kind=None, otype='AnnData', size=215211, hash='aSHN77yMrOMiMzo6jh1xEA', n_files=120, n_observations=None, branch_id=1, space_id=1, storage_id=3, run_id=1, schema_id=None, created_by_id=3, created_at=2026-02-05 16:08:15 UTC, is_locked=False)
with artifact.open(mode="r+") as access:
    access.add_column(where="obs", col_name="ones", col=np.ones(access.shape[0]))
    display(access)
Hide code cell output
AnnDataAccessor object with n_obs × n_vars = 35 × 765
  constructed for the AnnData object ItVqt5H02FvogTjO.zarr
    obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito', 'ones']
    obsm: ['X_pca', 'X_umap']
    obsp: ['connectivities', 'distances']
    uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
    var: ['highly_variable', 'index', 'n_counts']
    varm: ['PCs']

The version of the artifact is updated after the modification.

artifact
Hide code cell output
Artifact(uid='ItVqt5H02FvogTjO0001', version_tag=None, is_latest=True, key=None, description='test add column to adata', suffix='.zarr', kind=None, otype=None, size=215962, hash='3Gf4tPzfnj06zeqiigcFOg', n_files=123, n_observations=None, branch_id=1, space_id=1, storage_id=3, run_id=1, schema_id=None, created_by_id=3, created_at=2026-02-05 16:08:23 UTC, is_locked=False)
artifact.delete(permanent=True)
Hide code cell output
 deleting all versions of this artifact because they all share the same store

SpatialData

It is also possible to access AnnData objects inside SpatialData tables:

artifact = ln.Artifact.connect("laminlabs/lamindata").get(
    key="visium_aligned_guide_min.zarr"
)

access = artifact.open()
Hide code cell output
 transferred: Artifact(uid='bjH534dxVi1drmLZ0001'), Storage(uid='D9BilDV2')
access
Hide code cell output
SpatialDataAccessor object
  constructed for the SpatialData object bjH534dxVi1drmLZ.zarr
    with tables: ['table']
access.tables
Hide code cell output
Accessor for the SpatialData attribute tables
  with keys: ['table']

This gives you the same AnnDataAccessor object as for a normal AnnData.

table = access.tables["table"]

table
Hide code cell output
AnnDataAccessor object with n_obs × n_vars = 37 × 18085
  constructed for the AnnData object table
    obs: ['_index', 'array_col', 'array_row', 'clone', 'dataset', 'in_tissue', 'region', 'spot_id']
    obsm: ['spatial']
    uns: ['spatial', 'spatialdata_attrs']
    var: ['feature_types', 'gene_ids', 'genome', 'symbols']

You can subset it and read into memory as an actual AnnData:

table_subset = table[table.obs["clone"] == "diploid"]

table_subset
Hide code cell output
AnnDataAccessorSubset object with n_obs × n_vars = 31 × 18085
  obs: ['_index', 'array_col', 'array_row', 'clone', 'dataset', 'in_tissue', 'region', 'spot_id']
  obsm: ['spatial']
  uns: ['spatial', 'spatialdata_attrs']
  var: ['feature_types', 'gene_ids', 'genome', 'symbols']
adata = table_subset.to_memory()

Generic HDF5

Let us query a generic HDF5 artifact:

artifact = ln.Artifact.get(key="testfile.hdf5")

And get a backed accessor:

backed = artifact.open()

The returned object contains the .connection and h5py.File or zarr.Group in .storage

backed
Hide code cell output
BackedAccessor(connection=<File-like object S3FileSystem, lamindb-ci/test-arrays/testfile.hdf5>, storage=<HDF5 file "testfile.hdf5>" (mode r)>)
backed.storage
Hide code cell output
<HDF5 file "testfile.hdf5>" (mode r)>
# clean up test instance
ln.setup.delete("test-arrays", force=True)
Hide code cell output
 deleted storage record on hub 76e5f3b018085f52bcd5ca9b4d7e0ce5 | s3://lamindb-ci/test-arrays
 deleted instance record on hub 587a82023ecb5ea28b3a448cb8240f7f