LaminDB docs llms.txt codecov pypi cran stars downloads .md

LaminDB is an open-source data framework for biology to query, trace, and validate datasets and models at scale. You get context & memory through a lineage-native lakehouse that understands bio-formats, registries & ontologies.

Why?

(1) Reproducing, tracing & understanding how datasets, models & results are created is critical to quality R&D. Without context, humans & agents make mistakes and cannot close feedback loops across data generation & analysis. Without memory, compute & intelligence are wasted on fragmented, non-compounding tasks.

(2) Training & fine-tuning models with thousands of datasets — across LIMS, ELNs, orthogonal assays — is now a primary path to scaling R&D. But without queryable & validated data or with data locked in organizational & infrastructure siloes, it leads to garbage in, garbage out or is quite simply impossible.

Imagine building software without git or pull requests: an agent’s quality would be impossible to verify. While code has git and tables have dbt/warehouses, biological data has lacked a framework for managing its unique complexity.

LaminDB fills the gap. It is a lineage-native lakehouse that understands bio-registries and formats (AnnData, .zarr, …) based on the established open data stack: Postgres/SQLite for metadata and cross-platform storage for datasets. By offering queries, tracing & validation in a single API, LaminDB provides the context & memory to turn messy, agentic biological R&D into a scalable process.

How?

  • lineage → track inputs & outputs of notebooks, scripts, functions & pipelines with a single line of code

  • lakehouse → manage, monitor & validate schemas for standard and bio formats; query across many datasets

  • FAIR datasets → validate & annotate DataFrame, AnnData, SpatialData, parquet, zarr, …

  • LIMS & ELN → programmatic experimental design with bio-registries, ontologies & markdown notes

  • unified access → storage locations (local, S3, GCP, …), SQL databases (Postgres, SQLite) & ontologies

  • reproducible → auto-track source code & compute environments with data & code versioning

  • change management → branching & merging similar to git

  • zero lock-in → runs anywhere on open standards (Postgres, SQLite, parquet, zarr, etc.)

  • scalable → you hit storage & database directly through your pydata or R stack, no REST API involved

  • simple → just pip install from PyPI or install.packages('laminr') from CRAN

  • distributed → zero-copy & lineage-aware data sharing across infrastructure (databases & storage locations)

  • integrationsgit, nextflow, vitessce, redun, and more

  • extensible → create custom plug-ins based on the Django ORM, the basis for LaminDB’s registries

GUI, permissions, audit logs? LaminHub is a collaboration hub built on LaminDB similar to how GitHub is built on git.

Who?

Scientists and engineers at leading research institutions and biotech companies, including:

  • Industry → Pfizer, Altos Labs, Ensocell Therapeutics, …

  • Academia & Research → scverse, DZNE (National Research Center for Neuro-Degenerative Diseases), Helmholtz Munich (National Research Center for Environmental Health), …

  • Research Hospitals → Global Immunological Swarm Learning Network: Harvard, MIT, Stanford, ETH Zürich, Charité, U Bonn, Mount Sinai, …

From personal research projects to pharma-scale deployments managing petabytes of data across:

entities

OOMs

observations & datasets

10¹² & 10⁶

runs & transforms

10⁹ & 10⁵

proteins & genes

10⁹ & 10⁶

biosamples & species

10⁵ & 10²

Docs

Copy llms.txt into an LLM chat and let AI explain or read the docs.

Quickstart

Install the Python package:

pip install lamindb

Query databases

You can browse public databases at lamin.ai/explore. To query laminlabs/cellxgene, run:

import lamindb as ln

db = ln.DB("laminlabs/cellxgene")  # a database object for queries
df = db.Artifact.to_dataframe()    # a dataframe listing datasets & models
 connected lamindb: anonymous/test-transfer

To get a specific dataset, run:

artifact = db.Artifact.get("BnMwC3KZz0BuKftR")  # a metadata object for a dataset
artifact.describe()                             # describe the context of the dataset
Artifact: cell-census/2025-01-30/h5ads/82346769-8733-485e-ab49-f14923d2b5bc.h5ad (2025-01-30)
|   description: OPCs
├── uid: BnMwC3KZz0BuKftR0000            run: o9WY9Nh (annotate_2025_30_01_LTS.py)
kind: None                           otype: AnnData                           
hash: htHNjYEGEzDadT7QcGrkAQ         size: 64.5 MB                            
branch: main                         space: all                               
created_at: 2025-07-30 09:51:10 UTC  created_by: zethson                      
n_observations: 3324                                                          
├── storage/path: s3://cellxgene-data-public/cell-census/2025-01-30/h5ads/82346769-8733-485e-ab49-f14923d2b5bc.h5ad
├── Dataset features
├── obs (20)                                                                                                   
│   assay                          bionty.ExperimentalFactor[source__…                                         
│   assay_ontology_term_id         bionty.ExperimentalFactor.ontology…  EFO:0009922                            
│   cell_type                      bionty.CellType[source__uid='3Uw2V…                                         
│   cell_type_ontology_term_id     bionty.CellType.ontology_id[source…  CL:0002453                             
│   development_stage              bionty.DevelopmentalStage[source__…                                         
│   development_stage_ontology_t…  bionty.DevelopmentalStage.ontology…  HsapDv:0000147, HsapDv:0000162, HsapDv…
│   disease                        bionty.Disease[source__uid='4a3ejK…                                         
│   disease_ontology_term_id       bionty.Disease.ontology_id[source_…  MONDO:0004975, MONDO:0800027, PATO:000…
│   donor_id                       str                                                                         
│   is_primary_data                ULabel                                                                      
│   organism                       bionty.Organism.scientific_name[so…                                         
│   organism_ontology_term_id      bionty.Organism.ontology_id[source…  NCBITaxon:9606                         
│   self_reported_ethnicity        bionty.Ethnicity[source__uid='MJRq…                                         
│   self_reported_ethnicity_onto…  bionty.Ethnicity.ontology_id[sourc…  HANCESTRO:0005, HANCESTRO:0016, unknown
│   sex                            bionty.Phenotype[source__uid='3ox8…                                         
│   sex_ontology_term_id           bionty.Phenotype.ontology_id[sourc…  PATO:0000383, PATO:0000384             
│   suspension_type                ULabel                               nucleus                                
│   tissue                         bionty.Tissue[source__uid='MUtAGdL…                                         
│   tissue_ontology_term_id        bionty.Tissue.ontology_id[source__…  UBERON:0000451, UBERON:0016528, UBERON…
│   tissue_type                    ULabel                               tissue                                 
└── var (2)                                                                                                    
    feature_is_filtered            bool                                                                        
    var_index                      bionty.Gene.ensembl_gene_id[source…                                         
├── External features
└── n_of_donors                    int                                  8                                      
└── Labels
    └── .ulabels                       ULabel                               nucleus, tissue                        
        .references                    Reference                            Deciphering glial contributions to CSF…
        .organisms                     bionty.Organism                      human                                  
        .tissues                       bionty.Tissue                        prefrontal cortex, white matter of fro…
        .cell_types                    bionty.CellType                      oligodendrocyte precursor cell         
        .diseases                      bionty.Disease                       Alzheimer disease, normal, leukoenceph…
        .phenotypes                    bionty.Phenotype                     female, male                           
        .experimental_factors          bionty.ExperimentalFactor            10x 3' v3                              
        .developmental_stages          bionty.DevelopmentalStage            81-year-old stage, 53-year-old stage, …
        .ethnicities                   bionty.Ethnicity                     European, African American or Afro-Car…
See the output.

Access the content of the dataset via:

local_path = artifact.cache()  # return a local path from a cache
adata = artifact.load()        # load object into memory
! run input wasn't tracked, call `ln.track()` and re-run
! run input wasn't tracked, call `ln.track()` and re-run

You can query by biological entities like Disease through plug-in bionty:

alzheimers = db.bionty.Disease.get(name="Alzheimer disease")
df = db.Artifact.filter(diseases=alzheimers).to_dataframe()

Configure your database

You can create a LaminDB instance at lamin.ai and invite collaborators. To connect to a remote instance, run:

lamin login
lamin connect account/name

If you prefer to work with a local SQLite database (no login required), run this instead:

lamin init --storage ./quickstart-data --modules bionty

On the terminal and in a Python session, LaminDB will now auto-connect.

CLI

To save a file or folder from the command line, run:

lamin save myfile.txt --key examples/myfile.txt

To sync a file into a local cache (artifacts) or development directory (transforms), run:

lamin load --key examples/myfile.txt

Read more: docs.lamin.ai/cli.

Lineage: scripts & notebooks

To create a dataset while tracking source code, inputs, outputs, logs, and environment:

import lamindb as ln
# → connected lamindb: account/instance

ln.track()                                              # track code execution
open("sample.fasta", "w").write(">seq1\nACGT\n")        # create dataset
ln.Artifact("sample.fasta", key="sample.fasta").save()  # save dataset
ln.finish()                                             # mark run as finished
 created Transform('o80U861BeEDG0000', key='README.ipynb'), started new Run('jyHL084o09Z0ROUq') at 2026-02-05 16:07:40 UTC
 notebook imports: anndata==0.12.2 bionty==2.1.0 lamindb==2.1.1 numpy==2.4.2 pandas==2.3.3
 recommendation: to identify the notebook across renames, pass the uid: ln.track("o80U861BeEDG")
! calling anonymously, will miss private instances
! cells [(4, 6), (20, 22)] were not run consecutively
 finished Run('jyHL084o09Z0ROUq') after 2s at 2026-02-05 16:07:43 UTC

Running this snippet as a script (python create-fasta.py) produces the following data lineage:

artifact = ln.Artifact.get(key="sample.fasta")  # get artifact by key
artifact.describe()      # context of the artifact
artifact.view_lineage()  # fine-grained lineage
Artifact: sample.fasta (0000)
├── uid: DmKeBE0gXR6JKWJL0000            run: jyHL084 (README.ipynb)
hash: 83rEPcAoBHmYiIuyBYrFKg         size: 11 B                 
branch: main                         space: all                 
created_at: 2026-02-05 16:07:41 UTC  created_by: anonymous      
└── storage/path: /home/runner/work/lamindb/lamindb/docs/test-transfer/.lamindb/DmKeBE0gXR6JKWJL0000.fasta
_images/9e58cf40e4c9d4a38d7ae461e1a57e066c9d174f3cc83b2de04aa2c7d434e276.svg

Access run & transform.
run = artifact.run              # get the run object
transform = artifact.transform  # get the transform object
run.describe()                  # context of the run
Run: jyHL084 (README.ipynb)
├── uid: jyHL084o09Z…  transform: README.ipynb (0000)                                                              
                   |   description: LaminDB [![docs](https://img.shields.io/badge/docs-yellow)](https://docs.l…
started_at: 2026…  finished_at: 2026-02-05 16:07:43 UTC                                                        
status: completed                                                                                              
branch: main       space: all                                                                                  
created_at: 2026…  created_by: anonymous                                                                       
└── environment: PsDOBsG
    aiobotocore==2.26.0
    aiohappyeyeballs==2.6.1
    aiohttp==3.13.3
    aioitertools==0.13.0
    │ …
transform.describe()  # context of the transform
Transform: README.ipynb (0000)
|   description: LaminDB [![docs](https://img.shields.io/badge/docs-yellow)](https://docs.lamin.ai) 
[![llms.txt](https://img.shields.io/badge/llms.txt-orange)](https://docs.lamin.ai/llms.txt) 
[![codecov](https://codecov.io/gh/laminlabs/lamindb/branch/main/graph/badge.svg?token=VKMRJ7OWR3)](https://codecov.
io/gh/laminlabs/lamindb) 
[![pypi](https://img.shields.io/pypi/v/lamindb?color=blue&label=PyPI)](https://pypi.org/project/lamindb) 
[![cran](https://www.r-pkg.org/badges/version/laminr?color=green)](https://cran.r-project.org/package=laminr) 
[![stars](https://img.shields.io/github/stars/laminlabs/lamindb?style=flat&logo=GitHub&label=&color=gray)](https://
github.com/laminlabs/lamindb) 
[![downloads](https://static.pepy.tech/personalized-badge/lamindb?period=total&units=INTERNATIONAL_SYSTEM&left_colo
r=GRAY&right_color=GRAY&left_text=%E2%AC%87%EF%B8%8F)](https://pepy.tech/project/lamindb)
├── uid: o80U861BeEDG0000                                     
hash: LPEWGk_HPJjXjXe6A0V0wg         type: notebook       
branch: main                         space: all           
created_at: 2026-02-05 16:07:40 UTC  created_by: anonymous
└── source_code: 
    # %% [markdown]
    #
    #
    # LaminDB is an open-source data framework for biology to query, trace, and vali …
    # You get context & memory through a lineage-native lakehouse that understands b …
    #
    # <details>
    # <summary>Why?</summary>
    #
    # (1) Reproducing, tracing & understanding how datasets, models & results are cr …
    # Without context, humans & agents make mistakes and cannot close feedback loops …
    # Without memory, compute & intelligence are wasted on fragmented, non-compoundi …
    #
    # (2) Training & fine-tuning models with thousands of datasets — across LIMS, EL …
    # But without queryable & validated data or with data locked in organizational & …
    #
    # Imagine building software without git or pull requests: an agent's quality wou …
    # While code has git and tables have dbt/warehouses, biological data has lacked  …
    #
    # LaminDB fills the gap.
    # It is a lineage-native lakehouse that understands bio-registries and formats ( …
    # Postgres/SQLite for metadata and cross-platform storage for datasets.
    # By offering queries, tracing & validation in a single API, LaminDB provides th …
    #
    # </details>
    #
    # <img width="800px" src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/Bu …
    #
    # How?
    #
    │ …

Lineage: functions & workflows

You can achieve the same traceability for functions & workflows:

import lamindb as ln

@ln.flow()
def create_fasta(fasta_file: str = "sample.fasta"):
    open(fasta_file, "w").write(">seq1\nACGT\n")    # create dataset
    ln.Artifact(fasta_file, key=fasta_file).save()  # save dataset

if __name__ == "__main__":
    pass

Beyond what you get for scripts & notebooks, this automatically tracks function & CLI params and integrates well with established Python workflow managers: docs.lamin.ai/track. To integrate advanced bioinformatics pipeline managers like Nextflow, see docs.lamin.ai/pipelines.

A richer example.

Here is a an automatically generated re-construction of the project of Schmidt el al. (Science, 2022):

A phenotypic CRISPRa screening result is integrated with scRNA-seq data. Here is the result of the screen input:

You can explore it here on LaminHub or here on GitHub.

Labeling & queries by fields

You can label an artifact by running:

my_label = ln.ULabel(name="My label").save()   # a universal label
project = ln.Project(name="My project").save() # a project label
artifact.ulabels.add(my_label)
artifact.projects.add(project)

Query for it:

ln.Artifact.filter(ulabels=my_label, projects=project).to_dataframe()
uid key description suffix kind otype size hash n_files n_observations version_tag is_latest is_locked created_at branch_id space_id storage_id run_id schema_id created_by_id
id
2 DmKeBE0gXR6JKWJL0000 sample.fasta None .fasta None None 11 83rEPcAoBHmYiIuyBYrFKg None None None True False 2026-02-05 16:07:41.740000+00:00 1 1 3 3 None 3

You can also query by the metadata that lamindb automatically collects:

ln.Artifact.filter(run=run).to_dataframe()              # by creating run
ln.Artifact.filter(transform=transform).to_dataframe()  # by creating transform
ln.Artifact.filter(size__gt=1e6).to_dataframe()         # size greater than 1MB
uid id key description suffix kind otype size hash n_files n_observations version_tag is_latest is_locked created_at branch_id space_id storage_id run_id schema_id created_by_id

If you want to include more information into the resulting dataframe, pass include.

ln.Artifact.to_dataframe(include=["created_by__name", "storage__root"])  # include fields from related registries
uid key created_by__name storage__root
id
2 DmKeBE0gXR6JKWJL0000 sample.fasta None /home/runner/work/lamindb/lamindb/docs/test-tr...
1 9K1dteZ6Qx0EXK8g0000 example_datasets/mini_immuno/dataset1.h5ad None s3://lamindata

Note: The query syntax for DB objects and for your default database is the same.

Queries by features

You can annotate datasets and samples with features. Let’s define some:

from datetime import date

ln.Feature(name="gc_content", dtype=float).save()
ln.Feature(name="experiment_note", dtype=str).save()
ln.Feature(name="experiment_date", dtype=date, coerce=True).save()  # accept date strings
Feature(uid='GYIKpkHUlalv', is_type=False, name='experiment_date', _dtype_str='date', unit=None, description=None, array_rank=0, array_size=0, array_shape=None, synonyms=None, default_value=None, nullable=True, coerce=True, branch_id=1, space_id=1, created_by_id=3, run_id=None, type_id=None, created_at=2026-02-05 16:07:43 UTC, is_locked=False)

During annotation, feature names and data types are validated against these definitions:

artifact.features.add_values({
    "gc_content": 0.55,
    "experiment_note": "Looks great",
    "experiment_date": "2025-10-24",
})
 "columns" is validated against Feature.name

Query for it:

ln.Artifact.filter(experiment_date="2025-10-24").to_dataframe()  # query all artifacts annotated with `experiment_date`
uid key description suffix kind otype size hash n_files n_observations version_tag is_latest is_locked created_at branch_id space_id storage_id run_id schema_id created_by_id
id
2 DmKeBE0gXR6JKWJL0000 sample.fasta None .fasta None None 11 83rEPcAoBHmYiIuyBYrFKg None None None True False 2026-02-05 16:07:41.740000+00:00 1 1 3 3 None 3

If you want to include the feature values into the dataframe, pass include.

ln.Artifact.to_dataframe(include="features")  # include the feature annotations
 queried for all categorical features of dtypes Record or ULabel and non-categorical features: (9) ['concentration', 'treatment_time_h', 'sample_note', 'donor', 'perturbation', 'experiment', 'gc_content', 'experiment_note', 'experiment_date']
uid key perturbation experiment gc_content experiment_note experiment_date
id
2 DmKeBE0gXR6JKWJL0000 sample.fasta NaN NaN 0.55 Looks great 2025-10-24
1 9K1dteZ6Qx0EXK8g0000 example_datasets/mini_immuno/dataset1.h5ad {DMSO, IFNG} Experiment 1 NaN NaN NaT

Lake ♾️ LIMS ♾️ Sheets

You can create records for the entities underlying your experiments: samples, perturbations, instruments, etc., for example:

sample = ln.Record(name="Sample", is_type=True).save()  # create entity type: Sample
ln.Record(name="P53mutant1", type=sample).save()        # sample 1
ln.Record(name="P53mutant2", type=sample).save()        # sample 2
! you are trying to create a record with name='P53mutant2' but a record with similar name exists: 'P53mutant1'. Did you mean to load it?
Record(uid='w1jwlsysA74gB4zv', is_type=False, name='P53mutant2', description=None, reference=None, reference_type=None, extra_data=None, branch_id=1, space_id=1, created_by_id=3, type_id=4, schema_id=None, run_id=None, created_at=2026-02-05 16:07:44 UTC, is_locked=False)

Define features and annotate an artifact with a sample:

ln.Feature(name="design_sample", dtype=sample).save()
artifact.features.add_values({"design_sample": "P53mutant1"})
 "columns" is validated against Feature.name
 "design_sample" is validated against Record.name

You can query & search the Record registry in the same way as Artifact or Run.

ln.Record.search("p53").to_dataframe()
uid name description reference reference_type extra_data is_locked is_type created_at branch_id space_id created_by_id type_id schema_id run_id
id
5 OWIsdLY81pvDFTkx P53mutant1 None None None None False False 2026-02-05 16:07:44.123000+00:00 1 1 3 4 None None
6 w1jwlsysA74gB4zv P53mutant2 None None None None False False 2026-02-05 16:07:44.136000+00:00 1 1 3 4 None None

You can also create relationships of entities and edit them like Excel sheets in a GUI via LaminHub.

Data versioning

If you change source code or datasets, LaminDB manages versioning for you. Assume you run a new version of our create-fasta.py script to create a new version of sample.fasta.

import lamindb as ln

ln.track()
open("sample.fasta", "w").write(">seq1\nTGCA\n")  # a new sequence
ln.Artifact("sample.fasta", key="sample.fasta", features={"design_sample": "P53mutant1"}).save()  # annotate with the new sample
ln.finish()
 found notebook README.ipynb, making new version -- anticipating changes
 created Transform('o80U861BeEDG0001', key='README.ipynb'), started new Run('0ntHgFe9OFtKx3jp') at 2026-02-05 16:07:44 UTC
 notebook imports: anndata==0.12.2 bionty==2.1.0 lamindb==2.1.1 numpy==2.4.2 pandas==2.3.3
 recommendation: to identify the notebook across renames, pass the uid: ln.track("o80U861BeEDG")
 creating new artifact version for key 'sample.fasta' in storage '/home/runner/work/lamindb/lamindb/docs/test-transfer'
! cells [(4, 6), (20, 22)] were not run consecutively
 returning artifact with same hash: Artifact(uid='swB1BLFyzT5ypBZ40000', version_tag=None, is_latest=True, key=None, description='Report of run jyHL084o09Z0ROUq', suffix='.html', kind='__lamindb_run__', otype=None, size=336730, hash='fSm_GsKkT9zWE9XJ_uFFDA', n_files=None, n_observations=None, branch_id=1, space_id=1, storage_id=3, run_id=None, schema_id=None, created_by_id=3, created_at=2026-02-05 16:07:43 UTC, is_locked=False); to track this artifact as an input, use: ln.Artifact.get()
! run was not set on Artifact(uid='swB1BLFyzT5ypBZ40000', version_tag=None, is_latest=True, key=None, description='Report of run jyHL084o09Z0ROUq', suffix='.html', kind='__lamindb_run__', otype=None, size=336730, hash='fSm_GsKkT9zWE9XJ_uFFDA', n_files=None, n_observations=None, branch_id=1, space_id=1, storage_id=3, run_id=None, schema_id=None, created_by_id=3, created_at=2026-02-05 16:07:43 UTC, is_locked=False), setting to current run
! updated description from Report of run jyHL084o09Z0ROUq to Report of run 0ntHgFe9OFtKx3jp
! returning transform  with same hash & key: Transform(uid='o80U861BeEDG0000', version_tag=None, is_latest=False, key='README.ipynb', description='LaminDB [![docs](https://img.shields.io/badge/docs-yellow)](https://docs.lamin.ai) [![llms.txt](https://img.shields.io/badge/llms.txt-orange)](https://docs.lamin.ai/llms.txt) [![codecov](https://codecov.io/gh/laminlabs/lamindb/branch/main/graph/badge.svg?token=VKMRJ7OWR3)](https://codecov.io/gh/laminlabs/lamindb) [![pypi](https://img.shields.io/pypi/v/lamindb?color=blue&label=PyPI)](https://pypi.org/project/lamindb) [![cran](https://www.r-pkg.org/badges/version/laminr?color=green)](https://cran.r-project.org/package=laminr) [![stars](https://img.shields.io/github/stars/laminlabs/lamindb?style=flat&logo=GitHub&label=&color=gray)](https://github.com/laminlabs/lamindb) [![downloads](https://static.pepy.tech/personalized-badge/lamindb?period=total&units=INTERNATIONAL_SYSTEM&left_color=GRAY&right_color=GRAY&left_text=%E2%AC%87%EF%B8%8F)](https://pepy.tech/project/lamindb)', kind='notebook', hash='LPEWGk_HPJjXjXe6A0V0wg', reference=None, reference_type=None, environment=None, branch_id=1, space_id=1, created_by_id=3, created_at=2026-02-05 16:07:40 UTC, is_locked=False)
 new latest Transform version is: o80U861BeEDG0000
 finished Run('0ntHgFe9OFtKx3jp') after 1s at 2026-02-05 16:07:45 UTC

If you now query by key, you’ll get the latest version of this artifact with the latest version of the source code linked with previous versions of artifact and source code are easily queryable:

artifact = ln.Artifact.get(key="sample.fasta")  # get artifact by key
artifact.versions.to_dataframe()                # see all versions of that artifact
uid key description suffix kind otype size hash n_files n_observations version_tag is_latest is_locked created_at branch_id space_id storage_id run_id schema_id created_by_id
id
5 DmKeBE0gXR6JKWJL0001 sample.fasta None .fasta None None 11 aqvq4CskQu3Nnr3hl5r3ug None None None True False 2026-02-05 16:07:45.008000+00:00 1 1 3 4 None 3
2 DmKeBE0gXR6JKWJL0000 sample.fasta None .fasta None None 11 83rEPcAoBHmYiIuyBYrFKg None None None False False 2026-02-05 16:07:41.740000+00:00 1 1 3 3 None 3

Lakehouse ♾️ feature store

Here is how you ingest a DataFrame:

import pandas as pd

df = pd.DataFrame({
    "sequence_str": ["ACGT", "TGCA"],
    "gc_content": [0.55, 0.54],
    "experiment_note": ["Looks great", "Ok"],
    "experiment_date": [date(2025, 10, 24), date(2025, 10, 25)],
})
ln.Artifact.from_dataframe(df, key="my_datasets/sequences.parquet").save()  # no validation
 writing the in-memory object into cache
Artifact(uid='h78D3RJg7A4I03iJ0000', version_tag=None, is_latest=True, key='my_datasets/sequences.parquet', description=None, suffix='.parquet', kind='dataset', otype='DataFrame', size=3405, hash='XHWWD_cePb1MV2pgSS0Ecg', n_files=None, n_observations=2, branch_id=1, space_id=1, storage_id=3, run_id=None, schema_id=None, created_by_id=3, created_at=2026-02-05 16:07:46 UTC, is_locked=False)

To validate & annotate the content of the dataframe, use the built-in schema valid_features:

ln.Feature(name="sequence_str", dtype=str).save()  # define a remaining feature
artifact = ln.Artifact.from_dataframe(
    df,
    key="my_datasets/sequences.parquet",
    schema="valid_features"  # validate columns against features
).save()
artifact.describe()
! you are trying to create a record with name='valid_features' but a record with similar name exists: 'anndata_ensembl_gene_ids_and_valid_features_in_obs'. Did you mean to load it?
 writing the in-memory object into cache
 returning artifact with same hash: Artifact(uid='h78D3RJg7A4I03iJ0000', version_tag=None, is_latest=True, key='my_datasets/sequences.parquet', description=None, suffix='.parquet', kind='dataset', otype='DataFrame', size=3405, hash='XHWWD_cePb1MV2pgSS0Ecg', n_files=None, n_observations=2, branch_id=1, space_id=1, storage_id=3, run_id=None, schema_id=None, created_by_id=3, created_at=2026-02-05 16:07:46 UTC, is_locked=False); to track this artifact as an input, use: ln.Artifact.get()
 loading artifact into memory for validation
 "columns" is validated against Feature.name
Artifact: my_datasets/sequences.parquet (0000)
├── uid: h78D3RJg7A4I03iJ0000            run:                 
kind: dataset                        otype: DataFrame     
hash: XHWWD_cePb1MV2pgSS0Ecg         size: 3.3 KB         
branch: main                         space: all           
created_at: 2026-02-05 16:07:46 UTC  created_by: anonymous
n_observations: 2                                         
├── storage/path: /home/runner/work/lamindb/lamindb/docs/test-transfer/.lamindb/h78D3RJg7A4I03iJ0000.parquet
└── Dataset features
    └── columns (4)                                                                                                
        experiment_date                date                                                                        
        experiment_note                str                                                                         
        gc_content                     float                                                                       
        sequence_str                   str                                                                         

You can filter for datasets by schema and then launch distributed queries and batch loading.

Lakehouse beyond tables

To validate an AnnData with built-in schema ensembl_gene_ids_and_valid_features_in_obs, call:

import anndata as ad
import numpy as np

adata = ad.AnnData(
    X=pd.DataFrame([[1]*10]*21).values,
    obs=pd.DataFrame({'cell_type_by_model': ['T cell', 'B cell', 'NK cell'] * 7}),
    var=pd.DataFrame(index=[f'ENSG{i:011d}' for i in range(10)])
)
artifact = ln.Artifact.from_anndata(
    adata,
    key="my_datasets/scrna.h5ad",
    schema="ensembl_gene_ids_and_valid_features_in_obs"
)
artifact.describe()
 writing the in-memory object into cache
 loading artifact into memory for validation
/opt/hostedtoolcache/Python/3.13.11/x64/lib/python3.13/functools.py:934: ImplicitModificationWarning: Transforming to str index.
  return dispatch(args[0].__class__)(*args, **kw)
Artifact: my_datasets/scrna.h5ad (0000)
├── uid: q5gwoAvPzdaCq82J0000                                                            run:                 
kind: dataset                                                                        otype: AnnData       
hash: Sgbj2aSf8AKFs12oJKUxXQ                                                         size: 20.9 KB        
branch: main                                                                         space: all           
created_at: <django.db.models.expressions.DatabaseDefault object at 0x7f1acf398150>  created_by: anonymous
n_observations: 21                                                                                        
└── storage/path: /home/runner/work/lamindb/lamindb/docs/test-transfer/.lamindb/q5gwoAvPzdaCq82J0000.h5ad

To validate a spatialdata or any other array-like dataset, you need to construct a Schema. You can do this by composing simple pandera-style schemas: docs.lamin.ai/curate.

Ontologies

Plugin bionty gives you >20 public ontologies as SQLRecord registries. This was used to validate the ENSG ids in the adata just before.

import bionty as bt

bt.CellType.import_source()  # import the default ontology
bt.CellType.to_dataframe()   # your extendable cell type ontology in a simple registry
 import is completed!
uid name ontology_id abbr synonyms description is_locked created_at branch_id space_id created_by_id run_id source_id
id
3453 1ChUsEzDZXWW4B beam B cell, human CL:7770006 None None A Trabecular Meshwork Cell Within The Eye'S Tr... False 2026-02-05 16:07:47.928000+00:00 1 1 3 None 50
3452 5xoxfxIf7WrLdU beam cell CL:7770005 None None A Trabecular Meshwork Cell That Is Part Of The... False 2026-02-05 16:07:47.928000+00:00 1 1 3 None 50
3451 2j5mhhFoV2vBDV suprabasal cell CL:7770004 None None An Epithelial Cell That Resides In The Layer(S... False 2026-02-05 16:07:47.928000+00:00 1 1 3 None 50
3450 RBCFqAmkM1oaaZ beam A cell CL:7770003 None None A Beam Cell Within The Eye'S Trabecular Meshwo... False 2026-02-05 16:07:47.928000+00:00 1 1 3 None 50
3449 79Ow7BGPRP018I juxtacanalicular tissue cell CL:7770002 None None A Trabecular Meshwork Cell Of The Juxtacanalic... False 2026-02-05 16:07:47.928000+00:00 1 1 3 None 50
... ... ... ... ... ... ... ... ... ... ... ... ... ...
3358 gDJgUmTBv5AHYt Astro-OLF NN_2 Alk astrocyte (Mmus) CL:4307054 None 5234 Astro-OLF NN_2 A Astrocyte Of The Mus Musculus Brain. It Is D... False 2026-02-05 16:07:47.913000+00:00 1 1 3 None 50
3357 51U0BVtjFHQGxE Astro-OLF NN_2 Slc25a34 astrocyte (Mmus) CL:4307053 None 5233 Astro-OLF NN_2 A Astrocyte Of The Mus Musculus Brain. It Is D... False 2026-02-05 16:07:47.913000+00:00 1 1 3 None 50
3356 FjYN3z6zMFQ3JV Astro-OLF NN_1 Stk32a astrocyte (Mmus) CL:4307052 None 5232 Astro-OLF NN_1 A Astrocyte Of The Mus Musculus Brain. It Is D... False 2026-02-05 16:07:47.913000+00:00 1 1 3 None 50
3355 8tHAMMeaiLxdaP Astro-OLF NN_1 Greb1 astrocyte (Mmus) CL:4307051 None 5231 Astro-OLF NN_1 A Astrocyte Of The Mus Musculus Brain. It Is D... False 2026-02-05 16:07:47.913000+00:00 1 1 3 None 50
3354 5SkKyhULGbfXWC Astro-TE NN_5 Adamts18 astrocyte (Mmus) CL:4307050 None 5230 Astro-TE NN_5 A Astrocyte Of The Mus Musculus Brain. It Is D... False 2026-02-05 16:07:47.913000+00:00 1 1 3 None 50

100 rows × 13 columns

Read more: docs.lamin.ai/manage-ontologies.