Storage-agnostic, lazy-loading interface for ASE Atoms objects. Pluggable backends (LMDB, Zarr, HDF5/H5MD, HuggingFace Datasets, ASE file formats) behind a single list-like API with pandas-style column views.
pip install asebytes[lmdb] # LMDB backend (recommended)
pip install asebytes[zarr] # Zarr backend (fast compression)
pip install asebytes[h5md] # HDF5/H5MD backend
pip install asebytes[hf] # HuggingFace Datasets backend
from asebytes import ASEIO
# Write
db = ASEIO("data.lmdb")
db.extend(atoms_list) # bulk append
db[0] = new_atoms # replace row
db.update(0, calc={"energy": -10.5}) # partial update
# Read
atoms = db[0] # ase.Atoms
atoms = db[-1] # negative indexingBackend is auto-detected from the file extension:
| Extension | Backend | Install extra |
|---|---|---|
*.lmdb |
LMDBBackend |
asebytes[lmdb] |
*.zarr |
ZarrBackend |
asebytes[zarr] |
*.h5 / *.h5md |
H5MDBackend |
asebytes[h5md] |
*.xyz / *.extxyz / *.traj |
ASEReadOnlyBackend |
(none) |
Indexing with slices, lists, or strings returns lazy views that load data on demand.
# Row views — lazy, stream one frame at a time
view = db[5:100] # slice → RowView (nothing loaded yet)
view = db[[0, 42, 99]] # list of indices → RowView
for atoms in view:
process(atoms)
# Chunked iteration — loads N rows per batch for throughput
for atoms in db[:].chunked(1000):
process(atoms)
# Column views — avoid constructing full Atoms objects
energies = db["calc.energy"].to_list()
cols = db[["calc.energy", "calc.forces"]].to_dict()
# → {"calc.energy": [...], "calc.forces": [...]}
# Chaining — slice rows, then select columns
db[0:500]["calc.energy"].to_list()For slow or remote sources, cache_to creates a persistent local cache.
First pass reads from source and fills the cache; all subsequent reads are served from cache.
db = ASEIO("colabfit://dataset", split="train", cache_to="cache.lmdb")
for atoms in db: # epoch 1: reads source, populates cache
train(atoms)
for atoms in db: # epoch 2+: all reads from local cache
train(atoms)Accepts a file path (auto-creates backend) or any WritableBackend instance.
No invalidation — delete the cache file to reset.
Stream or download datasets from the HuggingFace Hub via URI schemes.
# ColabFit (auto-selects column mapping, streams by default)
db = ASEIO("colabfit://mlearn_Cu_train", split="train")
# OPTIMADE (e.g. LeMaterial)
db = ASEIO("optimade://LeMaterial/LeMat-Bulk", split="train", name="compatible_pbe")
# Generic HuggingFace (requires explicit column mapping)
from asebytes import ColumnMapping
mapping = ColumnMapping(
positions="pos", numbers="nums",
calc={"energy": "total_energy"},
)
db = ASEIO("hf://user/dataset", mapping=mapping, split="train")
# Downloaded mode for faster access
db = ASEIO("colabfit://dataset", split="train", streaming=False)Zarr backend with flat layout and Blosc/LZ4 compression. Offers compact file sizes and fast read performance. Supports variable particle counts via NaN padding, append-only writes.
db = ASEIO("trajectory.zarr")
db.extend(atoms_list)
# Custom compression
from asebytes import ZarrBackend
db = ASEIO(ZarrBackend("data.zarr", compressor="zstd", clevel=9))H5MD-standard files with support for variable particle counts, per-frame PBC, and bond connectivity.
db = ASEIO("trajectory.h5", author_name="Jane Doe", compression="gzip")
db.extend(atoms_list)
# Multi-group files
from asebytes import H5MDBackend
groups = H5MDBackend.list_groups("multi.h5")
db = ASEIO("multi.h5", particles_group="solvent")All data follows a flat namespace:
| Prefix | Content | Examples |
|---|---|---|
arrays.* |
Per-atom arrays | arrays.positions, arrays.numbers, arrays.forces |
calc.* |
Calculator results | calc.energy, calc.stress |
info.* |
Frame metadata | info.smiles, info.label |
| (top-level) | cell, pbc, constraints |
from asebytes import atoms_to_dict, dict_to_atoms
d = atoms_to_dict(atoms) # Atoms → flat dict (~5x faster than encode/decode)
atoms = dict_to_atoms(d) # flat dict → AtomsImplement ReadableBackend for read-only or WritableBackend for read-write:
from asebytes import ASEIO, ReadableBackend
class MyBackend(ReadableBackend):
def __len__(self): ...
def columns(self, index=0): ...
def read_row(self, index, keys=None): ...
db = ASEIO(MyBackend())1000 frames each on two datasets — ethanol conformers (small molecules, fixed size) and LeMat-Traj (periodic structures, variable atom counts). All frames include energy, forces, and stress. Compared against aselmdb, znh5md, extxyz, and SQLite.
# LeMat-Traj benchmark data
lemat = list(ASEIO("optimade://LeMaterial/LeMat-Traj", split="train", name="compatible_pbe")[:1000])Note: HDF5 performance is heavily influenced by compression and chunking settings. Both asebytes H5MD and znh5md use gzip compression by default, which reduces file size at the cost of read/write speed. The Zarr backend uses Blosc/LZ4 compression, which achieves compact file sizes with faster decompression than gzip.




