I/O optimization for climate data: why there's no universal answer // Aymeric Delefosse

Training a deep learning model on climate data sounds straightforward until you hit the data pipeline. ERA5 alone (the most widely used reanalysis dataset) spans 40+ years at 6-hour resolution on a 1440×721 global grid, with tens of atmospheric variables across 13 pressure levels. A single training batch that loads previous, current, and next states for a handful of examples can pull hundreds of megabytes from disk. Do this 50,000 times and your GPU sits idle while your dataloader struggles.

This is a write-up of the choices I made building the data pipelines for ArchesWeatherSR, and more broadly, why I/O optimization in climate science has no universal answer.

The obvious starting point: xarray

xarray is the lingua franca of climate science. It wraps NumPy arrays with labeled dimensions and coordinates, understands CF conventions, handles time axes, and plays nicely with the rest of the scientific Python stack. Any .nc file opens with three lines:

import xarray as xr
ds = xr.open_dataset("era5_2020.nc")
t2m = ds["2m_temperature"]

So most people start there. And then the pain begins.

xarray’s lazy loading is great for exploration but brutal for training. When you call isel(time=i) inside a PyTorch __getitem__, xarray reads the file, decodes coordinates, applies masks, checks chunking, every single time. With multiple DataLoader workers each opening their own file handles, you end up with a thundering herd of redundant I/O and metadata operations. Caching helps somewhat, but xarray’s cache is per-Dataset object, not shared across processes.

There’s also the coordinate overhead. xarray stores rich metadata (units, calendar, variable attributes) which is lovely for analysis but wasteful when all you need is a NumPy array at index i. The Python overhead alone can dominate a batch load at high worker counts.

This isn’t xarray’s fault: it wasn’t designed for ML training loops. But it’s easy to miss until you’re staring at GPU utilization stuck at 30%.

Storage formats, compared

Once you start looking for alternatives, the options multiply quickly:

Raw NumPy / PyTorch is the floor. Memory-map a .npy file with np.memmap, or save tensors as .pt files and load them with torch.load(..., mmap=True, weights_only=True). This gets you near-zero Python overhead with random access that’s essentially free. PyTorch’s mmap support has the advantage that your data is already in tensor form. But either way you have to handle your own indexing, compression, and metadata. For anything beyond a toy dataset, it becomes painful to maintain. Itamar Turner-Trauring has a thorough comparison of mmap vs. Zarr/HDF5 that’s worth reading if you’re evaluating this tradeoff.

Zarr is the cloud-native answer and is increasingly the default in the community. Google’s WeatherBench (the de facto benchmark for data-driven weather forecasting) and ECMWF’s Anemoi (their framework for training data-driven weather models) use Zarr as their native dataset format. It supports chunked, compressed arrays, arbitrary backends (local filesystem, S3, GCS), and plays well with xarray.open_zarr(). If you’re working with data that lives in object storage, Zarr is hard to beat.

But “cloud-native” is a double-edged sword on HPC clusters. Most HPC storage is Lustre or GPFS, parallel filesystems tuned for large sequential reads rather than the many small random reads that Zarr’s chunk-per-file layout generates. Zarr stores can consist of thousands of small files, which creates metadata pressure on the filesystem and can get your storage quota revoked or your jobs throttled. ZipStore seems like an obvious escape hatch here, but it trades the small-file problem for poor random-access performance: ZIP was designed for sequential archive reads, not hyperslab indexing, and a 10 TB zip on a shared filesystem is arguably worse. My sysadmins would not appreciate either option.

HDF5 / NetCDF4 is a single-file format with hierarchical datasets, chunked storage, and built-in compression. Both have decades of use in climate science, are natively supported on virtually every HPC cluster, and don’t require any special filesystem support. NetCDF4 is HDF5 under the hood, but accessing HDF5 directly via h5py gives you more reliable access to modern compression filters like Blosc and LZ4 through hdf5plugin. On HPC clusters, you’re often stuck with the system-installed NetCDF-C, where zlib is the safe default.

Even then, the details matter. NVIDIA’s Makani, a research framework for massively parallel training of weather and climate prediction models (and the codebase behind FourCastNet 3), uses HDF5 Virtual Datasets (VDS) to stitch together datasets from multiple source files into a single logical view. Elegant in theory, but VDS adds an indirection layer that can introduce overhead on Lustre when the underlying files live on different storage nodes.

Why the right answer depends on where you are

The choice isn’t just technical. It’s institutional.

On a cluster with Lustre, one .h5 file per year is fine. The same workload with Zarr might generate millions of chunk files that get your jobs queued behind everyone else’s.

On a cloud VM pulling from GCS, Zarr is the natural fit: your data is already there, your reads are parallel, and you’re paying for egress anyway.

If your data is generated by a pipeline that writes NetCDF, converting to Zarr adds friction. If your collaborators all use xarray, diverging from xarray-readable formats adds friction too.

For my work (ERA5 on a university HPC cluster, preprocessing pipelines that produce NetCDF, colleagues who use xarray), HDF5 with h5py is the pragmatic choice.

The custom dataloader

Reading HDF5 directly with h5py bypasses xarray entirely. You get a dict-like interface to datasets, array slicing that maps directly to HDF5 hyperslab reads, and full control over how file handles are managed.

The core of the HDF5Dataset I wrote handles two different layouts that came up in practice:

Per-variable mode: one HDF5 dataset per variable name, e.g. /2m_temperature[time, lat, lon], /geopotential[time, level, lat, lon]. This is how most datasets come out of the box (it’s the natural layout of WeatherBench 2 data), but it requires stacking variables at read time.
Pre-stacked mode: variables already stacked into e.g. /surface[time, n_vars, lat, lon]. One read per group, zero Python-level stacking. Faster if you can afford the preprocessing step to produce it.

if isinstance(self.variables, list):
    # Pre-stacked: one read per group
    tdict = TensorDict({
        var: torch.from_numpy(f[var][time_idx])
        for var in self.variables
    })
else:
    # Per-variable: stack on the fly
    tdict = TensorDict({
        key: torch.from_numpy(
            np.stack([f[var][time_idx] for var in var_list], axis=0)
        )
        for key, var_list in self.variables.items()
    })

A few things that matter in practice:

File handle caching. Opening an HDF5 file has non-trivial overhead: the library reads superblock metadata, validates checksums, sets up internal state. Opening and closing on every __getitem__ call is a disaster at high worker counts. There’s also a subtler issue: PyTorch’s multiprocess DataLoader forks worker processes, so any file handle opened before the fork gets shared and corrupted across workers. Opening handles lazily after fork (on the first __getitem__ call per worker) avoids this entirely. Yuxin Wu’s deep dive on DataLoader RAM usage explains the forking semantics in detail. Each worker keeps a dict of open handles keyed by file index:

def _get_handle(self, file_idx: int) -> h5py.File:
    if file_idx not in self._file_handles:
        self._file_handles[file_idx] = h5py.File(
            self.files[file_idx].decode("UTF-8"),
            mode="r",
            libver="latest",  # enables SWMR for concurrent readers
        )
    return self._file_handles[file_idx]

libver="latest" enables SWMR, which allows concurrent read-only access from multiple processes without explicit locking.

Storing file paths as bytes. A small trick from the same Yuxin Wu post: self.files = np.array(self.files, dtype=np.bytes_). Storing paths as a NumPy array of byte strings rather than a Python list reduces memory usage when indexing across thousands of files, since the array doesn’t hold Python object overhead per entry.

Compression via hdf5plugin. Simply importing hdf5plugin at the top of the file registers Blosc, LZ4, and other fast compressors with h5py. If your HDF5 files were written with these filters, the import is all you need to read them.

Timestamp-based index building. Rather than assuming files map 1:1 to training examples, _build_index reads the /time dataset from each file once at init and builds a flat (file_idx, time_idx, unix_timestamp) array. This handles:

Multiple files per year (e.g. one per month)
Single files spanning the full dataset
CF-convention time axes ("hours since 1900-01-01") via cftime
Filtering by train/val/test split at dataset init time

None of these are clever tricks, mostly just avoiding the obvious mistakes. In practice, moving away from xarray solved outright OOM errors first, then the remaining optimizations brought GPU utilization from ~60% up to ~95% on H100s. The result is a dataloader that keeps the GPU fed without drama.

TL;DR

xarray is great for exploration, slow for training loops, and if you’re lucky it won’t OOM either
Zarr is the modern answer but generates many small files, bad on Lustre/GPFS HPC clusters
HDF5 + h5py is boring, reliable, and fast if you manage file handles correctly
The “right” format depends on your storage backend, preprocessing pipeline, and institutional constraints as much as it does on benchmarks
Cache file handles, build a timestamp index at init, store file paths as numpy bytes, and let hdf5plugin handle compression transparently

The obvious starting point: xarray

Storage formats, compared

Why the right answer depends on where you are

The custom dataloader

Further reading

TL;DR