master's thesis

Deep Learning for Data Imputation in Oceanography

Remote sensing provides essential data for monitoring ocean color and phytoplankton, which are important indicators of marine ecosystem health. However, missing data is a common issue in these observations, and addressing it is necessary to gain a complete understanding of ocean dynamics.

I explored how transformer-based model can impute missing values in variables such as sea surface temperature, chlorophyll-a, and phytoplankton size classes. The work focused on a high-resolution (1/24°) dataset over the Gulf Stream region with an average of 80% missing data.

Overview of training procedure. During training, we mask a subset of the original spatiotemporal sequence. This is applied both temporally, by removing entire days, and spatially, by masking regions containing information. Missing regions are represented in gray. Each sequence is processed in two views: patches and visual tokens, which are then fed into the Transformer model. It aims at imputing missing regions by learning to reconstruct the original sequence from its masked version.

Since no ground truth exists for most missing regions, this makes the imputation task inherently ill-posed, unlike data assimilation, which integrates observations with a known dynamical model. To evaluate the method, I simulated missing data patterns on sea surface temperature fields, where ground truth is available, allowing controlled experiments to assess reconstruction quality. The model captures spatial, temporal, and multivariate correlations in 3D oceanographic data using self-attention mechanisms, making this approach a promising tool for oceanographic research.

(Delefosse, 2024)

References

2024

  1. Master’s thesis
    Generative Deep Learning Models for Data Imputation in Oceanography
    Aymeric Delefosse
    2024