CEDA Technical Blog

Technical blog by the Centre for Environmental Data Analysis (CEDA). These posts are written by members of the development and data management teams about work being undertaken in CEDA. Projects described here may be experimental and unfinished.

Posts

Experiments with Smolagents

Smolagents is a simple python library developed by Hugging Face, that enables large language models (LLMs) to take control of workflows through agents. Agents offer a flexible way to handle tasks where traditional programming logic might not suffice, such as responding to complex queries or dynamically managing user interactions.

Tags:AI

Feb 4, 2025 | William Cross

Migrating CEDA Documents Repository to the Zenodo

For many years CEDA has operated the CEDA Document Repository to aid capturing useful supporting material to aid long-term usabulity and understandability of data in the CEDA archives. This service was operated using a local deployment of the EPrints service with related overheads of running and maintaining such a service. However, during the course of the repository other generic services have become available and thus in 2022 CEDA undertook to migrate over 1300 items from the in-house EPrints service to a CEDA Document Repository Community within CERN’s “Zenodo” service to enable greater usability and sustainability of a grey literature service supporting CEDA’s community.

Tags:migration CEDA Docs Zenodo

May 25, 2023 | Adrian Dębski, Graham Parton

Experiments with Kerchunk

TL;DR Kerchunk stores netcdf chunk information to be used to create virtual xarray datasets. If there are many chunks, there will be many references in the kerchunk file. The number of references can be reduced by formulating a description of the chunks rather than just listing each one, the existing formula syntax is limited so a new custom syntax is explored here

Tags:Kerchunk fsspec

May 4, 2023 | Daniel Westwood

The Near-line Data Store

Managing data on cloud platforms can be an extremely difficult task. The need to store data efficiently, both in terms of energy and monetary cost, dictates the existence of tiered storage for data written and read at different frequencies. Avoiding a complicated user interface is therefore a great challenge, as in most cases the user is tasked with interacting with the different storage media directly, having to learn APIs and complex tape semantics to be able to effectively manage their data.

Tags:NLDS storage rabbit fastapi

Mar 9, 2022 | Jack Leland

Asset Specification and Asset Search

The STAC specification nests Assets (downloadable objects) within an Item (group of related Assets). In the CEDA archive, an Item could have 1000s of assets. How do we represent and retrieve Items with large assets counts? Enabling paginated Asset lists, Item sub-setting and cross-Item Asset search.

Tags:stac search

Feb 4, 2022 | Richard Smith

Climate Forecast Aggregation (CFA) Conventions

CEDA and JASMIN are facing two interrelated challenges regarding both Archive and user data: the data itself is growing rapidly year on year, and the introduction of new storage technology may require a change in workflows for users. Several software packages already exist to exploit this new storage technology, some of which involve re-writing the data into a new file format, which we believe is not suitable for archiving data. This article will present a data format, developed in conjunction with CEDA and NCAS CMS, that we believe can exploit the new storage technology and remain suitable as an archival data format.

Tags:archive data netcdf CFA

Dec 8, 2021 | Neil Massey

Search Futures - Update

In July, we posted our progress so far and gave some intentions for the future. This post looks at the progress since then and showcases a minimum viable product looking at content generation, a web server and client tools.

Tags:search indexing stac

Dec 7, 2021 | Richard Smith

What is a user? Removing anomalous behaviour from Anonymous access logs.

The Climate Change Initiative (CCI) project’s goal is to provide open, registration-free, access to essential climate variables (ECVs). CEDA runs the open data portal, a suite of services to provide access to the CCI datasets held in the CEDA Archive including download and metadata services. Dataset usage is an important metric in understanding uptake and usage of the different datasets however, without requiring users to register, it is difficult to determine distinct users. Recent changes in access patterns have led to spurious user counts when thinking 1 IP = 1 USER. This article looks at methods to determine “normal” thresholds to reduce the impact of the different access patterns on our usage statistics.

Tags:CCI download stats

Aug 20, 2021 | Mahir Rahman

Search Futures

We have been looking around for a flexible, scalable standard which would allow us to expose the bulk of the CEDA archive via faceted search. This could then be used to build user interfaces and enhance search services at CEDA. Here, we consider the feasibility and suitability of STAC and discuss progress into an Elasticsearch-based implementation.

Tags:search indexing stac

Jul 5, 2021 | Richard Smith