Technical blog by the Centre for Environmental Data Analysis (CEDA). These posts are written by members of the development and data management teams about work being undertaken in CEDA. Projects described here may be experimental and unfinished.
Managing data on cloud platforms can be an extremely difficult task. The need to store data efficiently, both in terms of energy and monetary cost, dictates the existence of tiered storage for data written and read at different frequencies. Avoiding a complicated user interface is therefore a great challenge, as in most cases the user is tasked with interacting with the different storage media directly, having to learn APIs and complex tape semantics to be able to effectively manage their data.Tags:NLDSstoragerabbitfastapi
The STAC specification nests
Assets (downloadable objects) within an
Item (group of related Assets).
In the CEDA archive, an Item could have 1000s of assets. How do we represent and retrieve Items with large assets counts?
Enabling paginated Asset lists, Item sub-setting and cross-Item Asset search.
CEDA and JASMIN are facing two interrelated challenges regarding both Archive and user data: the data itself is growing rapidly year on year, and the introduction of new storage technology may require a change in workflows for users. Several software packages already exist to exploit this new storage technology, some of which involve re-writing the data into a new file format, which we believe is not suitable for archiving data. This article will present a data format, developed in conjunction with CEDA and NCAS CMS, that we believe can exploit the new storage technology and remain suitable as an archival data format.Tags:archivedatanetcdfCFA
The Climate Change Initiative (CCI) project’s goal is to provide open, registration-free, access to essential climate variables (ECVs). CEDA runs the open data portal, a suite of services to provide access to the CCI datasets held in the CEDA Archive including download and metadata services. Dataset usage is an important metric in understanding uptake and usage of the different datasets however, without requiring users to register, it is difficult to determine distinct users. Recent changes in access patterns have led to spurious user counts when thinking 1 IP = 1 USER. This article looks at methods to determine “normal” thresholds to reduce the impact of the different access patterns on our usage statistics.Tags:CCIdownload stats
We have been looking around for a flexible, scalable standard which would allow us to expose the bulk of the CEDA archive via faceted search. This could then be used to build user interfaces and enhance search services at CEDA. Here, we consider the feasibility and suitability of STAC and discuss progress into an Elasticsearch-based implementation.Tags:searchindexingstac