Climate Data System Architecture

Overview

Components

1. Data Ingestion

Data is collected from the following sources:

CDS API by Copernicus:
- Automated ingestion using REST API for structured climate data.
- Scheduled jobs to fetch updated datasets daily or hourly
NASA Worldview:
- Satellite imagery data ingested via public APIs and SDKs.
- High-resolution images fetched directly to an object storage (e.g AWS S3)
NOAA NCEI:
- Direct database access and API endpoints for structured and semi-structured data (note the API was under maintenance during this assessment).
- Web scraping for additional unstructured data if APIs lack specific endpoints.

2. Data Processing & Cleaning

Workflow for validation, cleaning, and transform data:

Validation:
- Schema validation:
  - JSON schema for APIs
  - XML schema for NOAA
- Check for missing values, duplicate records, and inconsistencies.
Filtering:
- Removal of unneeded data (e.g older data, out of range / location)
- Alternatively this can be done by excluding data at the storage step
Cleaning:
- Removal of outliers using statistical models (if we don't want them)
- Data interpolation for missing fields.
- Image processing, removing noise (if possible at all)
Transformation:
- Convert all date/timestamps into ISO 8601 for uniformity.
- Spatial data transformations into standard CRS (Coordinate Reference System) like WGS 84.
- Pre-aggregate metrics (e.g., daily averages from hourly data).

3. Data Storage

3.a. Structured data

Most structured data can be stored in Postgres:

Postgres is a general purpose database with ACID guarantees
Postgis extension allow efficient storage and querying of geopspatial data
- Usage of spatial indexes for geo data
- Using geometry, geography or raster types where appropriate
JSON (and JSONB) allows to store semi-structured data such as schem-less data.

If for performance / scale reasons Posgres doesn't support the load, alternatives can be considered in addition (thus adopting an hybrid approach):

Using a cache like Redis to cache expensive query results or pre-building responses ahead of time
For aggregations, a columnar database like Clickhouse
NoSQL databases like ScyllaDB will scale more than Postgres, especially if distributed on multiple nodes

3.b. Unstructured data

Uploaded to an Object Storage (e.g AWS S3) using a hierachical structure (e.g region/year/month/day). This will be for satellite imagery and other unstructured assets.

4. Data Retrieval / API Gateway

This is for querying and retrieving data so that they can be exposed to the end-user applications.

4.a. API implementation

Using a REST API or GraphQL API depending on team experience and expectation (both appropriate for the use-cases but presenting different trade-offs).

4.b. Security

Role-based access control (RBAC)
Rate limiting to prevent abuse
Only serve authenticated users if possible
Encrypting in transit (https but that should be a given in 2024)

4.c. Performance

Only serve the data needed (avoid over-fetching)
Caching when possible and needed
Expose aggregated data instead of aggregating in the frontend

4.d. Scalability and Availability

Autoscaling or similar (for example in AWS or Kubernetes)
If possible always deploy in different locations (e.g multi-AZ on AWS)

5. End-User Frontend

React frontend, deployed statically (no server-side rendering)
Can be deployed on Cloudflare pages / Netlify / AWS S3 + cloudfront etc
Mapbox or similar to show geo-data on a map
Data export can be realised by creating a dump on S3 and providing a pre-signed URL to the user.