Climate Data System Architecture

Overview

APIs

Web Scraping

Batch Downloads

Validate, filter, clean, transform

Relational + Spatial Database

Imagery and assets

API Gateway

Geospatial Query

Raw Data Export

Data Sources

Data Ingestion Workers

Data Storage

PostgreSQL + PostGIS

Object Storage

Data Retrieval Layer

End User Frontend

Map Visualization

Data Export

Components

1. Data Ingestion

Data is collected from the following sources:

2. Data Processing & Cleaning

Workflow for validation, cleaning, and transform data:

  1. Validation:

    • Schema validation:
      • JSON schema for APIs
      • XML schema for NOAA
    • Check for missing values, duplicate records, and inconsistencies.
  2. Filtering:

    • Removal of unneeded data (e.g older data, out of range / location)
    • Alternatively this can be done by excluding data at the storage step
  3. Cleaning:

    • Removal of outliers using statistical models (if we don't want them)
    • Data interpolation for missing fields.
    • Image processing, removing noise (if possible at all)
  4. Transformation:

    • Convert all date/timestamps into ISO 8601 for uniformity.
    • Spatial data transformations into standard CRS (Coordinate Reference System) like WGS 84.
    • Pre-aggregate metrics (e.g., daily averages from hourly data).

3. Data Storage

3.a. Structured data

Most structured data can be stored in Postgres:

If for performance / scale reasons Posgres doesn't support the load, alternatives can be considered in addition (thus adopting an hybrid approach):

3.b. Unstructured data

Uploaded to an Object Storage (e.g AWS S3) using a hierachical structure (e.g region/year/month/day). This will be for satellite imagery and other unstructured assets.

4. Data Retrieval / API Gateway

This is for querying and retrieving data so that they can be exposed to the end-user applications.

4.a. API implementation

Using a REST API or GraphQL API depending on team experience and expectation (both appropriate for the use-cases but presenting different trade-offs).

4.b. Security

4.c. Performance

4.d. Scalability and Availability

5. End-User Frontend