Observation Management System

Introduction

The purpose of this project is to provide a management system that can work with observations generated by any of the CEH internal sensor networks, or other observation generating processes such as sampling, chemical and biological analysis, or model output. The goals that fall under this main purpose include:

storing observations with the semantic data necessary to support the OGC O&M standard
real-time quality-control (QC) checks to generate qualitative and quantitative meta-data regarding the quality and uncertainty for every observation
real-time model execution to create derived data and forecasts as the observations necessary for input arrive
real-time alerts and warnings based on observations, model output, and forecasts based on pre-defined and adaptive criteria

This system, while designed with the O&M standard in mind, will not produce the functionality necessary to support SOS calls for observation data. A catalogue and higher level software will take care of that side of things, and wrap access to this system.

More information on the different areas of this project can be found in their respective documentation, listed here:

Technologies

There are three main technologies this project builds upon:

Apache Kafka
Apache Flink
Apache Cassandra

Kafka is the message-queue software that is used to logically store data between processing bolts and before entry into the database. Cassandra is the persistent storage used to store the observation data and the processed data. Flink is the processing framework that was chosen over Apache Storm and Apache Spark (due to needed capabilities, best summed up here. While at present Spark appears to have better distributed ML libraries, there are many third-party Scala libraries that can make up this deficit. For the aggregation from multiple networks and potential two-way communication, Apache Nifi appears the best choice.

Related Software Not Used

Prometheus
Graphite (and the more relevant Cyanite) + Grafana
InfluxDB
openTSDB
KairosDB
ElasticSearch + Kibana
OpenMCT

Types of Data

Raw Observation Data

Raw observation data, in the context of this system includes: sensor data, abstract procedure generated data such as chemical analysis of a sample, manual measurements and samples, and data of a similar nature. It also includes derived observations generated outside of this system. For example, the HOBO temperature and relative humidity sensor on the Lake Observation Platforms generate observations for dew point temperature, which is derived from the sensed temperature and relative humidity observations. As this is not generated within the management system, it is classed as raw observation data and not derived data.

Derived Data

Derived data in this context is any observation or data generated by the management system. This can take the form of derived observation data, such as the thermocline depth observation which is generated from input of observations sensed by the stratified PRT chain. It can also take the form of process output such as QC checks, forecasts, and the aggregation of observation data to hourly and daily mean observations.

The distinction between observation data and derived data is important in the rationale behind the persistence and backup choices on different Kafka queues. A distinction is also made between short-lived and long-lived derived data, where short-lived data has a TTS value set and long-lived data is held indefinitely.

Long-Lived

Derived data products such as the hourly and daily observation aggregates, and their extended interpolated representations are examples of long-lived derived data. These are examples of derived data which would be of use to users wishing to work on a higher temporal aggregation than the raw observations allow, or who may need a full series of observations (interpolated) rather than the original which may have missing values. Another example of long-lived derived data is that of the QC check observations. These observations are of interest for analysis of potential issues of a sensor, and allow users of the data to better understand the context of an observation.

Short-Lived

Short-lived data refers to derived data which has a short time-frame of interest, such as forecasts or certain model outputs. For example, a forecast generated on a Monday for the following Tuesday to Friday becomes less interesting by the Saturday, and the need to keep the output past the period of interest becomes questionable when it can be reconstructed at will. If there is any criteria or checks on the forecast, it is conceivable that these may be better to keep. For short-lived data a TTL value is set within Cassandra.

Data Flow

TBC.

Semantic Annotation, Data Persistence

TBC.

QC

TBC.

Aggregation

TBC.

nerc-ceh / observation-management-system Goto Github PK