Summary
Redundant data for device reports and validation results can be reduced by storing only values changed from recorded entry. This can be achieved by building and storing a collected state of device report or validation result data as each device report or validation result is received. Each additional report or validation result will be compared against the accumulated state, and only changes between the current state and the new entry will be stored. The state construct also provides quick retrieval of the all accumulated report data and validation status for a device.
Implementation details are omitted intentionally, and an high level example is provided for walk-through.
Problem
Each JSON device report is stored in its entirety, and for each report, the results of every validation run on that report are stored. A report is sent about once a minute while it’s running through preflight. Most of this data is redundant and does not change between reports. While each report and validation results are relatively small, they accumulate to significant size over time. There is an average of 43 validation results per device report.
To illustrate, the size of the 10 largest tables is shown below (queried 2018-02-05). device_report
is the table the reports are stored, and device_validate
contains all validation results run for each report.
relation | total_size
--------------------------------+------------
public.device_validate | 43 GB
public.device_report | 7764 MB
public.device_settings | 55 MB
public.device_disk | 27 MB
public.device_nic | 3992 kB
public.device_neighbor | 3120 kB
public.device_nic_state | 2240 kB
public.datacenter_rack_layout | 1296 kB
public.device | 1248 kB
public.device_relay_connection | 648 kB
This is a significant amount of data and will grow faster as we increase the rate of DC builds.
It is also desirable to quickly retrieve the latest device report data and validation results for an overview of the device status. The current retrievals are slow, partially related to the size of their source tables.
Proposed solution
For both device reports and validation results, two constructs will be used: a 'state' data structure to store all previous, unique data received, and append-only log to write only the changes between entries. The state data structure and the data written to the log will be different for device reports and validation results.
Each device report is a nested hash structure reported as a JSON object. Likewise, the device report state structure will also be a nested hash. As device reports are received, new and changes values will be added to or replace older values in the nested hash. This will be done depth-first, so a new, deeply-nested value in a device report will be correctly identified and added to the state. All unchanged values in the device report will be stripped out.
To think of it another way, if you could "merge" all device reports received over time and newer values take precedence over older values in the merge, the result would be the device report state.
The device report log will store the timestamp the report was received, and the minimum hash of the changed values. If there's no change between a device report and the current state, only the timestamp will be recorded.
The state for validation results will be an associative array of validation results. As validation results are collected, they will be compared to the list of results in the state. If the result from a unique validation, it is added to the state. If a result is identified as from the same validation (identifiable by the name of the validation and device component details, for example) as a previous result stored in the state, it will replace the older result if if variable values in the result have changed (such as the pass/fail status of the result). If a result is identical to a previous result in the state for the same validation, the older result is kept and the newer result is not stored.
Validation results and a timestamp will be written to a log only if the validation result is unique (from a new validation or changed from a previous result from the same validation).
Example
A walkthrough is presented below for building the state and recording changes for device reports and validation results. JSON structures and pseudo-code are used. Function invocations represent side-effecting sub-routines, such as writing to the database.
Device Report and Device State
The device state for a new device begins empty (or possibly with a predetermined set of values).
A device sends a sends its first device report:
device_report_1 =
{
“serial_number” : “deadbeef”,
"product_name" : "Joyent-Compute-Platform-3301"
}
No device report states exists for the device. It is created and contains all values in the first report.
device_state =
{
“serial_number” : “deadbeef”,
"product_name" : "Joyent-Compute-Platform-3301"
}
write_device_state(device_state, timestamp)
The device report will then be written to the device report log with timestamp corresponding to when it was stored.
write_device_report_log({
“serial_number” : “deadbeef”,
"product_name" : "Joyent-Compute-Platform-3301"
}, timestamp)
The device sends another device report:
device_report_2 =
{
“serial_number” : “deadbeef”,
"product_name" : "Joyent-Compute-Platform-3301”,
“bios_version” : “2.4.3”
}
The second device report is compared to the device state. The change between the device state and the device report is the addition of the bios_version
attribute. The device state is updated:
device_state =
{
“serial_number” : “deadbeef”,
"product_name" : "Joyent-Compute-Platform-3301”,
“bios_version” : “2.4.3”
}
write_device_state(device_state, timestamp)
and only new change between the device report and device state is written to the device report log.
write_device_report_log({
“bios_version” : “2.4.3”
}, timestamp
)
Another device report is received.
device_report_3 =
{
“serial_number” : “deadbeef”,
"product_name" : "Joyent-Compute-Platform-3301”,
“bios_version” : “3.0.0”
}
Compared to the current device state, the keys remain the same, but the value of bios_version
has changed. The device state is updated, and the changed value is written to the device report log.
device_state =
{
“serial_number” : “deadbeef”,
"product_name" : "Joyent-Compute-Platform-3301”,
“bios_version” : “3.0.0”
}
write_device_state(device_state, timestamp)
write_device_report_log({
“bios_version” : “2.4.3”
}, timestamp
)
One final report is received
device_report_4 =
{
“serial_number” : “deadbeef”,
"product_name" : "Joyent-Compute-Platform-3301”,
“bios_version” : “3.0.0”
}
There is no difference between the device state and the latest device report. The device state is not updated, as there’s no new or changed values. We still write to the device report log to record that a device report has been received, but no data other than the timestamp is written.
write_device_report_log( null, timestamp)
Validation Results
A similar process will exist for recording validation results. To begin, the validation state for as device is empty.
When validations are run and the results collected, they will be compared to the current state.
validation_1 = {
"status" : “pass”,
"validation_name" : “product_name_check”,
"want_value" : "Joyent-Compute-Platform-3301",
"has_value" : "Joyent-Compute-Platform-3301”,
"component_id" : “...",
"component_type" : “...",
}
If there is no validation having the same values in the validation state, it is added to the state
validation_state = [ validation_1 ]
write_validation_state(validation_state, timestamp)
and written to the validation result log
write_validation_result_log( validation_1, timestamp)
New validation results are added to the validation state and stored.
validation_2 = {
"status" : “fail”,
"validation_name" : “bios_version_check”,
"want_value" : "3.0.0",
"has_value" : null,
"component_id" : “...",
"component_type" : “...",
}
validation_state = [ validation_1, validation_2]
write_validation_result_log(validation_2, timestamp)
write_validation_state(validation_state, timestamp)
Validation results with values matching previously stored in the state are not stored. Unlike device reports, no information about validation results is stored, including timestamps for when the validation result was received.
# Matches values in 'validation_1'
validation_3 = {
"status" : “pass”,
"validation_name" : “product_name_check”,
"want_value" : "Joyent-Compute-Platform-3301",
"has_value" : "Joyent-Compute-Platform-3301”,
"component_id" : “...",
"component_type" : “...",
}
# The validation state remains unchanged.
validation_state = [ validation_1, validation_2 ]
Validation results in the state that have matching identifying values (in this example, validation_name
, component_id
, component_type
) will be replaced by newer validation results with different variable values (status
, has_value
)
validation_4 = {
"status" : “pass”,
"validation_name" : “bios_version_check”,
"want_value" : "3.0.0",
"has_value" : "3.0.0",
"component_id" : “...",
"component_type" : “...",
}
# validation_4 replaces validation_2
validation_state = [ validation_1, validation_4 ]
write_validation_state(validation_state, timestamp)
write_validation_result_log(validation_4, timestamp)
To retrieve the current validation status of a device and details of the results , the state is retrieved and then all referenced validation results are found.
Previous data
A script can be written to process device reports and validation results to construct a device state and create the device report and validation result log. Afterwards, the old data will be dropped. This will be tested extensively using copies of production to verify against data loss.
Discussion
Implementation details were purposely omitted to keep the discussion at a high level. If this is agreed to be a reasonable proposal, implementation details such as database schema will be detailed in an OPS-RFD (either OPS-RFD 22 or another).
It is possible that this system can still store a large amount of data per device if changes in device reports or validation results are frequent. Device report data is mostly static. Validation results are derived from device report data, so they too will be mostly static and unchanged. The one exception is temperature data, which is reported in the device report and volatile. For each temperature reading, there are validations to verify the temperature is within an acceptable operating range. As the system was described, every change in temperature, even if it is 0.1 degree, will be stored on the device report log and the validation result for temperature validations will be stored. This may be irrelevant information. Instead, we might only store temperature data reaches a certain delta threshold (say, if it changes by 5 degrees from the current state). This is an open question for discussion.