joyent / conch-api Goto Github PK

View Code? Open in Web Editor NEW

22.0 22.0 11.0 31.28 MB

Datacenter build and management service

License: Mozilla Public License 2.0

Shell 0.23% Makefile 0.14% Perl 95.30% PLpgSQL 4.27% Dockerfile 0.06%

conch-api's Introduction

NOTICE

The Conch API has reached its end of life. So long and thanks for all the fish.

Conch API Server

Conch helps you build and manage datacenters.

Conch's goal is to provide an end-to-end solution for full datacenter resource lifecycle: from design to initial power-on to end-of-life for all components of all devices.

Conch is open source, licensed under MPL2.

Caveat Emptor

The API is in a constant state of flux. Contact the development team before attempting to use it directly. The conch shell and the Web UI are our current stable interfaces.

Installation

Operating System Support

We currently support Docker/Ubuntu. Being a Perl app, the API should run nearly anywhere but the code is only actively tested on macOS and Docker/Ubuntu.

Perl Support

The API is only certified to run against Perl 5.26.

Setup

Below is a list of useful Make commands that can be used to build and run the project. All of these should be run in the top level directory.

make run -- Build the project and run it
make test -- Run tests
make migrate-db -- Run database migrations

Needed Packages

PostgreSQL 10.14
Git
Perl, 5.26 or above (e.g. via perlbrew)
Carton

Configuration

Copy conch.conf.dist to conch.conf, modifying for any local parameters, including database connectivity information.

Starting Conch

make run

Creating Local Credentials

First, you need to get a login token into the local database. We can do this by leveraging the knowledge that an encrypted password entry of '' will match against all supplied inputs:

$ psql -U conch conch --command="insert into user_account (name, password, email) values ('me', '', '[email protected]')"

Now, we use this email and password to generate a login token:

make run curl -i -H'Content-Type: application/json' --url http://127.0.0.1:5001/login -d '{"email":"[email protected]","password":"anything"}'

You will see output like this:

{"jwt_token":"eyJInR5cCI6Iwhargarbl.eyJl9pZCI6ImM1MGYwhargarbl.WV3uJEvg0bqInI9pEtl04ZZ8ECN4yQOSmehello"}

Save that token somewhere, such as in an environment variable or a file, for use in future API calls. You will include it in the "Authorization" header, for example:

curl -i --url https://staging.conch.joyent.us/user/me --header "Authorization: Bearer eyJInR5cCI6Iwhargarbl.eyJl9pZCI6ImM1MGYwhargarbl.WV3uJEvg0bqInI9pEtl04ZZ8ECN4yQOSmehello"

Docker

Compose

The most simple way to get going with the Conch API is to use Docker Compose.

Build

First, build the image locally using docker/builder.sh

First Run

Copy conch.conf.dist to conch.conf, modifying for any local parameters. Specifically search for 'docker' in the comments. Ignore the database parameters.

# Edit compose file for desired release
docker-compose up -d postgres # initialize the postgres database
docker-compose run --rm web bin/conch-db all --username conch --email [email protected] --password kaewee3hipheem8BaiHoo6waed7pha
docker-compose up -d

Upgrading

docker-compose down
# Edit compose file for desired release
docker-compose pull
docker-compose up -d postgres
docker-compose run --rm web bin/conch-db migrate
docker-compose up -d

Licensing

This Source Code Form is subject to the terms of the Mozilla Public License, v.2.0. If a copy of the MPL was not distributed with this file, You can obtain one at https://www.mozilla.org/en-US/MPL/2.0/.

conch-api's People

Contributors

Stargazers

Watchers

Forkers

lseppala spearheadsys potatogim dev-sites stdigitaldatacenter karenetheridge zammalhabe bunnypirate isabella232 hernan604

conch-api's Issues

Proposal: Device report and validation result data reduction

Summary

Redundant data for device reports and validation results can be reduced by storing only values changed from recorded entry. This can be achieved by building and storing a collected state of device report or validation result data as each device report or validation result is received. Each additional report or validation result will be compared against the accumulated state, and only changes between the current state and the new entry will be stored. The state construct also provides quick retrieval of the all accumulated report data and validation status for a device.

Implementation details are omitted intentionally, and an high level example is provided for walk-through.

Problem

Each JSON device report is stored in its entirety, and for each report, the results of every validation run on that report are stored. A report is sent about once a minute while it’s running through preflight. Most of this data is redundant and does not change between reports. While each report and validation results are relatively small, they accumulate to significant size over time. There is an average of 43 validation results per device report.

To illustrate, the size of the 10 largest tables is shown below (queried 2018-02-05). device_report is the table the reports are stored, and device_validate contains all validation results run for each report.

            relation            | total_size
--------------------------------+------------
 public.device_validate         | 43 GB
 public.device_report           | 7764 MB
 public.device_settings         | 55 MB
 public.device_disk             | 27 MB
 public.device_nic              | 3992 kB
 public.device_neighbor         | 3120 kB
 public.device_nic_state        | 2240 kB
 public.datacenter_rack_layout  | 1296 kB
 public.device                  | 1248 kB
 public.device_relay_connection | 648 kB

This is a significant amount of data and will grow faster as we increase the rate of DC builds.

It is also desirable to quickly retrieve the latest device report data and validation results for an overview of the device status. The current retrievals are slow, partially related to the size of their source tables.

Proposed solution

For both device reports and validation results, two constructs will be used: a 'state' data structure to store all previous, unique data received, and append-only log to write only the changes between entries. The state data structure and the data written to the log will be different for device reports and validation results.

Each device report is a nested hash structure reported as a JSON object. Likewise, the device report state structure will also be a nested hash. As device reports are received, new and changes values will be added to or replace older values in the nested hash. This will be done depth-first, so a new, deeply-nested value in a device report will be correctly identified and added to the state. All unchanged values in the device report will be stripped out.

To think of it another way, if you could "merge" all device reports received over time and newer values take precedence over older values in the merge, the result would be the device report state.

The device report log will store the timestamp the report was received, and the minimum hash of the changed values. If there's no change between a device report and the current state, only the timestamp will be recorded.

The state for validation results will be an associative array of validation results. As validation results are collected, they will be compared to the list of results in the state. If the result from a unique validation, it is added to the state. If a result is identified as from the same validation (identifiable by the name of the validation and device component details, for example) as a previous result stored in the state, it will replace the older result if if variable values in the result have changed (such as the pass/fail status of the result). If a result is identical to a previous result in the state for the same validation, the older result is kept and the newer result is not stored.

Validation results and a timestamp will be written to a log only if the validation result is unique (from a new validation or changed from a previous result from the same validation).

Example

A walkthrough is presented below for building the state and recording changes for device reports and validation results. JSON structures and pseudo-code are used. Function invocations represent side-effecting sub-routines, such as writing to the database.

Device Report and Device State

The device state for a new device begins empty (or possibly with a predetermined set of values).

device_state = {}

A device sends a sends its first device report:

device_report_1 =
{
	“serial_number” : “deadbeef”,
	"product_name"  : "Joyent-Compute-Platform-3301"
}

No device report states exists for the device. It is created and contains all values in the first report.

device_state =
{
	“serial_number” : “deadbeef”,
	"product_name"  : "Joyent-Compute-Platform-3301"
}

write_device_state(device_state, timestamp)

The device report will then be written to the device report log with timestamp corresponding to when it was stored.

write_device_report_log({
	“serial_number” : “deadbeef”,
	"product_name"  : "Joyent-Compute-Platform-3301"
	}, timestamp)

The device sends another device report:

device_report_2 =
{
	“serial_number” : “deadbeef”,
	"product_name"  : "Joyent-Compute-Platform-3301”,
	“bios_version”  : “2.4.3”
}

The second device report is compared to the device state. The change between the device state and the device report is the addition of the bios_version attribute. The device state is updated:

device_state =
{
	“serial_number” : “deadbeef”,
	"product_name"  : "Joyent-Compute-Platform-3301”,
	“bios_version”  : “2.4.3”
}

write_device_state(device_state, timestamp)

and only new change between the device report and device state is written to the device report log.

write_device_report_log({
	“bios_version”  : “2.4.3”
	}, timestamp
)

Another device report is received.

device_report_3 =
{
	“serial_number” : “deadbeef”,
	"product_name"  : "Joyent-Compute-Platform-3301”,
	“bios_version”  : “3.0.0”
}

Compared to the current device state, the keys remain the same, but the value of bios_version has changed. The device state is updated, and the changed value is written to the device report log.

device_state =
{
	“serial_number” : “deadbeef”,
	"product_name"  : "Joyent-Compute-Platform-3301”,
	“bios_version”  : “3.0.0”
}

write_device_state(device_state, timestamp)

write_device_report_log({
	“bios_version”  : “2.4.3”
	}, timestamp
)

One final report is received

device_report_4 =
{
	“serial_number” : “deadbeef”,
	"product_name"  : "Joyent-Compute-Platform-3301”,
	“bios_version”  : “3.0.0”
}

There is no difference between the device state and the latest device report. The device state is not updated, as there’s no new or changed values. We still write to the device report log to record that a device report has been received, but no data other than the timestamp is written.

write_device_report_log( null, timestamp)

Validation Results

A similar process will exist for recording validation results. To begin, the validation state for as device is empty.

validation_state = []

When validations are run and the results collected, they will be compared to the current state.

validation_1 = { 
	"status"    : “pass”,
	"validation_name" : “product_name_check”,
	"want_value" : "Joyent-Compute-Platform-3301",
	"has_value" : "Joyent-Compute-Platform-3301”,
	"component_id" : “...",
	"component_type" : “...",

}

If there is no validation having the same values in the validation state, it is added to the state

validation_state = [ validation_1 ]

write_validation_state(validation_state, timestamp)

and written to the validation result log

write_validation_result_log( validation_1, timestamp)

New validation results are added to the validation state and stored.

validation_2 = { 
	"status"    : “fail”,
	"validation_name" : “bios_version_check”,
	"want_value" : "3.0.0",
	"has_value" : null,
	"component_id" : “...",
	"component_type" : “...",
} 

validation_state = [ validation_1, validation_2]

write_validation_result_log(validation_2, timestamp)

write_validation_state(validation_state, timestamp)

Validation results with values matching previously stored in the state are not stored. Unlike device reports, no information about validation results is stored, including timestamps for when the validation result was received.

# Matches values in 'validation_1'
validation_3 = { 
	"status"    : “pass”,
	"validation_name" : “product_name_check”,
	"want_value" : "Joyent-Compute-Platform-3301",
	"has_value" : "Joyent-Compute-Platform-3301”,
	"component_id" : “...",
	"component_type" : “...",

} 

# The validation state remains unchanged.
validation_state = [ validation_1, validation_2 ]

Validation results in the state that have matching identifying values (in this example, validation_name, component_id, component_type) will be replaced by newer validation results with different variable values (status, has_value)

validation_4 = { 
	"status"    : “pass”,
	"validation_name" : “bios_version_check”,
	"want_value" : "3.0.0",
	"has_value" : "3.0.0",
	"component_id" : “...",
	"component_type" : “...",
} 

# validation_4 replaces validation_2
validation_state = [ validation_1, validation_4 ]

write_validation_state(validation_state, timestamp)

write_validation_result_log(validation_4, timestamp)

To retrieve the current validation status of a device and details of the results , the state is retrieved and then all referenced validation results are found.

Previous data

A script can be written to process device reports and validation results to construct a device state and create the device report and validation result log. Afterwards, the old data will be dropped. This will be tested extensively using copies of production to verify against data loss.

Discussion

Implementation details were purposely omitted to keep the discussion at a high level. If this is agreed to be a reasonable proposal, implementation details such as database schema will be detailed in an OPS-RFD (either OPS-RFD 22 or another).

It is possible that this system can still store a large amount of data per device if changes in device reports or validation results are frequent. Device report data is mostly static. Validation results are derived from device report data, so they too will be mostly static and unchanged. The one exception is temperature data, which is reported in the device report and volatile. For each temperature reading, there are validations to verify the temperature is within an acceptable operating range. As the system was described, every change in temperature, even if it is 0.1 degree, will be stored on the device report log and the validation result for temperature validations will be stored. This may be irrelevant information. Instead, we might only store temperature data reaches a certain delta threshold (say, if it changes by 5 degrees from the current state). This is an open question for discussion.

Remove `device.validated` setting dispatch

Temporary code was added in #159 to provide backwards compatibility for Conch-Relay and Conch-Rebooter. Once these products have been updated and deployed to use the endpoint POST /device/:id/validated (or the orchestration system & validation system subsume this task), the temporary code marked with a TODO should be removed.

Urgent fixes needed for conch-reporter-smartos support

We need to get the new reporter deployed asap. A couple things I've run into so far:

# cat /tmp/conch-report.json | json conch | ./bin/conch -c /var/tmp/.conch.json api post /device/5T21TD2 -
{"error":true,"message":"HTTP Error: Status: 400 Bad Request\nBody: {\"error\":\"\\/temp\\/exhaust: Missing property.\\n\\/temp\\/inlet: Missing property.\"}\n"}

Inlet/exhaust requires a firmware upgrade, which seemingly a bunch of machines will be missing. json-schema/input.yml needs to be updated so these fields are not required.

# cat /tmp/conch-report.json | json conch | ./bin/conch -c /var/tmp/.conch.json api post /device/5T21TD2 -
{"error":true,"message":"HTTP Error: Status: 409 Conflict\nBody: {\"error\":\"Hardware product SKU 'NotProvided' does not exist\"}\n"}

We need to provide the legacy_product_name lookup function, if a device hw profile cannot be found using product_name or sku.

So far that's what I've run into.

There may be more.

Remove old validation system

Remove old validation system code and rely completely on the new validation
system.

Expose time-to-complete in API

This would allow us to show how long until as system has completed its validation cycle.

I think we have all pieces we need in the DB:

User settings: max_burnin_time (minutes)
Device table: uptime_since
Device settings: build.reboot_count (total reboot is hardcoded at 3)

In the rack view we could show this as a pie chart with three sections that fills as time goes on. This icon might change shade/color based on connectivity. Grey = not heartbeating, colored in = heartbeating.

This was my naive attempt at this: https://gist.github.com/bdha/82b28aa6a8d2183228def1022bcd20f1

Want to search for devices by MAC and IPaddr

It would be very helpful to be able to query the device_nic table with any MAC or IP address and get a device ID back.

This request is somewhat related to #355 in that the shell needs to support this lookup functionality.

Rewrite device report ingest

Rewrite the device report ingest function (record_device_report) to persist without DBIC.

Need CRUD operations for hardware products and profiles

Adding new hardware products and profiles to conch has historically been done by manually editing the database. We need to move those operations in the API and UIs.

Want to store and search for device hostnames

We currently only provide the ability to search for devices by serial number.

There is existing ops tooling that relies on the system hostname for lookups. We need to add another device field for the hostname, and make it queryable.

The new conch-reporter-smartos report (version:2) already includes the hostname parameter:

[root@east1a-admin01 (us-east-1a) /var/tmp/conch-reporter-smartos]# ./bin/conch -c /var/tmp/.conch.json api get /device/$( sysinfo | json 'Serial Number' ) | json id latest_report.os.hostname
8SCLRD2
east1a-admin01

Further, the shell should have a flag for conch device :id --type or similar that defines what :id is.

User emails are case sensitive and shouldn't be

We are currently dealing with email address in a case sensitive fashion. This allows us to create users with address like [email protected] and [email protected]. There's already a production instance of this. We need to make email addresses case insensitive.

Want ability to create and edit racks and rack layouts

Currently, racks must be created using prepared SQL statements which create the rack database object and then populates the appropriate hardware profiles at their relevant rack elevations.

What is needed are API, Web UI, and Shell components which allow one to create racks, either from prepared structured data (such as json, yaml, or the like) or as fields in a web UI, as well as the ability to edit existing racks. An example for the latter is if a hardware profile is wrong or changes for a particular position in a given rack, then it may easily be corrected.

Move Conch source to repo top-level

Now that the Conch UI has been extracted, the Conch/ subdirectory is particularly useless. Let's move up all of the Conch source to the top-level of the repository.

Need a sql file containing test data

While working on tests for joyent/go-conch, I ran into an issue caused by @lseppala and I having different data in our dev databases. It'd be super useful for testing if we had a sql file available that contained officially supported conch-api test data.

Update Travis integration after `String::CamelCase` is fixed

Once String::CamelCase fixes it's META file (or is removed as a dependency), change carton install to carton install --deployment in the .travis.yml file

Remove `v1` references

We've decided a while ago that we do not want to version the API, and instead using client version checks to check for compatibility. There are many references to v1 in the code base currently. Before cutting to the v3 release version, let's cleans these out by renaming and removing any v1 terminology.

Instances:

json-schema/v1.json
as_v1_json, as_v1 in Perl code

Raise flag on unparseable reports

If a host is submitting reports we can't parse, we want to flip a bit somewhere and display that in the UI.

This helps with diagnosing service problems.

A device setting would work for this.

Reduce Javascript asset size and/or optimize loading

Webpack warns that the rendered asset size is large (~600 Kb) and can impact performance (warnings disabled with this configuration line. Comment to enable: https://github.com/joyent/conch/pull/186/files#diff-57ace3d02945bd25f99fbc92d0d61ae7R2).

There may be 3rd-party dependencies that are barely used (Ramda might be a candidate) that can be removed and replaced with less code.

Alternatively, Webpack recommends code splitting, lazy loading, and caching to reduce asset size and overall performance regardless of size.

It may be worth investigating.

Side note: this does not impact performance once loaded. As a single-page app, it's loaded when the browser first visits the site and does not change as the user navigates the site.

Net::MAC::Vendor introduces an XS dep unnecessarily

We use this module to read the OUI database from IEEE to map MAC addresses to the device vendor.

In our deployment, we want to distribute the database plaintext directly, and so do not need to uncompress the file (using Compress:Bzip2 or Compress::Zlib). We want to limit new XS deps in the API -- and using the same module in conch-reporter-smartos, where XS deps are contraindicated.

It would be helpful if Net::MAC::Vendor only optionally required the compression modules if a local source is not specified.

Attached PRD is no longer reflected in rack view

When a PRD is attached to a rack, that ID of the PRD used to be noted under the rack view's Details section. That is no-longer the case, and results in a laborious search for the PRD which is actually connected.

Remove GET /problem endpoint

Remove the legacy GET /problem endpoint and associated code.

Endpoint to provide the version number of the API

It'd be handy to be able to determine the server's version number so we can enable features in the UIs if the server can support them.

Need api endpoint for creating new hardware vendors

There doesn't seem to be an api endpoint to create a new hardware vendor.

I believe we only need these routes:

Method	URI
GET	/hardware_vendor
GET, POST	/hardware_vendor/:name
POST	/hardware_vendor/:name/deactivate

Add OUI lookup support to switch peers validation

Something similar to https://github.com/joyent/conch-reporter-smartos/blob/master/lib/Conch/Reporter/Collect/OUI.pm so we don't have to rely on the client to submit correct data.

test issue

ignore

Failed validation criteria does not present a server as "FAIL" in the device and rack views

We are seeing cases where a server will fail validation due to something, but not be marked as such in the rack view for the device, or on the device's own page as it used to be.

Currently, https://conch.joyent.us/#!/device/S309573X8224784 is one such example. Validation is failing on the system's SMBIOS name, yet it is presented as as passing system.

incorrect results returned when creating a new resource

When creating a new resource, we are returning HTTP 303 with the location of the new resource as the redirect target. Instead, we should return HTTP 201 or 202 with the location in the 'Location' header.

(it should be fixed everywhere at once, in conjunction with changes to conch-cli and conch-ui, to preserve consistency.)

Want device notes

We currently use spreadsheets to interact with our integrators, and this is .. sub-optimal.

Ideally we would have a device_notes table that, in addition to simply putting comments on a device, would allow us to:

Track human actions we cannot automatically identify, like any maint work, replacing components, rebooting the system, etc. There should be a an ability to add new comment types (and we may later want to search for specific actions on a given device, group of devices, datacenter, etc.)
Assign a comment to a user in the workspace (like Google Docs/Sheets)
Allow user to mark a comment resolved (also like Google Docs/Sheets)

Comment types:

Maint
Reboot
Ticket
General
..?

Eventually when adding a comment we may want to be able to reference a specific component in the device, allowing us to track actions against that component -- or class of components when the Inventory System exists.

Some discussion of this happened here./show/42442029

Authorization system

As requested in joyent/conch-shell#17, we need a way for either Administrator users to invite themselves to workspaces or to just automatically have access to them all, by virtue of being admins. I'd prefer the later, particularly as a gateway to special-casing admins for all actions, like a real 'root' user.

IPMI validation

See: https://github.com/joyent/conch-livesys/issues/74, joyent/conch-reporter-smartos#32

Add dtrace probes

Production is running on SmartOS now and pkgsrc perl is built with dtrace support. We get some probes by default ( https://perldoc.perl.org/perldtrace.html ) but it'd be pretty shiny to support custom probes.

Devel::DTrace::Provider is probably our best/only bet https://metacpan.org/pod/Devel::DTrace::Provider

Handling registered but unlocated Relays

@daleghent has been rightfully frustrated on multiple occasions by Relays that have registered (POST /relay/:id/register) but haven't yet reported a located device don't show up in the UI.

3:14 PM [vendor] actually had the new PRDs connected up, but they were befuddled in that they didn't see the PRDs listed on the Relay tab of the web ui
3:14 PM so they thought they had a network issue

The Relay UI calls GET /workspace/:id/relay. The reason registered but unreported relays aren't included in the response is because we don't have any association between a relay and a workspace until at least once device located in that workspace sends a report.

We do have data to associate a user with a relay as soon as it registers. We should expose this information in the UI. Going off the 'API tree' discussion in issue #126, I suggest we create a new endpoint, /user/me/relay to expose relays that a user has registered but not yet reported.

Open questions:

The response for GET /workspace/:id/relay includes the devices each Relay has reported. Should the response should include any devices IDs that are unlocated (not in any workspace) but have reported through the relay be included in the response?
Will it confusing that the Relay will ONLY show up for the user's account who has registered registered the Relay? If integrator A registers the Relay, the Relay won't show up in integrator B's account even if they are working in the same workspace.

Fix Validation::SwitchPeer to be SmartOS-friendly

Currently makes assumptions about Linuxisms. This needs to be made more generic and multi-platform. Without this, conch-reporter-smartos reports will fail validation.

Disk ingest doesn't deactivate disks it doesn't find

If a disk is removed, it is not automatically marked deactivated. Someone needs to go toggle the deactivated field in the database.

Nicer if the ingest code compared inventory to database and disabled what is no longer in the system.

Permisions, as defined on the user_workspace_role table, should cascade to sub-workspaces unless overriden on the workspace

Want to store rack SN and asset information

An integrator has requested that we add rack SN and asset tag information, to reduce the amount of effort required to scan in a rack.

This requires new database fields in datacenter_rack, API and UI support. This is similar to fields we have in the device table, but for the rack itself.

Firmware update is not reflected as a status in rack view

A device that is undergoing firmware update is reflected as such in the Status section of the device's page; however the fact that it is undergoing this phase is not reflected next to the device's entry in the rack view. This may lead to an operator prematurely rebooting or turning off a server in the middle of this phase, which can lead to a inoperable server if done at the wrong moment.

New endpoint to provide data about all relays in the system, irrespective of workspace

This should be pinned to Administrators in the GLOBAL workspace.

v2 Performance

A place to discuss any performance concerns and possible solutions

Add Arista DCS-7160-48YC6-R support

We require hardware_* table support for a new vendor and a switch from that vendor:

Vendor: Arista
Switch: DCS-7160-48YC6-R

validate input on `layout` and `asset_tag` POST endpoints

When a user inputs a serial number or asset tag, we need to validate its correctness (e.g. no spaces)

Want device timeline / history

A report that shows:

first seen
major validation changes
location changes
etc

How much of this is built into Conch and into Locker, I'm not sure yet. A discussion and RFD will be needed.

Want the ability to export rack/workspace contents to a CSV file

An integrator has requested we add the ability to export the contents of a rack and workspace to a CSV file. They could easily build this information now we with the conch-shell but that may be more than they can chew.

The format might be something like:

workspace_name,datacenter_name,room_name,rack,rack_unit,device_Id

Remove all Legacy::* submodules

Want an API option to get device reports with a given state or time range

Like "all failing reports" or "most recent failing report", not just the current "last report."

Submission failure for HA class system

[2018-08-24T22:48:59.681340Z] LVLerror: conch-api/891957 on conch-api1 (DefaultHelpers.pm:102 in Mojolicious::Plugin::DefaultHelpers):
    DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR:  null value in column "iface_type" violates not-null constraint
    DETAIL:  Failing row contains (a0:36:9f:c0:fb:ba, 8TYJRD2, ixgbe1, null, Intel Corporate, , null, 2018-06-07 16:29:05.539205+00, 2018-08-24 22:48:59.468975+00). [for Statement "UPDATE device_nic SET iface_name = ?, iface_type = ?, iface_vendor = ?, mac = ?, updated = NOW() WHERE ( mac = ? )" with ParamValues: 1='ixgbe1', 2=undef, 3='Intel Corporate', 4='A0:36:9F:C0:FB:BA', 5='a0:36:9f:c0:fb:ba'] at /opt/conch/bin/../lib/Conch/Legacy/Control/DeviceReport.pm line 215
    
[2018-08-24T22:48:59.682570Z] LVLdebug: conch-api/891957 on conch-api1 (Controller.pm:219 in Mojolicious::Controller): 500 Internal Server Error (0.379691s, 2.634/s)
[2018-08-24T22:48:59.682720Z] LVLinfo: conch-api/891957 on conch-api1: dispatch (req_id=9f8df31f, req.params={}, req.remoteAdress=72.2.118.26, req.remotePort=42473)
    POST /device/8TYJRD2 HTTP/1.1
    Accept-Encoding: gzip
    Connection: upgrade
    Content-Length: 19405
    Content-Type: application/json
    Host: conch.joyent.us
    User-Agent: conch shell v1.3.1-v1.3.1-0-gdc1807d
    X-Forwarded-for: 148.78.3.229
    X-Real-IP: 148.78.3.229
    --
    HTTP/1.1 500 Internal Server Error
    Content-Type: application/json;charset=UTF-8
    Server: Mojolicious (Perl)
    
    {"error":"An exception occurred"}
    --
    err: {
      "fileName": null,
      "frames": [
        {}
      ],
      "lineNumber": null,
      "msg": "DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR:  null value in column \"iface_type\" violates not-null constraint\nDETAIL:  Failing row contains (a0:36:9f:c0:fb:ba, 8TYJRD2, ixgbe1, null, Intel Corporate, , null, 2018-06-07 16:29:05.539205+00, 2018-08-24 22:48:59.468975+00). [for Statement \"UPDATE device_nic SET iface_name = ?, iface_type = ?, iface_vendor = ?, mac = ?, updated = NOW() WHERE ( mac = ? )\" with ParamValues: 1='ixgbe1', 2=undef, 3='Intel Corporate', 4='A0:36:9F:C0:FB:BA', 5='a0:36:9f:c0:fb:ba'] at /opt/conch/bin/../lib/Conch/Legacy/Control/DeviceReport.pm line 215\n210: \t\t\t\t\t\tmac          => $mac,\n211: \t\t\t\t\t\tdevice_id    => $device->id,\n212: \t\t\t\t\t\tiface_name   => $nic,\n213: \t\t\t\t\t\tiface_type   => $dr->{interfaces}->{$nic}->{product},\n214: \t\t\t\t\t\tiface_vendor => $dr->{interfaces}->{$nic}->{vendor},\n215: \t\t\t\t\t\tiface_driver => '',\n216: \t\t\t\t\t\tupdated      => \\'NOW()',\n217: \t\t\t\t\t\tdeactivated  => undef\n218: \t\t\t\t\t}\n219: \t\t\t\t);\n220: \n"
    }

Missing the iface_type field from the report?

[root@headnode (us-east-1) /var/tmp/dc-standup/src/conch-reporter-smartos]# cat /tmp/conch-report.json | json conch.interfaces.ixgbe1
{
  "peer_switch": "va1-3-d05-2",
  "peer_mac": "f4:8e:38:46:23:22",
  "peer_mgmt_ip": null,
  "peer_capabilities": null,
  "peer_port": "TenGigabitEthernet 1/20",
  "peer_port_descr": null,
  "state": "up",
  "duplex": "full",
  "peer_descr": null,
  "speed": "10000",
  "class": "phys",
  "peer_vendor": "Dell Inc.",
  "mac": "a0:36:9f:c0:fb:ba",
  "mtu": "1500",
  "vendor": "Intel Corporate"
}

From a working report:

  "ixgbe0": {
    "peer_switch": "va1-5-d08-2",
    "peer_mgmt_ip": null,
    "state": "up",
    "peer_descr": null,
    "speed": "10000",
    "mac": "a0:36:9f:c1:a8:fc",
    "mtu": "1500",
    "peer_mac": "f4:8e:38:31:f2:46",
    "peer_capabilities": null,
    "peer_port_descr": null,
    "peer_port": "TenGigabitEthernet 1/2",
    "duplex": "full",
    "product": "unknown",
    "class": "phys",
    "peer_vendor": "Dell Inc.",
    "vendor": "Intel Corporate"
  }

DiskMap validation

Ensure all disks match their hardware spec, similar to DimmMap.

Add support for PDU devices

Initially, we want to be able to assign PDUs to racks.

(Later we will want full device-style reporting.)

There are potentially a few changes that need to be made here:

datacenter_rack_layout needs to be made aware of 0U, side-mounted PDUs. Or we can just lie about in device_location.

A new hardware_product needs to be added for each PDU type. (Preferably. We could fudge this, too, until we're ready to actually manage the PDUs from the Relay.)

Orchestration System

Conch needs a centralized system for controlling various processes inside the Conch environment, ranging from centralized management of production datacenter inventory validation to localized management of Conch relays and livesys images inside an integration facility.

In the most general terms, the orchestration system should be a purpose-built workflow engine. It must provide a way to specify policies, focused on build operations, that determine the order of operations for the Conch relay and livesys applications. The validation system must be used to determine if those policies have completed successfully.

The user interface, whether HTML or CLI, is not described in this RFD and will be specified at a later point.

Concepts

Automation

Automation is the execution of individual tasks, with the goal of simplifying or standardizing tasks that were often run manually previously.

Workflow

A workflow is a set of pre-defined automations launched by a trigger condition. In the most general case, steps in a workflow may be optional or required and it is usually possible to nest workflows.

Workflow Engine

A workflow engine is a system that evaluates trigger conditions, trigger individual workflows, and processes the results. Usually workflow engines are purpose-built for the needs of a specific application.

Orchestration

Orchestration is concerned with bringing workflows together into processes or policies, with the goal of streamlining and reusing those processes. Automation provides the building blocks upon which orchestration processes are built.

Design

Workflows

Orchestration operations are tied to a 'workflow', which itself is tied to a hardware product. Every device is keyed to a hardware product, as well. This combination allows for workflow steps and validations to be linked to specific hardware revisions.

This is particularly useful when a newer version of a server design varies wildly from an older one. Both systems can be built and validated because their workflow, and the validations necessary to green-light the device, are tied to the hardware product specification.

Workflow Steps

Workflow steps are an ordered list of string names and validations. The string names are opaque to the orchestration system but signal different operations to the downstream clients.

Status

Two types of status exist, one for the execution of an entire workflow, and another for the execution of an individual step.

Workflow Status

Workflow status records the state of the execution of an entire workflow for a specific device. Most of the time, a device's workflow will either be "ongoing" or "completed". However, special circumstances may arise with a need to interrupt that flow.

If an external entity (probably a human) determines that a workflow must be stopped, the workflow status for that device will be set to 'abort'. The engine cannot reach out to client devices and forcibly halt execution so 'abort' indicates that workflow must cease when the current step finishes. When the
device completes its current step, the workflow status for the device will be set to 'stopped'.

Similarly, an external entity may determine that the extraordinary circumstances have passed and a workflow may continue from a previously aborted state. In this case, the workflow status for that device will be set to 'resume'. When the device resumes work, the workflow status for that device will be set to 'ongoing'.

Workflow Step Status

Workflow step status records the state of a particular workflow step for a particular device. From a client perspective, they are write-once. However, a status record must also contain the results of the appropriate validation. As such, the backend must be able to update an existing status record.

When a workflow status is received with a status of 'complete' and 'data' is present, the validation subsystem will be called, using the validation plan id. The validation system call will be passed the relevant device id, hardware product id, validation plan id, and the data from the workflow status. The
result will be written back into workflow status record, indicating pass/fail status and a link to the full validation result.

By default, if validation fails for a workflow step, the client will receive a failure indicator when requesting its next step. If 'retry' is set on the workflow step, the client will be told to run that step again. Any retry-able step must also set a maximum amount of retries. Clients will not be allowed to
proceed further once that maximum has been reached.

It is possible for the validation system to fail internally, providing neither a fail nor success indicator. When this occurs, step retries must not happen and the step must be marked as failed. It should be possible to re-execute the validations of a step in this state, as long as the failed step is the most
recent one. When a revalidation occurs, a new status record must be written containing the new validation result.

Schema

workflow

column	type	modifiers
id	uuid	not null, pk
name	string	not null, unique
version	int	not null, default 1, unique(name,version)
created	ts	not null, default now
hardware_id	uuid	not null, fk hardware_product(id)

workflow_status

enum workflow_status_enum:

ongoing
stopped
abort
resume
completed

column	type	modifiers
id	uuid	pk
workflow_id	uuid	fk workflow(id)
device_id	uuid	fk device(id)
timestamp	tz	not null, default now
status	workflow_status_enum	not null, default ongoing

workflow_step

column	type	modifiers
id	uuid	not null, pk
workflow_id	uuid	not null, fk workflow(id)
name	string	not null, unique(name, workflow_id)
order	int	not null, unique(workflow_id, order)
retry	bool	default false
max_retries	int	not null, default 0
validation_plan_id	uuid	not null, fk validation_plan(id)

workflow_step_status

enum workflow_step_state_enum:

started
processing
complete

enum validation_status_enum:

pass
fail
error
noop (not run)

column	type	modifiers
id	uuid	not null, pk
device_id	string	not null, fk device(id)
workflow_step_id	uuid	not null, fk workflow_step(id)
timestamp	ts	not null, default now
state	workflow_step_state_enum	not null, default(started)
retry_count	int	not null, default 1
data	jsonb
validation_status	validation_status_enum	default(noop)
validation_result_id	uuid	fk validation_result(id)

Implementation

The orchestration API will exist within the existing Conch Mojo API codebase and feature an independant user interface. A standalone CLI will be developed in Go.

Auth

Authentication

The API will be divided into two segments, isolating authentication concerns. For endpoints used by automation, authentication will occur via HTTP Signatures, utilizing RSA keys generated and managed by a CLI tool. Users must be allowed multiple RSA keys and it should be possible to bake an RSA key into the orchestration CLI.

API endpoints used by user interfaces will use the same authentication as the conch API server.

Authorization

Authorization will be managed by the existing concept of roles within the Conch database. For automated clients, permissions will be based on roles within the GLOBAL workspace. For human clients, permissions will be split. Administrators in the GLOBAL workspace will be able to see all workflows, workflow statuses, and devices. Otherwise, the user will see items based on the device list for their particular workspace.

Only Administrators in the GLOBAL zone can created or modify workflows.

UI needs to indicate when a failed login attempt occurs

Currently, if a user inputs bad login credentials, there is no UI element that indicates the login failed. It appears that nothing happened at all. We need to indicate in the UI that the credentials were submitted and login failed.

when reporting racks by workspace datacenter is empty

When reporting racks by workspaces, the datacenter field is "" I tried against all SPC build workspaces with the same results

 ./bin/conch -j ws $i  racks|json -ag

{
  "id": "47caeaf6-9393-4d29-8cde-a14a35762077",
  "name": "3.1-0905",
  "role": "MANTA",
  "unit": 0,
  "size": 45,
  "datacenter": "",
  "slots": null
}