Giter Site home page Giter Site logo

comcast / fishymetrics Goto Github PK

View Code? Open in Web Editor NEW
10.0 5.0 4.0 7.22 MB

Redfish API Prometheus Exporter for monitoring large scale server deployments

License: Apache License 2.0

Dockerfile 0.54% Makefile 0.49% Go 98.52% Mustache 0.45%
prometheus-exporter redfish

fishymetrics's Introduction

fishymetrics exporter for Prometheus

This is a simple server that scrapes a baremetal chassis' managers stats using the redfish API and exports them via HTTP for Prometheus consumption.

This app can support any chassis that has the redfish API available. If one needs to query any non-redfish API calls this app can be extended to support that. Please see the plugins documentation for more information.

Getting Started

To run it:

$ ./fishymetrics --help
usage: fishymetrics [<flags>]

redfish api exporter with all the bells and whistles

Flags:
  -h, --help                    Show context-sensitive help (also try --help-long and --help-man).
      --user=""                 BMC static username
      --password=""             BMC static password
      --timeout=15s             BMC scrape timeout
      --scheme="https"          BMC Scheme to use
      --log.level=[debug|info|warn|error]
                                log level verbosity
      --log.method=[file|vector]
                                alternative method for logging in addition to stdout
      --log.file-path="/var/log/fishymetrics"
                                directory path where log files are written if log-method is file
      --log.file-max-size="256"   max file size in megabytes if log-method is file
      --log.file-max-backups="1"  max file backups before they are rotated if log-method is file
      --log.file-max-age="1"      max file age in days before they are rotated if log-method is file
      --vector.endpoint="http://0.0.0.0:4444"
                                vector endpoint to send structured json logs to
      --port="9533"             exporter port
      --vault.addr="https://vault.com"
                                Vault instance address to get chassis credentials from
      --vault.role-id=""        Vault Role ID for AppRole
      --vault.secret-id=""      Vault Secret ID for AppRole
      --collector.drives.modules-exclude=""
                                regex of drive module(s) to exclude from the scrape
      --collector.firmware.modules-exclude=""
                                regex of firmware module to exclude from the scrape
      --credentials.profiles=CREDENTIALS.PROFILES
                                profile(s) with all necessary parameters to obtain BMC credential from secrets backend, i.e.

                                  --credentials.profiles="
                                    profiles:
                                      - name: profile1
                                        mountPath: "kv2"
                                        path: "path/to/secret"
                                        userField: "user"
                                        passwordField: "password"
                                      ...
                                  "

                                --credentials.profiles='{"profiles":[{"name":"profile1","mountPath":"kv2","path":"path/to/secret","userField":"user","passwordField":"password"},...]}'

Or set the following ENV Variables:

BMC_USERNAME=<string>
BMC_PASSWORD=<string>
BMC_TIMEOUT=<duration> (Default: 15s)
BMC_SCHEME=<string> (Default: https)
EXPORTER_PORT=<int> (Default: 9533)
LOG_PATH=<string> (Default: /var/log/fishymetrics)
VAULT_ADDRESS=<string>
VAULT_ROLE_ID=<string>
VAULT_SECRET_ID=<string>
./fishymetrics

Collectors

Exclude flags

Since some hosts can contain many dozens of drives, this can cause a scrape to take a very long time and may not be entirely necessary. Because of this we've included an exclude flag specifically for the drives.module and firmware.module scopes.

Example:

--collector.drives.modules-exclude="(?i)(FlexUtil|(SBMezz|IOEMezz)[0-9]+)"
Collector Scope Include Flag Exclude Flag
drives module N/A module-exclude
firmware module N/A module-exclude

Usage

build info URL

Responds with the application's version, build_date, go_version, etc

_if deployed on ones localhost_
curl http://localhost:9533/info

metrics URL

Responds with the application's runtime metrics

_if deployed on ones localhost_
curl http://localhost:9533/metrics

redfish API /scrape

To test a scrape of a host's redfish API, you can curl fishymetrics

curl 'http://localhost:9533/scrape?model=<model-name>&target=1.2.3.4'

If you have a credential profile configured you can add the extra URL query parameter

curl 'http://localhost:9533/scrape?model=<model-name>&target=1.2.3.4&credential_profile=<profile-name>'

There is plugin support which is passed a comma separated list of strings

curl 'http://localhost:9533/scrape?model=<model-name>&target=1.2.3.4&plugins=example1,example2'

Docker

To run the fishymetrics exporter as a Docker container using static crdentials, run:

Using ENV variables

docker run --name fishymetrics -d -p <EXPORTER_PORT>:<EXPORTER_PORT> \
-e BMC_USERNAME='<user>' \
-e BMC_PASSWORD='<password>' \
-e BMC_TIMEOUT=15s \
-e EXPORTER_PORT=1234 \
comcast/fishymetrics:latest

Using command line args

docker run --name fishymetrics -d -p <EXPORTER_PORT>:<EXPORTER_PORT> \
-user '<user>' \
-password '<password>' \
-timeout 15s \
-port 1234 \
comcast/fishymetrics:latest

Prometheus Configuration

The fishymetrics exporter needs to be passed the address as a parameter, this can be done with relabelling.

Example config:

scrape_configs:
  - job_name: 'fishymetrics'
    static_configs:
      - targets:
        - bmc-fdqn-p1.example.com
        labels:
          foo: bar
      - targets:
        - bmc-fdqn-p2.example.com
        labels:
          foo: bar
    metrics_path: /scrape
    scrape_interval: 5m
    params:
      model: ["dl360"]
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: fishymetrics.example.com  # Kubernetes cluster nginx-ingress FQDN or any host IP/FQDN you deployed with

Development

Building

linux binary

make build

docker image

make docker

Testing

make test

License

dependency source code can be found in the sources.tgz file in the root of the comcast/fishymetrics-src docker image. To extract the source code, run the following commands:

export TMP_CONTAINER="$(docker create comcast/fishymetrics-src:latest)"
docker export $TMP_CONTAINER | tar -x sources.tgz
docker rm $TMP_CONTAINER
tar -xzf sources.tgz

dependency licenses for each package:

package license
github.com/hashicorp/go-hclog MIT license
github.com/hashicorp/go-retryablehttp Mozilla Public License v2.0
github.com/hashicorp/vault/api Mozilla Public License v2.0
github.com/hashicorp/vault/api/auth/approle Mozilla Public License v2.0
github.com/hashicorp/vault/sdk Mozilla Public License v2.0
github.com/nrednav/cuid2 MIT license
github.com/prometheus/client_golang Apache-2.0 license
github.com/stretchr/testify MIT license
go.uber.org/zap MIT license
gopkg.in/alecthomas/kingpin.v2 MIT license
gopkg.in/natefinch/lumberjack.v2 MIT license
gopkg.in/yaml.v3 Apache-2.0 license

fishymetrics's People

Contributors

dependabot[bot] avatar derrick-dacosta avatar ibrahimkk-moideen avatar ikkhan avatar jenniferkaiser21 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

fishymetrics's Issues

Fix metrics collection and labels post consolidation

Need to address below.

Add "Thermal summary status" for all Cisco models
Fix PowerTotalConsumed url label on all Cisco models
Update Chassis struct to include Gen9 models
Update memory metrics to include Gen9 models

Chassis ComputerSystems field is handled improperly

It appears as though Chassis.Links.ComputerSystems is not being parsed/unmarshalled correctly. When running a query against the dmtf/redfish-mockup-server, we receive an error:

Error Unmarshalling Chassis struct - json: cannot unmarshal object into Go struct field ChassisLinks.Links.ComputerSystems of type string"

To reproduce this, run the Redfish Mockup Server Docker container:
docker run --rm -d -p 8000:8000 dmtf/redfish-mockup-server:latest

And then run a Fishymetrics container:
docker run -p 9533:9533 -e BMC_USERNAME=admin -e BMC_PASSWORD=password -e EXPORTER_PORT=9533 -e BMC_TIMEOUT=15s -e BMC_SCHEMD=http comcast/fishymetrics.

You can now curl the Fishymetrics endpoint:
curl localhost:9533/scrape?target=http://localhost:8000 and will see the error.

The DMTF Redfish Mockup Server uses the "Simple Rack-mounted Server" mockup for its data, which is also found here:
https://github.com/DMTF/Redfish-Mockup-Server/blob/main/public-rackmount1/Chassis/1U/index.json. The full index.json is posted at the bottom of this issue.

The DMTF specification shows the Chassis.Links.ComputerSystems as follows:

    "Links": {
        "ComputerSystems": [
            {
                "@odata.id": "/redfish/v1/Systems/437XR1138R2"
            }
        ],

That is what fails.

If ComputerSystems is (incorrectly) changed to the following, it works.

    "Links": {
        "ComputerSystems": [
                "/redfish/v1/Systems/437XR1138R2",
                "/redfish/v1/Systems/437XR1138R2"
        ],
{
    "@odata.type": "#Chassis.v1_25_0.Chassis",
    "Id": "1U",
    "Name": "Computer System Chassis",
    "ChassisType": "RackMount",
    "AssetTag": "Chicago-45Z-2381",
    "Manufacturer": "Contoso",
    "Model": "3500RX",
    "SKU": "8675309",
    "SerialNumber": "437XR1138R2",
    "PartNumber": "224071-J23",
    "PowerState": "On",
    "IndicatorLED": "Lit",
    "HeightMm": 44.45,
    "WidthMm": 431.8,
    "DepthMm": 711,
    "WeightKg": 15.31,
    "Location": {
        "PostalAddress": {
            "Country": "US",
            "Territory": "OR",
            "City": "Portland",
            "Street": "1001 SW 5th Avenue",
            "HouseNumber": 1100,
            "Name": "DMTF",
            "PostalCode": "97204"
        },
        "Placement": {
            "Row": "North",
            "Rack": "WEB43",
            "RackOffsetUnits": "EIA_310",
            "RackOffset": 12
        }
    },
    "Status": {
        "State": "Enabled",
        "Health": "OK"
    },
    "ThermalSubsystem": {
        "@odata.id": "/redfish/v1/Chassis/1U/ThermalSubsystem"
    },
    "PowerSubsystem": {
        "@odata.id": "/redfish/v1/Chassis/1U/PowerSubsystem"
    },
    "EnvironmentMetrics": {
        "@odata.id": "/redfish/v1/Chassis/1U/EnvironmentMetrics"
    },
    "Sensors": {
        "@odata.id": "/redfish/v1/Chassis/1U/Sensors"
    },
    "Controls": {
        "@odata.id": "/redfish/v1/Chassis/1U/Controls"
    },
    "TrustedComponents": {
        "@odata.id": "/redfish/v1/Chassis/1U/TrustedComponents"
    },
    "[email protected]": "Please migrate to use /redfish/v1/Chassis/1U/ThermalSubsystem",
    "Thermal": {
        "@odata.id": "/redfish/v1/Chassis/1U/Thermal"
    },
    "[email protected]": "Please migrate to use /redfish/v1/Chassis/1U/PowerSubsystem",
    "Power": {
        "@odata.id": "/redfish/v1/Chassis/1U/Power"
    },
    "Links": {
        "ComputerSystems": [
            {
                "@odata.id": "/redfish/v1/Systems/437XR1138R2"
            }
        ],
        "ManagedBy": [
            {
                "@odata.id": "/redfish/v1/Managers/BMC"
            }
        ],
        "ManagersInChassis": [
            {
                "@odata.id": "/redfish/v1/Managers/BMC"
            }
        ]
    },
    "@odata.id": "/redfish/v1/Chassis/1U",
    "@Redfish.Copyright": "Copyright 2014-2023 DMTF. For the full DMTF copyright policy, see http://www.dmtf.org/about/policies/copyright."
}

Update response collection in PSU and Raid metrics

HP DL360:
Include label "Hpe" in PowerSupply metrics response.

Cisco S3260 M4 & M5:
Loop through members of storage instead of hardcoded endpoint to avoid errors when the raid controllers are not available.

Fix & update metrics on Cisco and HP servers

  1. Add CPU status metric to HP DL380.
  2. Add "bayNumber" label on psu status and output metrics for all HP models.
  3. Fix C220 drive metric on servers with cimc fw < 4.1.
  4. Fix PSU supply total metric on S3260M4 servers.
  5. Fix SN and FW ver label in PSU metrics on S3260M5 servers.
  6. Update MemberId key in PSU and Thermal to support string and int.

PowerSupply duplicate metrics

PowerSupply metric is including PSUs which are not enabled with incorrect bay numbers.

# HELP redfish_power_supply_status Current power supply status 1 = OK, 0 = BAD
# TYPE redfish_power_supply_status gauge
redfish_power_supply_status{bayNumber="1",chassisModel="xxxx",chassisSerialNumber="xxxx",firmwareVersion="2.00",manufacturer="",model="P38997-B21",name="HpeServerPowerSupply",powerSupplyType="AC",serialNumber="xxxx"} 1

## duplicate/incorrect metric
redfish_power_supply_status{bayNumber="2",chassisModel="xxxx",chassisSerialNumber="xxxx",firmwareVersion="",manufacturer="",model="",name="",powerSupplyType="",serialNumber=""} 0

redfish_power_supply_status{bayNumber="2",chassisModel="xxxx",chassisSerialNumber="xxxx",firmwareVersion="2.00",manufacturer="",model="P38997-B21",name="HpeServerPowerSupply",powerSupplyType="AC",serialNumber="xxxx"} 1

Exporter scrape config error (or unclear docs?)

@derrick-dacosta thanks for offering to take a look.

I have a pretty simple Docker-Compose sandbox environment used to test various Prometheus exporters. While I have a number of exporters configured, I'm not quite sure what I am doing wrong with Fishymetrics' exporter. My hunch is that there's a lack of clarity in the README for the scrape_config, and/or that I've configured it incorrectly.

Here are the relevant files:


docker-compose.yaml

services:
  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    ...


  redfish:
    image: dmtf/redfish-mockup-server:latest
    ports:
      - "8000:8000"


  fishymetrics_exporter:
    image: comcast/fishymetrics
    ports:
      - "9533:9533"
    environment:
      EXPORTER_PORT=9533
      BMC_USERNAME=admin
      BMC_PASSWORD=password
      BMC_TIMEOUT=15s
      BMC_SCHEME=http
      LOG_LEVEL=debug   

prometheus.yml

...
scrape_configs:
  - job_name: 'fishymetrics'
    static_configs:
      - targets: ['redfish:8000']
    metrics_path: /scrape

I can run the following example commands successfully:

  • cURL the Redfish container's /redfish API from my development machine:
    curl localhost:8000/redfish/v1/Chassis
  • cURL the Redfish container's /redfish API from a container on the same Docker network:
    docker run --rm --network sandbox_default curlimages/curl http://redfish:8000/redfish/v1/Chassis
  • cURL the Fishymetrics container's /scrape API from a container on the same Docker network:
    docker run --rm --network sandbox_default curlmages/curl fishymetrics_exporter:9533/scrape?target=http://redfish:8000

However, the Fishymetrics container itself is not successfully scraping the Redfish Mockup Server container. Its logs show the following info and error messages (I'm not copy/pasting the but this should suffice):

"level":"info"..., "caller":"fishymetrics/main.go:134", "msg":"started scrape", "app":"fishymetrics", "host":"<fishymetrics_container_id>", "module":"", "module" : "", "model":"", "target":"http://redfish:8000/redfish/v1/Chassis", "credential_profile":"",...`
"level:"error",..., "caller":"exporter/exporter.go:180", "msg":"error when getting chassis url from ", "app":"fishymetrics", "host":"<fishymetrics_container_id>", "error":"HTTP status 404"...
"level:"error",..., "caller":"fishymetrics/main.go:170", "msg":"failed to create chassis exporter", "app":"fishymetrics", "host":"<fishymetrics_container_id>", "error":"HTTP status 404"...
"level:"info",...,  "caller":"fishymetrics/main.go:80",  "msg":"finished handling", "app":"fishymetrics", "host":"<fishymetrics_container_id>", "module":"", "target":"http://redfish:8000/redfish/v1/Chassis", "sourceAddr":"172.x.x.5:12345", "method":"GET", "url":"/scrape?target=http://redfish:8000/redfish/v1/Chassis", "proto":"HTTP/1.1", "status":500"...
"level:"error",..., "caller":"fishymetrics/main.go:134", "msg":"started scrape", "app":"fishymetrics", "host":"<fishymetrics_container_id>", "module":"", "model":"", "target":"http://redfish:8000/redfish/v1/Chassis/1U", "credential_profile":"",...
"level:"error",..., "caller":"fishymetrics/main.go:134", "msg":"started scrape","app":"fishymetrics", "host":"<fishymetrics_container_id>", "module":"", "model":"", "target":"http://redfish:8000/redfish/v1/Chassis/1U", "credential_profile":"",...

I noticed a whitespace character in the first error message: "error when getting chassis url from ", implying no URL is set? Hopefully that is a helpful clue.

DL360 Power & Drive metrics - Include support for iLO4

Noticed in DL360 servers with iLO4 the Redfish API response has different naming for certain fields. As a result of which, unable to get Drive, Power Supply & Thermal metrics.

Power Supply metrics:
Unable to get all the PSUs
Unable to get MemberId

# HELP dl360_power_supply_output Power supply output in watts
# TYPE dl360_power_supply_output gauge
dl360_power_supply_output{memberId="",sparePartNumber="xxxxxxxxx"} 0
# HELP dl360_power_supply_status Current power supply status 1 = OK, 0 = BAD
# TYPE dl360_power_supply_status gauge
dl360_power_supply_status{memberId="",sparePartNumber="xxxxxxxxx"} 1
# HELP dl360_power_supply_total_capacity Total output capacity of all the power supplies
# TYPE dl360_power_supply_total_capacity gauge
dl360_power_supply_total_capacity{memberId=""} xxx
# HELP dl360_power_supply_total_consumed Total output of all power supplies in watts
# TYPE dl360_power_supply_total_consumed gauge
dl360_power_supply_total_consumed{memberId=""} xxx

Thermal Metrics:
Unable to get all fans
Unable to get fan name

# HELP dl360_thermal_fan_speed Current fan speed in the unit of percentage, possible values are 0 - 100
# TYPE dl360_thermal_fan_speed gauge
dl360_thermal_fan_speed{name=""} 0
# HELP dl360_thermal_fan_status Current fan status 1 = OK, 0 = BAD
# TYPE dl360_thermal_fan_status gauge
dl360_thermal_fan_status{name=""} 1

Drive Metrics:
Unable to get Physical drive metrics

consolidate exporters into a single generic one

This will eliminate the need to have specific exporters for each hardware make & model. We will keep the HPE Moonshots exporter separated because it is the most different than the other redfish API versions.

DL360 & XL420 - Add new metrics

Include metrics related to below components for both HP DL360 and XL420 server models.

Processor
iLO Self Test
Smart Storage Battery

Enhance Drive Metrics for HPE DL360 Gen 10 Servers

In it's current state, FishyMetrics does not include the ability to gather drive metrics on Physical Disk Drives or potential NVMe drives, it only collects Logical Drive information.

This issue has been opened to further enhance the drive metrics collection for HPE DL360 Gen 10 servers, further aligning the metrics gathered with taht of the HPE DL380 Gen 10 servers.

Update Memory and Storage controller Metrics

  1. Add "redfish_storage_controller_status" metrics to all HP servers.
  2. Update code to include responses from iLO6
  3. Ignore CPU metric if state is "Absent"
  4. Ignore "redfish_memory_status" metrics if "Health or Status" is not included in response. (seen in Cisco C225 models).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.