Giter Site home page Giter Site logo

googlecloudplatform / gcpdiag Goto Github PK

View Code? Open in Web Editor NEW
274.0 32.0 62.0 51.2 MB

gcpdiag is a command-line diagnostics tool for GCP customers.

Home Page: https://gcpdiag.dev/

License: Apache License 2.0

Python 66.69% Makefile 3.09% Shell 0.47% HCL 5.36% Dockerfile 0.22% SCSS 0.05% HTML 20.49% Smarty 0.28% Jinja 3.36%
google-cloud-platform gcp devops-tools linter diagnostics google-cloud

gcpdiag's Introduction

gcpdiag - Diagnostics for Google Cloud Platform

code analysis badge test badge

gcpdiag is a command-line diagnostics tool for GCP customers. It finds and helps to fix common issues in Google Cloud Platform projects. It is used to test projects against a wide range of best practices and frequent mistakes, based on the troubleshooting experience of the Google Cloud Support team.

gcpdiag is open-source and contributions are welcome! Note that this is not an officially supported Google product, but a community effort. The Google Cloud Support team maintains this code and we do our best to avoid causing any problems in your projects, but we give no guarantees to that end.

gcpdiag demo

Installation

You can run gcpdiag using a shell wrapper that starts gcpdiag in a Docker container. This should work on any machine with Docker or Podman installed, including Cloud Shell.

curl https://gcpdiag.dev/gcpdiag.sh >gcpdiag
chmod +x gcpdiag
./gcpdiag lint --project=MYPROJECT

Usage

Currently gcpdiag mainly supports one subcommand: lint, which is used to run diagnostics on one or more GCP projects.

usage:

gcpdiag lint --project P [OPTIONS]
gcpdiag lint --project P [--name faulty-vm --location us-central1-a --label key:value]

Run diagnostics in GCP projects.

optional arguments:
  -h, --help            show this help message and exit
  --auth-adc            Authenticate using Application Default Credentials (default)
  --auth-key FILE       Authenticate using a service account private key file
  --project P           Project ID of project to inspect
  --name n [n ...]      Resource Name(s) to inspect (e.g.: bastion-host,prod-*)
  --location R [R ...]  Valid GCP region/zone to scope inspection (e.g.: us-central1-a,us-central1)
  --label key:value     One or more resource labels as key-value pair(s) to scope inspection
                        (e.g.:  env:prod, type:frontend or env=prod type=frontend)
  --billing-project P   Project used for billing/quota of API calls done by gcpdiag (default is the inspected project, requires
                        'serviceusage.services.use' permission)
  --show-skipped        Show skipped rules
  --hide-ok             Hide rules with result OK
  --enable-gce-serial-buffer
                        Fetch serial port one output directly from the Compute API. Use this flag when not exporting
                        serial port output to cloud logging.
  --include INCLUDE     Include rule pattern (e.g.: `gke`, `gke/*/2021*`). Multiple pattern can be specified (comma separated, or with multiple
                        arguments)
  --exclude EXCLUDE     Exclude rule pattern (e.g.: `BP`, `*/*/2022*`)
  --include-extended    Include extended rules. Additional rules might generate false positives (default: False)
  -v, --verbose         Increase log verbosity
  --within-days D       How far back to search logs and metrics (default: 3 days)
  --config FILE         Read configuration from FILE
  --logging-ratelimit-requests R
                        Configure rate limit for logging queries (default: 60)
  --logging-ratelimit-period-seconds S
                        Configure rate limit period for logging queries (default: 60 seconds)
  --logging-page-size P
                        Configure page size for logging queries (default: 500)
  --logging-fetch-max-entries E
                        Configure max entries to fetch by logging queries (default: 10000)
  --logging-fetch-max-time-seconds S
                        Configure timeout for logging queries (default: 120 seconds)
  --output FORMATTER    Format output as one of [terminal, json, csv] (default: terminal)

Authentication

gcpdiag supports authentication using multiple mechanisms:

  1. Application default credentials

    gcpdiag can use Cloud SDK's Application Default Credentials. This might require that you first run gcloud auth login --update-adc to update the cached credentials. This is the default in Cloud Shell because in that environment, ADC credentials are automatically provisioned.

  2. Service account key

    You can also use the --auth-key parameter to specify the private key of a service account.

The authenticated principal will need as minimum the following roles granted (both of them):

  • Viewer on the inspected project
  • Service Usage Consumer on the project used for billing/quota enforcement, which is per default the project being inspected, but can be explicitly set using the --billing-project option

The Editor and Owner roles include all the required permissions, but if you use service account authentication (--auth-key), we recommend to only grant the Viewer+Service Usage Consumer on that service account.

Test Products, Classes, and IDs

Tests are organized by product, class, and ID.

The product is the GCP service that is being tested. Examples: GKE or GCE.

The class is what kind of test it is, currently we have:

Class name Description
BP Best practice, opinionated recommendations
WARN Warnings: things that are possibly wrong
ERR Errors: things that are very likely to be wrong
SEC Potential security issues

The ID is currently formatted as YYYY_NNN, where YYYY is the year the test was written, and NNN is a counter. The ID must be unique per product/class combination.

Each test also has a short_description and a long_description. The short description is a statement about the good state that is being verified to be true (i.e. we don't test for errors, we test for compliance, i.e. an problem not to be present).

Further Information

See http://gcpdiag.dev for more information:

gcpdiag's People

Contributors

aatalyk avatar abhigupta1207 avatar acsediment avatar benkorlems avatar c202c avatar chakresh84 avatar clsacramento avatar danelias avatar dawidmalina avatar ebenezergraham avatar eugenenuke avatar faripple avatar flacode avatar gbrayut avatar jacklandau avatar jmcculloch-google avatar junggil avatar kbhagi avatar kiwi-ru avatar kramarz avatar meatlink avatar miiiguel avatar ropeck avatar schweikert avatar songrx1997 avatar surmaybhavsar avatar taylorjstacey avatar tomerlf1 avatar vinay-vgs avatar vishnu-trace avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gcpdiag's Issues

CloudSQL json dump not having maintenanceWindow

gcpdiag version: 0.60-test

I am trying to create a new rule for CloudSQL, where the snapshot test is failing due to an older json response from the API

test-data/cloudsql1/json-dumps/cloudsql-instances.json

{
"items": [
{
"kind": "sql#instance",
"state": "RUNNABLE",
"databaseVersion": "MYSQL_8_0",
"settings": {
"authorizedGaeApplications": [],
"tier": "db-f1-micro",
"kind": "sql#settings",
"availabilityType": "ZONAL",
"pricingPlan": "PER_USE",
"replicationType": "SYNCHRONOUS",
"activationPolicy": "ALWAYS",
"ipConfiguration": {
"privateNetwork": "projects/gcpdiag-cloudsql1-aaaa/global/networks/private-network",
"authorizedNetworks": [],
"ipv4Enabled": false,
"requireSsl": false
},
"locationPreference": {
"zone": "us-central1-a",
"kind": "sql#locationPreference"
},

This does not contain the key maintenanceWindow, that is why I am getting the below error in the snapshot
(Error: could not access 'maintenanceWindow' from path ('settings', 'maintenanceWindow', 'day'), got error: KeyError('maintenanceWindow'))

Actual API response contains maintenanceWindow

"settings": {
"activationPolicy": "ALWAYS",
"availabilityType": "ZONAL",
"backupConfiguration": {
"backupRetentionSettings": {
"retainedBackups": 7,
"retentionUnit": "COUNT"
},
"enabled": true,
"kind": "sql#backupConfiguration",
"location": "us",
"pointInTimeRecoveryEnabled": true,
"replicationLogArchivingEnabled": true,
"startTime": "23:00",
"transactionLogRetentionDays": 7
},
"connectorEnforcement": "NOT_REQUIRED",
"dataDiskSizeGb": "37",
"dataDiskType": "PD_SSD",
"databaseFlags": [
{
"name": "cloudsql.iam_authentication",
"value": "on"
}
],
"insightsConfig": {
"queryInsightsEnabled": true,
"queryPlansPerMinute": 5,
"queryStringLength": 1024
},
"ipConfiguration": {
"ipv4Enabled": false,
"privateNetwork": "projects/xxxxxxxx/global/networks/xxxxxxx"
},
"kind": "sql#settings",
"locationPreference": {
"kind": "sql#locationPreference",
"zone": "europe-west1-d"
},
"maintenanceWindow": {
"day": 0,
"hour": 0,
"kind": "sql#maintenanceWindow"
},

Can you please fix this so that I can create the PR for the new rule?

Make it possible to filter resources by resource name.

Currently it is only possible to filter by project id, labels and regions, but it would be useful for example to analyze only a specific GKE cluster. This is more complicated than it seems, because we would need to somehow include resources related to that (on the other hand, maybe we can just do substring matching).

Optimize for automated use

Hi everyone,

I'm really liking gcpdiag, and it has uncovered a few issues with our infrastructure already that we were able to fix, so a big thanks for making it available :)

As it proved to be really useful to us, I wanted to go one step further and automate running it, meaning:

  1. run it automatically in a given interval (e.g. every day or every week)
  2. run it on all projects matching a given filter (e.g. all projects in a specific folder)
  3. get notified if any issue is found

While I was somewhat successful in that (running a cronjob in a GKE cluster with a custom python script that fetches all projects, runs gcpdiag on all of them and then sends an alert via a webhook in case there's any errors) I uncovered a few challenges along the way, so I wanted to share some thoughts and get your opinions on whether these are things you would consider implementing or be open to contributions:

  • Even with --hide-ok, the output is still quite verbose in a non-interactive terminal, as the "in progress" logs that are meant to disappear stay visible. It would be great to have a way to turn off any logging that is not an error or a warning.
  • A logging format option that logs JSON might make sense, which is easier to parse for e.g. Cloud Logging. I see that some parts of the code use print for logging, so that probably needs to be changed to use the logging module in a first step, and then different format options could be introduced.
  • An additional command, e.g. lint-all could be useful to lint multiple projects at once (would need additional permissions on the SA)
  • Add a way to send alerts on failure, though this might get quite complicated as people use different systems. Maybe this could be achieved by logging in JSON, and then providing a template for e.g. a log based metric + alert.

Happy to hear your thoughts! :)

https://gcpdiag.dev/gcpdiag.sh not found

The installation method mentioned in the docs does not work currently:

$ curl https://gcpdiag.dev/gcpdiag.sh
<?xml version='1.0' encoding='UTF-8'?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message></Error>%

interactively/automatically enable required APIs

Thanks for the tool, quite handy!

It'd be nice if instead of failing it'd just aks for permission and enable the required APIs by itself. I got prompted for three:

gcloud services enable serviceusage.googleapis.com --project=
gcloud services enable cloudresourcemanager.googleapis.com --project=
gcloud services enable iam.googleapis.com --project=

It's also missing from the README, I think.

gke/ERR/2023_005: Match strings too broad

The used Match strings in the rule don't only apply to the Pod IP leakage error which is mentioned in the remediation recommendation, but also to other issues like a putEndpointIdTooManyRequests error which seems to be neglectable.

An example message for this kind of error:

E0828 10:59:06.457564    2003 kuberuntime_manager.go:782] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to setup network for sandbox \"685a6e6e5b2b4564b3fdb98598110bb62feffdfb50c4743f45a02d277122227d\": plugin type=\"cilium-cni\" failed (add): unable to create endpoint: [PUT /endpoint/{id}][429] putEndpointIdTooManyRequests " pod="test-namespace/test-pod"

How to `--output=json` in order to `| jq -r .`?

With my forehead slightly dented...

I'm unable to gcpdiag --output=json to swallow stderr (which I assume is the non-JSON content) so that I can pipe the JSON into jq.

gcpdiag lint \
--project=${PROJECT} \
--output=json

Which yields information (stderr?) combined with JSON (stdout?)

gcpdiag 0.59

Starting lint inspection (project: {PROJECT})...

[
{
  "rule": "bigquery/ERR/2022_001",
  ...
}
]

I've tried:

gcpdiag lint \
--project=${PROJECT} \
--output=json 2>/dev/null \
| jq -r .

and I tried fudging the script's USE__TTY (to drop --tty) but without success.

[cloudsql/WARN/2023_003]: MQL query wrong, leading to false positives

The cloudsql/WARN/2023_003 policy is using a wrong MQL query:

      fetch cloudsql_database
       | metric 'cloudsql.googleapis.com/database/memory/components'
       | group_by 6h, [value_components_aggregate: aggregate(value.components)]
       | filter metric.component = 'Usage'
       | every 6h
       | filter val() >= {MEM_USAGE_THRESHOLD}
       | {within_str}

which results in a result like this:
image

As you can see, by aggregating the values, the resulting value is always >90 (the value of {MEM_USAGE_THRESHOLD}) as it's ranging into the thousands.

Even so, uniform bucket-level access is recommended.

gcpdiag lint execution results are not as intended.
It is recommended that uniform bucket-level access be used, even if the bucket configuration uses uniform bucket-level access.

Execution result of gcpdiag lint

๐Ÿ”Ž  gcs/BP/2022_001: Buckets are using uniform access
   - BUCKET_NAME                               [FAIL]
     it is recommend to use uniform access on your bucket

   Google recommends using uniform access for a Cloud Storage bucket IAM policy
   https://cloud.google.com/storage/docs/access-
   control#choose_between_uniform_and_fine-grained_access

   https://gcpdiag.dev/rules/gcs/BP/2022_001

Commands to check settings and output results

gsutil uniformbucketlevelaccess get gs://BUCKET_NAME

Uniform bucket-level access setting for gs://BUCKET_NAME:
  Enabled: True
  LockedTime: 2023-01-01 03:04:30.427000+00:00

Avoid Service Usage Consumer role requirement

gcpdiag sets billing project id for API calls using the X-Goog-User-Project header, either to what was passed as --billing-project, or by default to the project that is being inspected. The reason for this is that otherwise the Oauth client project would be used for billing/quota enforcement.

Unfortunately, this creates a requirement to have the following permission in the inspected project: serviceusage.services.use. That permission is included in the following roles:

  • Owner
  • Editor
  • Service Usage Consumer

Note that Viewer is not enough, so we need to tell users that Viewer+Service Usage Consumer is required.

gcs/BP/2022_001: Google recommendation hyperlink breaks as split across lines

The recommendation for this rule breaks the hyperlink across lines:

   Google recommends using uniform access for a Cloud Storage bucket IAM policy
   https://cloud.google.com/storage/docs/access-
   control#choose_between_uniform_and_fine-grained_access

In (most) shells, hyperlinks are detected and become clickable but the above behavior results in a 404 (https://cloud.google.com/storage/docs/access-)

Recommend: not line-splitting hyperlinks.

iam.py: add support for groups

Currently iam.py can't resolve IAM groups so for example if a service account is given certain permissions via a group, that won't be detected properly.

Crash on 0.60/0.61 if Compute Engine API not enabled

This was working fine on 0.59. When running gcpdiag on a project that hasn't enabled Compute Engine API, an exception is raised and cannot be bypassed, even if I try to exclude all GCE rules.

$ gcpdiag lint --project=myproject --exclude=gce

gcpdiag ๐Ÿฉบ  0.61

Starting lint inspection (project: myproject)...

   ... fetching metadata of project myproject

Traceback (most recent call last):
  File "/opt/gcpdiag/gcpdiag/queries/gce.py", line 763, in get_project_metadata
    response = query.execute(num_retries=config.API_RETRIES)
  File "/opt/gcpdiag/.venv/lib/python3.9/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/opt/gcpdiag/.venv/lib/python3.9/site-packages/googleapiclient/http.py", line 938, in execute
    raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 403 when requesting https://compute.googleapis.com/compute/v1/projects/myproject?alt=json returned "Compute Engine API has not been used in project 1234567890 before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/compute.googleapis.com/overview?project=1234567890 then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.". Details: "[{'message': 'Compute Engine API has not been used in project 1234567890 before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/compute.googleapis.com/overview?project=1234567890 then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.', 'domain': 'usageLimits', 'reason': 'accessNotConfigured', 'extendedHelp': 'https://console.developers.google.com'}]">

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/gcpdiag/bin/gcpdiag", line 70, in <module>
    main(sys.argv)
  File "/opt/gcpdiag/bin/gcpdiag", line 43, in main
    lint_command.run(argv)
  File "/opt/gcpdiag/gcpdiag/lint/command.py", line 342, in run
    if not gce.is_project_serial_port_logging_enabled(context.project_id) and \
  File "/opt/gcpdiag/gcpdiag/queries/gce.py", line 959, in is_project_serial_port_logging_enabled
    value = get_project_metadata(
  File "/opt/gcpdiag/gcpdiag/caching.py", line 155, in _cached_api_call_wrapper
    result = func(*args, **kwargs)
  File "/opt/gcpdiag/gcpdiag/queries/gce.py", line 765, in get_project_metadata
    raise utils.GcpApiError(err) from err
gcpdiag.utils.GcpApiError: Compute Engine API has not been used in project 1234567890 before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/compute.googleapis.com/overview?project=1234567890 then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.

[WARNING] Encountered 403 Forbidden with reason "PERMISSION_DENIED"

Crash when Identity and Access Management (IAM) API is not enabled.

 OS Config service account has the required permissions.
[WARNING] Encountered 403 Forbidden with reason "PERMISSION_DENIED"
   ... fetching IAM roles: projects/my-projectTraceback (most recent call last):
  File "/opt/gcpdiag/gcpdiag/queries/iam.py", line 221, in get_project_policy
   ... executing monitoring query (project: my-project)    return ProjectPolicy(project_id)
  File "/opt/gcpdiag/gcpdiag/queries/iam.py", line 212, in __init__
    self._custom_roles = _fetch_roles('projects/' + self._project_id,
  File "/opt/gcpdiag/gcpdiag/queries/iam.py", line 58, in _fetch_roles
    response = request.execute(num_retries=config.API_RETRIES)
  File "/opt/gcpdiag/.venv/lib/python3.9/site-packages/googleapiclient/_helpers.py", line 131, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/opt/gcpdiag/.venv/lib/python3.9/site-packages/googleapiclient/http.py", line 937, in execute
    raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 403 when requesting https://iam.googleapis.com/v1/projects/my-project/roles?view=FULL&alt=json returned "Identity and Access Management (IAM) API has not been used in project 1234567890 before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/iam.googleapis.com/overview?project=1234567890 then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.". Details: "[{'@type': 'type.googleapis.com/google.rpc.Help', 'links': [{'description': 'Google developers console API activation', 'url': 'https://console.developers.google.com/apis/api/iam.googleapis.com/overview?project=1234567890'}]}, {'@type': 'type.googleapis.com/google.rpc.ErrorInfo', 'reason': 'SERVICE_DISABLED', 'domain': 'googleapis.com', 'metadata': {'consumer': 'projects/1234567890', 'service': 'iam.googleapis.com'}}]">

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/gcpdiag/bin/gcpdiag", line 65, in <module>
    main(sys.argv)
  File "/opt/gcpdiag/bin/gcpdiag", line 43, in main
    lint_command.run(argv)
  File "/opt/gcpdiag/gcpdiag/lint/command.py", line 194, in run
    exit_code = repo.run_rules(context, report, include_patterns,
  File "/opt/gcpdiag/gcpdiag/lint/__init__.py", line 351, in run_rules
    rule.prefetch_rule_future.result()
  File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 445, in result
    return self.__get_result()
  File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.9/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/opt/gcpdiag/gcpdiag/lint/gce/err_2021_002_osconfig_perm.py", line 34, in prefetch_rule
    iam.get_project_policy(pid)
  File "/opt/gcpdiag/gcpdiag/caching.py", line 145, in _cached_api_call_wrapper
    return lru_cached_func(*args, **kwargs)
  File "/opt/gcpdiag/gcpdiag/queries/iam.py", line 223, in get_project_policy
    raise utils.GcpApiError(err) from err
gcpdiag.utils.GcpApiError: can't fetch data, reason: Identity and Access Management (IAM) API has not been used in project 1234567890 before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/iam.googleapis.com/overview?project=1234567890 then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.
[WARNING] Encountered 403 Forbidden with reason "PERMISSION_DENIED"
[WARNING] Encountered 403 Forbidden with reason "PERMISSION_DENIED"
[WARNING] Encountered 403 Forbidden with reason "PERMISSION_DENIED"
   ... still fetching logs (project: my-project, resource type: k8s_node, max wait: 112s)%

Running gcpdiag - recommendations for resource limitations

It would be good to have some info about the expected resource usage of running gcpdiag (how much memory the process may consume, or network bandwidth for example). This might scale in terms of the number of resources in the project being analyzed, so a few examples may be necessary.

Also please include best practices for choosing an environment to run gcpdiag, considering

  • Cloud Shell would work for most cases
  • if the user needs to limit the resource usage (run within a container for example)
  • whether or not to run on a production machine vs a throw-away VM.

Error: sqlite3.OperationalError: unable to open database file

Hi, tried using this for the first time on ubuntu 20.04, got this error.

$ curl https://gcpdiag.dev/gcpdiag.sh >gcpdiag
$ chmod +x gcpdiag
$ ./gcpdiag lint --project=MYPROJECT
Unable to find image 'us-docker.pkg.dev/gcpdiag-dist/release/gcpdiag:0.55' locally
0.55: Pulling from gcpdiag-dist/release/gcpdiag
<DOCKER PULL STUFF>
Digest: sha256:0b5fcc0fd3e2f1b822cec492b0128f7e1df5173c19990570ee072c80cf6164c4
Status: Downloaded newer image for us-docker.pkg.dev/gcpdiag-dist/release/gcpdiag:0.55
gcpdiag ๐Ÿฉบ 0.55

Starting lint inspection (project: MYPROJECT)...

Traceback (most recent call last):
  File "/opt/gcpdiag/bin/gcpdiag", line 64, in <module>
    main(sys.argv)
  File "/opt/gcpdiag/bin/gcpdiag", line 42, in main
    lint_command.run(argv)
  File "/opt/gcpdiag/gcpdiag/lint/command.py", line 237, in run
    apis.verify_access(context.project_id)
  File "/opt/gcpdiag/gcpdiag/queries/apis.py", line 253, in verify_access
    if not is_enabled(project_id, 'cloudresourcemanager'):
  File "/opt/gcpdiag/gcpdiag/queries/apis.py", line 246, in is_enabled
    return f'{service_name}.googleapis.com' in _list_apis(project_id)
  File "/opt/gcpdiag/gcpdiag/caching.py", line 145, in _cached_api_call_wrapper
    return lru_cached_func(*args, **kwargs)
  File "/opt/gcpdiag/gcpdiag/queries/apis.py", line 230, in _list_apis
    serviceusage = get_api('serviceusage', 'v1', project_id)
  File "/opt/gcpdiag/gcpdiag/caching.py", line 145, in _cached_api_call_wrapper
    return lru_cached_func(*args, **kwargs)
  File "/opt/gcpdiag/gcpdiag/queries/apis.py", line 197, in get_api
    credentials = _get_credentials()
  File "/opt/gcpdiag/gcpdiag/queries/apis.py", line 160, in _get_credentials
    return _get_credentials_oauth()
  File "/opt/gcpdiag/gcpdiag/queries/apis.py", line 119, in _get_credentials_oauth
    with caching.get_cache() as diskcache:
  File "/opt/gcpdiag/gcpdiag/caching.py", line 60, in get_cache
    _cache = diskcache.Cache(config.CACHE_DIR, tag_index=True)
  File "/opt/gcpdiag/.venv/lib/python3.9/site-packages/diskcache/core.py", line 456, in __init__
    sql = self._sql_retry
  File "/opt/gcpdiag/.venv/lib/python3.9/site-packages/diskcache/core.py", line 652, in _sql_retry
    sql = self._sql
  File "/opt/gcpdiag/.venv/lib/python3.9/site-packages/diskcache/core.py", line 648, in _sql
    return self._con.execute
  File "/opt/gcpdiag/.venv/lib/python3.9/site-packages/diskcache/core.py", line 623, in _con
    con = self._local.con = sqlite3.connect(
sqlite3.OperationalError: unable to open database file

Error for rule dataproc/warn_2022_002_job_throttling_rate_limit

Traceback (most recent call last):
File "/opt/gcpdiag/bin/gcpdiag", line 70, in
main(sys.argv)
File "/opt/gcpdiag/bin/gcpdiag", line 43, in main
lint_command.run(argv)
File "/opt/gcpdiag/gcpdiag/lint/command.py", line 290, in run
repo.run_rules(context)
File "/opt/gcpdiag/gcpdiag/lint/init.py", line 495, in run_rules
self.execution_strategy.run_rules(context, self.result, rules_to_run)
File "/opt/gcpdiag/gcpdiag/lint/init.py", line 575, in run_rules
rule.run_rule_f(context, rule_report)
File "/opt/gcpdiag/gcpdiag/lint/dataproc/warn_2022_002_job_throttling_rate_limit.py", line 73, in run_rule
clusters_with_throttling = get_clusters_having_relevant_log_entries(context)
File "/opt/gcpdiag/gcpdiag/lint/dataproc/warn_2022_002_job_throttling_rate_limit.py", line 61, in get_clusters_having_relevant_log_entries
return {
File "/opt/gcpdiag/gcpdiag/lint/dataproc/warn_2022_002_job_throttling_rate_limit.py", line 62, in
e.cluster_name
AttributeError: 'dict' object has no attribute 'cluster_name'

"docker: invalid reference format." during Github Actions usage

I tried to use it with Github Actions - minimalistic setup to reproduce (I know configs like credentials are missing, but it's not required for the reproduction):

name: GCP Diag

on:
  push:

jobs:
  build:
    runs-on: ubuntu-latest

    steps:
    - name: Run GCP diag
      run: |
        curl https://gcpdiag.dev/gcpdiag.sh > gcpdiag
        chmod +x gcpdiag
        ./gcpdiag lint --project=dummy-non-existing-project

Output:

Run
  curl https://gcpdiag.dev/gcpdiag.sh > gcpdiag
  chmod +x gcpdiag
  ./gcpdiag lint --project=dummy-non-existing-project
  shell: /usr/bin/bash -e {0}
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  4382  100  4382    0     0  20381      0 --:--:-- --:--:-- --:--:-- 20381
docker: invalid reference format.
See 'docker run --help'.
Error: Process completed with exit code 125.

I think "docker: invalid reference format." isn't an intended error message.

Running gcpdiag

I am trying to run gcpdiag on my local mac and i used the docker option to download the shell script. I have docker daemon running on my local machine but i am not able to execute any of the commands from the mac terminal.

./gcpdiag lint --project
ERROR: ('invalid_grant: Bad Request', {'error': 'invalid_grant', 'error_description': 'Bad Request'})

wrapper doesn't work on macos

Here's what made it work on macos for me:

--- gcpdiag.orig        2021-10-06 13:52:31.000000000 +0200
+++ gcpdiag     2021-10-06 13:54:14.000000000 +0200
@@ -1,7 +1,7 @@
 #!/bin/bash
 set -e
 THIS_WRAPPER_VERSION=0.6
-source <(curl -sf https://storage.googleapis.com/gcpdiag/release-version|grep -Ei '^\w*=[0-9a-z/\._-]*$')
+eval $(curl -sf https://storage.googleapis.com/gcpdiag/release-version|grep -Ei '^\w*=[0-9a-z/\._-]*$')
 if [[ $THIS_WRAPPER_VERSION != $WRAPPER_VERSION ]]; then
   echo
   echo "## ERROR:"

[gcs/BP/2022_001] KeyError: 'iamConfiguration'

gcpdiag version: 0.55

It seems that some buckets does not have a iamConfiguration field which is making gcpdiag crashing with an exception (seems to happen only on old buckets)

Ideally, gcpdiag need to test if the key iamConfiguration is defined or not and if the key is not defined it will returned a default value

python exception:

Traceback (most recent call last):
  File "/opt/gcpdiag/bin/gcpdiag", line 64, in <module>
    main(sys.argv)
  File "/opt/gcpdiag/bin/gcpdiag", line 42, in main
    lint_command.run(argv)
  File "/opt/gcpdiag/gcpdiag/lint/command.py", line 240, in run
    exit_code = repo.run_rules(context, report, include_patterns,
  File "/opt/gcpdiag/gcpdiag/lint/__init__.py", line 367, in run_rules
    rule.run_rule_f(context, rule_report)
  File "/opt/gcpdiag/gcpdiag/lint/gcs/bp_2022_001_bucket_access_uniform.py", line 37, in run_rule
    elif b.is_uniform_access():
  File "/opt/gcpdiag/gcpdiag/queries/gcs.py", line 48, in is_uniform_access
    return self._resource_data['iamConfiguration']['uniformBucketLevelAccess'][
KeyError: 'iamConfiguration'

Bucket json informations returned by the api for the exception:

{
  "kind": "storage#bucket",
  "selfLink": "https://www.googleapis.com/storage/v1/b/mybucket",
  "id": "mybucket",
  "name": "mybucket",
  "projectNumber": "00000000000",
  "metageneration": "9",
  "location": "US",
  "storageClass": "STANDARD",
  "etag": "CAk=",
  "timeCreated": "2016-07-12T15:05:45.473Z",
  "updated": "2022-06-22T10:25:28.219Z",
  "locationType": "multi-region",
  "rpo": "DEFAULT"
}

Working bucket json informations returned by the api:

{
  "kind": "storage#bucket",
  "selfLink": "https://www.googleapis.com/storage/v1/b/mybucket",
  "id": "mybucket",
  "name": "mybucket",
  "projectNumber": "00000000000",
  "metageneration": "3",
  "location": "EU",
  "storageClass": "STANDARD",
  "etag": "CAM=",
  "defaultEventBasedHold": false,
  "timeCreated": "2021-04-06T10:19:45.615Z",
  "updated": "2021-04-06T13:10:55.598Z",
  "iamConfiguration": {
    "bucketPolicyOnly": {
      "enabled": false
    },
    "uniformBucketLevelAccess": {
      "enabled": false
    },
    "publicAccessPrevention": "inherited"
  },
  "locationType": "multi-region",
  "satisfiesPZS": false,
  "rpo": "DEFAULT"
}

Intermittent warning/skips on Composer rules

This was working fine on 0.59, now there seems to be some lock issue on Composer API queries. Here's the command used:

gcpdiag lint --project=my-project --include=composer

At the start of the command, the following log will be temporarily visible on the console, before the results are printed.

... still fetching logs (project: my-project, resource type: cloud_composer_environment, max wait: 117s)

After two minutes, some of the checks are skipped because of (I guess) a timeout:

๐Ÿ”Ž  composer/BP/2023_001: Cloud Composer logging level is set to INFO
   - my-project/europe-west1/my-composer               [ OK ]

[WARNING] RuntimeError: Couldn't acquire lock for get_environments. while processing rule: composer/BP/2023_002 
[WARNING] RuntimeError: Couldn't acquire lock for get_environments. while processing rule: composer/BP/2023_003 
๐Ÿ”Ž  composer/ERR/2022_001: Composer Service Agent permissions
   - my-project                                        [ OK ]

[WARNING] RuntimeError: Couldn't acquire lock for get_environments. while processing rule: composer/ERR/2022_002 
[WARNING] RuntimeError: Couldn't acquire lock for get_environments. while processing rule: composer/ERR/2023_001 
๐Ÿ”Ž  composer/WARN/2022_001: Composer Service Agent permissions for Composer 2.x
   - my-project                                        [ OK ]

[WARNING] RuntimeError: Couldn't acquire lock for get_environments. while processing rule: composer/WARN/2022_002 
[WARNING] RuntimeError: Couldn't acquire lock for get_environments. while processing rule: composer/WARN/2022_003 
[WARNING] RuntimeError: Couldn't acquire lock for get_environments. while processing rule: composer/WARN/2023_001 
[WARNING] RuntimeError: Couldn't acquire lock for get_environments. while processing rule: composer/WARN/2023_002 
[WARNING] RuntimeError: Couldn't acquire lock for get_environments. while processing rule: composer/WARN/2023_003 
๐Ÿ”Ž  composer/WARN/2023_004: Cloud Composer database CPU usage does not exceed 80%
   - my-project/europe-west1/my-composer               [ OK ]

๐Ÿ”Ž  composer/WARN/2023_005: Cloud Composer is consistently in healthy state
   - my-project/europe-west1/my-composer               [ OK ]

๐Ÿ”Ž  composer/WARN/2023_006: Airflow schedulers are healthy for the last hour
   - my-project/europe-west1/my-composer               [ OK ]

๐Ÿ”Ž  composer/WARN/2023_007: Cloud Composer Scheduler CPU limit exceeded.
   - my-project/europe-west1/my-composer               [ OK ]

๐Ÿ”Ž  composer/WARN/2023_008: Cloud Composer Airflow database is in healthy state
   - my-project/europe-west1/my-composer               [ OK ]

Rules summary: 9 skipped, 8 ok, 0 failed

If I launch the command with a more restrictive parameter, like --include=composer/WARN, this still happens although less frequently.

Publish to PyPI

Please publish to PyPI so gcpdiag can be installed with pip (or pipx), and so that the community can build distro packages out of it (homebrew, AUR, other linux distros).

Reasoning:
Current installation instructions suggest installing this tool with curl. A more sophisticated distribution method would be to enable distros to build their own packaging, but for that a base requirement is for this tool being able to publish to PyPI (so it follows common python packaging standards).

Packaging tools (like Homebrew) make it easier to package up tools written in python if it's install-able through PyPI.

`--billing-project` argument not working anymore since 0.59

I run gcpdiag lint by setting the --billing-project:

$ gcpdiag lint --billing-project <PROJECT_WITH_BILLING_ENABLED> --project <SERVICE_PROJECT>

This worked well up until v0.58, but currently broken with v0.59:

$ ./gcpdiag lint --billing-project <PROJECT_WITH_BILLING_ENABLED> --project <SERVICE_PROJECT>
WARNING:googleapiclient.http:Encountered 403 Forbidden with reason "PERMISSION_DENIED"
Traceback (most recent call last):
  File "/opt/gcpdiag/bin/gcpdiag", line 70, in <module>
    main(sys.argv)
  File "/opt/gcpdiag/bin/gcpdiag", line 43, in main
    lint_command.run(argv)
  File "/opt/gcpdiag/gcpdiag/lint/command.py", line 232, in run
    project = crm.get_project(args.project)
  File "/opt/gcpdiag/gcpdiag/caching.py", line 155, in _cached_api_call_wrapper
    result = func(*args, **kwargs)
  File "/opt/gcpdiag/gcpdiag/queries/crm.py", line 77, in get_project
    response = request.execute(num_retries=config.API_RETRIES)
  File "/opt/gcpdiag/.venv/lib/python3.9/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/opt/gcpdiag/.venv/lib/python3.9/site-packages/googleapiclient/http.py", line 938, in execute
    raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 403 when requesting https://cloudresourcemanager.googleapis.com/v3/projects/<SERVICE_PROJECT>?alt=json returned "Caller does not have required permission to use project <SERVICE_PROJECT>. Grant the caller the roles/serviceusage.serviceUsageConsumer role, or a custom role with the serviceusage.services.use permission, by visiting https://console.developers.google.com/iam-admin/iam/project?project=<SERVICE_PROJECT> and then retry. Propagation of the new permission may take a few minutes.". Details: "[{'@type': 'type.googleapis.com/google.rpc.Help', 'links': [{'description': 'Google developer console IAM admin', 'url': 'https://console.developers.google.com/iam-admin/iam/project?project=<SERVICE_PROJECT>'}]}, {'@type': 'type.googleapis.com/google.rpc.ErrorInfo', 'reason': 'USER_PROJECT_DENIED', 'domain': 'googleapis.com', 'metadata': {'service': 'cloudresourcemanager.googleapis.com', 'consumer': 'projects/<SERVICE_PROJECT>'}}]">

As you can see, it tries to use <SERVICE_PROJECT> as the billing project even though I explicitly set it to another one.
For now I'm using 0.58 again and it works as expected.

don't warn about GKE for projects not using it

I get

๐Ÿ”Ž  gke/ERR/2021_007: GKE service account permissions.
   - xyz                                                        [FAIL]
     service account: [email protected]
     missing role: roles/container.serviceAgent

   Verify that the Google Kubernetes Engine service account exists and has the
   Kubernetes Engine Service Agent role on the project.

   https://gcpdiag.dev/rules/gke/ERR/2021_007

even for projects that don't use GKE. It'd be nice if the tool checked whether or not the corresponding API was enabled or not and changed the applied rules accordingly.

Output format issue

Hello!

I have an issue with adjusting a required output format. In the documentation I've noticed that gcpdiag have a support for multiple output formats (json, csv, terminal). But whenever I'm trying to issue the gcpdiag lint --output json --project PROJECT. I'm receiving the following error gcpdiag lint: error: unrecognized arguments: --output json. Could you please explain what can be the issue here?

Regards

dataproc/WARN/2023_001


title: "dataproc/WARN/2023_001"
linkTitle: "WARN/2023_001"
weight: 1
type: docs
description: >
Concurrent Job limit was not exceeded

Product: Cloud Dataproc
Rule class: WARN - Something that is possibly wrong

Description

If the Dataproc agent reach the concurrent job submission limit, Dataproc job scheduling delays can be observed.

Remediation

The maximum number of concurrent jobs based on master VM memory is exceeded (the job driver runs on the Dataproc cluster master VM). By default, Dataproc reserves 3.5GB of memory for applications, and allows 1 job per GB.Set the dataproc:dataproc.scheduler.max-concurrent-jobs cluster property to a value suited to your job requirements

Further information

Troubleshoot job delays

gcs/BP/2022_001: Propose "Buckets are **not** using uniform access"

Very interesting project, thank you!

This rule would be better phrased as Buckets not using uniform access

The rules check for the incorrect behavior.

In this case, linting produced the following output which appears inconsistent (If the Buckets are using uniform access why the [FAIL]'s?). The reason is that the Buckets are **not** using uniform access (and should be)

๐Ÿ”Ž  gcs/BP/2022_001: Buckets are using uniform access
   - {bucket}                                                                         [FAIL]
     it is recommend to use uniform access on your bucket
   - {bucket}                                                                         [FAIL]
     it is recommend to use uniform access on your bucket

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.