Giter Site home page Giter Site logo

csp-billing-adapter's People

Contributors

brett060102 avatar guangyee avatar jesusbv avatar keithmnemonic avatar rjschwei avatar rtamalin avatar smarlowucf avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

csp-billing-adapter's Issues

More detailed error/exception info

The adapter has an errors list and a single boolean value to denote if there is an issue with metered billing. However, there are a number of issues that could come up and the csp config could be more verbose on what error is occurring. This could possible be with more boolean values to cover other cases.

Consider exposing the whole config less

Currently the config file and thus structure is exposed to almost every function and layer of the app. It may be worthwhile to limit the exposure of the config where possible by passing values from config instead of the whole config file. This would help keep different plugins and/functions decoupled.

Finish config file module

A config module is started. However, given the changes in the config file structure it may or may not be as useful as planned. The current implementation takes a yaml file and turns it into a class instance with dot notation. It also accepts default values from plugins. With more complex billing data in config it may not be as useful to use dot notation as it may be more intuitive to have a dictionary. Such as to find a specific billing type using a key rather than iteration or getattr.

  • Decide on final architecture for class
  • Cleanup code and add fault tolerance
  • Add tests
  • Integration testing

https://github.com/SUSE-Enceladus/csp-billing-adapter/blob/devel/csp_billing_adapter/config.py

Exception with retry or botocore

root@ubuntu1604-JFrog:~/aws# kk logs neuvector-csp-pod-56ccccfb7b-trbm6
2023-05-19T20:59:47.117|INFO|CSPBillingAdapter|CSPBillingAdapter logging setup complete
2023-05-19T20:59:47.123|INFO|CSPBillingAdapter|Config loaded: {'version': '0.0.1', 'billing_interval': 'test', 'query_interval': 60, 'product_code': '8yq550xpepyhn6uq8f4pmwv6e', 'usage_metrics': {'managed_node_count': {'usage_aggregation': 'average', 'consumption_reporting': 'volume', 'minimum_consumption': 5, 'dimensions': [{'dimension': 'tier_1', 'min': 0, 'max': 15}, {'dimension': 'tier_2', 'min': 16, 'max': 50}, {'dimension': 'tier_3', 'min': 51, 'max': 100}, {'dimension': 'tier_4', 'min': 101, 'max': 250}, {'dimension': 'tier_5', 'min': 251, 'max': 1000}, {'dimension': 'tier_6', 'min': 1001}]}}, 'reporting_api_is_cumulative': False, 'reporting_interval': 60}
2023-05-19T20:59:47.123|INFO|CSPBillingAdapter|Loaded in cluster config.
2023-05-19T20:59:47.328|INFO|CSPBillingAdapter|Adapter setup complete
2023-05-19T20:59:47.685|ERROR|CSPBillingAdapter|Failed to meter bill: An error occurred (TimestampOutOfBoundsException) when calling the MeterUsage operation: Timestamp 2023-05-19T01:02:36Z is out of allowed range
2023-05-19T20:59:47.685|ERROR|CSPBillingAdapter|CSP Billing Adapter error: Fatal error while validating metering API access: catching classes that do not inherit from BaseException is not allowed
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/csp_billing_adapter/utils.py", line 233, in retry_on_exception
    return func_call()
  File "/usr/lib/python3.6/site-packages/pluggy/hooks.py", line 286, in __call__
    return self._hookexec(self, self.get_hookimpls(), kwargs)
  File "/usr/lib/python3.6/site-packages/pluggy/manager.py", line 93, in _hookexec
    return self._inner_hookexec(hook, methods, kwargs)
  File "/usr/lib/python3.6/site-packages/pluggy/manager.py", line 87, in <lambda>
    firstresult=hook.spec.opts.get("firstresult") if hook.spec else False,
  File "/usr/lib/python3.6/site-packages/pluggy/callers.py", line 208, in _multicall
    return outcome.get_result()
  File "/usr/lib/python3.6/site-packages/pluggy/callers.py", line 80, in get_result
    raise ex[1].with_traceback(ex[2])
  File "/usr/lib/python3.6/site-packages/pluggy/callers.py", line 187, in _multicall
    res = hook_impl.function(*args)
  File "/usr/lib/python3.6/site-packages/csp_billing_adapter_amazon/plugin.py", line 86, in meter_billing
    raise exc  # Re-raise exception to calling scope
  File "/usr/lib/python3.6/site-packages/csp_billing_adapter_amazon/plugin.py", line 74, in meter_billing
    DryRun=dry_run
  File "/usr/lib/python3.6/site-packages/botocore/client.py", line 530, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/lib/python3.6/site-packages/botocore/client.py", line 960, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.TimestampOutOfBoundsException: An error occurred (TimestampOutOfBoundsException) when calling the MeterUsage operation: Timestamp 2023-05-19T01:02:36Z is out of allowed range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/csp_billing_adapter/adapter.py", line 299, in main
    func_name="hook.meter_billing"
  File "/usr/lib/python3.6/site-packages/csp_billing_adapter/utils.py", line 234, in retry_on_exception
    except exceptions as error:
TypeError: catching classes that do not inherit from BaseException is not allowed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/csp_billing_adapter/adapter.py", line 311, in main
    f'Fatal error while validating metering API access: {error}'
csp_billing_adapter.exceptions.CSPBillingAdapterException: Fatal error while validating metering API access: catching classes that do not inherit from BaseException is not allowed

Standardize errors in csp-config

Currently the errors list in csp-config is expected to be a list of strings. However, some error strings contain a mix of text and marshalled json. This format is not ingestible in a way that it could be easily parsed.

Errors should be either a marshalled json string with a standard set of keys such as title, description. Or they should all be flat strings with no structured data.

Implement Microsoft Plugin

Start/Create Microsoft meter billing plugin. Expectation is to follow the pluggy plugin style and implement the csp billing hooks that are defined in hookspec. Naming should reference the provider not the platform. E.g. microsoft not Azure.

Also, the naming would match other plugins and exist in SUSE-Enceladus. And we want to follow the same project setup as other repos. Such as the amazon plugin. This will simplify packaging workflow.

https://github.com/SUSE-Enceladus/csp-billing-adapter-microsoft (To be created)

hook specs and implementations shouldn't use default argument values

When implementing tests for the in-memory cache implemented in #29 I discovered that if we specify default values for the hook specs and implementation, those values seem to get used even when an explicit value us provided for the associated argumenmt, e.g.

When the update_cache hookspec is:

@hookspec(firstresult=True)
def update_cache(config: Config, cache: dict, replace: bool = False):
    ...

and the update_cache hookimpl is:

@csp_billing_adapter.hookimpl(trylast=True)
def update_cache(config: Config, cache: dict, replace: bool = False):
    ...

and the hook call is:

cba_pm.hook.update_cache(config=cba_config, cache=test_data2, replace=True)

when I instrumented the hookimpl to print out the arguments it is called with I see:

update_cache(config={'version': 1, 'billing_interval': 30, 'metering_interval': 30}, cache={'c': 3, 'd': 4}, replace=False)

I.e. the default value for replace in the spec & implementation (which have to match to avoid complaints about a mismatch) is being used rather than the supplied replace value in the hook call.

On initial startup adapter no error

If the adapter is starting up for the first time and no usage data exists yet the status should not be affected.

Adapter would not log an error in this case. However, this would only cover initial deploy. If adapter is started up again in an environment where config map already exists then the error case is valid.

Provide a flow diagram for error cases

More context:

I think we are making the live for support and consumers of our information difficult:

"""
However, support will be given more information about the cluster in support documents. Thus if adapter last billed a min amount yet the cluster under support has more nodes this will be visible. Couple that with the errors, one of which stating an issue with usage data retrieval there's enough information to deduce something is wrong and to some degree what caused it.
"""

This has the potential to trigger lots of questions to us.

"""
There's already a number of mechanisms to cover this:

Adapter is expected to always run. It updates the timestamp and expiry date on every run. This ensures the adapter is running. If date is beyond expiry date the adapter is in failed state.
The last bill details are provided. It's expected that the adapter bills usage every cycle. If the last bill is beyond the end of cycle then something is wrong with billing.
The billing details also provides a way to validate what's getting billed.
Then the state and errors list provide more fine grained details if there is a mixed state. Such as usage not accessible but last metering was okay.
"""

this sounds a lot like an if-then-else condition and we try to avoid those in our own code, why would we inflict this on someone else?

Can we make a state diagram of the failure modes, there should be relatively few

usage data retrieval
usage data storage
billing API access
...?
And then layer in how we expect NV to behave and how we want to communicate those conditions out to NV? Also to consider in the state diagram, could probably even be a simple table, is the temporal aspect, i.e. do we want to behave differently if any of the failure mode happens 10 times (or some other arbitrary number) in a row?

Itemized billing details

Consider itemizing the billing into allocations such as splitting out each cluster and usage for multi-cluster setups. This can provide more granular details about how the usage is being aggregated.

Implement Google Plugin

Start/Create Google meter billing plugin. Expectation is to follow the pluggy plugin style and implement the csp billing hooks that are defined in hookspec. Naming should reference the provider not the platform. E.g. google not GCP.

Also, the naming would match other plugins and exist in SUSE-Enceladus. And we want to follow the same project setup as other repos. Such as the amazon plugin. This will simplify packaging workflow.

SUSE-Enceladus/csp-billing-adapter-google

Add logging and error handling

  • Ensure any actions are logged such as metering and billing.
  • Handle any expected exception cases with try/except. Add error message to configmap and set status accordingly and continue in event loop if necessary.

Queue datastore updates and save once

Currently the adapter saves changes to data multiple times in each event loop run. There are advantages and disadvantages to this. With multiple data updates it increases the chance of failure. Every call could hit an API or file system issue. With one data update at the end it increases the risk of data loss. One call holds all the data updates and thus if it fails everything is lost.

Marked as an enhancement for now as I think this needs more discussion on how to handle this and/or if we should go a certain path.

Cleanup billing functions

There are functions in progress for calculating usage for billing. These are in progress and need to be finished, cleaned up and organized. Ideally the core algorithms could be organized in one module. If possible leverage pluggy to make things pluggable were it makes sense.

Note this will be affected by changes to the config file structure.

In question:

  • get_max_usage
  • get_average_usage
  • get_billable_usage
  • get_bulk_dimensions
  • get_billing_dimensions

https://github.com/SUSE-Enceladus/csp-billing-adapter/blob/devel/csp_billing_adapter/csp_cache.py

Handle no matching dimension in volume pricing

In the volume pricing model if a matching dimension is not found an exception needs to be raised and bubbled up to the config map with details on the issue and failed billing state. Since the adapter cannot bill usage if there's not a matching dimension and it shouldn't try to guess at what dimension to bill.

Allow reporting interval config setting to be optional

The reporting interval is required for some CSPs (e.g. Amazon/AWS) that require the deployed service to generate a regular "heartbeat" update to the CSP.

However for other CSPs (e.g. Microsoft/Azure) there is no need for such a heartbeat, so this config setting should be made optional.

Split up hooks into modules

The hookspec.py file in devel branch has all the hooks at the moment. These should be broken up by plugin type. For example CSP related hooks into csp_hookspec.py and storage into storage_hookspec.py etc. This will make it easier to understand what functions a plugin should implement. And we should probably have a generic hookspec file for anything that's relevant everywhere such as setup and config handling. Thus something like:

  • csp
  • base
  • storage
  • Other?

state when errors clear

The readme should also contain a section about how and when errors clear. Could be as simple as "If/when the adapter completes a run with no errors previous error states are cleared."

Handle failure to save cache/csp-config

It's possible that the adapter can take actions that require updating the cache or csp-config. If the adapter is unable to save the changes to either of these the data is lost. Thus the adapter should have retries and fault tolerance.

  • For any save/get call that will hit a data store or external API the adapter should use retries with backoff in the event of failure
  • If the retries are exhausted the adapter should have some other way to backup the state for the next run

Add file based logging handler for vm products

Logging is deployment environment specific. There should be a hookspec for retrieving a logger and we could probably log to console by default. This means the local storage plugin implements the hook and sets up some sort of file log.

Handle missed billed records

If the adapter is down or inoperable for some reason and misses billing records the records should be accumulated by billing cycle and the adapter update state in csp config accordingly.

The adapter could walk through each missed cycle and calculate what would've been billed and logged that while cleaning up the records in cache. This will be provided to the product and a notification shown to the user to resolve the issue.

How will the adapter state be reset if the resolution happens offline?

Only process records in last cycle

The billing calculation takes all records in cache into account when processing usage. Instead it should only take records that were created since the last bill time and up to the current bill time.

Add timestamp to logs

Add a timestamp at the beginning of each log output:

2023-05-15T16:47:25.767|INFO|CSPBillingAdapter|Updating CSP config with: {'billing_api_access_ok': True, 'errors': [], 'timestamp': '2023-05-15T17:05:46.968960+00:00', 'expire': '2023-05-15T18:05:46.968960+00:00'}

Decide how to handle the test plugins in core package

There are 4 plugins that exist in core code to help assist with development and testing:

  • local_csp
  • product_api
  • memory_cache
  • memory_csp_config

These are not intended for production environment. However, they are very useful for testing and development. Thus they should remain in code base but probably should not be loaded by default such as https://github.com/SUSE-Enceladus/csp-billing-adapter/blob/main/csp_billing_adapter/adapter.py#L66.

From a testing perspective they can be loaded such as https://github.com/SUSE-Enceladus/csp-billing-adapter/blob/main/tests/unit/conftest.py#L86.

For v1 release we can probably just drop the local_csp from core code and move it to the conftest file. Anything else we can probably push to v2 discussion.

Python versions

We should consider fixing the Python version by defining pythons in all spec files. In the not too distant future Python 3.11 will be available in SLE and we probably want to build against that interpreter.

Event loop should sleep for the remainder of the querying interval

At the end of the event loop body there is a time.sleep() call that sleeps for the specified query interval. This doesn't factor in how long the event loop processing took, which means that over a period of time there will be a drift factor that builds up. If the query interval is large, and the event loop processing time is very brief, it may take a very long time for this drift factor to have any appreciable impact, e.g. if the query interval is 3600 seconds and the event loop processing time is approx 0.25 seconds, it would take approx 1.6 years to "miss" a query interval to drift, and if it only takes 0.1 seconds it would take over 4 years to "miss" a query interval due to drift.

However if we are now adding retries with backoff logic, per #81, we run the risk of the impact of drift being more appreciable.

Addressing this is simply a matter to reducing the sleep time by the time taken to run the event loop processing.

Collect CSP data

Any implementation needs to collect CSP information that is sufficient to meet the requirements that allows SCC to automatically create support cases.

Suggest to use the commands we use in AMI's we publish to collect metadata as this infrastructure already exists.

store past billing data

The goal is to allow us to provide information about each billing period, let's say a history of 6 (configurable) months rolling, to the user when requested. Th e idea is to answer the following question:

Why did I get charged for 25 nodes managed in the month of February?

The assumption for the example question is that the day the question is less than 6 months into the future from February.

Update usage retrieval to dict return

Currently the k8s usage function returns an instance of Usage named tuple. Given the data may be different across products it should just return a dictionary.

base_product in csp_config

With #85 an entry named base_product was added to the csp_config structure. This information is obtained from the application about which the adapter is reporting. But the csp_config structure is used by the adapter to communicate to the application. As such we are round-tripping the information about the name/code of the base product. This appears superfluous.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.