Centres like ECMWF or Eumetsat will provide very large amount of core data that nevert

Some possible options to consider: Adding in the notification

Hi Tom. Useful points. Thinking out loud ... <li

Prevent caching of large datasets in Global Caches,about wmo-im/wis2-guide

Comments (24)

golfvert commented on June 12, 2024 1

And to have the complete picture to consider as well the impact for users.

from wis2-guide.

golfvert commented on June 12, 2024

Some possible options to consider:

Adding in the notification message a cache: false in the properties
Having this "no cache" as part of the metadata description
Having those datasets directly published in the cache/a/wis2/... topic tree and having the handful Global Caches to use a list of not to be cached topic hierarchy. Typically those published by ECMWF, Eumetsat and a few (?) others.
Else ?

from wis2-guide.

kaiwirt commented on June 12, 2024

Caches might want to use the length field in the message to decide whether they will be able to handle that data volume. If that field is missing they could still decide if they are going to store the data to the cache after the download finished.

from wis2-guide.

efucile commented on June 12, 2024

I vote for option 3

from wis2-guide.

6a6d74 commented on June 12, 2024

Before looking for the right mechanism, I think it makes sense to agree who decides if things are cached.

Is it the data publisher? In which case putting a flag in the discovery metadata or notification message would work.

Is the cache operator? In which case "message length" could be a useful criteria.
... Aside: given that all caches would produce their own notification messages, would WIS2 work if caches made different decisions about what they were willing to cache. Example: if only Germany cached some things, consumers would only receive notifications pointing to the German cache?

The data publisher probably has a better idea of whether the data is real-time or near-real-time.
Aside: maybe the update frequency (specified in the discovery metadata) is a good metric for identifying real-time or near-real-time data?

The cache owner is the one impacted by the choice in terms of data down/upload and storage cost.

from wis2-guide.

tomkralidis commented on June 12, 2024

It also depends on whether limits will be mandated across all global services, or vary across same. In either case:

putting cache: bool (default: true) in discovery metadata would mean other global services supporting a lookup between discovery metadata and data notifications. We recently added an (optional) properties.metadata_id element to the notification message which could facilitate this, but again it is optional
putting cache: bool (default: true) in the notification message would allow the data publisher to specify caching at a data granule ("from this collection, cache this file, but not that one")
putting cache: bool (default: true) may be valuable for articulating that a data granule should not be cached for a other reason than size
using the message properties.content.size value, or a link object's length value can help provide an indication to Global Services to decided accordingly. If limits are Global Service specific, then the Global Service should communicate this constraint somehow (an AsyncAPI definition for GB, or a landing page for GC).

from wis2-guide.

6a6d74 commented on June 12, 2024

Hi Tom. Useful points.

Thinking out loud ...

I think it makes sense for the data publisher to declare whether something should be cached - via discovery metadata and/or notification message (see below).
"should be cached". I think it's for the Global Cache to make the decision about whether or not it will cache something - perhaps looking at size of an individual 'data granule' or the aggregate size of an entire dataset or all the data from a particular WIS2node. If a Global Cache decides not to cache something that a WIS2node has asked to be cached, it should raise an alert that's captured by the Global Monitor and also propagates back to the original WIS2node so they are aware of the issue. The alert should include the reason why it wasn't cached (e.g., file size, storage quota exceeded ...).
Global Cache instances may have different criteria for refusing to cache something. Which would lead to data being cached only by some Global Caches. I think this is OK - the system would still work (albeit a bit less robustly) as data consumers would still be able to access the data, just from a smaller number of caches. Aside: it would also be useful to continually check consistency between Global Cache instances. Easiest way to do this would be to listen to the data notification messages and record which caches sent them for each data granule.
The Global Discovery Catalogue is required to add actionable links to the discovery metadata record for cached datasets; e.g., additional subscription endpoint(s) in the /.../cache/.../ topic, and (possibly?) global cache locations from where the data can be downloaded. So - how does the Global Discovery Catalogue know when to do this? Easiest to have a cache: bool directive in the discovery metadata record. Otherwise, the Global Discovery Catalogue would have to listen to notification messages to see which datasets were actually being cached - which seems complex :). Easiest to have the cache: bool default to true for core data, so that only the edge case of not caching core data needs to be declared. Aside: global cache download locations may not be needed because it's the URL in the notification message(s) that provides that information, and the global cache isn't intended to be a browsable data access point.
I think it's best to avoid the need for a real-time lookup between the Global Cache and the Global Discovery Catalogue to determine what should be cached - the Global Discovery Catalogue isn't designed to be a highly available component. So - ways to mitigate? (A) On start-up, a Global Cache instance could build a configuration of what to cache by scanning the discovery metadata in the catalogue. (B) Each notification message includes a cache: bool directive, so no lookup is needed. To avoid bloating the message, probably best to assume a default true for core data. The WIS2nodes only need to declare the edge case of cache = false. ... I prefer option (B).

from wis2-guide.

golfvert commented on June 12, 2024

The cache: bool in the notification message is probably the easiest to manage for WIS2 Nodes and Global Cache operations.
This can added in the Metadata record so that this information is known upon discovery of the data.

From a user perspective, however, this may imply that he would have to subscribe to origin/a/wis2/... for receiving the non cached data.
Which is not very convenient IMHO.

An alternative to my option 3. above is for the WIS2 Node to:

add cache: false in the message
post this on origin/a/wis2/... as usual
and for the Global Cache:
not to download/cache the data with cache: false
nevertheless republish the message onto cache/a/wis2/... without updating the download link(s).

This way, as a user, I don't really care I subscribe to cache/a/wis2/... and I will receive the correct links in the message.

Like that, WIS2 Node is in control, Global Cache follows a simple rule, user doesn't require to know about this subtlety.

from wis2-guide.

6a6d74 commented on June 12, 2024

That's a neat solution. It means that data consumer wanting core data doesn't have to worry about whether it's on cache or origin topics at the Global Broker. Notifications about core data will always be available via the cache/a/wis2 topics.

We just need to clearly document the (counter intuitive) situation of when a cache/a/wis2 topic message doesn't point to a URL at a Global Cache. From inspecting a notification message it will be obvious - because it must include a "don't cache me" cache: false directive.

It also means that the Global Discovery Catalogue can treat all core data the same - adding an additional actionable link pointing to subscription via the cache topic. So there's no need for the WIS2node to include a special cache me / don't cache me directive in the metadata.

BTW - I'm assuming that the Global Discovery Catalogue adds an actionable link for the associated cache/a/wis2/... topic at every Global Broker? This is the way that data consumers find out where they can subscribe - I'm expecting this to be a list of places, one or more of which is relevant for them.

from wis2-guide.

golfvert commented on June 12, 2024

"We just need to clearly document the (counter intuitive) situation..." does it really matter?
As a user, I receive a message, I follow the link(s) and done.
We can mention that in the guide, as the download links may not be limited to dwd.de or... for core and recommended that's all.

from wis2-guide.

golfvert commented on June 12, 2024

@kaiwirt, is this potential solution (don't cache but re-publish me) a good option for you as a Global Cache centre?

from wis2-guide.

6a6d74 commented on June 12, 2024

"We just need to clearly document the (counter intuitive) situation..." does it really matter? As a user, I receive a message, I follow the link(s) and done. We can mention that in the guide, as the download links may not be limited to dwd.de or... for core and recommended that's all.

@golfvert - by documentation yes, I meant mention it in the guide - I'm sure that some Global Cache implementers might not get the point unless we're explicit with the reason.

from wis2-guide.

kaiwirt commented on June 12, 2024

To me i think it is ok if caches republish messages without actually downloading and storing the data leaving the message unmodified but the topic.

Caches could also indicate this in the message. Having a flag like on_local_cache: true/false

from wis2-guide.

antje-s commented on June 12, 2024

Sounds good, I just have one addition...
from my point of view it would be important for the automatic download of core data that the access to origin-download-URLs is open (without login for all WIS2 Nodes) otherwise the list of access data can become very long.

from wis2-guide.

golfvert commented on June 12, 2024

As ECMWF will have its WIS2 Node soonish, and its data shouldn't be cached, shall we tentatively endorse:

WIS2 Node can add cache: false in the Notification message. Default value being cache: true (if cache key is missing)
Global Cache honours cache: false by NOT downloading the data and, nevertheless, publishes on the corresponding cache/a/wis2/...

If that is acceptable, then @tomkralidis can amend the WIS2-notification-message repo accordingly.

from wis2-guide.

kaiwirt commented on June 12, 2024

We can add this feature. No objections. If we agree on this procedure we should in the same line agree what to do with recommended messages. In that case I would prefer having the same logic such that a Global Cache is not downloading recommended data, but is republishing the message at cache/a/wis2

from wis2-guide.

tomkralidis commented on June 12, 2024

TT-WISMD 2023-04-12:

potentially distributes responsibility?
global services can/should decide
can have both
data producer driven (in WNM)
decision by global services
- can notify data producer that data was not cached
  - traffic/load? Can send single message (report) to global service
- data producer can also subscribe to global broker to validate their data publication
global service should be able to govern if a data granule gets cached or not

Recommendation:

can have both, global cache can override
global cache can realize as an implementation detail
global service SHALL have a WCMP2 record (properties.type="service")
- rules can be communicated in themes/concepts?
- constraints

from wis2-guide.

tomkralidis commented on June 12, 2024

ET-W2AT 2023-05-15:

recommended data will NOT be cached by GC
for core data, add properties.cache (true|false, default=true) to WNM as decided by data producer
message is republished anyway
no issue in making "more" core data available (over and above resolution 1)
- ACTION: Secretariat to verify

from wis2-guide.

golfvert commented on June 12, 2024

A small complement to the summary above:

for core data, add properties.cache (true|false, default=true) to the notification message as decided by data producer
if properties.cache: false the Global Cache SHALL:

Not download the data made available using this message
Publish the Notification in cache/a/wis2/... (similarly to the data being cache) with the properties.links not modified (the link will still point to the data producer endpoint)

Item 2. is to make users' life easier. They will keep subscribing to the topic cache/a/wis2/... only. The "no cache core data" being a technicality, there is no need to expose this to user.

from wis2-guide.

tomkralidis commented on June 12, 2024

Associated PR in wmo-im/wis2-notification-message#46

from wis2-guide.

kaiwirt commented on June 12, 2024

Just for my clarification. We use the same mechanism for recommended data: GC receives message on origin/#, it does not download the data but republishes the (unmodified) message as cache/#

from wis2-guide.

golfvert commented on June 12, 2024

Not necessarily.
Access to recommended data may require authentication, signing a paper, accepting T&C,... almost whatever the data originator wants. Access to core data MUST be easy for users. Access to recommanded data MAY require particular action. So, having specifically to subscribe to origin/a/.... for this kind of dataset is, I think, acceptable.

from wis2-guide.

SimonElliottEUM commented on June 12, 2024

Before looking for the right mechanism, I think it makes sense to agree who decides if things are cached.

Is it the data publisher? In which case putting a flag in the discovery metadata or notification message would work.

Is the cache operator? In which case "message length" could be a useful criteria. ... Aside: given that all caches would produce their own notification messages, would WIS2 work if caches made different decisions about what they were willing to cache. Example: if only Germany cached some things, consumers would only receive notifications pointing to the German cache?

The data publisher probably has a better idea of whether the data is real-time or near-real-time. Aside: maybe the update frequency (specified in the discovery metadata) is a good metric for identifying real-time or near-real-time data?

The cache owner is the one impacted by the choice in terms of data down/upload and storage cost.

A producer of core data needs to know in advance whether they will be cached or not. If core data are not cached than the producer will have to accommodate data access from an unknown number of consumers, as opposed to one download from the global cache. The caching of the core data at the Global Caches is a key advantage of the WIS2 architecture from the point of view of a producer of large volumes of such data.

from wis2-guide.

efucile commented on June 12, 2024

Decision

An increase in the volume of data has to be announced by the provider in advance to allow the GCs to take required measures.

===DECISIONS

for core data, add properties.cache (true|false, default=true) to the notification message as decided by data producer
if properties.cache: false the Global Cache SHALL:
Not download the data made available using this message
Publish the Notification in cache/a/wis2/... (similarly to the data being cache) with the properties.links not modified (the link will still point to the data producer endpoint)

from wis2-guide.

Prevent caching of large datasets in Global Caches about wis2-guide HOT 24 CLOSED

Comments (24)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent