Comments (24)
And to have the complete picture to consider as well the impact for users.
from wis2-guide.
Some possible options to consider:
- Adding in the notification message a
cache: false
in the properties - Having this "no cache" as part of the metadata description
- Having those datasets directly published in the
cache/a/wis2/...
topic tree and having the handful Global Caches to use a list of not to be cached topic hierarchy. Typically those published by ECMWF, Eumetsat and a few (?) others. - Else ?
from wis2-guide.
Caches might want to use the length field in the message to decide whether they will be able to handle that data volume. If that field is missing they could still decide if they are going to store the data to the cache after the download finished.
from wis2-guide.
I vote for option 3
from wis2-guide.
Before looking for the right mechanism, I think it makes sense to agree who decides if things are cached.
Is it the data publisher? In which case putting a flag in the discovery metadata or notification message would work.
Is the cache operator? In which case "message length" could be a useful criteria.
... Aside: given that all caches would produce their own notification messages, would WIS2 work if caches made different decisions about what they were willing to cache. Example: if only Germany cached some things, consumers would only receive notifications pointing to the German cache?
The data publisher probably has a better idea of whether the data is real-time or near-real-time.
Aside: maybe the update frequency (specified in the discovery metadata) is a good metric for identifying real-time or near-real-time data?
The cache owner is the one impacted by the choice in terms of data down/upload and storage cost.
from wis2-guide.
It also depends on whether limits will be mandated across all global services, or vary across same. In either case:
- putting
cache: bool
(default:true
) in discovery metadata would mean other global services supporting a lookup between discovery metadata and data notifications. We recently added an (optional)properties.metadata_id
element to the notification message which could facilitate this, but again it is optional - putting
cache: bool
(default:true
) in the notification message would allow the data publisher to specify caching at a data granule ("from this collection, cache this file, but not that one") - putting
cache: bool
(default:true
) may be valuable for articulating that a data granule should not be cached for a other reason than size - using the message
properties.content.size
value, or a link object'slength
value can help provide an indication to Global Services to decided accordingly. If limits are Global Service specific, then the Global Service should communicate this constraint somehow (an AsyncAPI definition for GB, or a landing page for GC).
from wis2-guide.
Hi Tom. Useful points.
Thinking out loud ...
- I think it makes sense for the data publisher to declare whether something should be cached - via discovery metadata and/or notification message (see below).
- "should be cached". I think it's for the Global Cache to make the decision about whether or not it will cache something - perhaps looking at size of an individual 'data granule' or the aggregate size of an entire dataset or all the data from a particular WIS2node. If a Global Cache decides not to cache something that a WIS2node has asked to be cached, it should raise an alert that's captured by the Global Monitor and also propagates back to the original WIS2node so they are aware of the issue. The alert should include the reason why it wasn't cached (e.g., file size, storage quota exceeded ...).
- Global Cache instances may have different criteria for refusing to cache something. Which would lead to data being cached only by some Global Caches. I think this is OK - the system would still work (albeit a bit less robustly) as data consumers would still be able to access the data, just from a smaller number of caches. Aside: it would also be useful to continually check consistency between Global Cache instances. Easiest way to do this would be to listen to the data notification messages and record which caches sent them for each data granule.
- The Global Discovery Catalogue is required to add actionable links to the discovery metadata record for cached datasets; e.g., additional subscription endpoint(s) in the
/.../cache/.../ topic
, and (possibly?) global cache locations from where the data can be downloaded. So - how does the Global Discovery Catalogue know when to do this? Easiest to have acache: bool
directive in the discovery metadata record. Otherwise, the Global Discovery Catalogue would have to listen to notification messages to see which datasets were actually being cached - which seems complex :). Easiest to have thecache: bool
default totrue
for core data, so that only the edge case of not caching core data needs to be declared. Aside: global cache download locations may not be needed because it's the URL in the notification message(s) that provides that information, and the global cache isn't intended to be a browsable data access point. - I think it's best to avoid the need for a real-time lookup between the Global Cache and the Global Discovery Catalogue to determine what should be cached - the Global Discovery Catalogue isn't designed to be a highly available component. So - ways to mitigate? (A) On start-up, a Global Cache instance could build a configuration of what to cache by scanning the discovery metadata in the catalogue. (B) Each notification message includes a
cache: bool
directive, so no lookup is needed. To avoid bloating the message, probably best to assume a defaulttrue
for core data. The WIS2nodes only need to declare the edge case ofcache = false
. ... I prefer option (B).
from wis2-guide.
The cache: bool
in the notification message is probably the easiest to manage for WIS2 Nodes and Global Cache operations.
This can added in the Metadata record so that this information is known upon discovery of the data.
From a user perspective, however, this may imply that he would have to subscribe to origin/a/wis2/... for receiving the non cached data.
Which is not very convenient IMHO.
An alternative to my option 3. above is for the WIS2 Node to:
- add
cache: false
in the message - post this on origin/a/wis2/... as usual
and for the Global Cache: - not to download/cache the data with
cache: false
- nevertheless republish the message onto cache/a/wis2/... without updating the download link(s).
This way, as a user, I don't really care I subscribe to cache/a/wis2/... and I will receive the correct links in the message.
Like that, WIS2 Node is in control, Global Cache follows a simple rule, user doesn't require to know about this subtlety.
from wis2-guide.
That's a neat solution. It means that data consumer wanting core data doesn't have to worry about whether it's on cache
or origin
topics at the Global Broker. Notifications about core data will always be available via the cache/a/wis2
topics.
We just need to clearly document the (counter intuitive) situation of when a cache/a/wis2
topic message doesn't point to a URL at a Global Cache. From inspecting a notification message it will be obvious - because it must include a "don't cache me" cache: false
directive.
It also means that the Global Discovery Catalogue can treat all core data the same - adding an additional actionable link pointing to subscription via the cache
topic. So there's no need for the WIS2node to include a special cache me / don't cache me directive in the metadata.
BTW - I'm assuming that the Global Discovery Catalogue adds an actionable link for the associated cache/a/wis2/...
topic at every Global Broker? This is the way that data consumers find out where they can subscribe - I'm expecting this to be a list of places, one or more of which is relevant for them.
from wis2-guide.
"We just need to clearly document the (counter intuitive) situation..." does it really matter?
As a user, I receive a message, I follow the link(s) and done.
We can mention that in the guide, as the download links may not be limited to dwd.de
or... for core and recommended that's all.
from wis2-guide.
@kaiwirt, is this potential solution (don't cache but re-publish me) a good option for you as a Global Cache centre?
from wis2-guide.
"We just need to clearly document the (counter intuitive) situation..." does it really matter? As a user, I receive a message, I follow the link(s) and done. We can mention that in the guide, as the download links may not be limited to
dwd.de
or... for core and recommended that's all.
@golfvert - by documentation yes, I meant mention it in the guide - I'm sure that some Global Cache implementers might not get the point unless we're explicit with the reason.
from wis2-guide.
To me i think it is ok if caches republish messages without actually downloading and storing the data leaving the message unmodified but the topic.
Caches could also indicate this in the message. Having a flag like on_local_cache: true/false
from wis2-guide.
Sounds good, I just have one addition...
from my point of view it would be important for the automatic download of core data that the access to origin-download-URLs is open (without login for all WIS2 Nodes) otherwise the list of access data can become very long.
from wis2-guide.
As ECMWF will have its WIS2 Node soonish, and its data shouldn't be cached, shall we tentatively endorse:
- WIS2 Node can add
cache: false
in the Notification message. Default value beingcache: true
(ifcache
key is missing) - Global Cache honours
cache: false
by NOT downloading the data and, nevertheless, publishes on the correspondingcache/a/wis2/...
If that is acceptable, then @tomkralidis can amend the WIS2-notification-message repo accordingly.
from wis2-guide.
from wis2-guide.
TT-WISMD 2023-04-12:
- potentially distributes responsibility?
- global services can/should decide
- can have both
- data producer driven (in WNM)
- decision by global services
- can notify data producer that data was not cached
- traffic/load? Can send single message (report) to global service
- data producer can also subscribe to global broker to validate their data publication
- can notify data producer that data was not cached
- global service should be able to govern if a data granule gets cached or not
Recommendation:
- can have both, global cache can override
- global cache can realize as an implementation detail
- global service SHALL have a WCMP2 record (
properties.type="service"
)- rules can be communicated in themes/concepts?
- constraints
from wis2-guide.
ET-W2AT 2023-05-15:
- recommended data will NOT be cached by GC
- for core data, add
properties.cache
(true
|false
, default=true
) to WNM as decided by data producer - message is republished anyway
- no issue in making "more" core data available (over and above resolution 1)
- ACTION: Secretariat to verify
from wis2-guide.
A small complement to the summary above:
- for core data, add properties.cache (true|false, default=true) to the notification message as decided by data producer
- if
properties.cache: false
the Global Cache SHALL:
- Not download the data made available using this message
- Publish the Notification in
cache/a/wis2/...
(similarly to the data being cache) with theproperties.links
not modified (the link will still point to the data producer endpoint)
Item 2. is to make users' life easier. They will keep subscribing to the topic cache/a/wis2/...
only. The "no cache core data" being a technicality, there is no need to expose this to user.
from wis2-guide.
Associated PR in wmo-im/wis2-notification-message#46
from wis2-guide.
Just for my clarification. We use the same mechanism for recommended data: GC receives message on origin/#, it does not download the data but republishes the (unmodified) message as cache/#
from wis2-guide.
Not necessarily.
Access to recommended data may require authentication, signing a paper, accepting T&C,... almost whatever the data originator wants. Access to core data MUST be easy for users. Access to recommanded data MAY require particular action. So, having specifically to subscribe to origin/a/.... for this kind of dataset is, I think, acceptable.
from wis2-guide.
Before looking for the right mechanism, I think it makes sense to agree who decides if things are cached.
Is it the data publisher? In which case putting a flag in the discovery metadata or notification message would work.
Is the cache operator? In which case "message length" could be a useful criteria. ... Aside: given that all caches would produce their own notification messages, would WIS2 work if caches made different decisions about what they were willing to cache. Example: if only Germany cached some things, consumers would only receive notifications pointing to the German cache?
The data publisher probably has a better idea of whether the data is real-time or near-real-time. Aside: maybe the update frequency (specified in the discovery metadata) is a good metric for identifying real-time or near-real-time data?
The cache owner is the one impacted by the choice in terms of data down/upload and storage cost.
A producer of core data needs to know in advance whether they will be cached or not. If core data are not cached than the producer will have to accommodate data access from an unknown number of consumers, as opposed to one download from the global cache. The caching of the core data at the Global Caches is a key advantage of the WIS2 architecture from the point of view of a producer of large volumes of such data.
from wis2-guide.
Decision
An increase in the volume of data has to be announced by the provider in advance to allow the GCs to take required measures.
===DECISIONS
for core data, add properties.cache (true|false, default=true) to the notification message as decided by data producer
if properties.cache: false the Global Cache SHALL:
Not download the data made available using this message
Publish the Notification in cache/a/wis2/... (similarly to the data being cache) with the properties.links not modified (the link will still point to the data producer endpoint)
from wis2-guide.
Related Issues (20)
- Portal for WIS2 HOT 5
- Guidance for data consumers HOT 2
- Non-functional performance measures need to be defined HOT 1
- Volume C1 HOT 1
- Simple NC/DCPC data sharing in data-and-metadata-flows.adoc HOT 2
- Decision on publication_datetime on cached data HOT 4
- add guidance on APIs
- add GDC validation of WCMP2 id against incoming topic HOT 2
- clarify WIS2 specification release management
- MQTT features HOT 2
- Assignment of topics for multidisciplinary datasets HOT 32
- Finalise document structure for the WIS2 Guide
- Indicate to Global Caches how long a resource should be cached for
- specify guide for report format HOT 2
- Add examples to WIS2 Guide indicating how attribution / other license should be expressed in the WIS2 Discovery Metadata HOT 1
- Access control of recommended data HOT 1
- approval of centre-id between fast-track cycles HOT 5
- WIS2 Node broker links exposure in WCMP2 via GB subscriptions HOT 2
- Tom's list HOT 1
- replace all "See " pointers with cross references
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from wis2-guide.