Comments (43)
Just a quick chime-in here - this should be helpful:
https://docs.unidata.ucar.edu/netcdf-java/current/userguide/dataset_urls.html#object-stores
as well as https://github.com/lesserwhirls/tds-s3-jpl-test (I had planned on using this to bootstrap TDS specific docs).
from tds.
I'm going to leave it open for the time being, just as a reference until we revisit the S3 documentation :)
from tds.
I just quickly tried this out, based on what I read here, and it logs a whole bunch of stuff from awssdk in a new file called awssdk.log
. You can edit this WEB-INF/classes/log4j2.xml
file before starting up TDS to change these log settings.
I hope that helps! If you still can't figure out the errors you are getting, let us know.
from tds.
Oh! Thank you so much! I am such a complete noob with Java, and the logging documentation for AWS SDK and TDS read like ancient Babylonian, and looked like they were completely incompatible with each other. This is a huge leg up!
from tds.
@WeatherGod Let's focus on what it takes to get something working rather than delving off into design critiques for code that's already been merged. 😉
from tds.
cdms3: implies something different from S3. And doing a search for "cdms3" yields not much. The closest I get is "Climate Data Management System, v3", which does not make me think "S3".
Yes, because it is different (very similar, but different). cdms3 is a URI scheme specific to netCDF-java (a.k.a. The CDM) and was created to provide a more generic and flexible way for to interface object storage systems. s3: was used in the first implementation of object store support for netCDF-java, which focused on AWS S3. Once we started to support aggregations, non-AWS object stores (including on premises object stores), etc., we needed more flexibility, and thus cdms3: (support for the s3: scheme is deprecated).
from tds.
cdms3: implies something different from S3. And doing a search for "cdms3" yields not much. The closest I get is "Climate Data Management System, v3", which does not make me think "S3".
Yes, because it is different (very similar, but different). cdms3 is a URI scheme specific to netCDF-java (a.k.a. The CDM) and was created to provide a more generic and flexible way for to interface object storage systems. s3: was used in the first implementation of object store support for netCDF-java, which focused on AWS S3. Once we started to support aggregations, non-AWS object stores (including on premises object stores), etc., we needed more flexibility, and thus cdms3: (support for the s3: scheme is deprecated).
I <3 this explanation soo much, thank you! This sort of information in the documentation would go so far in helping people understand. From a perspective of discoverability, in light of deprecation of "s3:", the documentation should make it clear that "cdms3:" is for object stores like AWS S3 (and maybe include some others). Then I would include examples that expressly describes an example object store and then translate that into a dataset entry. Linking to the netcdf-java page would be helpful, but looking through that, I start to wonder if I still got the right mental model or not. Clear examples would be helpful in reducing ambiguity.
from tds.
Yes, exactly. It's an id that you're setting that the TDS maps to a storage location, but it can be anything.
from tds.
Alright, fixing our config, as well as fixing our docker mirror (somehow, our mirror had mixed up the 4.6 and 5.3 tags). We'll know how well this turns out in about an hour.
I think the aspect about the path="foobar" being an identifier is another potential stumbling block that would get made clear by having an example that clearly defines the layout of the object store. The reader can see that that name doesn't appear in the store layout and can then (hopefully!) not try and treat it as a subpath name or something.
from tds.
I might also have a bug report to file next week. Need to confirm, but it looks like thredds will report "something" rather than return an error when an invalid config is used.
This has been very helpful and I we can continue this next week. Have a good weekend!
from tds.
IMHO, this issue could be closed. The dataset root pattern or using a cdms3 string as I show above is pretty much answering the question.
from tds.
The best way to test if this is enough would be to include STS as a runtimeOnly
dependency for the TDS. I can make a PR for that here in a bit. If that does the trick, then we would want to add it as a runtimeOnly
dependency for netCDF-Java's cdm-s3
subproject.
from tds.
@WeatherGod, the newest docker image(unidata/thredds-docker:5.5-SNAPSHOT
) should contain the STS dependency. Let us know if you still have problems!
from tds.
Agreed that documentation is needed, but generally speaking, you should be able to use datasets on S3 as drop-in replacements for local datasets by using the "cdms3" prefix. In the tds-s3 test calatog that you've linked, I'm not sure what you're referring two as the two dataset roots, I only see the one: "cdms3:noaa-goes16". Am I missing something?
We certainly need to and will add some object store examples to our documentation, but that's not an immediate answer and there's no guarantee that the provided documentation will address your particular issue, so if you would like help sorting your configuration out, I'd be happy to look at it.
from tds.
@haileyajohnson It took me a bit to find it too, but this tag:
<dataset name="Test GOES-16 S3 Aggregation (November 2nd, 2020)"
ID="2020-11-02_OR-ABI-L1b-RadC-M6C01-G16"
urlPath="s3-agg/ABI-L1b-RadC/2020/11/02/OR_ABI-L1b-RadC-M6C01_G16.nc">
seems to be using a "s3-agg" dataset root that's not defined in the file. It's probably not used, though, because the <scan>
tag explicitly gives the cdms3
prefix:
<scan location="cdms3:noaa-goes16?ABI-L1b-RadC/2020/307/#delimiter=/"
dateFormatMark="OR_ABI-L1b-RadC-M3C01_G16_s#yyyyDDDHHmmss"
regExp=".*OR_ABI-L1b-RadC-M6C01_G16_s.*" />
from tds.
Yes, that's what I meant by two dataset roots.
I think an important view point to consider when drafting the documentation is what is the minimum amount of information the user needs to know. Initially, the user is going to think, "I have a dataset located at s3://myexamplebucket/foo/bar", how should a user specify that in catalog.xml, and what other information does the user need to provide? A narrative walking through that would go a long way.
from tds.
Ah ok, I see. Yes, @dopplershift is correct, that second dataset doesn't need a <datasetRoot>
mapping because the location is provided via the <scan>
element.
from tds.
@WeatherGod this is the current documentation for catalog configuration.
It definitely does need a lot of work - but as far as serving S3 datasets goes, the only things users need to do differently is provide locations
that start with the cdms3:
prefix, followed by the bucket name and object path, as opposed to a local path.
from tds.
should it end with a /
or not? Digging through the code, I see that there is a check for that, but it doesn't explain what it is for. Also, the code checks for both "cdms3" and "s3". Is there a difference?
https://github.com/Unidata/tds/blob/f0a6297f6fd8b9e4cb3ee2b29312ce64eda94458/tdcommon/src/main/java/thredds/server/catalog/DataRoot.java
https://github.com/Unidata/tds/blob/15d02c350d3c411f899c4befae694b492243911b/tds/src/main/java/thredds/core/DatasetManager.java
from tds.
Thanks, @lesserwhirls !
I'm not sure what what you're trying to configure, but for most situations you shouldn't need the location to end with a /
, it looks like the code you're referencing is just adding a trailing '/' to the root if it's not already there.
Either cdms3:
or s3:
will work as a prefix, but cdms3:
is recommended. I'm guessing the fact that both exist and work is an artifact of evolving decisions mid-development...
from tds.
cdms3: implies something different from S3. And doing a search for "cdms3" yields not much. The closest I get is "Climate Data Management System, v3", which does not make me think "S3".
from tds.
I can see how that was a design critique and sorry for not being clearer. This was more of me raising a point that in any such documentation it should make clear what, if any, difference there is between "s3:" and "cdms3:", and if there isn't a difference, it should state that they are equivalent, because an end-user who isn't familiar with one or the other will get confused on this point.
from tds.
And to note, I have currently taken some of these notes to apply back to our system to see if it resolves the problems in our bigger, more complicated system. There are so many moving parts to this, and difficulty in nailing down root causes that I am just trying to eliminate sources of ambiguity/confusion.
Once our CI/CD cycle completes, I'll let you know how that goes.
from tds.
@WeatherGod Pull Requests welcome...
Just kidding. We definitely need to capture a lot of this discussion (esp. the context from @lesserwhirls who wrote the stuff) in the docs. I sense the object store support is going to become increasingly important to a large swath of the community, and we definitely want it to be easy for them to set up.
from tds.
Alright, some progress, I guess. So, here is a snippet of our catalog.xml:
<dataset name="ERA5 Daily Precipitation and Temperatures"
ID="era5-daily-summary"
urlPath="cdms3:aer-awi-era5/temp_precip/era5_pnt_daily_2000_2020">
"aer-awi-era5" is a private S3 bucket of ours. The output from the relevant log:
# cat /usr/local/tomcat/content/thredds/logs/catalogInit.log
You are currently running TDS version 4.6.19 - 2021-12-20T10:32:09-0500
Latest Available TDS Version Info:
latest stable version = 5.1
latest maintenance version = 4.6.17
initCatalogs(): initializing 1 root catalogs.
**************************************
Catalog init catalog.xml
[2022-01-21T21:17:13.123Z]
initCatalog catalog.xml -> /usr/local/tomcat/content/thredds/catalog.xml
-------readCatalog(): full path=/usr/local/tomcat/content/thredds/catalog.xml; path=catalog.xml
----Catalog Validation
*** ERROR DataRootConfig path =temp_precip directory= <cdms3:aer-awi-era5/> does not exist
add static catalog to hash=catalog.xml
And to confirm that the bucket is accessible from the server running thredds, I installed awscli and did:
# aws s3 ls s3://aer-awi-era5/
PRE temp_precip/
So, any ideas what is wrong? Could it be subtle differences between the python and java AWS SDKs? Authentication is being implicitly done via the server instance having assumed the appropriate IAM role.
from tds.
waitaminute... that says TDS version 4.6. I need to go back and double-check which docker image we are using
from tds.
I think we're crossing wires a bit on what is a urlPath
and what is a location
. The cdms3:
prefix should be included in the location field of the <datasetRoot>
element to let the TDS know that dataset maps to your s3 bucket. It tells the TDS where to look for the stored data.
The urlPath
is a url path for users to get the data via the TDS. For example, this:
urlPath="s3-test/ABI-L1b-RadC/2019/363/21/OR_ABI-L1b-RadC-M6C16_G16_s20193632101189_e20193632103574_c20193632104070.nc"
means the dataset can be accessed using ncss with this url: <hostname>/thredds/ncss/grid/s3-test/ABI-L1b-RadC/2019/363/21/OR_ABI-L1b-RadC-M6C16_G16_s20193632101189_e20193632103574_c20193632104070.nc/dataset.html
So I think the TDS is trying to parse what you've put in the urlPath
as a url, and it doesn't know what to do with the ":". It should be something like: urlPath=aer-awi-era5/temp_precip/era5_pnt_daily_2000_2020
...or all of that will be true if you have a 5.x TDS running :)
from tds.
that's actually a question I had, and maybe this is just general knowledge for thredds catalog.xml configs (I am a complete noob to thredds configs). In your example, you have urlPath starting with "aer-awi-era5". Would my datasetRoot then be <datasetRoot path="aer-awi-era5" location="cdms3:aer-awi-era5" />
? In other words, the "path" is just an id, and that it doesn't appear in the final URL? In other words, could I have a datasetRoot of <datasetRoot path="foobar" location="cdms3:aer-awi-era5" />
, and then my urlPath could have been "foobar/temp_precip/era5_pnt_daily_2000_2020"?
from tds.
The <scan location="..."
is relative to urlPath? So, if my urlPath already pointed to the directory with my files, then the scan location should just be "/", or something?
from tds.
(the tds-s3.xml example had a different schema, which confused me)
from tds.
Your scan location should be absolute, e.g. `<scan location="cdms3:bucketname/path/to/objects_shared_location>
urlPath
is used by a client/user interfacing with the TDS, location
is used by the TDS to find the data, whether on disk on in a cloud store.
I'll follow this up with some annotations on the tds-s3.xml file that might clarify things, but it's taking me a bit to type it up.
from tds.
"urlPath is used by a client/user interfacing with the TDS". So, would siphon count as a client interfacing with TDS? Because siphon, as far as I can tell, doesn't support s3 or cdms3 stuff, so maybe this whole time, I shouldn't have been setting urlPath to a cdms3: address? An error in another part of our system that uses siphon is what triggered this entire investigation of wondering how we were supposed to configure catalog.xml.
from tds.
There are two datasets being served from the tds-s3 catalog, the first is a single dataset, i.e. one .nc file stored in an AWS bucket:
<datasetRoot path="s3-test" location="cdms3:noaa-goes16" />
<dataset name="Test Single GOES-16 S3 (December 29th, 2019 21:01) " ID="2020-11-02_2101-OR-ABI-L1b-RadC-M6C16-G16"
urlPath="s3-test/ABI-L1b-RadC/2019/363/21/OR_ABI-L1b-RadC-M6C16_G16_s20193632101189_e20193632103574_c20193632104070.nc"
dataType="Grid"
serviceName="obstoreGrid"/>
The <datasetRoot>
element here is telling the TDS that anytime a client provides a URL with "s3-test", it should go look in the "cdms3:noaa-goes16" location. That's used in the following <dataset>
element by the urlPath
field: we can't put "cdms3:noaa-goes16" in directly, because that wouldn't be a valid, parsable url, but the TDS knows what "s3-test" is referring to because we mapped it in the <datasetRoot>
above. Everything after "s3-test/" is the path to the dataset object within the s3 bucket (i.e. path relative to dataRoot).
The second dataset is an aggregation:
<dataset name="Test GOES-16 S3 Aggregation (November 2nd, 2020)"
ID="2020-11-02_OR-ABI-L1b-RadC-M6C01-G16"
urlPath="s3-agg/ABI-L1b-RadC/2020/11/02/OR_ABI-L1b-RadC-M6C01_G16.nc">
...
<scan location="cdms3:noaa-goes16?ABI-L1b-RadC/2020/307/#delimiter=/"
...
</dataset>
Here our urlPath is, again, the URL that will be used by a client to access the data through the TDS, but we didn't need to define a <datasetRoot>
because we tell the TDS where the data lives in the <scan>
element. That's why the <scan>
needs the full uri, cdms3 prefix and path.
from tds.
So, would siphon count as a client interfacing with TDS? Because siphon, as far as I can tell, doesn't support s3 or cdms3 stuff, so maybe this whole time, I shouldn't have been setting urlPath to a cdms3: address?
Yep! Siphon counts as client, and urlPath should not have cdms3 in it. The TDS should be handling the mapping from a url, like <hostname>/thredds/ncss/grid/s3-test/ABI-L1b-RadC/2019/363/21/OR_ABI-L1b-RadC-M6C16_G16_s20193632101189_e20193632103574_c20193632104070.nc/
to a storage location like cdms3:noaa-goes16/ABI-L1b-RadC/2019/363/21/OR_ABI-L1b-RadC-M6C16_G16_s20193632101189_e20193632103574_c20193632104070.nc
Siphon shouldn't even need to know whether the dataset lives on s3 or on local disk.
from tds.
Oh, that really clears things up. And that scan location is utilizing the url schema defined in the netcdf-java docs that was linked earlier today. Just curious, could the scan location utilize datasetRoot if it was defined, or does it always have to have a full url?
from tds.
I'm fairly certain a scan location overrides a datasetRoot and can't rely on one. Since scan location isn't user-facing and is only used internally by the TDS, it doesn't go through the datasetRoot map.
Glad this was helpful!
from tds.
Just a bit of an update on my end. Due to problems with other parts of the system unrelated to thredds and s3, we had to table working on using this feature. I do have my eye on utilizing TDS for another project that would utilize data off of an S3 bucket, but I haven't been allocated time to pursue that. It may be some time before I can contribute more to this discussion. If you all do move ahead and develop some documentation, I would be interested in reviewing it.
from tds.
I'm looking into this now for some basic examples and failing to get anything working. Various errors that make it clear I have the right TDS and have atleast gotten TDS to try, so I know I'm making some progress. Can someone point me to a dead simple working example?
I've never used the dataset root pattern, don't think I have enough brain cells to hold all the indirection in my head at once. Are there examples of using a netcdf element within a dataset element to point to a single netcdf file on an object store? I'd like to inch my way into this and get to the point that I know how to work with our provider pays bucket. (s3://nhgf-development) -- I put our copy of prism there and have been using the path: cdms3://default@aws/nhgf-development?thredds/prism_v2/prism_2020.nc
to try and get one file working.
Thanks for any examples you can point me to.
UPDATE: I have succeeded in getting an unauthenticated goes16 example to work and an authenticated (with a default profile) example against our requester pays bucket to work.
I am unable to get credentials to work by specifying them in the URL. Only a format like: cdms3:noaa-goes16
is working. e.g. if I create a:
[region-only-profile]
region=us-east-1
and set my dataset root to:
<datasetRoot path="s3-test" location="cdms3://region-only-profile@aws/noaa-goes16" />
I get 500, NULL from THREDDS with a Service: S3, Status Code: 403 ...
in the body. These public datasets don't seem to work at all with authenticated requests though, so maybe this is expected?
However, for my requester pays bucket, I can set up credentials like...
[default]
aws_access_key_id=enter_your_key
aws_secret_access_key=enter_your_secret
region=us-west-2
I can get this datasetRoot location to work: cdms3:nhgf-development
or cdms3://default@aws/nhgf-development
minimal catalog that works:
<?xml version="1.0" encoding="UTF-8"?>
<catalog name="S3 Test Catalog" xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0" xmlns:ncml="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2" version="1.2">
<service name="all" serviceType="OpenDAP" base="/thredds/dodsC/"/>
<datasetRoot path="nhgf" location="cdms3:nhgf-development" />
<dataset name="test prism" ID="prism" urlPath="nhgf/thredds/prism_v2/prism_2020.nc" dataType="Grid" serviceName="all"/>
<datasetRoot path="nhgf-test" location="cdms3://default@aws/nhgf-development" />
<dataset name="test prism 2" ID="prism-test" urlPath="nhgf-test/thredds/prism_v2/prism_2020.nc" dataType="Grid" serviceName="all"/>
</catalog>
I think I'm off and running with this pattern. I'll do some more testing and we'll see where we get.
from tds.
Picking this issue back up at work. I can get thredds to attempt to access my S3 bucket, but I get an access denied, even though my default credentials and configs stored in /usr/local/tomcat/.aws/
is all good. I am trying to debug this, but I can't figure out how to turn on the logging for the "com.amazonaws" logger (looks like they are still using log4j?).
Might be a good idea to update the log4j2.xml file to automatically include WARN and higher entries in the threddsServlet.log file, maybe? I tried doing it myself, but I can't seem to make it work and I don't have much experience with java logging.
from tds.
Any guidance on how to activate the logs from the Amazon SDK? I'm still having difficulty getting this to work and I'm shooting in the dark, trying to figure out what the code is choking on (either I get "null" errors, or "Access Denied" errors).
from tds.
Alright, thanks to the logging info, I have been able to determine one of the reasons why my group has been having such difficulty with figuring out the configs. If I am understanding this correctly, it turns out that TDS is missing a dependency from the awssdk package (I don't know exact terminology). Since we were trying to run TDS as a docker container image on an EKS cluster, the authorization is provided by a Web Identity Token rather than some other usual method such as AWS Role or ~/.aws/credentials
. Doing that requires the sts
module to be included in the classpath.
Looks like the following line needs to be added to the gradle files somewhere?
implementation 'software.amazon.awssdk:sts'
Not sure if something like that should go into netcdf-java directly, or into TDS?
from tds.
I will have a look!
from tds.
Thanks Sean, that sounds good! 😄
from tds.
Related Issues (20)
- dynamic dataset update HOT 18
- Add styles directory to WMS config
- Add Godiva properties to WMS config
- THREDDS Server Is Down 503 Error HOT 4
- Modifying the TDS properties HOT 1
- Thredds Docker 5.5-SNAPSHOT does not work for NetCDF for Horizontal extent and Horizontal subset HOT 4
- How to set `logoUrl` to show the logo image? HOT 2
- DatasetTrackerChronicle failing on k8s HOT 21
- chronicle-map's createOrRecoverPersistedTo() is deprecated
- [error] Exception in macro #layer called at templates/capabilities-1.3.0.vm[line 170, column 1] HOT 2
- Unable to Rename Dimensions with NcML (Works in 4.6) HOT 9
- Integration of ncWMS2 correction on native grids HOT 3
- TDS puts wrong variable attributes HOT 8
- catalog URL change 4.6 -> 5.x HOT 1
- RTMA Data on UCAR THREDDS Server not updating HOT 1
- SSL Certificate Error RTMA Data HOT 7
- Can't access ncml metadata for a dataset in 5.5 HOT 12
- Constant but random java.lang.IllegalStateExceptions HOT 1
- Query datasets based on metadata HOT 1
- Server-side virtual data processing HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tds.