snow-fox-data / dss-thread Goto Github PK
View Code? Open in Web Editor NEWDataiku Thread™ Data Catalog Plugin by Snow Fox Data
Home Page: https://www.snowfoxdata.com/thread-plugin
License: Other
Dataiku Thread™ Data Catalog Plugin by Snow Fox Data
Home Page: https://www.snowfoxdata.com/thread-plugin
License: Other
Not showing up in project it's shared with
Could you indicate the date of the last scan for each project?
First off, dynamite work! This is incredibly well thought out. I will be giving a shout out to Snow Fox and Excelion in the Dataiku Sales Engineering Global Call.
Describe the bug
Thread works perfectly in version 10, however, in version 11.1, Tread returns zero results. This appears to be a deprecation from one of the API calls.
2022-11-03 04:19:56,711 INFO 127.0.0.1 - - [03/Nov/2022 04:19:56] "GET /dss-stats HTTP/1.1" 200
-THREAD datasets do not exist yet
/opt/dataiku/dss_install/dataiku-dss-11.1.0/python/dataikuapi/dss/dataset.py:132: DeprecationWarning: Dataset.get_definition is deprecated, please use get_settings
warnings.warn("Dataset.get_definition is deprecated, please use get_settings", DeprecationWarning)
/opt/dataiku/dss_install/dataiku-dss-11.1.0/python/dataikuapi/dss/dataset.py:144: DeprecationWarning: Dataset.set_definition is deprecated, please use get_settings
warnings.warn("Dataset.set_definition is deprecated, please use get_settings", DeprecationWarning)
2022-11-03 04:20:00,312 INFO Initializing dataset writer for dataset
THREAD.--Thread-Datasets--2022-11-03 04:20:00,312 INFO Initializing write session
2022-11-03 04:20:00,336 INFO Starting RemoteStreamWriter
2022-11-03 04:20:00,338 INFO Initializing write data stream (sZ7gv1aINY)
2022-11-03 04:20:00,339 INFO Remote Stream Writer closed
2022-11-03 04:20:00,341 INFO Remote Stream Writer: start generate
2022-11-03 04:20:00,341 INFO Waiting for data to send ...
2022-11-03 04:20:00,341 INFO Got end mark, ending send
0 rows successfully written (sZ7gv1aINY)
2022-11-03 04:20:00,552 INFO Initializing dataset writer for dataset
THREAD.--Thread-Index--2022-11-03 04:20:00,552 INFO Initializing write session
2022-11-03 04:20:00,577 INFO Starting RemoteStreamWriter
2022-11-03 04:20:00,580 INFO Initializing write data stream (B4Qq39Og86)
2022-11-03 04:20:00,581 INFO Remote Stream Writer closed
2022-11-03 04:20:00,583 INFO Remote Stream Writer: start generate
2022-11-03 04:20:00,583 INFO Waiting for data to send ...
2022-11-03 04:20:00,583 INFO Got end mark, ending send
0 rows successfully written (B4Qq39Og86)
2022-11-03 04:20:00,849 INFO Initializing dataset writer for dataset
THREAD.--Thread-Column-Mapping--
2022-11-03 04:20:00,850 INFO Initializing write session
2022-11-03 04:20:00,884 INFO Starting RemoteStreamWriter
2022-11-03 04:20:00,888 INFO Initializing write data stream (nFw6UApJBf)
2022-11-03 04:20:00,891 INFO Remote Stream Writer: start generate
2022-11-03 04:20:00,891 INFO Waiting for data to send ...
2022-11-03 04:20:00,891 INFO Remote Stream Writer closed
2022-11-03 04:20:00,892 INFO Got end mark, ending send
0 rows successfully written (nFw6UApJBf)
To Reproduce
Expected behavior
Thread would perform as normal.
Screenshots
At some point after starting the DSS scan.
Additional context
Add any other context about the problem here.
Happy to do any testing as needed.
Search text doesn't appear on catalog screen when I type
Describe the bug
Hashtag (#) in field name causes error when attempting to edit definition.
To Reproduce
Steps to reproduce the behavior:
Name
.Expected behavior
Creating or editing a definition for a field containing a hashtag should not cause an error.
Additional context
Describe the bug
Our entire catalog somehow got wiped out.
Now when trying to re-scan the following message pops up on the Thread Public web interface with only an "OK" button:
dss.my-corp.com says:
'to'
The logs show the following over and over over again.
[2022/10/12-21:19:27.347] [Thread-350] [WARN] [dip.http.utils] - Got response code 502
[2022/10/12-21:19:27.347] [Thread-350] [WARN] [dku.auditmechanism.eventserver] - Failed to post events to event server
java.io.IOException: Unknown error on command (HTTP code:502):<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx</center>
</body>
</html>
at com.dataiku.dip.util.HTTPClientUtils.handleJSONResp(HTTPClientUtils.java:226)
at com.dataiku.common.rpc.InternalAPIClient.handleJSONResp(InternalAPIClient.java:866)
at com.dataiku.common.rpc.APIKeyAuthAPIClient.handleJSONResp(APIKeyAuthAPIClient.java:96)
at com.dataiku.common.rpc.InternalAPIClient.postObject(InternalAPIClient.java:310)
at com.dataiku.dip.security.audit.targets.EventServerTarget$QueueSender.run(EventServerTarget.java:167)
[2022/10/12-21:19:27.457] [qtp722417467-376348] [DEBUG] [dku.tracing] - [ct: 0] Start call: /api/futures/get-update [GET] user=user.one [futureId=VOvlVVLf]
[2022/10/12-21:19:27.457] [qtp722417467-376348] [DEBUG] [dku.tracing] - [ct: 0] Done call: /api/futures/get-update [GET] time=0ms user=user.one [futureId=VOvlVVLf]
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The catalog gets rebuilt
Additional context
Add any other context about the problem here.
If I am unable to edit / add a definition, the button should be shown but disabled
Add ability to rescan DSS on an automated schedule
Is your feature request related to a problem? Please describe.
I can't explicitly exclude datasets from getting scanned.
Describe the solution you'd like
I see we can use limit_to_tags
to limit the scan to only datasets with a particular tag.
I want to use a exclude_tags
to the optional Project Variables to explicitly exclude datasets from getting scanned.
Describe alternatives you've considered
Tag all datasets I want scanned with a tag and put that tag in the limit_to_tags
Project Variable.
This is unnecessarily tedious when all I want to do is exclude a small number of datasets from a very large number of datasets.
No longer works
When scanning our DSS instance, the indexing always stops too early and does not complete the entire scan. Some projects are not in the Thread_Index.
In the screenshot below you can see from the index subset of a dataframe created directly from Thread_Index that the last row the "key" and "last_modified" columns appear to have been shifted to the left by one column. Then a "NaN" (Null) value is left in the actual "last_modified" column.
Also, the only way to get to nearly any project is by putting the Project Key directly in URL, as this appears to break the search functionality of Thread. See the error message at the bottom of this post.
[2022-05-18 13:48:19,339] [27/MainThread] [ERROR] [dataiku.webapps.backend] Exception on /search [GET]
Traceback (most recent call last):
File "/opt/dataiku/code-env/lib/python3.7/site-packages/flask/app.py", line 2077, in wsgi_app response = self.full_dispatch_request()
File "/opt/dataiku/code-env/lib/python3.7/site-packages/flask/app.py", line 1525, in full_dispatch_request rv = self.handle_user_exception(e)
File "/opt/dataiku/code-env/lib/python3.7/site-packages/flask/app.py", line 1523, in full_dispatch_request rv = self.dispatch_request()
File "/opt/dataiku/code-env/lib/python3.7/site-packages/flask/app.py", line 1509, in dispatch_request return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
File "<string>", line 155, in search File "/opt/dataiku/code-env/lib/python3.7/site-packages/pandas/core/frame.py", line 2682, in __getitem__ return self._getitem_array(key)
File "/opt/dataiku/code-env/lib/python3.7/site-packages/pandas/core/frame.py", line 2709, in _getitem_array if com.is_bool_indexer(key):
File "/opt/dataiku/code-env/lib/python3.7/site-packages/pandas/core/common.py", line 107, in is_bool_indexer
raise ValueError('cannot index with vector containing 'ValueError: cannot index with vector containing NA / NaN values
I am unable to add/edit definitions in the Thread app.
I am an owner of the Thread project within DSS. All I see is a blank screen in place of the definition window, I am able to view the DSS and Lineage tabs.
Earlier the member was not able to add/edit definitions and now the owner of the project is not able to add/edit.
Unclear what happens to definitions that were applied to columns that no longer exist
Self-explanatory
Parentheses in search term causing error
Would be great to have a mass functionality to ingest all column comments
Would be great to have a recent activity feed
same as project functionality
Is your feature request related to a problem? Please describe.
Our organization has multiple dataiku design nodes or instances. Having a catalog that is consistent and "up-to-date" between the instances is important.
Describe the solution you'd like
We would like to be able to share one catalog between the different instances.
Describe alternatives you've considered
An alternative, discussed with @rymoore99, is to export the catalog as Dataiku Project, and import it in other instances. This is a nice feature I wasn't aware of, but I think it might be useful to migrate instances, but it only partially solves the problem of having a single consistent catalog, or it might require a lot of manual work and discipline to achieve the goal in a daily basis.
Want the ability to create / maintain categories and add these dynamically to definitions for selection
When applying an existing definition to new columns, the existing columns are not keeping the definition application
More advanced Tag Management would be nice
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.