databrickslabs / splunk-integration Goto Github PK
View Code? Open in Web Editor NEWDatabricks Add-on for Splunk
Home Page: https://splunkbase.splunk.com/app/5416/
License: Other
Databricks Add-on for Splunk
Home Page: https://splunkbase.splunk.com/app/5416/
License: Other
Hello team.
As Splunk DB connect doesn't supports SQL warehouse endpoint, the next option to take is using splunk add on tool: https://splunkbase.splunk.com/app/5416 but to use this option, the SQL warehouse endpoint must be turn on every single time I will need to fetch a data, for this issue, is there a way to trigger a turn on action, as you will do in splunk db connect https://splunkbase.splunk.com/app/2686 ?
Customer is not willing to use splunk add on if they need to get this sql gateway resource turn on every single time, any way to mimic how splunk db connect works with all purpose cluster, to allow sql warehouse endpoint?
According to the known issues section of the documentation the logging for the add-on is located within var/log/splunk/ta_databricks.log and var/log/TA-Databricks/<command_name>command.log. This is inconsistent with standard Splunk apps/add-on, as they should log under /var/log/splunk with a suitable filename to indicate the source (i.e., ta_databricks) and any subcomponent as required (as an example, ta_databricks_.log).
The logging format should also match that of the standard Splunk logs so that they are automatically ingested and processed correctly. Also, the documentation states that indistinct/unclear error messages may be displayed within the UI, which are not helpful to analysts who encounter them. A suitable/useful error message should always be provided in the UI to aid in troubleshooting, rather than having to inspect the logs each time there is a failure.
We updated our add-on and the databricksquery
command no longer works (via PAT or Azure Service Principal authentication). I can't find any clues in _internal related to the python error. The search log only shows:
07-28-2023 10:24:26.355 INFO ServerConfig [1401872 searchOrchestrator] - Will add app jailing prefix /opt/splunk/bin/nsjail-wrapper for TA-Databricks
07-28-2023 10:24:26.355 INFO ChunkedExternProcessor [1401872 searchOrchestrator] - Running process: /opt/splunk/bin/nsjail-wrapper /opt/splunk/bin/python3.7 /opt/splunk/etc/apps/TA-Databricks/bin/databricksquery.py
07-28-2023 10:24:27.238 INFO ChunkedExternProcessor [1401872 searchOrchestrator] - Custom search command is a generating command.
07-28-2023 10:24:27.238 WARN ChunkedExternProcessor [1401872 searchOrchestrator] - Error adding inspector message: invalid level or message already exists
.......
07-28-2023 10:24:27.329 ERROR ChunkedExternProcessor [1401944 phase_1] - Error in 'databricksquery' command: External search command exited unexpectedly with non-zero error code 1.
Also, version 1.2 does not appear to have been committed to this repo.
https://github.com/databrickslabs/splunk-integration/blob/8389c72498825c9bb9306e2b20fe33bfee209e35/app/app.manifest#L8C21-L8C21
I see this integration addon allows users to query Databricks from Splunk. Does it support Databricks SQL endpoints as well as interactive clusters?
Python 2 is already end of life, and won't receive security upgrades, etc. This removal will allow to use make code more readable & maintainable by supporting new features of Python 3
Following guide Splunk DB Connect Guide for Databricks but when creating a connection to Databricks and trying to test it in Data Lab >> SQL Explorer, I'm getting error "Cannot get schemas"
This is the combo used (tried others as well but the error thrown is always the same)
DB connect: 3.12.2
JDBC driver: 2.6
JAVA: JRE 11
Connection fails with the following errors in splunk/var/log/splunk/splunk_app_db_connect_server.log
ERROR c.s.d.s.a.s.d.impl.DatabaseMetadataServiceImpl - Unable to get schemas metadata
[...]
.ErrorPropagationThriftHandler:runSafely:ErrorPropagationThriftHandler.scala:119], errorCode:0, errorMessage:Configuration CONNECTION_TYPE is not available.).
At present to authenticate a PAT token must be used which raises security contents to its usage (and potential misuse). Being able to use SPNs would improve security and traceability while making the situation easier from a compliance perspective.
According to the setup documentation it reads as if only a single cluster can be defined within the platform at any given time. As it is feasible for multiple clusters to exist, and there to be a desire to search multiple clusters by specifying the 'cluster=' parameter shown in the screenshots, this functionality should be added. If this functionality already exists, the documentation should be updated to clearly state that multiple clusters can be added and leveraged at the same time, including screenshots for reference (to avoid any confusion).
the instructions are only valid for JDBC driver version 2.6.22 and earlier. After this the class paths have changed so db_connection_types.conf needs to reflect those changes
According to the custom commands section of the documentation a user requires either 'admin_all_objects' or 'list_storage_passwords' to use the add-on. From a security perspective neither permission is viable as the first provides a user with full admin privileges on the platform, while the second allows a user to see all stored passwords for apps/add-ons they have access to.
This requirement prevents this app being used in the majority of environments, and really needs to be rewritten to use proper access control that doesn't reveal credentials to non-admins. While an admin should be able to see (and change) the configuration of any defined cluster, a normal user should only have access to clusters that share the same role (i.e., databricks_cluster_xxxxxx), similar to the functionality that DB Connect provides.
Databricks now supports:
There doesn't appear to be a documented approach to retrieving the allowed list of notebooks (and their parameters) and any job id's from the Databricks platform via any of the custom commands. From an integration perspective the ability to query these so that concise dashboards (and field verifications for example) can be provided to the end user.
Without these there is significant manual effort required to provide a usable front-end within Splunk, especially if there are frequent changes on the Databricks side. If they don't already exist, new custom commands to retrieve details of notebooks / jobs would be beneficial to this end.
An Azure Databricks customer requested the following correction in line 75 of https://github.com/databrickslabs/splunk-integration/blob/master/notebooks/source/pull_from_splunk.py:
Instead of "pushing data to splunk" we should read "pulling data from splunk"
If you confirm the issue please proceed with the requested correction.
hey @metrocavich @alexott
Looking at guide Configuring Splunk DB Connect App For Databricks it looks like the integration only supports configuring an identity using Username/Password.
From within Splunk DB Connect, navigate to the Configuration > Databases > Identities tab and click New Identity.
Fill in the appropriate details:
Identity Name: Unique name of the identities
Username: Enter your Databricks Email/Username and password that you use for the Databricks instance that you want to connect to.
Note: Ensure that the database user has sufficient access to the data you want to search. For example, you might create a database user account whose access is limited to the data you want Splunk Enterprise to consume.
Password: Enter the password for the user you entered in the Username field.
Is there an option to use a "Token Identity" to connect instead of "Basic Identity" (user/pass) ?
I have encountered cases when certain environments only employ token-based authentication methods, as opposed to the traditional username and password credentials.
As shown in the integration screenshot (databricksquery.png) a user can specify the command_timeout parameter to override how long a search can run for. As this has the potential to create a negative performance impact on both environments a maximum allowable value should be a configuration option to prevent users from setting a value too high. In the event of a user trying to go past this, the maximum value should be used instead.
According to the limitations section of the documentation the databricksquery custom command has a limit on the number of results that will be returned (though the limit doesn't appear defined). This limitation (stated as being part of the API) is an inhibitor to adoption as it means results cannot be relied upon as queries that may return a larger number of results may be truncated.
If this data is being used for security purposes then this truncation of results could create blind-spots in detections. Any query being performed should return either the full number of results for the query or a limited number based on a defined configuration parameter (to prevent billions of results being returned for example).
Currently, we can run jobs using the Databricks add-on for splunk using Notebooks and pass parameters using the databricksjob command.
It doesn't support the SQL analytics features such as queries, alerts and dashboards that can be used in Databricks workflows. It would be great to add those to the add-on.
As of mid-October (don't have the exact date), the databricksquery command stopped working in Splunk cloud.
Errors from search job inspector below.
Could this be related to #9 and Splunk's removal of Python 2?
11-02-2021 16:56:27.691 ERROR ChunkedExternProcessor [35699 searchOrchestrator] - Error in 'databricksquery' command: External search command exited unexpectedly with non-zero error code 1.
11-02-2021 16:56:27.691 INFO ScopedTimer [35699 searchOrchestrator] - search.optimize 0.656035955
11-02-2021 16:56:27.691 INFO SearchPhaseGenerator [35699 searchOrchestrator] - Failed to create phases using AST:Error in 'databricksquery' command: External search command exited unexpectedly with non-zero error code 1.. Falling back to 2 phase mode.
11-02-2021 16:56:27.691 INFO SearchPhaseGenerator [35699 searchOrchestrator] - Executing two phase fallback for the search=| databricksquery query="SELECT * FROM silver.ProcessRollup2 LIMIT 1"
11-02-2021 16:56:27.691 INFO SearchParser [35699 searchOrchestrator] - PARSING: | databricksquery query="SELECT * FROM silver.ProcessRollup2 LIMIT 1"
11-02-2021 16:56:27.691 INFO ServerConfig [35699 searchOrchestrator] - Will add app jailing prefix /opt/splunk/bin/nsjail-wrapper for TA-Databricks
11-02-2021 16:56:27.691 INFO ChunkedExternProcessor [35699 searchOrchestrator] - Running process: /opt/splunk/bin/nsjail-wrapper /opt/splunk/bin/python3.7 /opt/splunk/etc/apps/TA-Databricks/bin/databricksquery.py
11-02-2021 16:56:28.268 ERROR ChunkedExternProcessor [29051 ChunkedExternProcessorStderrLogger] - stderr: Traceback (most recent call last):
11-02-2021 16:56:28.268 ERROR ChunkedExternProcessor [29051 ChunkedExternProcessorStderrLogger] - stderr: File "/opt/splunk/etc/apps/TA-Databricks/bin/databricksquery.py", line 6, in <module>
11-02-2021 16:56:28.268 ERROR ChunkedExternProcessor [29051 ChunkedExternProcessorStderrLogger] - stderr: import databricks_com as com
11-02-2021 16:56:28.268 ERROR ChunkedExternProcessor [29051 ChunkedExternProcessorStderrLogger] - stderr: File "/opt/splunk/etc/apps/TA-Databricks/bin/databricks_com.py", line 7, in <module>
11-02-2021 16:56:28.268 ERROR ChunkedExternProcessor [29051 ChunkedExternProcessorStderrLogger] - stderr: import databricks_common_utils as utils
11-02-2021 16:56:28.268 ERROR ChunkedExternProcessor [29051 ChunkedExternProcessorStderrLogger] - stderr: File "/opt/splunk/etc/apps/TA-Databricks/bin/databricks_common_utils.py", line 13, in <module>
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.