metabase / macaw Goto Github PK
View Code? Open in Web Editor NEWA Clojure wrapper for JSqlParser ๐ฆ
License: Eclipse Public License 1.0
A Clojure wrapper for JSqlParser ๐ฆ
License: Eclipse Public License 1.0
If you have something like
select public.towns.id from towns
Macaw won't catch that towns
must have a schema of public
in :tables
SELECT
count(id)
FROM
orders
GROUP BY user_id
finds id
but not user_id
When analyzing Metabase queries with references to logical views (aka models), Macaw is passed a query with "sentinel" names in place of actual tables. In order for find-replace to work correctly, we treat such qualifiers as if they were missing, so that we can cascade down to the entry qualified by the actual table.
This trick currently relies on Metabase's naming convention, and could have false positives, or break if this convention changes.
A more robust solution would be to take a set of placeholder names to be treated like this. Checking membership in this set should also be faster than searching for substrings too.
From the stats server:
SELECT date_trunc('month', instance_started)::date as month_started, count(*) AS "total_instances",
avg(ts-instance_started) as avg_runtime,
avg((stats->'user'->'users'->>'active')::int) as avg_active_users,
percentile_disc(0.9) within group (order by (stats->'user'->'users'->>'active')::int) as p90_active_users,
avg((stats->'question'->'questions'->>'total')::int) as avg_questions,
percentile_disc(0.9) within group (order by (stats->'question'->'questions'->>'total')::int) as p90_questions,
max((stats->'question'->'questions'->>'total')::int) as max_questions
FROM "public"."usage_stats"
WHERE (("public"."usage_stats"."instance_started" BETWEEN timestamp
with time zone '2019-01-01 00:00:00.000-08:00' AND NOW()) AND ts-instance_started < INTERVAL '30 days')
GROUP BY 1
ORDER BY 1 ASC;
"net.sf.jsqlparser.parser.ParseException: Encountered unexpected token: \"BETWEEN\" \"BETWEEN\"
at line 9, column 51.
Was expecting one of:
\")\"
\".\"
\"::\"
\"[\"
\"^\"
Currently Macaw is unable to correctly parse queries featuring squared bracket quoted identifiers, e.g.
SELECT [my_field] FROM my_table
(columns query)
;; Execution error (ParseException) at net.sf.jsqlparser.parser.CCJSqlParser/generateParseException (CCJSqlParser.java:39603).
;; Encountered unexpected token: "SELECT" <K_SELECT>
;; at line 1, column 1.
;;
;; Was expecting one of:
;;
;; "WITH"
The most obvious solution would be to update the JSQLParser grammar, but that would involve either a large upstream contribution, or forking the library, neither of which we are wanting to get into at this stage.
Since both Macaw and MSSQL can both support the use of double quotes for escaping, another solution could be to pre-process the query to make this substitution. Since double quotes require the QUOTED_IDENTIFIER
setting to be enable though, it would be prudent to convert them back however when rewriting strings. Since the semantics of the two quoting methods are the same, and square braces do not require any settings, we could implicitly turn all quotes into square brackets as a post-processing step, regardless of whether the corresponding identifier was rewritten - but we would prefer to preserve existing formatting as much as possible.
Since we can do this pre/post processing hack around our calls from Metabase to Macaw, we have no plans to fix this for now, although we might change our mind if there is 3rd party demand.
CASE
WHEN [[fromble ILIKE '%' || {{fromble_name}} || '%' AND]] TRUE
THEN [[{{fromble_name}}
ELSE]] 'Others' END AS fromble
net.sf.jsqlparser.JSQLParserException: net.sf.jsqlparser.parser.ParseException:
Encountered unexpected token: "||" <OP_CONCAT>
Hopefully this is as simple as turning on a dialect related config flag, like for MSSQL square brace quotes.
Queries affected on Stats: 128
Currently Macaw uses a global counter to assign
Ideally we would have a deterministic assignment for a given query, and has some local stability where things don't change in a query.
We don't need to preserve global uniqueness across queries, as we can always pair these ids with a card id in Metabase.
Similar to the work done in metabase/metabase#41781
When we call replace-names
, we should handle case and quoting correctly. Currently it depends on an exact string match.
(deftest tables-with-complex-aliases-test
(testing "With an alias that is also a table name"
(is (= #{"user" "user2_final"}
(tables
"SELECT legacy_user.id AS old_id,
user.id AS new_id
FROM user AS legacy_user
OUTER JOIN user2_final AS user
ON legacy_user.email = user2_final.email;")))))
Instead of just taking a column/table name, replace-names
should take {:table "foo" :column "bar" :schema "baz"}
maps
Bryan made the cogent suggestion to add Malli schemas to function metadata as a means of documentation and we've been planning for a while to Malli-fy everything. All public functions and constants should have schemas.
In SQL you can include the quote character within an identifier name by repeating it, e.g.
`schemer`.`the-table`.`doug``s mug`
{:schema "schemer"
:table "the-table"
:column "doug`s mug"}
Metabase solves this with a slightly more complex quote stripper which we should be able to simply pilfer.
Currently Macaw makes use of various maps whose keys are qualified entities that are themselves represented by maps.
For example, the parameters passed to replace-names
:
{:columns {{:table "orders" :column "total"} "subtotal"}
:tables {{:table "orders"} "purchases"}}
Ignoring what we want the library API to look like, there are some practical downsides to using this representation internally:
find-relaxed
) method requires a linear scan over they keys of a persistent map - slow!An idea for something better would be to build a trie-like structure which us traversed from the "outside in", i.e. right-to-left as identifiers are written. For example:
{"total" {"orders" {"schema_1" "primary_orders_total"
"schema_2" "secondary_order_total"}
"payments" {"schema_1" "total_payments"}}}
In the simple case of a fully-qualified identifier, we'd now need to do 3 look-ups instead of a single one, so there's no free lunch. On the other hand, we do save hashing the whole map - and clojure does not cache map hashes.
The advantage for partially qualified identifiers is that we can replace a linear seek which needs to check equality with modified keys with a sequence of look-ups. When we reach the end of our crumbs we can then easily enumerate all the children nodes. In the case where there is no match, we can also fail on a look-up step instead of doing a full seek.
They're also not great for pretty printing, given their density. That said, a trie will probably be even worse - so we'd probably want a utility method for printing them nicely when debugging too.
I have a dream, that instead of continually adding ad hoc mix of what we test for different queries, that we use a structured format of fixtures and expectation files to generate an "acceptance test suite".
e.g. we could have the following files:
test/resources/acceptance/foo.sql # if this is here, we test that parsing does not explode.
test/resources/acceptance/foo.analysis.edn # if this is here then we test analysis. keys can be sparse.
test/resources/acceptance/foo.rewrite.edn # if the next two files are here we use them to test rewrite
test/resources/acceptance/foo.rewritten.sql
We would probably need somewhere to put the opts
for a given query too, perhaps in a new or existing EDN file, or in comments in the source SQL.
We probably also want to have some allowed exceptions for cases we're working our way through. I think defining exceptions in code in the tests file is fine, as this is more visible to developers than "stop files" or something like that. For the exceptions we can define that actual results values. We can have a strict mode (e.g. CLI option) which fails if there are any exceptions, but I think this should be off for CI.
The idea is to shunt over our existing tests as fast as possible, hence allowing missing files or fields is advantageous.
Implementation wise I imagine macros which take descriptions of the suite, and a resource directory path, and they generate separate -test
vars with nicely broken down testing
scopes, so we get great test runner output in CI.
=?
over the entire components
map.(in the same way that you can rename columns and tables)
GITHUB_TOKEN
CLOJARS_USERNAME
CLOJARS_PASSWORD
deploy.yml
, edit if neededHere is a minimally reproducing query:
select
e.instance_id
-- this next one is fine
, percentile_cont(0.75) within group (order by e.running_time) / 1000.0 as p75_time
-- this is also fine
, avg(e.running_time) filter (where e.error = '') / 1000.0 as avg_success_time
-- this one causes a parse error
, percentile_cont(0.75) within group (order by e.running_time) filter (where e.error = '') / 1000.0 as p75_success_time
from execution e
group by 1
Example query:
select *
from dblink('hosting_insights',$REDSHIFT$
SELECT distinct hm_user.company
FROM metabase_database
LEFT JOIN hm_hosted_instance
ON metabase_database.etl_source_instance_id = hm_hosted_instance.id
LEFT JOIN hm_user
ON hm_hosted_instance.owner_id = hm_user.id
WHERE metabase_database.is_sample = FALSE
$REDSHIFT$) as t1(customer text)
Interestingly enough, this is a :postgres
query on Stats...
Queries affected on Stas: 93
Currently, Macaw's version is set to 0.1.z
, where z
is the number of commits in the release since the beginning of time. This makes working with Metabase releases difficult: a backport for an old version of Metabase may depend on some new Macaw behavior, but there could also be breaking changes in the Macaw API. We need a more sophisticated versioning strategy. The proposal is semver-adjacent:
x.y.z
format.x
and y
will be set in a new VERSION
file that gets slurped by build.clj
.z
will be the number of commits since x.y.0
.x.y.0
release to help compute #3.x
will be incremented when there are serious changes to the public API ("serious" is deliberately vague, but is meant to allow us to keep the same major version number if the API-breaking changes are trivially addressable from within Metabase)y
will be incremented when we feel we've written a substantial feature (this is again deliberately vague). This includes both very different internal behavior (e.g., improving our column-finding logic) or additions to the API.The efficiency gains we're getting from using an array-backed deque are more than erased by the need to copy the entire structure for each element, and then build a persistent vector from it as well.
Using a cons list would save making both copies, while still giving O(1) pop and push, at a small cost to memory locality.
But fails
(deftest complicated-mutations-test
(is (= #{"delete" "insert"}
(mutations "WITH outdated_orders AS (
DELETE FROM orders
WHERE
date <= '2018-01-01'
RETURNING *
)
INSERT INTO order_log
SELECT * from outdated_orders;"))))
with
macaw.core-test
1 non-passing tests:
Error in complicated-mutations-test
expected: (=
#{"delete" "insert"}
(mutations
"WITH outdated_orders AS (
DELETE FROM orders
WHERE
date <= '2018-01-01'
RETURNING *
)
INSERT INTO order_log
SELECT * from outdated_orders;"))
error: net.sf.jsqlparser.JSQLParserException: net.sf.jsqlparser.parser.ParseException: Encountered unexpected token: "DELETE" "DELETE"
at line 2, column 24.
Was expecting:
"WITH"
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.