metabase / macaw Goto Github PK

A Clojure wrapper for JSqlParser 🦜

License: Eclipse Public License 1.0

Clojure 60.11% Emacs Lisp 1.03% Shell 0.74% Java 38.11%

macaw's Issues

Use schema information from SELECT clause

If you have something like

select public.towns.id from towns

Macaw won't catch that towns must have a schema of public in :tables

find columns in GROUP BY

SELECT
    count(id)
FROM
    orders
GROUP BY user_id

finds id but not user_id

Use an explicit list of "placeholder" tables instead of a magical substring

When analyzing Metabase queries with references to logical views (aka models), Macaw is passed a query with "sentinel" names in place of actual tables. In order for find-replace to work correctly, we treat such qualifiers as if they were missing, so that we can cascade down to the entry qualified by the actual table.

This trick currently relies on Metabase's naming convention, and could have false positives, or break if this convention changes.

A more robust solution would be to take a set of placeholder names to be treated like this. Checking membership in this set should also be faster than searching for substrings too.

Date operations like BETWEEN sometimes cause issues

From the stats server:

SELECT date_trunc('month', instance_started)::date as month_started, count(*) AS "total_instances",
    avg(ts-instance_started) as avg_runtime,
    avg((stats->'user'->'users'->>'active')::int) as avg_active_users,
    percentile_disc(0.9) within group (order by (stats->'user'->'users'->>'active')::int) as p90_active_users,
    avg((stats->'question'->'questions'->>'total')::int) as avg_questions,
    percentile_disc(0.9) within group (order by (stats->'question'->'questions'->>'total')::int) as p90_questions,
    max((stats->'question'->'questions'->>'total')::int) as max_questions
  FROM "public"."usage_stats"
  WHERE (("public"."usage_stats"."instance_started" BETWEEN timestamp
    with time zone '2019-01-01 00:00:00.000-08:00' AND NOW()) AND ts-instance_started < INTERVAL '30 days')
  GROUP BY 1
  ORDER BY 1 ASC;

Error message

"net.sf.jsqlparser.parser.ParseException: Encountered unexpected token: \"BETWEEN\" \"BETWEEN\"
      at line 9, column 51.
  
  Was expecting one of:
  
      \")\"
      \".\"
      \"::\"
      \"[\"
      \"^\"

Support renaming aliases

Square brace quotes (MSSQL) are not supported

Currently Macaw is unable to correctly parse queries featuring squared bracket quoted identifiers, e.g.

SELECT [my_field] FROM my_table

(columns query)

;; Execution error (ParseException) at net.sf.jsqlparser.parser.CCJSqlParser/generateParseException (CCJSqlParser.java:39603).
;; Encountered unexpected token: "SELECT" <K_SELECT>
;;     at line 1, column 1.
;;
;; Was expecting one of:
;;
;;     "WITH"

The most obvious solution would be to update the JSQLParser grammar, but that would involve either a large upstream contribution, or forking the library, neither of which we are wanting to get into at this stage.

Since both Macaw and MSSQL can both support the use of double quotes for escaping, another solution could be to pre-process the query to make this substitution. Since double quotes require the QUOTED_IDENTIFIER setting to be enable though, it would be prudent to convert them back however when rewriting strings. Since the semantics of the two quoting methods are the same, and square braces do not require any settings, we could implicitly turn all quotes into square brackets as a post-processing step, regardless of whether the corresponding identifier was rewritten - but we would prefer to preserve existing formatting as much as possible.

Since we can do this pre/post processing hack around our calls from Metabase to Macaw, we have no plans to fix this for now, although we might change our mind if there is 3rd party demand.

String concattenation using || is not supported

Query

CASE 
  WHEN [[fromble ILIKE '%' || {{fromble_name}} || '%' AND]] TRUE 
  THEN [[{{fromble_name}} 
  ELSE]] 'Others' END AS fromble

Exception

net.sf.jsqlparser.JSQLParserException: net.sf.jsqlparser.parser.ParseException: 
  Encountered unexpected token: "||" <OP_CONCAT>

Hopefully this is as simple as turning on a dialect related config flag, like for MSSQL square brace quotes.

Queries affected on Stats: 128

Generate stable scope IDs

Currently Macaw uses a global counter to assign

Ideally we would have a deterministic assignment for a given query, and has some local stability where things don't change in a query.

A very simple idea that satisfies determinism would be to simply reset the counter every time we do analysis.
We could get a bit more stability through per-query counters based on hashing the context, added to that hash.
If we see value in the "invalidation" behavior, we could add a hash of the string contents of each scope to this number.

We don't need to preserve global uniqueness across queries, as we can always pair these ids with a card id in Metabase.

Handle case/quoting when replacing names

Similar to the work done in metabase/metabase#41781

When we call replace-names, we should handle case and quoting correctly. Currently it depends on an exact string match.

Alias removal is over-eager; removes tables that are shadowed

(deftest tables-with-complex-aliases-test
  (testing "With an alias that is also a table name"
    (is (= #{"user" "user2_final"}
           (tables
            "SELECT legacy_user.id AS old_id,
                    user.id AS new_id
             FROM user AS legacy_user
             OUTER JOIN user2_final AS user
             ON legacy_user.email = user2_final.email;")))))

Accept fully-qualified columns and tables for renames

Instead of just taking a column/table name, replace-names should take {:table "foo" :column "bar" :schema "baz"} maps

Add Malli schemas

Bryan made the cogent suggestion to add Malli schemas to function metadata as a means of documentation and we've been planning for a while to Malli-fy everything. All public functions and constants should have schemas.

Support quoted quote symbols within identifier names

In SQL you can include the quote character within an identifier name by repeating it, e.g.

`schemer`.`the-table`.`doug``s mug`

{:schema "schemer"
 :table  "the-table"
 :column "doug`s mug"}

Metabase solves this with a slightly more complex quote stripper which we should be able to simply pilfer.

Consider replacing "map-key mapping" with trie structure

Currently Macaw makes use of various maps whose keys are qualified entities that are themselves represented by maps.

For example, the parameters passed to replace-names:

{:columns {{:table "orders" :column "total"} "subtotal"}
 :tables  {{:table "orders"}                 "purchases"}}

Ignoring what we want the library API to look like, there are some practical downsides to using this representation internally:

The relaxed look-up (find-relaxed) method requires a linear scan over they keys of a persistent map - slow!
Since the ordering of the keys is non-deterministic and we return the first match we find, this may also be non-deterministic!
We may want to do something different when there are multiple matches - e.g. return all of them, throw an error, etc.
We could still support this with an exhaustive scan that accumulates all the results, but that's even more expensive.

An idea for something better would be to build a trie-like structure which us traversed from the "outside in", i.e. right-to-left as identifiers are written. For example:

{"total" {"orders"   {"schema_1" "primary_orders_total"
                      "schema_2" "secondary_order_total"}
          "payments" {"schema_1" "total_payments"}}}

In the simple case of a fully-qualified identifier, we'd now need to do 3 look-ups instead of a single one, so there's no free lunch. On the other hand, we do save hashing the whole map - and clojure does not cache map hashes.

The advantage for partially qualified identifiers is that we can replace a linear seek which needs to check equality with modified keys with a sequence of look-ups. When we reach the end of our crumbs we can then easily enumerate all the children nodes. In the case where there is no match, we can also fail on a look-up step instead of doing a full seek.

They're also not great for pretty printing, given their density. That said, a trie will probably be even worse - so we'd probably want a utility method for printing them nicely when debugging too.

Use data-driven acceptance tests

I have a dream, that instead of continually adding ad hoc mix of what we test for different queries, that we use a structured format of fixtures and expectation files to generate an "acceptance test suite".

e.g. we could have the following files:

test/resources/acceptance/foo.sql              # if this is here, we test that parsing does not explode.
test/resources/acceptance/foo.analysis.edn     # if this is here then we test analysis. keys can be sparse.
test/resources/acceptance/foo.rewrite.edn      # if the next two files are here we use them to test rewrite
test/resources/acceptance/foo.rewritten.sql

We would probably need somewhere to put the opts for a given query too, perhaps in a new or existing EDN file, or in comments in the source SQL.

We probably also want to have some allowed exceptions for cases we're working our way through. I think defining exceptions in code in the tests file is fine, as this is more visible to developers than "stop files" or something like that. For the exceptions we can define that actual results values. We can have a strict mode (e.g. CLI option) which fails if there are any exceptions, but I think this should be off for CI.

The idea is to shunt over our existing tests as fast as possible, hence allowing missing files or fields is advantageous.

Implementation wise I imagine macros which take descriptions of the suite, and a resource directory path, and they generate separate -test vars with nicely broken down testing scopes, so we get great test runner output in CI.

Advantages

Less friction to adding and maintaining tests.
- Can just drop in customer reported ones.
Scales to more test cases.
Less git conflicts due to separate files.
DRY up a lot of boilerplate.
Better error messages than one big =? over the entire components map.
Clear indication of where there are testing gaps.
Gamification of filling in the gaps.
Input and output closer together in PR diff when using fixtures.

Disadvantages

More annoying to iterate on and debug specific cases
- Can mitigate with scoped runner in a comment or taking a CLI option.

Make `should=` fail on CI when we have the correct answer

Support renaming DB table schemas

(in the same way that you can rename columns and tables)

Finish repository setup

Add secrets:
- GITHUB_TOKEN
- CLOJARS_USERNAME
- CLOJARS_PASSWORD
Set up Clojars
- (secrets as above)
- Uncomment deploy.yml, edit if needed

Using filter + within group is not supported

Here is a minimally reproducing query:

  select
    e.instance_id
   -- this next one is fine
    , percentile_cont(0.75) within group (order by e.running_time) / 1000.0 as p75_time
   -- this is also fine
    , avg(e.running_time) filter (where e.error = '') / 1000.0 as avg_success_time
   -- this one causes a parse error
    , percentile_cont(0.75) within group (order by e.running_time) filter (where e.error = '') / 1000.0 as p75_success_time
  from execution e
  group by 1

Multiline string using $REDSHIFT$ token is not supported

Example query:

select *
from dblink('hosting_insights',$REDSHIFT$
    SELECT distinct hm_user.company
    FROM metabase_database
    LEFT JOIN hm_hosted_instance
      ON metabase_database.etl_source_instance_id = hm_hosted_instance.id
    LEFT JOIN hm_user
      ON hm_hosted_instance.owner_id = hm_user.id
    WHERE metabase_database.is_sample = FALSE
$REDSHIFT$) as t1(customer text)

Interestingly enough, this is a :postgres query on Stats...

Queries affected on Stas: 93

Implement better version numbering

Currently, Macaw's version is set to 0.1.z, where z is the number of commits in the release since the beginning of time. This makes working with Metabase releases difficult: a backport for an old version of Metabase may depend on some new Macaw behavior, but there could also be breaking changes in the Macaw API. We need a more sophisticated versioning strategy. The proposal is semver-adjacent:

New scheme

[Continue to] use the x.y.z format.
x and y will be set in a new VERSION file that gets slurped by build.clj.
z will be the number of commits since x.y.0.
We will create a git tag for every x.y.0 release to help compute #3.
x will be incremented when there are serious changes to the public API ("serious" is deliberately vague, but is meant to allow us to keep the same major version number if the API-breaking changes are trivially addressable from within Metabase)
y will be incremented when we feel we've written a substantial feature (this is again deliberately vague). This includes both very different internal behavior (e.g., improving our column-finding logic) or additions to the API.

(deftest complicated-mutations-test
  (is (= #{"delete" "insert"}
         (mutations "WITH outdated_orders AS (
                       DELETE FROM orders
                       WHERE
                         date <= '2018-01-01'
                       RETURNING *
                     )
                     INSERT INTO order_log
                     SELECT * from outdated_orders;"))))

with


macaw.core-test
1 non-passing tests:

Error in complicated-mutations-test

expected: (=
	   #{"delete" "insert"}
	   (mutations
	    "WITH outdated_orders AS (
                       DELETE FROM orders
                       WHERE
                         date <= '2018-01-01'
                       RETURNING *
                     )
                     INSERT INTO order_log
                     SELECT * from outdated_orders;"))

   error: net.sf.jsqlparser.JSQLParserException: net.sf.jsqlparser.parser.ParseException: Encountered unexpected token: "DELETE" "DELETE"
    at line 2, column 24.

Was expecting:

    "WITH"