jacksontj / dataman Goto Github PK

License: MIT License

Go 100.00%

dataman's Introduction

DataMan

A data service-- which has:

- schema enforcement
- replication
- geo-distribution / load-balancing
- caching (MUCH later ;) )
- archiving / deleting data
- security
- backups

The intention is to have a stack of "backend stores" that this unified API can talk with to store the actual data. As such a lot of the features (schema, sharding, etc.) are done independently of the underlying store.

dataman's People

Contributors

Stargazers

Watchers

Forkers

bretep tvi kahuang zelin-l

dataman's Issues

Create CLI for running integration tests

To run tests today you need to run go test in the integration tests directory. The structure of the test suites actually only requires some config files and a directory of tests. There is no reason to require the user to put all their tests into upstream dataman's repo -- ideally there'd be a CLI with flags for the various config files and the directory of tests to run.

Add Timestamp type to dataman

Same as DateTime, just with all the precision

Record sorting needs to support sorting on columns that support nil

Allow new keys for _document type

I have a column in my postgres table that is of datatype json. However, I do not know what the keys in this column would be, as the user will specify this. I want this column to act as a counter dictionary. Therefore, the operations I need would be:

Insert new key/value pair if key not in dict
update value if key in dict
increment value if key in dict

I have tried to do this with _document datatype:

"counter_json": {
  "name": "counter_json",
  "field_type": "_document",
  "provision_state": 3
},

with an update operation:

{
  "Type": "update",
  "Args": {
    "db": "event_instance_period",
    "collection": "event_base",
    "filter": {
      "event_instance_id": evt.EventInstanceId,
      "start_time":        evt.StartTime,
      "updated":           evt.Updated,
      "end_time":          evt.EndTime
    },
    "record": {
      "count":    1
    },
    "record_op": {
      "counter_json.a": ["+", 1]
    }
  }
}

However, dataman returns an error: record_op field counter_json.a doesn't exist in collection.

Look into using pgx (faster pgstore)

From benchmarks I see around this seems to be faster: https://godoc.org/github.com/jackc/pgx

Before doing so we'll need to add some benchmarks etc. so we can confirm perf impact.

Error for extra fields in query

New feature in enconding/json in 1.10 -- https://golang.org/pkg/encoding/json/#Decoder.DisallowUnknownFields

Allow user to specify constraint for Set operation

Currently when performing a Set operation on a collection, the constraint automatically defaults to the primary key. For example, the query generated would be something like this:

INSERT INTO <collection> (<columns>) VALUES (<values>) ON CONFLICT (_id) DO UPDATE SET ...

It would be great to have the ability to specify which constraint to use for set operation. If the collection has a unique constraint, the user should be able to do set operations for those cases:

INSERT INTO <collection> (<columns>) VALUES (<values>) ON CONFLICT (<constraint>) DO UPDATE SET ...

This can be implemented by accepting an additional argument, constraint, when calling the operation. It can be either a list of column names corresponding to the constraint, or just the constrain name itself.

{
  "Type": "set",
  "Args": {
    "db": "event_sum",
    "collection": "event_base",
    "record": {
      "service_id":          evt.ServiceId,
      "event_type":          evt.EventType,
      "event_name":          evt.EventName,
      "processed_data":      evt.ProcessedData,
      "processed_data_hash": evt.ProcessedDataHash
    },
    "constraint": ["service_id", "event_type", "processed_data_hash"]
  }
}

Allow user to do 'LIKE' operation

Hello, could you add the 'LIKE' operation so that it can support the query similar to the following:
select name from event_group where name Like "%Group1%"

Batch delete API

This is kind of related to #47. If we decide to support multi deletes, then we don't need a batch API

Otherwise, it'd be a nice QoL improvement to allow you to send a batch of primary keys to delete

Wrap pg errors in dataman errors

the pg driver already exposes what the error was https://godoc.org/github.com/lib/pq#Error -- ideally we'd wrap the errors from the underlying datasource_instance as "dataman errors" so the users don't have to see datasource_instance specific errors.

Support projections within subfields that aren't defined in schema

Default prometheus + grafana dashboard

Create a dashboard template that can be used with grafana

Create Makefile

with targets for:

release: build a release set of binaries
test: run all tests in the repo
fmt: format all code and tests (json in there)

Have "set" check for pkey

TLDR; Need to add a pkey check before validating that the record is valid for an update / insert

Right now the set operation checks if it is valid as an insert or an update. Its possible to create a set which is missing the pkey which is invalid as an insert but passes the update validation. Since set is supposed to be an upsert for a single record-- this should error with that message.

Handle very long index names

Right now if a schema has a very long index name then it just fails without a good error.

Branch with test case to showcase the issue: https://github.com/jacksontj/dataman/tree/longindex

Need bytea type for binary/bytes data

FATA[0002] Unknown postgres data_type bytea in file map[column_name:data data_type:bytea character_maximum_length:<nil> is_nullable:YES column_default:<nil>]

split datamantype `int` into `int32` and `int64`

Option for multi deletes

Mongodb allows you to multi delete() things based on fields that don't include the primary key

If we don't want to support this, we can get around it on the client side by querying the docs, then deleting them. If you'd rather go that route, feel free to close this

support pgstore side projections for subfields

in pg if you select something like foo->'bar'->>'baz' the column name is not a "normal" name, so the current util method doesn't unpack it into the record properly. Need to see how hard this will be, if enough work this might be the time to redo the sql converter to use $1 etc. all over.

The idea I have right off the top of my head is that pg results are always in the order of the select (assuming you gave one) -- of so we could provide a slice (in the same order) of "addresses" inside the record for the results to go, then the util.go stuff could just use <record>.Set()

Need quickstart docs

Seriously.

Group by function, sum()

GROUP BY function and SUM() function from SQL (group aggregation with sum)
Would be very useful especially with grafana

Make integration tests sharding agnostic

The long-term goal is to have dataman abstract away the specifics of the sharding from the client. As such we should be able to run the same set of tests against N sharding configurations.

Not escaping quotes on `text` fields

Not sure if this is fixed in latest version but:

ERROR:

"
Dataman SET: ERROR: Error running query: Err=pq: syntax error at or near "Helvetica" query=INSERT INTO "public".XXXX ("_id","...) VALUES (111,null,null,82,'2018-12-27 03:49:11',null,7,8,51,'XXXXX: ....',29,69,'
"

Content has a single quote in this part of the data:

"
body itemscope itemtype="http://schema.org/EmailMessage" style="font-family: 'Helvetica Neue',Helvetica,Arial,sans-serif; box-sizing: border-box; font-size: 14px; -webkit-font-smoothing: antialiased; -webkit-text-size-adjust: none; width: 100% !important; height: 100%; line-height: 1.6em; background-color: #f6f6f6; margin: 0;"
"

Will work around by sanitizing it myself for now.

Unique Constraint violations return as a general querty error, should be ValidationError map so we know what key it was in violation

Unique Constraint violations return as a general querty error, should be ValidationError map so we know what key it was in violation.

Error running query: Err=pq: duplicate key value violates unique constraint "constraint_name"

Currently this is returned as Error, I think it should be ValidationError, so that I get the map of fields with this specific field in it. Otherwise I cant know which field it may have been without doing a ton of queries and DB schema inspections (constaint fields, each fields value, test my field values, see which are in violation).

Add regex FilterType

Filter by columns from joined tables

Hello, I was wondering if it is possible to filter by columns from tables joined with the "join" argument? For example, something like this:

{
  "Type": "filter",
  "Args": {
    "db": "event_sum",
    "collection": "event_instance_period",
    "join": ["event_instance_id", "event_instance_id.event_base_id"],
    "filter": {
      "start_time":        [">", evt.StartTime],
      "end_time":          ["<", evt.EndTime],
      "event_instance_id.event_base_id.service_id": ["=", evt.ServiceId]
    }
  }
}

event_instance_id joins a foreign table event_instance, and event_instance.event_base_id joins another foreign table event_base. I need to filter by event_base.service_id.

Right now, the error that occurs when I perform the operation above is as such: SubField "event_instance_id"->>'event_base_id' doesn't exist in event_instance_period: map[]

Let me know if you need clarification on anything 👍

Create CLI to load integration_test cases into live dataman

Disallow creation of database/collection/etc with no `name`

Option for set() or update() to replace the stored record with the sent record

Right now set/update will update the fields specified in the record in the request

It would be nice to have a feature/flag to replace the stored record entirely with the record in the request to mirror the mongodb behavior (https://docs.mongodb.com/manual/reference/method/db.collection.update/#replace-a-document-entirely)

Query limiting system

As a centralized data system, it's likely that well want to enforce some rules about queries that are sent. An example would be, no filter queries without a limit. Ideally this would be a rule driven system (to avoid the need for code) that can be reloaded (to avoid restarts and downtime).

Sort should allow NULL values (allow_null) fields

Normally int64, but NULL shouldnt fail because people need ALLOW NULL fields. I think NULL should sort first by default, but that could be an option.