Giter Site home page Giter Site logo

cloudstorage's Introduction

Lytics Command Line Tool & Developers Aid

The goal of this tool is to provide CLI access to the Lytics API. It also functions as a developers aid to enable writing and testing LQL (Lytics Query Language) as easily as possible.

We would love any feature requests or ideas that would make this useful to you.

Installation

Download a binary from the releases page and rename to lytics:

# linux/amd64
curl -Lo lytics https://github.com/lytics/lytics/releases/download/latest/lytics_linux \
  && chmod +x lytics \
  && sudo mv lytics /usr/local/bin/

# darwin/amd64
curl -Lo lytics https://github.com/lytics/lytics/releases/download/latest/lytics_mac \
  && chmod +x lytics \
  && sudo mv lytics /usr/local/bin/

Or install from source:

git clone https://github.com/lytics/lytics.git
go build
go install

Or install from the repository via go:

go get -u github.com/lytics/lytics

Usage

All examples use JQ to prettify the JSON output.

export LIOKEY="your_api_key"
lytics --help

Segment Scan Usage

Exporting CSV files, with usage.

Example

# Scan a segment by id
lytics segment scan ab93a9801a72871d689342556b0de2e9 | jq '.'

# Scan a segment by slug
lytics segment scan last_2_hours | jq '.'

# write out this segment to a temp file so we can play with JQ
lytics segment scan last_2_hours > /tmp/users.json

# same thing but with an "ad hoc query"
lytics segment scan '
FILTER AND (
    lastvisit_ts > "now-2d"
    EXISTS email
)
FROM user
' > /tmp/users.json

# use JQ to output a few fields
cat /tmp/users.json | \
 jq -c ' {country: .country, city: .city, org: .org, uid: ._uid, visitct: .visitct} '

# create a CSV file from these users
echo "country,city,org,uid,visitct\n" > /tmp/users.csv
cat /tmp/users.json | \
 jq -r ' [ .country, .city, .org,  ._uid, .visitct ] | @csv ' >> /tmp/users.csv

Lytics Watch Usage

  1. Create NAME.lql (any name) file in a folder.
  2. Assuming you already have data collected, it will use our API to show recent examples against that LQL.

You can open and edit in an editor. Every time you edit it will print resulting users it interpreted from recent data to our API.

Example

# get your API key from the web app account settings screen
export LIOKEY="your_api_key"

cd /path/to/your/project

# create an LQL file
# - utilize the Lytics app "Data -> Data Streams" section to see
#   data fields you are sending to Lytics.

# you can create this in an editor as well
echo '
SELECT
   user_id,
   name,
   todate(ts),
   match("user.") AS user_attributes,
   map(event, todate(ts))   as event_times   KIND map[string]time  MERGEOP LATEST

FROM default
INTO USER
BY user_id
ALIAS my_query
' > default.lql


# start watching
lytics schema queries watch .

# now edit JSON results of how data is interpreted is output

Lytics Watch With Custom Data

  1. Create NAME.lql (any name) file in a folder.
  2. Create NAME.json (any name, must match LQL file name) in folder.
  3. Run the lytics watch command from the folder with files.
  4. Edit .lql, or .json files, upon change the evaluated result of the .lql, JSON will be output.

Example

# get your API key from web app account settings
export LIOKEY="your_api_key"

cd /tmp

# start watching in background
lytics schema queries watch &

# create an LQL file
echo '
SELECT
   user_id,
   name,
   todate(ts),
   match("user.") AS user_attributes,
   map(event, todate(ts))   as event_times   KIND map[string]time  MERGEOP LATEST

FROM data
INTO USER
BY user_id
ALIAS hello
' > hello.lql

# Create an array of JSON events to feed into LQL query
echo '[
    {"user_id":"dump123","name":"Down With","company":"Trump", "event":"project.create", "ts":"2016-11-09"},
    {"user_id":"another234","name":"No More","company":"Trump", "event":"project.signup","user.city":"Portland","user.state":"Or", "ts":"2016-11-09"}
]' > hello.json

SegmentML example

# replace {your model name here} with target_audience::source_audience

# generates tables
lytics segmentml --output all {your model name here}
lytics segmentml --output features {your model name here}
lytics segmentml --output predictions {your model name here}
lytics segmentml --output overview {your model name here}

# for CSV output
lytics --format csv segmentml --output all {your model name here}

# for JSON
lytics --format json segmentml --output all {your model name here}

cloudstorage's People

Contributors

ajroetker avatar araddon avatar bpopadiuk avatar epsniff avatar erinpentecost avatar humanchimp avatar junichif avatar kyledj avatar mattayes avatar mh-cbon avatar peczenyj avatar ropes avatar sergeyt avatar snargleplax avatar timkaye11 avatar vitaminmoo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cloudstorage's Issues

localfs NewWriter, should use O_TRUNC ?

hi,

when i read the doc about NewWriter it says,

		// NewWriter returns a io.Writer that writes to a Cloud object
		// associated with this backing Store object.
		//
		// A new object will be created if an object with this name already exists.
		// Otherwise any previous object with the same name will be replaced.
		// The object will not be available (and any previous object will remain)
		// until Close has been called
		NewWriter(o string, metadata map[string]string) (io.WriteCloser, error)

however when using the localfs type, it appears the file is not truncated

What do you think ?

Add support for SFTP

Add support for SFTP as a backend

  • sftp uses traditional file based folder structures which means you can have a file a/b/file.csv which means folder a and that contains folder b which contains file.csv. the a folder is empty, and would NOT show up in traditional cloud based systems where the a is ignored. only a/b would be a folder in google, azure, etc.
    • update tests to validate this behavior across the existing tools. probably a weakness in the existing test-suite.

awss3 List method ignores NextMarker

Hello

I had some troubles to list the entire content of one s3 bucket - it only returns until the PageSize value but does not give me the next marker when the response is Truncated

q := cloudstorage.NewQuery(path)
or, err := s.StoreReader.List(context.Background(), q)
...
spew.Dump(or.NextMarker) // always empty

when I investigate I find two problems here:

https://github.com/lytics/cloudstorage/blob/master/awss3/store.go#L309

	if resp.IsTruncated != nil && *resp.IsTruncated {
		q.Marker = *resp.Contents[len(resp.Contents)-1].Key
	}

first, the query q is not a reference/pointer to a Query, it is passed by copy, so set the marker into q is useless

second, it does not fill the objResp.NextMarker ( like Azure https://github.com/lytics/cloudstorage/blob/master/azure/store.go#L265 )

This may affect other backends, I have an workaround for this but will be great a fix -- I will try to submit a pull request.

Using the Google auth for JWT files w/ out a scope can lead to confusion.

I got a report of a user trying to use a JWT file and getting confused about why it wasn't working with an error about lack of scopes. The reason was that they weren't setting the Scope, which the construction code allowed them to do. But then, on the first attempt to use the store an error was thrown.

We should add better feedback for this during config/Auth construction, so it's more clear what went wrong.

google storage type BucketLifecycleRuleCondition field Age change type to *int64

with latest versions of google.golang.org/api/storage/v1

on struct BucketLifecycleRuleCondition

the field Age change type from int64 to *int64 and it prevent the update of this specific package when using cloudstorage

$ ./go.test.sh 
?   	github.com/lytics/cloudstorage/testutils	[no test files]
ok  	github.com/lytics/cloudstorage/sftp	0.106s	coverage: 8.4% of statements
ok  	github.com/lytics/cloudstorage/localfs	0.625s	coverage: 82.7% of statements
# github.com/lytics/cloudstorage/google
google/apistore.go:84:58: cannot use days (variable of type int64) as type *int64 in struct literal
FAIL	github.com/lytics/cloudstorage/google/storeutils [build failed]
FAIL

awss3 List does not call query ApplyFilters

the documentation of Query ApplyFilters says

// ApplyFilters is called as the last step in store.List() to filter out the
// results before they are returned.

but it is not true for awss3 ( and maybe for google )

Read File if already in LocalCache and Cleaner

It would be nice to read file if it already exists in local-file cache, checking md5 to make sure it is the same md5. Use md5 as filename? Would require a couple of cleaner strategies, time-based as well as size-based.

File name gets too long when local storage is used with TmpDir=LocalFS

When the local file storage gets used and the TmpDir is set to the same as the LocalFS, it looks like it causes a loop of cached files getting created every time List() is called (this happens in LIO right now with the event store archive pointing at /tmp)

localfile: error occurred opening cachedcopy file. cachepath=/tmp/aid-12/v1/stream/.5dee81effdba11e5adfe0862664d56cf.cache.72580f9cfdbf11e598d00862664d56cf.cache.72580f9cfdbf11e598d00862664d56cf.cache.72580f9cfdbf11e598d00862664d56cf.cache.5dee81effdba11e5adfe0862664d56cf.cache.5dee81effdba11e5adfe0862664d56cf.cachefrom_s3.5dee81effdba11e5adfe0862664d56cf.cache err=open /tmp/aid-12/v1/stream/.5dee81effdba11e5adfe0862664d56cf.cache.72580f9cfdbf11e598d00862664d56cf.cache.72580f9cfdbf11e598d00862664d56cf.cache.72580f9cfdbf11e598d00862664d56cf.cache.5dee81effdba11e5adfe0862664d56cf.cache.5dee81effdba11e5adfe0862664d56cf.cachefrom_s3.5dee81effdba11e5adfe0862664d56cf.cache: file name too long

Causes the reads to crash after a few iterations since the filename gets too long.

Tests failing after I modified Move/Copy test cases to use variable len payloads

In a previous PR I discovered that sftp/localstore and Azure test cases were failing after I switched test cases to using variable length payloads.

SFTP/Localstore failed because the files weren't correctly being truncated before being written too. So, if new data was less than the previous data, then the file will contain extra data and be corrupted.

Azure turned out to be a race with how we wrote to the backing store.

improve sftp support

Hi!

Thanks again for the lib, super useful!

About the sftp part i want to report that some options should be configurable, they are hardcoded but this does not work very well imho.

In those two methods,

// ConfigUserPass creates ssh config with user/password
// HostKeyCallback was added here
// https://github.com/golang/crypto/commit/e4e2799dd7aab89f583e1d898300d96367750991
// currently we don't check hostkey, but in the future (todo) we could store the hostkey
// and check on future logins if there is a match.
func ConfigUserPass(user, password string) *ssh.ClientConfig {
	return &ssh.ClientConfig{
		User: user,
		Auth: []ssh.AuthMethod{
			ssh.Password(password),
		},
		// Config: ssh.Config{
		// 	Ciphers: []string{
		// 		"aes128-ctr", "aes192-ctr", "aes256-ctr", "[email protected]",
		// 		"arcfour256", "arcfour128", "aes128-cbc",
		// 	},
		// },
		HostKeyCallback: ssh.InsecureIgnoreHostKey(),
		Timeout:         timeout,
	}
}

// ConfigUserKey creates ssh config with ssh/private rsa key
func ConfigUserKey(user, keyString string) (*ssh.ClientConfig, error) {
	// Decode the RSA private key

	key, err := ssh.ParsePrivateKey([]byte(keyString))
	if err != nil {
		return nil, fmt.Errorf("bad private key: %s", err)
	}

	return &ssh.ClientConfig{
		User: user,
		Auth: []ssh.AuthMethod{
			ssh.PublicKeys(key),
		},
		Config: ssh.Config{
			Ciphers: []string{
				"aes128-ctr", "aes192-ctr", "aes256-ctr", "[email protected]",
				"arcfour256", "arcfour128", "aes128-cbc",
			},
		},
		HostKeyCallback: ssh.InsecureIgnoreHostKey(),
		Timeout:         timeout,
	}, nil
}

As you can see i had to comment the cipher part, otherwise i d end up with an ssh error such as : "SSH_FX_PERMISSION_DENIED".

Also, i think the callback to check for insecure keys and the timeout should be configurable.

About those 3 things, i d like to suggest to add new configurations keys (like ConfKeyFolder = "folder") so the end user can modify it via the gou.JsonHelper map.

What do you think ?

thanks

LocalFS Prefix Query Does Not Match Filename Prefixes

The LocalFS store will not match a filename prefix, it only matches folder prefixes:
if a file like /tests/users-2021-10-12.csv exists in the base of a LocalFS store.

  • a query for /tests will return the file
  • a query for /tests/users- will not return the file

This is different behavior from other store types that would return the file in both cases. This can cause issues when using creating an application that can operate with multiple different cloudstorage types.

SFTP (which has operates on a similar file system) has been built to behave the same as other store providers.

S3 Copy/Move performance optimizations

There are Copy/Move that are not currently implemented on S3. These are performance optimizations that when copy/move from s3 -> to s3 don't do network transfer, instead allow s3 to do the copy/move through native s3 api.

cloudstorage/store.go

Lines 73 to 85 in 2731960

// StoreCopy Optional interface to fast path copy. Many of the cloud providers
// don't actually copy bytes. Rather they allow a "pointer" that is a fast copy.
StoreCopy interface {
// Copy from object, to object
Copy(ctx context.Context, src, dst Object) error
}
// StoreMove Optional interface to fast path move. Many of the cloud providers
// don't actually copy bytes.
StoreMove interface {
// Move from object location, to object location.
Move(ctx context.Context, src, dst Object) error
}

Google implementation

// Copy from src to destination

Overwriting behavioral difference between stores types, when using store.NewWriter...

Some cloud providers overwrite a file as an atomic operation that takes place on a call to writer.Close(). But for localfs and sftp, we're currently removing the file when the writer is opened and then we overwrite the object as we stream bytes to it. This creates a gap of time when the file is in an inconsistent state for those stores that don't support atomic replacement on Close().

Handle Files vs Directories

The current Object implementation assumes it is a file, and doesn't have any affordance for directory only types. https://golang.org/pkg/os/#FileMode

Google Storage has a Delimiter for filtering to certain types

https://developers.google.com/apis-explorer/#p/storage/v1/storage.objects.list?bucket=lytics-dataux-tests&delimiter=%252F&maxResults=1000&prefix=tables%252F&_h=8&

Limiting to directories with Delimiter = "/"

{
 "kind": "storage#objects",
 "prefixes": [
  "tables/article/",
  "tables/user/"
 ]
}

Regular objects

https://developers.google.com/apis-explorer/#p/storage/v1/storage.objects.list?bucket=lytics-dataux-tests&maxResults=1000&prefix=tables%252F&_h=7&

{
 "kind": "storage#objects",
 "items": [
  {


   "kind": "storage#object",
   "id": "lytics-dataux-tests/tables/article/article1.csv/1457896488161000",
   "selfLink": "https://www.googleapis.com/storage/v1/b/lytics-dataux-tests/o/tables%2Farticle%2Farticle1.csv",
   "name": "tables/article/article1.csv",
   "bucket": "lytics-dataux-tests",
   "generation": "1457896488161000",
   "metageneration": "1",
   "contentType": "text/csv; charset=utf-8",
   "timeCreated": "2016-03-13T19:14:48.119Z",
   "updated": "2016-03-13T19:14:48.119Z",
   "storageClass": "STANDARD",
   "size": "398",
   "md5Hash": "+RTyIckctKnUmha0OaBBHA==",
   "mediaLink": "https://www.googleapis.com/download/storage/v1/b/lytics-dataux-tests/o/tables%2Farticle%2Farticle1.csv?generation=1457896488161000&alt=media",
   "metadata": {
    "content_type": "text/csv; charset=utf-8"
   },
   "owner": {
    "entity": "user-00b4903a9730f58d42103a7fd40a4d0f92e371ae57112475d3993f88ccf7e8d6",
    "entityId": "00b4903a9730f58d42103a7fd40a4d0f92e371ae57112475d3993f88ccf7e8d6"
   },
   "crc32c": "8oZBFA==",
   "etag": "COidr9KvvssCEAE="
  },
  {


   "kind": "storage#object",
   "id": "lytics-dataux-tests/tables/user/user1.csv/1457896494340000",
   "selfLink": "https://www.googleapis.com/storage/v1/b/lytics-dataux-tests/o/tables%2Fuser%2Fuser1.csv",
   "name": "tables/user/user1.csv",
   "bucket": "lytics-dataux-tests",
   "generation": "1457896494340000",
   "metageneration": "1",
   "contentType": "text/csv; charset=utf-8",
   "timeCreated": "2016-03-13T19:14:54.328Z",
   "updated": "2016-03-13T19:14:54.328Z",
   "storageClass": "STANDARD",
   "size": "299",
   "md5Hash": "p6GxtAFU3xu3q8ty852yxw==",
   "mediaLink": "https://www.googleapis.com/download/storage/v1/b/lytics-dataux-tests/o/tables%2Fuser%2Fuser1.csv?generation=1457896494340000&alt=media",
   "metadata": {
    "content_type": "text/csv; charset=utf-8"
   },
   "owner": {
    "entity": "user-00b4903a9730f58d42103a7fd40a4d0f92e371ae57112475d3993f88ccf7e8d6",
    "entityId": "00b4903a9730f58d42103a7fd40a4d0f92e371ae57112475d3993f88ccf7e8d6"
   },
   "crc32c": "1fpnNw==",
   "etag": "CKCvqNWvvssCEAE="
  }
 ]
}

update Google Cloud API client import paths and more

The Google Cloud API client libraries for Go are making some breaking changes:

  • The import paths are changing from google.golang.org/cloud/... to
    cloud.google.com/go/.... For example, if your code imports the BigQuery client
    it currently reads
    import "google.golang.org/cloud/bigquery"
    It should be changed to
    import "cloud.google.com/go/bigquery"
  • Client options are also moving, from google.golang.org/cloud to
    google.golang.org/api/option. Two have also been renamed:
    • WithBaseGRPC is now WithGRPCConn
    • WithBaseHTTP is now WithHTTPClient
  • The cloud.WithContext and cloud.NewContext methods are gone, as are the
    deprecated pubsub and container functions that required them. Use the Client
    methods of these packages instead.

You should make these changes before September 12, 2016, when the packages at
google.golang.org/cloud will go away.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.