lytics / cloudstorage Goto Github PK
View Code? Open in Web Editor NEWCloud & local storage unified api (s3, google, azure, sftp, local)
License: MIT License
Cloud & local storage unified api (s3, google, azure, sftp, local)
License: MIT License
When the local file storage gets used and the TmpDir is set to the same as the LocalFS, it looks like it causes a loop of cached files getting created every time List() is called (this happens in LIO right now with the event store archive pointing at /tmp)
localfile: error occurred opening cachedcopy file. cachepath=/tmp/aid-12/v1/stream/.5dee81effdba11e5adfe0862664d56cf.cache.72580f9cfdbf11e598d00862664d56cf.cache.72580f9cfdbf11e598d00862664d56cf.cache.72580f9cfdbf11e598d00862664d56cf.cache.5dee81effdba11e5adfe0862664d56cf.cache.5dee81effdba11e5adfe0862664d56cf.cachefrom_s3.5dee81effdba11e5adfe0862664d56cf.cache err=open /tmp/aid-12/v1/stream/.5dee81effdba11e5adfe0862664d56cf.cache.72580f9cfdbf11e598d00862664d56cf.cache.72580f9cfdbf11e598d00862664d56cf.cache.72580f9cfdbf11e598d00862664d56cf.cache.5dee81effdba11e5adfe0862664d56cf.cache.5dee81effdba11e5adfe0862664d56cf.cachefrom_s3.5dee81effdba11e5adfe0862664d56cf.cache: file name too long
Causes the reads to crash after a few iterations since the filename gets too long.
on awss3 object.Close() remove the cachepath
https://github.com/lytics/cloudstorage/blob/master/awss3/store.go#L677
But localfs only remove on Release
https://github.com/lytics/cloudstorage/blob/master/localfs/store.go#L463
Not sure if it happens in other backends
Some cloud providers overwrite a file as an atomic operation that takes place on a call to writer.Close(). But for localfs and sftp, we're currently removing the file when the writer is opened and then we overwrite the object as we stream bytes to it. This creates a gap of time when the file is in an inconsistent state for those stores that don't support atomic replacement on Close().
From this comment #53 (comment)
It would be nice to read file if it already exists in local-file cache, checking md5 to make sure it is the same md5. Use md5 as filename? Would require a couple of cleaner strategies, time-based as well as size-based.
see in the comment line.
Line 316 in bab6fd6
I have noticed use of os.FileInfo inside quite a few projects for reading/iterating files. Would be nice to be able to cast cloudstorage.Object as os.FileInfo for easy alignment with existing code.
The LocalFS store will not match a filename prefix, it only matches folder prefixes:
if a file like /tests/users-2021-10-12.csv
exists in the base of a LocalFS store.
/tests
will return the file/tests/users-
will not return the fileThis is different behavior from other store types that would return the file in both cases. This can cause issues when using creating an application that can operate with multiple different cloudstorage types.
SFTP (which has operates on a similar file system) has been built to behave the same as other store providers.
with latest versions of google.golang.org/api/storage/v1
on struct BucketLifecycleRuleCondition
the field Age
change type from int64
to *int64
and it prevent the update of this specific package when using cloudstorage
$ ./go.test.sh
? github.com/lytics/cloudstorage/testutils [no test files]
ok github.com/lytics/cloudstorage/sftp 0.106s coverage: 8.4% of statements
ok github.com/lytics/cloudstorage/localfs 0.625s coverage: 82.7% of statements
# github.com/lytics/cloudstorage/google
google/apistore.go:84:58: cannot use days (variable of type int64) as type *int64 in struct literal
FAIL github.com/lytics/cloudstorage/google/storeutils [build failed]
FAIL
Hello
I had some troubles to list the entire content of one s3 bucket - it only returns until the PageSize value but does not give me the next marker when the response is Truncated
q := cloudstorage.NewQuery(path)
or, err := s.StoreReader.List(context.Background(), q)
...
spew.Dump(or.NextMarker) // always empty
when I investigate I find two problems here:
https://github.com/lytics/cloudstorage/blob/master/awss3/store.go#L309
if resp.IsTruncated != nil && *resp.IsTruncated {
q.Marker = *resp.Contents[len(resp.Contents)-1].Key
}
first, the query q
is not a reference/pointer to a Query, it is passed by copy, so set the marker into q is useless
second, it does not fill the objResp.NextMarker
( like Azure https://github.com/lytics/cloudstorage/blob/master/azure/store.go#L265 )
This may affect other backends, I have an workaround for this but will be great a fix -- I will try to submit a pull request.
Hi!
Thanks again for the lib, super useful!
About the sftp part i want to report that some options should be configurable, they are hardcoded but this does not work very well imho.
In those two methods,
// ConfigUserPass creates ssh config with user/password
// HostKeyCallback was added here
// https://github.com/golang/crypto/commit/e4e2799dd7aab89f583e1d898300d96367750991
// currently we don't check hostkey, but in the future (todo) we could store the hostkey
// and check on future logins if there is a match.
func ConfigUserPass(user, password string) *ssh.ClientConfig {
return &ssh.ClientConfig{
User: user,
Auth: []ssh.AuthMethod{
ssh.Password(password),
},
// Config: ssh.Config{
// Ciphers: []string{
// "aes128-ctr", "aes192-ctr", "aes256-ctr", "[email protected]",
// "arcfour256", "arcfour128", "aes128-cbc",
// },
// },
HostKeyCallback: ssh.InsecureIgnoreHostKey(),
Timeout: timeout,
}
}
// ConfigUserKey creates ssh config with ssh/private rsa key
func ConfigUserKey(user, keyString string) (*ssh.ClientConfig, error) {
// Decode the RSA private key
key, err := ssh.ParsePrivateKey([]byte(keyString))
if err != nil {
return nil, fmt.Errorf("bad private key: %s", err)
}
return &ssh.ClientConfig{
User: user,
Auth: []ssh.AuthMethod{
ssh.PublicKeys(key),
},
Config: ssh.Config{
Ciphers: []string{
"aes128-ctr", "aes192-ctr", "aes256-ctr", "[email protected]",
"arcfour256", "arcfour128", "aes128-cbc",
},
},
HostKeyCallback: ssh.InsecureIgnoreHostKey(),
Timeout: timeout,
}, nil
}
As you can see i had to comment the cipher part, otherwise i d end up with an ssh error such as : "SSH_FX_PERMISSION_DENIED".
Also, i think the callback to check for insecure keys and the timeout should be configurable.
About those 3 things, i d like to suggest to add new configurations keys (like ConfKeyFolder = "folder"
) so the end user can modify it via the gou.JsonHelper map.
What do you think ?
thanks
hi,
when i read the doc about NewWriter it says,
// NewWriter returns a io.Writer that writes to a Cloud object
// associated with this backing Store object.
//
// A new object will be created if an object with this name already exists.
// Otherwise any previous object with the same name will be replaced.
// The object will not be available (and any previous object will remain)
// until Close has been called
NewWriter(o string, metadata map[string]string) (io.WriteCloser, error)
however when using the localfs type, it appears the file is not truncated
Line 247 in a14e59c
cloudstorage/csbufio/writer.go
Line 16 in b738f5f
What do you think ?
Integrate with Box.com
Hello
By using the awss3 store, we note that in the shutdown of our application we do not upload the remaining data.
Checking the reason, we discover that if we add a sleep of some seconds, it works.
This happens because this goroutine:
Line 455 in 4b211b8
does not have any kind of synchronization / waitgroup etc and during a shutdown, the write closer Close method returns but we do not finish the actual upload in background.
I see that the azure package had a wrapper thar uses an errgroup and this ensure that azure objects close properly. the same must be done in awss3
The Google Cloud API client libraries for Go are making some breaking changes:
google.golang.org/cloud/...
tocloud.google.com/go/...
. For example, if your code imports the BigQuery clientimport "google.golang.org/cloud/bigquery"
import "cloud.google.com/go/bigquery"
google.golang.org/cloud
togoogle.golang.org/api/option
. Two have also been renamed:
WithBaseGRPC
is now WithGRPCConn
WithBaseHTTP
is now WithHTTPClient
cloud.WithContext
and cloud.NewContext
methods are gone, as are theClient
You should make these changes before September 12, 2016, when the packages at
google.golang.org/cloud
will go away.
I got a report of a user trying to use a JWT file and getting confused about why it wasn't working with an error about lack of scopes. The reason was that they weren't setting the Scope, which the construction code allowed them to do. But then, on the first attempt to use the store an error was thrown.
We should add better feedback for this during config/Auth construction, so it's more clear what went wrong.
Many usecases require getting notification immediately after a file is added to a store for import.
Support Interfaces for event based file-change-event models.
The current Object
implementation assumes it is a file, and doesn't have any affordance for directory only types. https://golang.org/pkg/os/#FileMode
Google Storage has a Delimiter
for filtering to certain types
Limiting to directories with Delimiter = "/"
{
"kind": "storage#objects",
"prefixes": [
"tables/article/",
"tables/user/"
]
}
Regular objects
{
"kind": "storage#objects",
"items": [
{
"kind": "storage#object",
"id": "lytics-dataux-tests/tables/article/article1.csv/1457896488161000",
"selfLink": "https://www.googleapis.com/storage/v1/b/lytics-dataux-tests/o/tables%2Farticle%2Farticle1.csv",
"name": "tables/article/article1.csv",
"bucket": "lytics-dataux-tests",
"generation": "1457896488161000",
"metageneration": "1",
"contentType": "text/csv; charset=utf-8",
"timeCreated": "2016-03-13T19:14:48.119Z",
"updated": "2016-03-13T19:14:48.119Z",
"storageClass": "STANDARD",
"size": "398",
"md5Hash": "+RTyIckctKnUmha0OaBBHA==",
"mediaLink": "https://www.googleapis.com/download/storage/v1/b/lytics-dataux-tests/o/tables%2Farticle%2Farticle1.csv?generation=1457896488161000&alt=media",
"metadata": {
"content_type": "text/csv; charset=utf-8"
},
"owner": {
"entity": "user-00b4903a9730f58d42103a7fd40a4d0f92e371ae57112475d3993f88ccf7e8d6",
"entityId": "00b4903a9730f58d42103a7fd40a4d0f92e371ae57112475d3993f88ccf7e8d6"
},
"crc32c": "8oZBFA==",
"etag": "COidr9KvvssCEAE="
},
{
"kind": "storage#object",
"id": "lytics-dataux-tests/tables/user/user1.csv/1457896494340000",
"selfLink": "https://www.googleapis.com/storage/v1/b/lytics-dataux-tests/o/tables%2Fuser%2Fuser1.csv",
"name": "tables/user/user1.csv",
"bucket": "lytics-dataux-tests",
"generation": "1457896494340000",
"metageneration": "1",
"contentType": "text/csv; charset=utf-8",
"timeCreated": "2016-03-13T19:14:54.328Z",
"updated": "2016-03-13T19:14:54.328Z",
"storageClass": "STANDARD",
"size": "299",
"md5Hash": "p6GxtAFU3xu3q8ty852yxw==",
"mediaLink": "https://www.googleapis.com/download/storage/v1/b/lytics-dataux-tests/o/tables%2Fuser%2Fuser1.csv?generation=1457896494340000&alt=media",
"metadata": {
"content_type": "text/csv; charset=utf-8"
},
"owner": {
"entity": "user-00b4903a9730f58d42103a7fd40a4d0f92e371ae57112475d3993f88ccf7e8d6",
"entityId": "00b4903a9730f58d42103a7fd40a4d0f92e371ae57112475d3993f88ccf7e8d6"
},
"crc32c": "1fpnNw==",
"etag": "CKCvqNWvvssCEAE="
}
]
}
In a previous PR I discovered that sftp/localstore and Azure test cases were failing after I switched test cases to using variable length payloads.
SFTP/Localstore failed because the files weren't correctly being truncated before being written too. So, if new data was less than the previous data, then the file will contain extra data and be corrupted.
Azure turned out to be a race with how we wrote to the backing store.
quite a few changes in upstream cloud.google.com/go libs
Hi
In order to be possible use yandex S3 (who mimic the S3 API in everything except the endpoint)
I'd like to be able to set the endpoint in the cloudstorage Conf
to be used only in awss3 if it is not empty
the documentation of Query ApplyFilters says
// ApplyFilters is called as the last step in store.List() to filter out the
// results before they are returned.
but it is not true for awss3 ( and maybe for google )
Due to the fact that most cloud storage uses full paths as the "key" we had different folder path conventions in local-storage vs cloud. ie "local-test/a/b/" vs "b"
Add support for SFTP as a backend
a/b/file.csv
which means folder a
and that contains folder b
which contains file.csv
. the a
folder is empty, and would NOT show up in traditional cloud based systems where the a
is ignored. only a/b
would be a folder in google, azure, etc.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.