Giter Site home page Giter Site logo

sage-bionetworks / synapsepythonclient Goto Github PK

View Code? Open in Web Editor NEW
65.0 28.0 67.0 18.95 MB

Programmatic interface to Synapse services for Python

Home Page: https://www.synapse.org

License: Apache License 2.0

Python 99.97% Dockerfile 0.02% Shell 0.01%
synapse python

synapsepythonclient's Introduction

Python Synapse Client

Branch Build Status
develop Build Status develop branch
master Build Status master branch

Get the synapseclient from PyPI Supported Python Versions

A Python client for Sage Bionetworks' Synapse, a collaborative, open-source research platform that allows teams to share data, track analyses, and collaborate. The Python client can be used as a library for development of software that communicates with Synapse or as a command-line utility.

There is also a Synapse client for R.

Documentation

For more information about the Python client, see:

For more information about interacting with Synapse, see:

For release information, see:

Installation

The Python Synapse client has been tested on 3.8, 3.9, 3.10 and 3.11 on Mac OS X, Ubuntu Linux and Windows.

Starting from Synapse Python client version 3.0, Synapse Python client requires Python >= 3.8

Install using pip

The Python Synapse Client is on PyPI and can be installed with pip:

(sudo) pip install synapseclient[pandas,pysftp]

...or to upgrade an existing installation of the Synapse client:

(sudo) pip install --upgrade synapseclient

The dependencies on pandas and pysftp are optional. Synapse Tables integrate with Pandas. The library pysftp is required for users of SFTP file storage. Both libraries require native code to be compiled or installed separately from prebuilt binaries.

Install from source

Clone the source code repository.

git clone git://github.com/Sage-Bionetworks/synapsePythonClient.git
cd synapsePythonClient
pip install .

Command line usage

The Synapse client can be used from the shell command prompt. Valid commands include: query, get, cat, add, update, delete, and onweb. A few examples are shown.

downloading test data from Synapse

synapse -p auth_token get syn1528299

getting help

synapse -h

Note that a Synapse account is required.

Usage as a library

The Synapse client can be used to write software that interacts with the Sage Bionetworks Synapse repository. More examples can be found in the Tutorial section found here

Examples

Log-in and create a Synapse object

import synapseclient

syn = synapseclient.Synapse()
## You may optionally specify the debug flag to True to print out debug level messages.
## A debug level may help point to issues in your own code, or uncover a bug within ours.
# syn = synapseclient.Synapse(debug=True)

## log in using auth token
syn.login(authToken='auth_token')

Sync a local directory to synapse

This is the recommended way of synchronizing more than one file or directory to a synapse project through the use of synapseutils. Using this library allows us to handle scheduling everything required to sync an entire directory tree. Read more about the manifest file format in synapseutils.syncToSynapse

import synapseclient
import synapseutils
import os

syn = synapseclient.Synapse()

## log in using auth token
syn.login(authToken='auth_token')

path = os.path.expanduser("~/synapse_project")
manifest_path = f"{path}/my_project_manifest.tsv"
project_id = "syn1234"

# Create the manifest file on disk
with open(manifest_path, "w", encoding="utf-8") as f:
    pass

# Walk the specified directory tree and create a TSV manifest file
synapseutils.generate_sync_manifest(
    syn,
    directory_path=path,
    parent_id=project_id,
    manifest_path=manifest_path,
)

# Using the generated manifest file, sync the files to Synapse
synapseutils.syncToSynapse(
    syn,
    manifestFile=manifest_path,
    sendMessages=False,
)

Store a Project to Synapse

import synapseclient
from synapseclient.entity import Project

syn = synapseclient.Synapse()

## log in using auth token
syn.login(authToken='auth_token')

project = Project('My uniquely named project')
project = syn.store(project)

print(project.id)
print(project)

Store a Folder to Synapse (Does not upload files within the folder)

import synapseclient

syn = synapseclient.Synapse()

## log in using auth token
syn.login(authToken='auth_token')

folder = Folder(name='my_folder', parent="syn123")
folder = syn.store(folder)

print(folder.id)
print(folder)

Store a File to Synapse

import synapseclient

syn = synapseclient.Synapse()

## log in using auth token
syn.login(authToken='auth_token')

file = File(
    path=filepath,
    parent="syn123",
)
file = syn.store(file)

print(file.id)
print(file)

Get a data matrix

import synapseclient

syn = synapseclient.Synapse()

## log in using auth token
syn.login(authToken='auth_token')

## retrieve a 100 by 4 matrix
matrix = syn.get('syn1901033')

## inspect its properties
print(matrix.name)
print(matrix.description)
print(matrix.path)

## load the data matrix into a dictionary with an entry for each column
with open(matrix.path, 'r') as f:
    labels = f.readline().strip().split('\t')
    data = {label: [] for label in labels}
    for line in f:
        values = [float(x) for x in line.strip().split('\t')]
        for i in range(len(labels)):
            data[labels[i]].append(values[i])

## load the data matrix into a numpy array
import numpy as np
np.loadtxt(fname=matrix.path, skiprows=1)

Authentication

Authentication toward Synapse can be accomplished with the clients using personal access tokens. Learn more about Synapse personal access tokens

Learn about the multiple ways one can login to Synapse.

Synapse Utilities (synapseutils)

The purpose of synapseutils is to create a space filled with convenience functions that includes traversing through large projects, copying entities, recursively downloading files and many more.

Example

import synapseutils
import synapseclient
syn = synapseclient.login()

# copies all Synapse entities to a destination location
synapseutils.copy(syn, "syn1234", destinationId = "syn2345")

# copies the wiki from the entity to a destination entity. Only a project can have sub wiki pages.
synapseutils.copyWiki(syn, "syn1234", destinationId = "syn2345")


# Traverses through Synapse directories, behaves exactly like os.walk()
walkedPath = synapseutils.walk(syn, "syn1234")

for dirpath, dirname, filename in walkedPath:
    print(dirpath)
    print(dirname)
    print(filename)

OpenTelemetry (OTEL)

OpenTelemetry helps support the analysis of traces and spans which can provide insights into latency, errors, and other performance metrics. The synapseclient is ready to provide traces should you want them. The Synapse Python client supports OTLP Exports and can be configured via environment variables as defined here.

Read more about OpenTelemetry in Python here

Quick-start

The following shows an example of setting up jaegertracing via docker and executing a simple python script that implements the Synapse Python client.

Running the jaeger docker container

Start a docker container with the following options:

docker run --name jaeger \
  -e COLLECTOR_OTLP_ENABLED=true \
  -p 16686:16686 \
  -p 4318:4318 \
  jaegertracing/all-in-one:latest

Explanation of ports:

  • 4318 HTTP
  • 16686 Jaeger UI

Once the docker container is running you can access the Jaeger UI via: http://localhost:16686

Example

By default the OTEL exporter sends trace data to http://localhost:4318/v1/traces, however you may override this by setting the OTEL_EXPORTER_OTLP_TRACES_ENDPOINT environment variable.

import synapseclient
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

trace.set_tracer_provider(
    TracerProvider(
        resource=Resource(attributes={SERVICE_NAME: "my_own_code_above_synapse_client"})
    )
)
trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
tracer = trace.get_tracer("my_tracer")

@tracer.start_as_current_span("my_span_name")
def main():
    syn = synapseclient.Synapse()
    syn.login()
    my_entity = syn.get("syn52569429")
    print(my_entity)

main()

License and Copyright

© Copyright 2013-23 Sage Bionetworks

This software is licensed under the Apache License, Version 2.0.

synapsepythonclient's People

Contributors

adrinjalali avatar allaway avatar apratap avatar brucehoff avatar bryanfauble avatar burrch3s avatar bwmac avatar cbare avatar cokelaer avatar danlu1 avatar jay-hodgson avatar jaymedina avatar jkiang13 avatar kaysoky avatar kdaily avatar kellrott avatar kimyen avatar kkdang avatar larssono avatar linchiahui avatar linchiahuisage avatar linglp avatar mfazza avatar minneker avatar rxu17 avatar thomasyu888 avatar vpchung avatar xschildw avatar ychae avatar zimingd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

synapsepythonclient's Issues

Connecting to Synapse documentation

Client version

2.1.1

Description of the problem

Use of the api key is buried in the reference documentation. Recommend use of api key in Connecting to Synapse. Users will first arrive to this location before searching the reference documentation.

Make CACHE_ROOT_DIR customizable

On AWS Lambda you can only write to /tmp. We need a way to change the CACHE_ROOT_DIR.

My current workaround for this is:

synapseclient.cache.CACHE_ROOT_DIR = os.path.join(tempfile.gettempdir(), 'synapseCache')

cache.py:retrieve_local_file_info:'file' variable not defined.

In cache.py, function retrieve_local_file_info has if file not None as a part of the condition. file is a function is python2 and is not None, but it's removed in python3, and I guess the coder didn't mean to check if the function file exists here. Should it be removed from the if clause?

SSLError when trying to create a Synapse instance

Dear all,

I've got the following issue since a couple of days when trying to create a synase client with:

import synapseclient
s = synapseclient.Synapse()

I've tried with the newest released package of synapseclient from Pypi.

Here is the error message.
Thanks a lot
Thomas


SSLError Traceback (most recent call last)
in ()
----> 1 s = synapseclient.Synapse(debug=True)

/home/cokelaer/Work/virtualenv/lib/python2.7/site-packages/synapseclient/client.pyc in init(self, repoEndpoint, authEndpoint, fileHandleEndpoint, portalEndpoint, debug, skip_checks)
149 raise
150
--> 151 self.setEndpoints(repoEndpoint, authEndpoint, fileHandleEndpoint, portalEndpoint, skip_checks)
152
153 ## TODO: rename to defaultHeaders ?

/home/cokelaer/Work/virtualenv/lib/python2.7/site-packages/synapseclient/client.pyc in setEndpoints(self, repoEndpoint, authEndpoint, fileHandleEndpoint, portalEndpoint, skip_checks)
206 # Update endpoints if we get redirected
207 if not skip_checks:
--> 208 response = requests.get(endpoints[point], allow_redirects=False, headers=synapseclient.USER_AGENT)
209 if response.status_code == 301:
210 endpoints[point] = response.headers['location']

/home/cokelaer/Work/virtualenv/lib/python2.7/site-packages/requests/api.pyc in get(url, *_kwargs)
53
54 kwargs.setdefault('allow_redirects', True)
---> 55 return request('get', url, *_kwargs)
56
57

/home/cokelaer/Work/virtualenv/lib/python2.7/site-packages/requests/api.pyc in request(method, url, *_kwargs)
42
43 session = sessions.Session()
---> 44 return session.request(method=method, url=url, *_kwargs)
45
46

/home/cokelaer/Work/virtualenv/lib/python2.7/site-packages/requests/sessions.pyc in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert)
359 'allow_redirects': allow_redirects,
360 }
--> 361 resp = self.send(prep, **send_kwargs)
362
363 return resp

/home/cokelaer/Work/virtualenv/lib/python2.7/site-packages/requests/sessions.pyc in send(self, request, *_kwargs)
462 start = datetime.utcnow()
463 # Send the request
--> 464 r = adapter.send(request, *_kwargs)
465 # Total elapsed time of the request (approximately)
466 r.elapsed = datetime.utcnow() - start

/home/cokelaer/Work/virtualenv/lib/python2.7/site-packages/requests/adapters.pyc in send(self, request, stream, timeout, verify, cert, proxies)
361 except (_SSLError, _HTTPError) as e:
362 if isinstance(e, _SSLError):
--> 363 raise SSLError(e)
364 elif isinstance(e, TimeoutError):
365 raise Timeout(e)

SSLError: [Errno 1] _ssl.c:504: error:100AE081:elliptic curve routines:EC_GROUP_new_by_curve_name:unknown group

Developer interested in helping our project

Hello I am a college student that is interested in helping the project. I have found this project off the mozilla website where do I start to help improve the project itself?

Synapse login not the same with `authToken` and `apiKey`

Bug Report

Operating system

MacOS Big Sur

Client version

Versions 2.2.2 and 2.3.1.

Description of the problem

When using the .synapseConfig file (with the apiKey attribute) as in (for example) synapseclient==2.2.2) the synapseclient.Synapse.login() method works perfectly. However, when using the .synapseConfig (with the authToken attribute) as in (for example), synapseclient==2.3.1), the login method doesn't work as expected.

A minimal reproducible example:

  • Install version 2.2.2 of the synapseclient
$ pip install synapseclient==2.2.2

$ python
>>> import synapseclient
>>> syn = synapseclient.Synapse(configPath='/Users/spatil/Desktop/schematic/.synapseConfig')
>>> syn.login(silent=True)
  • Repeat the above with version 2.3.1
  • Observe the differences in behaviour

Note: Make sure to you the right versions of the .synapseConfig file too.

Expected behavior

User should be logged in successfully.

Actual behavior

No output to console when testing with synapseclient==2.3.1 and using .synapseConfig file with authToken.

No matrix.path

When I type:

print (matrix.path) returns me an error. When i search for the .path extension there is none in the package. Is the documentation incorrect?

downloadTableFile behaviour change in 1.7.1

Hi, I've been using the above method to download mPower files for some time. Just upgraded to synclient 1.7.1 (from 1.6.2 ) and things broke. The method no longer returns a dict , returns string (of the path) instead . Also, if you specify a downloadLocation of "." as per the docs , fails with ' cannot find path "" '. if you leave downloadLocation out, it defaults to the cache, as you'd expect. both fairly minor , perhaps just doc update required ?

Implement a verbose mode

I'm trying to download a large file and I can't tell if it's going successfully or not. It would be great to get more diagnostic information from the Synapse client to confirm that the download has begun and, ideally, progress information as well.

SynapseFileCacheError on Ubuntu Server edition

Bug Report

Operating system

Ubuntu Server 18.04

Client version

1.9.1

Description of the problem

Something related to mime when attempting to store data on Synapse. Dependency issue on Ubuntu Server edition? On the desktop editions of Ubuntu 18.04 or Debian 9 the issue is absent using the same synapseclient version and same file to upload.

In [1] import synapseclient
In [2] syn = synapseclient.login()
In [3] f ="/path/to/file"
In [4] syn.store(synapseclient.File(f, parent = "syn17931318"))

##################################################
 Uploading file to Synapse storage
##################################################

---------------------------------------------------------------------------
SynapseFileCacheError                     Traceback (most recent call last)
<ipython-input-4-cbd9cbaa63f3> in <module>
----> 1 syn.store(synapseclient.File(f, parent = "syn17931318"))

~/.local/lib/python3.6/site-packages/synapseclient/client.py in store(self, obj, **kwargs)
    969                                                 md5=local_state_fh.get('contentMd5'),
    970                                                 file_size=local_state_fh.get('contentSize'),
--> 971                                                 mimetype=local_state_fh.get('contentType'))
    972                 properties['dataFileHandleId'] = fileHandle['id']
    973                 local_state['_file_handle'] = fileHandle

~/.local/lib/python3.6/site-packages/synapseclient/upload_functions.py in upload_file_handle(syn, parent_entity, path, synapseStore, md5, file_size, mimetype)
     65         syn.logger.info('\n' + '#' * 50 + '\n Uploading file to ' + storageString + ' storage \n' + '#' * 50 + '\n')
     66
---> 67         return upload_synapse_s3(syn, expanded_upload_path, location['storageLocationId'], mimetype=mimetype)
     68     # external file handle (sftp)
     69     elif upload_destination_type == concrete_types.EXTERNAL_UPLOAD_DESTINATION:

~/.local/lib/python3.6/site-packages/synapseclient/upload_functions.py in upload_synapse_s3(syn, file_path, storageLocationId, mimetype)
    125 def upload_synapse_s3(syn, file_path, storageLocationId=None, mimetype=None):
    126     file_handle_id = multipart_upload(syn, file_path, contentType=mimetype, storageLocationId=storageLocationId)
--> 127     syn.cache.add(file_handle_id, file_path)
    128
    129     return syn._getFileHandle(file_handle_id)

~/.local/lib/python3.6/site-packages/synapseclient/cache.py in add(self, file_handle_id, path)
    218
    219         cache_dir = self.get_cache_dir(file_handle_id)
--> 220         with Lock(self.cache_map_file_name, dir=cache_dir):
    221             cache_map = self._read_cache_map(cache_dir)
    222

~/.local/lib/python3.6/site-packages/synapseclient/lock.py in __enter__(self)
     97     # Make the lock object a Context Manager
     98     def __enter__(self):
---> 99         self.blocking_acquire()
    100
    101     def __exit__(self, exc_type, exc_value, traceback):

~/.local/lib/python3.6/site-packages/synapseclient/lock.py in blocking_acquire(self, timeout, break_old_locks)
     83         if not lock_acquired:
     84             raise SynapseFileCacheError("Could not obtain a lock on the file cache within timeout: %s  "
---> 85                                         "Please try again later" % str(timeout))
     86
     87     def release(self):

SynapseFileCacheError: Could not obtain a lock on the file cache within timeout: 0:01:10  Please try again later

Certificate has no subjectAltName, falling back to check for a commonName for now

Upgrading to synapseclient 1.6.1, I am now getting the following warning multiple times:

/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:337: SubjectAltNameWarning: Certificate for file-prod.prod.sagebase.org has no `subjectAltName` , falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)

When downloading using Python 2.7 backport csv writer gives an UnicodeEncodeError

When trying to download syn3163039 with files = synapseutils.syncFromSynapse(syn, "syn3163039", path='syn3163039/") I get the below error. It works correctly when using Python 3.

Traceback (most recent call last):
File "download_data.py", line 12, in
files = synapseutils.syncFromSynapse(syn, "syn3163039", path='syn3163039/")
File "/apps/software/Python/2.7.11-foss-2015b/lib/python2.7/site-packages/synapseutils/sync.py", line 85, in syncFromSynapse
syncFromSynapse(syn, result['id'], new_path, ifcollision, allFiles, followLink=followLink)
File "/apps/software/Python/2.7.11-foss-2015b/lib/python2.7/site-packages/synapseutils/sync.py", line 85, in syncFromSynapse
syncFromSynapse(syn, result['id'], new_path, ifcollision, allFiles, followLink=followLink)
File "/apps/software/Python/2.7.11-foss-2015b/lib/python2.7/site-packages/synapseutils/sync.py", line 85, in syncFromSynapse
syncFromSynapse(syn, result['id'], new_path, ifcollision, allFiles, followLink=followLink)
File "/apps/software/Python/2.7.11-foss-2015b/lib/python2.7/site-packages/synapseutils/sync.py", line 105, in syncFromSynapse
generateManifest(syn, allFiles, filename)
File "/apps/software/Python/2.7.11-foss-2015b/lib/python2.7/site-packages/synapseutils/sync.py", line 145, in generateManifest
csvWriter.writerow(row)
File "/apps/software/Python/2.7.11-foss-2015b/lib/python2.7/site-packages/backports/csv.py", line 685, in writerow
return self.writer.writerow(self._dict_to_list(rowdict))
File "/apps/software/Python/2.7.11-foss-2015b/lib/python2.7/site-packages/backports/csv.py", line 204, in writerow
return self.fileobj.write(line)

Downloading files/folders from synapse with additional jamboree credentials fails

Bug Report

Operating system

MacOS Catalina version 10.15.7

Client version

Output of:

import synapseclient
synapseclient.__version__

'2.2.0'

Description of the problem

Downloading files/folders from synapse with additional jamboree credentials fails.

I am trying to download a folder from synapse, where if I were to do it manually I would click to download each file and then supply my jamboree access key & secret key. I was hoping to do this with the python client because there are a lot of files, but the python client never prompts me for the jamboree keys. Instead each file download silently fails, resulting in an empty list of files.

import synapseclient
import synapseutils 
 
syn = synapseclient.Synapse() 
syn.login('synapse_username','password') 
files = synapseutils.syncFromSynapse(syn, 'synID')

After running this I don't get any errors, but files is empty

Expected behavior

I expected the files in the folder associated with the synapse ID to be downloaded

Actual behavior

No error, but also no successful downloads.

>>> files
[]

deepcopy() of a synapseclient.Synapse object broken after upgrade to 2.2.2

Bug Report

Operating system

Ubuntu 20.04

Client version

2.2.2

Description of the problem

I was running client version 2.0.0. After upgrading to 2.2.2, I'm not able to deepcopy a Synapse object:

import copy
import synapseclient
syn = synapseclient.Synapse()
syn.login()
syn_copy = copy.deepcopy(syn)

Result:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/usr/lib/python3.8/copy.py", line 270, in _reconstruct
    state = deepcopy(state, memo)
  File "/usr/lib/python3.8/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.8/copy.py", line 230, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python3.8/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/usr/lib/python3.8/copy.py", line 270, in _reconstruct
    state = deepcopy(state, memo)
  File "/usr/lib/python3.8/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.8/copy.py", line 230, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python3.8/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.8/copy.py", line 230, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python3.8/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/usr/lib/python3.8/copy.py", line 264, in _reconstruct
    y = func(*args)
TypeError: __init__() missing 1 required positional argument: 'max_size'

Should I wipe/move my cache and try again? Figured I'd check before re-pulling 270GB.

Error when getting EntityViewSchema that does not exist

Synapse Client 1.8.2

This request fails when the table doesn't exist.

syn.get(EntityViewSchema(name='my_view', parent=my_project), downloadFile=False)

Error:

File "synapseclient/client.py", line 626, in get
    self._check_entity_restrictions(bundle['restrictionInformation'], entity, kwargs.get('downloadFile', True))
TypeError: 'NoneType' object has no attribute '__getitem__'

from backports import csv

csv is available natively with Python 2 and 3.
Can you change: from backports import csv
too: import csv

[GCC 5.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import synapseclient
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/app/easybuild/software/Python/3.6.4-foss-2016b-fh1/lib/python3.6/site-packages/synapseclient-1.7.3-py3.6.egg/synapseclient/__init__.py", line 308, in <module>
    from .client import Synapse, login
  File "/app/easybuild/software/Python/3.6.4-foss-2016b-fh1/lib/python3.6/site-packages/synapseclient-1.7.3-py3.6.egg/synapseclient/client.py", line 86, in <module>
    from .table import Schema, Column, TableQueryResult, CsvFileTable
  File "/app/easybuild/software/Python/3.6.4-foss-2016b-fh1/lib/python3.6/site-packages/synapseclient-1.7.3-py3.6.egg/synapseclient/table.py", line 276, in <module>
    from backports import csv
ImportError: cannot import name 'csv'
>>> import csv
>>> from backports import csv
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'csv'
>>>
[GCC 5.4.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from backpots import csv
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named backpots
>>> import csv
>>>

KeyError when trying to download using Synapse Client 1.8.1

I am trying to download syn3157325 using files = synapseutils.syncFromSynapse(syn, 'syn3157325', path = 'ROSMAP/'). I get the error

Traceback (most recent call last):
  File "download_data.py", line 32, in <module>
    files = synapseutils.syncFromSynapse(syn, 'syn3157325', path = 'ROSMAP/')
  File "/apps/software/Python/3.6.3-foss-2015b/lib/python3.6/site-packages/synapseutils/sync.py", line 104, in syncFromSynapse
    generateManifest(syn, allFiles, filename)
  File "/apps/software/Python/3.6.3-foss-2015b/lib/python3.6/site-packages/synapseutils/sync.py", line 116, in generateManifest
    keys, data = _extract_file_entity_metadata(syn, allFiles)
  File "/apps/software/Python/3.6.3-foss-2015b/lib/python3.6/site-packages/synapseutils/sync.py", line 135, in _extract_file_entity_metadata
    row.update(_get_file_entity_provenance_dict(syn, entity))
  File "/apps/software/Python/3.6.3-foss-2015b/lib/python3.6/site-packages/synapseutils/sync.py", line 152, in _get_file_entity_provenance_dict
    'executed' : ';'.join(prov._getExecutedStringList()),
  File "/apps/software/Python/3.6.3-foss-2015b/lib/python3.6/site-packages/synapseclient/activity.py", line 339, in _getExecutedStringList
    return self._getStringList(wasExecuted=True)
  File "/apps/software/Python/3.6.3-foss-2015b/lib/python3.6/site-packages/synapseclient/activity.py", line 329, in _getStringList
    usedList.append(source['name'])

printing usedList:

{'wasExecuted': True, 'concreteType': 'org.sagebionetworks.repo.model.provenance.UsedURL', 'url': 'https://github.com/Sage-Bionetworks/ampAdScripts/blob/master/Broad-Rush/migrateROSMAPGenotypesFeb2015.R'}

I put a try/except around usedList.append(source['name']), as far as I double check it allowed me to download all the data correctly.

Re-uploads a file when there are no changes

Bug Report

Operating system

Ubuntu Linux 18.04

Client version

1.9.2

Description of the problem

If a file already exists in a Project it will be uploaded even when the file has not changed. If you upload a second time it works as expected (it doesn't upload the file again).

Repro. Steps:

  • Create a new Project.
  • Upload a file to the Project through the Synapse website.
  • Do not make any changes to the local or remote file...
  • Re-upload the file: synapse add --parentid syn123456 test_file.txt The file will be uploaded.

Expected behavior

  • The file will NOT be uploaded since it has not changed.

Actual behavior

  • The file is uploaded even though no change was made to it.

UnicodeDecodeError on special characters when storing file

Bug Report

Operating system

Ubuntu 18.04

Client version

1.9.2

Description of the problem

Throws exception when uploading a file where the file path contains special characters.

This is a blocking issue for us.

Repro Script:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import synapseclient

filename = "TestûTest.txt"

with open(filename, mode='w') as f:
    f.write('test text')

syn = synapseclient.Synapse()
syn.login()
syn.store(synapseclient.File(path=filename, parent="syn18521874"))

Expected behavior

Does not error. Uploads file.

Actual behavior

Throws exception. Does not upload file.

Traceback (most recent call last):
  File "./bug.py", line 14, in <module>
    syn.store(synapseclient.File(path=filename, parent="syn18521874"))
  File "/home/user/source/.venv/local/lib/python2.7/site-packages/synapseclient/entity.py", line 578, in __init__
    kwargs['name'] = utils.guess_file_name(path)
  File "/home/user/source/.venv/local/lib/python2.7/site-packages/synapseclient/utils.py", line 243, in guess_file_name
    tokens = [x for x in path.split('/') if x != '']
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 62: ordinal not in range(128)

support custom Session objects (feature requests)

I have a feature request related to my specific use case. I have a large synapse project where the files themselves are hosted on google drive. The files on synapse are direct link-outs. Unfortunately google drive caps direct download for files over 50MB, instead redirecting the user to download a link with a random confirmation code in the url's query string.

Therefore, the basic url request doesn't quite work for me. I need to stream the file extract the confirmation code, and make a second request, while retaining the cookie from the original request.

I can do all of this in a custom get method (see my gist for a derivative request.Session class) but I need a way of getting this object into a Synapse client object. See PR #713 for a simple example.

Example usage:

import synapseclient
import synapseutils
from gdrivesession import GDriveSession

session = GDriveSession()
syn = synapseclient.Synapse(session=session)
syn.login()
files = synapseutils.syncFromSynapse(syn, "syn20844101")

I understand that overwriting get methods for requests might expose the user to security issues or simple user error, but perhaps this is of general interest since there are many urls that are not simply open on the web. At the very least you might want some type/integrity checking on the session object.

Thank you for the consideration.

Passing a pandas dataframe with a column called "read" breaks the type parsing in as_table_columns()

Bug Report

Operating system

MacOSX

Client version

2.3.1

Description of the problem

Symptom:
Passing a Pandas Dataframe with the column labeled "read" to as_table_columns() throws a 'TypeError' when calling _csv_to_pandas_df().

Bug:
The code tries to parse the value as a string instead of a Pandas DF in this code here:

    # filename of a csv file
    # in Python 3, we can check that the values is instanceof io.IOBase
    # for now, check if values has attr `read`
    if isinstance(values, str) or hasattr(values, "read"):   <----- hasattr(values, "read") is True!
        df = _csv_to_pandas_df(values)               <----- _csv_to_pandas_df() returns a TypeError
    # pandas DataFrame
    if isinstance(values, pd.DataFrame):
        df = values                                        <----- Should assign df here instead

Catching this in the debugger, I see that the input parameter values has the attr read and so the code tries to parse it as a string in _csv_to_pandas_df:

>>>values["read"]
0    
Name: read, dtype: object
>>>isinstance(values, str)
False
>>>hasattr(values, "read")
True

To Reproduce

Note, 'Table(schema, df)' calls as_table_columns() internally:

import pandas as pd
from synapseclient import Schema, Column, Table, Row, RowSet, as_table_columns, build_table, table

project = 'synXXXXXXXX'
df = pd.DataFrame([{'read': '0'}])
columns = []
for column in df.columns:
     columns.append(Column(name=column, columnType='STRING'))
schema = Schema('TEST_TABLE', columns, parent=project)
table = Table(schema, df)

Expected behavior

Users should be able to pass a pandas dataframe with a column called "read" to the function

Actual behavior

If you care to see the error:

  File "/Users/esurface/opt/miniconda2/envs/py3/lib/python3.9/site-packages/pandas/io/common.py", line 554, in get_handle
    if _is_binary_mode(path_or_buf, mode) and "b" not in mode:
  File "/Users/esurface/opt/miniconda2/envs/py3/lib/python3.9/site-packages/pandas/io/common.py", line 859, in _is_binary_mode
    return isinstance(handle, binary_classes) or "b" in getattr(handle, "mode", mode)
TypeError: argument of type 'method' is not iterable

pip install dev does not work

sudo pip install git+https://github.com/Sage-Bionetworks/synapsePythonClient.git@develop
Downloading/unpacking git+https://github.com/Sage-Bionetworks/synapsePythonClient.git@develop
  Cloning https://github.com/Sage-Bionetworks/synapsePythonClient.git (to develop) to /tmp/pip-ouux3d-build
  Running setup.py (path:/tmp/pip-ouux3d-build/setup.py) egg_info for package from git+https://github.com/Sage-Bionetworks/synapsePythonClient.git@develop

Requirement already satisfied (use --upgrade to upgrade): requests>=1.2 in /usr/lib/python2.7/dist-packages (from synapseclient==1.5.2.dev1)
Requirement already satisfied (use --upgrade to upgrade): six in /usr/lib/python2.7/dist-packages (from synapseclient==1.5.2.dev1)
Downloading/unpacking future (from synapseclient==1.5.2.dev1)
  Downloading future-0.15.2.tar.gz (1.6MB): 1.6MB downloaded
  Running setup.py (path:/tmp/pip_build_root/future/setup.py) egg_info for package future

    warning: no files found matching '*.au' under directory 'tests'
    warning: no files found matching '*.gif' under directory 'tests'
    warning: no files found matching '*.txt' under directory 'tests'
Downloading/unpacking backports.csv (from synapseclient==1.5.2.dev1)
  Downloading backports.csv-1.0.1-py2.py3-none-any.whl
Installing collected packages: future, backports.csv, synapseclient
  Running setup.py install for future

    warning: no files found matching '*.au' under directory 'tests'
    warning: no files found matching '*.gif' under directory 'tests'
    warning: no files found matching '*.txt' under directory 'tests'
    Installing pasteurize script to /usr/local/bin
    Installing futurize script to /usr/local/bin
  Running setup.py install for synapseclient

    Installing synapse script to /usr/local/bin
Successfully installed future backports.csv synapseclient
Cleaning up...

I suspect it is mainly because of this line : Running setup.py install for synapseclient. Even when we clone the develop, if you python setup.py install it will not install the dev branch. You must do python setup.py develop to install the dev branch.

synapseutils.sync.syncFromSynapse throws error when syncing a Table object

There appears to be a bug in synapseutils.sync.syncFromSynapse. I am attempting to sync the 'Wondrous Research Example' (syn1901847) to my local filesystem. The syncFromSynapse function is throwing this error:

Traceback (most recent call last):
  ...
  File "import_synapse.py", line 55, in import_synapse_files
    synapseutils.sync.syncFromSynapse(synapse_client, syn_id, output_path)
  File "~/anaconda/envs/py27/lib/python2.7/site-packages/synapseutils/sync.py", line 82, in syncFromSynapse
    syncFromSynapse(syn, result['entity.id'], new_path, ifcollision, allFiles)
  File "~/anaconda/envs/py27/lib/python2.7/site-packages/synapseutils/sync.py", line 90, in syncFromSynapse
    generateManifest(syn, allFiles, filename)
  File "~/anaconda/envs/py27/lib/python2.7/site-packages/synapseutils/sync.py", line 107, in generateManifest
    row = {'parent': entity['parentId'], 'path': entity.path, 'name': entity.name,
  File "~/anaconda/envs/py27/lib/python2.7/site-packages/synapseclient/entity.py", line 362, in __getattr__
    raise AttributeError(key)
AttributeError: path

I added a print statement to the entity to determine which resource was causing this error, and this appears to be the culprit:

Schema: Synapse Table Demo (syn3079449)
  columns_to_store=None
properties:
  accessControlList=/repo/v1/entity/syn3079449/acl
  annotations=/repo/v1/entity/syn3079449/annotations
  columnIds=[u'36450', u'36451', u'36452', u'36453']
  concreteType=org.sagebionetworks.repo.model.table.TableEntity
  createdBy=273979
  createdOn=2015-01-09T20:49:19.646Z
  entityType=org.sagebionetworks.repo.model.table.TableEntity
  etag=b6d017c7-18a8-47e7-8bd7-497bf8b1a512
  id=syn3079449
  modifiedBy=273979
  modifiedOn=2015-01-09T20:49:19.646Z
  name=Synapse Table Demo
  parentId=syn1901847
  uri=/repo/v1/entity/syn3079449
  versionLabel=1
  versionNumber=1
  versionUrl=/repo/v1/entity/syn3079449/version/1
  versions=/repo/v1/entity/syn3079449/version
annotations:

Command line download of tables

The ability to download tables into TSVs using the command line client would be very helpful.
Right now:

synapse get syn3156503

returns


WARNING: No files associated with entity syn3156503

Schema: RNA-Seq Metadata (syn3156503)
  columns_to_store=[]
properties:
  accessControlList=/repo/v1/entity/syn3156503/acl
  annotations=/repo/v1/entity/syn3156503/annotations
  columnIds=[u'4071', u'4192', u'4099', u'5449', u'4152', u'4077', u'4078', u'4079', u'35396', u'4073', u'35397', u'35399', u'35400', u'4124', u'4344', u'4225', u'4226', u'4158', u'4159', u'4227', u'4234', u'4228', u'4166', u'4229', u'4042', u'35398', u'4021', u'4023', u'4242', u'4243', u'4026', u'4233', u'4028', u'4043', u'4044', u'4030', u'4031', u'4032', u'4244', u'4034', u'4035', u'4036', u'4037', u'4045', u'4038', u'4046', u'4047', u'4039', u'4048', u'4245', u'4519', u'4521', u'4162', u'5518', u'4155', u'5515', u'5519', u'5520', u'4528', u'7673', u'7674', u'7705', u'7707']
  concreteType=org.sagebionetworks.repo.model.table.TableEntity
  createdBy=3323072
  createdOn=2015-01-28T18:46:07.159Z
  entityType=org.sagebionetworks.repo.model.table.TableEntity
  etag=aaa03a73-1847-4ec8-b8f0-80305c1adc7a
  id=syn3156503
  modifiedBy=3323072
  modifiedOn=2015-06-05T23:40:51.119Z
  name=RNA-Seq Metadata
  parentId=syn1773109
  uri=/repo/v1/entity/syn3156503
  versionLabel=10
  versionNumber=10
  versionUrl=/repo/v1/entity/syn3156503/version/10
  versions=/repo/v1/entity/syn3156503/version
annotations:


AttributeError: path

There is a download button on the page, but nothing for the command line tool.

How to list a folder?

I'm not sure how to list a folder. Am I missing something obvious? Seems like along with get/store, list is one of the most important file system actions.

I dug into the code and found _list(), but it's too complicated for me to understand.

Have an exclusion/inclusion list for syncFromSynapse (feature request)

Apologies if this is already possible, but I could not find it in the documentation.

When using syncFromSynapse() you can not exclude files from download. For example, I do not want to download the *.bam files. It would be great if there was a parameter for syncFromSynapse() with an exclude (or include) list of files.

Slow uploads of data with single records

As discussed in the RTI/Synapse call:

We are using Synapse tables for the storage and curation of data for a multi-site study. Our data lives in a document data store in JSON files. We process the data and flatten it into a data table structure for upload to Synapse. Most of the documents have many entries which create more than 152 columns. We wrote a python module that splits the data into 152 column sections and uploads the data to Synapse in columns with type STRING and 50 characters in length.

We are processing documents one-at-a-time as they are received in the document store database. Even when only one row is being uploaded, we see long delays in the API call (multiple seconds in most cases). With more than 120,000 to process, our upload strategy became untenable as the processing time reached almost a month.

To reproduce the issue, run the python3 test.py in our synapse-span-table module.

Is there any improvement to the use of the API you suggest that will speed up the process?

We understand that Synapse is many used and optimized for uploading batched records, but have run into issues with that strategy as well (see: Issue 867)

Data are uploaded in duplicate if rows are added and the schema changes simulatneously

As discussed in the RTI/Synapse call, we are seeing duplicated rows in data uploaded to the server while uploading batched data to Synapse. For each table, we load flattened JSON data into a pandas dataframe after processing every 100 records, the data gets saved to Synapse by calling store(). The issue arrises after the initial upload of the table, when during the second upload rows are added and the schema changes.

To reproduce the issue, run the dup_test.py file in our github repo: synapse-span-table

Operating system

  • Docker image (Ubuntu Linux) running on AWS or OSX

Client version

2.3.1

Let the user specify the number of allowed threads

Operating system

Any

Client version

2.4.0

Description of the problem

synapseclient spawns too many computational threads.

Relevant lines of the code
synapseclient/client.py:from synapseclient.core.pool_provider import DEFAULT_NUM_THREADS
synapseclient/client.py: 'max_threads': DEFAULT_NUM_THREADS,
synapseclient/core/upload/multipart_upload.py: max_threads = pool_provider.DEFAULT_NUM_THREADS
synapseclient/core/pool_provider.py:DEFAULT_NUM_THREADS = multiprocessing.cpu_count() + 4

cpu_count() + 4 can lead to time slicing with hundreds of threads on a cluster compute node even if the code is running in an environment with a single CPU core available to it. As a result most threads are blocked or run on a fraction of a percent of a CPU core.

Expected behavior

A synapseclient.Synapse attribute to set the number of threads and allowing the pool_provider to read an environment variable to set the number of threads would help with this issue.

Actual behavior

cpu_count() + 4 can lead to time slicing with hundreds of threads on a cluster compute node even if the code is running in an environment with a single CPU core available to it. As a result most threads are blocked or run on a fraction of a percent of a CPU core.

Getting an empty Provenance

Does it make sense to throw an error when calling syn.getProvenance("syn1234567") if syn1234567 has no associated provenance?

i.e., the above call returns an error like:

SynapseHTTPError: 404 Client Error: Not Found
No activity

Whereas calling syn.getProvenance("syn7654321"), where syn7654321 does have associated provenance gives:

{u'createdBy': u'3342492',
 u'createdOn': u'2016-08-17T00:23:09.498Z',
 u'etag': u'3425b097-1016-4a67-934d-31258a42be2a',
 u'id': u'7123748',
 u'modifiedBy': u'3342492',
 u'modifiedOn': u'2016-08-17T00:23:09.498Z',
 u'used': [{u'concreteType': u'org.sagebionetworks.repo.model.provenance.UsedEntity',
   u'reference': {u'targetId': u'syn5406913', u'targetVersionNumber': 2},
   u'wasExecuted': False},
  {u'concreteType': u'org.sagebionetworks.repo.model.provenance.UsedURL',
   u'name': u'https://github.com/taoliu/MACS/',
   u'url': u'https://github.com/taoliu/MACS/',
   u'wasExecuted': True}]}

Which makes me expect a result more like this when calling syn.get("syn1234567"):

{u'createdBy': u'3342492',
 u'createdOn': u'2016-08-17T00:23:09.498Z',
 u'etag': u'3425b097-1016-4a67-934d-31258a42be2a',
 u'id': u'7123748',
 u'modifiedBy': u'3342492',
 u'modifiedOn': u'2016-08-17T00:23:09.498Z',
 u'used': []}

Though I'm guessing files uploaded without Provenance currently have no Provenance attached, rather than an empty provenance like I've tried to represent here.

import error

Bug Report

Operating system

Ubuntu 14.04/18.04

Client version

Python 3.7.4
synapseclient 1.9.3

Description of the problem

>>> import synapseclient
ImportError: cannot import name 'csv' from 'backports' (/app/easybuild/software/Python/3.7.4-foss-2016b-fh1/lib/python3.7/site-packages/backports/__init__.py)

Why use backports with Python 3.x ?

FileEntity 'path' property has wrong separator in Windows.

Bug Report

Operating system

Windows 10 Pro

Client version

1.9.2

Description of the problem

On Windows (win32) the path separator in the FileEntity is wrong.

Repro. Steps:

  • Upload a file and look at the returned object's path property.

Expected behavior

  • Path separator is \ and the character casing is correct.
    Correct path: C:\\Users\\John\\AppData\\Local\\Temp\\tmpi7kpbq0s\\data\\core\\core_file_ace2.csv

Actual behavior

  • Path separator is / and character casing is incorrect.
    Incorrect path: c:/users/john/appdata/local/temp/tmpi7kpbq0s/data/core/core_file_ace2.csv

Allow overriding the cache location via Synapse() constructor

Bug Report

Operating system

macOS

Client version

2.1.0

Description of the problem

Provide a description of the problem, and if possible a minimal reproducible example.

I would like to download one dataset to one directory, and a second dataset to a different directory, but these need to be downloaded using the syn.downloadTableColumns function.

This function automatically downloads to the cache location, so this is not possible without updating the config file in between.

It would be great if the cache location could be set in the Synapse() constructor directly.

Add unit test for synapseclient.core.utils#printTransferProgress

Issue for use with weekly Code Review:

def printTransferProgress(transferred, toBeTransferred, prefix='', postfix='', isBytes=True, dt=None,
                          previouslyTransferred=0):
    """Prints a progress bar
    :param transferred:             a number of items/bytes completed
    :param toBeTransferred:         total number of items/bytes when completed
    :param prefix:                  String printed before progress bar
    :param postfix:                 String printed after progress bar
    :param isBytes:                 A boolean indicating whether to convert bytes to kB, MB, GB etc.
    :param dt:                      The time in seconds that has passed since transfer started is used to calculate rate
    :param previouslyTransferred:   the number of bytes that were already transferred before this transfer began
                                    (e.g. someone ctrl+c'd out of an upload and restarted it later)
    """
    if not sys.stdout.isatty():
        return
    barLength = 20  # Modify this to change the length of the progress bar
    status = ''
    rate = ''
    if dt is not None and dt != 0:
        rate = (transferred - previouslyTransferred)/float(dt)
        rate = '(%s/s)' % humanizeBytes(rate) if isBytes else rate
    if toBeTransferred < 0:
        defaultToBeTransferred = (barLength*1*MB)
        if transferred > defaultToBeTransferred:
            progress = float(transferred % defaultToBeTransferred) / defaultToBeTransferred
        else:
            progress = float(transferred) / defaultToBeTransferred
    elif toBeTransferred == 0:  # There is nothing to be transferred
        progress = 1
        status = "Done...\n"
    else:
        progress = float(transferred) / toBeTransferred
        if progress >= 1:
            progress = 1
            status = "Done...\n"
    block = int(round(barLength*progress))
    nbytes = humanizeBytes(transferred) if isBytes else transferred
    if toBeTransferred > 0:
        outOf = "/%s" % (humanizeBytes(toBeTransferred) if isBytes else toBeTransferred)
        percentage = "%4.2f%%" % (progress*100)
    else:
        outOf = ""
        percentage = ""
    text = "\r%s [%s]%s   %s%s %s %s %s    " % (prefix,
                                                "#"*block + "-"*(barLength-block),
                                                percentage,
                                                nbytes, outOf, rate,
                                                postfix, status)
    sys.stdout.write(text)
    sys.stdout.flush()

https://github.com/Sage-Bionetworks/synapsePythonClient/blob/develop/synapseclient/core/utils.py#L596

No path found for syn.get

Using python 2.7, I am able to get the example matrix of syn1901033, but when I use an actual SynapseID (syn5511449 in this case), I receive the error:

## retrieve a 100 by 4 matrix
matrix = syn.get('syn5511449')

## inspect its properties
print(matrix.name)
print(matrix.description)
print(matrix.path)

## load the data matrix into a dictionary with an entry for each column
with open(matrix.path, 'r') as f:
    labels = f.readline().strip().split('\t')
    data = {label: [] for label in labels}
    for line in f:
        values = [float(x) for x in line.strip().split('\t')]
        for i in range(len(labels)):
            data[labels[i]].append(values[i])

Walking Activity
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-43-88b0cdf2c38e> in <module>()
      4 ## inspect its properties
      5 print(matrix.name)
----> 6 print(matrix.description)
      7 print(matrix.path)
      8 

/Users/ajenkins/anaconda/lib/python2.7/site-packages/synapseclient/entity.pyc in __getattr__(self, key)
    366             ## about what exceptions it catches. In Python3, hasattr catches
    367             ## only AttributeError
--> 368             raise AttributeError(key)
    369 
    370 

AttributeError: description

Is there a reason why when I use an actual SynapaseID, that I am not able to get a path?

Slow synapse get on large projects

On large projects with many files, the synapse get command is very slow.
One suggestion I can make to speeding it up is using multi-threading / multi-processing for making the calls to 'get' concurrent.

A simple patch that suggests one way to do this is pasted below (sorry, for some reason github wouldn't let me upload it, just save to a txt file and apply) -

From d6ae4c2 Mon Sep 17 00:00:00 2001
From: fidlr [email protected]
Date: Tue, 18 Oct 2016 10:35:51 +0300
Subject: [PATCH] multi-threaded get


synapseutils/sync.py | 23 ++++++++++++++++++++---
1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/synapseutils/sync.py b/synapseutils/sync.py
index dfbfab8..de399a8 100644
--- a/synapseutils/sync.py
+++ b/synapseutils/sync.py
@@ -2,6 +2,13 @@ import errno
from synapseclient.entity import is_container
from synapseclient.utils import id_of
import os
+from concurrent.futures import ThreadPoolExecutor
+
+pool = ThreadPoolExecutor(max_workers=3) # Synapse allows up to 3 concurrent requests
+
+def getOneEntity(syn, entity_id, downloadLocation, ifcollision, allFilesList):

  • ent = syn.get(entity_id, downloadLocation=downloadLocation, ifcollision=ifcollision)
  • allFilesList.append(ent) # lists are thread-safe

def syncFromSynapse(syn, entity, path=None, ifcollision='overwrite.local', allFiles = None):
@@ -36,7 +43,11 @@ def syncFromSynapse(syn, entity, path=None, ifcollision='overwrite.local', allFi
for f in entities:
print(f.path)
"""

  • if allFiles is None: allFiles = list()
  • global pool
  • wait_at_finish = False
  • if allFiles is None: # initial call
  •    allFiles = list()
    
  •    wait_at_finish = True
    
    id = id_of(entity)
    results = syn.chunkedQuery("select id, name, nodeType from entity where entity.parentId=='%s'" %id)
    for result in results:
    @@ -53,6 +64,12 @@ def syncFromSynapse(syn, entity, path=None, ifcollision='overwrite.local', allFi
    new_path = None
    syncFromSynapse(syn, result['entity.id'], new_path, ifcollision, allFiles)
    else:
  •        ent = syn.get(result['entity.id'], downloadLocation = path, ifcollision = ifcollision)
    
  •        allFiles.append(ent)
    
  •        # use multi-threaded get function
    
  •        pool.submit(getOneEntity, syn, result['entity.id'], path, ifcollision, allFiles)
    
  •        # ent = syn.get(result['entity.id'], downloadLocation = path, ifcollision = ifcollision)
    
  •        # allFiles.append(ent)
    
  • if wait_at_finish:
  •    pool.shutdown(wait=True)  # wait till all objects were downloaded before returning
    

    return allFiles

    2.7.4

download speed, unnecessary REST calls

Bug Report

Operating system

ubuntu 18.04
4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Client version

Output of:

import synapseclient
synapseclient.__version__

'1.9.3'

Description of the problem

I am trying to download a synapse storage, via synapseutils.syncFromSynapse, but the progress is very slow. The project contains many subfolders(~10k) with 3 files per folder. The download speed is not the problem, rather a REST API request which seems to be made per file in: Synapse::getProvenance.
This function is called in every recursive invocation of synapseutils.syncFromSynapse on all members of allFiles array. Where the allFiles array contains all previously processed files.
One REST-call amounts to t=~100-200ms per call, leading to a duration of (n * t)! for n files.
With ! denoting the factorial.

Expected behavior

Faster download, do not repeat REST request for all files.

Actual behavior

What actually happened? Provide output or error messages from the console, if applicable.

Is there some fast workaround?

synapseutils.syncFromSynapse fails on empty folder

syncFromSynapse throws ValueError: The provided id: synMyFolderId is was neither a container nor a File when it hits an empty folder.

Folder Structure:

-Folder-1
  -Folder-2
  -some-file.txt

synapseutils.syncFromSynapse(syn, 'Folder-1-id')

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.