scrapinghub / shub Goto Github PK
View Code? Open in Web Editor NEWScrapinghub Command Line Client
Home Page: https://shub.readthedocs.io/
License: BSD 3-Clause "New" or "Revised" License
Scrapinghub Command Line Client
Home Page: https://shub.readthedocs.io/
License: BSD 3-Clause "New" or "Revised" License
Job IDs are hard to remember. If we saved the last scheduled job ID (say in .scrapinghub.yml
), instead of this:
shub schedule prod/myspider
shub items prod/2/204
users could do this:
shub schedule prod/myspider
shub items
We should be able to drop freeze/hooks/runtime-hooks.py
(the requests
part) as soon as pyinstaller/pyinstaller#1777 got merged and we pin PyInstaller to 3.2
shub
opens new Python sessions through the subprocess
module. In particular, shub deploy
depends on an available python
executable (to build an egg through python setup.py
), and shub deploy-egg
and shub deploy-reqs
additionally depend on pip
.
While PyInstaller does bundle the Python interpreter, it is not an executable file (for Windows builds it bundles python27.dll
and some additional libraries, which it then loads in the bootloader I guess). I have sent a question (yet to be released by a moderator) to the PyInstaller mailing list regarding how to call a new interpreter instance.
For pip
, for now we have settled in a Slack dicussion on deactivating the commands when pip
is not available (for deploy-egg
we should look into only deactivating the --from-pypi
switch). There is a mailing list thread on bundling it here.
Let's add tests for #72.
It would be nice to add a small wizard when people use shub deploy
for the first time, to ease the onboarding process. Then, all we need to say in Dash is use shub deploy
and answer the questions.
$ shub deploy
Please login first with: shub login
$ shub login
...
$ shub deploy
Project ID you would like to deploy to: NNNNN
(...maybe validate access to the project? like shub login validates api key...)
Save project ID into scrapinghub.yml [Y/n]?
The questions will be asked if no scrapinghub.yml
is found in the same dir as scrapy.cfg
.
If you enter Y
to the last question, the scrapinghub.yml
file will be created with a default project set to the ID (NNNN) specified and you won't be asked again
It is possible to provide multiple API keys without having to touch the endpoints
setting, e.g.
# scrapinghub.yml
projects:
default: 123
otheruser: someoneelse/123
apikeys:
default: abc
someoneelse: def
This won't see much use but nevertheless should be documented in the advanced section of the readme.
deploy-reqs isn't mentioned on the shub docs
Output:
tests/test_deploy_egg.py::TestDeployEgg::test_parses_project_information_correctly FAILED
================================== FAILURES ===================================
___________ TestDeployEgg.test_parses_project_information_correctly ___________
self = <tests.test_deploy_egg.TestDeployEgg testMethod=test_parses_project_information_correctly>
def tearDown(self):
os.chdir(self.curdir)
if os.path.exists(self.tmp_dir):
> shutil.rmtree(self.tmp_dir)
tests\test_deploy_egg.py:36:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
c:\python27\Lib\shutil.py:247: in rmtree
rmtree(fullname, ignore_errors, onerror)
c:\python27\Lib\shutil.py:252: in rmtree
onerror(os.remove, fullname, sys.exc_info())
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
path = 'c:\\users\\appveyor\\appdata\\local\\temp\\1\\shub-test-deploy-eggscayihm\\dist'
ignore_errors = False, onerror = <function onerror at 0x035D2770>
def rmtree(path, ignore_errors=False, onerror=None):
...
try:
> os.remove(fullname)
E WindowsError: [Error 32] The process cannot access the file because it is being used by another process: 'c:\\users\\appveyor\\appdata\\local\\temp\\1\\shub-test-deploy-eggscayihm\\dist\\test_project-1.2.0-py2.7.egg'
c:\python27\Lib\shutil.py:250: WindowsError
---------------------------- Captured stdout call -----------------------------
Building egg in: c:\users\appveyor\appdata\local\temp\1\shub-test-deploy-eggscayihm
Deploying dependency to Scrapy Cloud project "0"
Deployed eggs list at: https://dash.scrapinghub.com/p/0/eggs
===================== 1 failed, 59 passed in 7.22 seconds =====================
The test creates the egg file on a temporary directory that is deleted on tearDown, the problem is that the file is never closed.
The problem exists only on windows because attempting to remove a file that it's open raises an exception, on linux the os just unlinks
the file and it is removed once all file handlers are closed. [1]
We have some eggs deployed with unknown
version.
Try:
-e git+https://github.com/scrapinghub/frontera@settings_override#egg=frontera
Pre-release:
Post-release:
-p
option was dropped and scripts may have to be updatedCurrently, shub deploy-egg
(without --from-pypi
or --from-url
specified) will always try to find a setup.py
in the current directory. This makes it hard to maintain libraries that are used across multiple projects. I think we should add a --from-directory
option to shub deploy-egg
.
Note: Changing to the library directory first is inconvenient because it requires explicitly specifying the numeric project ID (as the correct scrapinghub.yml
isn't guaranteed to be found from the library directory). I.e., while cd ~/dev/my-cool-lib/ && shub deploy-egg PROJECTID && cd -
works, cd ~/dev/my-cool-lib/ && shub deploy-egg && cd -
(no project ID specified) does not.
This is all the output I get:
$ shub deploy-egg --from-pypi pytz 1462X
Fetching pytz from pypi
Collecting pytz
Using cached pytz-2015.4.tar.bz2
Saved /tmp/shub-deploy-egg-from-pypiYfDnBe/pytz-2015.4.tar.bz2
Successfully downloaded pytz
Package fetched successfully
Other packages work just fine:
$ shub deploy-egg --from-pypi scrapy-inline-requests 1462X
Fetching scrapy-inline-requests from pypi
Collecting scrapy-inline-requests
Using cached scrapy-inline-requests-0.1.2.tar.gz
Saved /tmp/shub-deploy-egg-from-pypig6fcOt/scrapy-inline-requests-0.1.2.tar.gz
Successfully downloaded scrapy-inline-requests
Package fetched successfully
Uncompressing: scrapy-inline-requests-0.1.2.tar.gz
Building egg in: /tmp/shub-deploy-egg-from-pypig6fcOt/scrapy-inline-requests-0.1.2
zip_safe flag not set; analyzing archive contents...
Deploying dependency to Scrapy Cloud project "1462X"
{"status": "ok", "egg": {"version": "scrapy-inline-requests-0.1.2", "name": "scrapy-inline-requests"}}
Deployed eggs list at: https://dash.scrapinghub.com/p/1462X/eggs
If one is already logged in, shub login
throws an error, while it could say something less "dramatic"
Current behavior:
$ shub login
Usage: shub login [OPTIONS]
Error: Already logged in. To logout use: shub logout
logout
first", or something along those linesshub schedule
works for this:
[deploy]
url = https://dash.scrapinghub.com/api/scrapyd/
project = 1234
shub schedule
fails for this:
[deploy:default]
url = https://dash.scrapinghub.com/api/scrapyd/
project = 1234
Whereas shub deploy
works for both.
How about replacing the generic exit code 1 we use everywhere right now with more meaningful exit codes (e.g. after the sysexits.h
convention).
This would be super-easy after #97 and will make it easier for shell scripts to do error diagnostics when automating shub
usage. The downside is that it's slightly backwards incompatible with scripts that check exit_code == 1
instead of the more sane exit_code != 0
The example below shows how shub fetch-eggs
creates a zip file without the actual zip content due to missing auth key.
$ shub fetch-eggs 1261X
Downloading eggs to eggs-1261X.zip
$ file eggs-1261X.zip
eggs-1261X.zip: ASCII text, with no line terminators
$ export SHUB_APIKEY=xxxxxxxb31
$ shub fetch-eggs 1261X
Downloading eggs to eggs-1261X.zip
$ file eggs-1261X.zip
eggs-12616.zip: Zip archive data, at least v2.0 to extract
shub deploy
only looks for credentials in the project's scrapy.cfg and ~/.scrapy.cfg files.
Does it make sense to make it work with .netrc (shub login) and $SHUB_APIKEY credentials?
When pyinstaller/pyinstaller#1772 gets fixed, we should remove the setuptools==19.2
dependency in tox.ini
's freeze
section
At this very moment shub items
outputs results in a format of newline-separated stringified Python dict
:
{u'name': u'item_0', u'foo': u'bar'}
{u'name': u'item_1', u'foo': u'bar'}
Personally I think it's better to make it JSON-lines:
{"name": "item_0", "foo": "bar"}
{"name": "item_1", "foo": "bar"}
Or if we need to keep its backward compatibility, I'd suggest at least we introduce something like shub items --format=jsonl
.
The issue was already discussed in another thread.
http://support.scrapinghub.com/topic/808955-windowserror-32-the-process-cannot-access-the-file/
Suggested fix was: "comment out the temporary file removal in the finally block in deploy.py" needs to be tested on windows platform
I've noticed a couple of times that people often just try to deploy from their requirements.txt file.
It's sort of a reasonable expectation, we even named the feature deploy-reqs
after all.
So I thought we could handle those cases more sanely, because they will show up again, and it seems a bit silly try to educate users to maintain separate requirement files.
Maybe we could maintain a list of runtime dependencies that should be skipped by default (while informing the user) and offer an option to force deploying them.
What do you think?
deploy-reqs cli arguments are not consistent with other commands like deploy.
project_id is required but shoud be an optional argument and the command should call the same _get_project function that is defined in deploy.py
It would be nice if shub log
supported a -f
option to follow logs for running jobs.
so I have some directories listed in .gitignore e.g. /dev directory where I keep some private dirty dev scripts, I see content of /dev is deployed, I learned this because I have some imports in my dev script, this import is not supported in my deploy target and deploy is failng because of missing import. Is there any way to deploy only actual code and not contents of .gitignore?
They're currently failing.
$ pytest
[...]
Ran 21 test cases in 8.23s (0.29s CPU), 5 errors, 3 failures, 5 skipped
When the API key validation request fails, it throws an unhandled exception and prints a traceback to the command line.
We should catch failed requests and either
Where I would opt for the first option. However, that request failing means that either the user has no working internet connection or there is a server outage on our side. In both cases, shub
is pretty much useless (at that moment), so I guess option 2 is viable as well.
On a broader scale, we should introduce a new exception for failed connections with helpful error messages.
https://travis-ci.org/scrapinghub/shub/jobs/104864865#L331
Works fine in my identical fork repository. I think it's the lack of skip_cleanup: true
in the pypi
deploy provider section (e.g. the tar command tries to compress a file that was already deleted)
There's a repeatable failure if requirements.txt has lxml in it. When I remove lxml, it builds and deploys the rest of the eggs (which is awesome). Mac Book Pro, OSX Yosemite.
Minimum reproducer:
$ virtualenv --version
13.1.0
$ virtualenv --python=python2.7 venv
Running virtualenv with interpreter ...
Using real prefix '...'
New python executable in venv/bin/python2.7
Also creating executable in venv/bin/python
Installing setuptools, pip, wheel...done.
$ source venv/bin/activate
$ pip install shub
Collecting shub
...
Successfully installed click-4.1 requests-2.7.0 shub-1.3.1 six-1.9.0
$ pip list
click (4.1)
pip (7.1.0)
requests (2.7.0)
setuptools (18.0.1)
shub (1.3.1)
six (1.9.0)
wheel (0.24.0)
$ echo 'lxml==3.4.4' > requirements.txt
$ shub deploy-reqs requirements.txt
Downloading eggs...
.../venv/lib/python2.7/site-packages/pip/vendor/requests/packages/urllib3/util/ssl.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
InsecurePlatformWarning
.../venv/lib/python2.7/site-packages/pip/vendor/requests/packages/urllib3/util/ssl.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
InsecurePlatformWarning
Collecting lxml==3.4.4 (from -r .../requirements.txt (line 1))
Using cached lxml-3.4.4.tar.gz
Saved /var/folders/zw/v2b8vtsj6cn58j55371phlf80000gp/T/eggshv9fpJ/eggs/lxml-3.4.4.tar.gz
Successfully downloaded lxml
Uncompressing: lxml-3.4.4.tar.gz
Building egg in: /private/var/folders/zw/v2b8vtsj6cn58j55371phlf80000gp/T/eggshv9fpJ/eggs/lxml-3.4.4
.../2.7/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'bugtrack_url'
warnings.warn(msg)
.../2.7/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'bugtrack_url'
warnings.warn(msg)
.../2.7/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'bugtrack_url'
warnings.warn(msg)
Traceback (most recent call last):
File ".../venv/bin/shub", line 11, in
sys.exit(cli())
File ".../venv/lib/python2.7/site-packages/click/core.py", line 664, in call
return self.main(_args, *_kwargs)
File ".../venv/lib/python2.7/site-packages/click/core.py", line 644, in main
rv = self.invoke(ctx)
File ".../venv/lib/python2.7/site-packages/click/core.py", line 991, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File ".../venv/lib/python2.7/site-packages/click/core.py", line 837, in invoke
return ctx.invoke(self.callback, *_ctx.params)
File ".../venv/lib/python2.7/site-packages/click/core.py", line 464, in invoke
return callback(_args, **kwargs)
File ".../venv//lib/python2.7/site-packages/shub/deploy_reqs.py", line 18, in cli
main(project_id, requirements_file)
File ".../venv/lib/python2.7/site-packages/shub/deploy_reqs.py", line 27, in main
utils.build_and_deploy_eggs(project_id, apikey)
File ".../venv/lib/python2.7/site-packages/shub/utils.py", line 105, in build_and_deploy_eggs
build_and_deploy_egg(project_id, apikey)
File ".../venv/lib/python2.7/site-packages/shub/utils.py", line 132, in build_and_deploy_egg
_deploy_dependency_egg(apikey, project_id)
File ".../venv/lib/python2.7/site-packages/shub/utils.py", line 138, in _deploy_dependency_egg
egg_name, egg_path = _get_egg_info(name)
File ".../venv/lib/python2.7/site-packages/shub/utils.py", line 169, in _get_egg_info
egg_path = glob(egg_path_glob)[0]
IndexError: list index out of range
deploy
should be "run and forget": shub
should make sure that all requirements the project has (specified in requirements.txt
or setup.py
's install_requires
) are available on Scrapy Cloud and upload them if necessary.
See also #56.
I just pull recent master - looks totally great and I really enjoy all those cool new features.
There is one thing that doens't work for me though. When I try to deploy to some specific target I get either
Error: Could not find API key for endpoint ad-hoc.
or
Error: "realtime-test" is not a valid Scrapinghub project ID. Please check your scrapinghub.yml
am I missing something here or is it some potential bug?
EDIT:
I see this is changed behavior. Looking at line here: https://github.com/scrapinghub/shub/blob/master/shub/config.py#L134 I must have defined apikeys for all endpoints. In the past if I had default apikey it was used for all endpoints, is this change expected and intended?
Command would take job_id as argument and it would restart job with id in other projects using same combination of arguments.
For example
> shub restart 1887/496/3724
would restart job with this id in default_project.
It would be very useful for all of us who have jobs with complex list of arguments and need to enter them manually either in command line or in dash.
Syntax could also include spider name project name and job id, e.g.
shub restart amazon.com_products 1886 latest
but I would prefer job id as it is much shorter.
The job resource commands (log
, items
, requests
) currently always use the production Hubstorage endpoint (http://storage.scrapinghub.com). This is due to the fact that the storage endpoint cannot be easily derivated from the scrapyd endpoint.
Should we add support for supplying different storage endpoints? That would allow using the job resource commands while working on devbox.
I think if any API call fails (i.e. auth error) then the command should exit with a non-zero status. The example below shows how shub
exit with zero status regardless of the auth failure.
$ shub deploy-egg 1462X
Building egg in: /home/rolando/foo
zip_safe flag not set; analyzing archive contents...
Deploying dependency to Scrapy Cloud project "1462X"
Deploy failed (403):
{"status": "error", "message": "Authentication failed"}
Deployed eggs list at: https://dash.scrapinghub.com/p/1462X/eggs
$ echo $?
0
Just tried to deploy-reqs
using shub master and got
Traceback (most recent call last):
File "/Users/zehzinho/.virtualenv/ds/bin/shub", line 9, in <module>
load_entry_point('shub==1.5.0', 'console_scripts', 'shub')()
File "/Users/zehzinho/.virtualenv/ds/lib/python2.7/site-packages/click/core.py", line 700, in __call__
return self.main(*args, **kwargs)
File "/Users/zehzinho/.virtualenv/ds/lib/python2.7/site-packages/click/core.py", line 680, in main
rv = self.invoke(ctx)
File "/Users/zehzinho/.virtualenv/ds/lib/python2.7/site-packages/click/core.py", line 1027, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/zehzinho/.virtualenv/ds/lib/python2.7/site-packages/click/core.py", line 873, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/zehzinho/.virtualenv/ds/lib/python2.7/site-packages/click/core.py", line 508, in invoke
return callback(*args, **kwargs)
File "/Users/zehzinho/Sources/scraping-hub/shub/shub/deploy_reqs.py", line 34, in cli
main(target, requirements_file)
File "/Users/zehzinho/Sources/scraping-hub/shub/shub/deploy_reqs.py", line 43, in main
utils.build_and_deploy_eggs(project, endpoint, apikey)
File "/Users/zehzinho/Sources/scraping-hub/shub/shub/utils.py", line 141, in build_and_deploy_eggs
build_and_deploy_egg(project, endpoint, apikey)
File "/Users/zehzinho/Sources/scraping-hub/shub/shub/utils.py", line 168, in build_and_deploy_egg
_deploy_dependency_egg(project, endpoint, apikey)
File "/Users/zehzinho/Sources/scraping-hub/shub/shub/utils.py", line 183, in _deploy_dependency_egg
make_deploy_request(url, data, files, auth)
TypeError: make_deploy_request() takes exactly 6 arguments (4 given)
pip 8.0.0 dropped pip install --download
in favour of pip download
, we should update our code.
Forcing pip>=8
might be too strong though, maybe we can check the pip version in our code, or just stick to install --download
and live with the deprecation warning for a while (or swallow it if possible).
It would be nice to be able to do:
shub deploy-spider -p $PROJ_ID spiderfile.py
This would wrap the spiderfile into a temporary project and deploy that project to Scrapy Cloud.
I think it would be a good complement for the scrapy runspider spiderfile.py
command, we could even show it off in http://scrapy.org :)
As we're moving our configuration format to scrapinghub.yml`s, should we make the switch easy by adding a command to generate a config in the new format?
Some jobs have huge logs / many items / requests. We could limit the number of entries downloaded from hubstorage by grabbing the number of entries first, e.g.
total_lines = job.logs.stats()['totals']['input_values']
# then construct correct `start_after` argument for iter.json() from this
Maybe download only the last 250 items if -f
is set, with a possibility to override, and download all items if -f
is not set.
The "get version control commit" commands throw an unhandled exception when their version control tool is not available.
This breaks deploying (shub exits with a traceback) when
The exceptions should be either caught in the pwd_git_version()
(and similar) util functions or in ShubConfig.get_version()
.
See: https://stackoverflow.com/questions/377017/test-if-executable-exists-in-python
This is outdated: http://doc.scrapinghub.com/scrapy-cloud.html#deploying-a-scrapy-spider
We will soon start distributing shub
in a binary form, outside standard channels (like pypi or apt). So shub
needs a way to check if there's a new update available for it.
I propose using Github releases API, and make sure we keep doing releases here. The check should be cached (maybe in ~/.scrapinghub.yml
) so as not to check on every invocation, but once a day (or something like that). It should also never fail (if there's no internet connection, for example) and cache failures as well.
#61 needs a test
Scrapy allows to store the name of the project as SCRAPY_PROJECT
, and the path of a project settings module as SCRAPY_CONFIG_MODULE
environment variables. These take preference when finding the settings module or reading its path from scrapy.cfg
.
A scrapy.cfg
file that has no [settings]
, or one that looks like this:
[settings]
custom_project_name = path.to.settings
therefore works perfectly fine with Scrapy as long as the appropriate environment variables are set.
shub
, however, does not read these environment variables, and instead requires a scrapy.cfg
with a [settings]
section and default
as project name.
shub schedule pet-supermarket.co.uk_keywords -a keywords="id1=dog"
fails with
Traceback (most recent call last):
File "/usr/local/bin/shub", line 9, in <module>
load_entry_point('shub==1.4.0', 'console_scripts', 'shub')()
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 664, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 644, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 991, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 837, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 464, in invoke
return callback(*args, **kwargs)
File "/home/pawel/scrapinghub/shub_clone/shub/schedule.py", line 18, in cli
job_key = schedule_spider(apikey, project_id, spider, argument)
File "/home/pawel/scrapinghub/shub_clone/shub/schedule.py", line 28, in schedule_spider
return conn[project_id].schedule(spider, **dict(x.split('=') for x in arguments))
ValueError: dictionary update sequence element #0 has length 3; 2 is required
Deploying fails with a quite cryptic (for users) message when setup.py does not contain a scrapy
entry point that points it to the settings module:
https://paste.scrapinghub.com/show/8940/
If there's no setup.py
, shub writes our own that has all the necessary information. But if the user wrote one, we probably should check whether the entry point exists and tell them to add it if necessary (or add it ourselves).
Today we only display "Authentication failed".
The docs also are outdated:
scrapy startproject myprojectname
puts the following section into scrapy.cfg
:
[deploy]
project = myprojectname
When users first run shub deploy
, this will be transferred to scrapinghub.yml
and an error message will be printed.
We should not import non-integer project names from scrapy.cfg
and instead guide users through the deploy wizard.
It looks like in last 2 months shub got several useful features, what about publishing them?
Let's add a -s
/ --setting
argument that works in the same fashion as -a
, but for passing settings to the job.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.