scrapinghub / shub Goto Github PK

View Code? Open in Web Editor NEW

126.0 20.0 79.0 1.29 MB

Scrapinghub Command Line Client

Home Page: https://shub.readthedocs.io/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

shub's Issues

Save last scheduled job ID

Job IDs are hard to remember. If we saved the last scheduled job ID (say in .scrapinghub.yml), instead of this:

shub schedule prod/myspider
shub items prod/2/204

users could do this:

shub schedule prod/myspider
shub items

Drop `requests` hook

We should be able to drop freeze/hooks/runtime-hooks.py (the requests part) as soon as pyinstaller/pyinstaller#1777 got merged and we pin PyInstaller to 3.2

Avoid dependencies on external tools

shub opens new Python sessions through the subprocess module. In particular, shub deploy depends on an available python executable (to build an egg through python setup.py), and shub deploy-egg and shub deploy-reqs additionally depend on pip.

While PyInstaller does bundle the Python interpreter, it is not an executable file (for Windows builds it bundles python27.dll and some additional libraries, which it then loads in the bootloader I guess). I have sent a question (yet to be released by a moderator) to the PyInstaller mailing list regarding how to call a new interpreter instance.

For pip, for now we have settled in a Slack dicussion on deactivating the commands when pip is not available (for deploy-egg we should look into only deactivating the --from-pypi switch). There is a mailing list thread on bundling it here.

Add tests for #72 (log and requests commands)

Let's add tests for #72.

Deploy onboarding

It would be nice to add a small wizard when people use shub deploy for the first time, to ease the onboarding process. Then, all we need to say in Dash is use shub deploy and answer the questions.

$ shub deploy
Please login first with: shub login

$ shub login
...

$ shub deploy
Project ID you would like to deploy to: NNNNN
(...maybe validate access to the project? like shub login validates api key...)
Save project ID into scrapinghub.yml [Y/n]?

The questions will be asked if no scrapinghub.yml is found in the same dir as scrapy.cfg.
If you enter Y to the last question, the scrapinghub.yml file will be created with a default project set to the ID (NNNN) specified and you won't be asked again

Document using multiple API keys

It is possible to provide multiple API keys without having to touch the endpoints setting, e.g.

# scrapinghub.yml
projects:
  default: 123
  otheruser: someoneelse/123
apikeys:
  default: abc
  someoneelse: def

This won't see much use but nevertheless should be documented in the advanced section of the readme.

Add `deploy-reqs` session to shub docs

deploy-reqs isn't mentioned on the shub docs

http://doc.scrapinghub.com/shub.html

Test test_parses_project_information_correctly fails on windows

Output:

tests/test_deploy_egg.py::TestDeployEgg::test_parses_project_information_correctly FAILED

================================== FAILURES ===================================
___________ TestDeployEgg.test_parses_project_information_correctly ___________

self = <tests.test_deploy_egg.TestDeployEgg testMethod=test_parses_project_information_correctly>

    def tearDown(self):
        os.chdir(self.curdir)
        if os.path.exists(self.tmp_dir):
>           shutil.rmtree(self.tmp_dir)

tests\test_deploy_egg.py:36: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
c:\python27\Lib\shutil.py:247: in rmtree
    rmtree(fullname, ignore_errors, onerror)
c:\python27\Lib\shutil.py:252: in rmtree
    onerror(os.remove, fullname, sys.exc_info())
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

path = 'c:\\users\\appveyor\\appdata\\local\\temp\\1\\shub-test-deploy-eggscayihm\\dist'
ignore_errors = False, onerror = <function onerror at 0x035D2770>

    def rmtree(path, ignore_errors=False, onerror=None):
...
                try:
>                   os.remove(fullname)
E                   WindowsError: [Error 32] The process cannot access the file because it is being used by another process: 'c:\\users\\appveyor\\appdata\\local\\temp\\1\\shub-test-deploy-eggscayihm\\dist\\test_project-1.2.0-py2.7.egg'

c:\python27\Lib\shutil.py:250: WindowsError
---------------------------- Captured stdout call -----------------------------
Building egg in: c:\users\appveyor\appdata\local\temp\1\shub-test-deploy-eggscayihm
Deploying dependency to Scrapy Cloud project "0"
Deployed eggs list at: https://dash.scrapinghub.com/p/0/eggs
===================== 1 failed, 59 passed in 7.22 seconds =====================

The test creates the egg file on a temporary directory that is deleted on tearDown, the problem is that the file is never closed.

The problem exists only on windows because attempting to remove a file that it's open raises an exception, on linux the os just unlinks the file and it is removed once all file handlers are closed. [1]

[1] https://docs.python.org/2/library/os.html#os.remove

deploy-reqs is not finding some egg versions

We have some eggs deployed with unknown version.

Try:

-e git+https://github.com/scrapinghub/frontera@settings_override#egg=frontera

Release v2

Pre-release:

Settle open issues
Update migration banners to say 'v2' and not 'v1.6'
Test binaries
Update README if API endpoint change is merged
Update release log

Post-release:

Merge https://github.com/scrapinghub/doc.scrapinghub.com/pull/124
Update http://support.scrapinghub.com/topic/205471-deploying-your-project-to-scrapy-cloud/
Publish release notes, attach binaries
Announce binaries somewhere on SH?
Announce on Slack, emphasize that -p option was dropped and scripts may have to be updated

Add --from-directory option to deplog-egg

Currently, shub deploy-egg (without --from-pypi or --from-url specified) will always try to find a setup.py in the current directory. This makes it hard to maintain libraries that are used across multiple projects. I think we should add a --from-directory option to shub deploy-egg.

Note: Changing to the library directory first is inconvenient because it requires explicitly specifying the numeric project ID (as the correct scrapinghub.yml isn't guaranteed to be found from the library directory). I.e., while cd ~/dev/my-cool-lib/ && shub deploy-egg PROJECTID && cd - works, cd ~/dev/my-cool-lib/ && shub deploy-egg && cd - (no project ID specified) does not.

`shub deploy-egg --from-pypi pytz` does not deploy the egg

This is all the output I get:

$ shub deploy-egg --from-pypi pytz 1462X                           
Fetching pytz from pypi
Collecting pytz
  Using cached pytz-2015.4.tar.bz2
  Saved /tmp/shub-deploy-egg-from-pypiYfDnBe/pytz-2015.4.tar.bz2
Successfully downloaded pytz
Package fetched successfully

Other packages work just fine:

$ shub deploy-egg --from-pypi scrapy-inline-requests 1462X
Fetching scrapy-inline-requests from pypi
Collecting scrapy-inline-requests
  Using cached scrapy-inline-requests-0.1.2.tar.gz
  Saved /tmp/shub-deploy-egg-from-pypig6fcOt/scrapy-inline-requests-0.1.2.tar.gz
Successfully downloaded scrapy-inline-requests
Package fetched successfully
Uncompressing: scrapy-inline-requests-0.1.2.tar.gz
Building egg in: /tmp/shub-deploy-egg-from-pypig6fcOt/scrapy-inline-requests-0.1.2
zip_safe flag not set; analyzing archive contents...
Deploying dependency to Scrapy Cloud project "1462X"
{"status": "ok", "egg": {"version": "scrapy-inline-requests-0.1.2", "name": "scrapy-inline-requests"}}
Deployed eggs list at: https://dash.scrapinghub.com/p/1462X/eggs

login while already logged-in shows error

If one is already logged in, shub login throws an error, while it could say something less "dramatic"

Current behavior:

$ shub login
Usage: shub login [OPTIONS]

Error: Already logged in. To logout use: shub logout

no need to show "Usage" as it's correct
"Error" is misleading. it could say "You're already logged in. Nothing to do. If you want to login with another API key, use logout first", or something along those lines

`shub deploy` works but `shub schedule [spider]` fails for "default" target

shub schedule works for this:

[deploy]
url = https://dash.scrapinghub.com/api/scrapyd/
project = 1234

shub schedule fails for this:

[deploy:default]
url = https://dash.scrapinghub.com/api/scrapyd/
project = 1234

Whereas shub deploy works for both.

Use more meaningful exit codes

How about replacing the generic exit code 1 we use everywhere right now with more meaningful exit codes (e.g. after the sysexits.h convention).

This would be super-easy after #97 and will make it easier for shell scripts to do error diagnostics when automating shub usage. The downside is that it's slightly backwards incompatible with scripts that check exit_code == 1 instead of the more sane exit_code != 0

The command `fetch-eggs` doesn't fail on auth error

The example below shows how shub fetch-eggs creates a zip file without the actual zip content due to missing auth key.

$ shub fetch-eggs 1261X
Downloading eggs to eggs-1261X.zip
$ file eggs-1261X.zip 
eggs-1261X.zip: ASCII text, with no line terminators
$ export SHUB_APIKEY=xxxxxxxb31
$ shub fetch-eggs 1261X
Downloading eggs to eggs-1261X.zip
$ file eggs-1261X.zip
eggs-12616.zip: Zip archive data, at least v2.0 to extract

'shub login' doesn't work for 'shub deploy'

shub deploy only looks for credentials in the project's scrapy.cfg and ~/.scrapy.cfg files.

Does it make sense to make it work with .netrc (shub login) and $SHUB_APIKEY credentials?

Pin dependency versions for reproducible binaries

When pyinstaller/pyinstaller#1772 gets fixed, we should remove the setuptools==19.2 dependency in tox.ini's freeze section

Let `shub items` outputs in JSON-lines format.

At this very moment shub items outputs results in a format of newline-separated stringified Python dict:

{u'name': u'item_0', u'foo': u'bar'}
{u'name': u'item_1', u'foo': u'bar'}

Personally I think it's better to make it JSON-lines:

{"name": "item_0", "foo": "bar"}
{"name": "item_1", "foo": "bar"}

Or if we need to keep its backward compatibility, I'd suggest at least we introduce something like shub items --format=jsonl.

Impossible to deploy project on windows platform

Got an error from user:

The issue was already discussed in another thread.
http://support.scrapinghub.com/topic/808955-windowserror-32-the-process-cannot-access-the-file/

Suggested fix was: "comment out the temporary file removal in the finally block in deploy.py" needs to be tested on windows platform

deploy-reqs should handle dependencies already available on Scrapy Cloud more sanely

I've noticed a couple of times that people often just try to deploy from their requirements.txt file.

It's sort of a reasonable expectation, we even named the feature deploy-reqs after all.
So I thought we could handle those cases more sanely, because they will show up again, and it seems a bit silly try to educate users to maintain separate requirement files.

Maybe we could maintain a list of runtime dependencies that should be skipped by default (while informing the user) and offer an option to force deploying them.

What do you think?

deploy-reqs should use the project_id defined in scrapy.cfg

deploy-reqs cli arguments are not consistent with other commands like deploy.
project_id is required but shoud be an optional argument and the command should call the same _get_project function that is defined in deploy.py

tail -f for `shub log`

It would be nice if shub log supported a -f option to follow logs for running jobs.

shub deploy could ignore files from .gitignore

so I have some directories listed in .gitignore e.g. /dev directory where I keep some private dirty dev scripts, I see content of /dev is deployed, I learned this because I have some imports in my dev script, this import is not supported in my deploy target and deploy is failng because of missing import. Is there any way to deploy only actual code and not contents of .gitignore?

Fix tests

They're currently failing.

$ pytest
[...]
Ran 21 test cases in 8.23s (0.29s CPU), 5 errors, 3 failures, 5 skipped

Fail gracefully when API key could not be validated

When the API key validation request fails, it throws an unhandled exception and prints a traceback to the command line.

We should catch failed requests and either

save the API key without validation, or
tell the user that login is not possible right now.

Where I would opt for the first option. However, that request failing means that either the user has no working internet connection or there is a server outage on our side. In both cases, shub is pretty much useless (at that moment), so I guess option 2 is viable as well.

On a broader scale, we should introduce a new exception for failed connections with helpful error messages.

Fix GitHub releases deployment on Travis

https://travis-ci.org/scrapinghub/shub/jobs/104864865#L331

Works fine in my identical fork repository. I think it's the lack of skip_cleanup: true in the pypi deploy provider section (e.g. the tar command tries to compress a file that was already deleted)

Crash in deploy-reqs for lxml.

There's a repeatable failure if requirements.txt has lxml in it. When I remove lxml, it builds and deploys the rest of the eggs (which is awesome). Mac Book Pro, OSX Yosemite.

Minimum reproducer:
$ virtualenv --version
13.1.0
$ virtualenv --python=python2.7 venv
Running virtualenv with interpreter ...
Using real prefix '...'
New python executable in venv/bin/python2.7
Also creating executable in venv/bin/python
Installing setuptools, pip, wheel...done.
$ source venv/bin/activate
$ pip install shub
Collecting shub
...
Successfully installed click-4.1 requests-2.7.0 shub-1.3.1 six-1.9.0
$ pip list
click (4.1)
pip (7.1.0)
requests (2.7.0)
setuptools (18.0.1)
shub (1.3.1)
six (1.9.0)
wheel (0.24.0)

$ echo 'lxml==3.4.4' > requirements.txt
$ shub deploy-reqs requirements.txt
Downloading eggs...
.../venv/lib/python2.7/site-packages/pip/vendor/requests/packages/urllib3/util/ssl.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
InsecurePlatformWarning
.../venv/lib/python2.7/site-packages/pip/vendor/requests/packages/urllib3/util/ssl.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
InsecurePlatformWarning
Collecting lxml==3.4.4 (from -r .../requirements.txt (line 1))
Using cached lxml-3.4.4.tar.gz
Saved /var/folders/zw/v2b8vtsj6cn58j55371phlf80000gp/T/eggshv9fpJ/eggs/lxml-3.4.4.tar.gz
Successfully downloaded lxml
Uncompressing: lxml-3.4.4.tar.gz
Building egg in: /private/var/folders/zw/v2b8vtsj6cn58j55371phlf80000gp/T/eggshv9fpJ/eggs/lxml-3.4.4
.../2.7/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'bugtrack_url'
warnings.warn(msg)
.../2.7/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'bugtrack_url'
warnings.warn(msg)
.../2.7/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'bugtrack_url'
warnings.warn(msg)
Traceback (most recent call last):
File ".../venv/bin/shub", line 11, in
sys.exit(cli())
File ".../venv/lib/python2.7/site-packages/click/core.py", line 664, in call
return self.main(_args, *_kwargs)
File ".../venv/lib/python2.7/site-packages/click/core.py", line 644, in main
rv = self.invoke(ctx)
File ".../venv/lib/python2.7/site-packages/click/core.py", line 991, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File ".../venv/lib/python2.7/site-packages/click/core.py", line 837, in invoke
return ctx.invoke(self.callback, *_ctx.params)
File ".../venv/lib/python2.7/site-packages/click/core.py", line 464, in invoke
return callback(_args, **kwargs)
File ".../venv//lib/python2.7/site-packages/shub/deploy_reqs.py", line 18, in cli
main(project_id, requirements_file)
File ".../venv/lib/python2.7/site-packages/shub/deploy_reqs.py", line 27, in main
utils.build_and_deploy_eggs(project_id, apikey)
File ".../venv/lib/python2.7/site-packages/shub/utils.py", line 105, in build_and_deploy_eggs
build_and_deploy_egg(project_id, apikey)
File ".../venv/lib/python2.7/site-packages/shub/utils.py", line 132, in build_and_deploy_egg
_deploy_dependency_egg(apikey, project_id)
File ".../venv/lib/python2.7/site-packages/shub/utils.py", line 138, in _deploy_dependency_egg
egg_name, egg_path = _get_egg_info(name)
File ".../venv/lib/python2.7/site-packages/shub/utils.py", line 169, in _get_egg_info
egg_path = glob(egg_path_glob)[0]
IndexError: list index out of range

Allow scheduling custom scripts

http://doc.scrapinghub.com/scrapy-cloud.html#running-custom-python-scripts

Include dependencies when deploying

deploy should be "run and forget": shub should make sure that all requirements the project has (specified in requirements.txt or setup.py's install_requires) are available on Scrapy Cloud and upload them if necessary.

cannot find apikey for project

I just pull recent master - looks totally great and I really enjoy all those cool new features.

There is one thing that doens't work for me though. When I try to deploy to some specific target I get either

Error: Could not find API key for endpoint ad-hoc.

Error: "realtime-test" is not a valid Scrapinghub project ID. Please check your scrapinghub.yml

am I missing something here or is it some potential bug?

EDIT:

I see this is changed behavior. Looking at line here: https://github.com/scrapinghub/shub/blob/master/shub/config.py#L134 I must have defined apikeys for all endpoints. In the past if I had default apikey it was used for all endpoints, is this change expected and intended?

[feature request] shub restart command

Command would take job_id as argument and it would restart job with id in other projects using same combination of arguments.

For example

> shub restart 1887/496/3724

would restart job with this id in default_project.

It would be very useful for all of us who have jobs with complex list of arguments and need to enter them manually either in command line or in dash.

Syntax could also include spider name project name and job id, e.g.

shub restart amazon.com_products 1886 latest

but I would prefer job id as it is much shorter.

Different hubstorage endpoints

The job resource commands (log, items, requests) currently always use the production Hubstorage endpoint (http://storage.scrapinghub.com). This is due to the fact that the storage endpoint cannot be easily derivated from the scrapyd endpoint.

Should we add support for supplying different storage endpoints? That would allow using the job resource commands while working on devbox.

Exit value does not reflect an authentication failure

I think if any API call fails (i.e. auth error) then the command should exit with a non-zero status. The example below shows how shub exit with zero status regardless of the auth failure.

$ shub deploy-egg 1462X       
Building egg in: /home/rolando/foo
zip_safe flag not set; analyzing archive contents...
Deploying dependency to Scrapy Cloud project "1462X"
Deploy failed (403):
{"status": "error", "message": "Authentication failed"}
Deployed eggs list at: https://dash.scrapinghub.com/p/1462X/eggs
$ echo $?
0

Error on `deploy-reqs`

Just tried to deploy-reqs using shub master and got

Traceback (most recent call last):
  File "/Users/zehzinho/.virtualenv/ds/bin/shub", line 9, in <module>
    load_entry_point('shub==1.5.0', 'console_scripts', 'shub')()
  File "/Users/zehzinho/.virtualenv/ds/lib/python2.7/site-packages/click/core.py", line 700, in __call__
    return self.main(*args, **kwargs)
  File "/Users/zehzinho/.virtualenv/ds/lib/python2.7/site-packages/click/core.py", line 680, in main
    rv = self.invoke(ctx)
  File "/Users/zehzinho/.virtualenv/ds/lib/python2.7/site-packages/click/core.py", line 1027, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/zehzinho/.virtualenv/ds/lib/python2.7/site-packages/click/core.py", line 873, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/zehzinho/.virtualenv/ds/lib/python2.7/site-packages/click/core.py", line 508, in invoke
    return callback(*args, **kwargs)
  File "/Users/zehzinho/Sources/scraping-hub/shub/shub/deploy_reqs.py", line 34, in cli
    main(target, requirements_file)
  File "/Users/zehzinho/Sources/scraping-hub/shub/shub/deploy_reqs.py", line 43, in main
    utils.build_and_deploy_eggs(project, endpoint, apikey)
  File "/Users/zehzinho/Sources/scraping-hub/shub/shub/utils.py", line 141, in build_and_deploy_eggs
    build_and_deploy_egg(project, endpoint, apikey)
  File "/Users/zehzinho/Sources/scraping-hub/shub/shub/utils.py", line 168, in build_and_deploy_egg
    _deploy_dependency_egg(project, endpoint, apikey)
  File "/Users/zehzinho/Sources/scraping-hub/shub/shub/utils.py", line 183, in _deploy_dependency_egg
    make_deploy_request(url, data, files, auth)
TypeError: make_deploy_request() takes exactly 6 arguments (4 given)

Upgrade to pip 8

pip 8.0.0 dropped pip install --download in favour of pip download, we should update our code.

Forcing pip>=8 might be too strong though, maybe we can check the pip version in our code, or just stick to install --download and live with the deprecation warning for a while (or swallow it if possible).

Deploy standalone spiders

It would be nice to be able to do:

shub deploy-spider -p $PROJ_ID spiderfile.py

This would wrap the spiderfile into a temporary project and deploy that project to Scrapy Cloud.

I think it would be a good complement for the scrapy runspider spiderfile.py command, we could even show it off in http://scrapy.org :)

Add `shub update-config` command

As we're moving our configuration format to scrapinghub.yml`s, should we make the switch easy by adding a command to generate a config in the new format?

Limit data fetched from hubstorage

Some jobs have huge logs / many items / requests. We could limit the number of entries downloaded from hubstorage by grabbing the number of entries first, e.g.

total_lines = job.logs.stats()['totals']['input_values']
# then construct correct `start_after` argument for iter.json() from this

Maybe download only the last 250 items if -f is set, with a possibility to override, and download all items if -f is not set.

Fix broken auto-versioning

The "get version control commit" commands throw an unhandled exception when their version control tool is not available.

This breaks deploying (shub exits with a traceback) when

The user uses mercurial for version control and has not installed git
The user uses bazaar for version control and has not installed both git and mercurial
The user uses no version control and also has not installed all of the above tools

The exceptions should be either caught in the pwd_git_version() (and similar) util functions or in ShubConfig.get_version().

See: https://stackoverflow.com/questions/377017/test-if-executable-exists-in-python

Update SH docs

This is outdated: http://doc.scrapinghub.com/scrapy-cloud.html#deploying-a-scrapy-spider

Check for updates

We will soon start distributing shub in a binary form, outside standard channels (like pypi or apt). So shub needs a way to check if there's a new update available for it.

I propose using Github releases API, and make sure we keep doing releases here. The check should be cached (maybe in ~/.scrapinghub.yml) so as not to check on every invocation, but once a day (or something like that). It should also never fail (if there's no internet connection, for example) and cache failures as well.

Add a test for #61

#61 needs a test

cc @eliasdorneles @dangra

shub disregards Scrapy configuration environment variables

Scrapy allows to store the name of the project as SCRAPY_PROJECT, and the path of a project settings module as SCRAPY_CONFIG_MODULE environment variables. These take preference when finding the settings module or reading its path from scrapy.cfg.

A scrapy.cfg file that has no [settings], or one that looks like this:

[settings]
custom_project_name = path.to.settings

therefore works perfectly fine with Scrapy as long as the appropriate environment variables are set.

shub, however, does not read these environment variables, and instead requires a scrapy.cfg with a [settings] section and default as project name.

cant schedule if argument contains equals sign

shub schedule pet-supermarket.co.uk_keywords -a keywords="id1=dog"

fails with

Traceback (most recent call last):
  File "/usr/local/bin/shub", line 9, in <module>
    load_entry_point('shub==1.4.0', 'console_scripts', 'shub')()
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 664, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 644, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 991, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 837, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 464, in invoke
    return callback(*args, **kwargs)
  File "/home/pawel/scrapinghub/shub_clone/shub/schedule.py", line 18, in cli
    job_key = schedule_spider(apikey, project_id, spider, argument)
  File "/home/pawel/scrapinghub/shub_clone/shub/schedule.py", line 28, in schedule_spider
    return conn[project_id].schedule(spider, **dict(x.split('=') for x in arguments))
ValueError: dictionary update sequence element #0 has length 3; 2 is required

Check for entry points in setup.py

Deploying fails with a quite cryptic (for users) message when setup.py does not contain a scrapy entry point that points it to the settings module:
https://paste.scrapinghub.com/show/8940/

If there's no setup.py, shub writes our own that has all the necessary information. But if the user wrote one, we probably should check whether the entry point exists and tell them to add it if necessary (or add it ourselves).

Add authentication instructions when the authentication fails

Today we only display "Authentication failed".

The docs also are outdated:

http://doc.scrapinghub.com/shub.html

Don't import non-integer project names

scrapy startproject myprojectname puts the following section into scrapy.cfg:

[deploy]
project = myprojectname

When users first run shub deploy, this will be transferred to scrapinghub.yml and an error message will be printed.

We should not import non-integer project names from scrapy.cfg and instead guide users through the deploy wizard.

Is it time to bump version and push latest code to PyPI?

It looks like in last 2 months shub got several useful features, what about publishing them?

Allow passing settings in shub schedule

Let's add a -s / --setting argument that works in the same fashion as -a, but for passing settings to the job.

scrapinghub / shub Goto Github PK

shub's Issues

Recommend Projects

Recommend Topics

Recommend Org