mechanicalsoup / mechanicalsoup Goto Github PK

View Code? Open in Web Editor NEW

4.6K 108.0 375.0 715 KB

A Python library for automating interaction with websites.

Home Page: http://mechanicalsoup.readthedocs.io/en/stable/

License: MIT License

Python 100.00%

python beautifulsoup mechanicalsoup python-library pypi requests web

mechanicalsoup's People

Stargazers

Watchers

Forkers

hamada2029 forrestlin marioacarvalho gwecho mocyuto tchen0123 bossjones wavelets jomtung eiriklv abensrhir chadvavra nvaller gtrdblt prodigeni handoutz bogdan-pr jbradish bag-of-projects mkouhia strogo furico kingjo002 584251395 halhx peta345 maximusk10 jeremiahmarks xuefeng-zhu yurashkurenko sanfam successtest9 makerbot icesunx priestd09 gsora rlugojr irekrybark mohan3d wernight windhaunting xyzalzhang ranamihir yoyocash teror4uks habdulkafi collinsctk agrawal-mohit cutecode fbormann vince06fr doshmajhan stoneyu3 nkhuyu orangain imclab smachraoui moy why-not-sky andyljones michaellgraves jstuff36 firearasi securextools tpof314 showliu architechsocial jbluvs2tri ghisvail r0fls dbrobins mengqhui dmvmrk python3pkg huangkbaaron j450h1 kopei reza5858 eoghanmurray boarnasia javierlopeza yujiaao juno249 hemberger filippoquaranta yfkwon prabz coresoft2 ezhangle bhageena optionalg crowd42 woolfgang devclev iriberri vault-the antoniotrento subc0ol sanyambansal76 jasonpy99

mechanicalsoup's Issues

Select button to submit with

Is there a way to select the button with which the browser submits the form?

Very minor issue with method placement

Hey, nice work.
Tiny non-issue really, in mechanicalsoup/browser.py
Line 88

Method: submit

That's a public method? Should it be placed with the other public methods above the private methods, _build_request and _prepare_request? I'm assuming those are 'private' methods anyway.

Anyway, this looks pretty great. Thanks!

how to handle javascript

It is amazing to write elegant code with MechanicalSoup. But I wonder how to handle javascipt with MechanicalSoup? If I use selenium, it will become annoying complicated. Hope for your help.

Ability to customize user agent

First thing you do if you are getting different output from a Python script (vs. your Browser) is change the script UA to match your browser to see if the server is sniffing it. I can submit a patch if you like unless there is an existing way of modifying request headers?

Compare with RoboBrowser

What kind of comparison with RoboBrowser?

TypeError: init() got multiple values for keyword argument 'features'

The following line of code seems to be causing problems:

self.browser = mechanicalsoup.Browser(soup_config={'features':'html.parser'})

This will consistently result in the following error:

  File "/usr/local/lib/python2.7/dist-packages/mechanicalsoup/browser.py", line 32, in get
    Browser.add_soup(response, self.soup_config)
  File "/usr/local/lib/python2.7/dist-packages/mechanicalsoup/browser.py", line 23, in add_soup
    response.content, "html.parser", **soup_config)
TypeError: __init__() got multiple values for keyword argument 'features'

Here are the versions I'm using in pip:

beautifulsoup4==4.5.1
mechanicalsoup==0.5.0

I'm actually trying to debug a different issue (I think the html.parser is having some issues with some funky content so I was trying to change to lxml when this happened).

Thanks for any ideas!

Submit button cannot be chosen when there are more than 2 buttons

From https://github.com/hickford/MechanicalSoup/blob/master/mechanicalsoup/form.py#L131:

        for inp in self.form.select("input"):
            if inp.get('type') != 'submit':
                continue
            if inp == el:
                continue

            del inp['name']
            return True

When there are 3 or more submits and one is chosen, only one is deleted and then the method early returns.

File objects are not allowed in <input type=file>

elif input.get('type') == 'file':
    ...
    files[name] = open(value, 'rb')

This code assumes the value is a file name. It is a sound assumption, when talking about normal HTML.

However, the requirement for a real file is a big limitation. The requests library allows passing a file object and even custom headers using a tuple.

In the context of this library, it would be nice to also expect such a tuple or just a file object in place of this value. It works well, even though semantically they have no place inside a HTML tree.

form.textarea adds text, but does not replace old content

If a textarea is pre-filled-in in a form, then form.py does

    def textarea(self, data):
        for (name, value) in data.items():
            self.form.find("textarea", {"name": name}).insert(0, value)

The insert adds stuff at the beginning, but does not remove the old content. Shouldn't this be

            self.form.find("textarea", {"name": name}).string = value

Thanks,

Wrong value sent for enabled checkbox or radio input with no value

MechanicalSoup always sends empty value for <input> elements with no value defined. That doesn't work well for radio and checkbox inputs – real browser would use "on" in such case. Empty value used by MechanicalSoup does not work for Django forms.

No attribute 'StatefulBrowser'

Hello. I have the latest version of MechanicalSoup installed (0.6.0) and python 3.5
I try to use StatefulBrowser like this:

import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()

And I get the exception:

Traceback (most recent call last):
  File "main.py", line 12, in <module>
    browser = mechanicalsoup.StatefulBrowser()
AttributeError: module 'mechanicalsoup' has no attribute 'StatefulBrowser'

Sessions should be clean for new Browser objects

Let's say I start a browser:

b1 = mechanicalsoup.Browser(soup_config={'features': 'html'})

Do several things and start later a second one:

b2 = mechanicalsoup.Browser(soup_config={'features': 'html'})

The session for the second browser is taken from:

requests.Session()

And cookies there (in requests library) are stored as a mutable object in Session class. This causes the second browser to start with the cookies (and all other session attributes) from the first one, which is an unexpected and quite weird behavior.

In my use case, I simply had to clear cookies just after creating the second browser instance:

b2.session.cookies.clear()

I think it should be easy to assure a clean session is used for each new browser instance (a natural way could be using context managers).

browser = mechanicalsoup.StatefulBrowser(), then can browser access the original soup class?

Hi guys, MechanicalSoup is a wonderful wrappers of BeautifulSoup and Mechane, and it's support python3, here I've an issue, can MechanicalSoup access the original BeautifulSoup's functions? Cause I don't see a soup object in the output of dir(browser):

browser = mechanicalsoup.StatefulBrowser()
print(dir(browser))

['_StatefulBrowser__current_form', '_StatefulBrowser__current_page', '_StatefulBrowser__current_url', '_StatefulBrowser__debug', '_StatefulBrowser__verbose', 'class', 'delattr', 'dict', 'dir', 'doc', 'eq', 'format', 'ge', 'getattribute', 'gt', 'hash', 'init', 'le', 'lt', 'module', 'ne', 'new', 'reduce', 'reduce_ex', 'repr', 'setattr', 'setitem', 'sizeof', 'str', 'subclasshook', 'weakref', '_build_request', '_prepare_request', 'absolute_url', 'add_soup', 'find_link', 'follow_link', 'get', 'get_current_form', 'get_current_page', 'get_debug', 'get_url', 'launch_browser', 'links', 'list_links', 'new_control', 'open', 'open_relative', 'post', 'request', 'select_custom', 'select_form', 'session', 'set_debug', 'set_verbose', 'soup_config', 'submit', 'submit_selected']

cookiejar file

Is it possible to set a cookiejar file so that the sessions could be saved and used several times?

Can not login to a site

I am trying to login to https://peerfly.com/login.php but after the submit call I am stuck in the same login page. The mechanize version works fine but I want to make it work with MechanicalSoup is there any problem or am I missing something?

This is the code that I am using:

import mechanicalsoup

browser = mechanicalsoup.Browser(soup_config={'features':'html.parser'})
loginPage = browser.get('https://peerfly.com/login.php')
form = loginPage.soup.find_all('form')[0]
form.find('input', {'name':'email'})['value'] = 'myuser'
form.find('input', {'name':'password'})['value'] = 'mypassword'
page = browser.submit(form, loginPage.url)
print page.text

Thank you

Enable codecov.io webhooks

Currently, codecov.io can only add comments to PR's. However, there's a lot more it can do when integrated directly into GitHub using the official app. Can this be enabled in the repository? I'm not sure if it requires an admin or the repository owner.

An example of the additional context elements enabled by this app:

(The screenshot is from my fork where I have the app installed.)

Make a release with statefulBrowser.py

We're having a lot of users troubled by the fact that statefulBrowser is documented but not released. It's been there for a while now. If anyone has trouble with it, please report bugs ASAP.

We should definitely make a release.

@hickford: can you either do the release, or give me permission on pypi? I'm moy there.

Form submit when action is a relative path

http://example.com/test/form.html

<form action="some_relative_path"

browser.submit(my_form)

requests.exceptions.MissingSchema: Invalid URL u'some_relative_path': No schema supplied. Perhaps you meant http://some_relative_path?

Please allow submitting forms fully automatically, resolving the relative action path. This should be resolved to http://example.com/test/some_relative_path

Update PyPI version

There have been a number of useful changes since v0.7.0. I was wondering what would need to be done to have a new tagged release so that the latest version in PyPI can be updated. Thanks!

Uh, can it be as mechanize as use cookiejar?

Like this:

browser = mechanize.Browser()
browser.set_cookiejar(cookiejar)
browser.open('')

how to change the headers?

I have tried a lot of ways...

like as
browser = mechanicalsoup.Browser()
browser.addheaders(...)

but actually, when I send a packet, there isn't header that I added
it's the same result when I look for packets with wireshark there isn't extra header

and also when I check with browser.session.headeres, it's the same result
it bring the basic headers.

Finally, I look up the all the member function in the browser class, I couldn't find it with your source
so is there a way to add a extra header??

it's impossible to get connection major portal where it require some cookies or extra headers.
is there a solution?

AttributeError: 'module' object has no attribute 'StatefulBrowser'

Using the example from the readme I get the above error.

KeyError: 'action' when form has no action

https://github.com/hickford/MechanicalSoup/blob/0b5b7c3fcab73507b625be04152497e73805474d/mechanicalsoup/stateful_browser.py#L109-L112

Elements without closing tags not appearing in the soup

None of the self closing tag elements appear in the soup. I saw a post on SO saying that using 'lxml' as my parser would fix it, but that's not working. I get the error message saying no parser was specified so it's defaulting to lxml on my system, but I still don't see any span, a or meta tags.

multi-value select input not supported for .submit()

.submit() won't work for mult-value <select> input – always only the last selected value is used.

proposed feature: stateful browser

Hi,

I love MechanicalSoup, but I find the code using it too verbose, as the Browser class doesn't remember much hence the caller needs to store a lot in local variables.

I wrote a small piece of code to add some convenience functions to Browser:

https://gitlab.com/chamilotools/chamilotools/blob/master/chamilolib/chamiloBrowser.py

I think most of the code in this class could be included in MechanicalSoup, for example as a StatefulBrowser class, that would manage the fields current_page, current_url and current_form, and let the user write things like

br.follow_link(url_regex="...")
br.follow_link(url_regex="...")
br.select_form("form")
br["foo"] = "bar"
br.submit_selected()

The original browser.py and form.py should not need to be modified.

Would such feature be of interest to MechanicalSoup? If so, I'll clean-up my code and submit a PR.

Thanks,

StatefulBrowser not found

I get the following error AttributeError: 'module' object has no attribute 'StatefulBrowser' when I simply type browser = mechanicalsoup.StatefulBrowser()

allow customization of BeautifulSoup instance

BeautifulSoup is initialized inside add_soup, so there's no way for clients to, for example, specify a different parser.

Some ways this could be done:

accept a dict of kwargs to provide to the constructor. It looks like their init doesn't take *args, so this should cover everything.
accept a factory to delegate instance construction to client code

post form can handle input but can't handle select

if post form has these line:
<select id="file_type" class="select" name="file_type"> <option value="testA">test</option> <option value="testB">test</option> </select>\n <input id="file" class="text" type="file" name="file">

I can use update_form.find("input", {"name" : "file" })['value'] = testfile_path
but can't make
update_form.find("select", {"name" : "file_type" })['value']= "test"
work

example.py seems broken

$ python example.py login password
Traceback (most recent call last):
  File "example.py", line 18, in <module>
    login_form = login_page.soup.select_one('#login form')
TypeError: 'NoneType' object is not callable

I tried adding print(login_page.soup.select_one) before the guilty line and it is indeed None.

Development workflow inconsistency

If we run python setup.py test, then the tests are run in the working tree of the repository. However, if we simply run pytest, then the tests are run against the installed version of the module, which may be different than the one that is currently in the working tree.

Advantages of python setup.py test:

You are guaranteed to be testing the code as it currently exists. It is all too easy to make a breaking change to the working tree, run pytest (forgetting to python setup.py install first), have the tests pass and assume that your change is correct.
The files reported by pytest-cov are much more readable. For example, you'll simply see mechanicalsoup/browser.py instead of .virtual-py3/lib/python3.5/site-packages/MechanicalSoup-0.8.0-py3.5.egg/mechanicalsoup/browser.py.
Dependencies (except for flake8) do not need to be installed manually, vastly simplifying the development instructions... so long as you are okay with a bunch of egg-related cruft that is output before the tests, e.g. for each dependency you get:

Using /repos/MechanicalSoup/.eggs/requests-2.18.4-py3.5.egg
Searching for certifi>=2017.4.17
Best match: certifi 2017.7.27.1
Processing certifi-2017.7.27.1-py3.5.egg

Disadvantages of python setup.py test:

Since test is aliased to pytest, it is not possible to pass arguments to pytest without modifying setup.cfg (that I'm aware of).

I don't know what the standard way of doing this is (I suspect this is why a lot of people use tools like tox to manage their build/dev environments). All I know is that the two methods are currently inconsistent, and that's dangerous. In my opinion, the benefits of python setup.py test make it preferable, but I don't think either option is ideal (or if we could even enforce that one is used if we wanted to).

Submit a multipart encripted form

This library doesn't seem to support forms with enctype='multipart/form-data'. Are there any plans for supporting these?

Nested forms

Error: mechanicalsoup has no attribute StatefulBrowser

Any pointers?

How to scrape in parallel

I have used mechanize before, one issue there was it was totally a serial operation, how do I scrape a list of urls in parallel and then use beautiful soup to process it?

Thanks

Impose "six" version requirement

When I tried using the library in a Django production environment, it did not work with the (presumably) already-present version 1.3.0 of "six". It gave me some "could not import module urllib" error. When I set a version restriction on six in my own application (1.9.0), I could use mechanicalsoup perfectly fine. So perhaps the standard library needs to specify a minimal version of six that works, in order to prevent these situations (e.g. >= 1.9.0)

Missing input field when submitting a form

When trying to submit a form that looks like:

<form action="/index.php" class="form-class" method="post">
 <input name="__csrf_magic" type="hidden" value="sid:foo"/>
<input id="usernamefld" name="usernamefld" type="text"/>
<input id="passwordfld" name="passwordfld" type="password"/>
<button name="login" type="submit"> Login </button>
</form>

My browser posts the following:
__csrf_magic=sis:foo&usernamefld=user&passwordfld=secret&login=
MechanicalSoup does not send the "&login=" part. Because of this I cannot log into the site with MechanicalSoup.

form action without schema'ed ACTION fails with MissingSchema

Used requests (and mechanicalsoup) to .get() a web page that has a form on it. The form tag looks like

<form action="/submit_uri" method="post">

I filled a part of the form, and then .submit()ed the form.

  File mechanicalsoup/browser.py", line 114, in submit
    request = self._prepare_request(form, url, **kwargs)
  File "mechanicalsoup/browser.py", line 109, in _prepare_request
    return self.session.prepare_request(request)
  File "requests/sessions.py", line 394, in prepare_request
    hooks=merge_hooks(request.hooks, self.hooks),
  File "requests/models.py", line 294, in prepare
    self.prepare_url(url, params)
  File "requests/models.py", line 354, in prepare_url
    raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '/submit_uri': No schema supplied. Perhaps you meant http:///submit_uri?

Automated test coverage reporting

Using pytest-cov, we can create a .coverage file that can be sent to Coveralls by TravisCI to report on the test coverage. Is this something the maintainers are interested in setting up?

Quick example of what this would entail:
https://ilovesymposia.com/2014/10/15/continuous-integration-in-python-5-report-test-coverage-using-coveralls/

Tag release versions

It would be nice if release versions would be tagged in git, so that people could download/use specific versions from GitHub directly.

Use case: I'm having MechanicalSoup as git submodule in my project and would like to specify to checkout tag v0.2.0 instead of master branch. This would make sure that the package contents are similar to what can be downloaded from PyPI.

Clarify Python Version Support

Problem

It appears as if this project is only supported on Python 3 but there is no clear sign of that in your documentation. Even if it is supported/developed specifically for Python 3, there is no mention of whether or not it will work with Python 2 or can be made to work with Python 2 (by, perhaps, installing certain modules/dependencies).

Steps Taken to Reproduce

Checked README.md and it makes mention of another library being incompatible with Python 3 but doesn't actually confirm/deny that this was built with Python 3 in mind:

I was a fond user of the Mechanize library, but unfortunately it's incompatible with Python 3 and development is inactive.

Can not find any other reference or mention of Python version compatibility in README.md

Response object has no attribute response

Got the above error when I tried to run your example. Any idea why?

Migrate tests to pyunit

Pyunit is a very simple unit tests framework, but could save us a bit of work, e.g.

    try:
        resp = browser.get("http://httpbin.org/nosuchpage")
    except mechanicalsoup.LinkNotFoundError:
        pass
    else:
        assert False

Could be written as stg like

with self.assertRaises(mechanicalsoup.LinkNotFoundError) as context:
    resp = browser.get("http://httpbin.org/nosuchpage")

Extract cookies from session?

Hey,

I was just wondering if there was a way to extract the session cookies from the page/browser to be used in other request calls outside of mechanical soup?

I'm trying to do page.cookies.get_dict() but its returning empty.

Thanks

something wrong

My function(use MechanicalSoup) does not work recently,below is my code:[it works before]

from bs4 import BeautifulSoup
def get_request(url):
    try:
        browser = mechanicalsoup.Browser(soup_config={"features": "lxml"})
        result = browser.get(url)
        code = result.status_code
        content = result.content
        content = content.decode("utf-8")
        title = BeautifulSoup(content, "lxml").title
        if title is not None:
            title_value = title.string
        else:
            title_value = None
    except:
        # 请求次数过多时被目标服务器禁止访问时
        code = 0
        title_value = "页面载入出错,但是这个页面有可能是存在的,只是因为访问过多被暂时拒绝访问"
        content = 'can not get html content this time,may be blocked by the server to request'

        print("可能由于访问过多或url错误(eg.scheme error:http|https)导致当前访问被重置,这个访问将不会得到任何内容")

    return_value = {
        'code': code,
        'title': title_value,
        'content': content}
    # print("访问当前url为:\n\t"+url+"\ntitle如下:")
    # print("\t"+str(return_value['title']))
    return return_value

a=get_request("https://www.baidu.com")
print(a)

Can you help me to find out why MechanicalSoup not work now?

Proxy Support

Are proxies currently supported?

Should we enable gitter.im?

I was thinking about how to communicate with the other developers of this repository, and it seems that something like gitter.im would be suitable. I'd be happy to set it up if you'd like -- it would just take a few clicks. If not, please let me know your preferred method of communication (@moy in particular, since you are currently the most active dev).

What does `StatefulBrowser.submit_selected`'s `btnName` argument do?

There is no documentation of the btnName argument of the StatefulBrowser.submit_selected method, and it simply gets forwarded to the requests.Request constructor in a data dict (the documentation for this argument in the requests module is not very illuminating either).

I have an example form that looks like this:

<form class="standard" method="POST" action="?sn=2e1b94b5">
    <input class="submit" name="action" type="submit" value="View"/>
    <input class="submit" name="action" type="submit" value="Save"/>
    <input type="checkbox" value="on" name="id[788441]"/>
    <input type="checkbox" value="on" name="id[788558]"/>
</form>

Thinking btnName meant "button name" (as in, the value of the submit element to be clicked), I tried the following code:

def SelectCheckboxes(br, names, submit_value):
  br.select_form('form')
  br.get_current_form().check({name: "on" for name in names})
  br.submit_selected(btnName=submit_value)

SelectCheckboxes(br, ['id[788441]'], "View")

The above code submits using the "Save" element instead of "View". Unsurprisingly, the code works correctly with the following change to SelectCheckboxes:

-  br.submit_selected(btnName=submit_value)
+  br.get_current_form().choose_submit(submit_value)
+  br.submit_selected()

But the existence of the choose_submit method further disguises what btnName is actually supposed to be. Any clarification would be greatly appreciated!

Tutorials or documentation?

Are there any tutorials or documentation available for this? Something a bit more robust than the example of logging into GitHub?

No parser was explicitly specified

/usr/local/lib/python3.4/dist-packages/bs4/init.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

BeautifulSoup([your markup])

to this:

BeautifulSoup([your markup], "lxml")

markup_type=markup_type))

Need to use add_soup method or what?