Giter Site home page Giter Site logo

Comments (20)

geduldig avatar geduldig commented on August 23, 2024

I've used the following approach when I need to know the stream will never die:

while True:
    try:
        r = api.request('statuses/filter', {'track':words})
        for item in r:
            if 'text' in item:
                process_tweet(item)
    except Exception as e:
        print('Must reconnect: %s' % e)

Is this solution practical for you?

If your requirement is to not miss any tweets then things get trickier. In this situation I have used another thread to back fill with the REST API the tweets I missed.

from twitterapi.

reidpr avatar reidpr commented on August 23, 2024

That general solution is practical, yes; missing some tweets is fine for us.

I think my one concern is that ValueError is very general. While I don't mind restarting on bad data in the tweet stream, there are a lot of other ways my and others' code can generate ValueError for which I'm less comfortable restarting (though maybe this is a paranoid attitude). How would you feel about wrapping that ValueError in e.g. a TweetDecodingError?

from twitterapi.

geduldig avatar geduldig commented on August 23, 2024

That sounds like a good idea. I'll look into adding an exception class to TwitterAPI. In addition to ValueError, I think the new class may also need to wrap request.exception.* errors.

from twitterapi.

geduldig avatar geduldig commented on August 23, 2024

Looking at this some more, I see that in this line

yield json.loads(item.decode('utf-8'))

decode can raise a UnicodeException (which is subclassed from ValueError), and loads can raise a ValueError. As you said, you probably experienced the latter from a malformed JSON string. With either type of exception I think it is best to silence the error and keep processing the stream. I am considering changing the _StreamingIterable method as follows:

def __iter__(self):
    """Return a tweet status as a JSON object."""
    for item in self.results:
        if item:
            try:
                yield json.loads(item.decode('utf-8'))
            except ValueError:
                continue

I don't see the need (yet) for a custom exception class. Any thoughts?

from twitterapi.

reidpr avatar reidpr commented on August 23, 2024

I'd support that.

One thing to consider: Suppose that the stream contains an excessive number of errors. As implemented above, this fact would be hidden. For example, if 20% of lines were corrupt, that seems like a problem but would be hard to notice. I think the question then would be: is this situation likely enough to care about?

One could count the errors and raise an exception after N ValueErrors. However, that makes the logic and API (if N is configurable) more complex, particularly since N should probably be related to the total number of lines received.

Currently the problem appears rare — I have not encountered it again since reporting the issue, and we try to keep listening continuously to the 1% sample.

Bottom line for me, I'm comfortable simply suppressing the errors, but IMO it's worth understanding the consequences of that.

from twitterapi.

geduldig avatar geduldig commented on August 23, 2024

These errors are rare but do happen. And, it is conceivable that one day Twitter will malfunction (which they do from time to time) in a way that produces an inordinate amount of improperly formatted JSON. I myself am a curious what this invalid JSON looks like (simply cutoff text, or unescaped quotes, etc). But, I think for these rare cases I prefer the user to modify the code rather than me adding clutter.

from twitterapi.

geduldig avatar geduldig commented on August 23, 2024

In the latest version, 2.2.7, I have silenced ValueError. This version also processes streaming endpoints more efficiently, so you shouldn't see incomplete read errors. But, if you do start seeing these errors let me know and I will handle these exceptions (ProtocolError and ChunkedEncodingError). These exceptions, if you get them, would require the streaming connection to be re-created.

from twitterapi.

reidpr avatar reidpr commented on August 23, 2024

Thank you!

With 2.2.7, I'm running into some odd behavior. Occasionally — the last one happened this morning after collecting around 25 million tweets on the 1% sample, and I've seen it ~3 times since installing 2.2.7 — my collector program will lock up hard and require kill -9 to dispose of.

Now I've my collector running under strace, which may give some clues if/when it happens again.

Is this something for a new issue, or would you like me to continue posting here? Or do something else?

from twitterapi.

geduldig avatar geduldig commented on August 23, 2024

You can keep posting here. I'll start my own collector and see what I can find.

You might want to upgrade to 2.2.8. It fixes oAuth2 - so, it will have no effect on what you are doing, but at least we will be looking at the same code.

Thanks, Jonas

from twitterapi.

geduldig avatar geduldig commented on August 23, 2024

I pushed a new branch, fault_tolerant_stream which I am currently testing. It has logging and a new error class. You are welcome to test it. I also added examples/sample_freq.py which prints tweet frequency from the sample stream every 5 seconds. It shows how to use logging and the new error class TwitterConnectError. I'm still trying work out which exceptions to ignore and which exceptions would require the client to reconnect. These errors happen so infrequently that it will take some time to figure out.

from twitterapi.

reidpr avatar reidpr commented on August 23, 2024

Thanks Jonas. I will take a look.

So far no more issues with 2.2.7; it's been 6 days or so. You are correct that the low frequency of errors makes this more difficult.

I wonder if there is some way to inject problems? There must be an error injecting HTTP proxy for this type of thing.

from twitterapi.

reidpr avatar reidpr commented on August 23, 2024

Well ask and ye shall receive, I guess. My collector froze again about an hour ago; same symptoms, needed kill -9 to shut it down. Here are the last 20 lines of strace -t -x -s 8:

11:35:55 poll([{fd=5, events=POLLIN}], 1, 90000) = 1 ([{fd=5, revents=POLLIN}])
11:35:55 read(5, "\x17\x03\x03\x00\x1a", 5) = 5
11:35:55 read(5, "\xcb\x17\xe6\xc4\xe1\x25\x31\x01"..., 26) = 26
11:35:55 poll([{fd=5, events=POLLIN}], 1, 90000) = 1 ([{fd=5, revents=POLLIN}])
11:35:55 read(5, "\x17\x03\x03\x00\x1c", 5) = 5
11:35:55 read(5, "\xcb\x17\xe6\xc4\xe1\x25\x31\x02"..., 28) = 28
11:35:55 poll([{fd=5, events=POLLIN}], 1, 90000) = 1 ([{fd=5, revents=POLLIN}])
11:35:55 read(5, "\x17\x03\x03\x00\xbb", 5) = 5
11:35:55 read(5, "\xcb\x17\xe6\xc4\xe1\x25\x31\x03"..., 187) = 187
11:35:55 poll([{fd=5, events=POLLIN}], 1, 90000) = 1 ([{fd=5, revents=POLLIN}])
11:35:55 read(5, "\x17\x03\x03\x00\x1a", 5) = 5
11:35:55 read(5, "\xcb\x17\xe6\xc4\xe1\x25\x31\x04"..., 26) = 26
11:35:55 poll([{fd=5, events=POLLIN}], 1, 90000) = 1 ([{fd=5, revents=POLLIN}])
11:35:55 read(5, "\x17\x03\x03\x00\x1b", 5) = 5
11:35:55 read(5, "\xcb\x17\xe6\xc4\xe1\x25\x31\x05"..., 27) = 27
11:35:55 poll([{fd=5, events=POLLIN}], 1, 90000) = 1 ([{fd=5, revents=POLLIN}])
11:35:55 read(5, "\x17\x03\x03\x00\x1a", 5) = 5
11:35:55 read(5, "\xcb\x17\xe6\xc4\xe1\x25\x31\x06"..., 26) = 26
11:35:56 close(5)                       = 0

Calling close() seems wrong, but it's unclear to me whether that's a cause or symptom.

The log also seems to be peppered with EAGAIN throughout, e.g.:

11:35:55 poll([{fd=5, events=POLLIN}], 1, 90000) = 1 ([{fd=5, revents=POLLIN}])
11:35:55 read(5, "\x17\x03\x03\x07\xe2", 5) = 5
11:35:55 read(5, "\xcb\x17\xe6\xc4\xe1\x25\x30\xdf"..., 2018) = 1443
11:35:55 read(5, 0x229125b, 575)        = -1 EAGAIN (Resource temporarily unavai
lable)
11:35:55 poll([{fd=5, events=POLLIN}], 1, 90000) = 1 ([{fd=5, revents=POLLIN}])
11:35:55 read(5, "\x07\xd9\x8f\x43\x37\x28\xd2\xac"..., 575) = 575

I don't think that's a problem, though, since it appears everywhere and not just before a crash.

I have 5.7GB more of that, if that would be helpful.

This is with 2.2.7 and requests 2.4.3. I'll now try 2.2.8 and requests 2.5.0.

from twitterapi.

geduldig avatar geduldig commented on August 23, 2024

My collector also froze around the same time as yours. It occurred after a ValueError was caught during json.loads(). I am wondering if this is another case where the user should reconnect.

from twitterapi.

reidpr avatar reidpr commented on August 23, 2024

I think you're right; a reconnect is appropriate. Just need to get an exception to the Python level to trigger that.

Here's another strace, with 2.2.8 and requests 2.5.0. The times are MST today. Note you can see the three ^C that I typed into the terminal. I don't recall if I detached strace before kill, but I'll be sure to leave strace running until the bitter end next time.

12:15:44 read(5, "\xa7\x6f\x17\xbf\x26\x7f\x5b\xb4"..., 26) = 26
12:15:44 poll([{fd=5, events=POLLIN}], 1, 90000) = 1 ([{fd=5, revents=POLLIN}])
12:15:45 read(5, "\x17\x03\x03\x00\x1b", 5) = 5
12:15:45 read(5, "\xa7\x6f\x17\xbf\x26\x7f\x5b\xb5"..., 27) = 27
12:15:45 poll([{fd=5, events=POLLIN}], 1, 90000) = 1 ([{fd=5, revents=POLLIN}])
12:15:45 read(5, "\x17\x03\x03\x00\x1a", 5) = 5
12:15:45 read(5, "\xa7\x6f\x17\xbf\x26\x7f\x5b\xb6"..., 26) = 26
12:15:45 close(5)                       = 0
14:04:00 --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
14:04:00 rt_sigreturn()                 = 4
14:04:00 --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
14:04:00 rt_sigreturn()                 = 0
14:04:00 --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
14:04:00 rt_sigreturn()                 = 66
Process 32476 detached

from twitterapi.

reidpr avatar reidpr commented on August 23, 2024

One last data point. Here's the strace of the kill and kill -9.

10:54:30 poll([{fd=4, events=POLLIN}], 1, 90000) = 1 ([{fd=4, revents=POLLIN}])
10:54:30 read(4, "\x17\x03\x03\x00\x1b", 5) = 5
10:54:30 read(4, "\x3e\xe6\xca\x14\x81\x0d\xb3\x5b"..., 27) = 27
10:54:30 poll([{fd=4, events=POLLIN}], 1, 90000) = 1 ([{fd=4, revents=POLLIN}])
10:54:30 read(4, "\x17\x03\x03\x00\x1a", 5) = 5
10:54:30 read(4, "\x3e\xe6\xca\x14\x81\x0d\xb3\x5c"..., 26) = 26
10:54:30 close(4)                       = 0
16:01:34 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=2105, si_uid=1001} ---
16:01:34 rt_sigreturn()                 = 0
16:02:06 +++ killed by SIGKILL +++

from twitterapi.

geduldig avatar geduldig commented on August 23, 2024

I'm not sure what to do with your strace outputs. What do they mean?

Do you know what line of code last executed before the crash? Or if it was an infinite loop rather than a crash?

from twitterapi.

reidpr avatar reidpr commented on August 23, 2024

Well, it's a little fuzzy because it's at such a lower level than the Python code. I don't know where in the Python code it stops. I'm not sure whether it's a hang or an infinite loop.

I think in the strace output is we can see the following:

  1. The poll() and read() system calls are the standard read loop. Somewhere in Python, it asks, "is there any data", the reply is "yes", and then it reads some data.
  2. At some point, this loop stops and the socket file descriptor is closed.
  3. After that, signals don't reach the Python level any more. I think they might be blocked entirely, because SIGTERM handlers are unusual.

One guess is that it's a race condition: The socket is being closed (maybe in a thread or because the connection was broken?), but something else doesn't realize this, makes a blocking read call, and then waits forever.

Another guess is that the ValueError is caught, and then whatever deals with that (at a brief glance, the continue on line 229 of TwitterAPI.py?; this is as of commit 2720ae3, currently the most recent on master) causes both the close() call and the hang. Assuming my guess about where in the code this occurs is correct, then I think this would be further evidence for propagating the exception back to the caller so they can reconnect. I'd be interested in knowing whether it was decode() or loads() that failed.

I can make this modification and try it, if you like.

The key difference is, 2.2.6 occasionally throws ValueError, but 2.2.7 locks up with the above behavior, I suspect in the same circumstances.

HTH,
Reid

from twitterapi.

geduldig avatar geduldig commented on August 23, 2024

From what you describe and from what I have also experienced, it does point to ValueError. I'm pretty sure ValueError is thrown by loads(). The docs say this would happen when receiving malformed JSON. If this is the case, I was hoping that that portion of the stream could be skipped or if there was a connection issue an exception would be caught by _iter_stream(). A likely scenario is there is a stream distrubance that corrupts the JSON that throws a ValueError. And, if that's the case, as you have concluded, the code may as well throw an exception (TwitterConnectionError) when there is a ValueError. I will add this to the fault_tolerant_stream branch tomorrow and continue testing.

Onward,
Jonas

from twitterapi.

geduldig avatar geduldig commented on August 23, 2024

With the latest version (2.3) I have been streaming the sample stream for over a week without having to manually intervene. In addition to TwitterConnectionError, I added TwitterRequestError. By catching both these exceptions you can handle timeouts, stalls, and interrupted connections. I included my test code for a fault tolerant stream in examples\sample_freq.py. It demonstrates how to handle these exceptions, when to re-try a request and when to give up.

There is also this doc which contains a code example that is a little more concise.

Note: whenever a stream disconnection occurs you are likely to miss tweets. During stalls you will miss 90 seconds worth of tweets. In cases where missing a small percentage of tweets is not acceptable, you should start a new thread with a search/tweets request to backfill a gap in the stream.

from twitterapi.

reidpr avatar reidpr commented on August 23, 2024

Jonas, thank you very much for your hard work on this issue and others. I will try it out and let you know if we encounter any issues.

At a higher level, we are using TwitterAPI to support some research papers. Thus, we would like to mention you in the acknowledgements sections. Do you want this, and if so, what name would you like to be acknowledged under? Feel free to e-mail me privately at [email protected] if you prefer.

from twitterapi.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.