Giter Site home page Giter Site logo

Comments (14)

KimiNewt avatar KimiNewt commented on August 17, 2024

Unsure, but this might be due to a tshark buffer. Should be tested if it still happens if you use tshark's "-l" flag.
I am rather busy at the moment so will only be able to start looking at these issues at the end of next week. Anyone's of course welcome to create a pull-request to solve them.

from pyshark.

goretkin avatar goretkin commented on August 17, 2024

In the tshark_stub, I just added a sys.stdout.flush() with the same effect, so I don't think that the issue is with wireshark. Thank you for bringing up the "-l" flag. I didn't know about it, and it's definitely something I should use in my case.

My guess is that the problem is probably in the caller to _get_packet_from_stream. You would want to call _get_packet_from_stream multiple times until there were no full packets left in existing_data. I don't see where this might be, because both calls are wrapped in while loops. But the behavior strongly points to at most one packet being emitted per DEFAULT_BATCH_SIZE bytes attempted read.

Thank you for the amazing library. I hope to contribute through some pull requests soon if I can figure any of the issues out.

from pyshark.

goretkin avatar goretkin commented on August 17, 2024

This commit: goretkin@f655e04 fixes the issue.

The problem is that the function _get_packet_from_stream could return at most once per read. Now it sees if it has any packets to emit before it does a blocking read.

I could submit a pull request now, but if any one has any ideas how to test this behavior, I'd like to hear it. I would like a test to test the whole pipeline, including the reading from a process' stdout, so it seems like making a fake tshark (which I called tshark_stub) is necessary.

EDIT: I implemented an ugly test in this commit: goretkin@5297835 It won't work on Windows because of no Signals and because it invokes the python script as if it were an executable (using the #! shebang). Probably we can make tshark_stub accept its trigger through its stdin, and we can invoke the script in windows by explcitly calling python tshark_stub <parameters>.

from pyshark.

ecbftw avatar ecbftw commented on August 17, 2024

Is there any reason not to use "-l"? I am very frustrated with all of the other python pcap wrappers out there because they are so slow and seem to drop packets. I'm interested in trying out pyshark in my application, but I do need to see packets "shortly" after they arrive (and also need to avoid dropping any). I noticed the lack of "-l" in ps and immediately saw a potential problem there, but wasn't sure how much it would affect me...

Would it make sense to add an "extra_options" parameter to the capture constructor so we can experiment with such things? Obviously adding the wrong options could cause all kinds of problems, but a "use with care" warning might be sufficient to avoid people shooting themselves in the foot.

from pyshark.

KimiNewt avatar KimiNewt commented on August 17, 2024

@Adian0 I'm not sure adding an "extra_options" parameter is a good idea since a lot of parameters will change how tshark outputs data and will make pyshark stop working properly. It also reveals the innards too much-- there aren't that many options so I'd rather just add them specifically (like was done with the encryption stuff).

However, I don't currently see any problem with adding "-l" by default.

@goretkin Are you sure reading from an actual process is necessary in the tests? Since the bug isn't really related to that, we could mock its "stdout" with a simple reader.

from pyshark.

goretkin avatar goretkin commented on August 17, 2024

@KimiNewt Reading from the actual process would be necessary to test "end-to-end", but I think the test you describe is perfectly useful for this issue, I agree.

It just occurred to me that you could go all out with the "end-to-end" idea by running the actual tshark process and sending IP packets from python through a loopback interface. I'm not sure if there would be a way of reliably sending two packets close enough in time that they get read together in a single read. But then this test would require a lot of privileges because it would actually be sniffing network traffic.

Maybe something worth considering for later, not this issue.

from pyshark.

ecbftw avatar ecbftw commented on August 17, 2024

@KimiNewt Ok, makes sense. Yeah, it would be easy to mess up the options. I can also just subclass the capture classes for now to experiment with new options if needed.

from pyshark.

ecbftw avatar ecbftw commented on August 17, 2024

I believe I'm running into the same problem now as @goretkin. Here's my very simple setup:
I run tshark on the commandline with very restrictive bpf and display filters. These filters cause tshark to emit only TCP packets with content when talking to a specific HTTP service. I run tshark manually with these filters in one terminal. In another terminal, I run my own pyshark-based script that uses the same filters as well as the "-l" option.

When I hit the HTTP service in my browser, tshark shows 6 packets: 2 HTTP request packets, and 4 HTTP response packets. There are two HTTP requests because my browser is sending my desired request as well as a request to /favicon.ico. (Each individual HTTP response consists of two packets because the HTTP response headers are sent separately from the body.) The two requests are sent at virtually the same exact time. The two response packets are also sent at the same exact time for a given response (same tcp timestamp).

However, in my pyshark-based script, I see only 3 packets: one request and two response packets. This is alarming.

Looking into it further, I ran both sniffers continuously and hit refresh in my browser twice. the first time, my pyshark script printed out only my desired HTTP request (as just described). However, upon hitting refresh again, pyshark printed out the /favicon.ico request and associated response. So this tells me the data is clearly being captured, but is getting stuck in a queue somewhere and only fetched when my second browser refresh happens. Repeated refreshing causes the printed packets to alternate between the two request types. This overall behavior seems consistent with the described bug. HTH.

from pyshark.

goretkin avatar goretkin commented on August 17, 2024

@Adian0 Will you try to make these small changes: goretkin@f655e04#diff-ccc1cf6c7cb2915638d3448781cc61e9

You could pull that branch, but they're also small enough to make my hand. It just moves two lines of code.

from pyshark.

ecbftw avatar ecbftw commented on August 17, 2024

@goretkin Before I saw your latest note, i started experimenting. You're totally correct in that the DEFAULT_BATCH_SIZE could allow for multiple packets to be read at once and the loop doesn't attempt to process all of them right away. In my capture subclass I set DEFAULT_BATCH_SIZE to an absurdly low value (16) and the problem went away, since there's no way for more than one packet XML to fit in 16 bytes. Of course that's just a test hack and we wouldn't want to do I/O in such small chunks.

Looking at your patch, it seems like it may just be shifting the problem to later in the cycle. If you keep receiving more than one packet in each read, won't they still queue up? I'm thinking we'd need an inner while loop that doesn't do any additional reads, but instead calls packet_from_xml_packet and _extract_tag_from_data repeatedly until the latter doesn't return a packet. That is, convert "if packet" to a while loop. I'll experiment with that.

from pyshark.

goretkin avatar goretkin commented on August 17, 2024

The function _get_packet_from_stream is called in a loop that tries to consume its output. If _get_packet_from_stream read a lot of data last time, and has multiple packets in existing_data, it will emit/produce each packet every time it is called. The next time it is called, it will have no full packet in existing_data, in which case it will call the new_data = yield From(stream.read(self.DEFAULT_BATCH_SIZE)), which will "block" until some new data arrives. In which case it will append it to existing_data and emit a None. The loop doesn't do anything with that None, and calls _get_packet_from_stream again, in which case it will emit any packets it has available.

You can structure the code so that a None doesn't ever get emitted too, but as far as I can understand, my patch fixes our issue. There is no need to have another loop that doesn't do additional reads because a read is not done anyway if there's already a full packet in existing_data.

I think that you could, though, have an additional loop, if instead of raiseing packets you yielded them. And if _get_packet_from_stream was implemented as a standard python generator. In that case, _get_packet_from_stream, when control passes to it, picks up where it left off (it's a co-routine). In the current architecture, that's not what happens, I don't think. But I don't understand the trollius way of implementing asyncio.

from pyshark.

ecbftw avatar ecbftw commented on August 17, 2024

Hmm, you could be right. I'm finding I'm getting quite confused with asyncio, as I'm not familiar with it.

Ultimately, I decided to give up on Python packet capture libraries, since none fit my needs or are stable or are maintained. I spent 2 hours writing a sniffer in C which prints JSON to stdout and then I'll just subprocess it from Python to get the job done. That's a lot less time than I've already spent trying out several Python libraries. Best of luck getting this library stable, I may try it again sometime.

from pyshark.

KimiNewt avatar KimiNewt commented on August 17, 2024

@goretkin Please create a pull-request with your commits.
Note that this may create a slight slowdown in packet capture (the get_packet_from_tag function is slow when working with huge inputs), but I was working on an optimization previously which I'll now continue, anyway.

from pyshark.

goretkin avatar goretkin commented on August 17, 2024

@KimiNewt I think you mean the function _extract_tag_from_data possibly. The only times it will be called in addition to when it was previously being called is when there isn't a full packet in existing_data, otherwise it is called the same total number of times, since the function has to emit the same number of packets.

I've submitted the pull request.

from pyshark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.