Giter Site home page Giter Site logo

Comments (30)

essen avatar essen commented on May 23, 2024 1

Merged. I think we're good now. Please test! A followup #215 is about adding support for SO_REUSEPORT (multiple listening socket per listener) and that will be the next addition before 2.0.

Closing, thanks everyone!

from ranch.

essen avatar essen commented on May 23, 2024

This would require going the async accept route.

from ranch.

rlipscombe avatar rlipscombe commented on May 23, 2024

we end up with more than one supervisor for all connections

Why is this? Were you planning on having NbAcceptors acceptor+supervisor processes? This isn't clear.

from ranch.

rlipscombe avatar rlipscombe commented on May 23, 2024

Also, how does this change impact the ability to implement
#83?

from ranch.

essen avatar essen commented on May 23, 2024

Yes, merge acceptor+supervisor into one process, but keep the ability to have many acceptors, therefore we end up with many acceptor+supervisor processes instead of many acceptors and a single supervisor.

This makes #83 no different. For this we need to close the listening socket, and to close it we need to have a special process holding it. No problem keeping the acceptor+supervisor processes around even if they don't accept anymore. With a bit more thought we could also add commands like restart the listening socket etc.

from ranch.

rlipscombe avatar rlipscombe commented on May 23, 2024

OK. That's clearer, thanks. Sounds like a good plan.

from ranch.

josevalim avatar josevalim commented on May 23, 2024

@essen I am interested on working on this. Are pull requests accepted? Or should we wait until Ranch 2.0?

This however means that limits need to be per supervisor (or at least to divide limits by number of supervisors or something).

The second is backwards compatible. I can only see it being a complication though if you can resize the number of acceptors, by either starting new ones or stopping existing ones. Is it possible to do so? If so, what would happen then? Do we need to recalculate the connection limit per each acceptor?

The other question is regarding active_connections. If we have multiple supervisors, the simplest way to handle it would be to loop over all supervisors and concatenate their connections. Is this a scenario where this approach would not be enough? Otherwise ranch could consider storing the connections in an ETS table but that adds complexity and maybe some contention on overload scenarios. I guess though, if the user really cares about getting the number of connections in an efficient way, they could implement their own connection tracking.

from ranch.

essen avatar essen commented on May 23, 2024

Pull requests are accepted but don't work too much before the details are ironed out. Also note that @juhlig started looking into this and experimenting so it might be best to synchronize. The difficulty in this task I think is in proving that the change is good and that we end up with better latency when accepting connections under various scenarios and he's started experimenting with setups to measure that.

I think regardless of potential backward compatibility of the interface this should be done as part of 2.0 as this is a pretty big change in behavior. That being said, this is pretty much the only change for 2.0 anyway, minus the removal of the socket option. Everything else that was planned for 2.0 is either already done, not necessary anymore, or still a way off (having a test suite for release upgrades).

I would definitely favor a "limit per acceptor/supervisor" strategy as that's easier to reason with and it doesn't prevent anyone from using global limits and dividing them before passing them to Ranch. I don't really want to go down the road of recalculating everything. You define what you want, how many of them you want, and Ranch should just do that.

Active connection counts should just be per acceptor/supervisor and aggregated when it is requested. If it needs to be made more efficient we can always query all acceptor/supervisor processes in parallel and gather the results asynchronously.

from ranch.

josevalim avatar josevalim commented on May 23, 2024

Thanks @essen! I will wait then for further news. A benchmark suite to measure the impact of these changes will definitely be helpful.

I am not sure if this can be helpful, but 2 or 3 years ago someone was working on C++ TCP client to benchmark a Ranch pool written in Elixir. The C++ client is available publicly: http://dbeck.github.io/Passing-Millions-Of-Small-TCP-Messages-in-Elixir/

from ranch.

petrohi avatar petrohi commented on May 23, 2024

I want to point out that key issue that this will solve for us is when there is sudden disconnection of large number of clients (~100K) cowboy becomes unresponsive for significant period of time.

from ranch.

 avatar commented on May 23, 2024

There should be a performance benefit as well.

from ranch.

essen avatar essen commented on May 23, 2024

Good point about the 100Ks, this is a good scenario to experiment with.

from ranch.

rlipscombe avatar rlipscombe commented on May 23, 2024

when there is sudden disconnection of large number of clients (~100K) ...

We have a similar problem (except we're using ranch directly, rather than cowboy).

from ranch.

juhlig avatar juhlig commented on May 23, 2024

@petrohi @rlipscombe

... becomes unresponsive for significant period of time...

Can you please elaborate? How much time is that, approximately?

from ranch.

essen avatar essen commented on May 23, 2024

When 100K connection processes exit at once, the supervisor has to handle them all before it can continue accepting connections. How much time depends on what environment you run it in.

from ranch.

juhlig avatar juhlig commented on May 23, 2024

Yes, but all the supervisor does when it receives an {'EXIT', ...} message from a connection process is erase that pid from the process dictionary, and send a message to a sleeping acceptor (if there are any). I wouldn't expect that to amount to anything significant, even with 100K messages, even in slow environments. There is no selective receive in place, so the number of messages in the queue does not influence the receive time of any message. Thus my surprise and question ;)

I'll do some experiments on this later today. If it is caused by that single supervisor bottleneck, that should solve itself with this issue since the solution requires multiple supervisors.

from ranch.

essen avatar essen commented on May 23, 2024

100K can take a while to process when the system is busy and even if it takes only 1 second it's still 1 second too much in many scenarios. Remember that connections get stuck until the supervisor processes all these messages, and that clients will queue up trying to reconnect, leading to more wait time. I'm sure you'll reproduce it easily.

from ranch.

petrohi avatar petrohi commented on May 23, 2024

@juhlig in our setup we have one server handling about 170K connections. When we took one server out of rotation without slowly draining connections it would not accept new connections for several minutes while visibly using 100% of 1 core.

from ranch.

josevalim avatar josevalim commented on May 23, 2024

When we implemented the Elixir Registry and tested it on 40 cores we saw similar issues even after partitioning the data. We would register hundreds of thousands of processes and kill all of them suddenly. All of the cores would spike and no new registrations would be accepted for a while.

We solved this by doing the cleanup on the side. All entries were put in an ETS table and a separate process was responsible to receive the DOWN messages and clean up the table.

Of course, this may not be necessary at all here, but I want to say that the issue with dropping connections may exist even after merging acceptors and supervisors. Having a single supervisor surely makes it worse.

from ranch.

petrohi avatar petrohi commented on May 23, 2024

@juhlig we will be doing more stress tests in the near future. I can do a run where all of our client/server application specifics will be removed--so we'll test pure connection setup/teardown at scale.

from ranch.

essen avatar essen commented on May 23, 2024

My hope is not to find a perfect solution (difficult without dropping supervision and I don't want that) but rather to push the limit further so that the behavior is not as bad as it currently is when things go wrong. Not having a single point of failure is a good step in this direction.

I think the next step after that will be to have some kind of watchdog that ensures the supervisors don't get a sudden flood of messages and if they do to simply kill and restart the supervisor being flooded. Then people can configure the threshold for when that's supposed to happen based on how many connections they typically handle and if everything dies at once the supervisor can recover much quicker. And with many supervisors chances are not all of them would get killed by this mechanism because resources would get freed and the remaining supervisors would have more chances to catch up.

I'm sure there's other solutions that don't require dropping supervisors.

from ranch.

juhlig avatar juhlig commented on May 23, 2024

@petrohi I'm trying to weed out the things that are unrelated to ranch from those that are and that can possibly be adressed by us.

So, can you please tell me something more about your system (hardware, OS, ...), and the VM options of the node running ranch/cowboy? Also, could you tell me how busy your system is doing other things than running ranch/cowboy?

Also, you might want to enable kernel_poll (+K true) for the ranch/cowboy node (unless your application is one of the supposedly rare cases that could be negatively affected by it, and unless you are on Windows, where it is not available). The symptom of 1 CPU being fully used rings a bell, I saw that when I was loading up a ranch instance with a lot of connections. Each added connection would take a little longer than the ones before, while htop showed that the OS kernel maxed out a single CPU. Enabling kernel_poll solved that issue. I didn't see the same when closing a lot of connections, though, but it might be worth a try for you.

from ranch.

petrohi avatar petrohi commented on May 23, 2024

Test cluster had 6 r4.8xlarge nodes. Test workload was 1M concurrent users that saturated CPUs to about 40% with application specific load.

We used OTP 19 with following vm.args

+P 16777216
+Q 16777216
+K true
+A 64

And Debian 8 with following kernel tweaks:

# maximum TCP receive window
net.core.rmem_max = 33554432
# maximum TCP send window
net.core.wmem_max = 33554432
# others
net.ipv4.tcp_rmem = 4096 16384 33554432
net.ipv4.tcp_wmem = 4096 16384 33554432
net.ipv4.tcp_syncookies = 1
# disable slow-start restart
net.ipv4.tcp_slow_start_after_idle=0
# this gives the kernel more memory for tcp which you need with many (100k+) open socket connections
net.ipv4.tcp_mem = 786432 1048576 26777216
net.ipv4.tcp_max_tw_buckets = 360000
net.core.netdev_max_backlog = 2500
vm.min_free_kbytes = 65536
vm.swappiness = 0
net.ipv4.ip_local_port_range = 1024 65535
net.core.somaxconn = 65535
# make sure the file max is high enough
fs.file-max=1048576

from ranch.

juhlig avatar juhlig commented on May 23, 2024

Ok, thanks. Sounds all quite reasonable to me.

from ranch.

essen avatar essen commented on May 23, 2024

I've started setting things up for testing this with Prometheus+Grafana output. One issue that arises is that there's no Grafana dashboard for netdata for that setup, but it's a problem we have when testing RabbitMQ too so it should be fixed in the next few weeks one way or another.

@juhlig has some code to test things already which is good, but I would rather use a more establish code for calculations like https://github.com/HdrHistogram/hdr_histogram_erl and then we'll need to write a metrics exporter on top of that in the test clients. Then run tests and confirm that the change helps in a few different scenarios.

Should take just a few more weeks before we get definite answers. This is the last outstanding issue remaining before Ranch 2.0.

from ranch.

essen avatar essen commented on May 23, 2024

Since I always get sidetracked, if anyone wants to try #198 against the current version and publish some preliminary results, that could help. I think @juhlig also had a branch with asynchronous accept in his repository.

I'm not sure when I will get to testing, and there's also the big question mark around the new NIF API coming in 22, so I would rather not rush things. But some data would be good to have anyway.

from ranch.

petrohi avatar petrohi commented on May 23, 2024

I was running a test to see the difference between Ranch 1.7 and Juhlig's #198. The workload is 100k "devices" each cycling through opening connection for 1 second and then closing it. The workload ramps-up from 0 to 100k for 15 minutes and then ramps-down for another 15 minutes.

rate

There is a clear breaking point at about 70K connections per second. It turned out that the breaking point is caused by SYN buffer overflow. This causes SYNs to be dropped, which results in a hike in latency when clients back off and resend. The buffer overflows because accept is not picking up connections fast enough. I played with the number of acceptors from 1x number of CPUs to 16x without much difference. My conclusion at this point is that accept has contention in the kernel and gets saturated at about 70K calls per second. There is vague confirmation in the discussion of SO_REUSEPORT socket option.

time90

Looking at 90th percentile latencies we can see that #198 has a meaningful effect for a short interval before accept contention kicks in.

time75

For 75th percentile latencies, there are no visible effects.

My conclusion is that we are unable to significantly exceed scalability of a single supervisor before hitting accept contention in the kernel.

from ranch.

essen avatar essen commented on May 23, 2024

https://stressgrid.com/blog/100k_cps_with_elixir/ shows good results for this.

The plan is to review/merge the existing PR once I am done with the TLS over TLS work in Gun and then build incrementally from there. Also see #215 for a related task resulting from this benchmark session. Cheers to @petrohi

from ranch.

essen avatar essen commented on May 23, 2024

1 conns_sup per acceptor was merged. Let's add a new option num_conns_sups to allow configure the number of conns_sup rather than force it to be 1:1. It should default to the same value as num_acceptors. Acceptors will choose their conns_sup using a formula similar to AcceptorId rem NumConnsSups. It's OK if some conns_sup have one more acceptor than others. It's OK if some conns_sup have no acceptor because of misconfiguration (if this causes problems we can restrict it later on).

from ranch.

essen avatar essen commented on May 23, 2024

There are now 3 different options to configure the concurrency of listeners: the old num_acceptors, the new num_conns_sups to configure how many connection supervisors will be created, and the experimental num_listen_sockets to configure the number of listening socket sharing the same port (requires SO_REUSEPORT to be available and enabled).

from ranch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.