Giter Site home page Giter Site logo

Comments (11)

bisgaard-itis avatar bisgaard-itis commented on July 27, 2024

@wvangeit @sanderegg @pcrespov here is how I will approach this. Let me know if you disagree with this approach.

After thinking a bit about this I think the best way to approach this would be to limit the clone_study endpoint in the api-server instead of doing it in storage. Here are some thoughts on this:

  • With this approach other services which communicate with storage will not be affected (e.g. the dynamic sidecar). This makes the change easier to implement. Of course the price to pay is that the limit is not hard: People can circumvent it by using the frontend.
  • I would like to do this directly in traefik. However, as far as I can tell from the traefik docs, the only way I can limit the number of in-flight requests to a service via traefik is by using https://doc.traefik.io/traefik/middlewares/http/inflightreq/ which does not allow me to specify an endpoint in the api-server. I.e. this limits the total in-flight requests to the api-server which is probably not what we want. I.e. I am left with doing it directly in the api-server. The downside of that is of course that the limit will be per replica. I.e. increasing the number of replicas of the api-server will increase the limit. But I feel we can live with that limitation.
  • The best package for doing rate limiting in a fastapi app (as far as I can tell) is https://github.com/laurents/slowapi which also doesn't seem to limit the number of in-flight requests to a given endpoint, but only rate limit (e.g. 10/sec).
  • Hence, I am left with handling this manually in the api-server. This should be very easy to do using an asyncio.Semaphore.

from osparc-simcore.

sanderegg avatar sanderegg commented on July 27, 2024

@bisgaard-itis

  • I also think the limiting should be done on the api-server side, for the same reason you gave here (storage is too low level).
  • in traefik you might be able to add specific routes rate limits. there is also a rate limit middleware that we already use I think (hope). you might have one basic one and then maybe it's possible to add on top some different ones for the routes you wish to limit more.
  • what I am wondering is that you wish to rate limit on the number of requests per second only, what about limiting by the number of jobs the api-server(s) launch? cause once storage issue is solved this will cascade to the director-v2 I think

from osparc-simcore.

pcrespov avatar pcrespov commented on July 27, 2024

@bisgaard-itis

  • I understand that the goal is to put an upper-cap limit to the number of clones/s that the system can attend.
  • We do not know what is the right number for the limit so I would be conservative and then increase slowly as we have more control on the capacity of the backend
  • So the solutions you propose to limit
    • traefik: limit req/s to given given entrypoints
    • middleware: same as above but done in api-server
    • asyncio.Semaphore: same as previous but more granular (e.g. can limit calls to webserver.clone only)
  • Nonetheless, none of the above are distributed, i.e. the limits apply per replica and not per service

from osparc-simcore.

bisgaard-itis avatar bisgaard-itis commented on July 27, 2024

@bisgaard-itis

* I also think the limiting should be done on the api-server side, for the same reason you gave here (storage is too low level).

* in traefik you might be able to add specific routes rate limits. there is also a rate limit middleware that we already use I think (hope). you might have one basic one and then maybe it's possible to add on top some different ones for the routes you wish to limit more.

* what I am wondering is that you wish to rate limit on the number of requests per second only, what about limiting by the number of jobs the api-server(s) launch? cause once storage issue is solved this will cascade to the director-v2 I think

What I want to limit is the number of requests to a given endpoint in the api-server being processed at any given time. E.g. I would like to set the limit that a maximum of 5 clone study operations can be performed at any given time. If a request hits that endpoint and 5 requests to that endpoint are already in progress I would return a 429. Concerning the director-v2 thing, the idea would be to have a general way to do this so we can easily limit other endpoints if needed. I will also go ahead and limit the start study endpoint in the api-server straight away.

from osparc-simcore.

bisgaard-itis avatar bisgaard-itis commented on July 27, 2024
  • middleware: same as above but done in api-server

My understanding is that the only really distributed solution is to use traefik (https://doc.traefik.io/traefik/middlewares/http/inflightreq/). There we already limit the number of requests per second which can hit the api-server. But we could also limit the number of in-flight requests to the api-server there. That would be distributed. But it doesn't really hit the target of this issue: To have control over the number of requests to a given endpoint in progress.

from osparc-simcore.

wvangeit avatar wvangeit commented on July 27, 2024

I understand the intention, and am not against it per se. But to be honest I kind of doubt it will really solve the problem we have, and possibly make it worse.
As mentioned in our meeting last week, what probably brought the system to its knees last week, was me working on studies (not using any metamodelling/cloning at all), that had interactive studies with massive amounts of data, that were basically taking storage slots. The solution here would have been the system limiting these tranfers, not the cloning by another user.
So limiting cloning would only artificially solve the problem, with probably too low limits for the amount of clones to happen. Also, not all clones are equal. Cloning a study that requires 5 MB to transfer will have a completely different consequence than a study cloning 20GB.

from osparc-simcore.

bisgaard-itis avatar bisgaard-itis commented on July 27, 2024

I understand the intention, and am not against it per se. But to be honest I kind of doubt it will really solve the problem we have, and possibly make it worse. As mentioned in our meeting last week, what probably brought the system to its knees last week, was me working on studies (not using any metamodelling/cloning at all), that had interactive studies with massive amounts of data, that were basically taking storage slots. The solution here would have been the system limiting these tranfers, not the cloning by another user. So limiting cloning would only artificially solve the problem, with probably too low limits for the amount of clones to happen. Also, not all clones are equal. Cloning a study that requires 5 MB to transfer will have a completely different consequence than a study cloning 20GB.

OK, I see your point. Of course it would be good to target a solution which prevents us seeing the same issue in the future. So if not limiting in the api-server, then the question is what we should then do to avoid this happening in the future? From the analysis we did with @sanderegg last week "taking storage slots" basically meant holding db connections while uploading/copying data via storage. Sylvain is working a solution for not holding the db connections while doing the upload/copying, so that should probably fix the issue for now, meaning we might see the issue/bottleneck somewhere else next time.

My understanding (correct me if I am wrong @sanderegg) is that once the "holding of db connections" issue is resolved uploading a 20GB file or a 5MB file should put the same load on storage because the actual uploading of data is done directly to S3. What really matters from the point of view of storage is the number of upload/copy operations which are performed concurrently. Do you agree @sanderegg?

If this is correct I would say a correct approach to reducing backpressure in storage would be to limit the total number of upload/copy operations which can be in progress at any given time. If we want to make sure that these limits also hit users in the dynamic sidecar I guess we will have to perform this limiting directly in storage. I see your point @wvangeit that we could be setting these limits too low which would impact user experience, but in a sense I understand that the whole point is that we want to have the control over those limits to be conservative. Otherwise I simply wouldn't know how to do it. But my ears are open for other solutions. Let me know.

from osparc-simcore.

sanderegg avatar sanderegg commented on July 27, 2024

Once I get the DB locking in storage in control:

  • 5GB or 20GB should make no difference in terms of Database locking
  • it will still make a difference through on storage load (although many order smaller) as S3 does not really allow in place copying or maybe up to 5GB which I still need to find out.

But as we discussed:

  • moving from REST api to message system for these copies would allow storage to only take as many jobs as it can, therefore the api-server will have to wait, which means it might be overwhelmed with rest calls??
  • we could also design some higher priority queues, priorizing the webserver over the api-server
  • there is no silver bullet and we will have to experiment.
  • Nevertheless @wvangeit I prefer a slightly less nice user experience to having the whole deployment broken. And everyone out there sets rate limits to prevent their servers to burn ;). The idea of the limit is that we can guarantee that the service runs, until we fix things and deploy higher limits vs we run to fix things and have no time to make long terms fixes.

from osparc-simcore.

bisgaard-itis avatar bisgaard-itis commented on July 27, 2024

Once I get the DB locking in storage in control:

* 5GB or 20GB should make no difference in terms of Database locking

* it will still make a difference through on storage load (although many order smaller) as S3 does not really allow in place copying or maybe up to 5GB which I still need to find out.

But as we discussed:

* moving from REST api to message system for these copies would allow storage to only take as many jobs as it can, therefore the api-server will have to wait, which means it might be overwhelmed with rest calls??

* we could also design some higher priority queues, priorizing the webserver over the api-server

* there is no silver bullet and we will have to experiment.

* Nevertheless @wvangeit I prefer a slightly less nice user experience to having the whole deployment broken. And everyone out there sets rate limits to prevent their servers to burn ;). The idea of the limit is that we can guarantee that the service runs, until we fix things and deploy higher limits vs we run to fix things and have no time to make long terms fixes.

Concerning the api-server getting overwhelmed with REST requests, I would say a simple solution is to use the Traefik inflight middleware( https://doc.traefik.io/traefik/middlewares/http/inflightreq/ ). That guarantees that the number of requests the api-server is processing at any given time will not exceed a limit. That of course means that the user will receive a lot of 429s, but we have retry strategies for that in the client.

from osparc-simcore.

wvangeit avatar wvangeit commented on July 27, 2024

Yes. I just mean to make things a bit more dynamic instead of us having to artificially tune limits all over the place. Imo hard limiting is best at the lowest level, probably here storage. And then when higher level services are hit with limit-reached responses from lower level services they start backing off. (Ideally higher level services can query some status before they reach the limit, but that could be a later stage).

I'm just saying this out of experience wrt to my parallelworker artificial limit. The limit of 10 workers worked for my studies, and it didnt see any real stress on the system. Then it failed for one user, because some completely different studies were running in the background. And for another user it started failing at 3 parallel tasks. Even with things now being file-size independent, I'm afraid we'll end up in an endless cycle of tuning of we dont' automate this.

from osparc-simcore.

bisgaard-itis avatar bisgaard-itis commented on July 27, 2024

Yes. I just mean to make things a bit more dynamic instead of us having to artificially tune limits all over the place. Imo hard limiting is best at the lowest level, probably here storage. And then when higher level services are hit with limit-reached responses from lower level services they start backing off. (Ideally higher level services can query some status before they reach the limit, but that could be a later stage).

I'm just saying this out of experience wrt to my parallelworker artificial limit. The limit of 10 workers worked for my studies, and it didnt see any real stress on the system. Then it failed for one user, because some completely different studies were running in the background. And for another user it started failing at 3 parallel tasks. Even with things now being file-size independent, I'm afraid we'll end up in an endless cycle of tuning of we dont' automate this.

Very good point! OK, then I will try to add these absolute limits for individual endpoints directly in storage.

from osparc-simcore.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.