A lot of people logged on to pangeo.pydata.org after my talk at JupyterCon. The cluste

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thank you for putting this together <a class="user-mention" href="https:

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

pangeo.pydata.org did not scale down about helm-chart HOT 17 CLOSED

pangeo-data commented on August 12, 2024

pangeo.pydata.org did not scale down

from helm-chart.

Comments (17)

mrocklin commented on August 12, 2024 2

We should consider setting a lower maximum size until we have things settled.

…

On Sat, Aug 25, 2018 at 10:42 PM, Ryan Abernathey ***@***.***> wrote: A lot of people logged on to pangeo.pydata.org after Jupyterhub. The cluster size went up to ~450. But it never scaled down: [image: image] <https://user-images.githubusercontent.com/1197350/44624280-08ce8080-a8b8-11e8-984c-1bb891f6458c.png> I have brought it back down manually now. But I just wanted to document this so we can figure out how to avoid it in the future. This cost a lot of credits! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#58>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszCl-NmCrm4rlWnuAeZbziEJHAoOqks5uUgsWgaJpZM4WMnVT> .

from helm-chart.

minrk commented on August 12, 2024 2

I can't overstate how cool the new scheduler is. Our first 24 hours with it and it successfully scaled down from 5 to 2 nodes. We shipped 0.7 last week, and I'm already more excited about getting our 0.8 release ready!

The manual solution that's been the only feasible scale-down is to cordon nodes that you don't need anymore and wait for them to drain and be reclaimed by the cluster autoscaler. They may still need manual draining to get pods other than users off of them that prevent scale-down (e.g. a miscellaneous kube-dns pod can show up).

We have work to do on autoscaling, especially documenting all the strategies and caveats and helpers.

from helm-chart.

consideRatio commented on August 12, 2024 1

@jhamman we are soon releasing version 0.8 of the z2jh helm chart, and I'm currently writing documentation about it.

from helm-chart.

guillaumeeb commented on August 12, 2024 1

I propose to close this issue, as the scaling down issue seems identified within z2jh or GKE community, and instead open a "Update to z2jh 0.8 helm chart" one. It seems, as @consideRatio informed us, that the release is imminent (jupyterhub/zero-to-jupyterhub-k8s#1054).

from helm-chart.

rabernat commented on August 12, 2024

For those who are interested, here is our daily cloud bill, split into compute and storage:

from helm-chart.

mrocklin commented on August 12, 2024

Thank you for putting this together @rabernat

…

On Mon, Aug 27, 2018 at 11:26 AM, Ryan Abernathey ***@***.***> wrote: For those who are interested, here is our daily cloud bill, split into compute and storage: [image: image] <https://user-images.githubusercontent.com/1197350/44668655-ff463500-a9eb-11e8-8adc-c222fa235afc.png> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#58 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszHvbVIBDCgUAMw4W-l3ZwrywE-epks5uVA-egaJpZM4WMnVT> .

from helm-chart.

rabernat commented on August 12, 2024

I didn't do much other than click a few buttons on the google cloud console. For some reason, I think only I am authorized to see the full details of our billing.

from helm-chart.

jhamman commented on August 12, 2024

@yuvipanda was recently mentioning some ongoing work within JupyterHub to rework the kubernetes scaling protocols to make scale down more efficient. Perhaps he can point us to that work so we can follow along?

from helm-chart.

yuvipanda commented on August 12, 2024

@consideRatio is the person doing most of that work, @jhamman. I think it got merged very recently, he should be able to help.

from helm-chart.

consideRatio commented on August 12, 2024

@jhamman I just wrote some about the deployment @minrk did today on mybinder.org. He enabled the freshly merged and foundational building block in a series of improvements to the scheduling and autoscaling of the zero-to-jupyterhub-k8s chart.

See pangeo-data/pangeo#322 (comment)

from helm-chart.

jhamman commented on August 12, 2024

Thanks @yuvipanda / @consideRatio / @minrk. This all seems quite promising. If we wanted to try this out in the near term, what would be the best way for us to do that. @minrk - can you point the config you are using for mybinder.org?

from helm-chart.

minrk commented on August 12, 2024

mybinder.org's deployment config is at https://github.com/jupyterhub/mybinder.org-deploy, the relevant bit here:

scheduling:
  userScheduler:
    enabled: true
    replicas: 2

This feature is only in the 0.8 dev versions of the chart at the moment. If you are upgrading from 0.6, make sure to test out a deploy/upgrade. As it stands right now, 0.6->0.7 chart upgrades seem to require relaunching users, so performing the upgrade would be a significant disruption.

from helm-chart.

jhamman commented on August 12, 2024

We now depend on version 0.8 of the z2jh chart. Do we want to add the scheduling optimization bit to the pangeo chart so its on by default?

from helm-chart.

guillaumeeb commented on August 12, 2024

@jhamman I guess so, @minrk why is it not included in the jupyterhub chart by default?

from helm-chart.

minrk commented on August 12, 2024

We made it opt-in as a new, somewhat experimental feature that increases the resource requirements a bit. That said, we've been using it on mybinder.org for several months to great effect. I think we will probably switch it to on-by-default in the next major chart release.

from helm-chart.

consideRatio commented on August 12, 2024

@jhamman I think it is quite complex to decide on good default values. If you are using an autoscaling cluster or not, as well as if you are using GKE or another cluster with certain settings relating to the cluster autoscaler.

Some of the complexities

About not having the user scheduler on by default:
It will pack the user pods tight on nodes, but that does not make sense if you have a fixed number of nodes, but it does if you use autoscaling. So, what the default should be is not critical but also not obvious.

About podPriority / userPlaceholder pods:
These changes only makes sense for autoscaling clusters as well, if that is configured or not is out of scope for the helm chart. The cluster autoscalers setting called "pod priority cutoff" deciding on a lower limit of a required pod priority of a pending pod to trigger scale up, is not fixed. Some clusters may need to adjust their podPriority settings based on this.

from helm-chart.

jhamman commented on August 12, 2024

I've opened #87 with what I see as sensible defaults for typical pangeo applications. I think we can assume basically all pangeo applications will require autoscaling clusters.

from helm-chart.

pangeo.pydata.org did not scale down about helm-chart HOT 17 CLOSED

Comments (17)

Some of the complexities

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent