Giter Site home page Giter Site logo

Comments (15)

asyedham avatar asyedham commented on September 7, 2024

After fixing the boot order for director style as #161 was able to deploy Openstack successfully

from badfish.

bengland2 avatar bengland2 commented on September 7, 2024

Just FYI there appear to be other machine types that have the same problem. I have no objection to merging PR160 but would be nice if the team fixed the other machine types as well. @QuantumPosix ?

from badfish.

QuantumPosix avatar QuantumPosix commented on September 7, 2024

Just because another tool expects something different based on how it was built doesn't mean it works the best for all use cases. All of the interface order is correct as it is in the repo and current for all the machine types listed. This same boot order has been in use since the machines were first installed. The boot order that is dropped in is a template for all the devices listed. You are able to change this locally once badfish is installed to reflect what it is you need for your environment. You can always view other ways to get the results you are requesting @ https://github.com/redhat-performance/badfish#host-type-overrides

from badfish.

bengland2 avatar bengland2 commented on September 7, 2024

@QuantumPosix , I disagree. IMHO this ought to be standardized and the same across all nodes, not something that each user has to discover for themselves. The purpose of all this automation is to prevent people from spending time on this and have them spend their time on performance work, not deployment work. Debugging boot orders is very time-consuming and hence expensive for the team. IMHO As Alias lab manager you have a unique insight into how the hardware is hooked up (e.g. network interfaces), and you could work with others such as the jetpack and jetski teams to ensure that the group's tools, especially badfish, are configured to work smoothly in the Alias environment by default. Would appreciate any assistance you could provide.

from badfish.

sadsfae avatar sadsfae commented on September 7, 2024

@bengland2

The key/value pairs in config/idrac_interfaces.yml have defaults which apply to the majority of use cases for people using the tool for that specific model, anything else you can provide your own overrides as @QuantumPosix points out.

Look at L35 and L39 here for examples of per-rack and per-host overrides.

https://github.com/redhat-performance/badfish/blob/master/config/idrac_interfaces.yml#L35

Make sure you're pulling the latest container image also via podman pull quay.io/quads/badfish (we have moved to quay)

https://github.com/redhat-performance/badfish#usage-via-podman

You can use the -v option to map your own entries or overrides as needed to address differences in your environment or needs outside of the default entries.

The default is a default, it's suitable for the majority of people using the tool and there can only be one.
If the majority usage changes, this will be reflected in the default changing - my understanding is what you're doing is not the standard order most people use who use badfish against r740.

If you're using any of the other methods (setuptools, pip, running straight from the repo via Python3) you can just edit the file as you like - you might add config/idrac_interfaces.yml to .gitignore in that case.

You have everything you need for quite a long time now to use your own overrides or entirely different idrac_interfaces.yml as you need.

from badfish.

bengland2 avatar bengland2 commented on September 7, 2024

That's not the point. I know we can customize it, what I'm saying is that perhaps we shouldn't be forcing people to debug the boot order every time. Anyone who runs openstack deploys in Alias is going to hit this problem with badfish, yes?

from badfish.

sadsfae avatar sadsfae commented on September 7, 2024

That's not the point. I know we can customize it, what I'm saying is that perhaps we shouldn't be forcing people to debug the boot order every time. Anyone who runs openstack deploys in Alias is going to hit this problem with badfish, yes?

Edit: I see from asyed response afterwards the problem is with how instackenv is generated, see the quads RFE redhat-performance/quads#357 There's nothing relevant here to badfish or the idrac_interfaces.yml file. the issue subject is misleading here and I misread it. Further follow up should look to the above quads issue.

from badfish.

asyedham avatar asyedham commented on September 7, 2024

As mentioned in the description of the issue, for Openstack deployment the host machine has to PXE boot on the mac which is specified in the http://quads11.example.com/cloud/cloud06_instackenv.json provided by the lab team. So when we try to set the boot order to director ideally it should put the interface of the mac specified in the instackenv.json first followed by others. But this isn't the case with dell 740xd on the alias, as it's PXE booting at a different interface than the one specified in the instackenv.json when we use director mode because of which OSP deployment are failing. Attaching screenshot for reference
e23-h03

from badfish.

sadsfae avatar sadsfae commented on September 7, 2024

As mentioned in the description of the issue, for Openstack deployment the host machine has to PXE boot on the mac which is specified in the http://quads11.example.com/cloud/cloud06_instackenv.json provided by the lab team. So when we try to set the boot order to director ideally it should put the interface of the mac specified in the instackenv.json first followed by others. But this isn't the case with dell 740xd on the alias, as it's PXE booting at a different interface than the one specified in the instackenv.json when we use director mode because of which OSP deployment are failing. Attaching screenshot for reference
e23-h03

Hey @asyedham I see your issue here.

instackenv generation from QUADS is based on the standard that people use NIC2 as their provisioning NIC (Scale Lab). This is our standard here, while the derivitative ocpinventory that's generated lists all interfaces but I think it's used as a reference not fed into the installer.

The reason ALIAS ocpinventory is generating NIC2 is because of this and the code here:

https://github.com/redhat-performance/quads/blob/master/quads/tools/make_instackenv_json.py#L68

The interface index 1 corresponds to NIC2 in Scale Lab, so if this is generated in ALIAS and that's not the standard it will not work.

ALIAS doesn't have to use this standard but it has to use some standard, is there instead another NIC you prefer ?

Looking at one of Ben's systems here are the interfaces (generically we refer to them as EM1,EM2,EM3,EM4)

# quads-cli --ls-interface --host e23-h01-740xd.example.com
interface: em1, mac address: 24:6E:96:BD:D4:68, switch IP: 10.19.97.232, port: xe-0/0/28
interface: em2, mac address: 24:6E:96:BD:D4:69, switch IP: 10.19.97.232, port: xe-0/0/29
interface: em3, mac address: 3C:FD:FE:B9:C8:20, switch IP: 10.19.97.233, port: et-0/0/9:0
interface: em4, mac address: 3C:FD:FE:B9:C8:21, switch IP: 10.19.97.233, port: et-0/0/9:1

I think a fix to this would be to conditionalize a value like osp_pxe_nic: em1 for example in quads.yml so that the tooling that generates instackenv.json for you can be adjusted appropriately. However there needs to be standard so if you always PXE on say EM1 then stick to doing that. Basically there can only be one entry, it doesn't matter what the standard is so long as it's consistent - then we can set it to whatever matches what people are doing in other environments.

This is something we can fix with an RFE fairly quickly.

This has nothing to do with badfish overrides or badfish however, so we'd pursue that on the QUADS github.

@QuantumPosix if there's a "preferred" network device that folks always use in ALIAS other than EM2 / NIC2 so long as we identify that I think this will solve your issue @asyedham and also make it more flexible in different environments.

from badfish.

sadsfae avatar sadsfae commented on September 7, 2024

We've got an RFE open here and will work on this shortly: redhat-performance/quads#357

I think this will set you straight @asyedham

In the meantime if there's a consistent interface across all the systems (.e.g. EM1) @QuantumPosix can just adjust the instackenv generation code manually to match the interface index that aligns with the one you're using.

from badfish.

asyedham avatar asyedham commented on September 7, 2024

@sadsfae But I see that the host machines on Alias already PXE boot on the interface which is specified in the instackenv.json (i.e NIC 2) rather than NIC 1 which is the expected behavior for OpenStack deployment.
http://quads11.alias.bos.scalelab.redhat.com/cloud/cloud06_instackenv.json.
But since we use badfish to set the boot order to director when it puts NIC1 ahead of NIC2 the Nodes are getting stuck and we notice overcloud deployment failures.

director_740xd_interfaces: NIC.Integrated.1-1-1,NIC.Integrated.1-2-1,NIC.Slot.7-2-1,NIC.Slot.7-1-1,NIC.Integrated.1-3-1,HardDisk.List.1-1

from badfish.

asyedham avatar asyedham commented on September 7, 2024

We've got an RFE open here and will work on this shortly: redhat-performance/quads#357

I think this will set you straight @asyedham

In the meantime if there's a consistent interface across all the systems (.e.g. EM1) @QuantumPosix can just adjust the instackenv generation code manually to match the interface index that aligns with the one you're using.

Hi @sadsfae Thanks, but I'm not sure if this would help. Since you mentioned that these defaults for interfaces cannot be changed for Alias, I discussed with my team and we would rather prefer passing custom config/idrac_interfaces.yml to set boot order as per #161 on dell 740xd as we are getting stable OSP deployment on Alias without the nodes getting stuck when we apply these changes

from badfish.

sadsfae avatar sadsfae commented on September 7, 2024

We've got an RFE open here and will work on this shortly: redhat-performance/quads#357
I think this will set you straight @asyedham
In the meantime if there's a consistent interface across all the systems (.e.g. EM1) @QuantumPosix can just adjust the instackenv generation code manually to match the interface index that aligns with the one you're using.

Hi @sadsfae Thanks, but I'm not sure if this would help. Since you mentioned that these defaults for interfaces cannot be changed for Alias, I discussed with my team and we would rather prefer passing custom config/idrac_interfaces.yml to set boot order as per #161 on dell 740xd as we are getting stable OSP deployment on Alias without the nodes getting stuck when we apply these changes

Hi @asyedham

This will help, this has nothing at all to do with idrac_interfaces.yml I think there's some misunderstanding there.

Partly this is my fault because I didn't read the full description of the issue and went by the title instead.

The problem as I understand it is that the generated instackenv.json does not reflect the PXE interface that folks in ALIAS want to use for OSP. This is managed solely by QUADS and has nothing at all to do with badfish (hence closing this issue).

If you need the ALIAS instackenv.json to point to a different PXE NIC all @QuantumPosix needs to do as a temporary measure is modify the interface list index here:

e.g. changing it from 1 to 2 (EM2 to EM3) or whatever interface you are using.

https://github.com/redhat-performance/quads/blob/master/quads/tools/make_instackenv_json.py#L68

Please tell us or Chris what interface you would want instead to appear in instackenv.json for ALIAS instead of EM2.

This behavior has never been brought up before to us so this is the first time we're looking at this, when the helper instackenv.json generation was brought in it was with the assumption people are using EM2 everywhere (which they are in Scale Lab at least).

You don't need us to generate an instackenv.json for you to deploy OSP, and no other tool or product does this for anyone its just one of many small helper features we've employed to assist peoples deployments. We want to ensure it's flexible enough to work in environments that do not use the Scale Lab standard of EM2.

The RFE here will give us a configurable pxe_boot setting we can assign to the mongo interfaces collection in MongoDB so that whatever interface per host (or across all hosts) can be configurable so the subsequent generated instackenv.json will reflect how you want it in ALIAS.

Further follow-up should go to the RFE here, as this is not in any way related to badfish or idrac_interfaces.yml at all and is a mechanic of QUADS.

redhat-performance/quads#357

from badfish.

smalleni avatar smalleni commented on September 7, 2024

Just catching up on the discussion here. Some good insights by people on this thread and redhat-performance/quads#357. As long as @grafuls suggestion to include a pxe flag per model type is implemented and that is inturn consumed appropriately by the code in quads to generate the instackenv.json that should help solve the issue. But to provide some background

The reason for the code in quads to consume the second mac address as the pxe interface is based on the assumption that we have a convention that the second interface is being used for the PXE network (the scale lab wiki explicitly mentions this to be the director network). However, is it not the case in ALIAS? In any case, following a common convention across all shared labs will help any other higher level automation using it. People outside of our group are using jetpack/jetski as they have been advertised to do the job for them. Is there any other technical reason for the boot device for director pxe to be something other than network 2 in ALIAS apart from the fact that it has always been that? Is there any product/higher level automation that depends on the boot orders requiring it?

Agree with Will's comment that ALIAS has to use some standard and it should be explicitly mentioned what that standard is (preferably would be nice to have same standard across all devices). If we are going to the extent of calling the boot order director_* in the idrac_interfaces.yaml in badfish, perhaps we should really care about what director is using or explicitly make that info available for something like jetpack to use?

We really need to care about boot orders in badfish because the whole point of changing the boot order is to enable other higher level automation/product testing right? Otherwise, we could just have one foreman related boot order that serves most of the usecases and be done with it. I believe all of us should work together and be in sync with any changes/conventions that are related to default boot orders/the way ocpinv.json and instackenv.json are generated because we have unique lab/hardware/product related knowledge and by effectively collaborating can not only boost the productivity of our team but several teams that are using jetpack/jetski.

from badfish.

sadsfae avatar sadsfae commented on September 7, 2024

Just catching up on the discussion here. Some good insights by people on this thread and redhat-performance/quads#357. As long as @grafuls suggestion to include a pxe flag per model type is implemented and that is inturn consumed appropriately by the code in quads to generate the instackenv.json that should help solve the issue. But to provide some background

The reason for the code in quads to consume the second mac address as the pxe interface is based on the assumption that we have a convention that the second interface is being used for the PXE network (the scale lab wiki explicitly mentions this to be the director network). However, is it not the case in ALIAS? In any case, following a common convention across all shared labs will help any other higher level automation using it. People outside of our group are using jetpack/jetski as they have been advertised to do the job for them. Is there any other technical reason for the boot device for director pxe to be something other than network 2 in ALIAS apart from the fact that it has always been that? Is there any product/higher level automation that depends on the boot orders requiring it?

Agree with Will's comment that ALIAS has to use some standard and it should be explicitly mentioned what that standard is (preferably would be nice to have same standard across all devices). If we are going to the extent of calling the boot order director_* in the idrac_interfaces.yaml in badfish, perhaps we should really care about what director is using or explicitly make that info available for something like jetpack to use?

We really need to care about boot orders in badfish because the whole point of changing the boot order is to enable other higher level automation/product testing right? Otherwise, we could just have one foreman related boot order that serves most of the usecases and be done with it. I believe all of us should work together and be in sync with any changes/conventions that are related to default boot orders/the way ocpinv.json and instackenv.json are generated because we have unique lab/hardware/product related knowledge and by effectively collaborating can not only boost the productivity of our team but several teams that are using jetpack/jetski.

Thanks for insight here Sai.

ALIAS has not seen many OSP workloads compared to Scale Lab, and it's also architecturally different (it has two different types of switches that comprise the backend internal connections: the 4550 and the normal QFX5200 and they are not cross-connected in some places).

This is the first time we've heard feedback that the subsequent generated instackenv.json for OSP, and the codebase being set for the Scale Lab is requiring people to do extra work in ALIAS. This is easily changed in 1 second in the running ALIAS QUADS codebase until redhat-performance/quads#357 is merged/active which lets us choose a pxe_boot interface either per system (not ideal) or setting a default lab-wide (what we really want).

(This is actually merged, just not packaged/updated yet as it'll land in 1.1.5 otherwise)
https://github.com/redhat-performance/quads/blob/master/quads/tools/make_instackenv_json.py#L68

(ALIAS running codebase - current)
Here is the 1-character change we can apply right away in ALIAS so the next subsequent instackenv.json matches a different "Director" interface in ALIAS:

https://github.com/redhat-performance/quads/blob/v1.1.4.1/quads/tools/make_instackenv_json.py#L68

We need to know what that preferred default is from your team, we believe folks are using what QUADS calls em1 instead of em2 in ALIAS and as a result folks are having to re-adjust other things to match it, this is not needed but we need to know the standard folks want to use in ALIAS.

Subsequent discussion/feedback needs to go to redhat-performance/quads#357 where we have posed some questions/assumptions and need confirmation otherwise on what should be the preferred ALIAS default for the new pxe_boot MongoDB host interface collection setting we've implemented that lets us set this in a more flexible fashion.

TL;DR

  • There is/was no standard we're aware of until now with ALIAS because it's never been brought up, hence it inherits Scale Lab default of EM2 (and really not a lot of OSP deployments there in general).
  • We've added a patch to support the notion of pxe_boot at the host/interface level to set how instackenv.json is generated
  • This allows us to set a different "lab standard" or in a pinch a host-level preference of what pxe_boot interface is used but really though, we need a consistent default for all of the lab.
  • Nobody should ever have to re-align a different "Director" interface layout just to match instackenv.json, now that we're aware it's different we should instead make setting a lab-specific default easy and seamless so it just works as it does in Scale Lab, which is what redhat-performance/quads#357 should accomplish
  • If you can confirm what QUADS-managed interface is instead preferred for PXE/Director traffic (EM1 we believe, instead of EM2) then we can adjust the codebase right now in ALIAS so subsequent instackenv.json is generated as such, not having to wait until we apply the patch or upgrade ALIAS to 1.1.5 which will contain this functionality.

There exists already override functionality otherwise for modifying orders set by the badfish tool which doesn't have anything at all to do with instackenv.json generation which is handled by QUADS.

If the "Director" entries in badfish for the r740xd systems (we have none of these in Scale Lab) need to be updated again to reflect a more common default we're happy to do that too or just open a PR against the development branch for it.

https://github.com/redhat-performance/badfish#contributing

from badfish.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.