I have begun the process of building a lab for my team of HPC consultants, and I'm trying to make some plans. I would like this to be as flexible as I can make it. I live 3½ hours away from the site, so the fewer trips down there to recable and/or move stuff around the better! Most of this hardware has various older InfiniBand connectivity, along with multiport LOM & OCP cards at either 1Gb or 10Gb. Most also have the option to do dedicated and shared BMC. We have 2 dedicated IPs (so far) that I'm currently using for the head node's BMC & SSH access. This will be all Linux, though we will be accessing web interfaces when testing various products. My initial thoughts:
- Identify what we want to keep and what we want to excess. There's some _very_old hardware in there! There's also some old OmniPath hardware in there. We don't see much OPA, but some team members seem to think that may change. Still this stuff is old.
- Carve out a management/provisioning network. Ideally, this will allow us to switch between dedicated and shared BMC ports at will. We use this for customer knowledge transfer when we demo our cluster management software. The share ports are usually the onboard port 1, which is usually 1gb, so this is easy enough. We can probably cable all of that up to 1 switch.
- Identify a subset of nodes to cables up the capability of accessing the campus network. These systems are behind the company VPN, and we will be controlling login access ourselves. While I'm not worried about someone on the team doing something nefarious on the company network, I don't want everything to have this capability. Still, having the option with some will give us some flexibility, and we have a handful of systems with more Ethernet ports than we would otherwise need (campus LAN access is 1Gb).
- Head node will run Proxmox to give us the flexibility to spin up temporary test heads for team member projects. The idea here is we can partition the network using VLANs to isolate what a group is doing with some systems from what anybody else is doing. The current head node has sufficient space to host shared home directories. We will also have a small IBM ESS that will be added to these racks next time I'm there.
- I had thought about running some containers in either a VM on the head node or some LXCs. Right now the only thing I'm thinking about on that front is netbox.
This is what I have off the top of my head. If there's any useful software, procedures, or if I'm on the wrong path entirely, I'd appreciate your help. We have a modest budget, but we did convince our management to at least buy us a used 1Gb switch that is at least similar hardware that we would see "in the wild." We're hoping we can use the lab to show value there and get them to approve some other, still modest, requests in the future!
In addition to netbox, a wiki or other knowledgebase would be nice. You can document setup procedures as you go, and then other people can use that to figure stuff out.