A collection of practical build notes, setup guides, and performance experiments for local AI/ML infrastructure.
This repository documents real-world home-lab cluster setups for training, fine-tuning, inference, distributed systems testing, high-speed networking, storage fabrics, and heterogeneous hardware experimentation.
The current focus is on:
- Dell Pro Max GB10 / NVIDIA GB10 cluster setups
- Apple Mac Studio Ultra cluster setups
- RoCE / RDMA networking
- MikroTik 200G / 400G switching
- NCCL, MPI, and multi-node communication
- Shared NVMe over RDMA
- Cross-vendor AI workload placement
- KV-cache prefill, caching, and serving experiments
These guides are written from working lab configurations, including the mistakes, caveats, and performance traps discovered along the way.
The goal of this project is to build a self-contained local AI infrastructure lab for training, inference, networking, storage, and distributed systems research.
This lab combines different hardware platforms, including Apple Mac Studio Ultra systems and Dell Pro Max GB10 / NVIDIA GB10 systems, to evaluate how heterogeneous compute can be used effectively for real AI workloads.
The goal is not only to benchmark hardware, but to understand how mixed-vendor systems behave under practical workloads:
- Distributed training
- Fine-tuning
- Inference serving
- KV-cache prefill
- KV-cache reuse and caching
- Shared NVMe over RDMA
- RoCE networking
- Mixed precision and quantization strategies
- Cross-vendor workload placement
- Local reproducible AI infrastructure
Longer term, this work is about building automation that understands hardware bottlenecks and can optimize workload placement across different systems. That includes deciding where to prefill, where to cache, where to serve, and how to mix precision across weights, activations, and KV cache.
A setup guide for building a GB10 cluster using ConnectX-7 networking, RoCE, NCCL, OpenMPI, and MikroTik switching.
Covered topics include:
- 2-node direct-connect setup
- 2x2 cluster layout
- 4-node switched fabric setup
- ConnectX-7 static IP configuration
- Duplicate MAC workaround on GB10 nodes
- RoCE device mapping
- NCCL rail pinning
- OpenMPI launch configuration
- MikroTik CRS switch configuration
- QSFP-DD 400G to 2x200G breakout behavior
- MTU 9000 and L2MTU configuration
- VLAN/PVID access-port fabric isolation
- Hardware offload validation using the MikroTik
Hflag - iperf3 validation
- NCCL all-reduce validation
Guide:
guides/gb10-cluster-setup.md
A lot of AI infrastructure documentation assumes cloud clusters, vendor reference architectures, or fully homogeneous systems.
This repo is focused on the messy but useful middle ground:
- Local hardware
- Mixed vendors
- Real switches
- Real cabling
- Real firmware quirks
- Real performance debugging
- Repeatable experiments
- Configurations that are small enough to own but complex enough to teach useful lessons
Some findings are simple but important. For example:
- Link up does not mean the switch is forwarding in hardware.
- On MikroTik CRS switches,
hw=yesis not enough; you want theHhardware-offload flag. - Separate software bridges can silently cap performance.
- A single hardware-offloaded bridge with untagged VLAN/PVID access ports can preserve L2 isolation while keeping traffic in the ASIC.
- 400G QSFP-DD breakout cables may expose multiple logical 200G interfaces.
- NCCL launch host IPs do not automatically guarantee NCCL data path selection.
NCCL_SOCKET_IFNAME,NCCL_IB_HCA, GID index, and OpenMPI interface pinning all matter.- Some GB10 interfaces may expose duplicate MAC addresses and need explicit locally administered MAC overrides.
Suggested layout:
.
├── README.md
├── guides/
│ ├── gb10-cluster-setup.md
│ ├── mac-studio-ultra-cluster-setup.md
│ └── shared-nvme-rdma-setup.md
├── scripts/
│ ├── nccl/
│ ├── iperf/
│ └── diagnostics/
├── configs/
│ ├── netplan/
│ ├── mikrotik/
│ └── hosts/
├── results/
│ ├── iperf/
│ ├── nccl/
│ └── notes/
└── diagrams/
Current tested and/or in-progress hardware includes:
- Dell Pro Max GB10 / NVIDIA GB10 systems
- NVIDIA ConnectX-7 networking
- MikroTik CRS812-8DS-2DQ-2DDQ switch
- QSFP-DD 400G to 2x200G QSFP56 DAC breakout cables
- Dell 400G DAC QSFP56 cables
- Apple Mac Studio Ultra systems
This list will expand as more configurations are validated.
Each setup should be validated in layers:
- Physical link state
- Interface naming and MAC mapping
- MTU and jumbo frame validation
- Switch forwarding and hardware offload
- Raw TCP throughput with
iperf3 - RDMA / RoCE visibility
- GID index validation
- NCCL transport selection
- NCCL correctness with
#wrong 0 - Application-level workload testing
The intent is to avoid guessing. Each layer should prove the next layer is worth debugging.
Example milestones from the GB10 work:
- 200G links detected on ConnectX-7 ports
- MTU 9000 jumbo ping passing end-to-end
- MikroTik switch ports running with MTU 9000 and L2MTU 9570
- Switch fabric forwarding in hardware with the
Hflag - Switched fabric traffic reaching approximately 190+ Gbps per active 200G logical port
- NCCL using RoCE instead of socket fallback
- NCCL rings connecting successfully
- NCCL tests completing with
#wrong 0
Planned or in-progress guides:
- Apple Mac Studio Ultra cluster setup
- GB10 four-node NCCL and RoCE tuning
- Shared NVMe over RDMA for AI cache experiments
- KV-cache prefill and serving across different hardware vendors
- Cross-vendor inference pipeline experiments
- Mixed precision strategy testing
- Workload placement automation
- Local benchmark reproducibility framework
These guides are not vendor-certified reference architectures.
They are practical working notes from real lab setups. Hardware, firmware, drivers, operating systems, and switch software versions can change behavior.
Before copying any configuration into your own environment:
- Export your current switch configuration.
- Back up your node network configuration.
- Use console access when changing switch VLAN filtering.
- Validate one link or rail at a time.
- Confirm hardware offload before trusting throughput numbers.
- Do not assume interface names are identical across systems.
- Do not assume GID indexes are identical across rails or nodes.
Issues, corrections, test results, and additional hardware notes are welcome.
Useful contributions include:
- Confirmed working configurations
- Failure cases and fixes
- Switch configuration examples
- RoCE / NCCL tuning notes
- Performance results with hardware details
- Diagrams and topology maps
- Reproducible benchmark scripts
When sharing results, include:
- Hardware model
- NIC model and firmware
- Switch model and RouterOS / firmware version
- Cable type
- OS version
- Kernel version
- NCCL version
- CUDA version
- Test command
- Relevant environment variables
- Throughput and correctness output
Use these configurations at your own risk.
High-speed networking, RDMA, switch VLAN filtering, firmware changes, and storage fabric experiments can break connectivity or cause data loss if applied incorrectly.
These guides are intended for experienced operators, builders, and researchers working in controlled lab environments.