If stretched

Multi-AZ (Stretched Cluster) Prep — extra layer

Run this only if intake A13 = Yes (Multiple Availability Zones / stretched). Everything here is additive to the single-AZ flow — you still do prerequisites.md01-network-dns-plan.md02-intake.md. This page captures what a stretched (multi-AZ) management domain adds on top.

A stretched cluster spans two data sites (AZ1, AZ2) plus a third witness site. vSAN mirrors every object across both AZs and uses the witness to break split-brain. That means three things the single-AZ plan never asks for: a witness at a third location, a fabric that meets latency/bandwidth limits between all three, and roughly double the raw capacity. The stretch operation itself is driven by SDDC Manager — TechDocs: Stretching vSAN Clusters.

Convention on this page: sfo01 = AZ1 / preferred fault domain, sfo02 = AZ2 / secondary fault domain, sfo-wit = witness site. Replace consistently. VLAN IDs and CIDRs are placeholders.


A. Decision gate — is stretched actually required?

#QuestionNotes
M1Two independent data sites available (AZ1, AZ2)?Separate power/cooling/fire zone — not two racks in one room
M2A third location for the witness?Can be small; only runs the witness appliance
M3Inter-AZ link meets <5 ms RTT and bandwidth (see C)?Hard vSAN requirement; confirm with network team in writing
M4AZ↔witness link within the witness RTT budget?≤200 ms RTT up to 10 hosts/site; ≤100 ms for 11–15; ≤500 ms for a single host/site. Witness tolerates far more latency than the data sites
M5Which AZ is preferred (owns quorum if witness is lost)?Default sfo01
M6Even, matched host count per AZ?Same host count + hardware both AZs

If any of M1–M4 is No/unknown, stop and resolve it before sizing — a stretched build on a fabric that misses the latency budget will pass bring-up and then fail under load.


B. Witness / third site

ItemValue / requirement
Witness applianceVMware-VirtualSAN-Witness-*.ova (see prerequisites.md)
Witness siteThird location, routable from both AZ1 and AZ2
Witness runs onAny supported host at the third site (nested ESXi appliance)
Witness sizeMatch to component count (Tiny / Medium / Large per OVA prompt)
Witness network (VCF)One VMkernel/subnet on the witness appliance carries both management and witness traffic — the 2nd adapter is unused. Route it to the management networks in both AZs
Data-host witness traffic (WTS)On ESXi hosts in both AZs, witness traffic rides the ESX Management VMkernel (WTS-tagged) — no dedicated witness VLAN needed. Only witness traffic is routed to the 3rd site; vSAN data stays stretched L2 between AZs
AZ↔witness RTT≤200 ms RTT (up to 10 hosts/site); ≤100 ms for 11–15 hosts/site; ≤500 ms for a single host/site (2-node)
Witness bandwidth (rule of thumb)~2 Mbps per 1000 vSAN components; size from expected object count
Witness FQDNA + PTR record, e.g. sfo-wit01.sfo.example.io
Witness NTP/DNSReachable from the witness site (see F)

The witness holds only metadata (witness components), never data. Losing it does not stop I/O — the preferred AZ (M5) keeps quorum.

Witness VLANs/subnets — what you actually need (VCF design). Per the Broadcom vSAN Design for VCF and Deploying a Witness Appliance, VCF puts witness traffic on the management network at both ends: the data hosts tag their ESX Management VMkernel for witness traffic, and the witness appliance uses one VMkernel for management + witness. So you do not provision a dedicated witness VLAN/subnet — you need the witness appliance’s management subnet (3rd site) routed to the ESX-management networks in both AZs, within the witness RTT budget above. (The generic vSAN guide describes an optional dedicated per-site witness VLAN; VCF’s design does not use it.)

Routing for witness traffic. Because witness traffic rides the ESX Management VMkernel — which is on the default TCP/IP stack and uses the management default gateway — it follows normal routed paths, so you do not add per-host static routes (the classic dedicated-witness-VMK design does; the VCF management-network design doesn’t). What you need instead:

  • Bidirectional L3 routing between each AZ’s ESX-Management subnet and the witness appliance’s network (3rd site): AZ1 and AZ2 hosts must reach the witness, and the witness must route back to both AZ management subnets.
  • Each AZ’s management default gateway needs a route to the witness subnet, and the witness site’s gateway routes back to both AZ management subnets — all within the witness RTT budget (≤200 ms; see the table above).
  • vSAN witness traffic is unicast on modern vSAN — no multicast on the routed path. Ensure the required vSAN ports are permitted end-to-end.

C. AZ1 ↔ AZ2 fabric

ItemRequirement
RTT AZ1↔AZ2<5 ms RTT — hard limit for vSAN data
Bandwidth AZ1↔AZ2≥10 Gbps (the VCF design-library figure — see the note under D); the actual need is driven by the write bandwidth being mirrored (VMs replicated between sites). Size against the vSAN Stretched Cluster Bandwidth Sizing guide (witness leg: ~2 Mbps per 1000 components; see also TechDocs Bandwidth and Latency Requirements) and plan for resync bursts
L2 + HA L3 gatewayStretched L2 segments (see D) plus a highly-available Layer 3 gateway between AZs, provided by the physical fabric
MTU across the inter-AZ link9000 end-to-end for vSAN / vMotion / overlay
Fault domainssfo01 = preferred, sfo02 = secondary, sfo-wit = witness (3rd)
Link redundancyNo single-path between AZs (dark fibre pair / diverse DWDM)

D. Networking — what stretches vs. what stays per-AZ

This extends the VLAN/subnet table in 01-network-dns-plan.md. In a stretched build each traffic type is either stretched L2 (same subnet visible in both AZs, for anything that must fail over) or per-AZ (a distinct subnet in AZ2, routed).

The table below is written for the management domain, but the per-AZ rows apply to any stretched cluster — a workload-domain cluster can also be stretched (once the management domain is). For each stretched WLD, repeat the per-AZ analysis for its own vMotion / vSAN / host-TEP networks. The VM Management (stretched) row is management-domain-specific: a WLD’s tenant workloads ride NSX overlay segments, and a WLD that runs its own edges repeats the Edge Overlay / Uplink rows.

TrafficStretched L2?AZ1 subnetAZ2 subnetNotes
ESX ManagementPer-AZ/24/24Own gateway per AZ
VM ManagementStretched/24(same)Mgmt VMs fail over between AZs → must be L2
vMotionPer-AZ/24/24Jumbo; routed between AZs
vSANPer-AZ/24/24Jumbo; routed AZ1↔AZ2 and to witness
ESX Host Overlay (TEP)Per-AZ/24/24Jumbo; per-AZ TEP subnets — common gotcha
NSX Edge Overlay (TEP)Stretched*/24(same)Edges fail over → stretched (Centralized only)
NSX Edge Uplink-01Stretched*/29–/30(same)BGP peer; stretched (Centralized only)
NSX Edge Uplink-02Stretched*/29–/30(same)BGP peer; stretched (Centralized only)
Witness trafficRouted to 3rdRides the ESX Management VMK (WTS) → witness site (≤200 ms); no dedicated witness VLAN

* Edge Overlay + Uplinks are stretched only with NSX Centralized connectivity (intake A10). With Distributed connectivity each AZ has its own local transit gateway / edges, so Edge Overlay + Uplinks are per-AZ. Consistent with prerequisites.md.

Confirmed against the Broadcom VCF 9 design library — vSphere Stretched Cluster Model (techdocs.broadcom.com): VM Management is “shared across availability zones” (stretched); ESX Management, vMotion, vSAN, and Host TEP are each “unique per availability zone” (per-AZ, own gateway). There is no option to stretch ESX Management. The AZ1↔AZ2 link is specified as <5 ms RTT and ≥10 Gbps — the vSAN stretched-cluster limit, not the looser 10 ms generic-AZ figure.

North-south / public peering: the NSX Edge Uplink BGP sessions (the two rows above, captured in the 01 BGP plan and intake B10B16) are the north-south / public peering — there is no separate “public peering” item unless you run a distinct public / DMZ transit (intake B22). Decide which AZ owns ingress in steady state and how routes withdraw on an AZ failure (BGP local-pref / AS-path prepend toward the non-preferred AZ). Capture this alongside section B of the BGP plan.

Public peering is normally a workload-domain concern, not management. The management domain’s Edge uplinks peer with the internal ToR fabric for management routing — not a public network. Public / upstream / DMZ peering (internet-facing or published routes) normally lives on the workload-domain edges, where tenant workloads need external reachability. It applies to the management domain only if your design deliberately routes a published service through the mgmt edges. In multi-AZ, whichever domain hosts the public peering must survive an AZ loss the same way: stretched under Centralized or per-AZ under Distributed, with the surviving AZ advertising the public prefixes and the failed AZ withdrawing them.


E. Storage policy & capacity

ItemSetting / implication
Site disaster toleranceDual site mirroring (data copy in each AZ) = PFTT 1
Local protection (per site)SFTT — RAID-1 FTT=1, or RAID-5/6 if host count allows
Raw capacity~2× usable (full copy per AZ) + local FTT overhead
Host countEven, matched per AZ; enough per AZ to satisfy the local RAID rule
Witness capacityMetadata only — no usable capacity contribution

Worked example: 20 TB usable with dual-site mirror + local RAID-1 (FTT=1) needs ~20 TB in each AZ before local mirroring, then local FTT roughly doubles that again per AZ. Size hosts for this up front — it is the #1 stretched surprise. Confirm against the Management Domain Sizing sheet.


F. DNS / NTP additions

On top of the records in 01-network-dns-plan.md:

  • A + PTR for the witness appliance (sfo-wit01.sfo.example.io)
  • AZ2 hosts have A + PTR in their AZ2 subnets
  • NTP reachable from all three sites (AZ1, AZ2, witness)
  • Prefer independent time sources per site (different fault domains — see prerequisites.md)
  • DNS resolvers reachable from the witness site

G. Ownership matrix

AreaOwnerSign-off
Inter-AZ RTT/bandwidth (C)Network
AZ↔witness RTT + witness placement (B)Network + Architect
Stretched vs per-AZ subnets (D)Network
North-south egress / BGP failover (D)Network
Storage policy + per-AZ capacity (E)Storage + Architect
Witness OVA download + deploy (B)Platform
DNS/NTP for AZ2 + witness (F)AD/DNS/NTP

Sign-off

Once A–G are filled and signed by the owners above, feed the results back into the single-AZ planning docs:

  • AZ2 + witness subnets → the VLAN/subnet table in 01-network-dns-plan.md
  • Witness + AZ2 host FQDNs → the DNS section of 01-network-dns-plan.md
  • Stretched answer + host counts → intake A13/A14 in 02-intake.md

Then continue with the normal workbook fill. A stretched build that clears this page will not surprise you at bring-up.