andrew@lab:~$

Switch project, part 1

2025-05-08 18:00

One of my longest-running projects has been an open hardware Ethernet switch. This has been one of the key driving forces behind many of my other projects, such as ngscopeclient and the high speed probes. It was also the project that got me into high speed digital design.

So I figured it’s time to kick off a series with a short writeup of where things are now, how we got there, and what’s coming next. If you follow me on Mastodon you’ve probably seen most of this in bits and pieces but I wanted to collect it all in one place.

Ancient history: The first switch

The first generation never had a name, it was just called “open-gig-switch” or something in my subversion repository (this was circa 2012 before I was primarily running on git).

I couldn’t find any many-port gigabit switch ASICs that were suitable for an OSH project (i.e. purchasable in qty 1, no NDA needed for datasheet, etc). So an FPGA-based from-scratch design seemed the only option.

Dark purple PCB with a mini USB port, quad RJ45, and Spartan-6 FPGA on it

This board pushed a lot of limits for me… my first use of switching power supplies rather than LDOs, my biggest and most complex board to date, and I think maybe even my first use of an RGMII PHY (I had previously used the Microchip ENC424J600 which is a full 10/100 Ethernet MAC + PHY including buffer memory attached to a parallel bus or SPI interface).

And at the time my fanciest piece of test equipment was a 100 MHz Rigol DS1102D. So I had no way to do signal integrity measurements on either the Ethernet differential pairs or even the 250 MT/s RGMII lines.

This one never really got anywhere. I got three of the four PHYs up and running, PHY #4 never worked (I can’t remember if it wouldn’t link up, wouldn’t pass data, or what). I tried resoldering a bunch of stuff but never got it functional.

I tried to bring up a basic switch with only 3 ports, but very quickly realized that 15K LUTs, <1 Mbit of BRAM, and no external CPU would make that very challenging. The XC6SLX25 was not a large FPGA and adding the Xilinx DDR controller and a softcore CPU and three MACs didn’t leave much space at all for fabric. Plus I was still a relative novice to FPGA development and wasn’t quite at the point of being ready to tackle the project.

So this board ended up a dead-end, but set the stage for what was coming.

The interim years

The first switch fiasco showed me I needed more skills, better test equipment, better debug tools, and more.

I was still in grad school with little budget for equipment, but over the course of my Ph.D I became a much more experienced RTL engineer. I built a couple of boards with Ethernet that mostly worked.

But since I couldn’t afford a better scope and had no way to validate SI on faster stuff, I decided to table the switch project for a bit.

In 2015 I graduated and got a job that paid much more than I was making as a graduate teaching assistant. One of my first purchases was my first “real” oscilloscope, a 350 MHz Teledyne LeCroy WaveSurfer 3034. This was actually fast enough to do SI work on RGMII and protocol-decode 100baseTX, but I dreamed of much faster.

I knew I wanted to put 10GbE on the switch whenever I finally got back to it, and didn’t expect to be able to afford a many-GHz scope any time soon, so I started looking at alternative approaches to validate SI at these speeds. This led me down another path of yak shaving in which I started designing a 10 GHz sampling oscilloscope named FREESAMPLE (which never got finished, but I do want to revisit the project one day). This, then, got put on ice when I realized a high speed scope would be useless without an equally high bandwidth probe to feed it with.

You probably already know where this story goes. My open hardware 16 GHz probe (which probably deserves a post or series of its own) is now essentially done and ramping up an initial PVT run, hopefully I can actually make them in quantity and offer to the public in the coming months.

Over the course of shaving these yaks I also bought a house, took a year or two off major projects to refurbish it, got a Sonnet EM solver seat, found a shockingly affordably 16 GHz oscilloscope on eBay (which made FREESAMPLE much less of a priority), rewrote “scopeclient” with OpenGL acceleration as “glscopeclient”, then rewrote it again in Vulkan as “ngscopeclient”. I also created protocol decodes for 10/100 baseTX, 1000baseX, SGMII, QSGMII, 10Gbase-R, and a few other potentially relevant protocols.

I also started dreaming up a roadmap for a whole family of networking equipment under the randomly generated umbrella name LATENTx, using colors for sub-projects (vaguely inspired by the Lockheed HAVE BLUE). The original roadmap called for LATENTRED to be a gigabit edge switch and LATENTORANGE to be a 10G core switch, but I decided that a prototyping/technology demonstrator platform was called for first. LATENTINFRARED was too long so I went with LATENTPINK as the name.

LATENTPINK

The original LATENTRED concept had called for 24 edge ports across three 8-port line cards (because multiples of 3 were convenient for using OSHPark for fabrication). RGMII would have been a nightmare to route, since 24 lanes of RGMII need 288 pins (clock, control, 4-bit data bus, times 2 for TX/RX lanes) plus the MDIO, reset, etc. So I started looking at lower pin count options (GMII, needing 576 pins, wasn’t even on the table, as no FPGA supported by the free Vivado edition had over 520 GPIOs and anything bigger would be vastly outside my price range anyway).

The obvious “easy” option was SGMII which needs one differential pair each for TX and RX (4 pins). This would only require 96 pins for 24 PHYs, and no parallel bus timing constraints to worry about (just matching P/N of each diff pair). So I started looking at the TI DP83867.

But the project had dragged on long enough that by the time I was ready to make hardware for LATENTPINK it was early 2023 and a new option was on the table. Vitesse, who had previously been on my “naughty list” of companies who will never get a design win from me due to developer-hostile practices like locking datasheets for their most boring products behind NDAs, had been bought by Microsemi in 2015, who had then been bought by Microchip in 2018. As part of this a lot of parts got opened up and I decided they deserved early release for good behavior… which meant that their 12-port QSGMII PHY, the VSC8512, was now on the table. This would allow 24 ports with only six transceiver links (12 diff pairs / 24 pins), a massive reduction from SGMII.

The only problem was, I had never worked with QSGMII before and the VSC8512 needed a fair bit of register configuration to work properly, so I wanted to hedge my bets a bit. This resulted in LATENTPINK, which had one VSC8512 and two DP83867s for a total of 14x 1G edge ports, plus a single 10G SFP+ uplink and a dedicated RGMII management port (using my tried-and-true preferred PHY, the KSZ9031RNX) for the SSH interface.

Blue PCB with a row of RJ45 connectors at the front, a SFP+ cage out the back, and several large BGAs in the middle

This was my second 8-layer board (first for a fully personal rather than work-related project), one of my first uses of the Murata MYMGK modules I now use everywhere, and one of my first designs using FPGA transceivers. It was also my first attempt at pairing a STM32H7 with an FPGA (over quad SPI because I hadn’t yet learned how much of a disaster the H7 OCTOSPI peripheral was, but it was fine because I was just doing manual register reads/writes and not trying to memory map it).

I got the VSC8512 working fine, the MCU talking to the FPGA (slowly) and used the platform to build out my SSH server and a bunch of other building blocks the full switch would need.

The LATENTPINK board contained an external QDR-II+ SRAM which was used as a packet buffer for a shared-memory based switching fabric. All incoming packets were written to small per-port CDC FIFOs then popped round robin from these FIFOs and written into a region of the QDR serving as a large data FIFO for the port.

On the far side, the forwarding engine would pick a source port round robin then read the QDR FIFO to pop one packet from the port. The output data stream from the QDR would then be written to small per-port exit queues, routed to all of them and with write enables gated according to which port the packet was to be forwarded to.

This design used a home-grown register structure for the control plane and another bus for the data plane, both of which had issues. The control plane bus didn’t support any kind of distributed decoding so I had to put all the register logic in one place which required routing decoded SFRs all over the place, and the data plane bus was 32 bits wide for RX and 10G TX, while it was 8-bit for 1G TX. This added some annoying complications to the design and the nonstandard nature meant I couldn’t easily reuse FIFOs and other blocks written for other projects.

There were also a couple of PCB bugs, most notably a bad pinout resulting in the upper row of ports only linking up in 10/100 mode until I reworked them (a massive pain given that the swap had to be performed on an inner signal layer, either 3 or 6 depending on the port, in a fairly confined space). I validated the fix on one or two ports but didn’t rework the entire board.

Overall I got it to the point that it was functional, it could pass packets, it had port based VLANs and could decode inbound 802.1q tags (but not synthesize outbound tags on trunk ports) before deciding I had proved out the tech stack enough that I was ready to build the real thing.

What next?

The other major finding from LATENTPINK was that, at least the way I had architected the fabric, the XC7K160T was a bit cramped for my plans. While the 14+1 port design was comfortable, I didn’t think it would be easy to fit a 24+2 port switch into it. The next largest 7-series part (the XC7K325T) was not available in the FBG484 package so I’d have to go up to FFG676, but more annoyingly it wasn’t supported by the free Vivado edition (requiring a $3K software license) and was hugely more expensive ($2260 vs $435 as of this writing).

Between these two cost adders, I’d be looking at a $5K increase in project cost to jump to the bigger FPGA and build one prototype. My long term goal was 96 ports, so to replace all of my legacy Cisco switches I’d be looking at $3K of software + 4x $2K = $11K more to build the entire batch of switches with the 325T vs the 160T. And this would be on top of the already high costs of doing 8-10 layer PCB fab in low volume, the PHYs and RJ45s, custom sheet metal work for the chassis, etc.

By the time I was ready to start thinking about building LATENTRED seriously, though, it was 2024. 7 series was getting pretty long in the tooth and UltraScale and UltraScale+ had been out for a while.

I considered building a switch around the Artix UltraScale+, specifically the largest one - the XCAU25P. It sells for less than the XC7K160T, $380 for the -1 speed grade in FFVB676 as of this writing. It’s two process nodes newer (16 nm vs 28) so the fabric is much faster. And at 141K LUTs it’s a fair bit larger than the 101K of the 7K160T, although well below the 203K of the 7K325T. Other specs were also mostly in between: 12 transceivers vs 8 or 16, etc. But it was light on block RAM, 10.5 Mb vs 11.7 or 16. And I expected to need a lot of FIFOs in the design, so it would go quickly.

While I did buy a pair of AU25Ps, before I could design a board I was tipped off by a friend to a batch of Kintex UltraScale+’s, specifically the XCKU5P, on AliExpress for a mere $55 each. He had tested one from the seller and they appeared to be legitimate, although likely salvaged/reballed from some scrapped equipment.

This was a major game-changer for the project’s direction. The XCKU5P is the largest FPGA supported by the free Vivado license, at 216K LUTs (just larger than the 7K325T), 16 transceivers (matching the 325T), 16.9 Mb of block RAM (slightly larger than the 325T), plus another 18 Mb of UltraRAM, a new kind of large SRAM block optimized for large buffers. The transceivers were also 28 Gbps capable, enabling 25/100G Ethernet rather than only 10/40G. They retail for $2972 in the commercial temperature grade or $3350 in the industrial (which the ones I got were), so I was quite happy with scoring them for less than 2% of list price!

The only problem is, with these new capabilities came scope creep. The 16 transceivers on the KU5P were enough to support twelve lanes of QSGMII (48 baseT ports) plus either a single 40/100G uplink or up to four 10/25G uplinks. And it seemed a shame to not use the full capabilities of such an expensive FPGA falling into my lap for cheap.

LATENTRED

Planned hardware architecture

The new concept for LATENTRED is a much more powerful switch than originally planned. It will be a 1U switch with two 24-port line cards (two VSC8512s per line card) and dual 10/25G SFP28 uplinks.

I had initially thought about doing four 12-port cards but found that 6 and 12 port magjacks with AC-isolated center taps (required by the VSC8512’s voltage mode MDI drivers) were hard to find, while 8-port ones were available from LINK-PP. The smallest port count evenly divisible by both 8 and 12 is 24, and a 2x12 port line card would just barely fit in my reflow oven.

The overall switch will consist of five, possibly six, PCBs:

  • The 48 -> 12V intermediate bus converter
  • Power distribution / switching board
  • Two 24-port dual VSC8512 line cards
  • Switch engine board with XCKU5P, STM32H735, management PHY, serial port, etc.
  • Possibly a separate board with the SFP28 uplinks connected to the switch engine by cables, depending on whether the chassis mechanical layout makes this easier than a monolithic design

The original concept also had called for the line cards to be connected by rigid interconnects in a daisy-chain style (i.e. all SGMII/QSGMII lanes into one side, each line card would tap several lanes off to local PHYs and route the others out the other side). This would result in long multi-gigabit signals running across the full 19” chassis width through several connectors, likely requiring the use of a higher cost low-loss substrate.

While browsing the Samtec catalog, though, I discovered the AcceleRate series, more specifically the AcceleRate Slim ARC6 (cable) / ARF6 (board) series. These are twinax differential interconnects with 8 to 24 differential pairs per cable and are rated for 32 Gbps NRZ or 64 Gbps PAM4, so the insertion loss performance is… only slightly overkill for 5 Gbps QSGMII links. One eight pair cable can handle the six pairs required for the three QSGMII lanes on a VSC8512 with two extra lanes to spare.

The big advantage of a “flyover” style interconnect like this is that there’s no need for a low loss PCB material or routing lots of diffpairs long distances that will conflict with other routing. Jut put a couple of connectors close to the BGA and leave the rest of the board area free for other stuff.

ARC6 is also good enough I could use them as a flyover for a remote SFP28 connector if that makes the chassis layout easier.

Current hardware state

The IBC has been already designed and used in other projects, so that’s done. We can forget about it, other than possibly doing that small respin with reduced ripple from the 3.3V buck. But with the current tariff situation I’m not in a hurry (it’s a 4L 2oz board made at Multech in China). I have plenty of the boards and several populated units ready to go.

The PDU board is done. There’s not much to it: 12V in the left 8-pin connector, 12V out the right 8-pin, 12V out the bottom two 4-pins. It also passes the I2C and 3.3V standby rails from the IBC through, while tapping off them to run programmable soft load switching and voltage/current monitoring. Pretty straightforward. I have three boards (one populated) so to get my 96 ports I only need to stuff a second one with some parts I have on the shelf.

Purple PCB with four Molex Mini-Fit Jr power connectors and some passives

The line card is also done. I’ve built one and it seems to work fine aside from, if memory serves me right, port 22 or 23 not linking up. Probably a solder defect since the left/right PHY layout was basically copy pasted, but I haven’t had time to troubleshoot. I have ten PCBs and need four to make two switches; I will have to order two more of the 8-port RJ45s from LINK-PP to stuff all of them but I’m a ways off from that being an issue.

Blue PCB with three 8-port RJ45s and two large BGA PHYs,

It’s a six-layer design on cheap Shengyi S1000-2M substrate since the QSGMII links are short and everything else is slow. Stackup is SGPPGS with all of the Ethernet signals on the back layer and some of the LED GPIOs routed in between layer 4 power pours. Connection to the rest of the system is via four connectors on the back side: 12V power on a 4-pin Mini-Fit Jr, six QSGMII lanes split across two ARF6 connectors, and a 10-pin Molex PicoBlade providing access to the PHY MDIO bus, I2C to a bunch of system health sensors, and a SPI bus to the onboard STM32L431 management microcontroller (not used in current firmware) in case I want to provide remote power cycling or something like that.

The big missing piece as of now is the switch engine board itself. For the near term I’ve cobbled together a sort of sub-scale test setup containing one line card, the power system, a KU5P dev board I built last year to validate that the $55 AliExpress FPGAs weren’t bricks, and a separate board with a STM32H735 as the management processor.

Bench setup with the line card, power supply, and FPGA dev board cabled together

It wasn’t possible to route a STM32H735 with FMC to the FPGA directly to enable the fast memory mapped interface I described in a previous post since the dev board was built on OSHPark 4 layer (I may be the first person to have put a Kintex UltraScale+, with all transceivers broken out, on a 4-layer PCB… post about that coming at some point) so instead I built a small expansion board with the MCU, a small bridge FPGA (XC7A35T), and connected it to the KU+ board via a 5 Gbps serial link. The MCU memory bus is bridged to APB on the Artix which then serializes the APB transactions via a simple PCIe-esque protocol (the Serial Chip to Chip Bus, which will probably get its own blog once I’ve fine tuned the implementation a bit) and back to APB on the Kintex. The end result is basically the same as if the MCU were directly wired to the Kintex but with a bit more latency in the path.

The glue board also contains a passive adapter circuit to bridge from three sets of SMPM connectors on the Kintex board to an ARF6 connector so I can break out that GTY quad to talk to the line card. I have a second glue board that will bridge a QSFP28 out to an ARF6 (enabling me to use the second PHY on the line card) but haven’t had time to assemble it yet.

Planned switch engine architecture

I’m still working out some of the fine points of the implementation, but the current overall plan is a 4x4 64-bit crossbar at 400 MHz. This will give 25.6 Gbps per lane of throughput, or 102.4 Gbps total. I could theoretically downclock to 390.625 MHz if needed due to difficult timing, but the increase in margin will be small and 400 is a more convenient number for me to synthesize from 25 MHz without needing fractional-N.

The 25G uplinks will each get a dedicated crossbar port, while each line card will also get one (combined bandwidth 24 Gbps).

The 4x4 crossbar should be very efficient and performant timing-wise, since a 4:1 mux fits in a single LUT6. So a 64-bit 4:1 mux is 64 LUTs and the full crossbar 256 LUTs, with only a single level of logic in the critical path. The decision-making logic will be more complex but can run in a slower clock domain if needed, since forwarding a packet will take multiple cycles.

On the input side, there will need to be some kind of arbiter and some small FIFOs to take the 24x 1 Gbps streams and mux them down to a single 24 Gbps stream, as well as some clock domain crossing blocks that can probably be handled by the same FIFOs. UltraRAM is available if needed for deeper FIFOs.

On the output side, there will just be a small block RAM exit queue per port. The 25G ports will drive the crossbar exit stream straight into the FIFO, while the 1G line cards will have 24 separate exit queues (one per port) with enable gated by the destination port set (a bitmask, to allow for broadcast/multicast).

Block diagram of proposed switch architecture

Current gateware state

I successfully ported my existing 10G MAC/PCS IP over to AXI4-Stream and have it working well in tests. The 1G is almost done, the AXI conversion is finished for the RX side but I haven’t done the TX yet.

I still have to write a 25G MAC/PCS but that’s a ways out, I can run the uplinks at 10G and test the whole rest of the switch just fine that way (I won’t actually be lighting up the uplinks at 25G until I get a 25/100G core switch anyway, which is a ways out).

The MAC address table from LATENTPINK will work just fine in this design with no modifications, it has more than enough capacity for the max theoretical number of packets I could push through the fabric.

I still have to build AXI4-Stream VLAN tag insertion/removal blocks, figure out at what point I want to drop frames with bad FCSes (probably at the point they’re written to the URAM ingress FIFO but I’m not certain yet), and actually do all the integrations.

Conclusions

There’s still a lot to do but it’s been an exciting project so far and I look forward to seeing where it goes. I don’t expect to have a polished switch on the final PCB/mechanical design until probably some time in 2026 (subject to manufacturing/supply chain delays and the political situation), but I hope to be pushing packets by the summer some time.

Like this post? Drop me a comment on Mastodon