Andrew's Lab

Switch project, part 1

Thu, 08 May 2025 18:00:00 -0700

One of my longest-running projects has been an open hardware Ethernet switch. This has been one of the key driving forces behind many of my other projects, such as ngscopeclient and the high speed probes. It was also the project that got me into high speed digital design.

So I figured it’s time to kick off a series with a short writeup of where things are now, how we got there, and what’s coming next. If you follow me on Mastodon you’ve probably seen most of this in bits and pieces but I wanted to collect it all in one place.

Ancient history: The first switch

The first generation never had a name, it was just called “open-gig-switch” or something in my subversion repository (this was circa 2012 before I was primarily running on git).

I couldn’t find any many-port gigabit switch ASICs that were suitable for an OSH project (i.e. purchasable in qty 1, no NDA needed for datasheet, etc). So an FPGA-based from-scratch design seemed the only option.

This board pushed a lot of limits for me… my first use of switching power supplies rather than LDOs, my biggest and most complex board to date, and I think maybe even my first use of an RGMII PHY (I had previously used the Microchip ENC424J600 which is a full 10/100 Ethernet MAC + PHY including buffer memory attached to a parallel bus or SPI interface).

And at the time my fanciest piece of test equipment was a 100 MHz Rigol DS1102D. So I had no way to do signal integrity measurements on either the Ethernet differential pairs or even the 250 MT/s RGMII lines.

This one never really got anywhere. I got three of the four PHYs up and running, PHY #4 never worked (I can’t remember if it wouldn’t link up, wouldn’t pass data, or what). I tried resoldering a bunch of stuff but never got it functional.

I tried to bring up a basic switch with only 3 ports, but very quickly realized that 15K LUTs, <1 Mbit of BRAM, and no external CPU would make that very challenging. The XC6SLX25 was not a large FPGA and adding the Xilinx DDR controller and a softcore CPU and three MACs didn’t leave much space at all for fabric. Plus I was still a relative novice to FPGA development and wasn’t quite at the point of being ready to tackle the project.

So this board ended up a dead-end, but set the stage for what was coming.

The interim years

The first switch fiasco showed me I needed more skills, better test equipment, better debug tools, and more.

I was still in grad school with little budget for equipment, but over the course of my Ph.D I became a much more experienced RTL engineer. I built a couple of boards with Ethernet that mostly worked.

But since I couldn’t afford a better scope and had no way to validate SI on faster stuff, I decided to table the switch project for a bit.

In 2015 I graduated and got a job that paid much more than I was making as a graduate teaching assistant. One of my first purchases was my first “real” oscilloscope, a 350 MHz Teledyne LeCroy WaveSurfer 3034. This was actually fast enough to do SI work on RGMII and protocol-decode 100baseTX, but I dreamed of much faster.

I knew I wanted to put 10GbE on the switch whenever I finally got back to it, and didn’t expect to be able to afford a many-GHz scope any time soon, so I started looking at alternative approaches to validate SI at these speeds. This led me down another path of yak shaving in which I started designing a 10 GHz sampling oscilloscope named FREESAMPLE (which never got finished, but I do want to revisit the project one day). This, then, got put on ice when I realized a high speed scope would be useless without an equally high bandwidth probe to feed it with.

You probably already know where this story goes. My open hardware 16 GHz probe (which probably deserves a post or series of its own) is now essentially done and ramping up an initial PVT run, hopefully I can actually make them in quantity and offer to the public in the coming months.

Over the course of shaving these yaks I also bought a house, took a year or two off major projects to refurbish it, got a Sonnet EM solver seat, found a shockingly affordably 16 GHz oscilloscope on eBay (which made FREESAMPLE much less of a priority), rewrote “scopeclient” with OpenGL acceleration as “glscopeclient”, then rewrote it again in Vulkan as “ngscopeclient”. I also created protocol decodes for 10/100 baseTX, 1000baseX, SGMII, QSGMII, 10Gbase-R, and a few other potentially relevant protocols.

I also started dreaming up a roadmap for a whole family of networking equipment under the randomly generated umbrella name LATENTx, using colors for sub-projects (vaguely inspired by the Lockheed HAVE BLUE). The original roadmap called for LATENTRED to be a gigabit edge switch and LATENTORANGE to be a 10G core switch, but I decided that a prototyping/technology demonstrator platform was called for first. LATENTINFRARED was too long so I went with LATENTPINK as the name.

LATENTPINK

The original LATENTRED concept had called for 24 edge ports across three 8-port line cards (because multiples of 3 were convenient for using OSHPark for fabrication). RGMII would have been a nightmare to route, since 24 lanes of RGMII need 288 pins (clock, control, 4-bit data bus, times 2 for TX/RX lanes) plus the MDIO, reset, etc. So I started looking at lower pin count options (GMII, needing 576 pins, wasn’t even on the table, as no FPGA supported by the free Vivado edition had over 520 GPIOs and anything bigger would be vastly outside my price range anyway).

The obvious “easy” option was SGMII which needs one differential pair each for TX and RX (4 pins). This would only require 96 pins for 24 PHYs, and no parallel bus timing constraints to worry about (just matching P/N of each diff pair). So I started looking at the TI DP83867.

But the project had dragged on long enough that by the time I was ready to make hardware for LATENTPINK it was early 2023 and a new option was on the table. Vitesse, who had previously been on my “naughty list” of companies who will never get a design win from me due to developer-hostile practices like locking datasheets for their most boring products behind NDAs, had been bought by Microsemi in 2015, who had then been bought by Microchip in 2018. As part of this a lot of parts got opened up and I decided they deserved early release for good behavior… which meant that their 12-port QSGMII PHY, the VSC8512, was now on the table. This would allow 24 ports with only six transceiver links (12 diff pairs / 24 pins), a massive reduction from SGMII.

The only problem was, I had never worked with QSGMII before and the VSC8512 needed a fair bit of register configuration to work properly, so I wanted to hedge my bets a bit. This resulted in LATENTPINK, which had one VSC8512 and two DP83867s for a total of 14x 1G edge ports, plus a single 10G SFP+ uplink and a dedicated RGMII management port (using my tried-and-true preferred PHY, the KSZ9031RNX) for the SSH interface.

This was my second 8-layer board (first for a fully personal rather than work-related project), one of my first uses of the Murata MYMGK modules I now use everywhere, and one of my first designs using FPGA transceivers. It was also my first attempt at pairing a STM32H7 with an FPGA (over quad SPI because I hadn’t yet learned how much of a disaster the H7 OCTOSPI peripheral was, but it was fine because I was just doing manual register reads/writes and not trying to memory map it).

I got the VSC8512 working fine, the MCU talking to the FPGA (slowly) and used the platform to build out my SSH server and a bunch of other building blocks the full switch would need.

The LATENTPINK board contained an external QDR-II+ SRAM which was used as a packet buffer for a shared-memory based switching fabric. All incoming packets were written to small per-port CDC FIFOs then popped round robin from these FIFOs and written into a region of the QDR serving as a large data FIFO for the port.

On the far side, the forwarding engine would pick a source port round robin then read the QDR FIFO to pop one packet from the port. The output data stream from the QDR would then be written to small per-port exit queues, routed to all of them and with write enables gated according to which port the packet was to be forwarded to.

This design used a home-grown register structure for the control plane and another bus for the data plane, both of which had issues. The control plane bus didn’t support any kind of distributed decoding so I had to put all the register logic in one place which required routing decoded SFRs all over the place, and the data plane bus was 32 bits wide for RX and 10G TX, while it was 8-bit for 1G TX. This added some annoying complications to the design and the nonstandard nature meant I couldn’t easily reuse FIFOs and other blocks written for other projects.

There were also a couple of PCB bugs, most notably a bad pinout resulting in the upper row of ports only linking up in 10/100 mode until I reworked them (a massive pain given that the swap had to be performed on an inner signal layer, either 3 or 6 depending on the port, in a fairly confined space). I validated the fix on one or two ports but didn’t rework the entire board.

Overall I got it to the point that it was functional, it could pass packets, it had port based VLANs and could decode inbound 802.1q tags (but not synthesize outbound tags on trunk ports) before deciding I had proved out the tech stack enough that I was ready to build the real thing.

What next?

The other major finding from LATENTPINK was that, at least the way I had architected the fabric, the XC7K160T was a bit cramped for my plans. While the 14+1 port design was comfortable, I didn’t think it would be easy to fit a 24+2 port switch into it. The next largest 7-series part (the XC7K325T) was not available in the FBG484 package so I’d have to go up to FFG676, but more annoyingly it wasn’t supported by the free Vivado edition (requiring a $3K software license) and was hugely more expensive ($2260 vs $435 as of this writing).

Between these two cost adders, I’d be looking at a $5K increase in project cost to jump to the bigger FPGA and build one prototype. My long term goal was 96 ports, so to replace all of my legacy Cisco switches I’d be looking at $3K of software + 4x $2K = $11K more to build the entire batch of switches with the 325T vs the 160T. And this would be on top of the already high costs of doing 8-10 layer PCB fab in low volume, the PHYs and RJ45s, custom sheet metal work for the chassis, etc.

By the time I was ready to start thinking about building LATENTRED seriously, though, it was 2024. 7 series was getting pretty long in the tooth and UltraScale and UltraScale+ had been out for a while.

I considered building a switch around the Artix UltraScale+, specifically the largest one - the XCAU25P. It sells for less than the XC7K160T, $380 for the -1 speed grade in FFVB676 as of this writing. It’s two process nodes newer (16 nm vs 28) so the fabric is much faster. And at 141K LUTs it’s a fair bit larger than the 101K of the 7K160T, although well below the 203K of the 7K325T. Other specs were also mostly in between: 12 transceivers vs 8 or 16, etc. But it was light on block RAM, 10.5 Mb vs 11.7 or 16. And I expected to need a lot of FIFOs in the design, so it would go quickly.

While I did buy a pair of AU25Ps, before I could design a board I was tipped off by a friend to a batch of Kintex UltraScale+’s, specifically the XCKU5P, on AliExpress for a mere $55 each. He had tested one from the seller and they appeared to be legitimate, although likely salvaged/reballed from some scrapped equipment.

This was a major game-changer for the project’s direction. The XCKU5P is the largest FPGA supported by the free Vivado license, at 216K LUTs (just larger than the 7K325T), 16 transceivers (matching the 325T), 16.9 Mb of block RAM (slightly larger than the 325T), plus another 18 Mb of UltraRAM, a new kind of large SRAM block optimized for large buffers. The transceivers were also 28 Gbps capable, enabling 25/100G Ethernet rather than only 10/40G. They retail for $2972 in the commercial temperature grade or $3350 in the industrial (which the ones I got were), so I was quite happy with scoring them for less than 2% of list price!

The only problem is, with these new capabilities came scope creep. The 16 transceivers on the KU5P were enough to support twelve lanes of QSGMII (48 baseT ports) plus either a single 40/100G uplink or up to four 10/25G uplinks. And it seemed a shame to not use the full capabilities of such an expensive FPGA falling into my lap for cheap.

LATENTRED

Planned hardware architecture

The new concept for LATENTRED is a much more powerful switch than originally planned. It will be a 1U switch with two 24-port line cards (two VSC8512s per line card) and dual 10/25G SFP28 uplinks.

I had initially thought about doing four 12-port cards but found that 6 and 12 port magjacks with AC-isolated center taps (required by the VSC8512’s voltage mode MDI drivers) were hard to find, while 8-port ones were available from LINK-PP. The smallest port count evenly divisible by both 8 and 12 is 24, and a 2x12 port line card would just barely fit in my reflow oven.

The overall switch will consist of five, possibly six, PCBs:

The 48 -> 12V intermediate bus converter
Power distribution / switching board
Two 24-port dual VSC8512 line cards
Switch engine board with XCKU5P, STM32H735, management PHY, serial port, etc.
Possibly a separate board with the SFP28 uplinks connected to the switch engine by cables, depending on whether the chassis mechanical layout makes this easier than a monolithic design

The original concept also had called for the line cards to be connected by rigid interconnects in a daisy-chain style (i.e. all SGMII/QSGMII lanes into one side, each line card would tap several lanes off to local PHYs and route the others out the other side). This would result in long multi-gigabit signals running across the full 19” chassis width through several connectors, likely requiring the use of a higher cost low-loss substrate.

While browsing the Samtec catalog, though, I discovered the AcceleRate series, more specifically the AcceleRate Slim ARC6 (cable) / ARF6 (board) series. These are twinax differential interconnects with 8 to 24 differential pairs per cable and are rated for 32 Gbps NRZ or 64 Gbps PAM4, so the insertion loss performance is… only slightly overkill for 5 Gbps QSGMII links. One eight pair cable can handle the six pairs required for the three QSGMII lanes on a VSC8512 with two extra lanes to spare.

The big advantage of a “flyover” style interconnect like this is that there’s no need for a low loss PCB material or routing lots of diffpairs long distances that will conflict with other routing. Jut put a couple of connectors close to the BGA and leave the rest of the board area free for other stuff.

ARC6 is also good enough I could use them as a flyover for a remote SFP28 connector if that makes the chassis layout easier.

Current hardware state

The IBC has been already designed and used in other projects, so that’s done. We can forget about it, other than possibly doing that small respin with reduced ripple from the 3.3V buck. But with the current tariff situation I’m not in a hurry (it’s a 4L 2oz board made at Multech in China). I have plenty of the boards and several populated units ready to go.

The PDU board is done. There’s not much to it: 12V in the left 8-pin connector, 12V out the right 8-pin, 12V out the bottom two 4-pins. It also passes the I2C and 3.3V standby rails from the IBC through, while tapping off them to run programmable soft load switching and voltage/current monitoring. Pretty straightforward. I have three boards (one populated) so to get my 96 ports I only need to stuff a second one with some parts I have on the shelf.

The line card is also done. I’ve built one and it seems to work fine aside from, if memory serves me right, port 22 or 23 not linking up. Probably a solder defect since the left/right PHY layout was basically copy pasted, but I haven’t had time to troubleshoot. I have ten PCBs and need four to make two switches; I will have to order two more of the 8-port RJ45s from LINK-PP to stuff all of them but I’m a ways off from that being an issue.

It’s a six-layer design on cheap Shengyi S1000-2M substrate since the QSGMII links are short and everything else is slow. Stackup is SGPPGS with all of the Ethernet signals on the back layer and some of the LED GPIOs routed in between layer 4 power pours. Connection to the rest of the system is via four connectors on the back side: 12V power on a 4-pin Mini-Fit Jr, six QSGMII lanes split across two ARF6 connectors, and a 10-pin Molex PicoBlade providing access to the PHY MDIO bus, I2C to a bunch of system health sensors, and a SPI bus to the onboard STM32L431 management microcontroller (not used in current firmware) in case I want to provide remote power cycling or something like that.

The big missing piece as of now is the switch engine board itself. For the near term I’ve cobbled together a sort of sub-scale test setup containing one line card, the power system, a KU5P dev board I built last year to validate that the $55 AliExpress FPGAs weren’t bricks, and a separate board with a STM32H735 as the management processor.

It wasn’t possible to route a STM32H735 with FMC to the FPGA directly to enable the fast memory mapped interface I described in a previous post since the dev board was built on OSHPark 4 layer (I may be the first person to have put a Kintex UltraScale+, with all transceivers broken out, on a 4-layer PCB… post about that coming at some point) so instead I built a small expansion board with the MCU, a small bridge FPGA (XC7A35T), and connected it to the KU+ board via a 5 Gbps serial link. The MCU memory bus is bridged to APB on the Artix which then serializes the APB transactions via a simple PCIe-esque protocol (the Serial Chip to Chip Bus, which will probably get its own blog once I’ve fine tuned the implementation a bit) and back to APB on the Kintex. The end result is basically the same as if the MCU were directly wired to the Kintex but with a bit more latency in the path.

The glue board also contains a passive adapter circuit to bridge from three sets of SMPM connectors on the Kintex board to an ARF6 connector so I can break out that GTY quad to talk to the line card. I have a second glue board that will bridge a QSFP28 out to an ARF6 (enabling me to use the second PHY on the line card) but haven’t had time to assemble it yet.

Planned switch engine architecture

I’m still working out some of the fine points of the implementation, but the current overall plan is a 4x4 64-bit crossbar at 400 MHz. This will give 25.6 Gbps per lane of throughput, or 102.4 Gbps total. I could theoretically downclock to 390.625 MHz if needed due to difficult timing, but the increase in margin will be small and 400 is a more convenient number for me to synthesize from 25 MHz without needing fractional-N.

The 25G uplinks will each get a dedicated crossbar port, while each line card will also get one (combined bandwidth 24 Gbps).

The 4x4 crossbar should be very efficient and performant timing-wise, since a 4:1 mux fits in a single LUT6. So a 64-bit 4:1 mux is 64 LUTs and the full crossbar 256 LUTs, with only a single level of logic in the critical path. The decision-making logic will be more complex but can run in a slower clock domain if needed, since forwarding a packet will take multiple cycles.

On the input side, there will need to be some kind of arbiter and some small FIFOs to take the 24x 1 Gbps streams and mux them down to a single 24 Gbps stream, as well as some clock domain crossing blocks that can probably be handled by the same FIFOs. UltraRAM is available if needed for deeper FIFOs.

On the output side, there will just be a small block RAM exit queue per port. The 25G ports will drive the crossbar exit stream straight into the FIFO, while the 1G line cards will have 24 separate exit queues (one per port) with enable gated by the destination port set (a bitmask, to allow for broadcast/multicast).

Current gateware state

I successfully ported my existing 10G MAC/PCS IP over to AXI4-Stream and have it working well in tests. The 1G is almost done, the AXI conversion is finished for the RX side but I haven’t done the TX yet.

I still have to write a 25G MAC/PCS but that’s a ways out, I can run the uplinks at 10G and test the whole rest of the switch just fine that way (I won’t actually be lighting up the uplinks at 25G until I get a 25/100G core switch anyway, which is a ways out).

The MAC address table from LATENTPINK will work just fine in this design with no modifications, it has more than enough capacity for the max theoretical number of packets I could push through the fabric.

I still have to build AXI4-Stream VLAN tag insertion/removal blocks, figure out at what point I want to drop frames with bad FCSes (probably at the point they’re written to the URAM ingress FIFO but I’m not certain yet), and actually do all the integrations.

Conclusions

There’s still a lot to do but it’s been an exciting project so far and I look forward to seeing where it goes. I don’t expect to have a polished switch on the final PCB/mechanical design until probably some time in 2026 (subject to manufacturing/supply chain delays and the political situation), but I hope to be pushing packets by the summer some time.

Like this post? Drop me a comment on Mastodon

Holding the Bag

Thu, 17 Apr 2025 12:00:00 -0700

This is a bit of a change from my usual content, but I’ve been sitting on this idea for a short SF story for a long time and I had to try actually writing it. I’m mostly a technical writer not a creative one so hopefully this turned out decently :)

Arrival

BEEP BEEP BEEP BEEP

It took Dr. Lynn Clarke a minute to wake up enough to register that the noise was her alarm clock. She was not a morning person, preferring late nights in the lab to coffee-fueled 8AM PHYS 1100 lectures.

“This is the new normal, better get used to it,” she thought as she ate breakfast. Her dreams of spending her career publishing IEEE papers on improving fabrication yields of THz photonic crystals were over, as with untold thousands of her peers. The collapse of government funding in the mid 2020s had been catastrophic for pure physics researchers, and really anyone working on science that wasn’t something an oil company thought looked profitable.

As if that wasn’t enough, the history of Y2K had been forgotten or ignored by Wall Street management and when 2038 came (who would have thought banks and stock exchanges would still be running their databases on 32-bit operating systems for that long?) the resulting financial apocalypse devastated what little budget had been coming from industry. The database recovery folks were raking in cash from companies desperate to salvage their records, of course, but physics faculty? Good luck.

MIT had been hit harder than some other institutions, with the endowment losing half its value in a week and most of the department being laid off at the end of the fiscal year as corporate research sponsors went bankrupt, stopped funding non-essential work, or automatic payments simply ended because the accounts payable system thought it was 1901 and nobody could be reached to fix the problem because the email server was just as corrupted.

The job at Rensselaer Polytechnic was a pay cut and meant starting over as an associate professor teaching freshmen about Newton’s laws rather than working in the cleanroom on cutting-edge research, but it would put food on the table and a roof over her head. And it beat stocking shelves overnight at the Cambridge Wal-Mart like her former department head.

After breakfast she grabbed her helmet and tire pump and went down to the garage to get her bike ready for the morning commute, checking the map for what felt like the 20th time to make sure she had the route memorized. Being delayed by a flat tire or making a wrong turn on the way to the first day of classes would be an embarrassing start to the new job. And she wanted to allow plenty of time for the trip until she got back in shape. Years of desk work and car commuting had let her once-athletic legs weaken, but she had sold the car over the summer to pay for groceries and there was no bus service to her new neighborhood (a good thing - apartments with transit access were more expensive) so it was the only option left.

As she rode down the hill towards campus, she tried to enjoy the view of the distant Hudson river as cars zipped past her every few seconds. Biking to work was going to take some getting used to.

Lost in thought, she didn’t even notice her view of the river vanish in a wall of blackness until the squealing of brakes and crunching of metal in front of her shattered her daydream. A fraction of a second later, the cacophony was drowned out by pain and confusion as she flew over the handlebars into something hard.

The Cube

“Dr. Clarke?”

Lynn woke up in a daze and looked around the room. Her left leg was in a cast and every breath hurt her chest despite the IV delivering what she assumed was some sort of pain medication. “W… What happened? I feel like I got hit by a bus!”

“Close!” the nurse responded, “You hit the bus. The Cube popped up right in front of the school bus in the next lane and you smashed into the back door. You’ve got a hairline fracture of your left fibula, three broken ribs, and some road rash but the doctor says you should make a full recovery in a few months.”

“Months? I have a class to teach… and wait a minute, what cube?” she asked.

“Oh, you must not remember what happened. That’s a normal mental reaction to traumatic events. Don’t worry, it’s not brain damage or anything - your MRI came back fine and there’s no sign of a concussion.”

The nurse reached over for the bedside TV remote and turned it on. A local news anchor stood in front of a video wall showing a helicopter view of Hoosick Street. But where there should have been a traffic light and four-way intersection a few blocks from the school, there was nothing but blackness. It appeared perfectly square and featureless, not even reflecting the lights of the tow trucks and police cars blocking the road as they removed the debris of what was clearly a massive multi-car pileup.

“Thankfully there were no fatalities, but over a dozen drivers and a cyclist were hospitalized after the Cube appeared in the middle of the road during the morning rush hour. The area has been blocked off by the FBI who refused to comment when our reporter asked them if they had any idea what the object was or where it had come from.”

The helicopter camera panned over to the Price Chopper parking lot across the street, which was lined with people pressing up against hastily erected barriers trying to get a glimpse of the object, then cut to a reporter on the ground in the parking lot.

“Hi there it’s Bob from Channel 10 News. We’re here with some of the spectators looking at the Cube.”

Bob shoved the microphone in the face of a bearded man in his 60s.

“What do you think about the Cube?”

“The army should have already blown it up. The eggs will be hatching any minute and then it’s game over!”

Bob turned to a pair of college girls behind him. “What about you?”

“This is the most exciting moment in our history. We’ve finally made contact with another species! We need to put aside our differences and show them our best side.”

Bob continued interviewing the locals. Theories ranged from a secret government project gone wrong to an unexploded alien antimatter bomb to a real-life version of the Monolith from “2001”. Member of a local religious cult had already erected a cross at one side of the parking lot and were singing and praying in hopes of being spared by whatever deity had delivered it.

As the TV droned on and on, failing to reveal any useful information other than “it’s a big black box”, she fell back asleep.

Going Home

Lynn rolled down the hallway of the hospital with her broken leg on a scooter. It had been a difficult couple of days, but she had been cleared for discharge.

“Professor?”

She looked up and saw a skinny man wearing a T-shirt featuring an anthropomorphic fox, holding a set of car keys.

“Abe Jackson. I’m one of Dr. Chan’s Ph.D students. He said you called asking for a ride? I live a few blocks from you so he asked me to help you out. I can give you a lift to campus whenever you need.”

“That would be great! I was planning on biking to school but that’s not going to be happening for a while,” she responded with a small laugh.

Abe walked her to an old blue Toyota Tercel that looked like it would fall apart if it made a sudden stop. The foam in the seats had disintegrated decades ago and had been replaced by cloth seat covers over some kind of improvised padding. The hatchback brake light dangled precariously from two strips of duct tape, and several spots on the fenders were rusted through. “Meet Old Betsy. She doesn’t look like much, but she runs. With the stipends they pay us these days…”

He started the car and threw it into gear. There was a disconcerting rattling noise from the brake light bouncing against the rear window but it otherwise didn’t sound like too much of a death trap.

“So how much have you been following the Cube situation?” he asked.

“Not much. I was too loopy from the meds to be paying much attention. What do we know about it?”

“You mean them?”

“Wait, there’s more than one now???” she asked.

“Yeah, another one popped up a day later right in the middle of a house in Costa Rica. Looked like it got hit by a tornado. Luckily nobody was home. And some guy on the internet browsing aerial photos found another in the Mojave. Who knows how long it’s been there. So that’s three that we know about so far. There might well be more in uninhabited areas that nobody’s found yet.”

“Wow. OK, what do we know about them?”

“Not a whole lot. They’re very heavy, nobody’s managed to move one of them yet. They completely absorb every wavelength of light or RF we’ve thrown at them so far and radiate weakly around 30ish GHz. The spectrum looks just like you’d expect from an ideal black-body… with a temperature of around 2.7K rather than ambient. No idea if they’re solid or hollow or what they’re made of. The governor set up a task force to focus on this one and several of us have been asked to participate. They want you involved too, once you’re feeling up to it.”

“Wait, 2.7K? 30 GHz? That sounds like the cosmic microwave background.”

“Definitely similar,” Abe replied. “We have no idea if that’s a coincidence or if there’s some kind of connection somehow.”

He pulled Betsy to the side of the road. “This is your place, right?”

“Yep, that’s it. Thanks a lot!”

“Any time. Dr. Chan said you had an 8AM lecture tomorrow so I’ll see you 7:30ish?”

“Sounds good. See you tomorrow.”

The Lab

HONK HONK

Lynn opened her front door and saw Abe waving at her from Old Betsy.

“Morning! Need a hand?”

“No, I gotta get used to moving on my own if I want to be getting anything done for the… what’s it called again?”

“Well, it was the NY Cube Task Force. But overnight one of them dropped in a village in Siberia, another in some Greek farm, and then one flattened a factory in Shenzhen squishing a bunch of workers. Now it’s the UN Cube Task Force. No changes to field operations, just means we’re sharing data more widely. Class is canceled, we’re going straight to the lab. You’re taking over as Director of THz Studies, leading all of the characterization work between 100 GHz and 20 micron far infrared.”

“That’s quite the promotion from associate professor,” she replied. “How did I get picked and not somebody else?”

“You’re new here, but you’ve done more work with high-sensitivity THz imaging sensors than anybody else. This thing is such a good absorber that we might have to fabricate custom detectors to see anything.”

A block later they pulled into the Price Chopper parking lot. Traffic had been rerouted around the back of the store to bypass the Cube, which was now hidden from view under a large tent.

Approaching the tent, they were greeted by an elderly Asian man. “Paul Chan. I’m Abe’s advisor. You must be Lynn.”

“Nice to meet you. You’ve got quite the setup here, mind giving us the overview?”

“The Cube is such a good absorber that we’re doing everything we can to improve SNR,” he replied. “There’s a full Faraday cage around it with >100 dB of attenuation from a few hundred kHz to ultraviolet - building that in 48 hours wasn’t cheap but we pulled it off. We have cryo-cooled panels we can move around it to reduce black-body radiation from the cage itself, although they do reflect in other frequency bands so we only use them when necessary. The foam is pretty decent at acoustic shielding too but we have sound absorbing panels we can use to supplement. All of the analysis and data processing is done from outside the cage so we don’t disturb the measurements.”

“So what do we know so far? Anything on surface characterization yet?”

“The surface is perfectly flat, within the limits of our measurement capability. There’s no tunneling current whatsoever in a STM, and the AFM showed no deviation at all, not even on the atomic scale. HOPG looks like sandpaper compared to this. We’re not even sure it’s a physical surface, it might just be some kind of energy field. Somebody volunteered to touch it with a bare hand and they said it felt like nothing, their hand just stopped but it didn’t feel hot or cold or rough or… like anything besides just sitting there in open air.”

“Ultrasound?”

“Zilch. No reflection or transmission at all, it’s as if we’re broadcasting into a vacuum.”

“Gravitation?”

“Hard to measure on earth, obviously. But between the fact that we haven’t been able to move it, and it’s not sucking things up like a black hole or sinking into the ground, we’re guesstimating a mass of somewhere in the 20K to 500K ton range. Our geologists say that the bedrock is pretty deep under this intersection and none of the buried sewer lines seem to have been crushed, so probably towards the lower end of that.”

“Particle radiation?”

“Nothing. Alpha, beta, and neutron detectors show nothing but noise, not even normal Earth-surface background levels. We’ve tried irradiating it in a few spots and got nothing detectable transmitted or reflected.”

“And what about the EM side?”

“So far, not a whole lot but we’ve got further there than anywhere else. It emits what appears to be uniform black-body radiation, so that’s something. We’re trying to get the most sensitive detectors we can across the entire EM band in hopes of getting some level of modulation back that we can detect. Some of them have long lead times that are hard to accelerate, so we’re trying to use cooled detectors and strong transmitters to improve SNR as much as practical in the meantime. We were hoping you might be able to continue the THz focal plane array work from your IEEE paper last year, it looks like it will outperform anything we’ve got north of 100 GHz and CNSE thinks they’ll be able to fab prototypes pretty easily… And this is your new desk. Have at it, let me know if you need anything equipment or staff wise and we’ll make it happen.”

Lynn sat down with her laptop and pulled up her notes from the nearly-forgotten project to refresh her memory.

Progress

“We got something on the last sweep!”

Lynn looked up as Abe ran excitedly up to her desk. It had been four weeks and two rushed wafer lots, but the new THz detector prototypes were in operation, part of an ongoing global campaign to try and see inside the Cube or determine what it was made of and where it came from by any means possible.

“What?”

“Yeah. Extremely narrow transmission peak right around 120 micron wavelength - 2.5 THz. They’re setting the detector to do a 360 degree scan and see if there’s any spatial pattern but I thought you’d want to know.”

She got up and walked into the RF chamber surrounding the Cube. Between the featureless black surface and the thousands of sharp pyramidal RF absorber cones lay a narrow circular track with two wheeled carts, 180 degrees apart. One held a high-power transmitter with heavy power cables dangling from the back. The other held her experimental detector, wired to a complicated ensemble of electronics. The entire setup was clearly thrown together in a hurry - boards screwed to frames made from 2x4 lumber, hand soldered jumper wires to fix missing connections, and cables secured to the frame with duct tape.

“Lynn! You’re just in time!” exclaimed Dr. Chan. “We’re all set up, let’s clear the chamber and see what we get.”

Everyone walked out of the chamber and over to the control desk on the far side of the wall. Abe closed the door and sat down at the bench as the rest of the team crowded around.

“I’ve been playing around with some open source CT scanning software and I think I’ve got a 2D slice processing flow working. Let’s see if we have enough power to see anything…”

He clicked the “start” button. A slight humming noise came from the power transformer outside and the lights dimmed briefly, then a graph began to slowly trace along the screen. As the scan finished, there was a short pause, then a grainy black and white image appeared below the graph.

“Well, it’s not solid,” Lynn said. “Definitely looks artificially constructed, too.”

A regular grid pattern of small squares was visible in the image. In between the grid points, smaller rectangular and circular objects, as well as some more irregular blobs, could be seen.

“But what’s special about this frequency? And what is this structure?” Abe wondered out loud.

“Abe, work with the lab techs to get us an elevation axis so we can do full 3D reconstructions. That will tell us a lot. Dr. Chan, take the S21 sweep over to the theory folks and let them stew on it for a while. This is the only spot we’ve seen any transmission at all, even if it’s attenuated by 90 dB. I want to know what’s special about it.”

Lynn sat down at her desk and stared at the grid image. It reminded her of something familiar but she couldn’t place it.

Answers

“We think we know what the walls are!”

Dr. Chan and several other grad students approached Lynn’s desk with a whiteboard in tow.

“So, the blackbody spectrum was the big clue. It is cosmic microwave background radiation.”

“But how could that be?” she asked. “It’s sitting right here, not in deep space.”

“Our side of the discontinuity is, yes. But after that…”

“Discontinuity? What are you talking about?”

“Our working theory is that the ‘wall’ isn’t a wall at all. It’s a jump discontinuity in space-time. Matter can’t pass through it because there’s an undefined slope rather than a smooth curve like you get around a normal point mass. You’d need an infinite force to push over the edge. EM fields get diffracted out into deep space, so anything you send in vanishes and all you see coming out is the CMB.”

“But then why are we seeing transmitted signal?”

“There’s a second discontinuity about 60 microns away from the first one, acting like a liner. The Cube is hollow. If your incident signal has a wavelength exactly matching the spacing of the discontinuities, it acts like a very high Q cavity resonator, almost like a laser. When you’ve pumped the cavity hard enough, the field strength gets to the point some of the energy can jump the discontinuity and enter the interior of the Cube. Presumably something similar happens on the exit side but we’re still working on how that bit works.”

“So if it’s hollow, what’s inside?”

“You’re… not going to believe this,” said Abe, walking in with a laptop. “The 3D reconstruction is done.”

Everyone stared in amazement as Abe tilted the point cloud slightly and the structure became clear: open aisles, separated by rows of rectangular shelving with fuzzy objects of various sizes resting on them.

“The grid of squares we saw on the 2D slice were the support pillars of these shelves. And it’s not static, either. I went back to the 1-meter elevation slice we did last week and several new objects are here that weren’t there before.”

“But… That would mean someone or something is going in and out of the Cube! We’ve had it completely surrounded the whole time,” Dr. Chan replied.

“Yes, on our side of the discontinuity, “ Abe said. “We don’t know how static it is. It’s very possible that it can slide around somehow, maybe opening up some kind of portal if you’re in the right spot.”

Lynn’s face turned pale as the implications sunk in. “Get me the President.”

The Bag

“White House switchboard”

“Lynn Clarke, NY Cube Task Force division. We have a problem.”

A few minutes of holding later, the phone clicked.

“Situation Room duty officer here. I’m with the President, VP, and Secretary of Defense.”

“Mr. President, have you ever played Dungeons and Dragons?” she asked.

After a second of incredulous laughter, he replied. “This better not get out before Election Day. But yes, I was a bit of a nerd in my Harvard days. What does this have to do with the Cube, though?”

The Secretary of Defense chimed in “Never been into that stuff. What are you getting at?”

“What about Dr. Who? The TARDIS? Just like the Bag of Holding from D&D, it’s bigger on the inside than the outside.”

“So? This is a national emergency, not an RPG convention,” the President responded angrily.

“Did you ever stop and think about where all that stuff goes? The bag is bigger inside than outside, but that means there’s a big storage room somewhere.”

“Are you saying what I think you’re saying?”

“Yes, precisely. We checked historical satellite photos, the one in the Mojave has been there for years and nobody ever got close enough to know it was there. It must have been the prototype, and now they’ve started mass production. It’s not going to stop until we find a way to get a message across to whatever parallel universe is building these things and hope they’re willing to shut down their Bag of Holding factory.”

The President sighed. “And we’re the ones left holding the bag.”

Like this post? Drop me a comment on Mastodon

STM32L431 teardown

Thu, 02 Jan 2025 14:00:00 -0800

I realized a while back that I haven’t put any silicon reverse engineering content on the new blog yet. It’s time to change that!

Today, we’ll be doing a teardown of the ST STM32L431. Why? Because it’s a part I use a lot, it’s my go-to for “I want enough flash for A/B firmware images and a bootloader and a nontrivial amount of code, but a H735 is overkill”. And, most importantly, I had one on a scrap board in my “microscope food” bin (yes, that is the actual label on it).

Quick disclaimer before we begin (no, work didn’t make me say this): if you’re a prospective customer of my day job, be advised that all of this analysis was done for fun in my garage lab with optical microscopy and basic wet etch for deprocessing. The quality of results shown here is not representative of what I can do in a real lab with proper gear for CMP, plasma etching, SEM/FIB, etc.

Some quick stats

In total, the analysis in this post took about two days (much of that waiting for imaging, not actively doing stuff). I acquired 49.2 GB of optical imagery (90436 individual image files, including focus stacked/stitched intermediates) at a range of magnifications. All of the stitched datasets (at least the ones that turned out good) can be viewed on siliconpr0n, I’ll be linking some of the more interesting ones here. If you want to see more, definitely click around all of the delayered scans though.

Delayering was done with Whink rust remover (1-3% HF per the SDS, I really should do a titration at some point to figure out the actual concentration) followed by mechanical cleaning to remove delaminated copper interconnect and vias.

Imaging used a Labsmore LIP-X1 CNC microscope with a Mitutoyo VMU optical column. Objectives used were Mitutoyo plan apo 20x/0.42 for overviews and in-process inspection and Olympus Neo SPlan 100x/0.90 for high magnification closeups.

Device Overview

The STM32L431 comes in a bunch of different packages ranging from a 49-ball WLCSP up to 100-pin LQFP. The sample seen here came from a 48-pin QFN.

It contains a Cortex-M4 with FPU capable of clocking up to 80 MHz, 256 kB of flash memory, 64 kB of SRAM, a 12-bit ADC, dual 12 bit DACs, an opamp, two comparators, and a bunch of other goodies.

Top metal

The overall die size, including scribe line, is approximately 3.132 x 3.127 mm = 9.793 mm^2.

Several features are immediately apparent:

The device is made on a fairly modern, high layer count process (Wikipedia claims 90 nm, we’ll verify that). Power routing covers most of the surface, preventing us from getting a good view of a lot of the chip.
A regular region in the northwest corner looks like it’s probably some kind of memory
The south and southwest region looks analog

Looking at the northeast corner we can see internal part number “T435A”, a 2015 die copyright, the ST logo, and a little doodle of a dolphin. It always brings me joy to see silicon artwork, which has become less common in recent years.

This is the second dolphin I’ve seen on a recent ST chip (the STM32H735 has one too). Anybody know why? Internal project codename? Design team mascot? Local sports team at one of the offices?

EDIT: According to folks at ST, it’s actually an orca, not a dolphin. The part was codenamed “Orca” during development.

Unlike some of the other STM32s I’ve looked at, this one isn’t made in house at ST’s foundry - it’s made by TSMC. The fiducial in the corner is a dead giveaway. The overall appearance is consistent with the 90nm node but we can do some more digging to be sure.

Substrate floorplan

I deprocessed the sample to bare silicon substrate (going slowly and grabbing a lot of photos on the way, we’ll get to those in due time).

While it’s a bit tricky to tell with only wet etch deprocessing and sub-optical feature sizes, the STM32L431 appears to be seven copper and one aluminum metal layers for a total of eight metal layers.

The 20x scan of substrate with no annotations is on siliconpr0n. I took a 100x scan but had significant stitch artifacts in the memories so I didn’t upload it (but you’ll see crops of the nicer regions in this post). At some point I’m going to try to re-stitch and will add a link here if it turns out well.

At the substrate layer, we can see the pad ring around the perimeter, analog in the south, large memories in the northwest, and standard cell logic in the northeast. Several small memories (marked A through D in the floorplan) are present along the edges of the analog region. SRAMs A and B use the same bitcell as SRAM1, while C and D use the same bitcell as SRAM2.

Three smaller regions of standard cells can be seen outside the main logic area, sandwiched between the flash memory and theanalog region. Two of them are rectangular and use a different cell library than the rest of the chip (with a much larger row height), while the third is L-shaped and uses the same cell library as the main digital region.

Memories

SRAM1

At the north end of the die is SRAM1. This is a 48 kB single-port SRAM consisting of three identical 16 kB memory IP instances.

Each SRAM IP measures 796 μm x 201 μm (0.16 mm^2) and consists of two blocks of cells on either side of a central addressing spine, plus two spare columns for array repair between the central spine and the main east bitcell array.

Each block is 381 μm (256 columns at 1486 nm pitch) wide, and 179 μm (4 strings of 64 rows at 680 nm pitch) high for a total capacity of 64 Kbits (8 KB) at a density of 1.04 μm^2 per bit or 939 Kbits/mm^2 (not counting periphery). The overall array density including periphery is 1.22 μm^2 per bit or 800.5 Kbits/mm^2.

The 1486 x 680 nm (1.01 μm^2) bitcell uses a lithography-optimized (all poly running horizontal) 6T bitcell design typical of modern planar CMOS technologies.

Dummy features appear to be present around the perimeter of the array.

SRAM2

SRAM2 is a single 16 kB block in a separate power domain from the rest of the device, which can optionally be preserved across device resets and kept active in an intermediate low-power state in which SRAM1 is not preserved (in the deepest sleep states only the 128 byte backup SRAM is preserved and contents of both SRAM1 and SRAM2 are lost). It also has optional parity (error detection only, not full SEC-DED ECC as present on some higher end STM32s).

SRAM2 is 883 x 202 μm (0.18 mm^2), roughly 11% larger than SRAM1 for the same usable capacity which aligns well with the overhead of the parity bits. Each block is 426 μm x 178 μm and, as with SRAM1, there are two spare columns for error correction.

The 1482 x 676 nm (1.00 μm^2) bitcell is identical in overall size to the SRAM1 bitcell within the bounds of measurement error, but has a different appearance (most notably, a different color as seen at lower magnification). This is likely due to SRAM2 being optimized for low leakage, perhaps using HVT transistors in the bitcell (although interestingly, the datasheet makes no mention of it having worse performance: perhaps it is placed closer to critical paths in the interconnect to compensate for higher clock-to-out delay? I haven’t actually run any microbenchmarks to see if there’s a pipeline register or anything in the path).

Dummy features appear to be present around the perimeter of the array.

SRAM A (tentatively flash instruction cache)

SRAM A uses the same bitcell as SRAM1, and consists of 32 rows x 256 columns (plus two spare), for a total capacity of 8 Kbits or 1 kB. Unlike the larger SRAMs, this one only has a single tile array (with the addressing logic on the west and north sides) rather than a double array with row addressing down the centerline.

The 1 kB size and 32 x 256 array structure is consistent with the I-side flash cache (32 cache lines of 4 x 64 bits words).

SRAM B (tentatively flash data cache)

SRAM B is identical to SRAM A, but only 8 rows high. Its placement very close to SRAM A, as well as its size of 1 Kbit or 256 bytes, is consistent with the D-side flash cache (8 cache lines of 4 x 64 bits

Neither SRAM A nor SRAM B have obvious extra bits which could be used for cache line validity state or tag. The flash is 256 kB (logically 32K 64-bit words) so assuming a fully associative cache (plausible for one this small), 15 tag bits per line would be required, plus a validity bit. This would require an additional 512 bits (for I-cache) and 128 bits (for D-cache) of tag memory, likely implemented as discrete flipflops to enable parallel tag matching.

SRAM C

SRAM C uses the same low-power bitcell as SRAM2, and is 128 columns x 32 rows (4096 bits, 512 bytes) with two spare columns.

Its functionality is unknown, perhaps the RTC backup SRAM, CPU register file, or SDMMC TX/RX FIFO?

SRAM D

SRAM D uses the same low-power bitcell as SRAM2, and is 128 columns x 64 rows (8192 bits, 1K bytes) with two spare columns.

Its functionality is unknown.

Flash

The datasheet capacity for the flash is 256 kB with ECC (so a physical capacity of 288 kB), organized as 72 bits physical / 64 bits logical by 32768 words.

The flash memory is located in the northwest corner of the device. It consists of two blocks, one 25% longer than the other. This is consistent with 128 kB for the small half and 128 kB + 32 kB system memory (28 kB boot “ROM” and 4 kB trim/OTP region) for the large half, plus ECC.

Overall IP size is 1027 x 965 μm for the main memory array (0.99 mm^2) plus 955 x 300 μm (0.29 mm^2) for the high voltage generation block.

The lower bitcell array (128 kB logical, 144 kB / 1152 Kbit physical) measures 684 x 334 μm (0.23 mm^2). This gives a density of 0.194 μm^2 per bit or 5034 Kbits/mm^2 - 5.36x higher areal density than the 6T SRAM bitcell.

The overall flash structure is 18 copies of a “super-column” tile, subdivided into four copies of the basic column circuit, giving a total width of 72 columns (as expected from the datasheet).

The visible wordline structures have a pitch of 1.28 μm, and the entire array appears to be 260 wordlines high (likely 256 + dummy features)

Each column appears to have 4-way muxing, giving a physical array width of 288 bits. Bitline logic within each column has a pitch of 2.24 μm, while column logic has a pitch of 9.50 μm. This gives a tile size of 1.28 x 2.24 = 2.86 μm^2.

The 288 x 256 cell array which can be inferred from this image analysis, however, only gives a capacity of 72K tiles - 16 times lower less than the known 1152 bit physical array capacity. This suggests that the actual bitcell structure is significantly smaller than what we can see here.

Backing up to the intermediate deprocessed stages, we can see some more details: the column mux logic on metal… 2, I think, contains a diagonal structure which appears to consist of 8 wires at roughly 466 nm pitch, fanning out to unseen logic further down the stack. This suggests a 466 nm bitline pitch and a physical array width of 1152 bits, not 288.

This still requires additional row logic: 1152 Kbits with 1152 bitline array width would require a 1024 wordline height, not the 256 tiles visible here. There appears to be a 4-way symmetry in the wordline logic as well, which is consistent with this.

This suggests that the actual wordline pitch must be closer to 320 nm (and the green horizontal structures seen here are not wordlines, but strings of bit cells too small to resolve optically), giving an approximate bitcell size of 320 nm x 466 nm or 0.15 μm^2, consistent with the overall array density of one bit per 0.194 μm^2.

Main logic area

The digital core of the device measures 1.652 x 1.54 mm, with a small cutout (824 x 226 μm) in the northwest corner for SRAM1. This gives an overall logic area of 2.36 mm^2. At a 100% packing density of 413K gates/mm^2 from literature for an unspecified TSMC 90nm cell library this gives an upper bound of 974K gates; actual packing density will be lower (maybe around 750K NAND2 equivalents).

The library height is approximately 1.95 μm.

At metal 1, we can barely resolve the ~300nm wide power tracks and almost no details within the cells are visible.

Low density logic area

A small region at the center west part of the die, approximately 792 x 323 μm (0.26 mm^2), uses a different cell library with a much larger height (approximately 3.42 μm). Without much in the way of supporting evidence, I suspect this is the backup domain logic with the RTC, backup registers, tamper logic, etc, using an extra low leakage cell library.

At metal 1, we can make out substantially more detail in the cells than with the high density library used elsewhere in the device.

Conclusions

I’m not going to do a full netlist extraction or anything, this is just a high level teardown. It focuses on memories and logic because those are the portions I’m most familiar with - if anybody knows their way around PLLs, ADCs, DACs, etc. and wants to do some analysis on the mixed signal stuff I’ll gladly throw raw data your way.

There wasn’t really any specific goal to this analysis, just spending some time getting extra familiar with a part I use all the time. Hope you enjoyed!

Like this post? Drop me a comment on Mastodon

Intermediate Bus Converter

Tue, 15 Oct 2024 00:00:00 -0700

(Sorry for the slow updates… I’ve been busy with a lot of stuff, like getting ngscopeclient ready for the full v0.1 release at the end of the year. I haven’t stopped working on projects, just been too busy to do long-form writeups.)

When I started out with digital electronics, most of my designs ran on 5V from a barrel jack. This was fine for simple stuff, but I rapidly ran into two problems: 5V isn’t high enough for designs that use a lot of power (at least, if you want to avoid a ton of losses in cables), and barrel jacks aren’t super reliable. They rely on just a few points of contact so it doesn’t take much of a bump of the cable to cause a momentary loss of power.

Moving to 12V was the obvious way to solve this (and I did make a few boards that took 12V on a barrel jack), but I also wanted to move away from barrel jacks. Even 12V starts to become questionable for longer range power distribution at higher power levels, so a lot of datacenter-type DC buses use even higher voltages, such as 48V.

I also wanted to use a DC bus that was non-isolated (i.e. negative supply rail is at earth ground potential) since a lot of the projects I have in mind are test equipment or rackmount networking hardware that will have grounded shields on connectors. It’s important to note that I’m using +48V here, rather than the -48V power (positive supply rail at earth potential) that is commonly used in telco applications.

After a bit of digging, I found that Mean Well makes a power brick (GST280A48-C6P) that puts out ground-referenced +48V on a locking 6-pin Molex Mini-Fit Jr connector, with up to 280W output. This is enough to run quite a few of my planned gizmos (most with <25W power budget) from a single DC bus if I made some kind of DC PDU or distribution panel.

There’s just one problem: while 48V is great for long range distribution, it’s difficult to step directly from 48V to typical digital core voltages (often <1V for modern devices). You normally need to step it down to an intermediate voltage, usually something in the neighborhood of 12V, and then go from there to whatever your actual loads require.

Enter the intermediate bus converter.

Defining the requirements

At a high level, the job of an IBC is pretty simple: take in a high voltage (48V DC in my case) and step it down to a lower voltage (12V DC). But since this was going to be a system-level power supply, I wanted this to be a bit more than just a naked buck converter, so I drew up a few initial requirements:

48V DC input on a 6-pin Mini-Fit Jr compatible with the previously mentioned Mean Well brick
12V DC output on an 8-bit Mini-Fit Jr, maybe PCIe 8-pin compatible (this ultimately didn’t happen)
Remote on/off via a GPIO to support soft power on/off
Soft start to avoid excessive inrush when driving loads with a lot of input capacitance
3.3V DC auxiliary output to power rail/reset supervisors, soft power, and other standby logic
Temperature, voltage, and current sensors plus an I2C interface for querying them
Some additional EMI filtering and bulk capacitance

Version 0.1

The first iteration of the IBC was 87 x 80 mm in size, targeting the OSHPark 2-layer 2oz copper stackup. Back side was almost completely solid ground with a handful of signal net crossovers, while the front side contained an S-shaped power path with monitor/control signals around it.

This design set the general stage for all of the subsequent designs, and all future versions have retained electrical (though not always mechanical) interface compatibility: 5 pin PicoBlade connector containing the 12V enable, I2C, and 3.3V standby rail. The I2C bus contained a temperature sensor at 0x90 and the management microcontroller at 0x42.

The input protection and power path was pretty straightforward: a socketed fuse at the input, common mode choke and ferrite bead to avoid radiating switching noise out the input, a current shunt and some bulk capacitors, then a TDK-Lambda i3a series buck module. No explicit reverse voltage or overvoltage protection, although something would probably blow the input fuse if it were reversed.

On the output side of the buck module, there’s a bunch more capacitace, a current shunt, a ferrite, an an On Semi NCP455620 controlled-slew load switch.

In parallel with the main power path I put a 3.3V LDO to run the management logic and some voltage dividers to monitor the 48 and 12V voltage levels, plus a pair of AD8218 current shunt amplifiers to convert the shunt readings to voltages I could feed to the MCU (a STM32L031).

Version 0.2

The v0.1 IBC had one major problem: one of the tracks from the output of the buck module to the first capacitor was an 0.125mm track that was supposed to be a marker for an eventual zone fill, but I never added the copper pour! It functioned fine once I bodged a piece of copper wire across this path.

It worked well enough from a power perspective, but gave very noisy current measurements. After a bit of digging I realized the problem: the ADC bandwidth was high enough that it was picking up switching ripple through the current shunts.

So I made version 0.2 which fixed the missing zone fill and added a low-pass filter between the current shunt amplifiers and the MCU ADC.

Version 0.3

After this fix, everything worked great except that I realized I had derped and put the 48V current shunt in a high dI/dT path causing it to pick up switching noise.

So I made version 0.3, the final iteration of the first generation IBC.

Version 0.4

The i3a module had one major problem: efficiency. It ran hot (necessitating forced air cooling even at fairly light loads), and its ~3W idle power unloaded resulted in awful efficiency at the ~10W output levels required by the trigger crossbar.

While exploring alternatives, I came across the Murata MYC0409. This is a rather unique switching DC-DC architecture in that it doesn’t use an inductor like a typical buck converter. Instead, it uses a charge pump and switches a series of capacitors around.

This has one significant downside: it produces an unregulated, ratiometric output that is a fixed integer division of the input. Essentially it consists of a series of switching transistors and four internal capacitors; they are connected in series and allowed to charge off the input supply then connected in parallel and allowed to discharge into the load.

But this isn’t a huge deal for a converter intended to primarily drive fans and other DC-DC converters, and the ~800 mW idle power consumption was a huge draw compared to the i3a.

As part of the revamp, I switched from the STM32L031 to the L431 in order to get more flash so I could support a bootloader and A/B firmware slots, enabling field updates if that ever became necessary. The 32 kB of the L031 was a little small to fit two copies of the firmware plus a bootloader.

I also took this opportunity to move version 0.4 to its own repository since it’s a common component, not part of the trigger crossbar project.

The legacy v0.1-0.3 line still lives in the trigger crossbar repo history, but I have no plans to continue development of it at this time.

Version 0.5

v0.4 had a few teething troubles. For starters, as soon as I applied power it exploded.

More precisely, the LTC4367 did. It looks like when power was first applied I started getting a bit of current going down the supply leads, through the input side common mode choke that I had put there to suppress potential common mode EMI, hit the LTC4367 input, then it had nowhere to go since there wasn’t much input capacitance upstream. The end result was inductive spikes peaking at close to 100V amplitude which was enough to cause the LTC4367 to pop.

I tried a few fixes without success while troubleshooting (blowing a second LTC4367 in the process), then simply removed the entire input protection subsystem to test, at which point it worked like a charm.

Weeks later, I discovered that these spikes had also damaged the 1M ohm frontend on channel 3 of that scope - presumably exceeding the V/F derating for my 10x R-C probe and overloading the input. I’ll get it repaired eventually, but the 50 ohm frontend still works fine and that’s the one I use more often so I’ll probably just red-tag the channel in the meantime.

I saved the blown LTC4367s and will try to decap them at some point and see if there’s obvious damage to the dies. If it’s not too expensive I’ll try to get one X-rayed or even CT scanned, it’ll be cool to see what happend to the pin 1 bond wire and how much carnage is inside the package.

So I made one final version 0.5 which removed the CMC and LTC4367 in favor of a ferrite and TPS16630. This version also adds a few more TVS diodes and other protections against overvoltage and ESD.

This version works great and is on one of my prototype boards now, and I will probably be building a few more for testing soon.

Version 0.6 coming?

As of now, v0.5 is current. There’s one minor annoyance, the 3.3V standby rail switcher is fed by the 12V output after the ferrite bead, and could perhaps do with a small input capacitor to reduce high frequency transients. The end result is that there’s high frequency switching spikes injected into the 12V rail. Measuring with a current probe shows no corresponding spikes in current drawn by the load (unsurprising) so I don’t think this will have a huge impact on EMC and I’m just doing prototypes at this stage anyway.

My current plan is to use up all ten of the v0.5 boards I’ve built making protos of various equipment, then when I run out do a v0.6 spin with this fix and any other EMC or performance related tweaks I might want to make after having used the thing for a while in the lab.

Characterization

The only thing left to do was more extensive performance characterization.

This provided a good opportunity to throw together a large filter graph to demonstrate some of the multi-instrument capabilities of ngscopeclient.

The test is fully automated, with a scalar-stairstep filter ramping load from 0 to 6A in 100 mA steps, waiting 30 seconds between steps for thermals to stabilize, and the resulting data is plotted as values vs load current.

The final characterization setup used 16 channels of data from four physical instruments (R&S power supply, Siglent load, R&S multimeter, Teledyne LeCroy oscilloscope), plus on-board sensor data streaming via SWO to ngscopeclient (I’ll probably do a more comprehensive post about this flow once I’ve fine tuned it a bit), being processed by 54 different filter graph blocks to produce the final curves.

Total loss (power out minus power in) starts at about 800 mW with no load, increasing to just over 6W at max load. This is a massive improvement over the 3W idle power of the old design.

Efficiency is over 80% at very low load levels, reaching 90% at under 1A. Max efficiency (after correcting for losses in the wiring harnesses) is over 95% from 2-3A before falling to around 93% at 6A.

Output voltage does sag a fair bit at high loads, from 12V at no load down to about 11.1V at 6A. This is measured at the load and includes losses in the wiring harness, so typical chassis deployments with shorter wires would avoid a bit of this drop, but the droop is inherent to a non-regulating ratiometric converter like this (since any ESR in the output path is not being compensated by any kind of feedback network). So this isn’t something you’d want to use for a precision 12.00V rail, but as an intermediate rail that’s just feeding a bunch of buck converters it’s totally fine.

Realistically, I’m a long ways from needing 6A on any of my designs anyway. And the MYC0409 supports paralleling up to 4 modules for better performance under high load, so if I’m ever going to design something that power hungry I’d probably scale up the IBC to match.

Output ripple is only a few mV RMS, but around 330 mV p-p because of the spikes from the 3V3_SB switcher I mentioned previously.

Thermals look excellent. With no heatsink in still air the module did reach close to 100C at full load, but a small heatsink and a small amount of airflow was sufficient to keep temperatures below 40C over the entire test program.

Conclusions

I’m pretty happy with how the IBC turned out. I will eventually be doing one final board spin to fix the 3V3_SB switching spikes but I’m in no rush to do so, it’s good enough for my in-house use despite the spikes.

It’s a lot more efficient and runs cooler than the older design, has plenty of headroom for my designs to get bigger, and can scale to multiple converters if I need even more power handling capacity.

But for now, it’s going to be powering most of my large prototypes moving forward. Look forward to seeing it appearing in lots of projects coming up!

Like this post? Drop me a comment on Mastodon

ITCM and veneer adventures

Wed, 31 Jul 2024 23:30:00 -0700

In a previous post I discussed how I reached >500 Mbps of iperf3 UDP performance on an embedded STM32 + FPGA platform.

Unfortunately, the improvements were more brittle than I thought: the loop unrolling seems to have perturbed away the problem, rather than been the cause of the performance gain. Tiny, unrelated changes like adding a print statement to boot logic would result in the performance dropping by about 50%.

After some exploration I became convinced that the problem had something to do with the exact layout of the hot path in the TCP/IP stack and iperf code in memory. Loops aligning (or failing to align) to cache lines, thrashing between multiple hot functions competing for the same cache line, etc.

Cache investigations

I spent quite a bit of time reading up on the STM32H735 and Cortex-M7 memory hierarchy, noting such interesting tidbits as the data cache being 4-way set associative while the instruction cache is only 2-way associative, while both have a fixed 32-byte cache line (matching the flash write block size of the STM32H735).

I wrote some scripts that parsed objdump output and output a spreadsheet that showed (to the best of my knowledge) which cache lines each function in the firmware could be allocated to, but didn’t find any obvious hot-path conflicts in the slow firmware that weren’t also in the fast version. BSP_MainLoop and APBEthernetInterface::GetRxFrame might have conflicted, but that’s on the RX path and the UDP transmit path would be unaffected (and with 2-way cache associativity both could be in cache at once).

Ultimately, this ended up being a waste of time. I didn’t find any obvious thrashing I could avoid by relocating specific code.

So I decided to pull out the big hammer and move all of the hot functions over to Instruction Tightly Coupled Memory (ITCM). Ethernet frame data, stack, and a couple of other critical pieces of state were already in DTCM.

ITCM setup

If you’re not already familiar with the fine points of the Cortex-M7 bus architecture, you might be surprised to learn that it’s actually a sort of Harvard architecture (separate instruction and data buses), although not strict Harvard since crossovers (execution from D-side bus and data accesses to I-side bus) are permitted with a performance penalty.

The Cortex-M7 in the STM32H735 has a total of five separate memory buses. Slightly simplifying, these are:

AXI requester interface for interfacing to flash, bulk SRAM, and external memory buses (FMC and OCTOSPI)
AHB requester interface for interfacing to most peripherals
AHB completer interface allowing an external DMA IP to access the ITCM/DTCM SRAMs
Dual channel 32 bit SRAM interface to DTCM (two separate buses with independent control signals)
64 bit SRAM interface to ITCM

The Cortex-M7 architecture allows arbitrarily high latency for TCMs, however the STM32H735 datasheet states that these are zero-wait-state memories. I’m not clear on if this means no latency beyond that of an L1 cache miss, or if it’s truly single cycle access (i.e. TCM access is as fast as an L1 hit) but either way, it’s the fastest you can get deterministically.

The STM32H735 has 64-256 kB of ITCM. No, this isn’t a range from model to model within the family, it’s dynamically configurable via the TCM_AXI_SHARED register. Essentially the physical topology is four 64 kB SRAM blocks plus some muxes allowing three of the blocks to be switched onto the AXI or ITCM buses (the first block is always ITCM).

Actually configuring my firmware to use the ITCM was straightforward. I’m sure there’s a different process if you’re using the ST toolchain but I’m using my stm32-cpp library which did not yet support ITCM.

First, on the linker script side, I added a memory region for the 64 kB ITCM assuming TCM_AXI_SHARED = 2’b00 (as was the default, which I hadn’t changed).

ITCM(RWX):          ORIGIN = 0x00000000, LENGTH = 64K

Yes, the ITCM is mapped at the all-zeroes address. Meaning a null pointer actually points to the beginning of ITCM, not an unmapped address!

So when I defined the actual ITCM section in the linker script, I left a blank space at the beginning unused (to make null pointers point at empty memory rather than executable code). This isn’t perfect but filling this space with 0xCC or similar in the future would enable use of a null pointer to be easily detected.

.tcmtext : ALIGN(32)
{
    __itcm_romstart = LOADADDR(.tcmtext);
    __itcm_start = .;

    /* DEBUG: block off first 256 bytes of ITCM since it's mapped at 0 and we want to catch null derefs */
    . += 256;

    *(.tcmtext)
    __itcm_end = .;
} > ITCM AT> FLASH

This is pretty similar to how you’d specify something like the .data section, in that it lives both in flash and SRAM and has to take up space in both memories.

The next step was to add an initialization hook to make sure that all of these initialized SRAMs actually got properly set up at run time. In the case of newlib on ARM, this is done by a global function “hardware_init_hook” that is called by _start prior to main() or - importantly - __libc_init_array (which calls constructors on global variables before main() is invoked)

extern "C" void hardware_init_hook()
{
    //Copy .data from flash to SRAM (for some reason the default newlib startup won't do this??)
    memcpy(&__data_start, &__data_romstart, &__data_end - &__data_start + 1);

    #ifdef HAVE_ITCM
        //Copy ITCM code from flash to SRAM
        memcpy(&__itcm_start, &__itcm_romstart, &__itcm_end - &__itcm_start + 1);
        asm("dsb");
        asm("isb");
    #endif

    //Initialize the floating point unit
    #ifdef STM32H7
        SCB.CPACR |= ((3UL << 20U)|(3UL << 22U));
    #endif
}

Nothing particularly out of the ordinary here, the only tricky bit is making sure to do this init in the hook rather than in main(), which would be too late if any of the TCM functions were called by constructors of global objects.

At this point, the only remaining step was to actually put the hot functions in ITCM. Straightforward enough:

#ifdef HAVE_ITCM
__attribute__((section(".tcmtext")))
#endif
void APBEthernetInterface::SendTxFrame(EthernetFrame* frame, bool markFree)
{

I tested with one or two functions, and after fixing a few typos in the linker script everything was happy and it worked.

So I started walking my way through the call graph of the hot path in the iperf test, pushing about 4 kB of the most speed-critical functions into ITCM to see if I could get the consistent high performance I was aiming for. Recompiled the firmware, flashed it to the board and…

BOOM SEGFAULT

localadmin@fmctest# [2024-07-31T23:20:15.9219] Ready
Hard fault
    HFSR  = 40000000
    MMFAR = 00000000
    BFAR  = 00000000
    CFSR  = 00010000
    UFSR  = 00000001
    DFSR  = 00000002
    MSP   = 2001ff20
    (register dump continues)

Investigating the crash

I rebuilt the firmware in debug mode and, of course, the firmware didn’t crash with either -Og or -O0. So we’re looking at some kind of heisenbug, great.

I attached gdb to an -O3 binary at the crash and was somewhat confused. It was segfaulting while trying to pop a FIFO. This was code I had been using for years and, weirder still, the crashing function wasn’t even in ITCM (so it shouldn’t have been affected by any of these changes).

HardFault_Handler () at /ceph/fast/home/azonenberg/code/misc-devboards/fpga-stm32-ifaces/firmware/main/vectors.cpp:309
309             while(1)
(gdb) bt
#0  HardFault_Handler () at /ceph/fast/home/azonenberg/code/misc-devboards/fpga-stm32-ifaces/firmware/main/vectors.cpp:309
#1  <signal handler called>
#2  FIFO<EthernetFrame*, 8ul>::Pop (this=0x20006a50 <g_ethIface+24400>) at /ceph/fast/home/azonenberg/code/misc-devboards/fpga-stm32-ifaces/firmware/main/../../../embedded-utils/FIFO.h:106
#3  APBEthernetInterface::GetTxFrame (this=0x20000b00 <g_ethIface>) at /ceph/fast/home/azonenberg/code/misc-devboards/fpga-stm32-ifaces/firmware/main/../../..//staticnet/drivers/apb/APBEthernetInterface.cpp:133
#4  0x08001fc8 in EthernetProtocol::GetTxFrame (this=0x240006f4 <InitIP()::eth>, type=type@entry=ETHERTYPE_ARP, dest=...)
    at /ceph/fast/home/azonenberg/code/misc-devboards/fpga-stm32-ifaces/firmware/main/../../..//staticnet/net/ethernet/EthernetProtocol.cpp:142
#5  0x08000bbe in ARPProtocol::SendQuery (this=0x24000710 <InitIP()::arp>, ip=...) at /ceph/fast/home/azonenberg/code/misc-devboards/fpga-stm32-ifaces/firmware/main/../../..//staticnet/net/arp/ARPProtocol.cpp:53
#6  0x080026e2 in IPv4Protocol::OnAgingTick (this=<optimized out>) at /ceph/fast/home/azonenberg/code/misc-devboards/fpga-stm32-ifaces/firmware/main/../../..//staticnet/net/ipv4/IPv4Protocol.cpp:279
#7  0x08001ffe in EthernetProtocol::OnAgingTick (this=<optimized out>) at /ceph/fast/home/azonenberg/code/misc-devboards/fpga-stm32-ifaces/firmware/main/../../..//staticnet/net/ethernet/EthernetProtocol.cpp:169
#8  0x08007b64 in BSP_MainLoopIteration () at /ceph/fast/home/azonenberg/code/misc-devboards/fpga-stm32-ifaces/firmware/main/mainloop.cpp:111
#9  0x08007902 in BSP_MainLoop () at /ceph/fast/home/azonenberg/code/misc-devboards/fpga-stm32-ifaces/firmware/main/../../..//common-embedded-platform/core/main.cpp:118
#10 0x08007936 in main () at /ceph/fast/home/azonenberg/code/misc-devboards/fpga-stm32-ifaces/firmware/main/../../..//common-embedded-platform/core/main.cpp:87

The disassembly didn’t show anything obviously wrong at a glance.

(gdb) frame 2
#2  FIFO<EthernetFrame*, 8ul>::Pop (this=0x20006a50 <g_ethIface+24400>) at /ceph/fast/home/azonenberg/code/misc-devboards/fpga-stm32-ifaces/firmware/main/../../../embedded-utils/FIFO.h:106
106             objtype Pop()
(gdb) disas
Dump of assembler code for function _ZN20APBEthernetInterface10GetTxFrameEv:
   0x000001fc <+0>:     push    {r3, r4, r5, lr}
   0x000001fe <+2>:     add.w   r4, r0, #20480  @ 0x5000
   0x00000202 <+6>:     ldrb.w  r3, [r4, #3960] @ 0xf78
   0x00000206 <+10>:    cbnz    r3, 0x24c <APBEthernetInterface::GetTxFrame()+80>
=> 0x00000208 <+12>:    blx     0x1000 <__EnterCriticalSection_veneer>

But wait, what’s that “veneer” function?

After some googling I determined this was a thunk added by the linker. Most ARM Thumb jump instructions have a 16-bit immediate for the destination, and you need a different instruction coding for a far jump. There’s a compiler option you can specify to generate far jumps at the call site, but by default the compiler tries to save a few bytes of code size by putting the far jump in a thunk and doing a near call to the thunk (thus allowing the far call’s instruction bytes to be reused across many call sites).

This made sense, since GetTxFrame was in ITCM while EnterCriticalSection was an assembly helper in .text. (It might make sense to move to ITCM since it’s small and called frequently, but that’s an optimization question and quite orthogonal to why my firmware is segfaulting.)

So the obvious next step was to look at the veneer.

(gdb) disas __EnterCriticalSection_veneer
Dump of assembler code for function __EnterCriticalSection_veneer:
   0x00001000 <+0>:     bfcsel  0, 0x1a40, 2, ne
   0x00001004 <+4>:     vsub.i32        d16, d12, d0
End of assembler dump.

I had never heard of bfcsel so I took a look in the ARMv7-M architecture spec… and was rather confused and shocked to not find it. A bit more research showed that this was an ARMv8-M instruction. So why was I getting one in my ARMv7-M binary?

I wasn’t sure if I was looking at a compiler code generation bug, a gdb/binutils disassembler bug, or something else so I tried opening the binary in IDA (perks of working in security, always good disassemblers on hand).

Everything looked fine in the outer function, so I moved on to the veneer.

Bad code generation

This was definitely not right. IDA didn’t detect it as a function, and IDA’s disassembly didn’t match gdb’s (neither made any sense). So some kind of compiler or linker code generation issue.

For comparison, at -O0, I got “ldr.w pc, [pc]” followed by the 32-bit jump destination, which made complete sense. But for some reason at higher optimization levels we get this garbage instruction.

Looking back at the registers in the crash dump, the CPU agrees with this: the hard fault was actually a usage fault, I just wasn’t getting the usage fault handler called since I never set one up and it got promoted to a hard fault.

UFSR is 0x0000_0001 “undefined instruction executed”, so what I thought was a segfault was actually the embedded equivalent of a SIGILL.

The function I was calling seemed normal enough. All it does is turn off interrupts and return the old CPSR so you can restore later on. But why was I getting invalid code generation for far calls to it, and not to anything else?

.globl EnterCriticalSection
EnterCriticalSection:
    mrs     r0, primask
    cpsid   i
    bx      lr

After several hours of bashing my head at search results in confusion, I came across this stackoverflow post.

I still don’t understand what’s going on, but adding the .type declaration to my function fixed it. This seems like the kind of thing that should have been caught at link time (the linker knows I’m compiling with -mcpu=cortex-m7 so it shouldn’t generate an instruction that doesn’t make sense for that, and if i try to call a symbol that doesn’t have a valid/odd address for thumb code, it should error out rather than generating invalid machine instructions).

Another night gone, now back to other stuff…

Like this post? Drop me a comment on Mastodon

So what is a Long Thing?

Wed, 31 Jul 2024 19:30:00 -0700

If you’ve read some of my previous articles or seen any of my fediverse posts about my PCB projects, you’ve probably seen this critter hiding somewhere on a board. I use the doodle as a signature on any design that has enough free space for it.

But what is this thing exactly? I’ve had people ask about it thinking it was all manner of things, “slug with a party hat” being one of my favorite (wrong) guesses. “Unicorn slug” was another common one.

So, let’s start with the short answer before digging into the details: It’s a 7-dimensional alien monster called a Long Thing. And the appendage on top of its head is an octopus-like tentacle, not a party hat.

Where did it come from?

I did a thread about this on the bird site years ago but that’s ancient history so I thought it was time to write it down in a more permanent location.

When I was in like 6th or 7th grade, I shared a triple bunk bed (bought at a garage sale) with my two younger brothers. The ladder had been lost or broken years before we got it, so to get up to the top bunk I had to kind of shimmy up the frame.

At the time I was in the middle of a teen growth spurt and my weight hadn’t caught up to my height, so I was super skinny. As I was “oozing” my way up the railing to get into bed, one of my brothers saw me from a shallow angle on the bottom bunk which made me look even more tall and wiry. He commented that I was “so long it made him sick”.

This quote rapidly escalated into complex lore.

So what are Long Things like?

Physically, they’re 7-dimensional, soft-bodied, alien monsters - somewhat like slugs but much larger (maybe not quite as big as Jabba the Hutt from Star Wars, but easily as big as a person).

They have octopus-like tentacles on the sides of their body which they use like arms, plus a smaller tentacle on the top of the head reminiscent of a unicorn’s horn.

Long Things exude a mildly radioactive slime called “long juice” from their skin, which is used much like slug slime to lubricate surfaces the Long Thing is oozing over. It glows green in the darkness and has an appearance very much like stereotypical movie toxic waste. If a surface is contaminated by long juice, it takes a “long time” to clean up.

In addition to the usual four (three spatial + one temporal) dimensions familiar to humans, Long Things have two other dimensions whose properties are not well understood. Finally, their 7th dimension is measured in “gallons” (not to be confused with common 3-dimensional gallons) and determines how long the Long Thing is.

They have “amazing eyelids” that, despite the name, are more like a slug’s eye stalks. The stalks are very flexible and can stretch for significant distances, so that the Long Thing can see around corners or into confined spaces without needing to move its body. When raising the eyelid, it makes a squeaking noise like a rusty door hinge. (Nobody quite knows why, one would assume the long juice would lubricate it and result in no sound, but that’s how the lore goes!)

Since they have few if any solid bones and their body is predominantly soft tissue, they can squeeze through very tiny holes. Some sources claim that a Long Thing can ooze through a hole the size of an American quarter.

I asked my wife (an artist) to draw one based on descriptions given by me and my brothers, and this was the result.

What do Long Things do all day?

The lore changed a bit over time.

Originally, Long Things were more like domestic animals than sentient beings in their own right. In this version, Long Things would be stored in garden hoses when not in use (because there’s no way that a dog crate or similar cage would be sufficient for keeping such a slippery critter contained). If the Long Thing wasn’t fully hose-trained and tried to escape, a sumo wrestler would be called to force them back into the hose. Long Things were available for rent from garden centers or home improvement stores, just like how in some parts of the world you can rent a hungry goat to eat unwanted vegetation.

Any time we visited one of my aunts who lived across the street from a garden center, my brothers would caution me to be on the lookout for sumo wrestlers trying to hunt down the escaped Long Thing.

In later versions, Long Things have human-level intelligence and are often employed as mechanics. They can send their eyelids through tiny holes into complex machines without needing to fully dismantle them or use a borescope camera. Once a stuck part in the guts of the equpment has been identified, the Long Thing can lubricate it with long juice and hopefully bring the machine back into working order much faster than a human mechanic (who would need to take the whole thing apart and put it back together to perform the same repair).

For similar reasons, many Long Things work as plumbers. They can ooze slowly down a pipe until a clog is reached, clear it using a combination of long juice and poking it with their tentacles, then ooze their way out. No snake, no tools.

What if you don’t want Long Things leaving slime trails around?

Easy, just buy a can of Anti-Long Juice, better known by the trade name Al’s Juice.

Many years ago, an Arabian prince named Sheikh Al-Zaiyah (whose name came from some our best interpretation of some gibberish one of my brothers mumbled in his sleep as we tried to wake him up) was sitting in a tent in the desert when a Long Thing oozed in.

He started throwing anything within reach at the Long Thing trying to chase it away but nothing worked. As his tent got more and more covered in long juice he finally became so frustrated he tossed his drink (a mixture of date juice and some secret flavors) at the Long Thing, immediately causing it to ooze away in terror.

Al-Zaiyah quickly recognized the commercial potential of his discovery and began to bottle the beverage as a Long Thing repellent. It was originally known as Al-Zaiyah’s Juice, but as it became more popular with western customers it was rebranded to Al’s Juice.

Contrary to popular belief, neither long juice nor Al’s Juice contain antimatter, nor is there any risk of a violent annihilation reaction if the two come into contact.

Like this post? Drop me a comment on Mastodon

Memory mapping improvements

Sun, 28 Jul 2024 01:30:00 -0700

In my previous post I demonstrated initial testing of a memory mapped FPGA-MCU interface bridging from the STM32 FMC peripheral to an APB bus on a Xilinx FPGA.

To stress this interconnect a bit, I pushed 284 Mbps of iperf3 traffic over it. But that wasn’t enough, I knew I could go further.

Clock boost

I started by tweaking some timing constraints, re-synthesizing the FPGA bitstream, and up-clocking the FMC bus by 10%, from 125 to 137.5 MHz. I tried 150 originally but couldn’t get reliable read capture on the STM32 side; I think I need to use a more complex timing setup with separate PLL phases for launch and capture and this might need a faster FPGA to account for the reduced setup timing window when moving data from the PCLK domain to the launch clock.

But the faster bus had almost no impact on performance (throughput went from 284 to 286 Mbps). So I figured the bus was probably not the bottleneck and put no additional effort into closing timing at higher frequencies.

Initial performance measurements

I started out by adding some instrumentation to the FPGA to see how busy things actually were. (Tip for folks mostly doing networking/control type stuff on Xilinx FPGAs that don’t need a lot of multipliers: DSP48 blocks are great for 48-bit performance counters and come with basically zero area cost if your design doesn’t use them for anything else - I sprinkle them around like candy while optimizing! You just need the logic to implement readout of results but this is comparatively small, you can fit a 16:1 mux in a single 7-series slice.)

The APB bus load was quite low (around 27% of cycles occupied by transactions even with the iperf running), which makes sense given that the APB is 32 bits wide with a parallel address while the FMC is 16 with multiplexed address. Thus, unless there are other sources of APB traffic in the system, the internal APB will never be the bottleneck.

The FMC bus, however, was also not maxed out (only about 34% of cycles with CS# asserted, 48% busy after accounting for the two-cycle minimum latency with CS# high between transactions). This included about 9.3M register writes and 48K reads, so we can conclude that the majority of bus traffic is from pushing traffic to the Ethernet transmit buffer and not e.g. polling for inbound frames or doing other things the firmware does.

But since the bus is idle so much of the time, the bottleneck must be on the MCU.

Checksum optimization

I ran some profiling on the MCU (using the OpenOCD “profile” command) and determined that the UDP checksum calculation (no hardware offload in the current configuration, I may eventually try pushing to FPGA but that’s a separate issue) was using a fair bit of CPU time.

Here’s what it looked like starting out.

uint16_t IPv4Protocol::InternetChecksum(uint8_t* data, uint16_t len, uint16_t initial)
{
    //Sum in 16-bit blocks until we run out
    uint16_t* data16 = reinterpret_cast<uint16_t*>(data);
    uint32_t checksum = initial;
    while(len >= 2)
    {
        //Add with carry
        checksum += __builtin_bswap16(*data16);
        checksum = (checksum >> 16) + (checksum & 0xffff);

        data16 ++;
        len -= 2;
    }

    //Add the last byte if needed
    if(len & 1)
    {
        checksum += __builtin_bswap16(*reinterpret_cast<uint8_t*>(data16));
        checksum = (checksum >> 16) + (checksum & 0xffff);
    }
    return checksum;
}

A lot of the common optimized software implementations depend on 2x 32-bit SIMD operations or similar, which aren’t available on Cortex-M, so I think I’m stuck using a 16 bit datapath. There’s probably some room for loop unrolling if needed.

But the most obvious optimization was to move the carry reduction to the end of the loop rather than reducing every iteration.

uint16_t IPv4Protocol::InternetChecksum(uint8_t* data, uint16_t len, uint16_t initial)
{
    //Sum in 16-bit blocks until we run out
    uint16_t* data16 = reinterpret_cast<uint16_t*>(data);
    uint32_t checksum = initial;
    while(len >= 2)
    {
        checksum += __builtin_bswap16(*data16);

        data16 ++;
        len -= 2;
    }

    //Add the last byte if needed
    if(len & 1)
        checksum += __builtin_bswap16(*reinterpret_cast<uint8_t*>(data16));

    //Handle carry-out
    while(checksum > 0xffff)
        checksum = (checksum >> 16) + (checksum & 0xffff);
    return checksum;
}

This improved things slightly: up to 311 Mbps. FMC bus load was now 52% busy, but there was a lot more room.

DMA

The other thing the initial profiling pointed at me was the transmit logic, so I decided to implement DMA using the amusingly named MDMA (“master DMA”) peripheral on the STM32. I’m still refactoring some of my setup and abstraction code to make it a bit cleaner, but it’s functional now using raw register writes.

The DMA setup took some effort to get right because the frame data lives in DTCM (which is accessible to MDMA but not the regular DMA cores) and is intentionally misaligned by 16 bits (buffer pointer ends in 0x2, 0x6, 0xa, or 0xe). This is so that after accounting for the length of the 14 byte Ethernet frame header (6 bytes each src/dst MAC address and 2-byte ethertype), all of the upper layer protocol fields are aligned to 32 bit boundaries for easy processing by the TCP/IP stack.

Also, the current register map for my Ethernet TX FIFO requires one register write to a length field at the start, then writing the frame data, then writing to a “commit” register to actually send the frame. The explicit length allows full-width 32 bit bus transactions to be used even if the transport at some point in the bridge doesn’t have byte write enables (the extra 0-3 bytes at the end of the frame will be discarded by the FIFO and not sent to the MAC) and the commit ensures that frames being pushed at low rate and popped at gigabit by the MAC won’t underrun the buffer.

And since there’s no serialization guarantees for memory transactions from the CPU and the DMA executing in parallel, all of these writes have to come from the MDMA.

The configuration I ended up settling on for this test used the “linked list” mode of the MDMA to perform the three separate transfers sequentially. The frame transfer uses a pair of 16-bit reads on the AHBS port of the Cortex-M7 to read low and high halves of a 32-bit frame word from DTCM on two consecutive AHB clock cycles, which then gets turned into a single 32-bit AXI write going to the FMC, which ultimately ends up as a 32-bit APB write on the FPGA.

As of now the software side only supports a single outstanding frame (if you try to send another frame before the DMA finishes, it’ll block until the DMA channel is available) and the receive side is still just using a blocking loop. This will definitely get improved later on.

While this did improve the available parallelism, it resulted in a comparatively small improvement (327 Mbps and 62% bus load) in iperf performance.

The final push

I scratched my head a bit and then asked gprof to give me a line-by-line, rather than function level, dump of the hot spots from the profiling dump and it gave me weird errors rather than sensible output.

This was when I realized that gprof was lying to me thanks to optimizations, and I wasn’t actually tuning the hottest spot. I recompiled with -Og which produced a significantly smaller and slightly slower binary, but one that was far more amenable to instrumentation.

And this profiler report showed me that the real hot spot was in the code that filled the UDP application-layer content.

void Iperf3Server::FillPacket(int id, uint32_t* payload, uint32_t len)
{
    uint32_t wordlen = len/4;
    if(len % 4)
        wordlen ++;

    //Fill seconds and nanoseconds using our timer
    auto countval = g_logTimer.GetCount();
    auto sec = countval / 10000;
    auto ticks = (countval % 10000);
    auto us = ticks * 100;
    payload[0] = __builtin_bswap32(sec);
    payload[1] = __builtin_bswap32(us);

    //Sequence number (for now only 32 bit)
    //Increment first so sequence numbers in packet can be one-based
    uint32_t seq = ++m_state[id].m_sequence;
    payload[2] = __builtin_bswap32(seq);
    payload[3] = 0;

    //fill rest of packet with garbage
    for(uint32_t i=4; i<wordlen; i++)
        payload[i] = i;
}

More specifically, the final “fill” loop, seen here in disassembly. I’m not entirely sure why it was so slow as I don’t know the details of the Cortex-M7 pipeline this well. Maybe something to do with coalescing dual 32-bit operations into 64-bit TCM accesses (or failing to do so due to lack of unrolling) or branch misprediction?

  5a:   d906            bls.n   6a <Iperf3Server::FillPacket(int, unsigned long*, unsigned long)+0x6a>
  5c:   320c            adds    r2, #12
  5e:   2304            movs    r3, #4
  60:   f842 3f04       str.w   r3, [r2, #4]!
  64:   3301            adds    r3, #1
  66:   459c            cmp     ip, r3
  68:   d1fa            bne.n   60 <Iperf3Server::FillPacket(int, unsigned long*, unsigned long)+0x60>

One “#pragma GCC unroll 4” later, I had this. It seems to do a couple of branches early on using a code structure similar to Duff’s device that I guess is faster for the common case of all four iterations executing concurrently. You could probably make this even faster for typical large-ish packet sizes by moving the comparison out of the loop and doing the final iteration separately.

  5a:   d922            bls.n   a2 <Iperf3Server::FillPacket(int, unsigned long*, unsigned long)+0xa2>
  5c:   f01e 0003       ands.w  r0, lr, #3
  60:   f102 010c       add.w   r1, r2, #12
  64:   f04f 0304       mov.w   r3, #4
  68:   d00f            beq.n   8a <Iperf3Server::FillPacket(int, unsigned long*, unsigned long)+0x8a>
  6a:   2801            cmp     r0, #1
  6c:   d008            beq.n   80 <Iperf3Server::FillPacket(int, unsigned long*, unsigned long)+0x80>
  6e:   2802            cmp     r0, #2
  70:   d003            beq.n   7a <Iperf3Server::FillPacket(int, unsigned long*, unsigned long)+0x7a>
  72:   4611            mov     r1, r2
  74:   f841 3f10       str.w   r3, [r1, #16]!
  78:   2305            movs    r3, #5
  7a:   f841 3f04       str.w   r3, [r1, #4]!
  7e:   3301            adds    r3, #1
  80:   f841 3f04       str.w   r3, [r1, #4]!
  84:   3301            adds    r3, #1
  86:   459e            cmp     lr, r3
  88:   d00b            beq.n   a2 <Iperf3Server::FillPacket(int, unsigned long*, unsigned long)+0xa2>
  8a:   1c5a            adds    r2, r3, #1
  8c:   604b            str     r3, [r1, #4]
  8e:   1d08            adds    r0, r1, #4
  90:   3302            adds    r3, #2
  92:   608a            str     r2, [r1, #8]
  94:   3110            adds    r1, #16
  96:   6083            str     r3, [r0, #8]
  98:   1c93            adds    r3, r2, #2
  9a:   60c3            str     r3, [r0, #12]
  9c:   1cd3            adds    r3, r2, #3
  9e:   459e            cmp     lr, r3
  a0:   d1f3            bne.n   8a <Iperf3Server::FillPacket(int, unsigned long*, unsigned long)+0x8a>

And the results were nothing short of astounding. 528 Mbps of UDP traffic and a saturated FMC bus, quite good for a 500 MHz single-core Cortex-M7 (now pushing just over one bit of Ethernet data per clock cycle)!

Conclusions

I’ve finally found the limit of the system and saturated the bus. Over half a Gbps of network traffic coming from a STM32H735 is more than I expect to ever need for any of my embedded management applications, and certainly far more CPU-FPGA bandwidth than I’ll need for anything I have in the pipeline.

I also fixed a few bugs in my APB and bridging code while working on this, including one causing the APB PSEL signal to be asserted for several clocks after PREADY when using pipeline stages, and another causing PADDR to be corrupted if two writes were issued back to back and the first one stalled for exactly the right number of clocks (the second write’s PADDR would be pushed to the bus controller before PREADY was asserted, causing the first write to go to the wrong address).

Now it’s time to move on to the next part of the project queue, assembling the new 48V IBC board and doing a bunch of decidedly not-fast power management firmware on the IBC as well as the supervisor MCU on the test board.

Like this post? Drop me a comment on Mastodon

Memory mapping an FPGA from an STM32

Wed, 24 Jul 2024 20:45:00 -0700

I teased at this a bit in my previous posts and finally have a setup I’m happy with, so I thought I’d do a more in-depth writeup.

To recap, the planned architecture for most of my future large-scale embedded projects is a fairly large (AMD Xilinx Kintex-7 or Artix / Kintex UltraScale+) FPGA for the high speed data plane paired with a STM32H735 for the control plane with a memory mapped interface between them.

Why a two-chip solution?

This is somewhat reminiscent of SoC FPGAs like Xilinx’s Zynq / Versal platforms, but with a few important differences that make it suit my needs and preferences better:

Using a MCU-class Cortex-M CPU instead of an applications processor is simpler to program in a bare-metal no-OS or minimal RTOS environment
The large (564 kB) on chip SRAM and 1 MB on chip flash eliminates the need for time-consuming DDR SDRAM layout for my typical firmwares (<200 kB each of ram/flash used)
The disaggregated pinout (two smaller BGAs rather than one larger one) is simpler to fan out on less PCB layers, and allows placing the FPGA and MCU with some distance between them if this is more convenient for layout reasons
Decentralizing allows the FPGA and MCU to enforce security boundaries between each other. For example, the FPGA can refuse to accept a bitstream from the MCU that isn’t signed by a key stored in the FPGA, and on-MCU memory and peripherals can’t be touched by the FPGA.
I can mix and match MCUs and FPGAs to achieve the mix of features, IO, BOM cost, etc. that fits my application
The STM32 has hardware AES and random number generator IPs while Zynq (shockingly) does not

The memory interface

After several false starts using quad SPI, I’ve settled on using the Flexible Memory Controller (FMC) as the preferred MCU-side bridge between the AXI on the STM32 and the FPGA’s internal interconnect. This is a highly configurable module which can be used to interface to old school (PC133 etc) SDRAM, asynchronous or synchronous SRAM/PSRAM, parallel NOR/NAND flash, etc.

Most importantly, unlike the OCTOSPI peripheral on the STM32H735, there is no hardware caching or prefetch in the FMC IP itself - only the normal L1 I/D caches provided by the Cortex-M7. And there’s not even any need to mess with these, since the first FMC bank has a mapping at 0xc000_0000 which is configured as strongly ordered, uncached, device memory in the MPU right out of the box - no need to mess around with MPU registers to turn off the cache for this range.

The FMC operating mode most amenable to FPGA bridging is synchronous PSRAM since this provides a clock (which can be made free-running between memory activity burst, allowing you to run internal FPGA logic, PLLs, etc. off it) and supports a hardware wait signal, allowing the FPGA to stall the bus in case pipeline delays or a slow peripheral require more latency than the fixed number of wait states provided in the initial FMC register configuration.

Hooking up 26 LVCMOS33 pins including clock, 16 bit multiplexed address/data, control signals, byte write enables, one chip select, and three high address bits occupies about half of a 7 series HR or UltraScale+ HD I/O bank. This configuration gives me 1 MB of address space (2^(16+3) = 512K 16-bit words addressable) on the FPGA and I can break out a few more address pins if I need more space for a more complex FPGA design.

Hardware design

I didn’t have any boards with a suitable combination of MCU, FPGA, and interconnect routing so I threw together a quick test board in KiCAD. It’s a six layer design on Shengyi S1000-2M (cost optimized Asian FR-4 class material) since there’s nothing particularly fast on the board and I wanted it to be cheap.

The board is intended to pair with my second generation 48 -> 12V intermediate bus converter and also be used for bringup/validation testing of it, so it includes the PicoBlade control and Mini-Fit Jr 12V input connectors for that. I have the new IBC boards but haven’t had time to populate any yet, so for now I’m powering it off a first-iteration reworked IBC prototype that I had lying around.

It contains a STM32L431 in QFN-48 as supervisor / rail sequencer (so I can validate that and develop firmware before using it in a more expensive, complex design), a STM32H735 in 201-BGA as the main processor, and a Xilinx XC7S25 Spartan-7 FPGA in FTGB196 for the other side of the bridge. I could have got away with less FPGA but wanted this board to be a more general FPGA+MCU dev board, and also needed sufficient logic capacity and RAM for logic analyzer cores during bringup, so decided against using the XC7S6 or 15 that I had on the shelf.

The FPGA and MCU are wired together by several interfaces: the FMC discussed above, an OCTOSPI channel, and 10/100 RMII. The OCTOSPI and RMII are not used in the current firmware due to the caching issues discussed in my previous posts and the fact that the FMC is significantly faster than the RMII interface (more on that later).

The second OCTOSPI channel on the MCU is connected to a quad SPI flash that is currently unused, but I want to play with in the future. I think this will work fine; the OCTOSPI is actually designed for interfacing with external flash and most of the quirks I’ve encountered were the result of trying to shoehorn it into something it was not meant to do.

In addition to the interfaces to the MCU, the FPGA has an RGMII connection to a KSZ9031RNX gigabit Ethernet PHY, a PMOD for GPIO expansion, and four LEDs for status indications.

The MCU also has a PMOD of its own, another four LEDs, and a 3.3V UART broken out to pin headers for debug console.

Integrated platform

The STM32H735 is a very complex chip (the reference manual weighs in at 3357 pages) so we’ll only show the parts relevant to the discussion in figures here.

From the MCU’s perspective, the FPGA shows up as a 64 MB region (of which only 1 MB is wired up on this board) of APB SFR address space mapped starting at 0xc000_0000. In my linker script these regions are referred to as FMC_APB1 and FMC_APB2 to avoid ambiguity with the on-MCU APB1 and APB2 bus segments located in the 0x4000_0000 peripheral address range.

64-bit accesses are not currently supported since the FPGA-side bus is 32 bits and I haven’t implemented logic to break up a 64-bit burst into two 32-bit transactions. 32-bit read and write accesses are fully supported including wait state propagation; 16 and 8 bit accesses are mostly implemented but thorough testing has been low priority since most of my peripherals have native 32-bit registers anyway.

The FPGA design (implemented in SystemVerilog) contains:

A tri speed 10/100/1000 RGMII MAC paired with memory mapped RX FIFO and TX FIFO
An MDIO controller
A 32-bit GPIO port connected to the PHY reset pin, some other miscellaneous control/status signals around the board and internal to the FPGA, and the FPGA PMOD pins
A few support blocks for things like querying the FPGA device ID, temperature, and other system health sensors

FPGA-side AMBA implementation

I shied away from using AMBA interconnects in my FPGA design for a long time because of Xilinx’s choices of using AXI (large and heavy weight) for everything, and having individual ports for every control signal (whyyyy?). But Vivado now has good support for SystemVerilog interfaces (when I last looked at this circa 2017, while interfaces were supported it couldn’t handle arrays of them).

Rather than using AXI for everything, I’ve decided to standardize on 32-bit APB as my internal control-plane interconnect. It’s much smaller and simpler, and for poking config registers it’s more than fast enough. And, as you’ll see later on, you can actually push quite a bit of data over it if necessary.

The top level of the design is mostly IO declarations and the APB interconnect.

FMC bridge

The FMC bridge is a bidirectional converter between the STM32 FMC bus to AMBA APB. You can go look at the source if you want to see all the gory details, but here’s how it’s instantiated.

The bridge contains an internal PLL (as of this writing only 7 series is supported but UltraScale+ will be added soon) locked to the FMC clock (which must be free running) and generating two equal-frequency output clocks. The phase of these clocks can be adjusted as needed to improve setup/hold margin depending on on IO timing requirements for your specific FPGA, board trace delays, etc.

The first clock phase is used for capturing inbound FMC control/write data signals and driving the APB PCLK out to internal loads within the FPGA, while the second clock is used for launching read data back to the MCU. At higher clock speeds it may be necessary to move the launch clock back relative to the capture clock in order to buy a bit more timing margin for the system-synchronous bus. (If anyone at ST is listening, could you maybe add some kind of DQS or other source-synchronous read capture clock to the FMC in your next gen parts?)

This bridge converts inbound FMC transactions directly to APB read/write transactions, setting the APB PSTRB signal as needed to match the byte write enables on the FMC. APB latency is properly propagated back to the NWAIT signal on the FMC, so peripherals can take arbitrarily long to service requests (although this will stall the AXI bus on the STM32, so beware).

As of now, the PSLVERR signal is not used for anything, but in the future I plan to break it out to a latching interrupt line of some sort that will trigger a “you done segfaulted” ISR on the MCU to handle the error.

APB #(.DATA_WIDTH(32), .ADDR_WIDTH(20), .USER_WIDTH(0)) fmc_apb();

FMC_APBBridge #(
    .CLOCK_PERIOD(7.27),    //137.5 MHz
    .VCO_MULT(8),           //1.1 GHz VCO
    .CAPTURE_CLOCK_PHASE(-30),
    .LAUNCH_CLOCK_PHASE(-30)
) fmcbridge(
    .apb(fmc_apb),

    .clk_mgmt(clk_125mhz),

    .fmc_clk(fmc_clk),
    .fmc_nwait(fmc_nwait),
    .fmc_noe(fmc_noe),
    .fmc_ad(fmc_ad),
    .fmc_nwe(fmc_nwe),
    .fmc_nbl(fmc_nbl),
    .fmc_nl_nadv(fmc_nl_nadv),
    .fmc_a_hi(fmc_a_hi),
    .fmc_cs_n(fmc_ne1)
);

APB bridges

My APB bridge module takes a single APB requester port and bridges it out to arbitrarily many completers, each mapped at consecutive, equally sized regions of address space configured as an array of SystemVerilog interfaces. No fancy GUI address space editors, no automatic code generation, just a parameterizable module.

Most nontrivial designs will include a mix of peripherals with simple, tiny register maps (just a few control bits) and larger, more complex ones with memory mapped buffers etc. My architecture implements this by using a tree of bridges; the test system has a root bridge with two 64 kB bus segments. One of these is then subdivided into 1 kB segments for general peripherals while the other is subdivided into 4 kB segments for the Ethernet FIFOs.

The bridge is completely combinatorial to provide maximum flexibility for timing-latency tradeoffs; it is expected that real-world designs will add register slices throughout the design as required to make timing at the desired PCLK frequency.

//Two 16-bit bus segments at 0xc000_0000 (APB1) and c001_0000 (APB2)
APB #(.DATA_WIDTH(32), .ADDR_WIDTH(16), .USER_WIDTH(0)) rootAPB[1:0]();

//Root bridge
APBBridge #(
	.BASE_ADDR(32'h0000_0000),	//MSBs are not sent over FMC
	.BLOCK_SIZE(32'h1_0000),
	.NUM_PORTS(2)
) root_bridge (
	.upstream(fmc_apb_pipe),
	.downstream(rootAPB)
);

//Pipeline stages at top side of each root in case we need to improve timing
APB #(.DATA_WIDTH(32), .ADDR_WIDTH(16), .USER_WIDTH(0)) apb1_root();
APBRegisterSlice #(.DOWN_REG(0), .UP_REG(0)) regslice_apb1_root(
	.upstream(rootAPB[0]),
	.downstream(apb1_root));

APB #(.DATA_WIDTH(32), .ADDR_WIDTH(16), .USER_WIDTH(0)) apb2_root();
APBRegisterSlice #(.DOWN_REG(0), .UP_REG(0)) regslice_apb2_root(
	.upstream(rootAPB[1]),
	.downstream(apb2_root));

Performance

In order to check how fast the interface actually is, I wrote a minimalistic iperf3 compatible server application as a benchmark. Not that I actually expect to be trying to firehose packets as fast as I can (I didn’t implement rate limiting) from a STM32 hanging off an FPGA, but it’s a decent stress test of the interconnect bandwidth.

I chose reverse UDP mode (STM32 sending, PC receiving) for the benchmark to minimize the amount of CPU used on the benchmark with the goal of primarily stressing the bus - in other words, this should not be interpreted as a realistic performance figure that can be achieved by actual application code doing nontrivial things, merely a figure of merit for comparison to future implementations.

The current APBEthernetInterface driver doesn’t use any DMA, just a busy loop that effectively memcpy’s the data over (with a few slight tweaks to ensure alignment etc). Given that all of the memory accesses are made by the CPU, I put the packet buffers and all of the internal data structures used by the TCP/IP stack in DTCM to maximize performance.

With the FMC clocked at 125 MHz and both PLL clock phases set to -30 degrees (after BUFG insertion delay), my current test firmware can sustain 284 Mbps over a ten-second test.

There’s probably potential to go faster, the FMC can clock to double the current rate (250 MHz) but I had trouble getting reliable performance. On my next “real” design I’ll have a faster FPGA (although potentially slower IOs, UltraScale+ HDIO are actually somewhat slow compared to 7 series HR) and may spend some time playing with constraints and PLL phases to see if I can push it any further. But realistically, this is already more than enough for my needs.

Conclusions

Overall this was surprisingly painless. The interface just works, with almost no fuss. Pushing to higher clock rates (past 125 MHz) is likely to be a bit challenging due to the system-synchronous nature of the bus. I played around a bit with dynamic PLL reconfiguration and some ideas for link trainining of sorts, but honestly I don’t think it’s necessary.

I expect the code will evolve slightly over time, perhaps eventually adding 64-bit transfer support on the FMC side and bridging to AHB rather than APB for reduced overhead of sequential transfers, but this is likely to be the backbone of my large FPGA+MCU projects for the foreseeable future.

Like this post? Drop me a comment on Mastodon

Embedded infrastructure, part 5: FPGA-MCU Bridging

Wed, 17 Jul 2024 21:00:00 -0700

The problem

The main MCU needs some way to talk to the datapath FPGA.

Early versions of the platform just used SPI, but this is slow and a bit awkward since you need to use SFR commands to read/write all of the registers on the FPGA. Over time various prototypes have shifted a bit in order to simplify this interoperability.

The plan

The next step was to convert the internal management bus from an ugly pile of nested case statements to a standard bus protocol. I decided to go with AMBA APB since this is bridgeable to AXI if needed, but is significantly simpler to implement and uses much less FPGA resources, while still being more than performant enough for the intended use case. Ultimately, the goal is a direct memory mapped interface from the MCU to the APB bus that I can use from software just like an APB peripheral internal to the MCU itself.

The next iteration, as used on the trigger crossbar, was to switch to the STM32H7 OCTOSPI (in quad SPI mode for initial testing). This ended up being a huge mistake and a waste of a lot of time.

The OCTOSPI allows memory mapping, but a combination of silicon errata and annoying design decisions rendered it difficult to use: memory mapped reads have a 32-byte prefetch buffer that cannot be disabled, which means any APB read you issue may result in up to 31 adjacent addresses being read. Subsequent reads to the prefetched address range are cached in the OCTOSPI peripheral itself (i.e. ignoring any caching properties set in the Cortex-M7 MPU). This means that reads with side effects will potentially trigger randomly when adjacent registers are read, and polling loops are very awkward to implement since the slightest misstep can lead to you infinite-looping with the poll hitting in the cache and not actually querying the peripheral.

On top of these wrinkles, it doesn’t support backpressure so there’s no way to stall if a transaction takes more time to execute. You need to have a fixed number of dummy clocks on the QSPI and the APB has a hard realtime constraint based on that turnaround time.

So I spun a test board to try out the other memory mapped external bus, the FMC (flexible memory controller).

Current state

Other than needing a lot of pins (~27 depending on exactly which modes are active and how much address space you want to map), the FMC is everything I could hope for: it supports device-mode memory mapping (completely uncached strongly ordered access), there’s no caching in the peripheral, and it Just Works(tm). It supports backpressure and you can even set it to generate a free-running clock that you can lock a PLL to, etc. The only APB feature of interest that it doesn’t natively support is the ability to propagate a bus error back from APB PSLVERR; if necessary I can hook PSLVERR to a GPIO that triggers an interrupt that I treat as equivalent to a bus error.

There’s a small silicon errata causing two dummy clocks to be added to the end of a read burst, but that’s simple enough to ignore on an FPGA although it does cause a small performance degradation. It can clock up to 275 MHz (max APB frequency of the STM32H735) although right now I’m testing at 125 on a slow Spartan-7; actual deployments with an UltraScale+ FPGA will have no problem running at the 250 MHz I intend to use.

I’ll do a more detailed writeup on the setup once I’ve got it fully packaged up and more extensively tested, but at this point it’s looking like this will be my interface of choice moving forward. I have it working well, bidirectionally bridging from AXI on the STM32 to APB in the FPGA.

It would be nice if the FMC supported arbitrary-sized burst accesses (or even just 64-bit bursts) to reduce overhead for bulk data transfer since the AXI on the STM32 side is 32 bits wide, but so far in my tests all attempts to do 64-bit AXI transfers have resulted in two consecutive 32-bit FMC accesses. I’ll need to play around a bit and see what kind of opcodes the compiler actually generated.

Embedded infrastructure, part 4: Management

Wed, 17 Jul 2024 14:00:00 -0700

The problem

All of these things need network management and/or data transfer capability. The normal solution would be to drop down a big Cortex-A SoC next to the FPGA running Linux, or use an integrated FPGA+SoC platform like a Zynq.

But embedded Linux comes with a lot of baggage and complexity, so I’m exploring lighter weight alternatives.

The plan

For the most part, I like to build systems with a clean control/data plane separation with all of the control plane in a MCU and the data plane living in the FPGA.

On the MCU side I’m very much a fan of simple event loops on a Cortex-M with a few ISRs for handling critical real time things that I can’t offload to the FPGA. No RTOS, no dynamic memory allocation, no unnecessary fluff.

The challenge is that most stuff these days is being optimized for fast development cycles with the lowest paid developers they can hire, not for this sort of minimalistic design. After a bit of digging I couldn’t find a TCP/IP stack, SSH server, etc. implementation that met my needs.

So I had to start building those from scratch.

Current state

I have a working TCP/IP stack (IPv4 only at the moment, v6 pending when I have some time) using fully static memory allocations that runs on a STM32H735, along with a SSH server implementation supporting AES-GCM and (optionally FPGA accelerated) curve25519 as the only cipher suite. The stack is server only (it cannot initiate outbound connections) due to it intended use case and omits support for a bunch of features like IP fragmentation that are rarely used.

It’s quite lightweight: the trigger crossbar recovery image (SSH server and SFTP firmware updater plus all of the drivers and utilities required for them to work) clocks in at 82 kB of flash space with -Os, using 114 kB of SRAM (mostly for socket buffers) plus stack. I could cut the SRAM down even further if I wanted to if I didn’t support several concurrent connections, but to simplify code reuse the recovery image uses the same TCP/IP stack configuration as the application image. It’s not like anything else is going to be using all that RAM during recovery.

The main application firmware is slightly heavier, coming in at 161 kB of flash with -O3 and 124 kB of SRAM - but in that space I fit all of the hardware drivers and IP stack, SSH management console, serial console, SFTP firmware updater, SCPI server, DHCP client, SNTP client, and a bunch of other miscellaneous glue.

I have some partial TCP/IP offload capability for high bandwidth data plane stuff in the FPGA, but that’s going to need a bunch more work to get where I want it (it’s a lot bulkier than it should be).