math error in our favor
The pi benchmark readme is not yet updated so you can see (verify?) my mistake: https://gitlab.cba.mit.edu/zfredin/stm32f412_core/tree/master/nucleo-f412zg/pi
STM32F412, no FPU: 12.91 seconds for 1,000,000 loops of pi. At 5 FLOPs per loop, that gives us (5E6)/(12.91) = 0.387 MFLOPS.
STM32F412, FPU on: 0.391 seconds for 1,000,000 loops of pi. I calculated that this resulted in 1.95 MFLOPS because I multiplied when I should have divided. (5E6)/(0.391) = 12.79 MFLOPS; at 84 MHz, that is ~6.5 clock cycles per loop. Now we're on the happier side of what I would expect given the Cortex-M4's 14-cycle divide spec.
I'm getting up to speed on the SAMD51; I've got it running sans external crystal at ~160 MHz and saw an NPTS=1,000,000 pi loop time of 0.297 s, or 16.84 MFLOPS. I haven't measured the PLL speed directly so the 160 MHz number could be off. If anything, this result seems low vs the STM32 so more investigation is clearly needed.
If you want to tinker with the SAMD51 using OpenOCD, arm-none-eabi-gdb, Microchip's libraries (handily shared via Adafruit), and a Makefile, I posted some instructions here: https://gitlab.cba.mit.edu/pub/hello-world/atsamd51