I translated the benchmark into C. The randomness values varied and the benchmark completed too quickly to easily measure accurately, so I threw a loop around it and ran it 1000 times:
$ time ./ahl
[ . . . 997 lines elided . . . ]
0.011523 21.901428
0.011523 11.257996
0.011523 18.668213
randomness averaged over 1000 runs: 10.608711
real 0m0.056s
user 0m0.030s
sys 0m0.026s
For grins, I changed all the float arithmetic to double and took out the 1000 calls to printf():
$ time ./ahl
0.000000 11.349048
randomness averaged over 1000 runs: 10.266838
real 0m0.051s
user 0m0.027s
sys 0m0.023s
Interesting the time didn’t change much and the randomness test is no better. The accuracy is now perfect.