
NAME
TinyFloats - Convert to/from tiny float formats
SYNOPSIS
use TinyFloats;
my $tf32 = tf32-from-num(1e0); # 0x5F800
my $num1 = num-from-tf32(0x7FBFF); # -3.4011621342146535e+38
my $bf16 = bf16-from-num(1e0); # 0x3F80
my $num2 = num-from-bf16(0xFF7F); # -3.3895313892515355e+38
my $bin16 = bin16-from-num(1e0); # 0x3C00
my $num3 = num-from-bin16(0xFBFF); # -65504e0
my $e5m2 = e5m2-from-num(1e0); # 0x3C
my $num4 = num-from-e5m2(0xFB); # -57344e0
DESCRIPTION
TinyFloats is a collection of simple conversion routines to help with storing floating point data in tiny float formats. Raku cannot compute with these shorter formats directly; they must first be converted back to native floating point using one of the num-from-*
routines.
This version supports only bidirectional conversion between Raku native floating point numbers (num
/num32
/num64
) and the following shorter floating point storage formats from this table:
Formats and Bit WidthsName | Total | Exponent | Mantissa | Max Val | ±Inf? | NaN? | Notes |
---|
num64 | 64 | 11 | 52 | ~2e+308 | Y | Y | Raku native (= num) |
num32 | 32 | 8 | 23 | ~3e+38 | Y | Y | Raku native |
tf32 | 19 | 8 | 10 | ~3e+38 | Y | Y | Nvidia TPU internal |
bf16 | 16 | 8 | 7 | ~3e+38 | Y | Y | bfloat16, truncated num32 |
bin16 | 16 | 5 | 10 | 65504 | Y | Y | IEEE 754 binary16/half |
e5m2 | 8 | 5 | 2 | 57344 | Y | Y | FP8, truncated bin16 |
More details on the supported formats:
IEEE 754 binary64
("double precision") format, AKA double
in C, float64
in CDDL, and 7.27
in CBOR. Natively handled in Raku; unless specified otherwise, all floating point computation in Raku is done in this format.
IEEE 754 binary32
("single precision") format, AKA float
in C, float32
in CDDL, and 7.26
in CBOR. Natively handled in Raku, but used in Raku computations only if specifically requested. This is also used as an intermediate format when expanding the shorter formats using one of the num-from-*
routines.
Nvidia TensorFloat-32 format, used internally by Nvidia tensor processing hardware, with the 8 bit exponent width of num32
and bf16
and the 10 bit mantissa width of bin16
. This works well for improving performance without as much loss of range or precision as the 16-bit formats. Unfortunately it is a 19-bit format and thus inconvenient to use for storage or interchange; Nvidia TPUs convert from num32
and back on the fly. This format is provided here for completeness, but is generally just a curiosity if not performing computations on actual Nvidia TPU hardware.
Google Brain bfloat16
format, essentially IEEE 754 binary32
("single precision") with the least significant 16 bits truncated from the mantissa. bf16
reduces only the mantissa bits, is relatively quick to convert, is used most often in machine learning systems, and is usually supported directly by the ML hardware.
IEEE 754 binary16
("half precision") format, AKA _Float16
in C, float16
in CDDL, and 7.25
in CBOR. bin16
attempts to balance the reduction in exponent and mantissa bits, is fairly slow to convert, is used most often in graphics formats, and is commonly converted to native binary32 internally by modern graphics hardware.
Open Compute Project (OCP) OFP8 format, E5M2 variant. Another truncated format, e5m2
is bin16
with the least significant 8 bits truncated from the mantissa. Like bf16
, e5m2
reduces only the mantissa bits from its parent format, and thus maintains most of its available useful range. Unlike the other OFP8 variant (E4M3), e5m2
still maintains full Inf
/NaN
support.
BUGS AND LIMITATIONS
There are no routines provided to directly convert between various tiny float formats. To do this, you will need to convert from the source tiny format to a native Raku num
using the appropriate num-from-*
routine before converting to the destination tiny format using the appropriate *-from-num
routine.
Currently only mantissa truncation (AKA RTZ, Round To Zero) is supported when converting to narrower formats. Recent standardization work requires that format conversions support and default to using IEEE 754 RTNE (Round To Nearest with ties to Even) instead, though truncation is allowed as an option. For now the routines in this module comply with the older and faster truncation-focused rounding.
There are no routines provided to pack sub-byte formats into a byte buffer (buf8
); callers will need to do this themselves for now.
AUTHOR
Geoffrey Broadwell gjb@sonic.net
COPYRIGHT AND LICENSE
Copyright 2021,2025 Geoffrey Broadwell
This library is free software; you can redistribute it and/or modify it under the Artistic License 2.0.