Rand Stats

TinyFloats

zef:japhb

Actions Status

NAME

TinyFloats - Convert to/from tiny float formats

SYNOPSIS

use TinyFloats;

my $tf32  = tf32-from-num(1e0);      # 0x5F800
my $num1  = num-from-tf32(0x7FBFF);  # -3.4011621342146535e+38

my $bf16  = bf16-from-num(1e0);      # 0x3F80
my $num2  = num-from-bf16(0xFF7F);   # -3.3895313892515355e+38

my $bin16 = bin16-from-num(1e0);     # 0x3C00
my $num3  = num-from-bin16(0xFBFF);  # -65504e0

my $e5m2  = e5m2-from-num(1e0);      # 0x3C
my $num4  = num-from-e5m2(0xFB);     # -57344e0

DESCRIPTION

TinyFloats is a collection of simple conversion routines to help with storing floating point data in tiny float formats. Raku cannot compute with these shorter formats directly; they must first be converted back to native floating point using one of the num-from-* routines.

This version supports only bidirectional conversion between Raku native floating point numbers (num/num32/num64) and the following shorter floating point storage formats from this table:

Formats and Bit Widths
NameTotalExponentMantissaMax Val±Inf?NaN?Notes
num64641152~2e+308YYRaku native (= num)
num3232823~3e+38YYRaku native
tf3219810~3e+38YYNvidia TPU internal
bf161687~3e+38YYbfloat16, truncated num32
bin161651065504YYIEEE 754 binary16/half
e5m285257344YYFP8, truncated bin16

More details on the supported formats:

IEEE 754 binary64 ("double precision") format, AKA double in C, float64 in CDDL, and 7.27 in CBOR. Natively handled in Raku; unless specified otherwise, all floating point computation in Raku is done in this format.

IEEE 754 binary32 ("single precision") format, AKA float in C, float32 in CDDL, and 7.26 in CBOR. Natively handled in Raku, but used in Raku computations only if specifically requested. This is also used as an intermediate format when expanding the shorter formats using one of the num-from-* routines.

Nvidia TensorFloat-32 format, used internally by Nvidia tensor processing hardware, with the 8 bit exponent width of num32 and bf16 and the 10 bit mantissa width of bin16. This works well for improving performance without as much loss of range or precision as the 16-bit formats. Unfortunately it is a 19-bit format and thus inconvenient to use for storage or interchange; Nvidia TPUs convert from num32 and back on the fly. This format is provided here for completeness, but is generally just a curiosity if not performing computations on actual Nvidia TPU hardware.

Google Brain bfloat16 format, essentially IEEE 754 binary32 ("single precision") with the least significant 16 bits truncated from the mantissa. bf16 reduces only the mantissa bits, is relatively quick to convert, is used most often in machine learning systems, and is usually supported directly by the ML hardware.

IEEE 754 binary16 ("half precision") format, AKA _Float16 in C, float16 in CDDL, and 7.25 in CBOR. bin16 attempts to balance the reduction in exponent and mantissa bits, is fairly slow to convert, is used most often in graphics formats, and is commonly converted to native binary32 internally by modern graphics hardware.

Open Compute Project (OCP) OFP8 format, E5M2 variant. Another truncated format, e5m2 is bin16 with the least significant 8 bits truncated from the mantissa. Like bf16, e5m2 reduces only the mantissa bits from its parent format, and thus maintains most of its available useful range. Unlike the other OFP8 variant (E4M3), e5m2 still maintains full Inf/NaN support.

BUGS AND LIMITATIONS

AUTHOR

Geoffrey Broadwell gjb@sonic.net

COPYRIGHT AND LICENSE

Copyright 2021,2025 Geoffrey Broadwell

This library is free software; you can redistribute it and/or modify it under the Artistic License 2.0.