Rand Stats

P5pack

zef:lizmat

Actions Status

NAME

Raku port of Perl's pack() / unpack() built-ins

SYNOPSIS

use P5pack; # exports pack(), unpack()

DESCRIPTION

This module tries to mimic the behaviour of Perl's pack and unpack built-ins as closely as possible in the Raku Programming Language.

Currently supported directives are: a A c C h H i I l L n N q Q s S U v V w x Z

ORIGINAL PERL 5 DOCUMENTATION

pack TEMPLATE,LIST
        Takes a LIST of values and converts it into a string using the
        rules given by the TEMPLATE. The resulting string is the
        concatenation of the converted values. Typically, each converted
        value looks like its machine-level representation. For example, on
        32-bit machines an integer may be represented by a sequence of 4
        bytes, which will in Perl be presented as a string that's 4
        characters long.

        See perlpacktut for an introduction to this function.

        The TEMPLATE is a sequence of characters that give the order and
        type of values, as follows:

            a  A string with arbitrary binary data, will be null padded.
            A  A text (ASCII) string, will be space padded.
            Z  A null-terminated (ASCIZ) string, will be null padded.

            b  A bit string (ascending bit order inside each byte,
               like vec()).
            B  A bit string (descending bit order inside each byte).
            h  A hex string (low nybble first).
            H  A hex string (high nybble first).

            c  A signed char (8-bit) value.
            C  An unsigned char (octet) value.
            W  An unsigned char value (can be greater than 255).

            s  A signed short (16-bit) value.
            S  An unsigned short value.

            l  A signed long (32-bit) value.
            L  An unsigned long value.

            q  A signed quad (64-bit) value.
            Q  An unsigned quad value.
                 (Quads are available only if your system supports 64-bit
                  integer values _and_ if Perl has been compiled to support
                  those.  Raises an exception otherwise.)

            i  A signed integer value.
            I  A unsigned integer value.
                 (This 'integer' is _at_least_ 32 bits wide.  Its exact
                  size depends on what a local C compiler calls 'int'.)

            n  An unsigned short (16-bit) in "network" (big-endian) order.
            N  An unsigned long (32-bit) in "network" (big-endian) order.
            v  An unsigned short (16-bit) in "VAX" (little-endian) order.
            V  An unsigned long (32-bit) in "VAX" (little-endian) order.

            j  A Perl internal signed integer value (IV).
            J  A Perl internal unsigned integer value (UV).

            f  A single-precision float in native format.
            d  A double-precision float in native format.

            F  A Perl internal floating-point value (NV) in native format
            D  A float of long-double precision in native format.
                 (Long doubles are available only if your system supports
                  long double values _and_ if Perl has been compiled to
                  support those.  Raises an exception otherwise.)

            p  A pointer to a null-terminated string.
            P  A pointer to a structure (fixed-length string).

            u  A uuencoded string.
            U  A Unicode character number.  Encodes to a character in char-
               acter mode and UTF-8 (or UTF-EBCDIC in EBCDIC platforms) in
               byte mode.

            w  A BER compressed integer (not an ASN.1 BER, see perlpacktut
               for details).  Its bytes represent an unsigned integer in
               base 128, most significant digit first, with as few digits
               as possible.  Bit eight (the high bit) is set on each byte
               except the last.

            x  A null byte (a.k.a ASCII NUL, "\000", chr(0))
            X  Back up a byte.
            @  Null-fill or truncate to absolute position, counted from the
               start of the innermost ()-group.
            .  Null-fill or truncate to absolute position specified by
               the value.
            (  Start of a ()-group.

        One or more modifiers below may optionally follow certain letters
        in the TEMPLATE (the second column lists letters for which the
        modifier is valid):

            !   sSlLiI     Forces native (short, long, int) sizes instead
                           of fixed (16-/32-bit) sizes.

            !   xX         Make x and X act as alignment commands.

            !   nNvV       Treat integers as signed instead of unsigned.

            !   @.         Specify position as byte offset in the internal
                           representation of the packed string.  Efficient
                           but dangerous.

            >   sSiIlLqQ   Force big-endian byte-order on the type.
                jJfFdDpP   (The "big end" touches the construct.)

            <   sSiIlLqQ   Force little-endian byte-order on the type.
                jJfFdDpP   (The "little end" touches the construct.)

        The ">" and "<" modifiers can also be used on "()" groups to force
        a particular byte-order on all components in that group, including
        all its subgroups.

        The following rules apply:

        *   Each letter may optionally be followed by a number indicating
            the repeat count. A numeric repeat count may optionally be
            enclosed in brackets, as in "pack("C[80]", @arr)". The repeat
            count gobbles that many values from the LIST when used with
            all format types other than "a", "A", "Z", "b", "B", "h", "H",
            "@", ".", "x", "X", and "P", where it means something else,
            described below. Supplying a "*" for the repeat count instead
            of a number means to use however many items are left, except
            for:

            *   "@", "x", and "X", where it is equivalent to 0.

            *   <.>, where it means relative to the start of the string.

            *   "u", where it is equivalent to 1 (or 45, which here is
                equivalent).

            One can replace a numeric repeat count with a template letter
            enclosed in brackets to use the packed byte length of the
            bracketed template for the repeat count.

            For example, the template "x[L]" skips as many bytes as in a
            packed long, and the template "$t X[$t] $t" unpacks twice
            whatever $t (when variable-expanded) unpacks. If the template
            in brackets contains alignment commands (such as "x![d]"), its
            packed length is calculated as if the start of the template
            had the maximal possible alignment.

            When used with "Z", a "*" as the repeat count is guaranteed to
            add a trailing null byte, so the resulting string is always
            one byte longer than the byte length of the item itself.

            When used with "@", the repeat count represents an offset from
            the start of the innermost "()" group.

            When used with ".", the repeat count determines the starting
            position to calculate the value offset as follows:

            *   If the repeat count is 0, it's relative to the current
                position.

            *   If the repeat count is "*", the offset is relative to the
                start of the packed string.

            *   And if it's an integer n, the offset is relative to the
                start of the nth innermost "( )" group, or to the start of
                the string if n is bigger then the group level.

            The repeat count for "u" is interpreted as the maximal number
            of bytes to encode per line of output, with 0, 1 and 2
            replaced by 45. The repeat count should not be more than 65.

        *   The "a", "A", and "Z" types gobble just one value, but pack it
            as a string of length count, padding with nulls or spaces as
            needed. When unpacking, "A" strips trailing whitespace and
            nulls, "Z" strips everything after the first null, and "a"
            returns data with no stripping at all.

            If the value to pack is too long, the result is truncated. If
            it's too long and an explicit count is provided, "Z" packs
            only "$count-1" bytes, followed by a null byte. Thus "Z"
            always packs a trailing null, except when the count is 0.

        *   Likewise, the "b" and "B" formats pack a string that's that
            many bits long. Each such format generates 1 bit of the
            result. These are typically followed by a repeat count like
            "B8" or "B64".

            Each result bit is based on the least-significant bit of the
            corresponding input character, i.e., on "ord($char)%2". In
            particular, characters "0" and "1" generate bits 0 and 1, as
            do characters "\000" and "\001".

            Starting from the beginning of the input string, each 8-tuple
            of characters is converted to 1 character of output. With
            format "b", the first character of the 8-tuple determines the
            least-significant bit of a character; with format "B", it
            determines the most-significant bit of a character.

            If the length of the input string is not evenly divisible by
            8, the remainder is packed as if the input string were padded
            by null characters at the end. Similarly during unpacking,
            "extra" bits are ignored.

            If the input string is longer than needed, remaining
            characters are ignored.

            A "*" for the repeat count uses all characters of the input
            field. On unpacking, bits are converted to a string of 0s and
            1s.

        *   The "h" and "H" formats pack a string that many nybbles (4-bit
            groups, representable as hexadecimal digits, "0".."9"
            "a".."f") long.

            For each such format, pack() generates 4 bits of result. With
            non-alphabetical characters, the result is based on the 4
            least-significant bits of the input character, i.e., on
            "ord($char)%16". In particular, characters "0" and "1"
            generate nybbles 0 and 1, as do bytes "\000" and "\001". For
            characters "a".."f" and "A".."F", the result is compatible
            with the usual hexadecimal digits, so that "a" and "A" both
            generate the nybble "0xA==10". Use only these specific hex
            characters with this format.

            Starting from the beginning of the template to pack(), each
            pair of characters is converted to 1 character of output. With
            format "h", the first character of the pair determines the
            least-significant nybble of the output character; with format
            "H", it determines the most-significant nybble.

            If the length of the input string is not even, it behaves as
            if padded by a null character at the end. Similarly, "extra"
            nybbles are ignored during unpacking.

            If the input string is longer than needed, extra characters
            are ignored.

            A "*" for the repeat count uses all characters of the input
            field. For unpack(), nybbles are converted to a string of
            hexadecimal digits.

        *   The "p" format packs a pointer to a null-terminated string.
            You are responsible for ensuring that the string is not a
            temporary value, as that could potentially get deallocated
            before you got around to using the packed result. The "P"
            format packs a pointer to a structure of the size indicated by
            the length. A null pointer is created if the corresponding
            value for "p" or "P" is "undef"; similarly with unpack(),
            where a null pointer unpacks into "undef".

            If your system has a strange pointer size--meaning a pointer
            is neither as big as an int nor as big as a long--it may not
            be possible to pack or unpack pointers in big- or
            little-endian byte order. Attempting to do so raises an
            exception.

        *   The "/" template character allows packing and unpacking of a
            sequence of items where the packed structure contains a packed
            item count followed by the packed items themselves. This is
            useful when the structure you're unpacking has encoded the
            sizes or repeat counts for some of its fields within the
            structure itself as separate fields.

            For "pack", you write length-item"/"sequence-item, and the
            length-item describes how the length value is packed. Formats
            likely to be of most use are integer-packing ones like "n" for
            Java strings, "w" for ASN.1 or SNMP, and "N" for Sun XDR.

            For "pack", sequence-item may have a repeat count, in which
            case the minimum of that and the number of available items is
            used as the argument for length-item. If it has no repeat
            count or uses a '*', the number of available items is used.

            For "unpack", an internal stack of integer arguments unpacked
            so far is used. You write "/"sequence-item and the repeat
            count is obtained by popping off the last element from the
            stack. The sequence-item must not have a repeat count.

            If sequence-item refers to a string type ("A", "a", or "Z"),
            the length-item is the string length, not the number of
            strings. With an explicit repeat count for pack, the packed
            string is adjusted to that length. For example:

             This code:                             gives this result:

             unpack("W/a", "\004Gurusamy")          ("Guru")
             unpack("a3/A A*", "007 Bond  J ")      (" Bond", "J")
             unpack("a3 x2 /A A*", "007: Bond, J.") ("Bond, J", ".")

             pack("n/a* w/a","hello,","world")     "\000\006hello,\005world"
             pack("a/W2", ord("a") .. ord("z"))    "2ab"

            The length-item is not returned explicitly from "unpack".

            Supplying a count to the length-item format letter is only
            useful with "A", "a", or "Z". Packing with a length-item of
            "a" or "Z" may introduce "\000" characters, which Perl does
            not regard as legal in numeric strings.

        *   The integer types "s", "S", "l", and "L" may be followed by a
            "!" modifier to specify native shorts or longs. As shown in
            the example above, a bare "l" means exactly 32 bits, although
            the native "long" as seen by the local C compiler may be
            larger. This is mainly an issue on 64-bit platforms. You can
            see whether using "!" makes any difference this way:

                printf "format s is %d, s! is %d\n",
                    length pack("s"), length pack("s!");

                printf "format l is %d, l! is %d\n",
                    length pack("l"), length pack("l!");

            "i!" and "I!" are also allowed, but only for completeness'
            sake: they are identical to "i" and "I".

            The actual sizes (in bytes) of native shorts, ints, longs, and
            long longs on the platform where Perl was built are also
            available from the command line:

                $ perl -V:{short,int,long{,long}}size
                shortsize='2';
                intsize='4';
                longsize='4';
                longlongsize='8';

            or programmatically via the "Config" module:

                   use Config;
                   print $Config{shortsize},    "\n";
                   print $Config{intsize},      "\n";
                   print $Config{longsize},     "\n";
                   print $Config{longlongsize}, "\n";

            $Config{longlongsize} is undefined on systems without long
            long support.

        *   The integer formats "s", "S", "i", "I", "l", "L", "j", and "J"
            are inherently non-portable between processors and operating
            systems because they obey native byteorder and endianness. For
            example, a 4-byte integer 0x12345678 (305419896 decimal) would
            be ordered natively (arranged in and handled by the CPU
            registers) into bytes as

                0x12 0x34 0x56 0x78  # big-endian
                0x78 0x56 0x34 0x12  # little-endian

            Basically, Intel and VAX CPUs are little-endian, while
            everybody else, including Motorola m68k/88k, PPC, Sparc, HP
            PA, Power, and Cray, are big-endian. Alpha and MIPS can be
            either: Digital/Compaq uses (well, used) them in little-endian
            mode, but SGI/Cray uses them in big-endian mode.

            The names big-endian and little-endian are comic references to
            the egg-eating habits of the little-endian Lilliputians and
            the big-endian Blefuscudians from the classic Jonathan Swift
            satire, Gulliver's Travels. This entered computer lingo via
            the paper "On Holy Wars and a Plea for Peace" by Danny Cohen,
            USC/ISI IEN 137, April 1, 1980.

            Some systems may have even weirder byte orders such as

               0x56 0x78 0x12 0x34
               0x34 0x12 0x78 0x56

            You can determine your system endianness with this
            incantation:

               printf("%#02x ", $_) for unpack("W*", pack L=>0x12345678);

            The byteorder on the platform where Perl was built is also
            available via Config:

                use Config;
                print "$Config{byteorder}\n";

            or from the command line:

                $ perl -V:byteorder

            Byteorders "1234" and "12345678" are little-endian; "4321" and
            "87654321" are big-endian.

            For portably packed integers, either use the formats "n", "N",
            "v", and "V" or else use the ">" and "<" modifiers described
            immediately below. See also perlport.

        *   Starting with Perl 5.10.0, integer and floating-point formats,
            along with the "p" and "P" formats and "()" groups, may all be
            followed by the ">" or "<" endianness modifiers to
            respectively enforce big- or little-endian byte-order. These
            modifiers are especially useful given how "n", "N", "v", and
            "V" don't cover signed integers, 64-bit integers, or
            floating-point values.

            Here are some concerns to keep in mind when using an
            endianness modifier:

            *   Exchanging signed integers between different platforms
                works only when all platforms store them in the same
                format. Most platforms store signed integers in
                two's-complement notation, so usually this is not an
                issue.

            *   The ">" or "<" modifiers can only be used on
                floating-point formats on big- or little-endian machines.
                Otherwise, attempting to use them raises an exception.

            *   Forcing big- or little-endian byte-order on floating-point
                values for data exchange can work only if all platforms
                use the same binary representation such as IEEE
                floating-point. Even if all platforms are using IEEE,
                there may still be subtle differences. Being able to use
                ">" or "<" on floating-point values can be useful, but
                also dangerous if you don't know exactly what you're
                doing. It is not a general way to portably store
                floating-point values.

            *   When using ">" or "<" on a "()" group, this affects all
                types inside the group that accept byte-order modifiers,
                including all subgroups. It is silently ignored for all
                other types. You are not allowed to override the
                byte-order within a group that already has a byte-order
                modifier suffix.

        *   Real numbers (floats and doubles) are in native machine format
            only. Due to the multiplicity of floating-point formats and
            the lack of a standard "network" representation for them, no
            facility for interchange has been made. This means that packed
            floating-point data written on one machine may not be readable
            on another, even if both use IEEE floating-point arithmetic
            (because the endianness of the memory representation is not
            part of the IEEE spec). See also perlport.

            If you know exactly what you're doing, you can use the ">" or
            "<" modifiers to force big- or little-endian byte-order on
            floating-point values.

            Because Perl uses doubles (or long doubles, if configured)
            internally for all numeric calculation, converting from double
            into float and thence to double again loses precision, so
            "unpack("f", pack("f", $foo)") will not in general equal $foo.

        *   Pack and unpack can operate in two modes: character mode ("C0"
            mode) where the packed string is processed per character, and
            UTF-8 mode ("U0" mode) where the packed string is processed in
            its UTF-8-encoded Unicode form on a byte-by-byte basis.
            Character mode is the default unless the format string starts
            with "U". You can always switch mode mid-format with an
            explicit "C0" or "U0" in the format. This mode remains in
            effect until the next mode change, or until the end of the
            "()" group it (directly) applies to.

            Using "C0" to get Unicode characters while using "U0" to get
            non-Unicode bytes is not necessarily obvious. Probably only
            the first of these is what you want:

                $ perl -CS -E 'say "\x{3B1}\x{3C9}"' |
                  perl -CS -ne 'printf "%v04X\n", $_ for unpack("C0A*", $_)'
                03B1.03C9
                $ perl -CS -E 'say "\x{3B1}\x{3C9}"' |
                  perl -CS -ne 'printf "%v02X\n", $_ for unpack("U0A*", $_)'
                CE.B1.CF.89
                $ perl -CS -E 'say "\x{3B1}\x{3C9}"' |
                  perl -C0 -ne 'printf "%v02X\n", $_ for unpack("C0A*", $_)'
                CE.B1.CF.89
                $ perl -CS -E 'say "\x{3B1}\x{3C9}"' |
                  perl -C0 -ne 'printf "%v02X\n", $_ for unpack("U0A*", $_)'
                C3.8E.C2.B1.C3.8F.C2.89

            Those examples also illustrate that you should not try to use
            "pack"/"unpack" as a substitute for the Encode module.

        *   You must yourself do any alignment or padding by inserting,
            for example, enough "x"es while packing. There is no way for
            pack() and unpack() to know where characters are going to or
            coming from, so they handle their output and input as flat
            sequences of characters.

        *   A "()" group is a sub-TEMPLATE enclosed in parentheses. A
            group may take a repeat count either as postfix, or for
            unpack(), also via the "/" template character. Within each
            repetition of a group, positioning with "@" starts over at 0.
            Therefore, the result of

                pack("@1A((@2A)@3A)", qw[X Y Z])

            is the string "\0X\0\0YZ".

        *   "x" and "X" accept the "!" modifier to act as alignment
            commands: they jump forward or back to the closest position
            aligned at a multiple of "count" characters. For example, to
            pack() or unpack() a C structure like

                struct {
                    char   c;    /* one signed, 8-bit character */
                    double d;
                    char   cc[2];
                }

            one may need to use the template "c x![d] d c[2]". This
            assumes that doubles must be aligned to the size of double.

            For alignment commands, a "count" of 0 is equivalent to a
            "count" of 1; both are no-ops.

        *   "n", "N", "v" and "V" accept the "!" modifier to represent
            signed 16-/32-bit integers in big-/little-endian order. This
            is portable only when all platforms sharing packed data use
            the same binary representation for signed integers; for
            example, when all platforms use two's-complement
            representation.

        *   Comments can be embedded in a TEMPLATE using "#" through the
            end of line. White space can separate pack codes from each
            other, but modifiers and repeat counts must follow
            immediately. Breaking complex templates into individual
            line-by-line components, suitably annotated, can do as much to
            improve legibility and maintainability of pack/unpack formats
            as "/x" can for complicated pattern matches.

        *   If TEMPLATE requires more arguments than pack() is given,
            pack() assumes additional "" arguments. If TEMPLATE requires
            fewer arguments than given, extra arguments are ignored.

        Examples:

            $foo = pack("WWWW",65,66,67,68);
            # foo eq "ABCD"
            $foo = pack("W4",65,66,67,68);
            # same thing
            $foo = pack("W4",0x24b6,0x24b7,0x24b8,0x24b9);
            # same thing with Unicode circled letters.
            $foo = pack("U4",0x24b6,0x24b7,0x24b8,0x24b9);
            # same thing with Unicode circled letters.  You don't get the
            # UTF-8 bytes because the U at the start of the format caused
            # a switch to U0-mode, so the UTF-8 bytes get joined into
            # characters
            $foo = pack("C0U4",0x24b6,0x24b7,0x24b8,0x24b9);
            # foo eq "\xe2\x92\xb6\xe2\x92\xb7\xe2\x92\xb8\xe2\x92\xb9"
            # This is the UTF-8 encoding of the string in the
            # previous example

            $foo = pack("ccxxcc",65,66,67,68);
            # foo eq "AB\0\0CD"

            # NOTE: The examples above featuring "W" and "c" are true
            # only on ASCII and ASCII-derived systems such as ISO Latin 1
            # and UTF-8.  On EBCDIC systems, the first example would be
            #      $foo = pack("WWWW",193,194,195,196);

            $foo = pack("s2",1,2);
            # "\001\000\002\000" on little-endian
            # "\000\001\000\002" on big-endian

            $foo = pack("a4","abcd","x","y","z");
            # "abcd"

            $foo = pack("aaaa","abcd","x","y","z");
            # "axyz"

            $foo = pack("a14","abcdefg");
            # "abcdefg\0\0\0\0\0\0\0"

            $foo = pack("i9pl", gmtime);
            # a real struct tm (on my system anyway)

            $utmp_template = "Z8 Z8 Z16 L";
            $utmp = pack($utmp_template, @utmp1);
            # a struct utmp (BSDish)

            @utmp2 = unpack($utmp_template, $utmp);
            # "@utmp1" eq "@utmp2"

            sub bintodec {
                unpack("N", pack("B32", substr("0" x 32 . shift, -32)));
            }

            $foo = pack('sx2l', 12, 34);
            # short 12, two zero bytes padding, long 34
            $bar = pack('s@4l', 12, 34);
            # short 12, zero fill to position 4, long 34
            # $foo eq $bar
            $baz = pack('s.l', 12, 4, 34);
            # short 12, zero fill to position 4, long 34

            $foo = pack('nN', 42, 4711);
            # pack big-endian 16- and 32-bit unsigned integers
            $foo = pack('S>L>', 42, 4711);
            # exactly the same
            $foo = pack('s<l<', -42, 4711);
            # pack little-endian 16- and 32-bit signed integers
            $foo = pack('(sl)<', -42, 4711);
            # exactly the same

        The same template may generally also be used in unpack().

unpack TEMPLATE,EXPR
unpack TEMPLATE
        "unpack" does the reverse of "pack": it takes a string and expands
        it out into a list of values. (In scalar context, it returns
        merely the first value produced.)

        If EXPR is omitted, unpacks the $_ string. See perlpacktut for an
        introduction to this function.

        The string is broken into chunks described by the TEMPLATE. Each
        chunk is converted separately to a value. Typically, either the
        string is a result of "pack", or the characters of the string
        represent a C structure of some kind.

        The TEMPLATE has the same format as in the "pack" function. Here's
        a subroutine that does substring:

            sub substr {
                my($what,$where,$howmuch) = @_;
                unpack("x$where a$howmuch", $what);
            }

        and then there's

            sub ordinal { unpack("W",$_[0]); } # same as ord()

        In addition to fields allowed in pack(), you may prefix a field
        with a %<number> to indicate that you want a <number>-bit checksum
        of the items instead of the items themselves. Default is a 16-bit
        checksum. Checksum is calculated by summing numeric values of
        expanded values (for string fields the sum of "ord($char)" is
        taken; for bit fields the sum of zeroes and ones).

        For example, the following computes the same number as the System
        V sum program:

            $checksum = do {
                local $/;  # slurp!
                unpack("%32W*",<>) % 65535;
            };

        The following efficiently counts the number of set bits in a bit
        vector:

            $setbits = unpack("%32b*", $selectmask);

        The "p" and "P" formats should be used with care. Since Perl has
        no way of checking whether the value passed to "unpack()"
        corresponds to a valid memory location, passing a pointer value
        that's not known to be valid is likely to have disastrous
        consequences.

        If there are more pack codes or if the repeat count of a field or
        a group is larger than what the remainder of the input string
        allows, the result is not well defined: the repeat count may be
        decreased, or "unpack()" may produce empty strings or zeros, or it
        may raise an exception. If the input string is longer than one
        described by the TEMPLATE, the remainder of that input string is
        ignored.

AUTHOR

Elizabeth Mattijsen liz@raku.rocks

Source can be located at: https://github.com/lizmat/P5pack . Comments and Pull Requests are welcome.

COPYRIGHT AND LICENSE

Copyright 2018, 2019, 2020, 2021 Elizabeth Mattijsen

Re-imagined from Perl as part of the CPAN Butterfly Plan and an earlier version that only lived in the Raku Ecosystem.

This library is free software; you can redistribute it and/or modify it under the Artistic License 2.0.