NAME

DOM::Tiny - A lightweight, self-contained DOM parser/manipulator

SYNOPSIS

use DOM::Tiny;

# Parse
my $dom = DOM::Tiny.parse('<div><p id="a">Test</p><p id="b">123</p></div>');

# Find
say $dom.at('#b').text;
say $dom.find('p').map(*.text).join("\n");
say $dom.find('[id]').map(*.attr('id')).join("\n");

# Iterate
$dom.find('p[id]').reverse.map({ .<id>.say });

# Loop
for $dom.find('p[id]') -> $e {
    say $e<id>, ':', $e.text;
}

# Modify
$dom.find('div p')[*-1].append('<p id="c">456</p>');
$dom.find(':not(p)').map(*.strip);

# Render
say "$dom";

DESCRIPTION

DOM::Tiny is a smallish, relaxed pure-Perl HTML/XML DOM parser. It is relatively robust owing mostly to the enormous test suite inherited from its progenitor. The HTML/XML parsing is very forgiving and the CSS parser supports a reasonable subset of CSS3 for selecting elements in the DOM tree.

This module started as a port of Mojo::DOM58 from Perl 5, but maintaining compatibility with that library is not a major aim of this project. In fact, features of Perl 6 render certain aspects of Mojo::DOM58 completely redundant. For example, the collection system that provides custom features such as map, each, reduce, etc. are completely unnecessary in Perl 6. The built-in syntax is as simple or simpler to use and safer in every case.

NODES AND ELEMENTS

When we parse an HTML/XML fragment, it gets turned into a tree of nodes.

<!DOCTYPE html>
<html>
<head><title>Hello</title></head>
<body>World!</body>
</html>

There are currently the following different kinds of nodes: Root, Text, Tag, Raw, PI, Doctype, Comment, and CDATA. These can also be grouped into the following roles: DocumentNode (anything but Root), Node (all kinds), HasChildren (Root and Tag), and TextNode (includes Text, CDATA, and Raw).

Root
|- Doctype (html)
+- Tag (html)
   |- Tag (head)
   |  +- Tag (title)
   |     +- Text (Hello)
   +- Tag (body)
      +- Text (World!)

While all node types are represented as DOM::Tiny objects, some methods like attr and namespace only apply to elements.

Under normal circumstances you will probably never need to use these objects directly, but they are available in case you have some special need.

These objects are all defined in the DOM::Tiny::HTML namespace. If you want to import the short names, they are exported by default by that compilation unit:

{
    use DOM::Tiny;
    my $t = DOM::Tiny::HTML::Text.new(:text<Hello>);
}

{
    use DOM::Tiny;
    use DOM::Tiny::HTML;
    my $t = Text.new(:text<Hello>);
}

CASE SENSITIVITY

DOM::Tiny defaults to HTML semantics. That means all tags and attribute names are automatically lowercased at parse time. Selectors will, therefore, need to be lowercase to match anything as matching is still case-sensitive.

# HTML semantics
my $dom = DOM::Tiny.parse('<P ID="greeting">Hi!</P>');
say $dom.at('p[id]').text;

If an XML declaration is found at the start of the snippet to parse, the parser will automatically switch into XML mode and everything becomes case-sensitive.

# XML semantics
my $dom = DOM::Tiny.parse('<?xml version="1.0"?><P ID="greeting">Hi!</P>');
say $dom.at('P[ID]').text;

XML detection can also be disabled or forced by explicitly setting the :xml flag as needed.

# Force XML semantics
my $dom = DOM::Tiny.parse('<P ID="greeting">Hi!</P>', :xml);
say $dom.at('P[ID]').text;

# Force HTML semantics
$dom = DOM::Tiny.parse('<P ID="greeting">Hi!</P>', :!xml);
say $dom.at('p[id]').text;

SELECTORS

DOM::Tiny uses a CSS selector engine found in DOM::Tiny::CSS. We try to support all all CSS selectors that make sense for a standalone parser.

Any element.

my $all = $dom.find('*');

E

An element of type E.

my $title = $dom.at('title');

E[foo]

An E element with a foo attribute.

my $links = $dom.find('a[href]');

E[foo="bar"]

An E element whose foo attribute value is exactly equal to bar.

my $case_sensitive = $dom.find('input[type="hidden"]');
my $case_sensitive = $dom.find('input[type=hidden]');

E[foo="bar" i]

An E element whose foo attribute value is exactly equal to any case-permutation of bar.

my $case_insensitive = $dom.find('input[type="hidden" i]');
my $case_insensitive = $dom.find('input[type=hidden i]');
my $case_insensitive = $dom.find('input[class~="foo" i]');

This selector is part of Selectors Level 4. The "i" modifier may be added to any attribute selector to make what is normally an exact match to one that matches any case-permutation.

E[foo~="bar"]

An E element whose foo attribute value is a list of whitespace-separated values, one of which is exactly equal to bar.

my $foo = $dom.find('input[class~="foo"]');
my $foo = $dom.find('input[class~=foo]');

E[foo^="bar"]

An E element whose foo attribute value begins exactly with the string bar.

my $begins_with = $dom.find('input[name^="f"]');
my $begins_with = $dom.find('input[name^=f]');

E[foo$="bar"]

An E element whose foo attribute value ends exactly with the string bar.

my $ends_with = $dom.find('input[name$="o"]');
my $ends_with = $dom.find('input[name$=o]');

E[foo*="bar"]

An E element whose foo attribute value contains the substring bar.

my $contains = $dom.find('input[name*="fo"]');
my $contains = $dom.find('input[name*=fo]');

E:root

An E element, root of the document.

my $root = $dom.at(':root');

E:nth-child(n)

An E element, the n-th child of its parent.

my $third = $dom.find('div:nth-child(3)');
my $odd   = $dom.find('div:nth-child(odd)');
my $even  = $dom.find('div:nth-child(even)');
my $top3  = $dom.find('div:nth-child(-n+3)');

E:nth-last-child(n)

An E element, the n-th child of its parent, but counting backwards from the end.

my $third    = $dom.find('div:nth-last-child(3)');
my $odd      = $dom.find('div:nth-last-child(odd)');
my $even     = $dom.find('div:nth-last-child(even)');
my $bottom3  = $dom.find('div:nth-last-child(-n+3)');

E:nth-of-type(n)

An E element, the n-th sibling of its type.

my $third = $dom.find('div:nth-of-type(3)');
my $odd   = $dom.find('div:nth-of-type(odd)');
my $even  = $dom.find('div:nth-of-type(even)');
my $top3  = $dom.find('div:nth-of-type(-n+3)');

E:nth-last-of-type(n)

An E element, the n-th sibling of its type, counting backwards from the end.

my $third    = $dom.find('div:nth-last-of-type(3)');
my $odd      = $dom.find('div:nth-last-of-type(odd)');
my $even     = $dom.find('div:nth-last-of-type(even)');
my $bottom3  = $dom.find('div:nth-last-of-type(-n+3)');

E:first-child

An E element, first child of its parent.

my $first = $dom.find('div p:first-child');

E:last-child

An E element, last child of its parent.

my $last = $dom.find('div p:last-child');

E:first-of-type

An E element, first sibling of its type.

my $first = $dom.find('div p:first-of-type');

E:last-of-type

An E element, last sibling of its type.

my $last = $dom.find('div p:last-of-type');

E:only-child

An E element, only child of its parent.

my $lonely = $dom.find('div p:only-child');

E:only-of-type

An E element, only sibling of its type.

my $lonely = $dom.find('div p:only-of-type');

E:empty

An E element that has no children (including text nodes, meaning the element does not even contain whitespace).

my $empty = $dom.find(':empty');

E:checked

A user interface element E which is checked (for instance a radio-button or checkbox).

my $input = $dom.find(':checked');

E.warning

An E element whose class is "warning".

my $warning = $dom.find('div.warning');

E#myid

An E element with an "id" attribute equal to "myid". Basically, a shorthand for E[id=foo].

my $foo = $dom.at('div#foo');

E:not(s)

An E element that does not match simple selector s.

my $others = $dom.find('div p:not(:first-child)');

E F

An F element descendant of an E element.

my $headlines = $dom.find('div h1');

E > F

An F element child of an E element.

my $headlines = $dom.find('html > body > div > h1');

E + F

An F element immediately preceded by an E element.

my $second = $dom.find('h1 + h2');

E ~ F

An F element preceded by an E element.

my $second = $dom.find('h1 ~ h2');

E, F, G

Elements of type E, F and G.

my $headlines = $dom.find('h1, h2, h3');

E[foo=bar][bar=baz]

An E element whose attributes match all following attribute selectors.

my $links = $dom.find('a[foo^=b][foo$=ar]');

OPERATORS AND COERCIONS

You can use array subscripts and hash subscripts with DOM::Tiny. Using this class as an array or hash, though, is not recommended as several of the standard methods for these do not work as expected.

Array

You may use array subscripts as a shortcut for calling children:

my $third-child = $dom[2];

Hash

You may use hash subscripts as a shortcut for calling attr:

my $id = $dom<id>;

Str

If you convert the DOM::Tiny object to a string using Str, ~, or putting it in a string, it will render the markup.

my $html = "$dom";

METHODS

Construction, Parsing, and Rendering

method new

method new(DOM::Tiny:U: Bool :$xml) returns DOM::Tiny:D

Constructs a DOM::Tiny object with an empty DOM tree. Setting the optional $xml flag guarantees XML mode. Setting it to a false guarantees HTML mode. If it is unset, DOM::Tiny will select a mode based upon the parsed text, defaulting to HTML.

method deep-clone

method deep-clone(DOM::Tiny:D:) returns DOM::Tiny:D

Returns a deep-cloned copy of the current DOM::Tiny object and its children. Any change to the origin will not impact the copy and vice versa.

method parse

method parse(DOM::Tiny:U: Str $ml, Bool :$xml) returns DOM::Tiny:D
method parse(DOM::Tiny:D: Str $ml, Bool :$xml) returns DOM::Tiny:D

Parses the given string, $ml, as HTML or XML based upon the $xml flag or autodetection if the flag is not given. If called on an existing DOM::Tiny object, the newly parsed tree will replace the previous tree.

method render

method render(DOM::Tiny:D:) returns Str:D

This renders the current node and all its content back to a string and returns it. The format of the markup is determined by the current xml setting.

method Str

method Str(DOM::Tiny:D:) returns Str:D

This is a synonym for render.

method xml

method xml(DOM::Tiny:D:) is rw returns Bool:D

This is the boolean flag determining how the node was parsed and how it will be rendered.

Finding and Filtering Nodes

method at

method at(DOM::Tiny:D: Str:D $selector) returns DOM::Tiny

Given a CSS selector, this will return the first node matching that selector or Nil.

method find

method find(DOM::Tiny:D: Str:D $selector)

Returns all nodes matching the given CSS $selector within the current node.

method matches

method matches(DOM::Tiny:D: Str:D $selector) returns Bool:D

Returns True if the current node matches the given $selector or False otherwise.

Tag Details

postcircumfix:<{}>

method postcircumfix:<{}>(DOM::Tiny:D: Str:D $k) is rw

You may use the .{} operator as a shortcut for calling the attr method and getting attributes on a tag. You may also use the :exists and :delete adverbs.

method hash

method hash(DOM::Tiny:D:) returns Hash

This is a synonym for attr, when it is called with no arguments.

method all-text

method all-text(DOM::Tiny:D: Bool :$trim = False) returns Str

Pulls the text from all nodes under the current item in the DOM tree and returns it as a string. This is identical to calling text with the :recurse flag set to True. The :trim flag may be set to true, which will cause all trimmable space to be clipped from the returned text (i.e., text not in an RCDATA tag like title or textarea and not in a pre tag).

method attr

multi method attr(DOM::Tiny:D:) returns Hash:D
multi method attr(DOM::Tiny:D: Str:D $name) returns Str
multi method attr(DOM::Tiny:D: Str:D $name, Str() $value) returns DOM::Tiny:D
multi method attr(DOM::Tiny:D: Str:D $name, Nil) returns DOM::Tiny:D
multi method attr(DOM::Tiny:D: *%values) returns DOM::Tiny:D

The attr multi-method provides a getter/setter for attributes on the current tag. If the current node is not a tag, this is basically a no-op and will silently do nothing.

With no arguments, the method returns the attributes of the tag as a Hash.

With a single string argument, it returns the value of the named attribute or Nil.

With two string arguments, it will set the value of the named attribute and return the current node.

With a string argument and a Nil, it will delete the attribute and return the current node.

Given one or more named arguments, the named values will be set to the given values and the current node will be returned.

method content

multi method content(DOM::Tiny:D:) returns Str:D
multi method content(DOM::Tiny:D: DOM::Tiny:D $tree) returns DOM::Tiny:D
multi method content(DOM::Tiny:D: Str() $ml, Bool :$xml = $!xml) returns DOM::Tiny:D

This multi-method works with the content of the node, something like innerHTML in the standard DOM.

Given no arguments, it returns the markup within the element rendered to a string. If the node is empty or has no markup, it will return an empty string.

Given a DOM::Tiny, the tree within that object will replace the content of the current node. If the current node cannot have children, then this is a no-op and will silently do nothing. This returns the current node.

Given a string, the string will be parsed into HTML or XML (based upon the value of the :xml named argument, which defaults to the setting for the current node), and the generated node tree will be used to replace the content of the current node. The current node is returned.

method namespace

method namespace(DOM::Tiny:D:) returns Str

Returns the namespace URI of the current tag or Str if it has no namespace.

Returns Nil in all other cases (i.e., the current node is not a tag).

method tag

multi method tag(DOM::Tiny:D:) returns Str
multi method tag(DOM::Tiny:D: Str:D $tag) returns DOM::Tiny:D

If the current node is a tag, both versions of this multi-method are no-ops that silently do nothing.

If no arguments are passed, the name of the tag is returned.

If a single string is passed, the name of the tag is changed to the given string and the current node is returned.

method text

method text(DOM::Tiny:D: Bool :$trim = False, Bool :$recurse = False) returns Str

This returns the text content of the current node. For a text node, this returns the text of the node itself. For a tag or the root, this will return the text of all of the immediate text node children of the current node concatenated together.

If the argument named :recurse is passed, this method will return the text of all descendants rather than just the immediate children. This is the same as calling all-text.

If the argument named :trim is passed, this method will compress all breaking space into single spaces while concatenating all the text together.

method type

method type(DOM::Tiny:D:) returns Node:U

This method returns the type of node that is wrapped within the current DOM::Tiny object. This will be one of the following types:

Root The root node of the tree.

Tag Markup tag nodes within the tree.

Text A regular text node.

CDATA A CDATA text node.

Comment A comment node.

Doctype A DOCTYPE tag element.

PI An XML processing instruction. This is also used to represent the XML declaration even though it is technically not a PI.

Raw A special raw text node, used to represent the text inside of script and style tags.

In addition to these types, you may also want to make use of the following roles, which help group the node types together:

Node All nodes, including the root implement this role.

DocumentNode All nodes that have a parent have this role, i.e., all but the root.

HasChildren Only the nodes that have children have this role, so just Tag and Root.

TextNode All nodes that contain text have this role. This includes Text, CDATA, and Raw.

Each of these classes and roles are exported by DOM::Tiny by default. If you prevent these from being exported, you will need to use their full name, which are each prefixed with DOM::Tiny::HTML::. For example, Tag has the full name DOM::Tiny::HTML::Tag and TextNode as the full name DOM::Tiny::HTML::TextNode.

method val

method val(DOM::Tiny:D) returns Str

Returns the value of the tag. Returns Nil if the current tag has no notion of value or if the current node is not a tag.

Value is computed as follows, based on the tag name:

- option: If the option tag has a value attribute, that is the option's value. Otherwise, the option's text is used.
- input: The value attribute is used.
- button: The value attribute is used.
- textarea: The text content of the tag is used as the value.
- select: The value of the currently selected option is used. If no option is marked as selected, the select tag has no value. If the select tag has the multiple attribute set, then this returns all the selected values.
- Anything else will return Nil for the value.

method postcircumfix:<[]>

method postcircumfix:<[]>(DOM::Tiny:D: Int:D $i) is rw

The .[] can be used in place of child-nodes to retrieve children of the current root or tag from the DOM. The :exists and :delete adverbs also work.

method list

method list(DOM::Tiny:D:) returns List

This is a synonym for child-nodes.

method ancestors

method ancestors(DOM::Tiny:D: Str $selector?) returns Seq

Returns a sequence of ancestors to the current object as DOM::Tiny objects. This will return an empty sequence for the root or any node that no longer has a parent (such as may be the case for a recently removed node).

method child-nodes

method child-nodes(DOM::Tiny:D: Bool :$tags-only = False)

If the current node has children (i.e., a tag or root), this method returns all of the children. If the :tags-only flag is set, it returns only the children that are tags.

If the current node has no children or is not able to have children, an empty list will be returned.

method children

method children(DOM::Tiny:D: Str $selector?)

If the current node has children, this method returns only the tags that are children of the current node. The $selector may be set to a CSS selector to filter the children returned. Only those matching the selector will be returned.

If the current node has no children or is not able to have children, an empty list will be returned.

method descendant-nodes

method descendant-nodes(DOM::Tiny:D:)

Returns all the descendants of the current node or an empty list if none or the node cannot have descendants. They are returned in depth-first order.

method following

method following(DOM::Tiny:D: Str $selector?)

Returns all sibling tags of the current node that come after the current node.

method following-nodes

method following-nodes(DOM::Tiny:D:)

Returns all sibling nodes of the current node that come after the current node.

method next

method next(DOM::Tiny:D:) returns DOM::Tiny

Returns the next sibling tag of the current node. If there is no such sibling, it returns Nil.

method next-node

method next-node(DOM::Tiny:D:) returns DOM::Tiny

Returns the next sibling node of the current node. If there is no such sibling, it returns Nil.

method parent

method parent(DOM::Tiny:D:) returns DOM::Tiny

Returns the parent of the current node. If the current node is the root, this method returns Nil instead.

method preceding

method preceding(DOM::Tiny:D: Str $selector?)

Retursn all siblings of the current node that are tags that come before the current node. A $selector may be given to filter the returned tags.

method preceding-nodes

method preceding-nodes(DOM::Tiny:D:)

Returns all siblings nodes of the current node that precede the current node.

method previous

method previous(DOM::Tiny:D:) returns DOM::Tiny

Returns the previous sibling tag of the current node. If there is no such sibling, it returns Nil.

method previous-node

method previous-node(DOM::Tiny:D:) returns DOM::Tiny

Returns the previous sibling node of the current node. If there is no such sibling, it returns Nil.

method root

method root(DOM::Tiny:D:) returns DOM::Tiny:D

Returns the root node of the tree.

Tree Modification

method append

method append(DOM::Tiny:D: Str() $ml, Bool :$xml = $!xml) returns DOM::Tiny:D

Appends the given markup content immediately after the current node. The :xml flag may be set to determine whether the given markup should be parsed as XML or HTML (with the default being whatever the current document is being treated as).

If the current node is the root (i.e., $dom.type ~~ Root), this operation is a no-op. It will silently do nothing.

Returns the current node.

method append-content

method append-content(DOM::Tiny:D: Str() $ml, Bool :$xml = $!xml) returns DOM::Tiny:D

If this is the root or a tag (i.e., $dom.type ~~ Root|Tag), the given markup will be parsed and appended to the end of the root's or tag's children. If this is a text node (i.e., $dom.type ~~ TextNode), then the markup will be appended to the text node parent's children. Otherwise this is a no-op and will silently do nothing.

The :xml flag may be used to specify the format for the markup being parsed, defaulting to the setting for the current document.

Returns the node whose children have been modified.

method prepend

method prepend(DOM::Tiny:D: Str() $ml, Bool :$xml = $!xml) returns DOM::Tiny:D

Appends the given markup content immediately before the current node. The :xml flag may be set to tell the parser to parse in XML mode or not (with the default being whatever is set for the current node).

If the current node is the root, this operation is a no-op and will silently do nothing.

This method will return the current node.

method prepend-content

method prepend-content(DOM::Tiny:D: Str() $ml, Bool :$xml = $!xml) returns DOM::Tiny:D

Appends the given markup content at the beginning of the current node's children, if it is the root or a tag. (This is a no-op that silently does nothing unless the current node is the root or a tag.) The :xml flag sets whether to parse the $ml as XML or not, with the default being the xml mode flag set on the current node.

This method returns the current node.

method remove

method remove(DOM::Tiny:D:) returns DOM::Tiny:D

Removes the current node from the tree and returns the parent node. If this node is the root, then the tree is emptied and the current node (i.e., the root) is returned.

method replace

method replace(DOM::Tiny:D: DOM::Tiny:D $tree) returns DOM::Tiny:D
method replace(DOM::Tiny:D: Str() $ml) returns DOM::Tiny:D

The current node is replaced with the tree or markup given.

If the current node is the root, the current node is returned. Otherwise, the original parent of this node, which has been replaced with the new tree, is returned.

method strip

method strip(DOM::Tiny:D:) returns DOM::Tiny:D

If the current node is a tag, the tag is removed from the tree and its content moved up into the current node's original parent. This will then return the original node's parent.

If the current node is anything else, this is a no-op that will silently do nothing and return the current node.

method wrap

method wrap(DOM::Tiny:D: Str:D $ml, Bool :$xml = $!xml) returns DOM::Tiny:D

The given markup in $ml is parsed according to the format given by the :xml flag (defaulting to whatever the xml setting is for the current node). The current node is put within the innermost tag of the given markup. The current node is returned.

This is a no-op and will silently do nothing if the current node is the root.

method wrap-content

method wrap-content(DOM::Tiny:D: Str:D $ml, Bool :$xml = $!xml) returns DOM::Tiny:D

This is a no-op and will silently do nothing unless the current node is the root or a tag.

The given markup in $ml is parsed. The parsing proceeds as XML if the :xml flag is set or HTML otherwise (with the default being whatever the xml flag is set to on the current node). The content of the current node is then placed within the innermost tag of the parsed markup and that parsed markup replaces the content of the current node.

CAVEATS

This software is beta quality. It has been ported from a mature code base and survived many uses, but it has still only had a small number of bugs reported and fixed. It has a large test suite, but much of that has been ported from the Perl 5 module and is not necessarily specific to the kinds of bugs this port has. There has also been very little done to optimize the code or even to check to make sure it performs well in how it utilizes CPU and memory.

As of the v0.5.0, this project is committed to the following signals regarding changes to this software in the future:

The major version number ("1" in "1.2.3") will be incremented whenever a documented feature changes in a way that is not backwards compatible.
The minor version number ("2" in "1.2.3") will be incremented whenever new features or added or any backwards compatible change is made to an undocumented feature or some other significant change is made to the project.
The patch number ("3" in "1.2.3") will be incremented whenever any other change is made (e.g., documentation, testing, minor bug fixes, etc.)

Semantic versioning is not a perfect system as it is not always crystal clear what distinguishes "bug fix" from "new feature" or "backwards compatible change" until after the fact, but I will try to do my best.

Any change thought to break backwards compatibility will be tagged with "BREAKING CHANGE" in Changes.

AUTHOR AND COPYRIGHT

This is free software, licensed under:

The Artistic License 2.0 (GPL Compatible)