NAME
DOM::Tiny - A lightweight, self-contained DOM parser/manipulator
SYNOPSIS
use DOM::Tiny;
# Parse
my $dom = DOM::Tiny.parse('<div><p id="a">Test</p><p id="b">123</p></div>');
# Find
say $dom.at('#b').text;
say $dom.find('p').map(*.text).join("\n");
say $dom.find('[id]').map(*.attr('id')).join("\n");
# Iterate
$dom.find('p[id]').reverse.map({ .<id>.say });
# Loop
for $dom.find('p[id]') -> $e {
say $e<id>, ':', $e.text;
}
# Modify
$dom.find('div p')[*-1].append('<p id="c">456</p>');
$dom.find(':not(p)').map(*.strip);
# Render
say "$dom";
DESCRIPTION
DOM::Tiny is a smallish, relaxed pure-Perl HTML/XML DOM parser. It is relatively robust owing mostly to the enormous test suite inherited from its progenitor. The HTML/XML parsing is very forgiving and the CSS parser supports a reasonable subset of CSS3 for selecting elements in the DOM tree.
This module started as a port of Mojo::DOM58 from Perl 5, but maintaining compatibility with that library is not a major aim of this project. In fact, features of Perl 6 render certain aspects of Mojo::DOM58 completely redundant. For example, the collection system that provides custom features such as map
, each
, reduce
, etc. are completely unnecessary in Perl 6. The built-in syntax is as simple or simpler to use and safer in every case.
NODES AND ELEMENTS
When we parse an HTML/XML fragment, it gets turned into a tree of nodes.
<!DOCTYPE html>
<html>
<head><title>Hello</title></head>
<body>World!</body>
</html>
There are currently the following different kinds of nodes: Root, Text, Tag, Raw, PI, Doctype, Comment, and CDATA. These can also be grouped into the following roles: DocumentNode (anything but Root), Node (all kinds), HasChildren (Root and Tag), and TextNode (includes Text, CDATA, and Raw).
Root
|- Doctype (html)
+- Tag (html)
|- Tag (head)
| +- Tag (title)
| +- Text (Hello)
+- Tag (body)
+- Text (World!)
While all node types are represented as DOM::Tiny objects, some methods like attr
and namespace
only apply to elements.
Under normal circumstances you will probably never need to use these objects directly, but they are available in case you have some special need.
These objects are all defined in the DOM::Tiny::HTML namespace. If you want to import the short names, they are exported by default by that compilation unit:
{
use DOM::Tiny;
my $t = DOM::Tiny::HTML::Text.new(:text<Hello>);
}
{
use DOM::Tiny;
use DOM::Tiny::HTML;
my $t = Text.new(:text<Hello>);
}
CASE SENSITIVITY
DOM::Tiny defaults to HTML semantics. That means all tags and attribute names are automatically lowercased at parse time. Selectors will, therefore, need to be lowercase to match anything as matching is still case-sensitive.
# HTML semantics
my $dom = DOM::Tiny.parse('<P ID="greeting">Hi!</P>');
say $dom.at('p[id]').text;
If an XML declaration is found at the start of the snippet to parse, the parser will automatically switch into XML mode and everything becomes case-sensitive.
# XML semantics
my $dom = DOM::Tiny.parse('<?xml version="1.0"?><P ID="greeting">Hi!</P>');
say $dom.at('P[ID]').text;
XML detection can also be disabled or forced by explicitly setting the :xml
flag as needed.
# Force XML semantics
my $dom = DOM::Tiny.parse('<P ID="greeting">Hi!</P>', :xml);
say $dom.at('P[ID]').text;
# Force HTML semantics
$dom = DOM::Tiny.parse('<P ID="greeting">Hi!</P>', :!xml);
say $dom.at('p[id]').text;
SELECTORS
DOM::Tiny uses a CSS selector engine found in DOM::Tiny::CSS. We try to support all all CSS selectors that make sense for a standalone parser.
Any element.
my $all = $dom.find('*');
E
An element of type E.
my $title = $dom.at('title');
E[foo]
An E element with a foo attribute.
my $links = $dom.find('a[href]');
E[foo="bar"]
An E element whose foo attribute value is exactly equal to bar.
my $case_sensitive = $dom.find('input[type="hidden"]');
my $case_sensitive = $dom.find('input[type=hidden]');
E[foo="bar" i]
An E element whose foo attribute value is exactly equal to any case-permutation of bar.
my $case_insensitive = $dom.find('input[type="hidden" i]');
my $case_insensitive = $dom.find('input[type=hidden i]');
my $case_insensitive = $dom.find('input[class~="foo" i]');
This selector is part of Selectors Level 4. The "i" modifier may be added to any attribute selector to make what is normally an exact match to one that matches any case-permutation.
E[foo~="bar"]
An E element whose foo attribute value is a list of whitespace-separated values, one of which is exactly equal to bar.
my $foo = $dom.find('input[class~="foo"]');
my $foo = $dom.find('input[class~=foo]');
E[foo^="bar"]
An E element whose foo attribute value begins exactly with the string bar.
my $begins_with = $dom.find('input[name^="f"]');
my $begins_with = $dom.find('input[name^=f]');
E[foo$="bar"]
An E element whose foo attribute value ends exactly with the string bar.
my $ends_with = $dom.find('input[name$="o"]');
my $ends_with = $dom.find('input[name$=o]');
E[foo*="bar"]
An E element whose foo attribute value contains the substring bar.
my $contains = $dom.find('input[name*="fo"]');
my $contains = $dom.find('input[name*=fo]');
E:root
An E element, root of the document.
my $root = $dom.at(':root');
E:nth-child(n)
An E element, the n-th child of its parent.
my $third = $dom.find('div:nth-child(3)');
my $odd = $dom.find('div:nth-child(odd)');
my $even = $dom.find('div:nth-child(even)');
my $top3 = $dom.find('div:nth-child(-n+3)');
E:nth-last-child(n)
An E element, the n-th child of its parent, but counting backwards from the end.
my $third = $dom.find('div:nth-last-child(3)');
my $odd = $dom.find('div:nth-last-child(odd)');
my $even = $dom.find('div:nth-last-child(even)');
my $bottom3 = $dom.find('div:nth-last-child(-n+3)');
E:nth-of-type(n)
An E element, the n-th sibling of its type.
my $third = $dom.find('div:nth-of-type(3)');
my $odd = $dom.find('div:nth-of-type(odd)');
my $even = $dom.find('div:nth-of-type(even)');
my $top3 = $dom.find('div:nth-of-type(-n+3)');
E:nth-last-of-type(n)
An E element, the n-th sibling of its type, counting backwards from the end.
my $third = $dom.find('div:nth-last-of-type(3)');
my $odd = $dom.find('div:nth-last-of-type(odd)');
my $even = $dom.find('div:nth-last-of-type(even)');
my $bottom3 = $dom.find('div:nth-last-of-type(-n+3)');
E:first-child
An E element, first child of its parent.
my $first = $dom.find('div p:first-child');
E:last-child
An E element, last child of its parent.
my $last = $dom.find('div p:last-child');
E:first-of-type
An E element, first sibling of its type.
my $first = $dom.find('div p:first-of-type');
E:last-of-type
An E element, last sibling of its type.
my $last = $dom.find('div p:last-of-type');
E:only-child
An E element, only child of its parent.
my $lonely = $dom.find('div p:only-child');
E:only-of-type
An E element, only sibling of its type.
my $lonely = $dom.find('div p:only-of-type');
E:empty
An E element that has no children (including text nodes, meaning the element does not even contain whitespace).
my $empty = $dom.find(':empty');
E:checked
A user interface element E which is checked (for instance a radio-button or checkbox).
my $input = $dom.find(':checked');
E.warning
An E element whose class is "warning".
my $warning = $dom.find('div.warning');
E#myid
An E element with an "id" attribute equal to "myid". Basically, a shorthand for E[id=foo]
.
my $foo = $dom.at('div#foo');
E:not(s)
An E element that does not match simple selector s.
my $others = $dom.find('div p:not(:first-child)');
E F
An F element descendant of an E element.
my $headlines = $dom.find('div h1');
E > F
An F element child of an E element.
my $headlines = $dom.find('html > body > div > h1');
E + F
An F element immediately preceded by an E element.
my $second = $dom.find('h1 + h2');
E ~ F
An F element preceded by an E element.
my $second = $dom.find('h1 ~ h2');
E, F, G
Elements of type E, F and G.
my $headlines = $dom.find('h1, h2, h3');
E[foo=bar][bar=baz]
An E element whose attributes match all following attribute selectors.
my $links = $dom.find('a[foo^=b][foo$=ar]');
OPERATORS AND COERCIONS
You can use array subscripts and hash subscripts with DOM::Tiny. Using this class as an array or hash, though, is not recommended as several of the standard methods for these do not work as expected.
Array
You may use array subscripts as a shortcut for calling children
:
my $third-child = $dom[2];
Hash
You may use hash subscripts as a shortcut for calling attr
:
my $id = $dom<id>;
Str
If you convert the DOM::Tiny object to a string using Str
, ~
, or putting it in a string, it will render the markup.
my $html = "$dom";
METHODS
Construction, Parsing, and Rendering
method new
method new(DOM::Tiny:U: Bool :$xml) returns DOM::Tiny:D
Constructs a DOM::Tiny object with an empty DOM tree. Setting the optional $xml
flag guarantees XML mode. Setting it to a false guarantees HTML mode. If it is unset, DOM::Tiny will select a mode based upon the parsed text, defaulting to HTML.
method deep-clone
method deep-clone(DOM::Tiny:D:) returns DOM::Tiny:D
Returns a deep-cloned copy of the current DOM::Tiny object and its children. Any change to the origin will not impact the copy and vice versa.
method parse
method parse(DOM::Tiny:U: Str $ml, Bool :$xml) returns DOM::Tiny:D
method parse(DOM::Tiny:D: Str $ml, Bool :$xml) returns DOM::Tiny:D
Parses the given string, $ml
, as HTML or XML based upon the $xml
flag or autodetection if the flag is not given. If called on an existing DOM::Tiny object, the newly parsed tree will replace the previous tree.
method render
method render(DOM::Tiny:D:) returns Str:D
This renders the current node and all its content back to a string and returns it. The format of the markup is determined by the current xml
setting.
method Str
method Str(DOM::Tiny:D:) returns Str:D
This is a synonym for render
.
method xml
method xml(DOM::Tiny:D:) is rw returns Bool:D
This is the boolean flag determining how the node was parsed and how it will be rendered.
Finding and Filtering Nodes
method at
method at(DOM::Tiny:D: Str:D $selector) returns DOM::Tiny
Given a CSS selector, this will return the first node matching that selector or Nil.
method find
method find(DOM::Tiny:D: Str:D $selector)
Returns all nodes matching the given CSS $selector
within the current node.
method matches
method matches(DOM::Tiny:D: Str:D $selector) returns Bool:D
Returns True
if the current node matches the given $selector
or False
otherwise.
Tag Details
postcircumfix:<{}>
method postcircumfix:<{}>(DOM::Tiny:D: Str:D $k) is rw
You may use the .{}
operator as a shortcut for calling the attr
method and getting attributes on a tag. You may also use the :exists
and :delete
adverbs.
method hash
method hash(DOM::Tiny:D:) returns Hash
This is a synonym for attr
, when it is called with no arguments.
method all-text
method all-text(DOM::Tiny:D: Bool :$trim = False) returns Str
Pulls the text from all nodes under the current item in the DOM tree and returns it as a string. This is identical to calling text
with the :recurse
flag set to True
. The :trim
flag may be set to true, which will cause all trimmable space to be clipped from the returned text (i.e., text not in an RCDATA tag like title
or textarea
and not in a pre
tag).
method attr
multi method attr(DOM::Tiny:D:) returns Hash:D
multi method attr(DOM::Tiny:D: Str:D $name) returns Str
multi method attr(DOM::Tiny:D: Str:D $name, Str() $value) returns DOM::Tiny:D
multi method attr(DOM::Tiny:D: Str:D $name, Nil) returns DOM::Tiny:D
multi method attr(DOM::Tiny:D: *%values) returns DOM::Tiny:D
The attr
multi-method provides a getter/setter for attributes on the current tag. If the current node is not a tag, this is basically a no-op and will silently do nothing.
With no arguments, the method returns the attributes of the tag as a Hash.
With a single string argument, it returns the value of the named attribute or Nil.
With two string arguments, it will set the value of the named attribute and return the current node.
With a string argument and a Nil
, it will delete the attribute and return the current node.
Given one or more named arguments, the named values will be set to the given values and the current node will be returned.
method content
multi method content(DOM::Tiny:D:) returns Str:D
multi method content(DOM::Tiny:D: DOM::Tiny:D $tree) returns DOM::Tiny:D
multi method content(DOM::Tiny:D: Str() $ml, Bool :$xml = $!xml) returns DOM::Tiny:D
This multi-method works with the content of the node, something like innerHTML
in the standard DOM.
Given no arguments, it returns the markup within the element rendered to a string. If the node is empty or has no markup, it will return an empty string.
Given a DOM::Tiny, the tree within that object will replace the content of the current node. If the current node cannot have children, then this is a no-op and will silently do nothing. This returns the current node.
Given a string, the string will be parsed into HTML or XML (based upon the value of the :xml
named argument, which defaults to the setting for the current node), and the generated node tree will be used to replace the content of the current node. The current node is returned.
method namespace
method namespace(DOM::Tiny:D:) returns Str
Returns the namespace URI of the current tag or Str if it has no namespace.
Returns Nil in all other cases (i.e., the current node is not a tag).
method tag
multi method tag(DOM::Tiny:D:) returns Str
multi method tag(DOM::Tiny:D: Str:D $tag) returns DOM::Tiny:D
If the current node is a tag, both versions of this multi-method are no-ops that silently do nothing.
If no arguments are passed, the name of the tag is returned.
If a single string is passed, the name of the tag is changed to the given string and the current node is returned.
method text
method text(DOM::Tiny:D: Bool :$trim = False, Bool :$recurse = False) returns Str
This returns the text content of the current node. For a text node, this returns the text of the node itself. For a tag or the root, this will return the text of all of the immediate text node children of the current node concatenated together.
If the argument named :recurse
is passed, this method will return the text of all descendants rather than just the immediate children. This is the same as calling all-text
.
If the argument named :trim
is passed, this method will compress all breaking space into single spaces while concatenating all the text together.
method type
method type(DOM::Tiny:D:) returns Node:U
This method returns the type of node that is wrapped within the current DOM::Tiny object. This will be one of the following types:
Root The root node of the tree.
Tag Markup tag nodes within the tree.
Text A regular text node.
CDATA A CDATA text node.
Comment A comment node.
Doctype A DOCTYPE tag element.
PI An XML processing instruction. This is also used to represent the XML declaration even though it is technically not a PI.
Raw A special raw text node, used to represent the text inside of script and style tags.
In addition to these types, you may also want to make use of the following roles, which help group the node types together:
Node All nodes, including the root implement this role.
DocumentNode All nodes that have a parent have this role, i.e., all but the root.
HasChildren Only the nodes that have children have this role, so just Tag and Root.
TextNode All nodes that contain text have this role. This includes Text, CDATA, and Raw.
Each of these classes and roles are exported by DOM::Tiny
by default. If you prevent these from being exported, you will need to use their full name, which are each prefixed with DOM::Tiny::HTML::
. For example, Tag
has the full name DOM::Tiny::HTML::Tag
and TextNode
as the full name DOM::Tiny::HTML::TextNode
.
method val
method val(DOM::Tiny:D) returns Str
Returns the value of the tag. Returns Nil
if the current tag has no notion of value or if the current node is not a tag.
Value is computed as follows, based on the tag name:
- option: If the option tag has a
value
attribute, that is the option's value. Otherwise, the option's text is used.
- option: If the option tag has a
- input: The
value
attribute is used.
- input: The
- button: The
value
attribute is used.
- button: The
- textarea: The text content of the tag is used as the value.
- select: The value of the currently selected option is used. If no option is marked as selected, the select tag has no value. If the select tag has the
multiple
attribute set, then this returns all the selected values.
- select: The value of the currently selected option is used. If no option is marked as selected, the select tag has no value. If the select tag has the
- Anything else will return
Nil
for the value.
- Anything else will return
Tree Navigation
method postcircumfix:<[]>
method postcircumfix:<[]>(DOM::Tiny:D: Int:D $i) is rw
The .[]
can be used in place of child-nodes
to retrieve children of the current root or tag from the DOM. The :exists
and :delete
adverbs also work.
method list
method list(DOM::Tiny:D:) returns List
This is a synonym for child-nodes
.
method ancestors
method ancestors(DOM::Tiny:D: Str $selector?) returns Seq
Returns a sequence of ancestors to the current object as DOM::Tiny
objects. This will return an empty sequence for the root or any node that no longer has a parent (such as may be the case for a recently removed node).
method child-nodes
method child-nodes(DOM::Tiny:D: Bool :$tags-only = False)
If the current node has children (i.e., a tag or root), this method returns all of the children. If the :tags-only
flag is set, it returns only the children that are tags.
If the current node has no children or is not able to have children, an empty list will be returned.
method children
method children(DOM::Tiny:D: Str $selector?)
If the current node has children, this method returns only the tags that are children of the current node. The $selector
may be set to a CSS selector to filter the children returned. Only those matching the selector will be returned.
If the current node has no children or is not able to have children, an empty list will be returned.
method descendant-nodes
method descendant-nodes(DOM::Tiny:D:)
Returns all the descendants of the current node or an empty list if none or the node cannot have descendants. They are returned in depth-first order.
method following
method following(DOM::Tiny:D: Str $selector?)
Returns all sibling tags of the current node that come after the current node.
method following-nodes
method following-nodes(DOM::Tiny:D:)
Returns all sibling nodes of the current node that come after the current node.
method next
method next(DOM::Tiny:D:) returns DOM::Tiny
Returns the next sibling tag of the current node. If there is no such sibling, it returns Nil
.
method next-node
method next-node(DOM::Tiny:D:) returns DOM::Tiny
Returns the next sibling node of the current node. If there is no such sibling, it returns Nil
.
method parent
method parent(DOM::Tiny:D:) returns DOM::Tiny
Returns the parent of the current node. If the current node is the root, this method returns Nil
instead.
method preceding
method preceding(DOM::Tiny:D: Str $selector?)
Retursn all siblings of the current node that are tags that come before the current node. A $selector
may be given to filter the returned tags.
method preceding-nodes
method preceding-nodes(DOM::Tiny:D:)
Returns all siblings nodes of the current node that precede the current node.
method previous
method previous(DOM::Tiny:D:) returns DOM::Tiny
Returns the previous sibling tag of the current node. If there is no such sibling, it returns Nil
.
method previous-node
method previous-node(DOM::Tiny:D:) returns DOM::Tiny
Returns the previous sibling node of the current node. If there is no such sibling, it returns Nil
.
method root
method root(DOM::Tiny:D:) returns DOM::Tiny:D
Returns the root node of the tree.
Tree Modification
method append
method append(DOM::Tiny:D: Str() $ml, Bool :$xml = $!xml) returns DOM::Tiny:D
Appends the given markup content immediately after the current node. The :xml
flag may be set to determine whether the given markup should be parsed as XML or HTML (with the default being whatever the current document is being treated as).
If the current node is the root (i.e., $dom.type ~~ Root
), this operation is a no-op. It will silently do nothing.
Returns the current node.
method append-content
method append-content(DOM::Tiny:D: Str() $ml, Bool :$xml = $!xml) returns DOM::Tiny:D
If this is the root or a tag (i.e., $dom.type ~~ Root|Tag
), the given markup will be parsed and appended to the end of the root's or tag's children. If this is a text node (i.e., $dom.type ~~ TextNode
), then the markup will be appended to the text node parent's children. Otherwise this is a no-op and will silently do nothing.
The :xml
flag may be used to specify the format for the markup being parsed, defaulting to the setting for the current document.
Returns the node whose children have been modified.
method prepend
method prepend(DOM::Tiny:D: Str() $ml, Bool :$xml = $!xml) returns DOM::Tiny:D
Appends the given markup content immediately before the current node. The :xml
flag may be set to tell the parser to parse in XML mode or not (with the default being whatever is set for the current node).
If the current node is the root, this operation is a no-op and will silently do nothing.
This method will return the current node.
method prepend-content
method prepend-content(DOM::Tiny:D: Str() $ml, Bool :$xml = $!xml) returns DOM::Tiny:D
Appends the given markup content at the beginning of the current node's children, if it is the root or a tag. (This is a no-op that silently does nothing unless the current node is the root or a tag.) The :xml
flag sets whether to parse the $ml
as XML or not, with the default being the xml mode flag set on the current node.
This method returns the current node.
method remove
method remove(DOM::Tiny:D:) returns DOM::Tiny:D
Removes the current node from the tree and returns the parent node. If this node is the root, then the tree is emptied and the current node (i.e., the root) is returned.
method replace
method replace(DOM::Tiny:D: DOM::Tiny:D $tree) returns DOM::Tiny:D
method replace(DOM::Tiny:D: Str() $ml) returns DOM::Tiny:D
The current node is replaced with the tree or markup given.
If the current node is the root, the current node is returned. Otherwise, the original parent of this node, which has been replaced with the new tree, is returned.
method strip
method strip(DOM::Tiny:D:) returns DOM::Tiny:D
If the current node is a tag, the tag is removed from the tree and its content moved up into the current node's original parent. This will then return the original node's parent.
If the current node is anything else, this is a no-op that will silently do nothing and return the current node.
method wrap
method wrap(DOM::Tiny:D: Str:D $ml, Bool :$xml = $!xml) returns DOM::Tiny:D
The given markup in $ml
is parsed according to the format given by the :xml
flag (defaulting to whatever the xml
setting is for the current node). The current node is put within the innermost tag of the given markup. The current node is returned.
This is a no-op and will silently do nothing if the current node is the root.
method wrap-content
method wrap-content(DOM::Tiny:D: Str:D $ml, Bool :$xml = $!xml) returns DOM::Tiny:D
This is a no-op and will silently do nothing unless the current node is the root or a tag.
The given markup in $ml
is parsed. The parsing proceeds as XML if the :xml
flag is set or HTML otherwise (with the default being whatever the xml
flag is set to on the current node). The content of the current node is then placed within the innermost tag of the parsed markup and that parsed markup replaces the content of the current node.
CAVEATS
This software is beta quality. It has been ported from a mature code base and survived many uses, but it has still only had a small number of bugs reported and fixed. It has a large test suite, but much of that has been ported from the Perl 5 module and is not necessarily specific to the kinds of bugs this port has. There has also been very little done to optimize the code or even to check to make sure it performs well in how it utilizes CPU and memory.
As of the v0.5.0, this project is committed to the following signals regarding changes to this software in the future:
The major version number ("1" in "1.2.3") will be incremented whenever a documented feature changes in a way that is not backwards compatible.
The minor version number ("2" in "1.2.3") will be incremented whenever new features or added or any backwards compatible change is made to an undocumented feature or some other significant change is made to the project.
The patch number ("3" in "1.2.3") will be incremented whenever any other change is made (e.g., documentation, testing, minor bug fixes, etc.)
Semantic versioning is not a perfect system as it is not always crystal clear what distinguishes "bug fix" from "new feature" or "backwards compatible change" until after the fact, but I will try to do my best.
Any change thought to break backwards compatibility will be tagged with "BREAKING CHANGE" in Changes
.
AUTHOR AND COPYRIGHT
Copyright 2008-2016 Sebastian Riedel and others.
Copyright 2016 Andrew Sterling Hanenkamp for the port to Perl 6.
This is free software, licensed under:
The Artistic License 2.0 (GPL Compatible)