HTML entity parsing in David Carlisle's htmlparse.xsl fails in Saxon-JS 2

Added by Martin Honnen about 4 years ago

I have tried whether Saxon-JS 2 can handle the HTML parser that David Carlisle implemented in XSLT 2, online as the raw source at or viewable at

It seems to be able to import and compile it, I have however found one issue: the parsing of HTML entities doesn't seem to work, it emits a warning htmlparse: Unknown entity: nbsp:

E:\SomeDir>xslt3 -t -s:htmlparse-test-input.xml -xsl:htmlparse-test1.xsl
Saxon-JS 2.0 from Saxonica
Node.js version v12.18.0
Compiling stylesheet E:\SomeDir\htmlparse-test1.xsl
Stylesheet compilation time: 2.337s
SEF generated by Saxon-JS 2.0 at 2020-07-02T20:15:01.226+02:00 with -target:JS -relocate:true

<!DOCTYPE html>
<html lang="en">
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
         <h2>numeric character reference test</h2>
         <div>Text with non-breaking space: test.</div>
         <h2>hexadecimal character reference test</h2>
         <div>Text with non-breaking space: test.</div>
         <h2>htmlparse test</h2>htmlparse: Unknown entity: nbsp
htmlparse: Unknown entity: Auml

            <p>parse HTML &amp; XML:&amp;nbsp; HTML entity references
            <p lang="de">&amp;Auml;rger</p>
<!--Run with Saxon-JS 2.0-->

The same code works fine (i.e. htmlparse is able to parse the entities) with the Saxon Java or .NET.

The warning is emitted by the function xsl:function name="d:chars" in

Replies (2)

RE: HTML entity parsing in David Carlisle's htmlparse.xsl fails in Saxon-JS 2 - Added by Martin Honnen about 4 years ago

It seems the bug is in evaluating $d:ents/key('d:ents',regex-group(3)), that fails to find the entity. On the other hand, a simple $d:ents/key('d:ents', $entity-name) doesn't fail. So I am not sure whether it is related to the function use or the regex-group use.

A reduced test case would be

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl=""
    <xsl:param name="entity-name" as="xs:string">aacute</xsl:param>
    <xsl:param name="entity-reference" as="xs:string"><![CDATA[&aacute;]]></xsl:param>
    <xsl:output method="html"/>
    <xsl:template match="/" name="xsl:initial-template">
                    <h1>key test</h1>
                    <p><code><xsl:value-of select="$entity-name || ': ' || $d:ents/key('d:ents', $entity-name)"/></code></p>
                    <h1>function test</h1>
                    <p><code><xsl:value-of select="$entity-reference || ': ' || d:chars($entity-reference)"/></code></p>
    <xsl:variable name="d:ents">
        <entity name="Aacute">&#xC1;</entity>
        <entity name="aacute">&#xE1;</entity>
        <entity name="Acirc">&#xC2;</entity>
    <xsl:key name="d:ents" match="entity" use="@name"/>

    <xsl:function name="d:chars">
        <xsl:param name="s" as="xs:string"/>
            <xsl:analyze-string select="$s" regex="&amp;(#?)(x?)([0-9a-fA-F]+|[a-zA-Z][a-zA-Z0-9]*);">
                        <xsl:when test="$d:ents/key('d:ents',regex-group(3))">
                            <xsl:value-of select="$d:ents/key('d:ents',regex-group(3))"/>
                            <xsl:message>htmlparse: Unknown entity: <xsl:value-of select="regex-group(3)"/></xsl:message>
                            <xsl:value-of select="regex-group(3)"/>
                    <xsl:value-of select="."/>

When I rewrite the key use in the function to

                        <xsl:when test="key('d:ents', regex-group(3), $d:ents)">
                            <xsl:value-of select="key('d:ents', regex-group(3), $d:ents)"/>

the code works the same in Saxon-JS 2 and Saxon Java.


    Please register to reply