Project

Profile

Help

HTML entity parsing in David Carlisle's htmlparse.xsl fails in Saxon-JS 2

Added by Martin Honnen 4 months ago

I have tried whether Saxon-JS 2 can handle the HTML parser that David Carlisle implemented in XSLT 2, online as the raw source at https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl or viewable at https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl.

It seems to be able to import and compile it, I have however found one issue: the parsing of HTML entities doesn't seem to work, it emits a warning htmlparse: Unknown entity: nbsp:

E:\SomeDir>xslt3 -t -s:htmlparse-test-input.xml -xsl:htmlparse-test1.xsl
Saxon-JS 2.0 from Saxonica
Node.js version v12.18.0
Compiling stylesheet E:\SomeDir\htmlparse-test1.xsl
Stylesheet compilation time: 2.337s
SEF generated by Saxon-JS 2.0 at 2020-07-02T20:15:01.226+02:00 with -target:JS -relocate:true

<!DOCTYPE html>
<html lang="en">
   <head>
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      <title>Test</title>
   </head>
   <body>
      <section>
         <h2>numeric character reference test</h2>
         <div>Text with non-breaking space: test.</div>
      </section>
      <section>
         <h2>hexadecimal character reference test</h2>
         <div>Text with non-breaking space: test.</div>
      </section>
      <section>
         <h2>htmlparse test</h2>htmlparse: Unknown entity: nbsp
htmlparse: Unknown entity: Auml

         <div>
            <p>parse HTML &amp; XML:&amp;nbsp; HTML entity references
                </p>
            <p lang="de">&amp;Auml;rger</p>
         </div>
      </section>
   </body>
</html>
<!--Run with Saxon-JS 2.0-->

The same code works fine (i.e. htmlparse is able to parse the entities) with the Saxon Java or .NET.

The warning is emitted by the function xsl:function name="d:chars" in https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl#L259.


Replies (2)

RE: HTML entity parsing in David Carlisle's htmlparse.xsl fails in Saxon-JS 2 - Added by Martin Honnen 4 months ago

It seems the bug is in evaluating $d:ents/key('d:ents',regex-group(3)), that fails to find the entity. On the other hand, a simple $d:ents/key('d:ents', $entity-name) doesn't fail. So I am not sure whether it is related to the function use or the regex-group use.

A reduced test case would be

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="3.0"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:d="http://example.com/dc"
    exclude-result-prefixes="#all"
    expand-text="yes">
    
    <xsl:param name="entity-name" as="xs:string">aacute</xsl:param>
    
    <xsl:param name="entity-reference" as="xs:string"><![CDATA[&aacute;]]></xsl:param>
    
    <xsl:output method="html"/>
    
    <xsl:template match="/" name="xsl:initial-template">
        <html>
            <head>
                <title>Test</title>
            </head>
            <body>
                <section>
                    <h1>key test</h1>
                    <p><code><xsl:value-of select="$entity-name || ': ' || $d:ents/key('d:ents', $entity-name)"/></code></p>
                </section>
                <section>
                    <h1>function test</h1>
                    <p><code><xsl:value-of select="$entity-reference || ': ' || d:chars($entity-reference)"/></code></p>
                </section>
            </body>
        </html>
    </xsl:template>
    
    <xsl:variable name="d:ents">
        <entity name="Aacute">&#xC1;</entity>
        <entity name="aacute">&#xE1;</entity>
        <entity name="Acirc">&#xC2;</entity>
    </xsl:variable>
    
    <xsl:key name="d:ents" match="entity" use="@name"/>

    <xsl:function name="d:chars">
        <xsl:param name="s" as="xs:string"/>
        <xsl:value-of>
            <xsl:analyze-string select="$s" regex="&amp;(#?)(x?)([0-9a-fA-F]+|[a-zA-Z][a-zA-Z0-9]*);">
                <xsl:matching-substring>
                    <xsl:choose>
                        <xsl:when test="$d:ents/key('d:ents',regex-group(3))">
                            <xsl:value-of select="$d:ents/key('d:ents',regex-group(3))"/>
                        </xsl:when>
                        <xsl:otherwise>
                            <xsl:message>htmlparse: Unknown entity: <xsl:value-of select="regex-group(3)"/></xsl:message>
                            <xsl:text>&amp;</xsl:text>
                            <xsl:value-of select="regex-group(3)"/>
                            <xsl:text>;</xsl:text>
                        </xsl:otherwise>
                    </xsl:choose>
                </xsl:matching-substring>
                <xsl:non-matching-substring>
                    <xsl:value-of select="."/>
                </xsl:non-matching-substring>
            </xsl:analyze-string>
        </xsl:value-of>
    </xsl:function>
    
</xsl:stylesheet>

When I rewrite the key use in the function to

                        <xsl:when test="key('d:ents', regex-group(3), $d:ents)">
                            <xsl:value-of select="key('d:ents', regex-group(3), $d:ents)"/>
                        </xsl:when>

the code works the same in Saxon-JS 2 and Saxon Java.

    (1-2/2)

    Please register to reply