HTML entity parsing in David Carlisle's htmlparse.xsl fails in Saxon-JS 2
Added by Martin Honnen over 4 years ago
I have tried whether Saxon-JS 2 can handle the HTML parser that David Carlisle implemented in XSLT 2, online as the raw source at https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl or viewable at https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl.
It seems to be able to import and compile it, I have however found one issue: the parsing of HTML entities doesn't seem to work, it emits a warning htmlparse: Unknown entity: nbsp
:
E:\SomeDir>xslt3 -t -s:htmlparse-test-input.xml -xsl:htmlparse-test1.xsl
Saxon-JS 2.0 from Saxonica
Node.js version v12.18.0
Compiling stylesheet E:\SomeDir\htmlparse-test1.xsl
Stylesheet compilation time: 2.337s
SEF generated by Saxon-JS 2.0 at 2020-07-02T20:15:01.226+02:00 with -target:JS -relocate:true
<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Test</title>
</head>
<body>
<section>
<h2>numeric character reference test</h2>
<div>Text with non-breaking space: test.</div>
</section>
<section>
<h2>hexadecimal character reference test</h2>
<div>Text with non-breaking space: test.</div>
</section>
<section>
<h2>htmlparse test</h2>htmlparse: Unknown entity: nbsp
htmlparse: Unknown entity: Auml
<div>
<p>parse HTML & XML:&nbsp; HTML entity references
</p>
<p lang="de">&Auml;rger</p>
</div>
</section>
</body>
</html>
<!--Run with Saxon-JS 2.0-->
The same code works fine (i.e. htmlparse is able to parse the entities) with the Saxon Java or .NET.
The warning is emitted by the function xsl:function name="d:chars"
in https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl#L259.
htmlparse-test1.xsl (761 Bytes) htmlparse-test1.xsl | |||
htmlparse.xsl (29.9 KB) htmlparse.xsl | |||
htmlparse-test-input.xml (649 Bytes) htmlparse-test-input.xml |
Replies (2)
RE: HTML entity parsing in David Carlisle's htmlparse.xsl fails in Saxon-JS 2 - Added by Martin Honnen over 4 years ago
It seems the bug is in evaluating $d:ents/key('d:ents',regex-group(3))
, that fails to find the entity. On the other hand, a simple $d:ents/key('d:ents', $entity-name)
doesn't fail. So I am not sure whether it is related to the function use or the regex-group use.
A reduced test case would be
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="3.0"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="http://example.com/dc"
exclude-result-prefixes="#all"
expand-text="yes">
<xsl:param name="entity-name" as="xs:string">aacute</xsl:param>
<xsl:param name="entity-reference" as="xs:string"><![CDATA[á]]></xsl:param>
<xsl:output method="html"/>
<xsl:template match="/" name="xsl:initial-template">
<html>
<head>
<title>Test</title>
</head>
<body>
<section>
<h1>key test</h1>
<p><code><xsl:value-of select="$entity-name || ': ' || $d:ents/key('d:ents', $entity-name)"/></code></p>
</section>
<section>
<h1>function test</h1>
<p><code><xsl:value-of select="$entity-reference || ': ' || d:chars($entity-reference)"/></code></p>
</section>
</body>
</html>
</xsl:template>
<xsl:variable name="d:ents">
<entity name="Aacute">Á</entity>
<entity name="aacute">á</entity>
<entity name="Acirc">Â</entity>
</xsl:variable>
<xsl:key name="d:ents" match="entity" use="@name"/>
<xsl:function name="d:chars">
<xsl:param name="s" as="xs:string"/>
<xsl:value-of>
<xsl:analyze-string select="$s" regex="&(#?)(x?)([0-9a-fA-F]+|[a-zA-Z][a-zA-Z0-9]*);">
<xsl:matching-substring>
<xsl:choose>
<xsl:when test="$d:ents/key('d:ents',regex-group(3))">
<xsl:value-of select="$d:ents/key('d:ents',regex-group(3))"/>
</xsl:when>
<xsl:otherwise>
<xsl:message>htmlparse: Unknown entity: <xsl:value-of select="regex-group(3)"/></xsl:message>
<xsl:text>&</xsl:text>
<xsl:value-of select="regex-group(3)"/>
<xsl:text>;</xsl:text>
</xsl:otherwise>
</xsl:choose>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:value-of>
</xsl:function>
</xsl:stylesheet>
When I rewrite the key use in the function to
<xsl:when test="key('d:ents', regex-group(3), $d:ents)">
<xsl:value-of select="key('d:ents', regex-group(3), $d:ents)"/>
</xsl:when>
the code works the same in Saxon-JS 2 and Saxon Java.
key-failure3.xsl (2.5 KB) key-failure3.xsl |
RE: HTML entity parsing in David Carlisle's htmlparse.xsl fails in Saxon-JS 2 - Added by Michael Kay over 4 years ago
Logged as a bug here:
Please register to reply