Project

Profile

Help

SaxonC 12.1 EE Query.exe on Windows: how to use XPath 4.0 function parse-html

Added by Martin Honnen about 1 year ago

With Saxon EE 12.1 Java, it seems to have support for the XPath 4.0 (proposed) parse-html function I need to put e.g. htmlparser-1.4.16.jar on the Java classpath.

What about SaxonC EE 12.1 where I can't put anything on the classpath? Should it support parse-html? I don't get any output indicating it but no error either:

PS C:\Program Files\Saxonica\libsaxon-EEC-windows-v12.1\command> .\Query.exe -t -qversion:4.0 -qs:"parse-html('<p id=p1>This is a test')"
SaxonC-EE 12.1 from Saxonica
Java version 11.0.18
Using license serial number ...
Analyzing query from {parse-html('<p id=p1>This is a test')}
Analysis time: 0.9928 milliseconds
<?xml version="1.0" encoding="UTF-8"?>

Running EE Java with those command line options show the parsed HTML e.g.

PS C:\Program Files\Saxonica\libsaxon-EEC-windows-v12.1\command> java -cp 'C:\Program Files\Saxonica\SaxonEE12-1J\saxon-ee-12.1.jar;C:\Users\marti\OneDrive\Documents\htmlparser\htmlparser-1.4.16.jar' net.sf.saxon.Query  -t -qversion:4.0 -qs:"parse-html('<p id=p1>This is a test')"
SaxonJ-EE 12.1 from Saxonica
Java version 11.0.12
Using license serial number ...
Analyzing query from {parse-html('<p id=p1>This is a test')}
Analysis time: 161.1342 milliseconds
Building tree for (unknown systemId) using class net.sf.saxon.tree.tiny.TinyBuilder
Tree built in 3.8195ms
Tree size: 7 nodes, 14 characters, 1 attributes
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head/><body><p id="p1">This is a test</p></body></html>Execution time: 50.7936ms
Memory used: 31Mb

Replies (8)

Please register to reply

RE: SaxonC 12.1 EE Query.exe on Windows: how to use XPath 4.0 function parse-html - Added by Martin Honnen 12 months ago

Any answer here on how to use SaxonC EE and fn:parse-html?

RE: SaxonC 12.1 EE Query.exe on Windows: how to use XPath 4.0 function parse-html - Added by Norm Tovey-Walsh 12 months ago

Saxonica Developer Community writes:

Any answer here on how to use SaxonC EE and fn:parse-html?

Extensibility is an interesting challenge in SaxonC. I don’t know what
the right approach is, longer term. In the immediate term, I worked
around this problem in a Python script by parsing the HTML before
passing it to SaxonC. That’s not a general solution, of course, but it
got me over the hurdle.

Be seeing you,
norm

--
Norm Tovey-Walsh
Saxonica

RE: SaxonC 12.1 EE Query.exe on Windows: how to use XPath 4.0 function parse-html - Added by Martin Honnen 12 months ago

What kind of "parsing" in Python do you use to pass HTML (as an PyXdmNode?) to SaxonC? Or are you parsing the HTML into XHTML?

But I mainly would like to know whether the omission to bundle the HTML parser that Saxon Java uses for fn:parse-html with SaxonC EE is a build mistake/omission of the current release or somehow a known/deliberate shortcoming of SaxonC EE as license or technical reasons prevent your from bundling the third party HTML parser with the rest of the software.

RE: SaxonC 12.1 EE Query.exe on Windows: how to use XPath 4.0 function parse-html - Added by Norm Tovey-Walsh 12 months ago

What kind of "parsing" in Python do you use to pass HTML (as an PyXdmNode?) to SaxonC? Or are
you parsing the HTML into XHTML?

I’m using the HTML parser to get an XML serialization which I then
reparse. It’s ugly, but it was the quickest way around an obstacle. (The
code where I’m doing this is a special-purpose link checker for
generated HTML.)

with open(htmlfile, "r", encoding="utf-8") as html:
doc = html5_parser.parse(html.read())

All my files are UTF-8!

text = lxml.etree.tostring(doc).decode("utf-8")

with PySaxonProcessor(license=False) as saxon:
xslt = saxon.new_xslt30_processor()
xexec = xslt.compile_stylesheet(stylesheet_text=extract)

builder = saxon.new_document_builder()
builder.set_base_uri('file:' + htmlfile)
node = builder.parse_xml(xml_text=text)

But I mainly would like to know whether the omission to bundle the
HTML parser that Saxon Java uses for fn:parse-html with SaxonC EE is a
build mistake/omission of the current release or somehow a
known/deliberate shortcoming of SaxonC EE as license or technical
reasons prevent your from bundling the third party HTML parser with
the rest of the software.

“Yes.” (Which is kind of a flippant answer, I know. :-))

There are a number of extensions on Java that require additional
libraries that the user can supply as JAR files. In fact, there are some
APIs that allow for bespoke extension JARs.

Even if in principle this could be accomplished with dynamic libraries
for SaxonC, it would be far less practical as the libraries would have
to be provided for all of the platforms and all of the different
environments. (Turns out, “write once, run anywhere” was a good idea.)

We’ll want to include a library that supports fn:parse-html by the time
the 4.0 specs that define it are finished. Where it becomes practical in
the timeline between now and then will depend a little bit on how
difficult it turns out to be. That difficulty will involve both
technical choices (does this jar file work seamlessly with GraalVM?) and
non-technical ones (do we have a license to bundle this library, for
example?).

Be seeing you,
norm

--
Norm Tovey-Walsh
Saxonica

RE: SaxonC 12.1 EE Query.exe on Windows: how to use XPath 4.0 function parse-html - Added by Martin Honnen 12 months ago

I see.

In the meantime I have gone back to use David Carlisle's XSLT 2.0 implementation of an HTML tag soup parser https://github.com/davidcarlisle/web-xslt/blob/main/htmlparse/htmlparse.xsl from XSLT/XPath/XQuery by donating the htmlparse function declaration(s) a visibility="public" attribute and then using it e.g. in SaxonC with fn:transform and initial-function or with SaxonC's API call_function_returning_... e.g.

from saxonche import *

with PySaxonProcessor(license=False) as saxon_processor:

    print(saxon_processor.version)

    xslt_proc = saxon_processor.new_xslt30_processor()

    try:
        htmlparser_executable = xslt_proc.compile_stylesheet(stylesheet_file='htmlparse.xsl')
    except RuntimeError as e:
        print(f'Compiling htmlparse.xsl failed: {e}')
        exit(1)

    html = '''<p id=p1>This is a test.'''

    try:
        result = htmlparser_executable.call_function_returning_value('{data:,dpc}htmlparse', [saxon_processor.make_string_value(html), saxon_processor.make_string_value(''), saxon_processor.make_boolean_value(True)])
        print(result)
    except RuntimeError as e:
        print(f'Error parsing HTML: {e}')

    print()

    html = '''<p id=p1>This is a test.<p>This is a test.<br>This is a test.'''

    try:
        result = htmlparser_executable.call_function_returning_value('{data:,dpc}htmlparse', [saxon_processor.make_string_value(html), saxon_processor.make_string_value(''), saxon_processor.make_boolean_value(True)])
        print(result)
    except RuntimeError as e:
        print(f'Error parsing HTML: {e}')

Sufficient for most use cases to get an XDM/PyXdmNode representation of an HTML fragment.

RE: SaxonC 12.1 EE Query.exe on Windows: how to use XPath 4.0 function parse-html - Added by Michael Kay 12 months ago

I would add another pragmatic consideration: there's a good case for producing one implementation of each new QT4 feature as quickly as possible, so that we can create test cases, gain user experience, etc. But the specs will take a while to stabilise, so the case for producing multiple implementations right across the product line is much weaker - especially where implementation isn't trivial and where the spec is most likely to have edge cases that end up changing.

RE: SaxonC 12.1 EE Query.exe on Windows: how to use XPath 4.0 function parse-html - Added by O'Neil Delpratt 12 months ago

I have managed to build the SaxonC library with the htmlparser.jar. I confirm the above test case works.

    (1-8/8)

    Please register to reply