Project

Profile

Help

Bug #6122

closed

XSLT Processor unsuccessfully reads source XML

Added by William Stiller 10 months ago. Updated 9 months ago.

Status:
Closed
Priority:
Normal
Category:
Python
Start date:
2023-07-07
Due date:
% Done:

0%

Estimated time:
Found in version:
12.3.0
Fixed in version:
Platforms:

Description

The SaxonC XSLT processor unsuccessfully reads XML files in a directory. This does not work with version 12.3.0, but does work with version 12.0.0

Link to Repo: https://github.com/WillStill/SaxonXML


Related issues

Related to SaxonC - Bug #6325: saxonche.PySaxonApiError: Null found in Java string conversation. Line number: -1NewO'Neil Delpratt2024-01-19

Actions
Actions #1

Updated by Michael Kay 10 months ago

Please indicate what steps are needed to reproduce the problem.

Actions #2

Updated by William Stiller 10 months ago

The error message: Error reported by XML parser: Content is not allowed in prolog.: Content is not allowed in prolog. appears when attempting to transform with the Saxonche 12.3.0 using

xsltProc.transform_to_file(source_file=source, stylesheet_file=xslt, output_file=result)

The source_file=source parameter refers to a file directory with several XML documents inside. This error does not exist for Saxonche 12.0.0.

Actions #3

Updated by O'Neil Delpratt 10 months ago

  • Category set to Python
  • Status changed from New to AwaitingInfo
  • Assignee set to O'Neil Delpratt

The source_file argument in the method transform_to_file is designed to accept file names. The passing of directory name as source file is not allowed therefore that is why you are getting the parse error in SaxonC 12.3. We could could give a better error message in this case. The fact that SaxonC 12.0 did not complain about it is unfortunate but maybe because the source_file is not actually used in the given stylesheet from your GitHub project.

In your stylesheet I see that the collection on the 'data' directory is given as a xl:variable:

<xsl:variable name="dataColl" as="document-node()+" select="collection('data/?select=*.xml')"/>

What worked for me is to omitted the source_file argument in your python script. The following ran on SaxonC 12.3 :

xsltproc.transform_to_file(stylesheet_file=xslt, output_file=result)

I hope that helps.

Actions #4

Updated by O'Neil Delpratt 10 months ago

Checked out the latest code from your GitHub project. Now getting the result:

python XMLDataPull.py
SaxonC-HE 12.3 from Saxonica
<?xml version="1.0" encoding="UTF-8"?>
<root>
   <metadata>
      <machName>Spotify</machName>
      <time>2023-07-06T18:39:40.9903796-04:00</time>
   </metadata>
   <data>
      <cd num="1"
           artist="Bob Dylan"
           title="Empire Burlesque"
           sequence="1"
           year="1985">10.90</cd>
      <cd num="2"
           artist="Garg Moore"
           title="Still got the blues"
           sequence="2"
           year="1990">10.20</cd>
      <cd num="3"
           artist="Eros Ramazzotti"
           title="Eros"
           sequence="3"
           year="1997">9.90</cd>
      <cd num="4"
           artist="Dolly Parton"
           title="Greatest Hits"
           sequence="4"
           year="1982">9.90</cd>
      <cd num="5"
           artist="Bonnie Tyler"
           title="Hide your heart"
           sequence="5"
           year="1988">9.90</cd>
   </data>
</root>

miniDom parsing:  <xml.dom.minidom.Document object at 0x107873580>
<?xml version="1.0" encoding="UTF-8"?>
<root>
   <metadata>
      <machName>Youtube</machName>
      <time>2023-07-06T18:39:40.9903796-04:00</time>
   </metadata>
   <data>
      <cd num="1"
           artist="Bob Dylan"
           title="Empire Burlesque"
           sequence="1"
           year="1985">10.90</cd>
      <cd num="2"
           artist="Garg Moore"
           title="Still got the blues"
           sequence="2"
           year="1990">10.20</cd>
      <cd num="3"
           artist="Eros Ramazzotti"
           title="Eros"
           sequence="3"
           year="1997">9.90</cd>
      <cd num="4"
           artist="Dolly Parton"
           title="Greatest Hits"
           sequence="4"
           year="1982">9.90</cd>
      <cd num="5"
           artist="Bonnie Tyler"
           title="Hide your heart"
           sequence="5"
           year="1988">9.90</cd>
   </data>
</root>

miniDom parsing:  <xml.dom.minidom.Document object at 0x107884a60>
Actions #5

Updated by O'Neil Delpratt 10 months ago

  • Status changed from AwaitingInfo to In Progress

Reported by user that they are still experiencing problems on windows machine.

Actions #6

Updated by Martin Honnen 10 months ago

My suggestion still would be to use an initial template, no source document and then call_template..., also to ensure the result files created with xsl:result-document in a subdirectory it is necessary to set the base_output_uri, in my opinion easiest done with the help of pathlib e.g.

from saxonche import PySaxonProcessor

from pathlib import Path

with PySaxonProcessor(license=False) as proc:
	print(proc.version)
	xsltproc = proc.new_xslt30_processor()
	source = "data"
	xslt = "DataXSLT.xsl"
	result = "output"
	base_output_uri = Path('.', result, 'result').absolute().as_uri()
	#print(base_output_uri)
	xslt_executable = xsltproc.compile_stylesheet(stylesheet_file=xslt)
	xslt_executable.call_template_returning_value(template_name=None, base_output_uri=base_output_uri)

and the XSLT having e.g.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="#all"
    version="3.0">

    <xsl:variable name="dataColl" as="document-node()+" select="collection('data/?select=*.xml')"/>

    <xsl:template name="xsl:initial-template" match="/">
        <xsl:for-each select="$dataColl">
            <xsl:result-document href="{//DeviceStream/@name}.xml" method="xml" indent="yes">
                <root>
                    <metadata>
                        <machName>
                            <xsl:value-of
                                select="//DeviceStream[not(@name = 'Agent')]/@name"/>
                        </machName>
                        <time>
                            <xsl:value-of select="current-dateTime()"/>
                        </time>
                    </metadata>
                    <data>
                        <xsl:for-each select="//*[@sequence]">
                            <xsl:sort select="xs:integer(@sequence)"/>
                            <xsl:copy>
                                <xsl:attribute name="num" select="@sequence"/>
                                <xsl:copy-of select="@*, node()"/>
                            </xsl:copy>
                        </xsl:for-each>
                    </data>
                </root>
            </xsl:result-document>
        </xsl:for-each>
    </xsl:template>

</xsl:stylesheet>


Actions #7

Updated by O'Neil Delpratt 10 months ago

Thanks Martin for your suggestion. Yes that would work. Just to add, William may find also useful to capture the result documents using SaxonC for further processing. They are represented as a dict object [URI, PyXdmValue] with the URI as the document uri and the PyXdmValue as the secondary result documents. For example:

from saxonche import PySaxonProcessor

from pathlib import Path

with PySaxonProcessor(license=False) as proc:
	print(proc.version)
	xsltproc = proc.new_xslt30_processor()
	source = "data"
	xslt = "DataXSLT.xsl"
	result = "output"
	base_output_uri = Path('.', result, 'result').absolute().as_uri()
	#print(base_output_uri)
	xslt_executable = xsltproc.compile_stylesheet(stylesheet_file=xslt)
        xslt_excutable.set_capture_result_documents(True)
	result = xslt_executable.call_template_returning_value(template_name=None, base_output_uri=base_output_uri)
        result_documents = xslt_executable.get_result_documents()
        print(*self.result_documents, sep=", ")
Actions #8

Updated by Elisa Beshero-Bondar 10 months ago

Hi O'Neill and Michael. I don't think the problem is with generating output. It is simply with reading input and launching the XSLT. That is, I do not think we need to change the structure of the XSLT document, and we're already set with xsl:result-document to output files as we want to.

The issue is simply the function running the XSLT in the first place.

On Windows 10 and 11 (two different machines) we have this problem:

In saxonche 12.3, 12.2, and 12.1 on Windows systems: The XSLT does not run when we use this format of transform_to_file()

xsltproc.transform_to_file(stylesheet_file=xslt, output_file=result)  

If we specify an input this, fails with the original posted error ("Content not allowed in prolog").

We downgraded to saxonch 12.0 and we find this:

  1. The XSLT does not run when we use:
xsltproc.transform_to_file(stylesheet_file=xslt, output_file=result)  
  1. The XSLT does run successfully and populate the output directory properly etc, but we MUST specify a source file in 12.0

This works and it's the only thing that has worked on Windows:

xsltProc.transform_to_file(source_file=source, stylesheet_file=xslt, output_file=result)
Actions #9

Updated by Elisa Beshero-Bondar 10 months ago

Just a quick update to specify that in the line that works for us on Windows, the following are variables that store relative filepaths:

  • source
  • xslt
  • result
Actions #10

Updated by Martin Honnen 10 months ago

Is source supposed to be an XML document to be processed? Or a directory?

Are you running that code with Saxon from the command line providing -s or in oXygen naming an input document?

Still don't understand what source is supposed to be needed for or the the initial match="/" because inside the only code is <xsl:for-each select="$dataColl">..</xsl:for-each> which does not use any XML input document or context node or context item at all. That's why I simply think using a named template and starting with it is more adequate in the context of XSLT 2 or 3 or with any version and platform Saxon exists for.

That in some case SaxonC with transform_to_file and source_file=source where source is a directory worked out is in my view more a quirk or bug of that particular Saxon version than a reasonable result to expect.

But let's see what Saxonica thinks, my suggestion to use a different method in the API and a slight change in the stylesheet ( (i.e. start with a named template) was meant to allow you to get consistent results (hopefully, admittedly I only tested with 12.3) and not to rely on a quirk of 12.something that has gone away in the current release.

Actions #11

Updated by O'Neil Delpratt 10 months ago

  • Status changed from In Progress to AwaitingInfo

I managed to look over the python script with Elisa and we concluded that the source document is not actually required in this stylesheet and that SaxonC 12.3 is actually correct. As an improvement the PyXsltExecutable is best suited for this use case.

Actions #12

Updated by William Stiller 10 months ago

Thank you all for the help. I managed to replicate Martin's and O'Neil's scripts successfully with SaxonC 12.3. I've created a named template in the XSLT following Martin's suggestion and now use call_template_returning_value() instead of transform_to_file(). My current script successfully outputs transformed XML. I'll update the repo I supplied with the solution soon.

Actions #13

Updated by O'Neil Delpratt 9 months ago

  • Status changed from AwaitingInfo to Closed

Closing this bug issue as user has managed to get their python script to work the correctly using SaxonC 12.3

Actions #14

Updated by O'Neil Delpratt 4 months ago

  • Related to Bug #6325: saxonche.PySaxonApiError: Null found in Java string conversation. Line number: -1 added

Please register to edit this issue

Also available in: Atom PDF