Project

Profile

Help

Refactoring or compiling a stylesheet to remove need for collection() function - Saxon JS

Added by Nik Osvalds 12 months ago

I'm working on converting an existing XML validator using Saxon HE to Saxon JS. From my research and creation of a proof of concept it seems that having collection() functions in your .xslt that is used in a transform with Saxon JS is a negative on performance as collectionFinder() cannot be run asynchronously. See this section of documentation https://www.saxonica.com/saxon-js/documentation/index.html#!development/asynchrony .

So I'm wondering if there is a way to re-structure my .xslt file to remove the use of collections and get around this limitation of Saxon-JS?

I have this section in the .xslt file that uses a collection() function to return nodes of several .xml files of codelist elements.

  <xsl:variable name="iati-codelists">
    <codes version="2.03">
      <xsl:apply-templates select="collection('../lib/schemata/2.03/codelist/?select=*.xml;recurse=yes')" mode="get-codelists"/>
      <xsl:apply-templates select="collection('../lib/schemata/non-embedded-codelist/?select=*.xml;recurse=yes')" mode="get-codelists"/>
    </codes>
</xsl:variable>

  <xsl:function name="me:codeListFail" as="xs:boolean">
    <xsl:param name="code"/>
    <xsl:param name="codelist"/>
    <xsl:param name="iati-version"/>
    
    <xsl:sequence select="$code and 
      $iati-codelists/codes[@version=$iati-version]/codelist[@name=$codelist] and 
      not(($code, lower-case($code), upper-case($code))=$iati-codelists/codes[@version=$iati-version]/codelist[@name=$codelist]/code)"/>
  </xsl:function>

.xml files referenced by the collection()

BudgetType.xml

<codelist name="BudgetType" xml:lang="en" complete="1" embedded="1">
    <metadata>
        <name>
            <narrative>Budget Type</narrative>
        </name>
        <category>
            <narrative>Core</narrative>
        </category>
    </metadata>
    <codelist-items>
        <codelist-item>
            <code>1</code>
            <name>
                <narrative>Original</narrative>
                <narrative xml:lang="en">Original</narrative>
                <narrative xml:lang="fr">Original</narrative>
            </name>
            <description>
                <narrative>The original budget allocated to the activity</narrative>
                <narrative xml:lang="fr">Le budget initialement affecté à l'activité.</narrative>
            </description>
        </codelist-item>
        <codelist-item>
            <code>2</code>
            <name>
                <narrative>Revised</narrative>
                <narrative xml:lang="fr">Révisé</narrative>
            </name>
            <description>
                <narrative>The updated budget for an activity</narrative>
                <narrative xml:lang="fr">Le budget mis à jour pour une activité.</narrative>
            </description>
        </codelist-item>
    </codelist-items>
</codelist>

Is there some way to "compile" the collections into the .xslt before running the transform?

Or is this implementation (where I just load the nodes into memory before calling the transform function basically the same thing)?

const _ = require('lodash');
const fs = require('fs');
const fsPromises = fs.promises;
const SaxonJS = require('saxon-js');

            // load codelists since collectionFinder can't be async
            let codelistPaths = [
                "non-embedded-codelist/",
                "2.03/codelist/",
                "2.02/codelist/",
                "2.01/codelist/",
                "1.05/codelist/",
                "1.04/codelist/",
                "1.03/codelist/"
            ];
                
            // this returns an object of the codelistPaths as Keys and an Array of resolved promises for the Values. these promises are grabbing the codelist XML files
            let resources = _.zipObject(codelistPaths, await Promise.all(_.map(codelistPaths, async (path) => {
                let files = await fsPromises.readdir("./IATI-Rulesets/lib/schemata/" + path);
                return await Promise.all(files.map(async (file) => {
                    return await SaxonJS.getResource({ type : 'xml', file : "./IATI-Rulesets/lib/schemata/" +  path + file })
                }))
            })))         

            // this pulls the right array of SaxonJS resources from the resources object
            const collectionFinder = (url) => {
                if (url.includes("codelist")) {
                    let path = url.split('schemata/')[1].split('?')[0]; // get the right filepath (remove file:// and after the ?
                    return resources[path]
                } else {
                    return []
                }
            }

            // results filepath
            let resultsPath = __dirname + "/file_storage/validated/" + xmlIn.md5 + '_results.xml'
            // Applying the XSLT3 Ruleset to IATI Files Using SaxonJS
            let transformStart = process.hrtime.bigint();
            let results = await SaxonJS.transform({
                stylesheetFileName: "./IATI-Rulesets/rules/iati.sef.json",
                sourceFileName: filePath,
                destination: "serialized",
                collectionFinder: collectionFinder
            }, "async")

Thanks, Nik

clipboard-202010261058-gjye6.png (26.5 KB) clipboard-202010261058-gjye6.png .xml files referenced by the collection()

Replies (2)

RE: Refactoring or compiling a stylesheet to remove need for collection() function - Saxon JS - Added by Michael Kay 12 months ago

I think that the second approach is probably the best way of doing it in cases where the entire collection fits comfortably in memory.

I'm not sure about the detail -- at first glance, you seem to be doing "await" where you don't need to, especially before the call on getResource() -- but I'm not going to try and advise you on writing good asynchronous Javascript since I'm not at all sure I have fully mastered the art.

RE: Refactoring or compiling a stylesheet to remove need for collection() function - Saxon JS - Added by Nik Osvalds 12 months ago

Thanks Michael. I was able to load the full collections into memory when the express app is started up so it's ready to go when the API is called to transform the XML.

I did have to use Asynchronous programming with getResource( ). I wasn't able to get it to return any information synchronously. This is supported in the documentation as well https://www.saxonica.com/saxon-js/documentation/index.html#!api/getResource

*SaxonJS.getResource(options)

This function retrieves a resource asynchronously, delivering a Promise which is fulfilled when the resource is available.*

Here is my code below for anyone else. I've removed some of the less important parts of my application for brevity so this would not run on its own without modifications.

const _ = require('lodash');
const fs = require('fs');
const fsPromises = fs.promises;
const SaxonJS = require('saxon-js');

// load codelists since collectionFinder can't be async
let codelistPaths = [
    "non-embedded-codelist/",
    "2.03/codelist/",
    "2.02/codelist/",
    "2.01/codelist/",
    "1.05/codelist/",
    "1.04/codelist/",
    "1.03/codelist/"
];
let resources;
let beforeRSS = process.resourceUsage().maxRSS;
// this returns an object of the codelistPaths as Keys and an Array of resolved promises for the Values. these promises are grabbing the codelist XML files
(async () => {
    try {
        resources = _.zipObject(codelistPaths, await Promise.all(_.map(codelistPaths, async (path) => {
            let files = await fsPromises.readdir("./IATI-Rulesets/lib/schemata/" + path);
            return await Promise.all(files.map(async (file) => {
                return await SaxonJS.getResource({ type : 'xml', file : "./IATI-Rulesets/lib/schemata/" +  path + file })
            }))
        })))    
    } catch (error) {
        console.log("Error loading collections: " + error);
    }           
    console.log((process.resourceUsage().maxRSS - beforeRSS) / 1000000 + " MB change in RSS");    // 89 Megabytes is rss
}) ()


// this pulls the right array of SaxonJS resources from the resources object
const collectionFinder = (url) => {
  if (url.includes("codelist")) {
    let path = url.split('schemata/')[1].split('?')[0]; // get the right filepath (remove file:// and after the ?
    return resources[path]
  } else {
      return []
  }
}

console.log(process.resourceUsage().maxRSS / 1000000 + "MB total in RSS");

// results filepath
let resultsPath = __dirname + "/file_storage/validated/" + xmlIn.md5 + '_results.xml'

// Applying the XSLT3 Ruleset to IATI Files Using SaxonJS
let transformStart = process.hrtime.bigint();
let results = await SaxonJS.transform({
  stylesheetFileName: "./IATI-Rulesets/rules/iati.sef.json",
  sourceFileName: filePath,
  destination: "serialized",
  collectionFinder: collectionFinder
}, "async")

    (1-2/2)

    Please register to reply