Project

Profile

Help

Wrong sort order from API

Added by Lars Marius Garshol 5 months ago

Our stylesheets get the correct sort order when run on the command-line, but the wrong one when run via the API, and I can't figure out why. I've made a simplified test case to demonstrate the issue.

I have this stylesheet:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="2.0">
  <xsl:template match="/">
    <new-terms>
      <xsl:for-each select="terms/term">
        <xsl:sort select="." collation=
                  "http://www.w3.org/2013/collation/UCA?strength=primary;lang=en"/>

        <term><xsl:value-of select="."/></term>
      </xsl:for-each>
    </new-terms>
  </xsl:template>
</xsl:stylesheet>

which I apply to this input:

<terms>
  <term>block</term>
  <term>block data</term>
  <term>blockchain</term>
</terms>

When I run that with command-line Saxon and no other parameters than the input, output, and stylesheet I get (indented to make it readable):

``

block block data blockchain ```

This is correct. However, when I run the same stylesheet and input with the API, I get:

<?xml version="1.0" encoding="UTF-8"?>
<new-terms>
  <term>block</term>
  <term>blockchain</term>
  <term>block data</term>
</new-terms>

This is not correct. I wrote a minimal Java class to reproduce the issue below:

package com.realtaonline.render.xmlserver;

import java.io.File;

import javax.xml.transform.stream.StreamSource;
import net.sf.saxon.s9api.Processor;
import net.sf.saxon.s9api.XsltCompiler;
import net.sf.saxon.s9api.XsltExecutable;
import net.sf.saxon.s9api.XsltTransformer;

public class Transform {

  public static void main(String[] args) throws Exception {
    File style = new File(args[0]);
    File infile = new File(args[1]);
    File outfile = new File(args[2]);

    Processor proc = new Processor(true);
    XsltCompiler compiler = proc.newXsltCompiler();
    StreamSource ssource = new StreamSource(style);
    XsltExecutable executable = compiler.compile(ssource);
    XsltTransformer transformer = executable.load();

    transformer.setSource(new StreamSource(infile));
    transformer.setDestination(proc.newSerializer(outfile));
    transformer.transform();
  }
}

What am I doing wrong? How can I get the correct sort order via the API? I tried using an Xslt30Transformer like the Saxon command-line does, but that made no difference.


Replies (7)

Please register to reply

RE: Wrong sort order from API - Added by Lars Marius Garshol 5 months ago

The formatting got messed up somehow. This is the correct output from the command-line:

<?xml version="1.0" encoding="UTF-8"?>
<new-terms>
  <term>block</term>
  <term>block data</term>
  <term>blockchain</term>
</new-terms>

RE: Wrong sort order from API - Added by Michael Kay 5 months ago

The most likely explanation is that one case is falling back to using collations provided by the JDK rather than collations provided by ICU. A good way to check this would be to add ;fallback=no to the collation URI - this will cause it to fail if ICU collations are not available.

Saxon should pick up ICU collations automatically if (a) the license is found, and (b) the ICU jar file can be found in the lib directory that's a sibling to the Saxon Jar file. If it can't load the ICU collations, it will silently fallback to using JDK collations, unless you specify fallback=no.

Hope that gives you some pointers to investigate further.

There are of course other parameters you can use in a collation URI to control whether whitespace is ignorable. I have to say I find the rules in the Unicode spec highly confusing. I also find that different publishers seem to have very different ideas on what is "correct"!

RE: Wrong sort order from API - Added by Lars Marius Garshol 5 months ago

That was it!

I added ";fallback=no" and it crashed because ICU4J was missing. Once I added that to my dependencies the sort order became correct.

Thank you!

RE: Wrong sort order from API - Added by Lars Marius Garshol 5 months ago

(b) the ICU jar file can be found in the lib directory that's a sibling to the Saxon Jar file

Is this really a good idea? It requires customers to have files placed in very specific locations, and if the files happen not to be there code will silently do the wrong thing. Or crash if you set ";fallback=no".

This means, for example, that you can't write an automated integration test to prove that the functionality works, because it relies on details in the filesystem layout.

We build our code with Maven and produce a "fat jar" that contains all the dependencies, but it seems that Saxon cannot find ICU4J via the classpath, but instead relies on this lib directory. If Saxon could look ICU4J up on the classpath instead of or in addition to using the lib directory, that would mean that we can produce a build we can automatically verify will have the correct behaviour in all environments.

RE: Wrong sort order from API - Added by Norm Tovey-Walsh 5 months ago

Is this really a good idea? It requires customers to have files placed in very specific locations, and if the files happen not to be there code will silently do the wrong thing. Or crash if you set ";fallback=no".

This means, for example, that you can't write an automated integration test to prove that the functionality works, because it relies on details in the filesystem layout.

Yes and no. If you download the release and unzip it, then you get a jar file that has a manifest that points to the additional jars in the lib directory. For users running Saxon from the command line, that has a couple of benefits. First, it just works and second, they can, if they need to, upgrade those components.

If you’re integrating Saxon into an application, a more common approach is to get the Saxon artifacts (along with all your other dependencies) through Maven or your build tool of choice.

In that case, the class path is managed by your build system and the bundled jar files don’t interfere with your application.

We build our code with Maven and produce a "fat jar" that contains all the dependencies,

We could do that, but that binds the dependencies very tightly. If you wanted to update to a new point release of some dependency, you’d have to do a lot of work and create a new jar. That introduces additional complexity around signing the jar, etc.

but it seems that Saxon cannot find ICU4J via the classpath, but instead relies on this lib directory. If Saxon could look ICU4J up on the classpath instead of or in addition to using the lib directory, that would mean that we can produce a build we can automatically verify will have the correct behaviour in all environments.

I’m not sure I follow. It isn’t Saxon per-se that’s looking for ICU4J, it’s the Java classloader.

If you run net.sf.saxon.Transform with a classpath that includes ICU4J and the other dependencies (whether you do that from the command line or in some larger build system), it should work fine.

If you use ‘java -jar’ to run Saxon directly from the jar file, the built in manifest and lib directory approach assures that you don’t have to manage all those dependencies on the classpath yourself.

Does that make sense?

Be seeing you,
norm

--
Norm Tovey-Walsh
Saxonica

RE: Wrong sort order from API - Added by Lars Marius Garshol 5 months ago

Thanks for explaining. It turns out I misunderstood what Saxon is doing.

What I observed is that when I run my tests with "mvn test" Saxon finds the right collations if SAXON_HOME is set to the directory that has the license key and the lib/ directory. If I don't set SAXON_HOME then it doesn't find the collations.

I assumed this meant Saxon was using the environment variable to find ICU4J, but when I try setting the variable now and changing the name of the lib/ folder so Saxon can't use it I see that it still works. So I guess it must be the license key that Saxon really needs, and not the other parts.

That's perfectly fine.

Thanks for the help, and sorry about the confusion. I'm reverse engineering and trying to guess what is happening, and I guessed wrong.

RE: Wrong sort order from API - Added by Norm Tovey-Walsh 5 months ago

Saxonica Developer Community writes:

I assumed this meant Saxon was using the environment variable to find ICU4J, but when I try setting the variable now and changing the name of the lib/ folder so Saxon can't use it I see that it still works. So I guess it must be the license key that Saxon really needs, and not the other parts.

That’s right. The use of ICU4J is a licensed, EE feature.

Thanks for the help, and sorry about the confusion. I'm reverse engineering and trying to guess what is happening, and I guessed wrong.

Been there. Done that. :-)

Be seeing you,
norm

--
Norm Tovey-Walsh
Saxonica

    (1-7/7)

    Please register to reply