demo of what might be a bug in Saxon-JS 1.2.0

2020-04-18 by Syd Bauman

The associated XSLT program (./bug_demo.xslt) is supposed to return a single number (cast to a string for output). It counts how many “words” there are in the input document, where a word is defined as a whitespace-separated token in the content of a particular XPath, that is not a descendant of one of a handful of ignorable elements.

If it is being run in the browser, it is supposed to replace the contents of the paragraph below with the number.

[The number goes here]

If it is being run on the commandline it writes the output to a file in the /tmp/ directory. (I used <result-document> even from the commandline just to make them parallel. Probably would get same results if just written to STDOUT or wherever -o: switch says.) It always seems to count correctly when run on the commandline, whether using Saxon 9.9 or 10.0.

In the browser, where I am using Saxon-EE 9.9.1.5 to generate a SEF file from the program, is a different story. When I started writing this it did not seem to matter how big the input document (../bug_demo.xml) was, I got the same XError: looping??? error from Saxon in the browser when it was over 11,000 words long or 11 words long. But later the looping error started to appear only with large input documents. (At ≤ 1002 counted words it works fine, at ≥ 1003 it fails.) The only difference I can think of between “sooner” and “later” is that I re-launched Firefox, and thus updates I received over the last N days via Ubuntu Software Updater may have been applied, idunno. (I also probably made some code tweaks that I thought would not effect anything, but I don’t even remember if I did or not, let alone what they might have been.)

I don’t speak Java, but I am very suspicious of this snippet of code I found on GitHub:

            if (loops++ > 1000) {
                throw XError("looping???");
            }

It seems at first blush as if the error is flagged the first time the $all_content variable is referenced, whether or not it is processed step-by-step:

      <xsl:variable name="ac_normalized" select="normalize-space($all_content)"/>
      <xsl:variable name="all_tokens" select="tokenize( $ac_normalized,'&#x20;')"/>
      <xsl:variable name="num_tokens" select="count( $all_tokens )"/>

or all-at-once: <xsl:variable name="num_words" select="normalize-space( $all_content ) ! tokenize(' ') => count()"/>.

Other issues

It also does not seem to matter whether or not I use a predicate. The predicate, [ matches( .,'\p{L}')], was in my original code just to get rid of tokens like “—” or “!!” that should not be counted as words for my application. (It is a poor approximation of what I really want, because “Sept. 11” should count as two words, not one.) However, using the predicate seems to make it more likely that I get Synchronous XMLHttpRequest on the main thread is deprecated because of its detrimental effects to the end user’s experience. For more help http://xhr.spec.whatwg.org/. I have no idea what that warning means or what, if anything, to do about it. And even less after going to the recommended web page.

I am also wondering if there is a right way to ask “am I being run as a SEF in a browser?”. I found that function-available('ixsl:location') does the trick, but it seems a bit clumsy.

I also wonder if there is any way to use <result-document> to write to a disk file when running as a SEF in the browser. The documentation implies the answer is “no”, but I am hopeful, as it could make debugging a lot easier.