Project

Profile

Help

Bug #4446

Schema-Aware Transformation: wrong node set

Added by Frank Steimke 2 months ago. Updated about 1 month ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Schema-Aware processing
Sprint/Milestone:
-
Start date:
2020-01-29
Due date:
% Done:

100%

Legacy ID:
Applies to branch:
9.9, trunk
Fix Committed on Branch:
9.9, trunk
Fixed in Maintenance Release:

Description

Hi, i have a medium-sized project dealing with latin characters in Unicode. There is a database of latin characters (latinchars.xml), which is a XML Document valid with respect to an XML 1.1 schema latinchars.xsd. There is a schema-aware function library in XSLT 3. The overall goal is th produce a docbook documentation, which works fine. All this is done as an Oxygen project. Oxygen version is 21.1 (recent) which includes Saxon EE 8.8.0.1 on Windows 10.

However, we want to analyze some aspects of NFD normalization. For this i added an extension Element in the schema, which allows xs:any childs. While reading the document from the database, we add an extension element with an nfd element as child. the nfd Element has an mandatory base element as child, followed by an optional diacritical element. The enriched document is validated against the schema without any error.

When i appy transformations to this document, there is a strange behaviour, which seems to be a bug. Unfortunately, i am unable to reproduce it with a small script. I can only describe what is see, and give you the project attached.

Observation:

The enriched database is hold in a global variable $characterSet as='document-node(schema-element(lc:characterSet))'. There are 924 child Elements of Type (*, Entry). Each of these has an nfd child element, every nfd element has an bas child element. Counting the nfd/base elements, i would expect 924 nodes.

This expression gives the correct result:

xsl:value-of select="count($characterSet//element(*, Entry)/extension/nfd/base)"/>

This expression, however, gives the incorrect number of 1 node only:

<xsl:value-of select="count($characterSet//nfd/base)"/>

So, when i count the number of nfd/base descendats of $characterset i get only one, but when i count the number of element(*, Entry)/extension/nfd/base) descendants, i get 924.

I have tried to boil it down to a simple script, but failed. So i have attached the whole oxygen projekt. The transformation which counts the number of nodes is called xsl/dia-matrix.xsl

Sincerely, Frank

latinchars.zip (1.31 MB) Oxygen project. xsl/dia-matrix.xsl counts nodes Frank Steimke, 2020-01-29 11:08 latinchars.zip
dia-matrix.xsl (2.18 KB) Frank Steimke, 2020-01-30 09:38 dia-matrix.xsl
issue-4446.zip (5.23 KB) Frank Steimke, 2020-02-01 09:11 issue-4446.zip
2020-02-02-issue-4446.zip (31.5 KB) Frank Steimke, 2020-02-02 09:49 2020-02-02-issue-4446.zip

History

#1 Updated by Frank Steimke 2 months ago

Further observation: the following expession gives the correct result 924;

<xsl:value-of select="count($characterSet//extension/nfd/base)"/>

So it seems to be important, that the extension element is part of XPATH.

Frank

#2 Updated by Michael Kay 2 months ago

Thanks for reporting it.

Saxon, in schema-aware mode, looks at A//X and tries to rewrite it as A/B/C/D/X if that's the only way of reaching an X. It does try to take types derived by extension into account, but it looks like it's got it wrong in this particular case.

(Incidentally, I have occasionally asked myself whether this optimization is actually worthwhile. A scan of the descendant axis on the tiny tree looking for a particular element name is very fast; it's a very tight loop searching a single array, and the data is probably cached very effectively in the CPU. The rewritten path using repeated tests of the child axis is going to involve visiting fewer nodes in the tree, and fewer comparisons, but it's going to involve a lot more branch instructions and is probably not going to cache so well in the CPU. Worth doing some experiments.)

#3 Updated by Michael Kay 2 months ago

I had to make some adjustments to relative URIs to get it to run; when I do, I see different results from yours:

<?xml version="1.0" encoding="UTF-8"?>
<count xmlns:lc="http://xoev.de/latinchars">
   <entry>924 Entry Elements.</entry>
   <nfd>0 nfd Elements</nfd>
   <base>0 nfd/base Elements</base>
</count>

But this should be enough to start investigating.

#4 Updated by Michael Kay 2 months ago

OK, I was running it incorrectly: the source document latinchars.xml wasn't being validated, and as a result, no template rules were matching.

After fixing this, I get the correct result:

<count xmlns:lc="http://xoev.de/latinchars">
   <entry>924 Entry Elements</entry>
   <nfd>924 nfd Elements</nfd>
   <base>924 nfd/base Elements</base>
</count>

I'll try it on 9.8 (this was 9.9.1.6)

#5 Updated by Michael Kay 2 months ago

I get the same (correct) result on 9.8.

There's a possibility that the incorrect result is a consequence of a different order of loading schema modules, which could be affected by the precise way in which the transformation is run. It could also be affected by the catalog resolver (which I'm not using; I edited the relative URIs to make them resolve directly).

It might be useful to compare the -t output. This is what I'm getting on 9.9:

Saxon-EE 9.9.1.6J from Saxonica
Java version 1.8.0_121
Using license serial number K007537
URIResolver.resolve href="ucd.xsl" base="file:/Users/mike/bugs/2020/4446-Steimke/latinchars/xsl/dia-matrix.xsl"
URIResolver.resolve href="docbook-specification-functions.xsl" base="file:/Users/mike/bugs/2020/4446-Steimke/latinchars/xsl/ucd.xsl"
URIResolver.resolve href="../schema/latinchars.xsd" base="file:/Users/mike/bugs/2020/4446-Steimke/latinchars/xsl/dia-matrix.xsl"
Loading schema document file:/Users/mike/bugs/2020/4446-Steimke/latinchars/schema/latinchars.xsd
URIResolver.resolve href="xml.xsd" base="file:/Users/mike/bugs/2020/4446-Steimke/latinchars/schema/latinchars.xsd"
Loading schema document file:/Users/mike/bugs/2020/4446-Steimke/latinchars/schema/xml.xsd
Finished loading schema document file:/Users/mike/bugs/2020/4446-Steimke/latinchars/schema/xml.xsd
Finished loading schema document file:/Users/mike/bugs/2020/4446-Steimke/latinchars/schema/latinchars.xsd
Warning at xsl:import-schema on line 19 column 94 of docbook-specification-functions.xsl:
  SXWN9006: The schema document at latinchars.xsd is ignored because a schema for this
  namespace is already loaded
URIResolver.resolve href="../schema/unicode.xsd" base="file:/Users/mike/bugs/2020/4446-Steimke/latinchars/xsl/ucd.xsl"
Loading schema document file:/Users/mike/bugs/2020/4446-Steimke/latinchars/schema/unicode.xsd
Finished loading schema document file:/Users/mike/bugs/2020/4446-Steimke/latinchars/schema/unicode.xsd
Warning at function lc:gc on line 436 of ucd.xsl:
  SXWN9000: A function that computes atomic values should use xsl:sequence rather than xsl:value-of
Warning at function lc:name on line 615 of ucd.xsl:
  SXWN9000: A function that computes atomic values should use xsl:sequence rather than xsl:value-of
Warning at function ucd:is-base-character on line 188 of ucd.xsl:
  SXWN9000: A function that computes atomic values should use xsl:sequence rather than xsl:value-of
Warning at function lc:string-to-codepoints on line 367 of docbook-specification-functions.xsl:
  SXWN9000: A function that computes atomic values should use xsl:sequence rather than xsl:value-of
Warning at function lc:name-for-codepoints on line 633 of ucd.xsl:
  SXWN9000: A function that computes atomic values should use xsl:sequence rather than xsl:value-of
Warning at function ucd:name on line 592 of ucd.xsl:
  SXWN9000: A function that computes atomic values should use xsl:sequence rather than xsl:value-of
Warning at function lc:is-graphic on line 152 of ucd.xsl:
  SXWN9000: A function that computes atomic values should use xsl:sequence rather than xsl:value-of
Warning at function lc:generate-table-id on line 232 of docbook-specification-functions.xsl:
  SXWN9000: A function that computes atomic values should use xsl:sequence rather than xsl:value-of
Warning at function lc:is-combining-base-character on line 217 of ucd.xsl:
  SXWN9000: A function that computes atomic values should use xsl:sequence rather than xsl:value-of
Warning at function lc:is-in-entry-table on line 204 of docbook-specification-functions.xsl:
  SXWN9000: A function that computes atomic values should use xsl:sequence rather than xsl:value-of
Warning at function ucd:is-combining-character on line 203 of ucd.xsl:
  SXWN9000: A function that computes atomic values should use xsl:sequence rather than xsl:value-of
Warning at function ucd:is-graphic on line 171 of ucd.xsl:
  SXWN9000: A function that computes atomic values should use xsl:sequence rather than xsl:value-of
Stylesheet compilation time: 743.076351ms
Processing file:/Users/mike/bugs/2020/4446-Steimke/latinchars/xsl/dia-matrix.xsl
Using parser com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser
Building tree for file:/Users/mike/bugs/2020/4446-Steimke/latinchars/xsl/dia-matrix.xsl using class net.sf.saxon.tree.tiny.TinyBuilder
Tree built in 6.17474ms
Tree size: 60 nodes, 92 characters, 16 attributes
URIResolver.resolve href="../src/datenbank/latinchars.xml" base="file:/Users/mike/bugs/2020/4446-Steimke/latinchars/xsl/ucd.xsl"
Writing to file:/Users/mike/bugs/2020/4446-Steimke/latinchars/augmentedCharacterSet.xml
Building tree for file:/Users/mike/bugs/2020/4446-Steimke/latinchars/src/datenbank/latinchars.xml using class net.sf.saxon.tree.tiny.TinyBuilder
Tree built in 383.704119ms
Tree size: 6879 nodes, 84327 characters, 4917 attributes
URIResolver.resolve href="../ucd/ucd.reduced.xml" base="file:/Users/mike/bugs/2020/4446-Steimke/latinchars/xsl/ucd.xsl"
Building tree for file:/Users/mike/bugs/2020/4446-Steimke/latinchars/ucd/ucd.reduced.xml using class net.sf.saxon.tree.tiny.TinyBuilder
Tree built in 58.651892ms
Tree size: 2427 nodes, 14 characters, 17554 attributes
Building tree for file:///Users/mike/bugs/2020/4446-Steimke/latinchars/normalizationData.xml using class net.sf.saxon.tree.tiny.TinyBuilder
Tree built in 2.228248ms
Tree size: 23 nodes, 204620 characters, 7 attributes
<?xml version="1.0" encoding="UTF-8"?>
<count xmlns:lc="http://xoev.de/latinchars">
   <entry>924 Entry Elements.</entry>
   <nfd>924 nfd Elements</nfd>
   <base>924 nfd/base Elements</base>
</count>
Execution time: 1.958542403s (1958.542403ms)
Memory used: 158,417,824

It would also be useful to see what is actually in the augmented characterset.xml file. I did this to allow inspection of the content:

   <xsl:result-document href="augmentedCharacterSet.xml">
      <xsl:copy-of select="$characterSet"/>
    </xsl:result-document>

I'm puzzled that you're getting 1 selected element, rather than zero. Can you investigate to see what that element is?

#6 Updated by Michael Kay 2 months ago

  • Category set to Schema-Aware processing
  • Status changed from New to AwaitingInfo
  • Assignee set to Michael Kay

#7 Updated by Frank Steimke 2 months ago

Hi Mike,

i had the issue when running Oxygen on the system at our office. Now i try to reproduce it at my hoemoffice system. The Environment should be almost identical (same Win 10 64 bit OS, same Oxygen Version), but i get the correct results here. Unable to reproduce the issue with this system. This is not the first time that i get slightly different results, which i can't explain. Maybe it's because of different Oxygen configurations (global Options)?

Unfortunatly i don't know how to use saxon as a standalone application. Is it possible, or is it bundled to the Oxygen IDE?

I will try to reproduce the problem tomorrow, when i'm back at the office. To answer some of your questions:

  • when i got only one result, it was the very first of the entries in $characterSet (the char Element for the U+0009 TAB character)
  • i had the same idea and wrote the augmented characterSet to a file and inspected it. Everything was fine. It could be validated without any problems, and it had the extension/nfd Elements as expected. XPATH worked as expected. Nothing special.

Sincerely, Frank

#8 Updated by Frank Steimke 2 months ago

Good news: i can reproduce the issue when running on the system at our office, and made some further observations.

I have changed the stylesheet dia-matrix.xsl (attached) so that we can compare the result from two different XPATH expressions. It also shows the parent element of the first Element in the node-set.

  • using an oxygen scenario the enables SAXON Optimizatiotion (-opt), i count 1 element (wrong), and** this element has no parent element**.
  • using an oxygen scenario the does not enable SAXON Optimizatiotion, i count 924 elements (correct), but ** the first of these elements also has no parent element**.

I'm afraid it has something to do with the Oxygen environment. Since i don't know if and how i can use the saxon engine, which is bundled with oxygen, as a standalone product, i am unable to test without oxygen. Also, i don't kno how to set the -t parameter within Oxygen. But it is not Oxygen alone. There is a dependency to Saxons optimization. And with or without optimization: the elements in the resulting node-set are orphans. Why?

Sincerely, Frank

#9 Updated by Michael Kay 2 months ago

I think it would be a good idea if I sent you a temporary license key so you can investigate whether the problem is reproducible outside oXygen. (I have an email address for you from Jan 2018, is that likely to still reach you?) The license that comes with oXygen only covers embedded use within oXygen itself.

The fact that it fails with optimization on, and succeeds with optimization off, is certainly a useful data point, though until we can reproduce it it doesn;t help us much. With the free-standing product, as soon as we can reproduce the issue, we'll be able to investigate what rewrites are taking place during optimization.

The fact that the nodes are parentless is very mysterious. I can't think of any mechanism that would exhibit that particular failure mode. We absolutely need to reproduce this "in the lab".

#10 Updated by Frank Steimke 2 months ago

You can reach me at

I will also ask one or two of my collegues wheter they can reproduce it on their machines in Oxygen. I'd like to know where the difference is between the system at our office and the almost identical Oxygen installation at home (where i can't reproduce the issue).

Frank

Am 31.01.2020 um 01:13 schrieb Saxonica Developer Community:

#11 Updated by Frank Steimke 2 months ago

I was able to strip down the code. In the attached archive you will find a demonstration of issue 4446 with minimalistic files.

Just apply transform.xsl to input.xml. The content of the input file is completely irrelevant. The xsl script will operate only on lc-simple.xml, which is valid against lc-simple.xsd. The structure is very simple now, and the number of elements is reduced from 924 to 3. I do not use a catalog file. Every path is relative. It still is an Oxygen Project, but i would be surprised if this would be relevant for this issue.

The generated output counts the number of base elements in the augmented version of lc-simple.xml. The base elements are selected by different xpath expressions. I think that they all should select all base elements in the augmented input, so that i would expect the number 3 as result of counting the number of elements in all three sequences. But this is not the case, sometimes it only counts 1. And this single element in the sequence seems to be an orphan.

In the attached archive you will find the generated output with these strange results in output.xml, with comments i made afterwards. Also the augmented content in augmented.xml.

I hope, that you can reproduce the issue on your side.

Sincerely, Frank

#12 Updated by Frank Steimke 2 months ago

The variable $characterset is defined as <xsl:variable name="characterSet" as="document-node(schema-element(characterSet))">

When you define it as <xsl:variable name="characterSet"> without validating, the error will vanish.

fst

#13 Updated by Frank Steimke 2 months ago

Some more observations:

  • it is not a matter of augmentation of a seperate file. Same issue when the script is applied to an input which already has the base elements
  • it is not the case that the base elements are orphans, but counting the number of parent elements gives zero, with or without optimization,
  • but only when $characterSet is defined as a global variable. This issue (counting the number of parent elements) will not remain when $ characterSet is defined locally in the initial template, or when there is no $characterSet at all but the initial template matches document-node(schema-element(lc:characterSet))
  • the sequence of base elements is wrong only if optimization is enabled. It is correct without optimization
  • It is a matter of schema awareness. When $characterSet is defined as a validated document-node, the sequence of base elements is wrong. The moment i remove the AS attribute from the definition of the variable, everything seems to be fine.
  • But it is not a matter of schema version. Same issue with Schema version 1.0 or 1.1.

#14 Updated by Michael Kay 2 months ago

  • Status changed from AwaitingInfo to In Progress

#15 Updated by Michael Kay 2 months ago

I'm now getting the same results as you.

Looking at the output of -explain, it starts

OPT : At line 42 of file:/Users/mike/bugs/2020/4446-Steimke/latinchars0202/transform.xsl
OPT : Pre-evaluated function call fn:count(...)
OPT : Expression after rewrite: 0

which immediately suggests something is wrong: it has decided statically that the expression count($nfd-base-2/parent::*) will return an empty sequence.

#16 Updated by Michael Kay 2 months ago

Not making a great deal of progress in pinning this down. I can see that the second phase of type-checking of the path expression $nfd-base-2/parent-element() concludes that the path expression is void because the type of the variable is empty-sequence(), but I haven't been able to pin down where that inference comes from.

#17 Updated by Michael Kay 2 months ago

It seems that the typeCheck() code for schema-aware axis expressions handles wildcards (xs:anyType) correctly, but the computeCardinality() code does not. The analysis is computing how many nfd elements might be encountered within a characterSet, and is coming back with the answer zero, because it doesn't take account of elements that match the wildcard.

It's not clear to me at present (a) why it only fails for some of these expressions and not others, or (b) why the code for AxisExpression.computeCardinality() appears to replicate logic in AxisExpression.typeCheck().

The answer to (a) seems to be: because extension appears in the schema as an explicit descendant, while nfd relies on matching the wildcard.

The answer to (b) is that typeCheck is primarily concerned with computing the item type, and it only remembers the cardinality in one or two paths where it happens to stumble across it.

For both descendant::extension and descendant::nfd, we appear to be calling typeCheck() before we call computeCardinality(). In principle typeCheck() (at least on some paths) remembers the cardinality so it doesn't have to be recomputed; that doesn't seem to be happening here.

It seems that when typeCheck() exits having found that the descendant element is matched by a wildcard, it doesn't set the variable computedCardinality, which is why the computation is repeated. (It also seems odd that computeCardinality() does't set this variable, however, that's taken care of because the result goes into the staticProperties variable).

Note: it would be useful if all the paths in AxisExpression.checkPlausibility() that issue a warning and then convert the expression to an empty sequence also logged the warning to the optimizer trace.

#18 Updated by Michael Kay 2 months ago

  • Status changed from In Progress to Resolved
  • Applies to branch 9.9, trunk added
  • Fix Committed on Branch 9.9, trunk added

For 9.9, the problem is fixed by adding a test to UserComplexType.getDescendantElementCardinality(): if the result of gatherAllPermittedDescendants() includes -1 (indicating a wildcard), then return ZERO_OR_MORE (meaning we know of no constraints).

For 10.0 I'm also getting rid of the computedCardinality field, which seems to be serving very little purpose and doesn't justify the added complexity.

Added to test suite as test import-schema-202.

#19 Updated by O'Neil Delpratt about 1 month ago

  • Status changed from Resolved to Closed
  • % Done changed from 0 to 100
  • Fixed in Maintenance Release 9.9.1.7 added

Patch applied in the 9.9.1.7 maintenance release.

Please register to edit this issue

Also available in: Atom PDF