Project

Profile

Help

Bug #4837

Two documents can have the same URI

Added by Michael Kay 10 months ago. Updated 11 days ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
JAXP Java API
Sprint/Milestone:
-
Start date:
2020-11-25
Due date:
% Done:

100%

Estimated time:
Legacy ID:
Applies to branch:
10, trunk
Fix Committed on Branch:
trunk
Fixed in Maintenance Release:

Description

When a document comes into Saxon from outside, typically as a Source object, we treat its SystemId property as representing its base URI. This enables relative references within the document to be correctly resolved, for example by the doc() and document() functions. But documents read using doc() and document() are required to have the property that two requests using the same absolute URI resolve to the same document. This can't work if the external input to a transformation supplies two different documents with the same SystemId property.

In a bug raised by Ihe Onwuka on the saxon-help list, this is happening on a call to fn:transform(), and it is happening as a result of rules in the XSLT specification. The XSLT spec says that xsl:copy (and various other instructions) create a document whose base URI is the same as that of the stylesheet. So we immediately have two documents with the same base URI, which (using fn:transform() or otherwise) can easily be used as input to another transformation. In this example the stylesheet is using doc("") (a "same document reference") to access the document identified by its base URI, and if two documents have the same base URI there is no way this can work. (It runs directly counter to the rule that two calls on doc() supplying the same URI must return the same document).

There are of course many complications here. For example, what do we do when doc(X) supplies one particular URI, and the returned document has a systemId property that is different? We addressed that question in bug #4795, and it's possible that the changes we made there exacerbated the problem we're seeing here. But both bugs have the same underlying cause: the whole architecture is making an assumption that there's a one-to-one correspondence between URIs and documents, and when the going gets rough that clearly isn't the case.

I think one immediate action we can take is that we should detect, when we put a document in the document pool, that it already contains a different document with the same URI. That will at least give earlier detection of the problem, and clearer diagnostics.

Perhaps we should also be clearer about the distinction between document URI and base URI. (The 3.1 specs have both properties, in recognition of the fact that here be dragons). When we create a document using xsl:copy or similar instructiions, perhaps we should carefully mark it to indicate that the base URI we allocate can be used for resolving relative URIs appearing in the document, but it must not be used as a unique identifier for the document (that is, as a document URI, or as a same-document reference). That's not easy because we use the JAXP interfaces such as Source and URIResolver so heavily, and these interfaces make no such distinctions.

History

#1 Updated by Ihe Onwuka 10 months ago

Question: Would recursing over a compiled stylesheet solve the problem? Suggestion: I don't think you are but do not presume the problem is restricted to the use of doc('') as mentioned in the bug report.

#2 Updated by Michael Kay 10 months ago

There are many ad-hoc things we could do that might fix this particular example, but there's an architectural issue that needs a general solution rather than a whack-a-mole approach.

#3 Updated by Michael Kay 10 months ago

We had a good discussion about this on the team call this morning.

Firstly, we need to prevent two documents supplied as input to a transformation having the same URI. That's fairly easy; the immediate benefit will be that you get better and earlier diagnostics indicating what's wrong in this situation.

Supplying two documents with the same URI is something that can happen in all sorts of ways, but fn:transform() makes it much more likely to happen unintentionally, because many of the "temporary trees" constructed in the course of a transformation have the same base URI as the stylesheet. In the past there has been (almost) no way of exposing these temporary trees to the outside world, so this didn't really matter; but with fn:transform (and xsl:evaluate, etc) they can be externalised, meaning we need some of the same disciplines xsl:result-document offers to ensure URI uniqueness.

We discussed a couple of language features that could be helpful here: perhaps an fn:rebase($doc, $uri) function that copies a document with a different base URI (we have internal support for this in Saxon that would allow it to be a virtual copy): the idea is that you would call this function when supplying a document as input to fn:transform(). Perhaps an attribute such as <xsl:document href="xxxx"> to allow you to construct an internal document with an explicitly chosen URI distinct from that of the stylesheet. Perhaps a mechanism to generate an arbitrary unique URI for use with these capabilities.

#4 Updated by Michael Kay 10 months ago

Reading the Data Model spec at https://www.w3.org/TR/xpath-datamodel-31/ is helpful.

According to the Data Model, a document node has two separate properties: base-uri and document-uri. There's no prescribed relationship between them, but the way that they are described makes it clear that while many documents can have the same base URI, the document-uri property of a document, if it exists, must identify the document uniquely.

The only definitive way in which a document can acquire a document-uri property is if it is read using doc() or collection(). The spec leaves it open for other documents, for example those supplied as the initial context item or as stylesheet parameters, to have one too, but it must be unique.

Internally in Saxon, document-uri($doc) works as follows

(a) if getUserData("saxon:document-uri") has a value, we return that value

(b) if the document is present in the document pool, we return the key under which it is held

(c) if node.getSystemId() has a non-null and non-empty value, we return that value.

As far as I can see we only do setUserData("saxon:document-uri") in fn:parse-xml(), and we use it there to set the document-uri() to "absent". But this was clearly an attempt to allow a document to have a document-uri that is different from the base-uri property,

(c) above seems highly questionable. There's no guarantee that node.getSystemId() will return a value that uniquely identifies a document node or that can be used to retrieve it using the doc() function.

(b) above is also problematic, because the document pool isn't immutable. While doc() and doc-available() are written to behave as if the document pool were immutable (so once doc-available() gets a result, that result remains fixed for all time), document-uri() isn't following the same protocol: a URI can be absent from the pool at one point in time, and present in the pool later. This means that the result of calling document-uri($doc) for a given document can change over time, which is clearly wrong.

#5 Updated by Michael Kay 10 months ago

Iniitial changes:

(a) get rid of the saxon:document-uri mechanism

(b) the document-uri() function ONLY looks in the document pool

(c) it's an error ("Two documents with the same document uri") to put a document in the document pool if there is already another one with the same URI.

Getting a handful of test failures in XSLT 3..0.

#6 Updated by Michael Kay 10 months ago

Collection-006 fails Cannot have two different documents with the same document-uri file:/Users/mike/GitHub/xslt30-test/tests/fn/collection/doc14.xml

This test is loading the same document in two different packages with different xsl:strip-space declarations. So in fact we can have two documents in the pool with the same document URI, because the mapping from URI to document is local to a package.

It seems this test has been working "by accident", because the two documents should have different document keys in the document pool, but they actually don't include the package name in the document key, so they are using the same key; it only works because we don't retrieve each of the documents more than once. I've changed the CollectionFn to include the package name and version in the document key, so the test now passes again (But I haven't changed the test to be more thorough).

#7 Updated by Michael Kay 10 months ago

Tests merge-065a and merge-098 are failing, plus 2 tests in si-empty and si-non-empty.

Confirmed that these were not failing before the changes.

si-on-non-empty-025 is relying on xsl:source-document setting the document-uri of the document. The spec for xsl:source-document doesn't say anything about document-uri(); but because it doesn't guarantee stability of results, while document-uri() does, there can't be a guarantee that document-uri() is set. By not taking document-uri() from the systemId of the document, we've essentially ensured that it isn't set.

merge-098 is using the document-uri() of the primary source document, which is streamed, so this suffers the same problem. Similarly merge-065a. In both cases, using base-uri() instead of document-uri() solves the problem.

#8 Updated by Michael Kay 10 months ago

Running through the JUnit tests, I'm getting 23 failures out of 1222 tests.

4 failures in TestCollections need to be looked at

Various tests are failing because of unfinished bytecode work: CompiledConversionTests, TestXMarkProjection, etc

TestXQueryEvaluator/testFlworTracing() is failing - probably unrelated, but needs checking.

TraxTest/testTransformerReuse() and TraxTest/testParam() are failing with the message "Cannot have two documents with the same document-uri" - that definitely needs looking into.

TestCanonicalSerializer/testCanonicalSerialization2() is failing trying to parse a ..DS_Store file - looks unrelated, but needs investigating.

BinarySerializationTest is crashing - again, looks unrelated.

TransformTests/testDirectory fails (Input is a directory, output is not).

#9 Updated by Michael Kay 10 months ago

The test TraxTest/testTransformerReuse() does two transformations, using the same compiled stylesheet and the same Transformer and the same source document; the second transformation is failing because there is already an entry in the document pool for the source document URI, and it's not (recognised as being) the same document. There is no call on Transformer.reset() between the two transformations.

This is a tricky one. We've always said that the only good reason for reusing a Transformer is so that you can reuse documents in the document pool; if you don't want that, then you should reset() the transformer or (more simply) create a new one. The problem here is that we're putting the initial source document in the document pool in case it's used again by a call on doc().

Perhaps a clean solution would be to distinguish entries in the document pool that are there to enforce the semantics of the doc() function, from those that are there for optimisation, to avoid having to reload a document that is already in memory. For example, mark entries in the pool as being either "strong" or "weak", allow weak entries to be replaced, and mark all existing entries in the pool as "weak" when a transformer is reused.

The problem with TraxTest.testParam() is identical.

#10 Updated by Michael Kay 10 months ago

It seems that we could be making more use here of the global document pool owned by the configuration. At present documents are only added to the global document pool if they are "pre-loaded" at stylesheet compile time, which only happens if Feature.PRE_EVALUATE_DOC_FUNCTION is set. We could use this pool additionally to hold documents that have been read by a previous transformation using the same transformer, by a different transformation in another thread, etc. There are complications: the document must not only have the same URI, it must also have the same space-stripping and validation options, and we need to be able to free up space by discarding documents that are no longer needed. The global document pool currently ignores these niceties, which it can probably afford to do because by default it isn't used.

A less pervasive fix would be for Controller.makeSourceTree() to check whether a document is in the pool before parsing it and trying to add it to the pool.

I have implemented this - changes Controller.makeSourceTree() - and these two tests now pass.

#11 Updated by Michael Kay 10 months ago

  • Status changed from New to In Progress

Looking now at the failing JUnit tests in TestCollections. All these tests explicitly access the document URI property of a document returned by the collection() function. In the past this succeeded because the document-uri() function looked at the systemId property of the document node; it now fails because the documents are not added to the document pool.

This exposes the fact that in Saxon, collections are not stable by default. The only become stable if the property Feature.STABLE_COLLECTION_URI is set. Setting this property causes these tests to pass.

There's a backwards-compatibility issue here but I think it's necessary for correctness. If collection() is not stable, then it cannot be guaranteed that the document-URIs of the documents in the collection can be used to retrieve the same documents using the doc() function, so the document-uri() function should not return such a URI. I'll add a test to check that it doesn't.

#12 Updated by Michael Kay 10 months ago

  • Status changed from In Progress to Resolved
  • Applies to branch 10, trunk added
  • Fix Committed on Branch trunk added

I've confirmed that the other unit test failures are unrelated, and I'll address them outside the scope of this bug.

The question now arises, which of these changes to implement in a Saxon 10 maintenance release, given that there's an impact on existing applications.

On balance, I think it's probably easier to live with the problem in Saxon 10 than to live with the cure. So I'm proposing not to retrofit these changes.

#13 Updated by Michael Kay 10 months ago

  • Status changed from Resolved to In Progress

On the 11 branch, test TestXsltTransformer.testSimple() does this:

            for (int i = 0; i < 10; i++) {
                XdmNode input = processor.newDocumentBuilder().build(
                        new File(ConfigTest.DATA_DIR + "caffo/Input.xml"));
                transformer.setInitialContextNode(input);
                transformer.transform();
            }

Which fails with the "two documents with the same URI" problem. Seems to be a variation on comment #10 - need to check the document pool before adding the new document in setInitialContextNode.

#14 Updated by Michael Kay 10 months ago

  • Status changed from In Progress to Resolved

Applied similar logic in Controller.setGlobalContextItem() - we only add the document to the pool if it doesn't already contain a document with that URI.

#15 Updated by Ihe Onwuka 10 months ago

Seeking clarification on when to expect this fix to be available and/or workaround if it is not in the near future. Thanks

#16 Updated by Michael Kay 10 months ago

I'm still working through the implications and consequences of the changes.

Some of the changes are clearly going to disrupt existing applications, so they aren't going to go in the 10 branch.

#17 Updated by O'Neil Delpratt 6 months ago

  • Status changed from Resolved to Closed
  • % Done changed from 0 to 100
  • Fixed in Maintenance Release 10.5 added

Bug fix applied to Saxon 10.5 maintenance release.

#18 Updated by Ihe Onwuka 3 months ago

This is not fixed in 10.5. The test case I provided on the mailing list still fails.

#19 Updated by O'Neil Delpratt 3 months ago

  • Status changed from Closed to In Progress
  • Fixed in Maintenance Release deleted (10.5)

Thanks for pointing this out. This bug issue should not have been marked as fixed against the 10.5 maintenance release. In respect to Mike comment #16 the fix has only been made on the Saxon 11 branch.

#20 Updated by Michael Kay 3 months ago

  • Status changed from In Progress to Closed
  • Fixed in Maintenance Release 10.5 added

Indeed, if you follow the tortuous reasoning in my comments on the bug, you'll see that we decided not to apply the more radical changes that would fix these issues in the 10.x branch, because they would likely cause more problems to existing user workloads than they solved.

#21 Updated by Michael Kay 3 months ago

  • Status changed from Closed to In Progress
  • Fixed in Maintenance Release deleted (10.5)

Reverting my accidental changes to O'Neil's simultaneous update.

#22 Updated by Michael Kay about 2 months ago

  • Status changed from In Progress to Resolved

Marking as resolved. As detailed in the contemporaneous notes, the changes were largely confined to the 11.x branch.

#23 Updated by O'Neil Delpratt 11 days ago

  • Status changed from Resolved to Closed
  • Fixed in Maintenance Release 10.6 added

Bug fix applied in the Saxon 10.6 maintenance release

Please register to edit this issue

Also available in: Atom PDF