Bug #6405: Saxon-EE 12.4J: saxon:column-number() seems incorrect for text nodes not inside xsl:text - Saxon - Saxonica Developer Community

Actions

Send by e-mail Copy link

Bug #6405

closed

Saxon-EE 12.4J: saxon:column-number() seems incorrect for text nodes not inside xsl:text

Added by A Galtman 9 months ago. Updated 7 months ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

Diagnostics

Sprint/Milestone:

-

Start date:

2024-04-28

Due date:

% Done:

100%

Estimated time:

Legacy ID:

Applies to branch:

12, trunk

Fix Committed on Branch:

12, trunk

Fixed in Maintenance Release:

Platforms:

Java

Description

I'm using the saxon:column-number() function to get column numbers of nodes in an XSLT stylesheet file. For a text node inside xsl:text , saxon:column-number() returns the column number that I expect, which is the column where the text node begins.

For a text node not inside xsl:text, the output of saxon:column-number() seems incorrect:

If the text node is all on one line of the XSLT file, saxon:column-number() seems to return the position of the parent element's end tag
If the text node is across multiple lines of the XSLT file, saxon:column-number() seems to return the position of the parent element's start tag

The documentation page https://www.saxonica.com/documentation12/index.html#!functions/saxon/column-number describes the result when the input node is an element node, but it doesn't say the input node must be an element node. What is supposed to happen when the input node is a text node, and why?

To reproduce this behavior, run the attached XSLT transformation in Saxon-EE 12.4J. (I've been running it via Oxygen 26.1 because I don't have an EE license.)

Thanks!

Files

Download all files

check-column-number.xsl (1.53 KB) check-column-number.xsl	Reproduces behavior	A Galtman, 2024-04-28 20:54
text-inside-when.xsl (861 Bytes) text-inside-when.xsl	Supporting file	A Galtman, 2024-04-28 20:55

Actions

#1

Updated by Michael Kay 9 months ago

First thing is to point out that any information Saxon has about line and column numbers comes from the XML parser, and XML parsers aren't all the same. There's a difference between Java parsers and .NET parsers, there's a difference on Java between pull parsers and push parsers, and even in the world of SAX parsers, the spec leaves some things open. (The Javadoc for SAX Locator.getColumnNumber() says "If possible, the SAX driver should provide the line position of the first character after the text associated with the document event. The first column in each line is column 1.") "Document event" means the chunk of text that the parser has chosen to hand over to the application in a single call.

The difference caused by newline characters is probably because SAX parsers deliver text nodes in multiple chunks, and they will tend to do this in a way that avoids moving characters within buffers. In particular, if the line ending is normalised from CRLF to NL, that will often be a good place to break the content, which may mean that different location information is notified for the each line. I haven't checked whether that is actually happening here.

Another point is that there are at least three different phases during which Saxon reports errors, and different information is available in each case. The first phase is actually during the XML parsing, when we determine whether each element is recognised and look for things such as namespace declarations and use-when attributes. The second phase is static analysis of the constructed node-tree, and the third phase is processing of the expression tree built from the node tree (both static processing such as type checking, and dynamic evaluation).

In the node tree built from parsing the stylesheet, we only hold location information for element nodes, not for text nodes. An error within a text node will therefore typically be notified as the location of the containing element.

So it's all pretty complicated!

I'll go on to take a look at your actual examples, and attempt to see how that explains the behaviour.

Actions

#2

Updated by Michael Kay 9 months ago

Also, when you say you are "using the saxon:column-number() function to get column numbers of nodes in an XSLT stylesheet file", that's different from when Saxon produces diagnostics during XSLT compilation. If you're using saxon:column-number then you presumably parsed the XSLT document as an ordinary XML file, not as a stylesheet? That's going to be different, because for example there's no special handling of xsl:text elements in an ordinary XML file.

Actions

#3

Updated by Adrian Bird 9 months ago

Can I add some comments here as I've been talking to Amanda about this issue in xspec. I only have Saxon-HE and extension functions for the line and column numbers but my investigation, which is related to parsing a stylesheet as XML, has shown that the column number for a text node is only correct (i.e. the column number of the enclosing element) when the enclosing element has no attributes. If the enclosing element has attributes the column number for the text node is no longer the same as for the enclosing element. It doesn't seem to matter what the enclosing element is - if you put expand-text="no" on an enclosing element, say xsl:text or xsl:comment or your own element, then the column number differs from the column number of the enclosing element.

The other thing is that the problem occurs with the tiny tree but not with the linked tree (haven't tried tinyc).

I can't provide an example that I've tested with saxon:column-number but could provide a example that should work for you in PE or EE if that would be helpful. Adrian

Actions

#4

Updated by Michael Kay 9 months ago

This data file appears to use NL line endings rather than CRLF.

There is indeed a difference between the Saxon TinyTree and LinkedTree implementations. The TinyTree maintains line/column information for text nodes, the LinkedTree does not. For the LinkedTree, getColumnNumber() will always return the column number of the containing element - which is rather misleading in the case of mixed content. I think there are paths where, to avoid that problem, we return the location of the preceding-sibling element of a text node rather than the parent element; but that's probably when producing diagnostic messages, as distinct from when you call saxon:column-number().

The observation that it depends on whether the element has attributes is probably because the TinyTree uses an optimized representation of elements that have no attributes, no namespace declarations, and a single text node child: the element and text nodes are coalesced into a single "TextualElementNode" internally, meaning they will share location information.

At the heart of this is that maintaining location information for every node in the tree is expensive. For both the TinyTree and the LinkedTree, we try to ensure there is zero overhead unless it has been specifically requested (e.g by using the -l option on the command line). This means holding the data "out of line".

Actions

#5

Updated by Michael Kay 9 months ago

With the TinyTree, I get the same results as you report: 51 53 13 27

It's instructive to add the output of saxon:line-number, which makes it 10/51 14/53 13/13 16/27

The XML parser is reporting characters() events for each line separately (even though the line ending is NL).

The first result, 10/51, is what I would expect: a SAX parser is supposed to report the position relevant to the end of the text in a node.

The second result: 14/53 is completely wrong. The result I would expect is the end of the text node, which is at 13/13. The XML parser is reporting three events, with line/col 11/13, 12/40, and 13/13 respectively.

~~The third result, 13/13 is also wrong. I would expect 14/61, and this is what the XML parser is reporting.~~

The third result, 13/13, relates to the second text node. This correctly reflects the position of the end of the text node.

The second result, 14/53, relates to the third text node. I would expect 14/61, the position of the end of the text node.

The fourth result, 16/27 is the start of the text node. I would expect the value to be 19/17, the end of the text node.

Actions

#6

Updated by Michael Kay 9 months ago

With the linked tree, I get 10/43 14/53 11/43 16/27. That's a different set of problems...

Actions

#7

Updated by A Galtman 9 months ago

Thanks for your investigations so far. Yes, it does sound complicated, and I am not knowledgeable about the different tree models Saxon uses.

Do you think that at least one of the tree models will be able to produce line/column results that match those in a trace listener? In part of the XSLT code coverage feature in XSpec, an XSLT transform has a node in hand, from the user's stylesheet that was parsed as XML, and we want to know whether that node was hit during execution of an XSpec test. We look at where the node is in the XML and compare that to the "hits" data reported by a trace listener. (More detail: Looking at where the node is in the XML actually uses the Java methods getLineNumber() and getColumnNumber() of net.sf.saxon.om.NodeInfo, not saxon:line-number() and saxon:column-number(). But I was hoping that understanding those extension functions would lead to understanding the Java methods.)

MHK response: The extension functions saxon:line-number() and saxon:column-number() applied to a given node return exactly the same result as the internal methods NodeInfo.getLineNumber() and NodeInfo.getColumnNumber(). However, there is a difference, in that the extension functions are used only on trees built as source documents, whereas the trace methods are also used on trees built as stylesheets, where (for example) the whitespace stripping rules are different. And stylesheets are built using the linked tree model rather than the tiny tree model.

Actions

#8

Updated by Michael Kay 9 months ago

So the first thing we're seeing is that the code is designed to return what SAX gives us, which is the END of the text node, whereas you are expecting the beginning.

Neither of these matches the documentation.

The documentation currently states (incorrectly) that: "SAX parsers report line and column numbers only for element nodes, so for any other kind of node, the returned value will be -1". I could change the implementation to match the documentation, which would be easy but unfriendly.

Although your expectation of getting the position of the start of the text is entirely reasonable, I'm a bit disinclined to change the implementation to match that (which we could do by counting backwards from the end of the text node). If we're getting it right (as designed) in many cases, it could be disruptive to existing users to change the spec.

The case where we are getting it wrong is where we create a "textual element node" that merges the element node and text node into one. The simplest solution here would seem to be to compute the line/column of the text node from the line/column of the containing element node, rather than returning the latter unchanged.

Doing this, I get 10/51 14/61 13/13 19/17 - which is the expected result.

Actions

#9

Updated by Michael Kay 9 months ago

The next thing to look at is the LinkedTree situation. In a way this is more important because line numbers on stylesheets (and schemas) are routinely used for diagnostics. In the past a text node was unlikely to be used as the location for an error, because they only held text and not executable code, but with text value templates this has changed. Ideally of course with text value templates we want the position of an expression within the text node, not just the position of the containing text node, which could be hundreds of lines long. But getting the position of the text node right would be a start.

Currently with the linked tree, the getLineNumber() and getColumnNumber() methods for a text node return the line number of the parent element, which in principle can be a long distance away. With XSLT (as in this example) that's a little unusual, but it can happen.

As it happens, in the example, the offsets for the text nodes in question are all given as the SAX location of the end of the start tag, and because the elements in question only own a single text node (there is no mixed content) this is the same as the location of the start of the text node. And as we've seen, getting the position of the start of the text node is probably a more satisfactory result than getting the end of the text node. However, this isn't the result we would get with mixed content.

I added the following to an executable stylesheet:

<xsl:sequence expand-text="yes">
            multiple-line text node
            in xsl:text with mixed
            <b>bold</b>
            content {23 div $zero}.
        </xsl:sequence>

and the division-by-zero error was reported at line 26 column 22 (or char 22): In fact line 26 is the location of the end of the xsl:sequence start tag, and the 22 is actually the character position of the opening brace within the text node. So it we had better information on the position of the text node we would be able to produce better information on the location of the embedded expression. And for this purpose, using the start of the text node is just fine - the only thing is, that's not always (as in this example) the same as the start tag of the parent element.

The SAX parser tells us where the text node ends, but that's less useful. In principle we can get the line number of the braced expression by subtracting the number of newlines that follow it from the line number of the end of the text node, and we can get its column number either from the number of characters since the preceding newline (if there is one). If the braced expression follows an end tag on the same line then getting a column position isn't possible, unless we start capturing the locations of end tags (which is available from SAX, but currently ignored).

So: should we maintain extra information in the linked tree about the position of text nodes? And if we do so, should we keep the start position or the end position? -- the start position is more useful for diagnostics, the end position is more consistent with SAX, with the TinyTree, and with current practice. If we know the start position, we can compute the end position, but the converse is not true, in the case where the text node follows an end tag. So one option might be to capture the start position in the tree, but report the end position in response to NodeInfo.getLineNumber() and NodeInfo.getColumnNumber(), for consistency with the tiny tree.

All rather messy.

Actions

#10

Updated by Michael Kay 9 months ago

Worth pointing out also is that if we start doing calculations, for example computing the position of a text node from the position of its parent, then we rely on knowlng exactly what the semantics of line numbers and column numbers on nodes in the tree are. This is not currently the case, because it can vary for different parsers. Most obviously, the calculations on SaxonJ and SaxonCS will be different.

For example, as mentioned in the above comments I changed TinyTextualElementText getLineNumber() and getColumnNumber() to compute the end-of-text-node position from the end-of-start-tag position of the containing element. That's working correctly for a tree built using a SAX push parser, but the calculation will be wrong for a JAXP pull parser or for the Microsoft pull parser in SaxonCS.

Actions

#11

Updated by A Galtman 9 months ago

Thanks for your reply in #note-7. Also...

Michael Kay wrote in #note-8:

So the first thing we're seeing is that the code is designed to return what SAX gives us, which is the END of the text node, whereas you are expecting the beginning.

[...]

Although your expectation of getting the position of the start of the text is entirely reasonable, I'm a bit disinclined to change the implementation to match that (which we could do by counting backwards from the end of the text node). If we're getting it right (as designed) in many cases, it could be disruptive to existing users to change the spec.

If this issue is closed by making Saxon return the position of the end of the text node, then I'm inferring that the XSpec maintainers should plan to make the XSpec feature do the counting-backward computation you mentioned. Does that sound like the right inference? I'm not following everything you've written on this page (a lot of it seems to be aimed at those familiar with the code inside Saxon), and I'm trying to determine what to expect from a future Saxon version and what XSpec will need to do to adapt to changes in Saxon.

Actions

#12

Updated by Michael Kay 9 months ago

a lot of it seems to be aimed at those familiar with the code inside Saxon

Yes, sorry, I use this space for my working notes while investigating a problem. I realise it might mean more to some readers than to others, but it keeps everything in once place -- and at least it shows that I'm thinking about it!

the XSpec maintainers should plan to make the XSpec feature do the counting-backward computation

I've come to the conclusion that counting backwards isn't feasible in the general case. With miixed content, If you have something like

<para>Some <b>rather 
   important</b><br role="decoration"/> information
   about quantum physics</para>

Then there's no way currently of getting the line/column number of the start of the "information about ..." text node. You can work out the line number of the start of the text, but not the column number.

I'm planning to do some experiments to see if it's possible to capture this with a SAX filter that does a one-event lookahead.

Actions

#13

Updated by Michael Kay 9 months ago

For the next major release I've been making some changes:

For the linked tree (used for stylesheets and schemas) I've now added the ability for text nodes to contain line/column information. Previously this was maintained only for element nodes.
I'm now using the location information reported for the previous SAX event, rather than the current SAX event. Generally this means that for an element node, the line and column will represent the start of the start tag, and for a text node, the start of the text node, which is probably what users expect. There are a few cases where I haven't found a way to get this information from a SAX parse, notably when a start tag is immediately preceded by a DOCTYPE declaration or an external entity reference; there might also be problems if it is preceded by an xi:include. Remember that SAX makes very weak promises about the accuracy of location information.
I'm looking at maintaining a more precise position for expressions that occur in a text value template.
I think that if line numbering is requested for a TinyTree, we should suppress the compression of "text only element nodes" so that we retain separate location for the element and its contained text node.

Actions

#14

Updated by Michael Kay 9 months ago

Category set to Diagnostics
Status changed from New to Resolved
Assignee set to Michael Kay
Priority changed from Low to Normal
Applies to branch 12, trunk added
Fix Committed on Branch 12, trunk added
Platforms Java added

I'm going to close this one. I have only made very minor changes on the 12.x branch (as described in comments); much more radical changes for 13.x.

Actions

#15

Updated by A Galtman 9 months ago

Thank you very much for your analysis and the work you're doing to improve the traces.

If Saxonica has some kind of Beta or Prerelease program for Saxon-HE, I'd be interested in having an early opportunity to try your changes (in 12.x or 13.x) with the XSpec code coverage reporting feature.

Actions

#16

Updated by O'Neil Delpratt 7 months ago

Status changed from Resolved to Closed
% Done changed from 0 to 100
Fixed in Maintenance Release 12.5 added

Bug fix applied in the Saxon 12.5 Maintenance release.

Please register to edit this issue

Actions

Send by e-mail Copy link

Also available in: Atom PDF