Project

Profile

Help

SaxonCS and Saxon-HE .NET shortcomings of attribute node line position/column number in lexical markup

Added by Martin Honnen over 1 year ago

While trying to identify the line number and line position/column number of attribute nodes, I find that both SaxonCS and Saxon.NET seem to just assign the line position/column number of the start element the attribute belongs to, not the attribute's line position/column number itself.

The .NET XmlReader seems to output a more precise line position/column number for attribute nodes, however, so perhaps Saxon can be improved.

Here is some sample code using SaxonCS and .NET 6 and XmlReader:

using Saxon.Api;
using System.Xml;

var processor = new Processor();

var docBuilder = processor.NewDocumentBuilder();

docBuilder.LineNumbering = true;

docBuilder.BaseUri = new Uri("urn:from-string");

const string mixedContent = @"<text>This is <b id=""b1"" style=""font-weight: bold;"">mixed</b> content: <!-- test -->Can XPath tools select a text node?</text>";

var xdmDoc = docBuilder.Build(new StringReader(mixedContent));

foreach (XdmNode node in processor.NewXPathCompiler().Evaluate("//@*", xdmDoc))
{
    Console.WriteLine($"Node {node}; line: {node.LineNumber}; column: {node.ColumnNumber}");
}

using (var xr = XmlReader.Create(new StringReader(mixedContent)))
{
    while (xr.Read())
    {
        //Console.WriteLine($"Node {xr.NodeType}; Value: {xr.Value}; Line: {((IXmlLineInfo)xr).LineNumber}; Column: {((IXmlLineInfo)xr).LinePosition}");
        if (xr.HasAttributes)
        {
            xr.MoveToFirstAttribute();
            do
            {
                Console.WriteLine($"Node {xr.NodeType}; Value: {xr.Value}; Line: {((IXmlLineInfo)xr).LineNumber}; Column: {((IXmlLineInfo)xr).LinePosition}");
            }
            while (xr.MoveToNextAttribute());
        }
    }
}

Output I get is e.g.

Node id="b1"; line: 1; column: 16
Node style="font-weight: bold;"; line: 1; column: 16
Node Attribute; Value: b1; Line: 1; Column: 18
Node Attribute; Value: font-weight: bold;; Line: 1; Column: 26

So for both attribute nodes (id and style) Saxon reports 16 (which is the b element's column) for the column/line position while the reader has the right 18 and 26.

Known flaw/shortcoming?

Any settings to get better results?


Replies (3)

Please register to reply

RE: SaxonCS and Saxon-HE .NET shortcomings of attribute node line position/column number in lexical markup - Added by Michael Kay over 1 year ago

The internal data structure is based on what a Java parser gives us, which is the location of a start tag only. Maintaining location information is potentially expensive and we don't want to allocate space for it when it's not required/available.

RE: SaxonCS and Saxon-HE .NET shortcomings of attribute node line position/column number in lexical markup - Added by Martin Honnen over 1 year ago

I see, I can live with that.

Isn't SaxonCS always using a Microsoft XmlReader as its XML parser? In that case https://www.saxonica.com/html/documentation11/dotnetdoc/Saxon/Api/XdmNode.html#ColumnNumber saying

For a document constructed using the document builder, this is available only if the line numbering option was set when the document was built (and then only for element nodes). If the column number is not available, the value -1 is returned. Line numbers will typically be as reported by a SAX parser; this means that the column number for an element node is the column number containing the closing ">" of the start tag.

is rather misleading, it seems the XmlReader reports the position of the < or of the element name in the start tag, but certainly not the position of the closing > of the start tag.

RE: SaxonCS and Saxon-HE .NET shortcomings of attribute node line position/column number in lexical markup - Added by Michael Kay over 1 year ago

I have improved the API documentation (on 11.x and 12.x branches) so it no longer refers to SAX parsers. No issue raised.

    (1-3/3)

    Please register to reply