Project

Profile

Help

Support #3541

closed

Remove empty tags from XML

Added by abhishek munjal over 6 years ago. Updated over 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Sprint/Milestone:
-
Start date:
2017-11-23
Due date:
% Done:

0%

Estimated time:
Legacy ID:
Applies to branch:
Fix Committed on Branch:
Fixed in Maintenance Release:
Platforms:

Description

Michael,

I have a requirement to remove the empty tags from the incoming large XML file (>500 MB) . Is there a optimized way to achieve this using Saxon XSLT transformer.

Right now i am using 2 XSLT's , one for data transformation and other as a post processing step for removing empty tags (in memory processing). It will be great if both of these step can be done together. Thanks in Advance


Files

SampleInput.xml (921 Bytes) SampleInput.xml the input XML file we need to process Don Bosco Antony, 2017-12-07 11:07
error_filteremptynodes.log (3.01 KB) error_filteremptynodes.log the error log we captured during the execution Don Bosco Antony, 2017-12-07 11:07
FilterEmptyNodes.xsl (426 Bytes) FilterEmptyNodes.xsl the XSL that filters empty nodes on streamable mode Don Bosco Antony, 2017-12-07 11:09
Input.png (37.7 KB) Input.png Don Bosco Antony, 2017-12-07 11:14
Expected Output.png (28 KB) Expected Output.png Don Bosco Antony, 2017-12-07 11:14
Actions #1

Updated by Michael Kay over 6 years ago

Firstly, I would suggest that for general help with XSLT coding (where the question isn't at all specific to Saxon), you ask on StackOverflow. There are lots of people prepared to help you there. We keep an eye on the relevant questions, but they have usually been answered before we see them.

Running two transformations in a pipeline is generally a good idea because it makes your code more modular and reusable, so I don't know why you want to change this. If you want to remove empty tags in the output of your first transformation (as distinct from the original input) then it probably makes your code much simpler to do it in a post-processing phase.

Actions #2

Updated by Mohd Shadab over 6 years ago

While running two transformations, following xsl is used to remove empty tags from xml output generated from first xsl

As the file size is large, we are trying to run in streaming mode, but it gives error

Template rule is not streamable

  • In a streaming apply-templates instruction, the select expression cannot

select descendant elements (that is, it must not have crawling posture)

   <xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
	<xsl:output omit-xml-declaration="yes" indent="yes"/>
        <xsl:mode streamable="yes"/>
	<xsl:strip-space elements="*"/>
	<xsl:template match="node()|@*">
		<xsl:copy>
			<xsl:apply-templates select="node()|@*"/>
		</xsl:copy>
	</xsl:template>
	<xsl:template match="*[not(@*) and not(*) and (not(text()) or .=-1)]"/>
   </xsl:stylesheet>
Actions #3

Updated by Michael Kay over 6 years ago

Running this, I observe a number of things that are worth investigating:

(a) when running with the option -nogo from the command line, no errors are reported. This option is supposed to do complete static analysis so it should report the streamability errors.

(b) the report about the select attribute selecting descendant elements seems highly misleading. This may be a case where Saxon is trying to report the error in terms meaningful to the user, but has lost accuracy in the process.

(c) the match pattern for the empty template rule is clearly not a motionless pattern, and this should be reported as a static error.

Actions #4

Updated by Michael Kay over 6 years ago

Fixed (a) as bug #3559

Actions #5

Updated by Michael Kay over 6 years ago

As regards (b), the expression node()|@* does indeed have crawling posture, defined by ยง19.8.8.4 rule 4 in the XSLT 3.0 specification. Because most users are unlikely to understand what "crawling posture" means, and because it is usually associated with use of the descendant axis, Saxon puts a gloss on the error message and gets it wrong. A more accurate message might be

In a streaming apply-templates instruction, the select expression must not have crawling posture (for example, it must not select descendant nodes).

You can make this template rule streamable by rewriting it as

   <xsl:template match="node()|@*">
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
            <xsl:apply-templates select="node()"/>
        </xsl:copy>
    </xsl:template>

However, a better solution is to remove this template rule, and add the attribute

on-no-match="shallow-copy"

to the xsl:mode declaration.

When you do this, the analysis then proceeds to the next template rule, and complains that the match pattern is not motionless. So the absence of any diagnostics for (c) is simply because the compiler stopped the analysis before getting that far.

Actions #6

Updated by Don Bosco Antony over 6 years ago

Thanks for your initial investigation, but if you could help us in resolving the issue as we are not able to generate an XML with filtered empty nodes using the XSL attached inline.

Attached are the artifacts you might need for reference.

Actions #7

Updated by Michael Kay over 6 years ago

Now, how to rewrite the match pattern to make it motionless?

This means, in effect, it must be possible for the match pattern for an element to be evaluated while the parser is positioned on the start tag, without reading any of the element's content.

The function has-children() was introduced to help with this requirement. You could change the match pattern to

match="*[not(@*) and not(has-children())]"

The has-children() function relies on the parser being able to do a tiny bit of look-ahead - although theoretically being "motionless" means you can only see the start tag, this function relies on looking ahead to see whether the start tag is immediately followed by an end tag (or is an empty tag).

But this match pattern is slightly different from yours. Firstly, an element that contains comments or processing instructions will match your pattern, but it won't match mine. Secondly, your pattern treats an element that contains the text value ~~1 as empty ~~ and that definitely isn't motionless. Probably you don't really care about the comments and processing instructions, but the -1 problem is more difficult.

I think the answer might be to use the XSLT 3.0 xsl:where-populated instruction. If you write your stylesheet with a single template rule:

   <xsl:template match="*">
     <xsl:where-populated>
        <xsl:copy>
            <xsl:copy-of select="@*"/>
            <xsl:apply-templates select="node()"/>
        </xsl:copy>
      </xsl:where-populated>
    </xsl:template>

then the effect of where-populated is that constructed elements that are "deemed empty" will not be output. (Again this is achieved with a little bit of look-ahead, this time on the output side). Unfortunately the definition of "deemed empty" isn't quite the same as your definition: with xsl:where-populated, the presence of attributes does not prevent an element being deemed empty. The answer to this is to use this template rule only for elements that have no attributes, which you can achieve by changing the pattern to match="[not(@)]".

Finally, to handle the "-1" problem, you can add a template rule

<xsl:template match="text()[number(.) = -1]"/>

which suppresses such text nodes, and thus causes the containing element to be "deemed empty".

So the stylesheet becomes:

<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output omit-xml-declaration="yes" indent="yes"/>
    
    <xsl:mode streamable="yes" on-no-match="shallow-copy"/>
    <xsl:strip-space elements="*"/>
       
    <xsl:template match="*[not(@*)]">
      <xsl:where-populated>
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
            <xsl:apply-templates select="node()"/>
        </xsl:copy>
      </xsl:where-populated>  
    </xsl:template>
    
    <xsl:template match="text()[number(.) = -1]"/>
</xsl:stylesheet>
Actions #8

Updated by Michael Kay over 6 years ago

I have improved the error message on the development branch.

Actions #9

Updated by Don Bosco Antony over 6 years ago

Thanks, Michael!!

It was really helpful for us in understanding the underlying functionality, and as far as the modified XSL is concerned, it worked for us for the sample input XML.

We are good to go for now, and for any further assistance, we would surely catch up with you.

Thanks again for your valuable efforts!

Actions #10

Updated by Michael Kay over 6 years ago

  • Status changed from New to Closed
  • Assignee set to Michael Kay

Please register to edit this issue

Also available in: Atom PDF