Project

Profile

Help

Support #6376

closed

Xslt Map Memory Footprint

Added by fouad MOUTASSIM about 1 month ago. Updated 4 days ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Performance
Sprint/Milestone:
-
Start date:
2024-03-22
Due date:
% Done:

0%

Estimated time:
Legacy ID:
Applies to branch:
Fix Committed on Branch:
Fixed in Maintenance Release:
Platforms:
Java

Description

Hello,

We are using Saxonica 10.4 with Java and have noticed that it consumes a lot of memory when running. For example, when processing a payload of 100MB, the map consumes 1.3GB, which causes memory issues in our application.

We kindly request your support in addressing this matter.

Please find attached the following elements:

The XSLT map. The Java class. The input payload. Thank you in advance for your support.


Files

OOM_Issue.zip (7.66 MB) OOM_Issue.zip fouad MOUTASSIM, 2024-03-22 15:21
payload-filtered.zip (7.63 MB) payload-filtered.zip Laabidi Raissi, 2024-03-26 12:08
simplified-m-tbw-map-GboMappingFull-T4.xslt (155 KB) simplified-m-tbw-map-GboMappingFull-T4.xslt fouad MOUTASSIM, 2024-03-27 16:00
Actions #1

Updated by Michael Kay about 1 month ago

Thanks for reporting it.

I've run the test case on the current development build with the -t output being

Execution time: 16.9082515s (16908.2515ms)
Memory used: 1084Mb

I'll take a look at a heap dump in due course to see where the memory is going and whether there is any scope for optimizations here.

Actions #2

Updated by Michael Kay about 1 month ago

You describe the data as a "map" but from what I can see so far there are no actual XDM maps involved here, is that right?

My first attempt at instrumenting this shows it creating 19 temporary trees, of which 5 have over a million nodes, and 5 (presumably the same 5) have between 70m and 85m characters of text. I haven't looked at any garbage collection, but a tree with 2 million nodes and 80m characters of text is going to occupy about 2m*20 + 80m*2 bytes = 320m bytes, so 5 of these would give you 1.5Gb. The TinyTree (as the name suggests) is very finely tuned for memory usage, so there's no scope for bringing this down at the Saxon level, so I think it's a case of seeing whether there is anything you can do at application level.

Perhaps there is scope for taking advantage of XSLT 3.0 streaming? I'm afraid that with a 2000-line stylesheet it's difficult to form a quick view of its logic. Or perhaps saving intermediate data in XDM maps rather than node trees would give an improvement?

Actions #3

Updated by Michael Kay about 1 month ago

Some further investigation:

The first large TinyTree is the one containing the primary transformation input. This is using 8-bit characters throughout.

The second large tree is created by the xsl:variable at line 250. This is using 16-bit characters.

The third is from the xsl:variable at line 556. Again, 16-bit characters.

The fourth is from line 106. 16-bit characters.

The fifth is line 115. Again, 16-bit characters.

Note that these are in order of the tree being completed, not the order of it being started.

From the point of view of Saxon internals, I might investigate why we're holding these trees with 16 bits per character rather than 8.

From the point of view of the application-level design, I would think there are opportunities to reduce the amount of data being copied during each phase of processing.

Actions #4

Updated by Michael Kay about 1 month ago

The reason that the text content is being held with 16 bits per character is that the input does in fact contain a few characters outside the 1-255 range. Specifically, there are about eleven instances of code point 8206, one of code point 8211, and six of code point 65533. It only takes one character outside the 1-255 range to force Saxon to hold an entire buffer in 16-bit representation.

Actions #5

Updated by Laabidi Raissi about 1 month ago

Hello Mr. Kay, My name is Laabidi and I am a colleague of Fouad. Please allow me to add a few details following your comments:

  1. I created a new input XML by filtering out non ascii chars (PFA). But we have almost exactly the same memory footprint
  2. When investigating the dump generated on OutOfMemory, we can see that for this input and an XMX=1200M, we have 28733 instances of class: net.sf.saxon.expr.Operand. Moreover, there huge numbers of duplicate Strings. For example the value "externalCode" occurs 11910 times in the dump. We are investigating the effect of using the -XX:+UseStringDeduplication option
Actions #6

Updated by Michael Kay about 1 month ago

You might find it useful to use the "tiny tree condensed" option - this looks for duplicated attributes and text nodes and pools the instances.

Actions #7

Updated by Michael Kay about 1 month ago

The instances of Operand are not a great concern, because they are part of the compiled stylesheet, and won't increase as the source document size increases. (It would be nice to get them garbage collected once they are no longer needed - but they might be used for diagnostics at any time if things fail.)

The string "externalCode" appears repeatedly in your data. As far as I can tell your source document is a wrapper for data in a variety of micro formats including what looks like CSV and JSON. I'm not sure how you are parsing or processing this data, if you build intermediate structures to hold it then they may well be opportunities to reduce the memory footprint of these structures but that's essentially an application-level design issue, not something I can help you with much.

I see you are using fn:json-to-xml() in your code, it's worth noting that the XML this produces will almost certainly use more memory than the original JSON (though this is just a gut feeling, not something I have measured). You might be better off using fn:parse-json to parse the JSON into maps and arrays, but I can't judge this without knowing what you are doing with the data.

Given that the application repeatedly copies the XML tree in about 5 phases of processing, there's almost certainly some mileage in trying to reduce the amount of copying that's going on.

Actions #8

Updated by Michael Kay about 1 month ago

  • Status changed from New to In Progress
Actions #9

Updated by fouad MOUTASSIM about 1 month ago

Hello,

Thank you for your reply.

We have conducted several tests using the suggested solutions, but we are still experiencing the same issue.

1 - When using the "parse-json" function instead of "json-to-xml" we encountered the following error: XTDE0450 - Cannot add a map to an XDM node tree. 2 - Even when using the "TINY_TREE_CONDENSED" optio we are still encountering the Out of Memory (OOM) error. 3 - We have simplified the map (attached) and removed all the mentioned variables, but the OOM issue persists.

Best Regards, Fouad.

Actions #10

Updated by Michael Kay about 1 month ago

I'm afraid this is starting to cross the fuzzy line between product support and consultancy. There's no evidence here of a product defect, which is where the line is drawn; instead you are asking us for assistance with improving the design of your application, which definitely falls on the "consultancy" side of the boundary.

We enjoy working on problems like this because they help us to understand our own product better, and to identify opportunities for making improvements; but like everyone else we have limited resources and other priorities.

Actions #11

Updated by Michael Kay 4 days ago

  • Status changed from In Progress to Closed

I'm closing this as we have provided advice and there is no evidence of a product defect.

If you need more information, please feel free to open another support request.

Please register to edit this issue

Also available in: Atom PDF