Bug #3609 (fixed in 9.8) resurfaces in 9.9 and later
I assume the fix was not applied to 9.9 and later?
It is potentially dangerous as this may eat text in replace() operations that rely on capturing groups. If this goes unnoticed, data is lost. (Found out the hard way after migrating to 10.3 with an XSLT based wiki-system that relies on regex to insert non-breaking spaces between numbers and units.)
Apologies if this is not the right way to reopen an existing bug. I'm not familiar with the bug tracking system.
#2 Updated by Michael Kay 7 months ago
An observation: I suspect we have very few tests for the ";j" and ";n" flags - certainly not at the level of detail that would be likely to catch this problem.
Also: in edge cases the semantics of the XPath regex functions (replace, matches, etc) are not defined over regex dialects other than the XPath regex dialect. For example, there's no provision in XPath for how a regex should be handled if it captures groups by name rather than by number. So this extension feature is always going to be on a "best efforts" basis.
#3 Updated by O'Neil Delpratt 7 months ago
- Status changed from New to Resolved
- % Done changed from 0 to 100
- Fix Committed on Branch 10, trunk added
I have applied the same fix made way back on the 9.8 branch i.e. issue #3609) to the latest 10 and trunk branches.
This fix will be available in the next maintenance release.
#4 Updated by Tom De Herdt 7 months ago
As for Michael's observations: I understand, and I'm thankful for Saxon's very good "best efforts" basis in this regard.
Not sure if this is an edge case though, since the problem affects simple expressions with numbered captured groups.
In fact, most regular expressions in our application do not use special features and run correctly with the standard regex engine. But there was a significant performance drop when moving from 9.4 to 9.8 (which has its own regex engine, if I remember correctly). The application relies heavily on regular expressions to translate wiki-like markup to XML. Using the .NET regex engine makes a big difference.
To give you an idea: converting a 24-page document takes about 140 ms with Saxon 9.4N standard engine, 350 ms on 9.8N standard engine and about 180 ms on 9.8N with the .NET regex engine.
I'm not implying the new regex engine is too slow, far from that. I probably need to check and optimize my regular expressions and XSLT coding strategies in general. But for this particular application the native engine really is much faster and generates exactly the same output (until the capturing group bug showed up).
As for testing: wouldn't it be possible to run standard regex tests with ";j" and ";n" flags and see if they generate identical output? That doesn't cover non-standard features of course (which is impossible because of undefined semantics as you point out), but at least it would seem to imply that it's safe to run standard regex stylesheets with the native engine whenever there is a real performance benefit...?
#5 Updated by Michael Kay 7 months ago
Yes, of course it would be possible to test in this way. The problem is that if we ran 2000 regex tests this way we would probably get 200 failures (given that the tests are deliberately focused on edge cases) and we would need to inspect each of these failures carefully to understand whether it was legit or not.
I do understand the performance case, which is why the "escape" option was provided in the first place.
#6 Updated by Tom De Herdt 7 months ago
OK, I understand now.
It seems reasonable that applications using the non-standard "escape" option for performance reasons do their own specific testing, which could be as simple as comparing output for a large data set with/without the native engine flag.
Thank you very much for taking time to comment & clarify (even late at night)!
Regards, Tom De Herdt
Please register to edit this issue