Project

Profile

Help

Bug #6225

closed

Significant performance problem for case insensitive regex matching under high concurrency

Added by at055612 at055612 7 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
Performance
Sprint/Milestone:
-
Start date:
2023-10-18
Due date:
% Done:

0%

Estimated time:
Legacy ID:
Applies to branch:
10, 11, 12, 9.9, trunk
Fix Committed on Branch:
11, 12
Fixed in Maintenance Release:
Platforms:
.NET, Java

Description

Seen in Saxon-HE v9.9.1-8 (but appears to also be an issue in more recent versions)

Relates to previous fix https://saxonica.plan.io/issues/4449

The above fix added synchronized to the method net.sf.saxon.regex.CaseVariants#getCaseVariants. While this fixes the NPE it results in a significant performance hit when calling this method with high concurrency. We use saxon-HE in a large scale data processing system so may have high numbers of threads concurrently doing XSLT processing. If xpath functions like matches() are used with the case insensitive flag then processing is very very slow with a lot of blocked threads.

The following is taken from a thread dump. We have many threads in this state.

"Data Processor# #10227" #42246 prio=1 os_prio=0 cpu=4294222.87ms elapsed=26283.54s tid=0x00007f59c0167420 nid=0xef63 waiting for monitor entry  [0x00007f5798ab7000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at net.sf.saxon.regex.CaseVariants.getCaseVariants(CaseVariants.java:92)
	- waiting to lock <0x0000000264bf1c00> (a java.lang.Class for net.sf.saxon.regex.CaseVariants)
	at net.sf.saxon.regex.REMatcher.equalCaseBlind(REMatcher.java:715)
	at net.sf.saxon.regex.Operation$OpAtom.iterateMatches(Operation.java:631)
	at net.sf.saxon.regex.REMatcher.checkPreconditions(REMatcher.java:494)
	at net.sf.saxon.regex.REMatcher.match(REMatcher.java:426)
	at net.sf.saxon.regex.ARegularExpression.containsMatch(ARegularExpression.java:89)
	at net.sf.saxon.functions.Matches.call(Matches.java:78)
	at net.sf.saxon.functions.Matches.call(Matches.java:24)
	at net.sf.saxon.expr.FunctionCall.iterate(FunctionCall.java:532)
	at net.sf.saxon.expr.Expression.effectiveBooleanValue(Expression.java:889)
	at net.sf.saxon.expr.FilterIterator$NonNumeric.matches(FilterIterator.java:190)
	at net.sf.saxon.expr.FilterIterator.getNextMatchingItem(FilterIterator.java:76)
	at net.sf.saxon.expr.FilterIterator.next(FilterIterator.java:62)
	at net.sf.saxon.om.SequenceIterator.forEachOrFail(SequenceIterator.java:127)
	at net.sf.saxon.expr.sort.DocumentOrderIterator.<init>(DocumentOrderIterator.java:51)
	at net.sf.saxon.expr.sort.DocumentSorter.iterate(DocumentSorter.java:260)
	at net.sf.saxon.expr.parser.Evaluator$7.evaluate(Evaluator.java:239)
	at net.sf.saxon.expr.LetExpression.eval(LetExpression.java:532)
	at net.sf.saxon.expr.LetExpression.processLeavingTail(LetExpression.java:714)
	at net.sf.saxon.expr.instruct.Choose.processLeavingTail(Choose.java:896)
	at net.sf.saxon.expr.LetExpression.processLeavingTail(LetExpression.java:723)
	at net.sf.saxon.expr.instruct.TemplateRule.applyLeavingTail(TemplateRule.java:352)
	at net.sf.saxon.trans.Mode.applyTemplates(Mode.java:533)
	at net.sf.saxon.expr.instruct.ApplyTemplates.apply(ApplyTemplates.java:300)
	at net.sf.saxon.expr.instruct.ApplyTemplates.processLeavingTail(ApplyTemplates.java:255)
	at net.sf.saxon.expr.instruct.Block.processLeavingTail(Block.java:735)
	at net.sf.saxon.expr.instruct.Instruction.process(Instruction.java:132)
	at net.sf.saxon.expr.instruct.ElementCreator.processLeavingTail(ElementCreator.java:353)
	at net.sf.saxon.expr.instruct.Copy.processLeavingTail(Copy.java:431)
	at net.sf.saxon.expr.instruct.TemplateRule.applyLeavingTail(TemplateRule.java:352)
	at net.sf.saxon.trans.Mode.applyTemplates(Mode.java:533)
	at net.sf.saxon.trans.rules.TextOnlyCopyRuleSet.process(TextOnlyCopyRuleSet.java:71)
	at net.sf.saxon.trans.Mode.applyTemplates(Mode.java:495)
	at net.sf.saxon.trans.XsltController.applyTemplates(XsltController.java:746)
	at net.sf.saxon.s9api.AbstractXsltTransformer.applyTemplatesToSource(AbstractXsltTransformer.java:347)
	at net.sf.saxon.s9api.XsltTransformer.transform(XsltTransformer.java:349)
	at net.sf.saxon.jaxp.TransformerImpl.transform(TransformerImpl.java:71)
	at net.sf.saxon.jaxp.TransformerHandlerImpl.endDocument(TransformerHandlerImpl.java:173)

The following class is both a test case to demonstrate the problem and a modified version of CaseVariants that fixes the problem (changes marked FIX:). It compares the time taken to repeatedly call getCaseVariants on both versions of the class. Running this main method results in the following output on my pc (with 12cores/24threads).

Round 1
CaseVariants completed in 1314ms
CaseVariantsModified completed in 40ms
Round 2
CaseVariants completed in 1875ms
CaseVariantsModified completed in 10ms

The synchronized method in CaseVariants is a huge performance hit. In CaseVariantsModified I have changed it to use double checked locking to avoid the synchronisation for all but the first few calls.

It would be greatly appreciated if you could patch saxon to fix this issue. Would you also be able to back port the fix into v9?

package test;

import net.sf.saxon.Configuration;
import net.sf.saxon.lib.ParseOptions;
import net.sf.saxon.lib.Validation;
import net.sf.saxon.om.AxisInfo;
import net.sf.saxon.om.NodeInfo;
import net.sf.saxon.pattern.NameTest;
import net.sf.saxon.regex.CaseVariants;
import net.sf.saxon.trans.XPathException;
import net.sf.saxon.tree.iter.AxisIterator;
import net.sf.saxon.type.Type;
import net.sf.saxon.z.IntArraySet;
import net.sf.saxon.z.IntHashMap;
import net.sf.saxon.z.IntToIntHashMap;
import net.sf.saxon.z.IntToIntMap;

import java.io.InputStream;
import java.time.Duration;
import java.time.Instant;
import java.util.ArrayList;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.function.IntFunction;
import javax.xml.transform.stream.StreamSource;

public class TestCaseVariants {
    public static final int THREADS = 24;
    public static final int ITERATIONS = 1_000;

    public static void main(String[] args) throws InterruptedException {

        // Ensure all the maps have been built, so we are only testing the performance
        // of getting variants
        CaseVariants.getCaseVariants('a');
        CaseVariantsModified.getCaseVariants('a');

        final ExecutorService executorService = Executors.newFixedThreadPool(THREADS);

        for (int round = 1; round <= 2; round++) {
            System.out.println("Round " + round);
            runTest("CaseVariants", executorService, CaseVariants::getCaseVariants);
            runTest("CaseVariantsModified", executorService, CaseVariantsModified::getCaseVariants);
        }
        executorService.shutdown();
    }

    private static void runTest(final String name,
                                final ExecutorService executorService,
                                final IntFunction<int[]> func) throws InterruptedException {

        final CountDownLatch startAllLatch = new CountDownLatch(1);
        final CountDownLatch threadReadyLatch = new CountDownLatch(THREADS);
        final CountDownLatch completionLatch = new CountDownLatch(THREADS);

        for (int k = 0; k < THREADS; k++) {
            executorService.submit(() -> {
                threadReadyLatch.countDown();
                try {
                    // All threads start together
                    startAllLatch.await();
                } catch (InterruptedException e) {
                    throw new RuntimeException(e);
                }
                for (long j = 0; j < ITERATIONS; j++) {
                    for (int i = 0; i < 1000; i++) {
                        func.apply(i);
                    }
                }
                completionLatch.countDown();
            });
        }
        final Instant startTime = Instant.now();
        // Wait for all threads to be ready
        threadReadyLatch.await();
        // Release all threads
        startAllLatch.countDown();
        // Wait for all threads to complete
        completionLatch.await();
        final Duration duration = Duration.between(startTime, Instant.now());
        System.out.println(name + " completed in " + duration.toMillis() + "ms");
    }

    // --------------------------------------------------------------------------------


    ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
    // Copyright (c) 2018 Saxonica Limited.
    // This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0.
    // If a copy of the MPL was not distributed with this file, You can obtain one at http://mozilla.org/MPL/2.0/.
    // This Source Code Form is "Incompatible With Secondary Licenses", as defined by the Mozilla Public License, v. 2.0.
    ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
    /**
     * Modified copy of {@link net.sf.saxon.regex.CaseVariants}.
     */
    private static class CaseVariantsModified {

        // Use one hashmap for characters with a single case variant, another for characters with multiple
        // case variants, to reduce the number of objects that need to be allocated
        // FIX: Make these volatile
        private static volatile IntToIntMap monoVariants = null;
        private static volatile IntHashMap<int[]> polyVariants = null;

        static void build() {

            // FIX: Use temporary local variables
            final IntToIntMap monoVariants = new IntToIntHashMap(2500);
            final IntHashMap<int[]> polyVariants = new IntHashMap<>(100);

            InputStream in = Configuration.locateResource("casevariants.xml", new ArrayList<>(), new ArrayList<>());
            if (in == null) {
                throw new RuntimeException("Unable to read casevariants.xml file");
            }

            Configuration config = new Configuration();
            ParseOptions options = new ParseOptions();
            options.setSchemaValidationMode(Validation.SKIP);
            options.setDTDValidationMode(Validation.SKIP);
            NodeInfo doc;
            try {
                doc = config.buildDocumentTree(new StreamSource(in, "casevariants.xml"), options).getRootNode();
            } catch (XPathException e) {
                throw new RuntimeException("Failed to build casevariants.xml", e);
            }

            AxisIterator iter = doc.iterateAxis(AxisInfo.DESCENDANT, new NameTest(Type.ELEMENT, "", "c", config.getNamePool()));
            while (true) {
                NodeInfo item = iter.next();
                if (item == null) {
                    break;
                }
                String code = item.getAttributeValue("", "n");
                int icode = Integer.parseInt(code, 16);
                String variants = item.getAttributeValue("", "v");
                String[] vhex = variants.split(",");
                int[] vint = new int[vhex.length];
                for (int i=0; i<vhex.length; i++) {
                    vint[i] = Integer.parseInt(vhex[i], 16);
                }
                if (vhex.length == 1) {
                    monoVariants.put(icode, vint[0]);
                } else {
                    polyVariants.put(icode, vint);
                }
            }
            // FIX: Set static class variables at the end
            CaseVariantsModified.polyVariants = polyVariants;
            // Must set this last as other threads will be checking this value
            CaseVariantsModified.monoVariants = monoVariants;
        }

        /**
         * Get the case variants of a character
         *
         * @param code the character whose case variants are required
         * @return the case variants of the character, excluding the character itself
         */

        // FIX: Method no longer synchronized
        public static int[] getCaseVariants(int code) {
            // FIX: Use local variables to avoid additional volatile reads
            IntToIntMap monoVariants = CaseVariantsModified.monoVariants;
            if (monoVariants == null) {
                synchronized (CaseVariantsModified.class) {
                    monoVariants = CaseVariantsModified.monoVariants;
                    if (monoVariants == null) {
                        build();
                        monoVariants = CaseVariantsModified.monoVariants;
                    }
                }
            }
            int mono = monoVariants.get(code);
            if (mono != monoVariants.getDefaultValue()) {
                return new int[]{mono};
            } else {
                int[] result = polyVariants.get(code);
                if (result == null) {
                    return IntArraySet.EMPTY_INT_ARRAY;
                } else {
                    return result;
                }
            }
        }

        /**
         * Get the case variants of roman letters (A-Z, a-z), other than the letters A-Z and a-z themselves
         */

        /*@NotNull*/ public static int[] ROMAN_VARIANTS = {0x0130, 0x0131, 0x212A, 0x017F};
    }
}


Files

CaseVariants.png (95.2 KB) CaseVariants.png VisualVM at055612 at055612, 2023-10-18 18:33

Please register to edit this issue

Also available in: Atom PDF