Project

Profile

Help

Maintenance: Planio will be observing a scheduled maintenance window this Tuesday, November 5, 2024 from 03:00 UTC until 06:30 UTC to perform urgent network maintenance in our primary data center. Your Planio account will be unavailable during this maintenance window.

Predict implementation strategy

Added by Anonymous over 14 years ago

Legacy ID: #8367273 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)

Hello Mr Kay! I would like to have your advice on how I can ensure that my function be efficient in Saxon. Context: I'm trying to parse text reports in xslt. From the xslt point of view report is a value of type xs:string*. The process is organized as a sequence of subviews of data imposed on original data. This way parsing looks like as a sequence of filters over original data. The problem: Data subview is implemented as a function. My problem is to ensure that such funtions were efficient. An example: Consider such subview: data before a specific line pattern. A function p:view1() implements such a filter. [code] <xsl:function name="p:view1" as="xs:string*"> <xsl:param name="view" as="xs:string*"/> <xsl:sequence select="p:view1($view, 1)"/> </xsl:function> <xsl:function name="p:view1" as="xs:string*"> <xsl:param name="view" as="xs:string*"/> <xsl:param name="row" as="xs:integer"/> <xsl:variable name="line" as="xs:string?" select="$view[$row]"/> <xsl:if test="exists($line) and not(p:condition1($view, $row))"> <xsl:sequence select="$line"/> <xsl:sequence select="p:page1($view, $row + 1)"/> </xsl:if> </xsl:function> <xsl:function name="p:condition1" as="xs:boolean"> <xsl:param name="view" as="xs:string*"/> <xsl:param name="row" as="xs:integer"/> <xsl:sequence select=" matches($view[$row], '^\s+$') and matches($view[$row + 1], 'E N D O F R E P O R T')"/> </xsl:function> ... <xsl:variable name="view1" as="xs:string*" select="p:view1($view)"/> ... <xsl:variable name="view2" as="xs:string*" select="p:view2($view1)"/> ... [/code] My concern is to make these functions efficient; meaning that they should be lazy enough and should not try to cache whole output. How do you think is it possible to achieve this goal in Saxon? P.S. After all I've written here I convinced no more ragarding correct implementation language. I see that I'm going deeply in implementation details. Probably I should consider generation of java or C# report parsers. :-)


Replies (3)

Please register to reply

RE: Predict implementation strategy - Added by Anonymous over 14 years ago

Legacy ID: #8367442 Legacy Poster: Michael Kay (mhkay)

The general advice I would give is (a) create a measurement framework that allows you to determine the performance you are getting with sufficient precision, including its scalability as workload factors (such as input document size) change, (b) set performance targets, (c) if performance is not meeting targets, try to understand why, by using the tools available: the -explain output to show the decisions made by the optimizer, timing profiles showing where the execution time is spent at the XSLT level, Java timing profiles showing where it is spent at the Java level. If you find an example where Saxon's execution strategy is clearly sub-optimal, then I'm always interested to know. If you want to know why the optimizer made the decisions it did, then I will try and explain. However, in general I can't give free advice or help to people who want me to do open-ended performance studies or improvements on a particular workload. There's nothing obviously inefficient about your code; if I were doing a performance exercise on it I would want to know a lot more about the project requirements, e.g. the actual and required performance, the data volumes, the environment in which the XSLT code is running, the opportunities for tuning components other than your XSLT code.

RE: Predict implementation strategy - Added by Anonymous over 14 years ago

Legacy ID: #8367764 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)

Thank you. I realize that I cound not expect exhaustive answer. Probably I just needed the place to articulate my problems. Sorry for this misuse. My problem is in unlimited size of input (I've seen at least 3Gb). I could supply it into xslt as xs:string* through an extension function. I should prevent the engine to cache the input or some view of the input. From the implementation perspective I would like the engine to work with buffered sequence (buffered stream analogy). I might be able to implement "buffered sequence" with some extension function that would wrap output of each function that produces subview.

RE: Predict implementation strategy - Added by Anonymous over 14 years ago

Legacy ID: #8367803 Legacy Poster: David Lee (daldei)

My oppinions may not be the same as mr Kay's. But for me I would not try to use XSLT to parse a 3GB text file on any mortal machine. If it were me I would write a pre-parser in another language like C, or Java or C# or even perl .. This pre-parser would split the file into smaller pieces at appropreate places (instaed of just using "split" which is too blind). And also try to add SOME xml structure to the files. Once you have a directory of reasonable sized even slightly XML encoded files you will have an easier time. However depending on the needs of your processing it still may be difficult. Ideally you dont need access to the entire 3GB all at once, so can run xslt on a subset of files iteratively. 3GB of text will load into Java atleast 10GB of RAM if not more.

    (1-3/3)

    Please register to reply