Project

Profile

Help

Bug #2271 » Bug #3876 - 2014-12-24T12_14_35Z.eml

Tomaž Erjavec, 2014-12-24 13:14

 
Return-Path: <Tomaz.Erjavec@ijs.si>
Received: from mi015.mc1.hosteurope.de ([80.237.138.240]) by wp245.webpack.hosteurope.de running ExIM with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) id 1Y3kpj-0006x6-9P; Wed, 24 Dec 2014 13:14:31 +0100
Received: from mail.ijs.si ([193.2.4.66]) by mx0.webpack.hosteurope.de (mi015.mc1.hosteurope.de) with esmtps (TLSv1.1:DHE-RSA-AES256-SHA:256) id 1Y3kph-00015i-Rf for dropbox+saxonica+f38e@plan.io; Wed, 24 Dec 2014 13:14:31 +0100
Received: from amavis-proxy-ori.ijs.si (localhost [IPv6:::1]) by mail.ijs.si (Postfix) with ESMTP id 3k6tdK3cBnzDJ for <dropbox+saxonica+f38e@plan.io>; Wed, 24 Dec 2014 13:14:29 +0100
Received: from mail.ijs.si ([IPv6:::1]) by amavis-proxy-ori.ijs.si (mail.ijs.si [IPv6:::1]) (amavisd-new, port 10012) with ESMTP id u6TylcPhGTvx for <dropbox+saxonica+f38e@plan.io>; Wed, 24 Dec 2014 13:14:25 +0100
Received: from mildred.ijs.si (mailbox.ijs.si [IPv6:2001:1470:ff80::143:1]) by mail.ijs.si (Postfix) with ESMTP for <dropbox+saxonica+f38e@plan.io>; Wed, 24 Dec 2014 13:14:25 +0100
Received: from [IPv6:2001:1470:ff80:e8:3c6d:3d1f:10bc:1667] (unknown [IPv6:2001:1470:ff80:e8:3c6d:3d1f:10bc:1667]) (using TLSv1 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) by mildred.ijs.si (Postfix) with ESMTPSA id 3k6tdF0L3Xz1Ch for <dropbox+saxonica+f38e@plan.io>; Wed, 24 Dec 2014 13:14:25 +0100
Date: Wed, 24 Dec 2014 13:14:34 +0100
From: =?UTF-8?B?VG9tYcW+IEVyamF2ZWM=?= <Tomaz.Erjavec@ijs.si>
To: Saxonica Developer Community <dropbox+saxonica+f38e@plan.io>
Message-ID: <549AAE2A.5090107@ijs.si>
In-Reply-To: <redmine.journal-3874.20141223230614@plan.io>
References: <redmine.issue-2271.20141221205103@plan.io>
<redmine.journal-3874.20141223230614@plan.io>
Subject: Re: [Saxon - Bug #2271] (Resolved) AIOOBE with large xml file
Mime-Version: 1.0
Content-Type: multipart/alternative;
boundary=------------020308060604080408010703;
charset=UTF-8
Content-Transfer-Encoding: 7bit
Delivery-date: Wed, 24 Dec 2014 13:14:31 +0100
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ijs.si; h=
content-type:content-type:in-reply-to:references:subject:subject
:mime-version:user-agent:from:from:date:date:message-id:received
:received:received; s=jakla4; t=1419423265; x=1422015266; bh=dJi
JcIiyvZzovhM8keDLi0m6lcBTdLX7/f89b+1vgSQ=; b=Q1CZQRq6aQPAwFuWvkF
q1ZTFLSG5F1TBlP7XMAcR6OnjxvPTxPi4Sr25hUzM8wW90cRgqpIKTDYRemmxEGt
sa56v2f14hNHUaCfozTOCQVpcOd4Jbb3WvYfVMm500a93gEGZcXiEdFPFSABDrkD
pUL+h0s1zqVvYLuZ4AJFkCfk=
X-Virus-Scanned: amavisd-new at ijs.si
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101
Thunderbird/31.3.0
X-HE-Spam-Level: -----
X-HE-Spam-Score: -5.0
X-HE-Spam-Report: Content analysis details: (-5.0 points) pts rule name
description ---- ----------------------
-------------------------------------------------- -5.0 RCVD_IN_DNSWL_HI RBL:
Sender listed at http://www.dnswl.org/, high trust [193.2.4.66 listed in
list.dnswl.org] 0.1 HTML_MESSAGE BODY: HTML included in message -0.1
DKIM_VALID_AU Message has a valid DKIM or DK signature from author's domain
-0.1 DKIM_VALID Message has at least one valid DKIM or DK signature 0.1
DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid
X-HE-SPF: PASSED
Envelope-to: dropbox+saxonica+f38e@plan.io

This is a multi-part message in MIME format.
--------------020308060604080408010703
Content-Type: text/plain;
charset=utf-8;
format=flowed
Content-Transfer-Encoding: quoted-printable

I won't pretend to understand Saxon datastructures and functions that =

need to be modified but, yes, for me it would certainly be nice if I =

could feed large files to XSLT; I'm working with corpora, where there =

are lots of conversions to be done, and working on the whole corpus / =

file is of course very convenient (it's not such a bit thing to split =

it, but it's one more layer of complication). And my server has 47G of =

memory, so no problem there.
So, very glad that it seems doable, and fingers crossed that a dark cold =

winter day comes along to provide the opportunity to do it :)
All the best,
Toma=C5=BE

Dne 24.12.2014 ob 0:06 je Saxonica Developer Community zapisal(a):
>
> --- In your reply, please do not write below this line ---
>
> Issue #2271 has been updated by Michael Kay.
>
> * *Status* changed from /In Progress/ to /Resolved/
>
> I've been thinking a little about how one might tackle this, without =

> relying on any Java changes.
>
> It would be easy to change the LargeStringBuffer to support 2^31 =

> segments of 2^16 characters, instead of 2^15 segments as at present. =

> It wouldn't be possible to implement CharSequence properly, because =

> CharSequence uses int offsets; but there's actually no great need for =

> LargeStringBuffer to implement CharSequence. We would also need a =

> variant of the TinyTree that uses longs instead of ints for the =

> offsets into the buffer. (Actually, I don't think there's a really =

> great need for the string value of the document to be held in =

> pseudo-contiguous storage at all, but it would reduce the changes =

> needed to keep it that way).
>
> If someone actually asks for the string value of the root node, which =

> will be longer than 2^31 characters, we can return a StringValue that =

> contains pointers into the LargeStringBuffer. That's no great problem =

> at the XPath level. At the Java API level we would need to make =

> changes to NodeInfo.getStringValue(), but I think we could do this by =

> retaining this method and having it throw an exception if the string =

> value is too long, and providing an alternative method for getting the =

> string value without the 2^31 limit.
>
> In 9.6 we've started making greater use of the interface UnicodeString =

> which provides direct addressing into strings using Unicode codepoint =

> counting rather than 16-bit char counting. This also has the advantage =

> that strings using only Latin-1 characters only need 8 bits per =

> character. We could easily extend this interface to use longs rather =

> than ints for codepoint addressing, and we could underpin it with =

> something like the LargeStringBuffer data structure to bypass Java =

> limits on string and array sizes. So I think it's do-able.
>
> (Just been looking at the specs for current MacBooks and I'm actually =

> slightly surprised that they're not very much higher than my =

> early-2011 model. Perhaps things are reaching a plateau? Who knows.)
>
> -----------------------------------------------------------------------=
-
>
>
> Bug #2271: AIOOBE with large xml file
> <https://saxonica.plan.io/issues/2271#change-3874>
>
> * Author: Toma=C5=BE Erjavec
> * Status: Resolved
> * Priority: Normal
> * Assignee: Toma=C5=BE Erjavec
> * Category: Internals
> * Sprint/Milestone:
> * Legacy ID:
> * Found in version: 9.6
> * Fixed in version:
>
> Hi,
> Saxon gives me an array index out of bounds when I try to process a =

> large file and this happens even with an empty stylesheet. I can =

> understand that it wouldn't work, but with an exception saying out of =

> memory, but not AIOOBE.
> I'm using Saxon 9.6.0.3 (I tried with some older versions, same =

> problem) with java 1.8.0_25:
> Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)
> Below is the trace.
> All the best,
> Toma=C5=BE
> PS: I can send the file if it would help.
>
> $ du -h blog.bug.xml
> 2,8G blog.bug.xml
> $ java -jar /usr/local/bin/saxon9he.jar -xsl:empty.xsl blog.bug.xml > =

> bug.vert
> java.lang.ArrayIndexOutOfBoundsException: -32768
> at =

> net.sf.saxon.tree.tiny.LargeStringBuffer.append(LargeStringBuffer.java:=
90)
> at net.sf.saxon.tree.tiny.TinyTree.appendChars(TinyTree.java:405)
> at net.sf.saxon.tree.tiny.TinyBuilder.makeTextNode(TinyBuilder.java:380=
)
> at net.sf.saxon.tree.tiny.TinyBuilder.characters(TinyBuilder.java:362)
> at =

> net.sf.saxon.event.ReceivingContentHandler.flush(ReceivingContentHandle=
r.java:544)
> at =

> net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentH=
andler.java:435)
> at =

> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement=
(AbstractSAXParser.java:609)
> at =

> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.=
scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
> at =

> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$=
FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2973)
> at =

> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XML=
DocumentScannerImpl.java:606)
> at =

> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(X=
MLNSDocumentScannerImpl.java:117)
> at =

> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.=
scanDocument(XMLDocumentFragmentScannerImpl.java:510)
> at =

> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML=
11Configuration.java:848)
> at =

> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML=
11Configuration.java:777)
> at =

> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.ja=
va:141)
> at =

> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Abst=
ractSAXParser.java:1213)
> at =

> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.par=
se(SAXParserImpl.java:649)
> at net.sf.saxon.event.Sender.sendSAXSource(Sender.java:440)
> at net.sf.saxon.event.Sender.send(Sender.java:171)
> at net.sf.saxon.Controller.transform(Controller.java:1690)
> at net.sf.saxon.s9api.XsltTransformer.transform(XsltTransformer.java:54=
7)
> at net.sf.saxon.Transform.processFile(Transform.java:1056)
> at net.sf.saxon.Transform.doTransform(Transform.java:659)
> at net.sf.saxon.Transform.main(Transform.java:80)
> Fatal error during transformation: =

> java.lang.ArrayIndexOutOfBoundsException: -32768
>
> -----------------------------------------------------------------------=
-
>
> You have received this notification because you have either subscribed =

> to or are involved in a project on Saxonica Developer Community site.
> To change your notification preferences, please click here: =

> https://saxonica.plan.io/my/account
>
> =

>
> This notification was cheerfully delivered by <https://plan.io/>
> =

> Planio <https://plan.io/>
>


--------------020308060604080408010703
Content-Type: text/html;
charset=utf-8
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<meta content=3D"text/html; charset=3Dutf-8" http-equiv=3D"Content-Ty=
pe">
</head>
<body bgcolor=3D"#FFFFFF" text=3D"#000000">
I won't pretend to understand Saxon datastructures and functions
that need to be modified but, yes, for me it would certainly be nice
if I could feed large files to XSLT; I'm working with corpora, where
there are lots of conversions to be done, and working on the whole
corpus / file is of course very convenient (it's not such a bit
thing to split it, but it's one more layer of complication). And my
server has 47G of memory, so no problem there.<br>
So, very glad that it seems doable, and fingers crossed that a dark
cold winter day comes along to provide the opportunity to do it :)<br=
>
All the best,<br>
Toma=C5=BE<br>
<br>
<div class=3D"moz-cite-prefix">Dne 24.12.2014 ob 0:06 je Saxonica
Developer Community zapisal(a):<br>
</div>
<blockquote cite=3D"mid:redmine.journal-3874.20141223230614@plan.io"
type=3D"cite">
<style>
@import url(<a class=3D"moz-txt-link-freetext" href=3D"https://assets.pla=
n.io/stylesheets/fonts.css">https://assets.plan.io/stylesheets/fonts.css<=
/a>);
body {
font-family: "ProximaNova-Regular", Verdana, sans-serif;
font-size: 1.1em;
color:#333434;
}
h1, h2, h3 { font-family: "ProximaNova-Bold", "Trebuchet MS", Verdana, sa=
ns-serif; margin: 0px; }
h1 { font-size: 1.2em; }
h2, h3 { font-size: 1.1em; }
a, a:link, a:visited, a:hover, a:active { color:#2b7a94; }
a.wiki-anchor { display: none; }
hr {
width: 100%;
height: 1px;
background: #ccc;
border: 0;
}
</style>
<table width=3D"100%">
<tbody>
<tr>
<td style=3D"font-family: MarketWeb, Verdana,
sans-serif;font-size:0.8em;text-align:center;width:100%;col=
or:#D7D7D7;">
<p>--- In your reply, please do not write below this line
---</p>
</td>
</tr>
<tr>
<td>Issue #2271 has been updated by Michael Kay.
<ul>
<li><strong>Status</strong> changed from <i>In Progress</=
i>
to <i>Resolved</i></li>
</ul>
<p>I've been thinking a little about how one might tackle
this, without relying on any Java changes.</p>
<p>It would be easy to change the LargeStringBuffer to
support 2^31 segments of 2^16 characters, instead of
2^15 segments as at present. It wouldn't be possible to
implement CharSequence properly, because CharSequence
uses int offsets; but there's actually no great need for
LargeStringBuffer to implement CharSequence. We would
also need a variant of the TinyTree that uses longs
instead of ints for the offsets into the buffer.
(Actually, I don't think there's a really great need for
the string value of the document to be held in
pseudo-contiguous storage at all, but it would reduce
the changes needed to keep it that way).</p>
<p>If someone actually asks for the string value of the
root node, which will be longer than 2^31 characters, we
can return a StringValue that contains pointers into the
LargeStringBuffer. That's no great problem at the XPath
level. At the Java API level we would need to make
changes to NodeInfo.getStringValue(), but I think we
could do this by retaining this method and having it
throw an exception if the string value is too long, and
providing an alternative method for getting the string
value without the 2^31 limit.</p>
<p>In 9.6 we've started making greater use of the
interface UnicodeString which provides direct addressing
into strings using Unicode codepoint counting rather
than 16-bit char counting. This also has the advantage
that strings using only Latin-1 characters only need 8
bits per character. We could easily extend this
interface to use longs rather than ints for codepoint
addressing, and we could underpin it with something like
the LargeStringBuffer data structure to bypass Java
limits on string and array sizes. So I think it's
do-able.</p>
<p>(Just been looking at the specs for current MacBooks
and I'm actually slightly surprised that they're not
very much higher than my early-2011 model. Perhaps
things are reaching a plateau? Who knows.)</p>
<hr>
<h1><a moz-do-not-send=3D"true"
href=3D"https://saxonica.plan.io/issues/2271#change-387=
4">Bug
#2271: AIOOBE with large xml file</a></h1>
<ul>
<li>Author: Toma=C5=BE Erjavec</li>
<li>Status: Resolved</li>
<li>Priority: Normal</li>
<li>Assignee: Toma=C5=BE Erjavec</li>
<li>Category: Internals</li>
<li>Sprint/Milestone: </li>
<li>Legacy ID: </li>
<li>Found in version: 9.6</li>
<li>Fixed in version: </li>
</ul>
<p>Hi,<br>
Saxon gives me an array index out of bounds when I try
to process a large file and this happens even with an
empty stylesheet. I can understand that it wouldn't
work, but with an exception saying out of memory, but
not AIOOBE.<br>
I'm using Saxon 9.6.0.3 (I tried with some older
versions, same problem) with java 1.8.0_25:<br>
Java(TM) SE Runtime Environment (build 1.8.0_25-b17)<br>
Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02,
mixed mode)<br>
Below is the trace.<br>
All the best,<br>
Toma=C5=BE<br>
PS: I can send the file if it would help.</p>
<p>$ du -h blog.bug.xml<br>
2,8G blog.bug.xml<br>
$ java -jar /usr/local/bin/saxon9he.jar -xsl:empty.xsl
blog.bug.xml &gt; bug.vert<br>
java.lang.ArrayIndexOutOfBoundsException: -32768<br>
at
net.sf.saxon.tree.tiny.LargeStringBuffer.append(LargeStringBuffer.java:90=
)<br>
at
net.sf.saxon.tree.tiny.TinyTree.appendChars(TinyTree.java=
:405)<br>
at
net.sf.saxon.tree.tiny.TinyBuilder.makeTextNode(TinyBuild=
er.java:380)<br>
at
net.sf.saxon.tree.tiny.TinyBuilder.characters(TinyBuilder=
.java:362)<br>
at
net.sf.saxon.event.ReceivingContentHandler.flush(ReceivingContentHandler.=
java:544)<br>
at
net.sf.saxon.event.ReceivingContentHandler.endElement(ReceivingContentHan=
dler.java:435)<br>
at
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(A=
bstractSAXParser.java:609)<br>
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.sc=
anEndElement(XMLDocumentFragmentScannerImpl.java:1782)<br>
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$Fr=
agmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2973)<br>
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDo=
cumentScannerImpl.java:606)<br>
at
com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XML=
NSDocumentScannerImpl.java:117)<br>
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.sc=
anDocument(XMLDocumentFragmentScannerImpl.java:510)<br>
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11=
Configuration.java:848)<br>
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11=
Configuration.java:777)<br>
at
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java=
:141)<br>
at
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Abstra=
ctSAXParser.java:1213)<br>
at
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse=
(SAXParserImpl.java:649)<br>
at
net.sf.saxon.event.Sender.sendSAXSource(Sender.java:440)<=
br>
at net.sf.saxon.event.Sender.send(Sender.java:171)<br>
at
net.sf.saxon.Controller.transform(Controller.java:1690)<b=
r>
at
net.sf.saxon.s9api.XsltTransformer.transform(XsltTransfor=
mer.java:547)<br>
at
net.sf.saxon.Transform.processFile(Transform.java:1056)<b=
r>
at
net.sf.saxon.Transform.doTransform(Transform.java:659)<br=
>
at net.sf.saxon.Transform.main(Transform.java:80)<br>
Fatal error during transformation:
java.lang.ArrayIndexOutOfBoundsException: -32768</p>
<script type=3D"application/ld+json">
{
"@context": <a class=3D"moz-txt-link-rfc2396E" href=3D"http://schema.or=
g">"http://schema.org"</a>,
"@type": "EmailMessage",
"action": {
"@type": "ViewAction",
"url": <a class=3D"moz-txt-link-rfc2396E" href=3D"https://saxonica.pl=
an.io/issues/2271#change-3874">"https://saxonica.plan.io/issues/2271#chan=
ge-3874"</a>,
"name": "View on Planio"
},
"description": "Click here to view this issue update on Planio."
}
</script></td>
</tr>
<tr>
<td style=3D"font-size:0.8em;width:100%;">
<hr>
<p>You have received this notification because you have
either subscribed to or are involved in a project on
Saxonica Developer Community site.<br>
To change your notification preferences, please click
here: <a moz-do-not-send=3D"true" class=3D"external"
href=3D"https://saxonica.plan.io/my/account">https://sa=
xonica.plan.io/my/account</a></p>
</td>
<td><br>
</td>
</tr>
<tr>
<td style=3D"font-family: MarketWeb, Verdana,
sans-serif;font-size:1.2em;text-align:center;width:100%;col=
or:#D7D7D7;"><br>
<div><a moz-do-not-send=3D"true" href=3D"https://plan.io/"
style=3D"color:#D7D7D7;text-decoration:none;">This
notification was cheerfully delivered by</a></div>
</td>
<td><br>
</td>
</tr>
<tr>
<td style=3D"text-align:center;width:100%;"><a
moz-do-not-send=3D"true" href=3D"https://plan.io/"
title=3D"Planio"><img moz-do-not-send=3D"true"
src=3D"https://assets.plan.io/images/planio_logo_gray_2=
04x50.png"
alt=3D"Planio" style=3D"vertical-align: middle;"
height=3D"25" width=3D"102"></a></td>
</tr>
</tbody>
</table>
</blockquote>
<br>
</body>
</html>

--------------020308060604080408010703--
    (1-1/1)