Project

Profile

Help

Custom tree model

Added by Anonymous about 14 years ago

Legacy ID: #8222602 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)

Hello Mr Kay! I would like to commit an experiment: to implement a new tree model for the saxon (a variant of linked tree). My question is if it's possible to plug such tree model externally? I've seen that I can specify it through an enumeration only. The idea behind of this experiment is to separate node parent reference from all other node data, and as result to make subtree copy very cheap. I've described one reasoning behind of this approach in http://www.nesterovsky-bros.com/weblog/2010/03/22/InlineFunctionsInXslt21.aspx Thanks Vladimir Nesterovsky


Replies (40)

Please register to reply

RE: Custom tree model - Added by Anonymous about 14 years ago

Legacy ID: #8223322 Legacy Poster: Michael Kay (mhkay)

Seems a good idea. Saxon tries a bit of this with the class net.sf.saxon.om.VirtualCopy which represents a virtual copy of a node; the tricky part is that the copied node and all its descendants need to have a different identity from the original, that is, A is B must return false when B is a copy of A. In Saxon's implementation, the VirtualCopy must be a complete tree (i.e. rooted at a parentless node), which severely limits the number of places it can be used; but I've always thought it would be possible to extend the idea so that two distinct trees (in terms of node-identity) can share storage for a common subtree. Some of this would be easier if Saxon accessed nodes via a stateful XPathNavigator object rather than a NodeInfo object, but that would be a very big internal change. It's true that Configuration.setTreeModel() takes an enumeration, which means you can't set a different tree model as the system default. But at the level of an individual transformation you can use Controller.setModel(TreeModel), which can be your own implementation of the TreeModel interface: the main task (apart from implementing NodeInfo, of course) is to implement a subclass of Builder to build your tree from parser events.

RE: Custom tree model - Added by Anonymous about 14 years ago

Legacy ID: #8223345 Legacy Poster: Michael Kay (mhkay)

Interesting blog post, by the way. I share your concerns that using higher-order functions to plug the limitations in XDM's data structuring facilities will lead to poor usability. I've been hoping, however, that we could use HOF as the formal underpinning, and then provide more usable facilities as a "syntactic sugar" layer on top. But there's definitely a question about whether richer data structures should or should not be modelled as XDM nodes, and it's interesting to see you addressing that question. It's related to questions about user-defined axes, I think. One possible route is to extend XDM so the nodes can participate in an arbitrary graph, using arbitrary axes for the inter-node relationships, and then to treat the model of an XML tree as a special case of this in which the axes are the ones that are currently defined in 2.0. Don't expect such radical thinking to be acceptable to the W3C WGs any time soon; but it's worth exploring at a research level.

RE: Custom tree model - Added by Anonymous about 14 years ago

Legacy ID: #8224206 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)

... can participate in an arbitrary graph ... The following theorem should probably be proved, but it goes like this: staying within a functional language one cannot model a cyclic graph. Thus functional languages including xslt and xquery are limited to trees (forest). This means that XDM rather closely approximates all kinds of graphs available.

RE: Custom tree model - Added by Anonymous about 14 years ago

Legacy ID: #8224234 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)

Formally, XDM does not allow "diamond" like structures: A refers to B and to C, B and C refer to D, but this can be modeled.

RE: Custom tree model - Added by Anonymous about 14 years ago

Legacy ID: #8225645 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)

I'm trying noe to figure out how NodeInfo.copy() should work for this tree model. To understand interactions I'm looking at ElementCreator.constructElement() method. What I see is that: a SequenceReceiver or its wrapper is created to collect result; nested elements are added through SequenceReceiver.append(), which calls NodeInfo.copy() for a pipe of nested recievers; the most nested reciever (Builder) is hidden (not available in NodeInfo.copy()); Builder could allowed me to compose parent child nodes efficiently. Am I right in that the efficient external implementation of desired tree model is not possible?

RE: Custom tree model - Added by Anonymous about 14 years ago

Legacy ID: #8225808 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)

Things could work if there were Reciever.append() method, which might elminate need of NodeInfo.copy().

RE: Custom tree model - Added by Anonymous about 14 years ago

Legacy ID: #8227481 Legacy Poster: Michael Kay (mhkay)

I think you are right in concluding that the Builder is not able to detect that the events being sent to it represent a copy of a node in some source document. On the other hand, if it's "your" node that is being copied (i.e. an instance of NodeInfo that uses your implementation of the NodeInfo interface), then your implementation of the NodeInfo.copy() method can behave differently. You can probably detect that the ultimate destination of the copy is your Builder class by following the pipeline of intermediate ProxyReceiver objects using getUnderlyingReceiver(). You could also consider exploiting the rather peculiar mechanism of the "CopyInformee" which is set in the PipelineConfiguration to be notified of xsl:copy-of instructions - though I agree this might not be general enough. It's entirely possible, however, that your ideas can't be implemented without changing the Saxon source code.

RE: Custom tree model - Added by Anonymous about 14 years ago

Legacy ID: #8228094 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)

I've found that SequenceOutputter used in ElementCreator does not expose underlying reciever. Also, at some cases SequenceOutputter is wrapped into Validator, which probably does not allow to pass item through. If one could pass a node to a reciever as a whole rather than a sequence of events, then different kinds of recievers could process it and pass a node further.

RE: Custom tree model - Added by Anonymous about 14 years ago

Legacy ID: #8253452 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)

I've implemented described tree model! It's called ComposableTree. I was not able to create external classes, so I've introduced a small set of changes to saxon. I've declared an interface: public interface NodeReceiver extends Receiver { public boolean canRecieve(NodeInfo node); public void node(NodeInfo node); } and have implemented it over: ComplexContentOutputter,ProxyReceiver, SequenceWrapper, SequenceWriter, SequenceWriter. Its goal is to push node to the tree builder. A tree builder for this tree model is called ComposableTreeBuilder. Classes are derived by refactoring linked tree. Definitely, better implementation is possible. My shallow tests show that approach is working. I can observe sharing of data among trees. This approach works when code passes through ElementCreator, but does not work when Copy is in action (see Copy.processLeavingTail()), as it decomposes node into a sequence of tree events. Extensive tests are required to verify implementation and to measure its performance. At this point I hardly can proceed, as I have no large test base. I can share the code if you're interested in the verification. Thanks. Vladimir Nesterovsky

RE: Custom tree model - Added by Anonymous about 14 years ago

Legacy ID: #8253845 Legacy Poster: Michael Kay (mhkay)

Your node() method seems quite similar to the append() method in SequenceReceiver. Probably a better solution is to merge Receiver and SequenceReceiver, and then do what the pull pipeline does: you can wrap a Decomposer around any EventIterator to decompose nodes/trees into individual events, if the target can't do that for itself. Or one could inherit from an abstract base class than implements the append() method in terms of the other methods. How have you solved the problems of node identity mentioned in comment #2?

RE: Custom tree model - Added by Anonymous about 14 years ago

Legacy ID: #8253892 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)

Your node() method seems quite similar to the append() method in SequenceReceiver. You're right. The only reason I did not merge them is that I wanted minimal side effects. In fact SequenceWriter looks now like this: public abstract class SequenceWriter extends SequenceReceiver implements NodeReceiver { ... public boolean canRecieve(NodeInfo node) throws XPathException { if (outputter instanceof NodeReceiver) { NodeReceiver nodeReciever = (NodeReceiver)outputter; return nodeReciever.canRecieve(node); } return false; } public void node(NodeInfo node) throws XPathException { NodeReceiver nodeReciever = (NodeReceiver)outputter; nodeReciever.node(node); } public void append(Item item, int locationId, int copyNamespaces) throws XPathException { ... else { NodeInfo node = (NodeInfo)item; if (canRecieve(node)) { node(node); } else { node.copy(outputter, NodeInfo.ALL_NAMESPACES, true, locationId); } ... } } } } > How have you solved the problems of node identity mentioned in comment #2? Previously node identity were base on sequence number (in most cases). In this tree model, I think, we cannot rely on this. Thus identity related functions are changed as following: public void generateId(FastStringBuffer buffer) { parent.generateId(buffer); buffer.append(NODE_LETTER[getNodeKind()]); buffer.append(Integer.toString(index)); } public boolean isSameNodeInfo(NodeInfo other) { // default implementation: differs for attribute and namespace nodes if (other instanceof NodeImpl) { NodeImpl otherNode = (NodeImpl)other; return this.getNodeData() == otherNode.getNodeData(); } return false; } public final int compareOrder(NodeInfo other) { if (other instanceof NamespaceIterator.NamespaceNodeImpl) { return 0 - other.compareOrder(this); } // Nodes added by XQuery Update do not have sequence numbers return Navigator.compareOrder(this, ((NodeImpl)other)); }

RE: Custom tree model - Added by Anonymous about 14 years ago

Legacy ID: #8253953 Legacy Poster: Michael Kay (mhkay)

I can't see how this works. If the same Java object is used to represent nodes in two different trees, then the identity-dependent methods must depend not only on the properties of this Java object, but on how it was reached. For the VirtualCopy implementation this is done by using a transient Java object (the VirtualCopy) which contains both a reference to the "shared" node holding the real data, and a documentNumber that distinguishes which tree it is part of. I would expect to see you doing something similar. Test case: <xsl:variable name="x" as="element()"> </xsl:variable> <xsl:variable name="y" as="element()"> <xsl:sequence select="$x"/> </xsl:variable> <xsl:value-of select="generate-id($x) = generate-id($y/e)"/> should return "false".

RE: Custom tree model - Added by Anonymous about 14 years ago

Legacy ID: #8253968 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)

I see already that isSameNodeInfo() is not correct. I must to check that parens are the same also.

RE: Custom tree model - Added by Anonymous about 14 years ago

Legacy ID: #8253983 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)

public boolean isSameNodeInfo(NodeInfo other) { // default implementation: differs for attribute and namespace nodes if (other instanceof NodeImpl) { NodeImpl otherNode = (NodeImpl)other; if (this.getNodeData() != otherNode.getNodeData()) { return false; } if (parent == null) { return otherNode.parent == null; } if (otherNode.parent != null) { return parent.isSameNodeInfo(otherNode.parent); } } return false; } as for generate-id() I think that above mentioned implementation addresses the question, as value depends on generate-id() of parent.

RE: Custom tree model - Added by Anonymous about 14 years ago

Legacy ID: #8253995 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)

For this code: <xsl:variable name="x" as="element()"> </xsl:variable> <xsl:variable name="y" as="element()"> <xsl:sequence select="$x"/> </xsl:variable> <xsl:message select="generate-id($x), generate-id($y/e)"/> I can see this output: d3e0 d4e0e0

RE: Custom tree model - Added by Anonymous about 14 years ago

Legacy ID: #8254007 Legacy Poster: Michael Kay (mhkay)

But I thought the idea was that a node could have more than one parent: or more strictly, that a single Java object could represent two different nodes with different parents?

RE: Custom tree model - Added by Anonymous about 14 years ago

Legacy ID: #8254052 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)

NodeImpl stores protected ParentNodeImpl parent; protected int index; and has method: protected abstract NodeData getNodeData(); NodeData defines: public abstract NodeImpl getNodeInfo(ParentNodeImpl parent, int index); Descendants of NodeImpl define their NodeData: abstract class ParentNodeImpl extends NodeImpl { protected static abstract class ParentNodeData extends NodeData { public Object children = null; ... } public class ElementImpl extends ParentNodeImpl implements NamespaceResolver { protected static class ElementData extends ParentNodeData { protected int nameCode; protected int typeCode; protected AttributeCollection attributeList; // this excludes namespace attributes protected int[] namespaceList = null; // list of namespace codes ... } protected ElementData data; protected ParentNodeData getNodeData() { return data; } NodeData is shared. This tree model looks as variation of linked tree but on the other side it is similar to node wrappers.

RE: Custom tree model - Added by Anonymous about 14 years ago

Legacy ID: #8254069 Legacy Poster: Michael Kay (mhkay)

OK, thanks, I think I get the idea.

RE: Custom tree model - Added by Anonymous about 14 years ago

Legacy ID: #8290900 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)

Hello, Continuing with tree models I have a somewhat different question. Looking at NamePool I think your wanted to make some agressive optimization, but I don't really get the idea. Can you please explain? I've stated my thoughts at http://www.nesterovsky-bros.com/weblog/2010/04/09/NamePool.aspx

RE: Custom tree model - Added by Anonymous about 14 years ago

Legacy ID: #8292052 Legacy Poster: Michael Kay (mhkay)

Your equals() test for comparing two qualified names is MUCH more expensive than comparing two integers. To put this in context, Saxon's loop for scanning the TinyTree to evaluate //foo:bar contains little more than three integer comparisons and an integer increment (for nodes that don't match). There are problems with the NamePool because of synchronization - but your proposal doesn't solve that. You would still need to synchronize when adding names to the pool (and perhaps even when finding them.)

RE: Custom tree model - Added by Anonymous about 14 years ago

Legacy ID: #8292657 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)

Your equals() test for comparing two qualified names is MUCH more expensive than comparing two integers. The use pattern goes like this: 1. get cached QualifiedName for triple (here one should care about contention): getQualifiedName(prefix, localName, namespace) getQualifiedName(key) 2. Use (store) cached QualifiedName instance. 3. To get parts use getPrefix(), getLocalName(), getNamespace(), getExtendedName() that are primitive getters and do not require synchronization. 4. To test two qualified names for equality one compares two instance references: name1 == name2 5. To test two expanded names (localname, and namespace) one compares two instance references: name1.getExtendedName() == name2.getExtendedName() > qualified names is MUCH more expensive than comparing two integers I guess that reference comparisions are as fast as integer comparisions. > There are problems with the NamePool because of synchronization - but your proposal doesn't solve that. > You would still need to synchronize when adding names to the pool (and perhaps even when finding them.) There is only contention point during caching of a new value. In the sample I've used concurrent map with no explicit synchronization (http://www.nesterovsky-bros.com/download/saxon/NameCache.java.txt).

RE: Custom tree model - Added by Anonymous about 14 years ago

Legacy ID: #8292700 Legacy Poster: Michael Kay (mhkay)

I don't see what benefits such a redesign would bring. It would involve storing object references in the TinyTree for node name and type annotation in place of integers, which would increase the size from 19 bytes per node to 27 bytes per node (or more, because spare bits in the type annotation field are used for other things). There would be more synchronization overhead (this is a tricky one. Saxon doesn't currently synchronize on getting a value from the pool. This isn't provably safe, but it seems to work reliably. But this wouldn't be safe with implementations of ConcurrentHashMap, where a writer can expand the map dynamically, causing concurrent readers to fail. I do need to make improvements to the NamePool to reduce contention for high-throughput applications, but this seems to be a step backwards.

RE: Custom tree model - Added by Anonymous about 14 years ago

Legacy ID: #8292740 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)

It would involve storing object references in the TinyTree for node name TinyTree stores names as int[] nameCode; but QualifiedName[] names; would occupy exactly the same space on 32 bits (and x2 on 64bit). > There would be more synchronization overhead My point is that, there won't be more synchronization (probably less), as operations to get name components from qualified name are primitive getters, so they do not need to access pool at all.

RE: Custom tree model - Added by Anonymous about 14 years ago

Legacy ID: #8293292 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)

TinyTree for node name and type annotation in place of integers, > which would increase the size from 19 bytes per node to 27 bytes per node Regardless of what's written above, to reduce memory footprint you can: 1. define NodeMetadata: NodeMetadata node type; qualified name; type annotation; 2. Cache it. 3. NodeInfo to refer to it. For TinyTree this would replace: public byte[] nodeKind; protected int[] nameCode; protected int[] typeCodeArray; with NodeMetadata[] metadatas;

RE: Custom tree model - Added by Anonymous about 14 years ago

Legacy ID: #8329713 Legacy Poster: Vladimir Nesterovsky (vnesterovsky)

I think you have a typo in net.sf.saxon.sreamedtree.NodeImpl.generateId(): public void generateId(FastStringBuffer buffer) { long seq = getSequenceNumber(); if (seq == -1L) { getPhysicalRoot().generateId(buffer); buffer.append(NODE_LETTER[getNodeKind()]); buffer.append(Long.toString(seq)); } else { parent.generateId(buffer); buffer.append(NODE_LETTER[getNodeKind()]); buffer.append(Integer.toString(index)); } } If node has no a sequence number (-1) then result will always be the same. This is masked, however, with method overrides, but still one is able to construct a tree with two text nodes with identical generateid() values.

(1-25/40)

Please register to reply