Hi,
Am I wrong or isn't it possible to retrieve the outputs of a pipeline using the org.orbeon.oxf.pipeline.api? What I'd like to do is to instanciate a pipeline processor, connect its config input on a static URI (so far so good), send it a second input as SAX events and read its unique output as SAX events too. The second input and the output would have predefined names (like the data input/output of a page view if you like). Would you have pointers that could help me doing so without reinventing the wheel? Thanks, Eric -- Le premier annuaire des apiculteurs 100% XML! http://apiculteurs.info/ ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (ISO) RELAX NG ISBN:0-596-00421-4 http://oreilly.com/catalog/relax (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ -- You receive this message as a subscriber of the [hidden email] mailing list. To unsubscribe: mailto:[hidden email] For general help: mailto:[hidden email]?subject=help ObjectWeb mailing lists service home page: http://www.objectweb.org/wws |
Administrator
|
Eric,
An easy one. To achieve this, you connect the output of your processor to the DOMSerializer processor: DOMSerializer domSerializer = new DOMSerializer(); PipelineUtils.connect(myProcessor, myOutput.getName(), domSerializer, "data"); Then you start the execution of your pipeline: domSerializer.start(pipelineContext); When the execution has terminated, you can obtain the result as a dom4j or W3C DOM document: domSerializer.getDocument(pipelineContext) domSerializer.getW3CDocument(pipelineContext) -Erik Eric van der Vlist wrote: > Hi, > > Am I wrong or isn't it possible to retrieve the outputs of a pipeline > using the org.orbeon.oxf.pipeline.api? > > What I'd like to do is to instanciate a pipeline processor, connect its > config input on a static URI (so far so good), send it a second input as > SAX events and read its unique output as SAX events too. The second > input and the output would have predefined names (like the data > input/output of a page view if you like). > > Would you have pointers that could help me doing so without reinventing > the wheel? > > Thanks, > > Eric -- You receive this message as a subscriber of the [hidden email] mailing list. To unsubscribe: mailto:[hidden email] For general help: mailto:[hidden email]?subject=help ObjectWeb mailing lists service home page: http://www.objectweb.org/wws |
Hi Erik,
Le vendredi 18 novembre 2005 à 15:48 +0100, Erik Bruchez a écrit : > Eric, > > An easy one. Not that sure, I think I haven't been clear enough :-) ... My problem isn't to connect the external input/outputs of a custom processor... I am writing a processor that calls a pipeline processor in Java and my question is the way round: how can I in my custom processor (in Java) instanciate a pipeline processor, connect its config input on a static URI (so far I have found how to do so with the org.orbeon.oxf.pipeline.api package), send it a second input as SAX events and read its unique output as SAX events too (this is what I don't think you can do with this package) ? It looks like the org.orbeon.oxf.pipeline.api has been designed with the minimal amount of features needed to implement the command line utility and I need more then than! I can probably create a new instance of the PipelineProcessor directly, call its createInput and createOutput methods, but then, how do I read its output and give it its inputs? Eric -- If you have a XML document, you have its schema. http://examplotron.org ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (ISO) RELAX NG ISBN:0-596-00421-4 http://oreilly.com/catalog/relax (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ -- You receive this message as a subscriber of the [hidden email] mailing list. To unsubscribe: mailto:[hidden email] For general help: mailto:[hidden email]?subject=help ObjectWeb mailing lists service home page: http://www.objectweb.org/wws |
Le vendredi 18 novembre 2005 à 16:06 +0100, Eric van der Vlist a écrit :
> It looks like the org.orbeon.oxf.pipeline.api has been designed with the > minimal amount of features needed to implement the command line utility > and I need more then than! I have made some progresses :-) ... That's a quick hack since it would handle only one output, but adding the following method to InitUtils appears to make the job for me: public static void readOutput(Processor processor, ExternalContext externalContext, PipelineContext pipelineContext, String outputName, ContentHandler contentHandler) throws Exception { // Record start time for this request long tsBegin = logger.isInfoEnabled() ? System.currentTimeMillis() : 0; String requestPath = null; try { ExternalContext.Request request = externalContext.getRequest(); requestPath = request.getRequestPath(); } catch (UnsupportedOperationException e) { // Don't do anything } // Set ExternalContext if (externalContext != null) { if (logger.isInfoEnabled()) { String startLoggerString = externalContext.getStartLoggerString(); if (startLoggerString != null && startLoggerString.length() > 0) logger.info(startLoggerString); } pipelineContext.setAttribute(PipelineContext.EXTERNAL_CONTEXT, externalContext); } // Make the static context available StaticExternalContext.setStaticContext(new StaticExternalContext.StaticContext(externalContext, pipelineContext)); try { // Set cache size Integer cacheMaxSize = OXFProperties.instance().getPropertySet().getInteger(CACHE_SIZE_PROPERTY); if (cacheMaxSize != null) ObjectCache.instance().setMaxSize(pipelineContext, cacheMaxSize.intValue()); // Start execution processor.reset(pipelineContext); processor.createOutput(outputName); ProcessorOutput processorOutput = processor.getOutputByName(outputName); processorOutput.read(pipelineContext, contentHandler); if (!pipelineContext.isDestroyed()) pipelineContext.destroy(true); } catch (Exception e) { try { if (!pipelineContext.isDestroyed()) pipelineContext.destroy(false); } catch (Exception f) { logger.error("Exception while destroying context after exception", OXFException.getRootThrowable(f)); } LocationData locationData = ValidationException.getRootLocationData(e); Throwable throwable = OXFException.getRootThrowable(e); String message = locationData == null ? "Exception with no location data" : "Exception at " + locationData.toString(); logger.error(message, throwable); // Make sure the caller can do something about it, like trying to run an error page throw e; } finally { // Free context StaticExternalContext.removeStaticContext(); if (logger.isInfoEnabled()) { // Display cache statistics CacheStatistics statistics = ObjectCache.instance().getStatistics(pipelineContext); int hitCount = statistics.getHitCount(); int missCount = statistics.getMissCount(); String successRate = null; if (hitCount + missCount > 0) successRate = hitCount * 100 / (hitCount + missCount) + "%"; else successRate = "N/A"; long timing = System.currentTimeMillis() - tsBegin; logger.info((requestPath != null ? requestPath : "Done running processor") + " - Timing: " + timing + " - Cache hits: " + hitCount + ", fault: " + missCount + ", adds: " + statistics.getAddCount() + ", success rate: " + successRate); } } } Most of the stuff is shamelessly copied from the run() method, the difference being: processor.createOutput(outputName); ProcessorOutput processorOutput = processor.getOutputByName(outputName); processorOutput.read(pipelineContext, contentHandler); The question is now: what can we do with this :-) ... The code uses a lot of private static declarations that seems to be common between processors and I don't know if I could easily move it to another class in my own package. Also, I don't know if I would need all these lines if that was only a "custom" method. OTH, I don't think that this method is generic enough to deserve to be committed and generally available... Eric -- Le premier annuaire des apiculteurs 100% XML! http://apiculteurs.info/ ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (ISO) RELAX NG ISBN:0-596-00421-4 http://oreilly.com/catalog/relax (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ -- You receive this message as a subscriber of the [hidden email] mailing list. To unsubscribe: mailto:[hidden email] For general help: mailto:[hidden email]?subject=help ObjectWeb mailing lists service home page: http://www.objectweb.org/wws |
Le vendredi 18 novembre 2005 à 21:10 +0100, Eric van der Vlist a écrit :
> > OTH, I don't think that this method is generic enough to deserve to be > committed and generally available... Hmmm... I might have spoken too fast and be wrong. That would maybe not be optimal, but you should be able to read several outputs by calling this method several times since it doesn't destroy the processor. So, maybe you could include this new method into the source code? Eric -- Le premier annuaire des apiculteurs 100% XML! http://apiculteurs.info/ ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (ISO) RELAX NG ISBN:0-596-00421-4 http://oreilly.com/catalog/relax (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ -- You receive this message as a subscriber of the [hidden email] mailing list. To unsubscribe: mailto:[hidden email] For general help: mailto:[hidden email]?subject=help ObjectWeb mailing lists service home page: http://www.objectweb.org/wws |
Administrator
|
In reply to this post by Eric van der Vlist
On 11/18/05, Eric van der Vlist <[hidden email]> wrote:
> That's a quick hack since it would handle only one output, but adding > the following method to InitUtils appears to make the job for me: > [...] Hi Eric, If you are already in the code of your own processor, you don't need to do all those things that are done in InitUtils, like setting the cache size and creating a new pipeline context. In pseudo-code, you need to: 1) Create the pipeline processor (instantiating PipelineProcessor) => pipelineProcessor object 3) Create URLGenerator, connect it to the "config" input of pipelineProcessor with PipelineUtils.connect(). 2) Create input for "data" and create output for "data" 4) Connect to the input your own implementation of a ProcessorOutput which when the method read(content hander) is called calls readInputAsSAX(passing here the content handler) to read the "data" input of the processor you are implementing. 5) Call read(content handler) on the processor output you created passing the content handler you received from the method readImpl you are implementing. I hope that somehow this will makes sense :). Alex -- Blog (XML, Web apps, Open Source): http://www.orbeon.com/blog/ -- You receive this message as a subscriber of the [hidden email] mailing list. To unsubscribe: mailto:[hidden email] For general help: mailto:[hidden email]?subject=help ObjectWeb mailing lists service home page: http://www.objectweb.org/wws
--
Follow Orbeon on Twitter: @orbeon Follow me on Twitter: @avernet |
Hi Alex,
Le mardi 22 novembre 2005 à 11:41 -0800, Alessandro Vernet a écrit : .../... > In pseudo-code, you need to: > > 1) Create the pipeline processor (instantiating PipelineProcessor) => > pipelineProcessor object > 3) Create URLGenerator, connect it to the "config" input of > pipelineProcessor with PipelineUtils.connect(). > 2) Create input for "data" and create output for "data" > 4) Connect to the input your own implementation of a ProcessorOutput > which when the method read(content hander) is called calls > readInputAsSAX(passing here the content handler) to read the "data" > input of the processor you are implementing. > 5) Call read(content handler) on the processor output you created > passing the content handler you received from the method readImpl you > are implementing. > > I hope that somehow this will makes sense :). I had tried to find out examples of this kind of flows in the source code, but the ones I had found were hidden in the complexity of the PFC or pipeline controllers and I hadn't been able to isolate the basic calls like you did... I'll try that on a simple example and let you know the outcome. Thanks. Eric -- Don't you think all these XML schema languages should work together? http://dsdl.org ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (ISO) RELAX NG ISBN:0-596-00421-4 http://oreilly.com/catalog/relax (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ -- You receive this message as a subscriber of the [hidden email] mailing list. To unsubscribe: mailto:[hidden email] For general help: mailto:[hidden email]?subject=help ObjectWeb mailing lists service home page: http://www.objectweb.org/wws |
In reply to this post by Alessandro Vernet
Hi Alex,
Le mardi 22 novembre 2005 à 11:41 -0800, Alessandro Vernet a écrit : > On 11/18/05, Eric van der Vlist <[hidden email]> wrote: > > That's a quick hack since it would handle only one output, but adding > > the following method to InitUtils appears to make the job for me: > > [...] > > Hi Eric, > > If you are already in the code of your own processor, you don't need > to do all those things that are done in InitUtils, like setting the > cache size and creating a new pipeline context. > In pseudo-code, you need to: > > 1) Create the pipeline processor (instantiating PipelineProcessor) => > pipelineProcessor object Here you need to reset the processor... > 3) Create URLGenerator, connect it to the "config" input of > pipelineProcessor with PipelineUtils.connect(). > 2) Create input for "data" and create output for "data" > 4) Connect to the input your own implementation of a ProcessorOutput > which when the method read(content hander) is called calls > readInputAsSAX(passing here the content handler) to read the "data" > input of the processor you are implementing. > 5) Call read(content handler) on the processor output you created > passing the content handler you received from the method readImpl you > are implementing. I can now invoke XPL pipelines to index documents based on their media types... Thanks, Eric -- If you have a XML document, you have its schema. http://examplotron.org ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (ISO) RELAX NG ISBN:0-596-00421-4 http://oreilly.com/catalog/relax (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ -- You receive this message as a subscriber of the [hidden email] mailing list. To unsubscribe: mailto:[hidden email] For general help: mailto:[hidden email]?subject=help ObjectWeb mailing lists service home page: http://www.objectweb.org/wws |
Hi,
Le mercredi 23 novembre 2005 à 12:45 +0100, Eric van der Vlist a écrit : > I'll need to have a look on caching and performance issues later on, but > I can now invoke XPL pipelines to index documents based on their media > types... Just some comments about a few things I have noticed and some related questions... 1) The scheduler doesn't notice when a thread runs out of memory. If you run a processor from the scheduler, if this processor runs out of memory, the scheduler doesn't seem to notice and it won't restart a processor with the same name (if you ask to check that no two processors with the same name run concurrently). 2) Pipeline contexts grow In my context where a long running custom processor executes a large number of pipeline processors (I have tried with collections of several thousands of documents), if you reuse a pipeline context you rapidly run out of memory (even if you don't reuse the same pipeline processor). My interpretation is that the stuff added by each processor is never removed and I think that this could be helpful to have a method that asks to a processor to clean the pipeline context 3) You can't safely reuse a pipeline processor if you don't delete its input/outputs when you change them. Despite the fact that a comment in the pipeline processors says: * <p>This processor is not only not thread safe, but it can't even be * reused: if there is one data output (with a 1 cardinality), one can't call * read multiple times and get the same result. Only the first call to read * on the data output will succeed. I have tried to see how I could reuse these pipelines processors. I have some few .xpl pipes attached to mime types and I am keeping them in an Hashtable which key is the address of the .xpl file. When I reuse one of these processors, I keep its config input unchanged, I reset its data input with a new ProcessorOutput and read its data output again. If I do so without deleting the data input and output, I run rapidly out of memory again. 4) My questions... Is that safe to reuse a pipeline processor like I do, keeping its config input unchanged between runs? Is that enough to cache the config input and things such as the XSLT transformations that it includes? Is there an easy way to use OPS' cache system instead? Wouldn't it be better than what I am doing right now (which is a kind of poor man's cache)? Thanks, Eric -- Carnet web : http://eric.van-der-vlist.com/blog?t=category&a=Fran%C3%A7ais ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (ISO) RELAX NG ISBN:0-596-00421-4 http://oreilly.com/catalog/relax (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ -- You receive this message as a subscriber of the [hidden email] mailing list. To unsubscribe: mailto:[hidden email] For general help: mailto:[hidden email]?subject=help ObjectWeb mailing lists service home page: http://www.objectweb.org/wws |
Administrator
|
On 11/24/05, Eric van der Vlist <[hidden email]> wrote:
> [...] > Is that safe to reuse a pipeline processor like I do, keeping its config > input unchanged between runs? Hi Eric, If you reset the processor instance, you can reuse it and call read multiple times on outputs. That comment in the code seems to be inaccurate as it does not seem take the existence of a reset() into account. However, you might want to create a different instance of the pipeline context every time you run your "pipeline"; I am using quotes here as you are creating the equivalent of an XPL pipeline with Java code. > Is that enough to cache the config input and things such as the XSLT > transformations that it includes? Yes, if you keep reusing the same pipeline instance the cache should work as expected. > Is there an easy way to use OPS' cache system instead? > Wouldn't it be better than what I am doing right now (which is a kind of > poor man's cache)? Yes, instead of the Hashtable, you could store the pipelines you are creating in PresentationServer cache. Now if you are creating a limited number of those pipelines for all practical purpose using a Hashtable is just fine as you might not care about getting rid of those pipelines to save memory. Alex -- You receive this message as a subscriber of the [hidden email] mailing list. To unsubscribe: mailto:[hidden email] For general help: mailto:[hidden email]?subject=help ObjectWeb mailing lists service home page: http://www.objectweb.org/wws
--
Follow Orbeon on Twitter: @orbeon Follow me on Twitter: @avernet |
Hi Alex,
Le vendredi 25 novembre 2005 à 17:12 -0800, Alessandro Vernet a écrit : > On 11/24/05, Eric van der Vlist <[hidden email]> wrote: > > [...] > > Is that safe to reuse a pipeline processor like I do, keeping its config > > input unchanged between runs? > > Hi Eric, > > If you reset the processor instance, you can reuse it and call read > multiple times on outputs. That comment in the code seems to be > inaccurate as it does not seem take the existence of a reset() into > account. However, you might want to create a different instance of the > pipeline context every time you run your "pipeline"; I am using quotes > here as you are creating the equivalent of an XPL pipeline with Java > code. pipeline processor thus creating real XPL pipelines with Java :-) . But this would work with other processors too. > > Is that enough to cache the config input and things such as the XSLT > > transformations that it includes? > > Yes, if you keep reusing the same pipeline instance the cache should > work as expected. Great! > > Is there an easy way to use OPS' cache system instead? > > Wouldn't it be better than what I am doing right now (which is a kind of > > poor man's cache)? > > Yes, instead of the Hashtable, you could store the pipelines you are > creating in PresentationServer cache. Now if you are creating a > limited number of those pipelines for all practical purpose using a > Hashtable is just fine as you might not care about getting rid of > those pipelines to save memory. I am still not very clear on the memory usages during the life cycle of a pipeline processor... In my indexer, I am creating each pipeline processor and connecting it to its config input once: PipelineProcessor pipelineProcessor = new PipelineProcessor(); URLGenerator config = new URLGenerator("file:" + resourceManager.getRealPath(processorKey)); PipelineUtils.connect(config, "data", pipelineProcessor, "config"); And each time I am using it, I create its data input and output, reset the processor, connect its data input, read from its data output and delete its data input an output. pipelineProcessor.createInput("data"); pipelineProcessor.createOutput("data"); pipelineProcessor.reset(context); pipelineProcessor.getInputByName("data").setOutput( new DocumentObjectOutput(doc)); XmlSax2JavaObjectPipe pipe = new XmlSax2JavaObjectPipe(); pipe.setParameter("namespace2package", "", this.getClass() .getPackage().getName()); try { pipelineProcessor.getOutputByName("data").read(context, (XmlSaxSource) pipe.getSource()); processedDoc = (Document) pipe.getObject(); pipelineProcessor.deleteInput(pipelineProcessor .getInputByName("data")); pipelineProcessor.deleteOutput(pipelineProcessor .getOutputByName("data")); break; } catch (Exception e) { e.printStackTrace(); pipelineProcessor.deleteInput(pipelineProcessor .getInputByName("data")); pipelineProcessor.deleteOutput(pipelineProcessor .getOutputByName("data")); } Is that optimal? In particular, I am expecting that since I am reusing the config input, it will be read and "compiled" only once. Is that the case? The data input is a small document, but during the execution of the pipeline, large documents can be read and manipulated by the pipeline. Is the memory used by these documents freed when I "leave" the pipeline processor between two uses? If not, how can I release this memory? Would reseting the processor after having used it help? Would it be better to embed these large documents in the data input? Thanks for your help, Eric -- Freelance consulting and training. http://dyomedea.com/english/ ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (ISO) RELAX NG ISBN:0-596-00421-4 http://oreilly.com/catalog/relax (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ -- You receive this message as a subscriber of the [hidden email] mailing list. To unsubscribe: mailto:[hidden email] For general help: mailto:[hidden email]?subject=help ObjectWeb mailing lists service home page: http://www.objectweb.org/wws |
Administrator
|
On 11/26/05, Eric van der Vlist <[hidden email]> wrote:
> In particular, I am expecting that since I am reusing the config input, > it will be read and "compiled" only once. Is that the case? Yes, the config should be cached in this case. > Is the memory used by these documents freed when I "leave" the pipeline > processor between two uses? Processors should not store anything in instance properties. They should only store information in the pipeline context or the cache. So if you don't keep a reference to the pipeline context, the processors are not going to use more memory after one run than they did before. Of course, after one run, there might be more memory used by the cache. > If not, how can I release this memory? Would reseting the processor > after having used it help? Would it be better to embed these large > documents in the data input? The reset() method is used in general to initialize a "state" in the context. It used in conjunction with the setState() method, both defined in ProcessorImpl. So reseting after use would just free some memory in the pipeline context, but it is better and simpler to just start over with a new pipeline context. Alex -- Blog (XML, Web apps, Open Source): http://www.orbeon.com/blog/ -- You receive this message as a subscriber of the [hidden email] mailing list. To unsubscribe: mailto:[hidden email] For general help: mailto:[hidden email]?subject=help ObjectWeb mailing lists service home page: http://www.objectweb.org/wws
--
Follow Orbeon on Twitter: @orbeon Follow me on Twitter: @avernet |
Hi Alex,
Le mardi 29 novembre 2005 à 19:30 -0800, Alessandro Vernet a écrit : > On 11/26/05, Eric van der Vlist <[hidden email]> wrote: > > In particular, I am expecting that since I am reusing the config input, > > it will be read and "compiled" only once. Is that the case? > > Yes, the config should be cached in this case. > > > Is the memory used by these documents freed when I "leave" the pipeline > > processor between two uses? > > Processors should not store anything in instance properties. They > should only store information in the pipeline context or the cache. own custom processors but I'll fix that ASAP! > So > if you don't keep a reference to the pipeline context, the processors > are not going to use more memory after one run than they did before. > Of course, after one run, there might be more memory used by the > cache. > > > If not, how can I release this memory? Would reseting the processor > > after having used it help? Would it be better to embed these large > > documents in the data input? > > The reset() method is used in general to initialize a "state" in the > context. It used in conjunction with the setState() method, both > defined in ProcessorImpl. So reseting after use would just free some > memory in the pipeline context, but it is better and simpler to just > start over with a new pipeline context. running pipeline. Thanks for the explanations. Eric -- Lisez-moi sur XMLfr. http://xmlfr.org/index/person/eric+van+der+vlist/ ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (ISO) RELAX NG ISBN:0-596-00421-4 http://oreilly.com/catalog/relax (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ -- You receive this message as a subscriber of the [hidden email] mailing list. To unsubscribe: mailto:[hidden email] For general help: mailto:[hidden email]?subject=help ObjectWeb mailing lists service home page: http://www.objectweb.org/wws |
Free forum by Nabble | Edit this page |