Hi,
In the documentation, you say when you mention non XML documents: "Processors can stream binary and text documents by issuing a number of short character SAX events. It is therefore possible to generate "infinitely" long binary and text documents with a constant amount of memory, assuming both the sender and the receiver of the document are able to perform streaming. This is the case for example of the URL generator and the HTTP serializer." This is true for the URL generator and the HTTP serializer.", but not true for processors that use libraries excepting an InputStream. The common pattern for these processors is to use a ByteArrayOutputStream together with a ByteArrayInputStream: ByteArrayOutputStream bos = new ByteArrayOutputStream(); Base64ContentHandler b64ch = new Base64ContentHandler(bos); readInputAsSAX(context, "data", b64ch); document = bos.toByteArray(); .../... ByteArrayInputStream bis = new ByteArrayInputStream(document); That's the pattern I am using in my archive converter to process OpenOffice documents and in the MS documents converters that I have recently developed based on POI and that's also the pattern used by the PDF conversion processor. The big downside is that documents (that can be very large) are stored in memory in a byte array. That could be improved by using pipe input and output streams instead of byte array streams, but that requires forking a new thread and can be touchy to put into practice. Also, some processors need to read the document several times and that would require forking these threads several times... When documents are generated, this overhead makes sense, but when it's just a matter of accessing to a static document stored on disk, that's less obvious... The thing I am considering for my own converters is the following: * Accept both an optional "config" and and optional "data" input on these processors. * When there is a "config" input with an "uri" element, read the document from this URI. * Otherwise, expect a non XML document in the "data" input (the current behavior). That would lead to processors that can be used either as generators or as converters. I am now familiar enough with custom processors to implement these beasts but I'd like to know if you would consider this as a good practice and if this is worth generalizing... What do you think? Thanks, Eric -- Weblog: http://eric.van-der-vlist.com/blog?t=category&a=English ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (ISO) RELAX NG ISBN:0-596-00421-4 http://oreilly.com/catalog/relax (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ -- You receive this message as a subscriber of the [hidden email] mailing list. To unsubscribe: mailto:[hidden email] For general help: mailto:[hidden email]?subject=help ObjectWeb mailing lists service home page: http://www.objectweb.org/wws |
Administrator
|
Hi Eric,
The 3 options you mention are: 1) Use a ByteArrayInputStream and ByteArrayOutputStream. 2) Create an additional thread so the InputStream can read from a ContentHandler. 3) Allow the processor to read directly from a URI, in addition to being able to read from a ContentHandler. There is also a fourth option: 4) Similar to option 1, but writing the data in a temp file instead of using a buffer in memory. I have a bias towards trying to avoid option 3 if possible, as it seems the API of the processor is changed here for performance reasons. If those are the options to choose from, I'd like to know what the performance impact of 2 and 4 is, so we can weigh that performance impact versus API change. So to summarize the matter, I guess I don't really have a definitive answer to your question :). Alex On 11/29/05, Eric van der Vlist <[hidden email]> wrote: > Hi, > > In the documentation, you say when you mention non XML documents: > > "Processors can stream binary and text documents by issuing a number of > short character SAX events. It is therefore possible to generate > "infinitely" long binary and text documents with a constant amount of > memory, assuming both the sender and the receiver of the document are > able to perform streaming. This is the case for example of the URL > generator and the HTTP serializer." > > This is true for the URL generator and the HTTP serializer.", but not > true for processors that use libraries excepting an InputStream. > > The common pattern for these processors is to use a > ByteArrayOutputStream together with a ByteArrayInputStream: > > ByteArrayOutputStream bos = new ByteArrayOutputStream(); > Base64ContentHandler b64ch = new Base64ContentHandler(bos); > readInputAsSAX(context, "data", b64ch); > document = bos.toByteArray(); > .../... > ByteArrayInputStream bis = new ByteArrayInputStream(document); > > That's the pattern I am using in my archive converter to process > OpenOffice documents and in the MS documents converters that I have > recently developed based on POI and that's also the pattern used by the > PDF conversion processor. > > The big downside is that documents (that can be very large) are stored > in memory in a byte array. > > That could be improved by using pipe input and output streams instead of > byte array streams, but that requires forking a new thread and can be > touchy to put into practice. Also, some processors need to read the > document several times and that would require forking these threads > several times... > > When documents are generated, this overhead makes sense, but when it's > just a matter of accessing to a static document stored on disk, that's > less obvious... > > The thing I am considering for my own converters is the following: > > * Accept both an optional "config" and and optional "data" input > on these processors. > * When there is a "config" input with an "uri" element, read the > document from this URI. > * Otherwise, expect a non XML document in the "data" input (the > current behavior). > > That would lead to processors that can be used either as generators or > as converters. > > I am now familiar enough with custom processors to implement these > beasts but I'd like to know if you would consider this as a good > practice and if this is worth generalizing... > > What do you think? > > Thanks, > > Eric > > -- > Weblog: > http://eric.van-der-vlist.com/blog?t=category&a=English > ------------------------------------------------------------------------ > Eric van der Vlist http://xmlfr.org http://dyomedea.com > (ISO) RELAX NG ISBN:0-596-00421-4 http://oreilly.com/catalog/relax > (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema > ------------------------------------------------------------------------ > > > > > > -- > You receive this message as a subscriber of the [hidden email] mailing list. > To unsubscribe: mailto:[hidden email] > For general help: mailto:[hidden email]?subject=help > ObjectWeb mailing lists service home page: http://www.objectweb.org/wws > > > -- Blog (XML, Web apps, Open Source): http://www.orbeon.com/blog/ -- You receive this message as a subscriber of the [hidden email] mailing list. To unsubscribe: mailto:[hidden email] For general help: mailto:[hidden email]?subject=help ObjectWeb mailing lists service home page: http://www.objectweb.org/wws
--
Follow Orbeon on Twitter: @orbeon Follow me on Twitter: @avernet |
Administrator
|
I should mention, related to Alex's option 4, that we use "FileItem"
objects in OPS in a few places, which are sort of smart temporary containers which stay in memory until a certain configurable size (for example 10KB), and then only write to disk. -Erik Alessandro Vernet wrote: > Hi Eric, > > The 3 options you mention are: > > 1) Use a ByteArrayInputStream and ByteArrayOutputStream. > 2) Create an additional thread so the InputStream can read from a > ContentHandler. > 3) Allow the processor to read directly from a URI, in addition to > being able to read from a ContentHandler. > > There is also a fourth option: > > 4) Similar to option 1, but writing the data in a temp file instead of > using a buffer in memory. > > I have a bias towards trying to avoid option 3 if possible, as it > seems the API of the processor is changed here for performance > reasons. If those are the options to choose from, I'd like to know > what the performance impact of 2 and 4 is, so we can weigh that > performance impact versus API change. > > So to summarize the matter, I guess I don't really have a definitive > answer to your question :). > > Alex -- You receive this message as a subscriber of the [hidden email] mailing list. To unsubscribe: mailto:[hidden email] For general help: mailto:[hidden email]?subject=help ObjectWeb mailing lists service home page: http://www.objectweb.org/wws |
Free forum by Nabble | Edit this page |