Design of processors handling non XML documents

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Design of processors handling non XML documents

Eric van der Vlist
Hi,

In the documentation, you say when you mention non XML documents:

"Processors can stream binary and text documents by issuing a number of
short character SAX events. It is therefore possible to generate
"infinitely" long binary and text documents with a constant amount of
memory, assuming both the sender and the receiver of the document are
able to perform streaming. This is the case for example of the URL
generator and the HTTP serializer."

This is true for the  URL generator and the HTTP serializer.", but not
true for processors that use libraries excepting an InputStream.

The common pattern for these processors is to use a
ByteArrayOutputStream together with a ByteArrayInputStream:  

                        ByteArrayOutputStream bos = new ByteArrayOutputStream();
                        Base64ContentHandler b64ch = new Base64ContentHandler(bos);
                        readInputAsSAX(context, "data", b64ch);
                        document = bos.toByteArray();
.../...
                ByteArrayInputStream bis = new ByteArrayInputStream(document);

That's the pattern I am using in my archive converter to process
OpenOffice documents and in the MS documents converters that I have
recently developed based on POI and that's also the pattern used by the
PDF conversion processor.

The big downside is that documents (that can be very large) are stored
in memory in a byte array.

That could be improved by using pipe input and output streams instead of
byte array streams, but that requires forking a new thread and can be
touchy to put into practice. Also, some processors need to read the
document several times and that would require forking these threads
several times...

When documents are generated, this overhead makes sense, but when it's
just a matter of accessing to a static document stored on disk, that's
less obvious...

The thing I am considering for my own converters is the following:

      * Accept both an optional "config" and and optional "data" input
        on these processors.
      * When there is a "config" input with an "uri" element, read the
        document from this URI.
      * Otherwise, expect a non XML document in the "data" input (the
        current behavior).  

That would lead to processors that can be used either as generators or
as converters.

I am now familiar enough with custom processors to implement these
beasts but I'd like to know if you would consider this as a good
practice and if this is worth generalizing...

What do you think?

Thanks,

Eric

--
Weblog:
                 http://eric.van-der-vlist.com/blog?t=category&a=English
------------------------------------------------------------------------
Eric van der Vlist       http://xmlfr.org            http://dyomedea.com
(ISO) RELAX NG   ISBN:0-596-00421-4 http://oreilly.com/catalog/relax
(W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema
------------------------------------------------------------------------




--
You receive this message as a subscriber of the [hidden email] mailing list.
To unsubscribe: mailto:[hidden email]
For general help: mailto:[hidden email]?subject=help
ObjectWeb mailing lists service home page: http://www.objectweb.org/wws
Reply | Threaded
Open this post in threaded view
|

Re: Design of processors handling non XML documents

Alessandro  Vernet
Administrator
Hi Eric,

The 3 options you mention are:

1) Use a ByteArrayInputStream and ByteArrayOutputStream.
2) Create an additional thread so the InputStream can read from a
ContentHandler.
3) Allow the processor to read directly from a URI, in addition to
being able to read from a ContentHandler.

There is also a fourth option:

4) Similar to option 1, but writing the data in a temp file instead of
using a buffer in memory.

I have a bias towards trying to avoid option 3 if possible, as it
seems the API of the processor is changed here for performance
reasons. If those are the options to choose from, I'd like to know
what the performance impact of 2 and 4 is, so we can weigh that
performance impact versus API change.

So to summarize the matter, I guess I don't really have a definitive
answer to your question :).

Alex

On 11/29/05, Eric van der Vlist <[hidden email]> wrote:

> Hi,
>
> In the documentation, you say when you mention non XML documents:
>
> "Processors can stream binary and text documents by issuing a number of
> short character SAX events. It is therefore possible to generate
> "infinitely" long binary and text documents with a constant amount of
> memory, assuming both the sender and the receiver of the document are
> able to perform streaming. This is the case for example of the URL
> generator and the HTTP serializer."
>
> This is true for the  URL generator and the HTTP serializer.", but not
> true for processors that use libraries excepting an InputStream.
>
> The common pattern for these processors is to use a
> ByteArrayOutputStream together with a ByteArrayInputStream:
>
>                         ByteArrayOutputStream bos = new ByteArrayOutputStream();
>                         Base64ContentHandler b64ch = new Base64ContentHandler(bos);
>                         readInputAsSAX(context, "data", b64ch);
>                         document = bos.toByteArray();
> .../...
>                 ByteArrayInputStream bis = new ByteArrayInputStream(document);
>
> That's the pattern I am using in my archive converter to process
> OpenOffice documents and in the MS documents converters that I have
> recently developed based on POI and that's also the pattern used by the
> PDF conversion processor.
>
> The big downside is that documents (that can be very large) are stored
> in memory in a byte array.
>
> That could be improved by using pipe input and output streams instead of
> byte array streams, but that requires forking a new thread and can be
> touchy to put into practice. Also, some processors need to read the
> document several times and that would require forking these threads
> several times...
>
> When documents are generated, this overhead makes sense, but when it's
> just a matter of accessing to a static document stored on disk, that's
> less obvious...
>
> The thing I am considering for my own converters is the following:
>
>       * Accept both an optional "config" and and optional "data" input
>         on these processors.
>       * When there is a "config" input with an "uri" element, read the
>         document from this URI.
>       * Otherwise, expect a non XML document in the "data" input (the
>         current behavior).
>
> That would lead to processors that can be used either as generators or
> as converters.
>
> I am now familiar enough with custom processors to implement these
> beasts but I'd like to know if you would consider this as a good
> practice and if this is worth generalizing...
>
> What do you think?
>
> Thanks,
>
> Eric
>
> --
> Weblog:
>                  http://eric.van-der-vlist.com/blog?t=category&a=English
> ------------------------------------------------------------------------
> Eric van der Vlist       http://xmlfr.org            http://dyomedea.com
> (ISO) RELAX NG   ISBN:0-596-00421-4 http://oreilly.com/catalog/relax
> (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema
> ------------------------------------------------------------------------
>
>
>
>
>
> --
> You receive this message as a subscriber of the [hidden email] mailing list.
> To unsubscribe: mailto:[hidden email]
> For general help: mailto:[hidden email]?subject=help
> ObjectWeb mailing lists service home page: http://www.objectweb.org/wws
>
>
>

--
Blog (XML, Web apps, Open Source): http://www.orbeon.com/blog/



--
You receive this message as a subscriber of the [hidden email] mailing list.
To unsubscribe: mailto:[hidden email]
For general help: mailto:[hidden email]?subject=help
ObjectWeb mailing lists service home page: http://www.objectweb.org/wws
--
Follow Orbeon on Twitter: @orbeon
Follow me on Twitter: @avernet
Reply | Threaded
Open this post in threaded view
|

Re: Design of processors handling non XML documents

Erik Bruchez
Administrator
I should mention, related to Alex's option 4, that we use "FileItem"
objects in OPS in a few places, which are sort of smart temporary
containers which stay in memory until a certain configurable size (for
example 10KB), and then only write to disk.

-Erik

Alessandro Vernet wrote:

> Hi Eric,
>
> The 3 options you mention are:
>
> 1) Use a ByteArrayInputStream and ByteArrayOutputStream.
> 2) Create an additional thread so the InputStream can read from a
> ContentHandler.
> 3) Allow the processor to read directly from a URI, in addition to
> being able to read from a ContentHandler.
>
> There is also a fourth option:
>
> 4) Similar to option 1, but writing the data in a temp file instead of
> using a buffer in memory.
>
> I have a bias towards trying to avoid option 3 if possible, as it
> seems the API of the processor is changed here for performance
> reasons. If those are the options to choose from, I'd like to know
> what the performance impact of 2 and 4 is, so we can weigh that
> performance impact versus API change.
>
> So to summarize the matter, I guess I don't really have a definitive
> answer to your question :).
>
> Alex


--
You receive this message as a subscriber of the [hidden email] mailing list.
To unsubscribe: mailto:[hidden email]
For general help: mailto:[hidden email]?subject=help
ObjectWeb mailing lists service home page: http://www.objectweb.org/wws