Overhead to start a pipeline

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Overhead to start a pipeline

Eric van der Vlist
I am working again on the Lucene processor that I had developed for
XMLfr.

I have created an indexer that runs in the background (started by the
scheduler) and I am extending this indexer to support new file formats.

Each file format involves a different processing to extract the text
(when available) and to present the metadata in a consistent way.

I can register these treatments in pure Java, but I could also register
pipelines that would allow to describe the processing logic with XPL. In
that case, a different pipeline would be executed for each document to
index depending on its media type.

Do you think that this would be wise (knowing that big collections can
eventually be indexed) and do you have a idea of the overhead needed to
start new pipelines?

Thanks,

Eric
--
Don't you think all these XML schema languages should work together?
                                                         http://dsdl.org
------------------------------------------------------------------------
Eric van der Vlist       http://xmlfr.org            http://dyomedea.com
(ISO) RELAX NG   ISBN:0-596-00421-4 http://oreilly.com/catalog/relax
(W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema
------------------------------------------------------------------------




--
You receive this message as a subscriber of the [hidden email] mailing list.
To unsubscribe: mailto:[hidden email]
For general help: mailto:[hidden email]?subject=help
ObjectWeb mailing lists service home page: http://www.objectweb.org/wws
Reply | Threaded
Open this post in threaded view
|

Re: Overhead to start a pipeline

Alessandro  Vernet
Administrator
Eric,

The key is to make sure that the pipeline object, and the objects it
creates, are reused, because it is retrieved from the cache or
otherwise.

If you are doing this from Java, you want to create one instance of
PipelineProcessor, create input and outputs, connect the input to your
own instance of ProcessorOutput which implements the method read() by
reading a different document every time it is called.

Alex

On 11/17/05, Eric van der Vlist <[hidden email]> wrote:

> I am working again on the Lucene processor that I had developed for
> XMLfr.
>
> I have created an indexer that runs in the background (started by the
> scheduler) and I am extending this indexer to support new file formats.
>
> Each file format involves a different processing to extract the text
> (when available) and to present the metadata in a consistent way.
>
> I can register these treatments in pure Java, but I could also register
> pipelines that would allow to describe the processing logic with XPL. In
> that case, a different pipeline would be executed for each document to
> index depending on its media type.
>
> Do you think that this would be wise (knowing that big collections can
> eventually be indexed) and do you have a idea of the overhead needed to
> start new pipelines?
>
> Thanks,
>
> Eric
> --
> Don't you think all these XML schema languages should work together?
>                                                          http://dsdl.org
> ------------------------------------------------------------------------
> Eric van der Vlist       http://xmlfr.org            http://dyomedea.com
> (ISO) RELAX NG   ISBN:0-596-00421-4 http://oreilly.com/catalog/relax
> (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema
> ------------------------------------------------------------------------
>
>
>
>
>
> --
> You receive this message as a subscriber of the [hidden email] mailing list.
> To unsubscribe: mailto:[hidden email]
> For general help: mailto:[hidden email]?subject=help
> ObjectWeb mailing lists service home page: http://www.objectweb.org/wws
>
>
>

--
Blog (XML, Web apps, Open Source): http://www.orbeon.com/blog/



--
You receive this message as a subscriber of the [hidden email] mailing list.
To unsubscribe: mailto:[hidden email]
For general help: mailto:[hidden email]?subject=help
ObjectWeb mailing lists service home page: http://www.objectweb.org/wws
--
Follow Orbeon on Twitter: @orbeon
Follow me on Twitter: @avernet