I am working again on the Lucene processor that I had developed for
XMLfr. I have created an indexer that runs in the background (started by the scheduler) and I am extending this indexer to support new file formats. Each file format involves a different processing to extract the text (when available) and to present the metadata in a consistent way. I can register these treatments in pure Java, but I could also register pipelines that would allow to describe the processing logic with XPL. In that case, a different pipeline would be executed for each document to index depending on its media type. Do you think that this would be wise (knowing that big collections can eventually be indexed) and do you have a idea of the overhead needed to start new pipelines? Thanks, Eric -- Don't you think all these XML schema languages should work together? http://dsdl.org ------------------------------------------------------------------------ Eric van der Vlist http://xmlfr.org http://dyomedea.com (ISO) RELAX NG ISBN:0-596-00421-4 http://oreilly.com/catalog/relax (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema ------------------------------------------------------------------------ -- You receive this message as a subscriber of the [hidden email] mailing list. To unsubscribe: mailto:[hidden email] For general help: mailto:[hidden email]?subject=help ObjectWeb mailing lists service home page: http://www.objectweb.org/wws |
Administrator
|
Eric,
The key is to make sure that the pipeline object, and the objects it creates, are reused, because it is retrieved from the cache or otherwise. If you are doing this from Java, you want to create one instance of PipelineProcessor, create input and outputs, connect the input to your own instance of ProcessorOutput which implements the method read() by reading a different document every time it is called. Alex On 11/17/05, Eric van der Vlist <[hidden email]> wrote: > I am working again on the Lucene processor that I had developed for > XMLfr. > > I have created an indexer that runs in the background (started by the > scheduler) and I am extending this indexer to support new file formats. > > Each file format involves a different processing to extract the text > (when available) and to present the metadata in a consistent way. > > I can register these treatments in pure Java, but I could also register > pipelines that would allow to describe the processing logic with XPL. In > that case, a different pipeline would be executed for each document to > index depending on its media type. > > Do you think that this would be wise (knowing that big collections can > eventually be indexed) and do you have a idea of the overhead needed to > start new pipelines? > > Thanks, > > Eric > -- > Don't you think all these XML schema languages should work together? > http://dsdl.org > ------------------------------------------------------------------------ > Eric van der Vlist http://xmlfr.org http://dyomedea.com > (ISO) RELAX NG ISBN:0-596-00421-4 http://oreilly.com/catalog/relax > (W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema > ------------------------------------------------------------------------ > > > > > > -- > You receive this message as a subscriber of the [hidden email] mailing list. > To unsubscribe: mailto:[hidden email] > For general help: mailto:[hidden email]?subject=help > ObjectWeb mailing lists service home page: http://www.objectweb.org/wws > > > -- Blog (XML, Web apps, Open Source): http://www.orbeon.com/blog/ -- You receive this message as a subscriber of the [hidden email] mailing list. To unsubscribe: mailto:[hidden email] For general help: mailto:[hidden email]?subject=help ObjectWeb mailing lists service home page: http://www.objectweb.org/wws
--
Follow Orbeon on Twitter: @orbeon Follow me on Twitter: @avernet |
Free forum by Nabble | Edit this page |