Adding additional parser features to the URLGenerator

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Adding additional parser features to the URLGenerator

Eric van der Vlist
Hi,

Title says it (almost) all...

I need to read external XML documents to index them.

Some of these documents have external DTDs and these DTDs are often
either non existant or on other sites.

To improve performances and decrease the number of parsing errors, I'd
like to use two different and complementary approaches.

The first one has already been mentioned on this list and that would be
to implement XML catalogs. This would deal with "well known DTDs" (HTML,
XHTML, DocBook, OpenOffice, ...).

The second one, for DTDs that would not be known of the catalog would be
to use SAX features that are currently not exposed through the
URLGenerator such as the following one:

http://apache.org/xml/features/nonvalidating/load-external-dtd

>From a quick glance in the code, it doesn't seem so easy because the
couple of currently exposed features (validation and XInclude) are used
in the cache keys and that would require some refactoring to avoid an
exponential growth of the number of combinations (and of instructions to
test these combinations)...

Expect maybe if we said that using these others features would disable
caching.

What do you think?

Are there better ways to exposing these features?

Thanks,

Eric

--
Le premier annuaire des apiculteurs 100% XML!
                                                http://apiculteurs.info/
------------------------------------------------------------------------
Eric van der Vlist       http://xmlfr.org            http://dyomedea.com
(ISO) RELAX NG   ISBN:0-596-00421-4 http://oreilly.com/catalog/relax
(W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema
------------------------------------------------------------------------




--
You receive this message as a subscriber of the [hidden email] mailing list.
To unsubscribe: mailto:[hidden email]
For general help: mailto:[hidden email]?subject=help
ObjectWeb mailing lists service home page: http://www.objectweb.org/wws
Reply | Threaded
Open this post in threaded view
|

Re: Adding additional parser features to the URLGenerator

Erik Bruchez
Administrator
It looks like this email was left unanswered.

I don't think there is really a problem adding the load-external-dtd
element to the configuration from the URL generator's point of view,
while still allowing for caching.

In a first phase, we would not detect changes in the DTDs though:
adding detection would require adding hooks to he parser or to the XML
catalog to add the DTD URI to the list of URIs that impact caching.

We would have to solve the question of the number of parser factories
available, but that is certainly doable. The relevant code is:

   XMLUtils.newSAXParser(boolean validating, boolean handleXInclude)

That method would require a "boolean loadExternalDTD" flag.

-Erik

Eric van der Vlist wrote:
 > Hi,
 >
 > Title says it (almost) all...
 >
 > I need to read external XML documents to index them.
 >
 > Some of these documents have external DTDs and these DTDs are often
 > either non existant or on other sites.
 >
 > To improve performances and decrease the number of parsing errors, I'd
 > like to use two different and complementary approaches.
 >
 > The first one has already been mentioned on this list and that would be
 > to implement XML catalogs. This would deal with "well known DTDs" (HTML,
 > XHTML, DocBook, OpenOffice, ...).
 >
 > The second one, for DTDs that would not be known of the catalog would be
 > to use SAX features that are currently not exposed through the
 > URLGenerator such as the following one:
 >
 > http://apache.org/xml/features/nonvalidating/load-external-dtd
 >
 >>From a quick glance in the code, it doesn't seem so easy because the
 > couple of currently exposed features (validation and XInclude) are used
 > in the cache keys and that would require some refactoring to avoid an
 > exponential growth of the number of combinations (and of instructions to
 > test these combinations)...
 >
 > Expect maybe if we said that using these others features would disable
 > caching.
 >
 > What do you think?
 >
 > Are there better ways to exposing these features?
 >
 > Thanks,
 >
 > Eric




--
You receive this message as a subscriber of the [hidden email] mailing list.
To unsubscribe: mailto:[hidden email]
For general help: mailto:[hidden email]?subject=help
ObjectWeb mailing lists service home page: http://www.objectweb.org/wws