Feature request: controlling entity references during HTML/XML serialisation

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Feature request: controlling entity references during HTML/XML serialisation

Eric van der Vlist
Hi,

One of the simple techniques to obfuscate email addresses is to simply
replace "@" characters by "%40" in mailto: URIs and by @ in plain
text.

It appears that most spammers aren't able to read something as simple
as:

<a href="vdv%40dyomedea.com">vdv&#64;dyomedea.com</a>

even if that's strictly equivalent to plainly exposed addresses for any
piece of software which is minimally conform (meaning that this
technique has none of the drawbacks of other obfuscation techniques).

Escaping @s in URIs is easy enough in XSLT... Getting them replaced by
numeric entity references in plain text is more challenging in
OPS :-) ...

That could be achieved in "plain XSLT" using either disable-output-
escaping attributes or XSLT 2.0 character maps but none of them would
survive in an XML pipe and would be usable only in a transformation that
would be the last processor in the pipe.

Note that their might be other cases where applications would need a
finer control over entity references.

How could we plug this behaviour into the existing converters?

I had a quick look to the code and noticed that the HTML and XML
converters are still using the "old legacy" serializers and that these
serializers are using Saxon identity transformations.

An option would be to add either an input or an element in the existing
config input to provide an alternate XSLT transformation to use as an
identity transformation. This transformation could then use either
disable-output-escaping (which is a hack) or character maps which are
more elegant...

Another option would be, of course, to write a new output method.

What do you think?
 
Thanks,

Eric
--
Read me on XML.com.
                                            http://www.xml.com/pub/au/74
------------------------------------------------------------------------
Eric van der Vlist       http://xmlfr.org            http://dyomedea.com
(ISO) RELAX NG   ISBN:0-596-00421-4 http://oreilly.com/catalog/relax
(W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema
------------------------------------------------------------------------




--
You receive this message as a subscriber of the [hidden email] mailing list.
To unsubscribe: mailto:[hidden email]
For general help: mailto:[hidden email]?subject=help
ObjectWeb mailing lists service home page: http://www.objectweb.org/wws
Reply | Threaded
Open this post in threaded view
|

Re: Feature request: controlling entity references during HTML/XML serialisation

Erik Bruchez
Administrator
Eric,

One question is: does XSLT serialization allow you to control this? If
so, then we could allow all XSLT serialization options to be passed to
the XML, XHTML or HTML converters. This would solve the problem neatly.

-Erik

Eric van der Vlist wrote:

> Hi,
>
> One of the simple techniques to obfuscate email addresses is to simply
> replace "@" characters by "%40" in mailto: URIs and by &#64; in plain
> text.
>
> It appears that most spammers aren't able to read something as simple
> as:
>
> <a href="vdv%40dyomedea.com">vdv&#64;dyomedea.com</a>
>
> even if that's strictly equivalent to plainly exposed addresses for any
> piece of software which is minimally conform (meaning that this
> technique has none of the drawbacks of other obfuscation techniques).
>
> Escaping @s in URIs is easy enough in XSLT... Getting them replaced by
> numeric entity references in plain text is more challenging in
> OPS :-) ...
>
> That could be achieved in "plain XSLT" using either disable-output-
> escaping attributes or XSLT 2.0 character maps but none of them would
> survive in an XML pipe and would be usable only in a transformation that
> would be the last processor in the pipe.
>
> Note that their might be other cases where applications would need a
> finer control over entity references.
>
> How could we plug this behaviour into the existing converters?
>
> I had a quick look to the code and noticed that the HTML and XML
> converters are still using the "old legacy" serializers and that these
> serializers are using Saxon identity transformations.
>
> An option would be to add either an input or an element in the existing
> config input to provide an alternate XSLT transformation to use as an
> identity transformation. This transformation could then use either
> disable-output-escaping (which is a hack) or character maps which are
> more elegant...
>
> Another option would be, of course, to write a new output method.
>
> What do you think?
>  
> Thanks,
>
> Eric
>
>
> ------------------------------------------------------------------------
>
>
> --
> You receive this message as a subscriber of the [hidden email] mailing list.
> To unsubscribe: mailto:[hidden email]
> For general help: mailto:[hidden email]?subject=help
> ObjectWeb mailing lists service home page: http://www.objectweb.org/wws



--
You receive this message as a subscriber of the [hidden email] mailing list.
To unsubscribe: mailto:[hidden email]
For general help: mailto:[hidden email]?subject=help
ObjectWeb mailing lists service home page: http://www.objectweb.org/wws