Specify character encoding for URL-encoded UTF-8 query strings / "Illegal HTML character: decimal 128"

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Specify character encoding for URL-encoded UTF-8 query strings / "Illegal HTML character: decimal 128"

Gabe Martin-Dempesy
We're encountering a character encoding issue when reading a UTF-8 query string. An separate outside application is constructs links to our Orbeon application such as:

- http://localhost:8080/ops/encoding-test/?message=hello%20world
- http://localhost:8080/ops/encoding-test/?message=it%E2%80%99s%20a%20message

Our application's model reading the query string with the oxf:request processor, and then displaying the string in a view. In the first case above, the application displays "hello world" correctly without problems. In the second test case, "%E2%80%99" is the URL encoding for a UTF-8 apostrophe, and causes the application to error with:

> 2012-09-13 12:21:43,383 ERROR XSLTTransformer  - Error at line 174 of oxf:/config/theme-examples.xsl:
> Illegal HTML character: decimal 128
> 2012-09-13 12:21:43,384 ERROR ProcessorService  - Exception at line 174 of oxf:/config/theme-examples.xsl
> ; SystemID: oxf:/config/theme-examples.xsl; Line#: 174; Column#: -1
> org.orbeon.saxon.trans.XPathException: Illegal HTML character: decimal 128

- Full log output: https://gist.github.com/3716033
- Application test-case source: https://gist.github.com/3716159 (also attached as encoding-test.zip)

The error is referencing the %80 in the second byte of the multi-byte encoding of the apostrophe. Note that in the log not only does the theme raise an exception, but the xforms inspector does as well.

It appears like the URL is being decoded as Latin1 instead of UTF-8, as the debug processor lists "it???s a message" with three characters for the apostrophe. In my research so far, it doesn't appear that HTTP has a way to specify the encoding of the query string itself.

1. Is there a way to specify the encoding of a query string when read with oxf:request? I didn't see a configuration property for the processor or anything relevant in properties-local.xml that would set a default.
2. If not, is there a way to force the associated encoding of the string? I suspect this could be done with XSLT, but was unable to find an example. I believe I want something equivalent to ruby's String#force_encoding.
3. If not, is there any other suggested way to work around the error? My current worst-case hack-fix here is to just strip out any offending characters using mod_rewrite before it hits the servlet.

Any guidance and assistance is appreciated!



--
You receive this message as a subscriber of the [hidden email] mailing list.
To unsubscribe: mailto:[hidden email]
For general help: mailto:[hidden email]?subject=help
OW2 mailing lists service home page: http://www.ow2.org/wws

encoding-test.zip (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Specify character encoding for URL-encoded UTF-8 query strings / "Illegal HTML character: decimal 128"

Alessandro  Vernet
Administrator
Hi Gabe,

I also posted a response to this on Stack Overflow, but in essence, Orbeon Forms relies on what is returned by the servlet API: see getParameterMap() in ServletExternalContext:

https://github.com/orbeon/orbeon-forms/blob/master/src/java/org/orbeon/oxf/servlet/ServletExternalContext.java

So this seems to be something you need to set at the application server level. If using Tomcat, you can do so by adding `URIEncoding="UTF-8"` on the <Connector>, as mentioned in this Stack Overflow answer:

http://stackoverflow.com/questions/3278900/httpservletrequest-setcharacterencoding-seems-to-do-nothing

Alternatively, back in 2009, Jack mentioned that adding useBodyEncodingForURI="true" on the <Connector> worked for him. See:

http://orbeon-forms-ops-users.24843.n4.nabble.com/chinese-encoding-issue-td42472.html

You'll let us know if either one of those solutions works for you.

Alex
--
Follow Orbeon on Twitter: @orbeon
Follow me on Twitter: @avernet
Reply | Threaded
Open this post in threaded view
|

Re: Specify character encoding for URL-encoded UTF-8 query strings / "Illegal HTML character: decimal 128"

Gabe Martin-Dempesy
Hi Alex -

I have verified that *both* of the methods your suggested fix this issue. On your lead, I also found http://wiki.apache.org/tomcat/FAQ/CharacterEncoding that states:

> How do I change how GET parameters are interpreted?
>
> Tomcat will use ISO-8859-1 as the default character encoding of the entire URL, including the query string ("GET parameters").
>
> There are two ways to specify how GET parameters are interpreted:
>
> • Set the URIEncoding attribute on the <Connector> element in server.xml to something specific (e.g. URIEncoding="UTF-8").
> • Set the useBodyEncodingForURI attribute on the <Connector> element in server.xml to true. This will cause the Connector to use the request body's encoding for GET parameters.

Thanks for helping dig into this - it hadn't occurred to me to tackle this from the Tomcat-side.

On Sep 13, 2012, at 10:07 PM, Alessandro Vernet <[hidden email]> wrote:

> Hi Gabe,
>
> I also posted a response to this on Stack Overflow, but in essence, Orbeon
> Forms relies on what is returned by the servlet API: see getParameterMap()
> in ServletExternalContext:
>
> https://github.com/orbeon/orbeon-forms/blob/master/src/java/org/orbeon/oxf/servlet/ServletExternalContext.java
>
> So this seems to be something you need to set at the application server
> level. If using Tomcat, you can do so by adding `URIEncoding="UTF-8"` on the
> <Connector>, as mentioned in this Stack Overflow answer:
>
> http://stackoverflow.com/questions/3278900/httpservletrequest-setcharacterencoding-seems-to-do-nothing
>
> Alternatively, back in 2009, Jack mentioned that adding
> useBodyEncodingForURI="true" on the <Connector> worked for him. See:
>
> http://orbeon-forms-ops-users.24843.n4.nabble.com/chinese-encoding-issue-td42472.html
>
> You'll let us know if either one of those solutions works for you.
>
> Alex
>
>
>
> --
> View this message in context: http://orbeon-forms-ops-users.24843.n4.nabble.com/Specify-character-encoding-for-URL-encoded-UTF-8-query-strings-Illegal-HTML-character-decimal-128-tp4655743p4655745.html
> Sent from the Orbeon Forms (ops-users) mailing list archive at Nabble.com.
>
> --
> You receive this message as a subscriber of the [hidden email] mailing list.
> To unsubscribe: mailto:[hidden email]
> For general help: mailto:[hidden email]?subject=help
> OW2 mailing lists service home page: http://www.ow2.org/wws


--
You receive this message as a subscriber of the [hidden email] mailing list.
To unsubscribe: mailto:[hidden email]
For general help: mailto:[hidden email]?subject=help
OW2 mailing lists service home page: http://www.ow2.org/wws
Reply | Threaded
Open this post in threaded view
|

Re: Specify character encoding for URL-encoded UTF-8 query strings / "Illegal HTML character: decimal 128"

Alessandro  Vernet
Administrator
Hi Gabe,

Excellent, thank you for confirming this, and I will keep that link to the Tomcat wiki for future reference. I am sure you're not the last one to stumble on this one!

Alex
--
Follow Orbeon on Twitter: @orbeon
Follow me on Twitter: @avernet