UTF-8 with BOM not supported?

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

UTF-8 with BOM not supported?

ahenket
Orbeon Forms 3.9.0.201105152046 CE

I've used the exact code from the HowTo to upload XML:
saxon:parse(saxon:base64Binary-to-string(xs:base64Binary(instance('upload')), 'UTF-8'))

Whenever I upload UTF-8 encoded XML with a BOM, I get "no content allowed in prolog".

Bug or feature? How to avoid? Fixing all input at workplaces around the world is not feasible.
Reply | Threaded
Open this post in threaded view
|

Re: UTF-8 with BOM not supported?

Alessandro  Vernet
Administrator
Hi ahenket,

saxon:parse() expects XML, and if what people upload isn't XML, it just won't work. Now, I am wondering if something else could be happening. Could you maybe add an <xf:output value="saxon:base64Binary-to-string(xs:base64Binary(instance('upload')), 'UTF-8')"/> somewhere in your form to see what that value looks like. Is it proper XML? If it looks to you like it is, but saxon:parse() fails, could you share with us a specific example of that XML, so we can reproduce the issue?

Alex
--
Follow Orbeon on Twitter: @orbeon
Follow me on Twitter: @avernet
Reply | Threaded
Open this post in threaded view
|

Re: UTF-8 with BOM not supported?

ahenket
Hi. It's absolutely XML. OxygenXML is my tool of choice for editing XML/XQuery etc. and it is set to be very picky. I've validated the files before uploading using oxygen and it gave no errors. I removed the 3 UTF-8 BOM characters with a Hexeditor (I found out later that you can instruct Oxygen to remove the UTF-8 BOM upon save) and then uploaded without any problem. There's no question that the BOM was the only thing between me and a successful upload.
Reply | Threaded
Open this post in threaded view
|

Re: UTF-8 with BOM not supported?

Alessandro  Vernet
Administrator
Hi ahenket,

Indeed, looks like a bug in saxon:base64Binary-to-string() to me. Since that function is UTF-8 aware, it should know how to interpret the BOM. Even if this is an issue with Saxon (at least the version we're using), I added an issue against Orbeon Forms:

https://github.com/orbeon/orbeon-forms/issues/1093

In you can manually strip the BOM in XForms if present, as done in this example: view.xhtml. I also copied here the relevant part:

    <xf:var name="dec" value="saxon:base64Binary-to-octets(xs:base64Binary(.))"/>
    <xf:var name="has-bom" value="$dec[1] = 239 and $dec[2] = 187 and $dec[3] = 191"/>
    <xf:bind ref="." type="xs:base64Binary" calculate="if ($has-bom) then saxon:octets-to-base64Binary($dec[position() > 3]) else ."/>

Alex
--
Follow Orbeon on Twitter: @orbeon
Follow me on Twitter: @avernet
Reply | Threaded
Open this post in threaded view
|

Re: UTF-8 with BOM not supported?

ahenket

Hi, thanks for the workaround. I'll be on holiday for 3 weeks so I'll get back to it afterwards most likely.

Alexander

Op 27 jun. 2013, om 03:12 heeft Alessandro Vernet [via Orbeon Forms community mailing list] <[hidden email]> het volgende geschreven:

Hi ahenket,

Indeed, looks like a bug in saxon:base64Binary-to-string() to me. Since that function is UTF-8 aware, it should know how to interpret the BOM. Even if this is an issue with Saxon (at least the version we're using), I added an issue against Orbeon Forms:

https://github.com/orbeon/orbeon-forms/issues/1093

In you can manually strip the BOM in XForms if present, as done in this example: view.xhtml. I also copied here the relevant part:

    <xf:var name="dec" value="saxon:base64Binary-to-octets(xs:base64Binary(.))"/>
    <xf:var name="has-bom" value="$dec[1] = 239 and $dec[2] = 187 and $dec[3] = 191"/>
    <xf:bind ref="." type="xs:base64Binary" calculate="if ($has-bom) then saxon:octets-to-base64Binary($dec[position() > 3]) else ."/>

Alex
--
Follow Orbeon on Twitter: @orbeon
Follow me on Twitter: @avernet



If you reply to this email, your message will be added to the discussion below:
http://discuss.orbeon.com/UTF-8-with-BOM-not-supported-tp4656932p4656951.html
To unsubscribe from UTF-8 with BOM not supported?, click here.
NAML

Reply | Threaded
Open this post in threaded view
|

Re: UTF-8 with BOM not supported?

Alessandro  Vernet
Administrator
Hi Alexander,

Sure, there of course no rush at all; you'll let us know when you get a chance to test this.

Alex
--
Follow Orbeon on Twitter: @orbeon
Follow me on Twitter: @avernet
Reply | Threaded
Open this post in threaded view
|

Re: UTF-8 with BOM not supported?

ahenket
This obviously fell off my radar. We decided to go a different, but similar route solving this in xquery as Saxon under eXist-db has the exact same issue, so we need circumvention deeper down.

let $file-data          := if (request:exists()) then (request:get-data()) else ()
let $update            :=
    if (not(empty($file-data))) then
        (:Hack alert: upload fails when content has UTF-8 Byte Order Marker.
           the UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF:)
        let $file-content        := util:base64-decode($file-data/content)
        let $content-no-bom := if (string-to-codepoints(substring($file-content,1,1))=65279) then (substring($file-content,2)) else ($file-content)
        let $store                 := xmldb:store($messageStoragePath, encode-for-uri($filename), $content-no-bom)
    else ()
Reply | Threaded
Open this post in threaded view
|

Re: UTF-8 with BOM not supported?

Alessandro  Vernet
Administrator
Hi Alexander,

I'm glad doing this in eXist works for you. BTW, have you tried asking Mike Kay, the Saxon author, about this? (If you haven't already, the saxon-help mailing list would be a good place.)

Alex
--
Follow Orbeon on Twitter: @orbeon
Follow me on Twitter: @avernet