Experimenting with PostgreSQL's JSONB for better performance

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Experimenting with PostgreSQL's JSONB for better performance

Aaron Spike
We make extensive use of PostgreSQL and Flat Views. A while ago I noticed that the views exposing our Orbeon Forms data in a relational fashion were getting slow to the point of pain. I converted one particularly slow view to a materialized view to try make the experience more tolerable but I kept wondering if the live view of the data could be made faster. Because I had some success with the JSONB datatype in PostgreSQL I wondered if the views would perform better if the data were stored as JSONB instead of XML in the orbeon_form_data table. I posted some questions on twitter at the time. Now I am posting a more complete description of my experience here in hopes that the community can provide some advice.

Our most complex form has the following statistics:

Sections 13
Repeats 6
Grids 42
Section templates 0
Controls 136
All 197

I'm guessing these stats are relatively modest compared to what some of you are doing. For this form the orbeon_form_data table stores 3054 rows with 1248 distinct document_ids. Again, I think this is relatively modest. The view we are using to access this data is a copy of the automatically generated flat view with a few type casts and an extra where condition to avoid draft forms that may have unvalidated data that cannot be properly casted.

The view has about 90 output columns and about 100 XPATH() calls. Running the view query in PGAdmin takes approximately 3 minutes to retrieve ~1000 rows. (Lest someone think that my modification are slowing things down, the modified query actually returns slightly faster than the raw automatically generated flat view.)

On Twitter @Orbeon suggested that I should be able to add a JSONB column to the orbeon_form_data table and populate the data with a trigger. I found a post on StackOverflow with an example XML to JSON function for PostgreSQL. I felt that the resulting JSON was too verbose so I simplified it a little. (I removed the outer <form/> document element, ignored attributes, and placed all text values in arrays even if there is only a single item to make indexing consistent.) The conversion function is listed below.
Using this function it took about 2.75 minutes to populate the JSON column for the 3054 rows in question. I think this works out to about 60 milliseconds which would probably be tolerable for a trigger executing on form submission.

I then wrote a query which returns identical results to the original view (i.e. the one that takes 3 minutes to materialize) but references the JSONB column instead of using XPATH() operations on the XML to extract the data. This new query executes in under 3 seconds.

After running this test, I believe the JSONB data storage alternative would provide a major performance enhancement to users of PostgreSQL and Orbeon's Flat Views. I think it would be most advantageous if Orbeon Forms provided the possibility to have a parallel JSONB column out of the box.

Here are my concerns with the trigger method:
1) Because PostgreSQL's XML capabilities are limited, I'm a little bit nervous that unexpected XML data (new namespaces, etc.) may cause exceptions in the trigger function that could cause the data to fail to be stored in PostgreSQL. Orbeon Forms would know how to properly convert any XML that it is able to generate.
2) The format of the generated JSONB is still a bit ugly. I would prefer to have JSON as described in Orbeon Issue #2109 but I don't think that there is enough information contained in the individual form submission's XML model data properly identify all attachments and to consistently interpret which elements may be duplicated due to repeating grids and section. I believe these things are necessary so that deep indexing into the JSONB data is consistent from record to record.

Comments welcome!

Aaron Spike

Orbeon XML to JSONB Function:
create or replace function orbeon_xml_to_json_noattr_helper(p_xml xml, OUT eltname text, OUT eltvalue jsonb) returns record as $$
declare
  children jsonb;
  txt text;
begin
  -- Get root element name
  select (xpath('local-name(/*)', p_xml))[1]::text into eltname;

  -- Process child nodes
  select jsonb_object_agg(k, vv)
  from (
    select k, jsonb_agg(v) as vv
    from unnest(xpath('/'||eltname||'/*', p_xml)) with ordinality as a(z,i)
      cross join lateral (
      --
      -- recursive call!
      --
      select k,v from orbeon_xml_to_json_noattr_helper(z) as xj(k,v)
      ) as b
    where v is not null
    group by k
  ) as e
  into children;

  -- Read text
  select (xpath('/*[local-name()='''||eltname||''']/text()', p_xml))[1]::text into txt;

  -- Build object
  eltvalue := case 
  when nullif(children,'null'::jsonb) is null then to_jsonb(nullif(regexp_replace(txt, '(^\s+|\s+$)','','g'),''))
  else children end;
end $$ language plpgsql immutable;

create or replace function orbeon_xml_to_json_noattr(p_xml xml) returns jsonb as $$
declare
  result jsonb;
begin
  -- Get root element name
  select v from orbeon_xml_to_json_noattr_helper(p_xml) as xj(k,v) into result;

  return result;
end $$ language plpgsql immutable;

Example Form Model:
<form>
  <section-1>
    <field-1>value-1</field-1>
    <field-2/>
    <field-3 xmlns:ns1="http://www.example.com/ns1" xmlns:ns2="http://www.example.com/ns2">value-3</field-3>
    <field-4>1 2 3</field-4>
    <field-5>
    
    value-1
    
    test
    
    </field-5>
    <attachment-1 filename="File.pdf" mediatype="application/pdf" size="1234">/fr/service/persistence/crud/APP/FORM/data/DOC/FILE.bin</attachment-1>
  </section-1>
  <section-2>
    <grid-1>
      <grid-field-1>1</grid-field-1>
      <grid-field-2>2</grid-field-2>
    </grid-1>
    <grid-1>
      <grid-field-1>3</grid-field-1>
      <grid-field-2>4</grid-field-2>
    </grid-1>
    <nested-section-1>
      <field-4>value-4</field-4>
      <field-5>value-5</field-5>
    </nested-section-1>
    <nested-repeating-section-1>
      <nested-repeating-section-1-iteration>
        <field-6>value-6</field-6>
        <field-7>value-7</field-7>
      </nested-repeating-section-1-iteration>
      <nested-repeating-section-1-iteration>
        <field-6>value-8</field-6>
        <field-7>value-9</field-7>
      </nested-repeating-section-1-iteration>
    </nested-repeating-section-1>
  </section-2>
</form>

Resulting JSONB output:
{
    "section-1": [
        {
            "field-1": [
                "value-1"
            ],
            "field-3": [
                "value-3"
            ],
            "field-4": [
                "1 2 3"
            ],
            "field-5": [
                "value-1\n    \n    test"
            ],
            "attachment-1": [
                "/fr/service/persistence/crud/APP/FORM/data/DOC/FILE.bin"
            ]
        }
    ],
    "section-2": [
        {
            "grid-1": [
                {
                    "grid-field-1": [
                        "1"
                    ],
                    "grid-field-2": [
                        "2"
                    ]
                },
                {
                    "grid-field-1": [
                        "3"
                    ],
                    "grid-field-2": [
                        "4"
                    ]
                }
            ],
            "nested-section-1": [
                {
                    "field-4": [
                        "value-4"
                    ],
                    "field-5": [
                        "value-5"
                    ]
                }
            ],
            "nested-repeating-section-1": [
                {
                    "nested-repeating-section-1-iteration": [
                        {
                            "field-6": [
                                "value-6"
                            ],
                            "field-7": [
                                "value-7"
                            ]
                        },
                        {
                            "field-6": [
                                "value-8"
                            ],
                            "field-7": [
                                "value-9"
                            ]
                        }
                    ]
                }
            ]
        }
    ]
}





This electronic communication, including any attached documents, may contain confidential and/or legally privileged information that is intended only for use by the recipient(s) named above. If you have received this communication in error, please notify the sender immediately and delete the communication and any attachments. Views expressed by the author do not necessarily represent those of Martin Luther College.

--
You received this message because you are subscribed to the Google Groups "Orbeon Forms" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Reply | Threaded
Open this post in threaded view
|

Re: Experimenting with PostgreSQL's JSONB for better performance

Alessandro  Vernet
Administrator
Hi Aaron,

Thank you for sharing this. It will for sure come in handy to someone else
stumbling on the same situation. Going from 3 minutes to 3 seconds is quite
impressive.

Out of curiosity, what does the view query using the JSONB look like?

-Alex

-----
--
Follow Orbeon on Twitter: @orbeon
Follow me on Twitter: @avernet
--
Sent from: http://discuss.orbeon.com/

--
You received this message because you are subscribed to the Google Groups "Orbeon Forms" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
--
Follow Orbeon on Twitter: @orbeon
Follow me on Twitter: @avernet
Reply | Threaded
Open this post in threaded view
|

Re: Experimenting with PostgreSQL's JSONB for better performance

Aaron Spike
I've abbreviated the example a bit. This is what a current flat view definition looks like:

CREATE OR REPLACE VIEW orbeon_schema.orbeon_f_appname_formname AS 
 SELECT d.document_id AS metadata_document_id,
    d.created AS metadata_created,
    d.last_modified_time AS metadata_last_modified_time,
    d.last_modified_by AS metadata_last_modified_by,
    (xpath('/*/section-1/field-1/text()'::text, d.xml))[1]::text AS section_1_field_1,
    (xpath('/*/section-1/field-2/text()'::text, d.xml))[1]::text AS section_1_field_2,
    (xpath('/*/section-1/field-3/text()'::text, d.xml))[1]::text AS section_1_field_3,
    (xpath('/*/section-1/field-4/text()'::text, d.xml))[1]::text AS section_1_field_4,
    (xpath('/*/section-1/field-5/text()'::text, d.xml))[1]::text AS section_1_field_5
   FROM orbeon_schema.orbeon_form_data d,
    ( SELECT max(d_1.last_modified_time) AS last_modified_time,
            d_1.app,
            d_1.form,
            d_1.document_id
           FROM orbeon_schema.orbeon_form_data d_1
          WHERE d_1.app::text = 'AppName'::text AND d_1.form::text = 'FormName'::text AND d_1.draft = 'N'::bpchar
          GROUP BY d_1.app, d_1.form, d_1.document_id) m
  WHERE d.last_modified_time = m.last_modified_time AND d.app::text = m.app::text AND d.form::text = m.form::text AND d.document_id::text = m.document_id::text AND d.deleted = 'N'::bpchar;


With the JSONB column generated by the function in my previous post, it would look more like this:

CREATE OR REPLACE VIEW orbeon_schema.orbeon_f_appname_formname AS 
 SELECT d.document_id AS metadata_document_id,
    d.created AS metadata_created,
    d.last_modified_time AS metadata_last_modified_time,
    d.last_modified_by AS metadata_last_modified_by,
    (jsonb_data->'section-1'->0->'field-1'->>0)::text AS section_1_field_1,
    (jsonb_data->'section-1'->0->'field-2'->>0)::text AS section_1_field_2,
    (jsonb_data->'section-1'->0->'field-3'->>0)::text AS section_1_field_3,
    (jsonb_data->'section-1'->0->'field-4'->>0)::text AS section_1_field_4,
    (jsonb_data->'section-1'->0->'field-5'->>0)::text AS section_1_field_5
   FROM orbeon_schema.orbeon_form_data d,
    ( SELECT max(d_1.last_modified_time) AS last_modified_time,
            d_1.app,
            d_1.form,
            d_1.document_id
           FROM orbeon_schema.orbeon_form_data d_1
          WHERE d_1.app::text = 'AppName'::text AND d_1.form::text = 'FormName'::text AND d_1.draft = 'N'::bpchar
          GROUP BY d_1.app, d_1.form, d_1.document_id) m
  WHERE d.last_modified_time = m.last_modified_time AND d.app::text = m.app::text AND d.form::text = m.form::text AND d.document_id::text = m.document_id::text AND d.deleted = 'N'::bpchar;

Notice that accessing data in the JSONB is a little more verbose than one would think should be necessary. I would prefer if the bold red portions could be omitted: "jsonb_data->'section-1'->0->'field-1'->>0". The problem is that given only the XML data, the conversion function in postgresql can't tell which parts of the document might include repeated data. If the JSONB was calculated inside of Orbeon Form proper a much more pleasing JSONB structure could be generated. Would you ever consider adding such a thing?

My current plan to work around my concern about calling the conversion function from a trigger is to call pg_notify() from a trigger and have a daemon process somewhere listening for the notification to generate missing JSONB data. I want to reduce the chances that the XML data fails to be saved to an absolute minimum.

This electronic communication, including any attached documents, may contain confidential and/or legally privileged information that is intended only for use by the recipient(s) named above. If you have received this communication in error, please notify the sender immediately and delete the communication and any attachments. Views expressed by the author do not necessarily represent those of Martin Luther College.

--
You received this message because you are subscribed to the Google Groups "Orbeon Forms" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Reply | Threaded
Open this post in threaded view
|

Re: Experimenting with PostgreSQL's JSONB for better performance

Alessandro  Vernet
Administrator
Hi Aaron,

Interesting. About the JSON not being optimal, I'm wondering if, in the
future once the issue linked below is implemented, instead of relying on a
database trigger and your own logic to generate the JSON from the XML, you
could update your process running on save or submit to also `send()` the
data in JSON format to an end-point of yours, and have that code store the
JSON it receives in the database.

Of course, until that issue is taken care of, this is just a through, but it
implementing 2109 is likely to be done faster than full built-in JSONB
support ;).

https://github.com/orbeon/orbeon-forms/issues/2109

-Alex

-----
--
Follow Orbeon on Twitter: @orbeon
Follow me on Twitter: @avernet
--
Sent from: http://discuss.orbeon.com/

--
You received this message because you are subscribed to the Google Groups "Orbeon Forms" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
--
Follow Orbeon on Twitter: @orbeon
Follow me on Twitter: @avernet