PostgreSQL: Data from Multiple Queries in JSON

During a discussion the other day where we discussed SOA and a new service for a set of data that we currently have in PostgreSQL, a thought popped into my head – what if the database could do all of the selects and build the service JSON document in one request, rather than having a service do all of the calls and build an object which is then JSONified. I figured that based on document depth, you could reduce the number of calls to the database dramatically, and gaining some of the value of a document database.

I went ahead and made a test case using this, with two VM’s on the same LAN (so that latency was minimal), and expected there to be very minimal differences between a service supplying data where the database built the document, and another where the service code built the document, doing multiple selects against the table. I expected there to be minimal differences, but instead found significant improvements in the version where the database handled the JSON creation and returned it. Because all of the pieces were coded in Perl, I’m hypothesizing that most of the timing differences were latency related; therefore, increases in the number of queries should result in larger and larger differences in the results of timing, and I intend to prove that by the end of this post.

First, I need to describe how I made this work. I downloaded some relational data from data.gov about foods and their nutritional values. Basically, I ended up with a series of tables for my first test that looked like:

  • food_des (Food description) with primary key ndb_no. (7539 records)
  • fd_group (Food group) with primary key fd_group. Food_des has an fd_group field as well, for matching. A fd_group has many food_des (25 records)
  • nut_data (Nutrition Data) has field ndb_no matching the ndb_no of food_des. a food_des record has many nut_data. (534542 records, index on ndb_no)

There are others, but for my first test, I only used these tables. My goal is a web service that will return a json document for a given food_des ndb_no all of its data, along with a ‘group’ field that will reference an object that contains all of the fd_group data for that food_des, and a nutrition_data field that will reference all of it’s nutrition_data as an array of objects.

In order to make this happen, I had a Ubuntu 12.04 VM with a PostgreSQL database already on it, version 9.1.

Assuming the database is already created (in my case, it’s called agri), I need to install pgperl/pgperlu in order to make the JSON documents on the database. I install the following packages:

$ sudo apt-get install postgresql-plperl-9.1
$ sudo apt-get install libjson-perl

Then I install the languages in my database:

$ sudo -u postgres createlang plperl agri
$ sudo -u postgres createlang plperlu agri

The reason that I need plperlu is that loading a file via ‘use’ is considered unsafe. This limits us a lot, as we’re going to have to create the functions using the administrative user. Calling the functions afterwards can be done by anyone, though.

I write the following subroutine to create the data from my requirements:

CREATE OR REPLACE FUNCTION pgperlu_test(id varchar) RETURNS varchar AS $$
    use JSON;
    my $args = [ @_ ];
    my $pl = spi_prepare('select * from food_des where ndb_no = $1','varchar');
    my $data = spi_exec_prepared($pl,$args->[0]);
    my $d0 = $data->{rows}->[0];
    if ( $d0 ) {
        my $pl2 = spi_prepare('select * from fd_group where fdgrp_cd = $1','varchar');
        $data = spi_exec_prepared($pl2,$d0->{fdgrp_cd});
        $d0->{group} = $data->{rows}->[0];
        my $pl3 = spi_prepare('select * from nut_data where ndb_no = $1','varchar');
        $data = spi_exec_prepared($pl3,$args->[0]);
        $d0->{nutrition_data} = $data->{rows};
    }
    return JSON::encode_json($d0);
$$ LANGUAGE plperlu;

I could even probably save some time by putting the prepares in globals, but running it this way will enable me to test that possibility later, and perhaps put it in an edit.

Now that I have that installed via:

$ cat food_with_group.sql | sudo -u postgres psql --dbname agri

I am ready to do some testing. Since I’m already working with Perl, I create a quick catalyst application and add a working DBIx schema connection, and I write two subroutines under the Controller Root.pm:

sub with_pgperlu :Local :Args(1) {
    my ( $self, $c ) = @_;
    my $dbh = $c->model('FoodDB')->storage->dbh;
    my $stm = $dbh->prepare('select pgperlu_test(?)');
    $stm->execute($c->request->arguments->[0]);
    my $data = $stm->fetchrow_arrayref();
    $c->response->body($data->[0]);
}

sub without_pgperlu :Local :Args(1) {
    my ( $self, $c ) = @_;
    my $dbh = $c->model('FoodDB')->storage->dbh;
    my $stm = $dbh->prepare('select * from food_des where ndb_no = ?');
    $stm->execute($c->request->arguments->[0]);
    my $row = $stm->fetchrow_hashref('NAME_lc');
    $stm->finish;
    if ( $row ) {
        my $gstm = $dbh->prepare('select * from fd_group where fdgrp_cd = ?');
        $gstm->execute($row->{fdgrp_cd});
        my $grow = $gstm->fetchrow_hashref();
        $gstm->finish;
        my $nstm = $dbh->prepare('select * from nut_data where ndb_no = ?');
        $nstm->execute($c->request->arguments->[0]);
        my $ndata = [ ];
        while ( my $row = $nstm->fetchrow_hashref() ) {
            push(@{$ndata},$row);
        }
        $row->{group} = $grow;
        $row->{nutrition_data} = $ndata;
    }
    $c->response->body(JSON::encode_json($row));
}

I downloaded and ran the following httperf commands for testing, against the server in development mode:

$ httperf --server=localhost --port=3000 --uri=/with_pgperlu/01001 --wsess=1,5000,0
$ httperf --server=localhost --port=3000 --uri=/without_pgperlu/01001 --wsess=1,5000,0

This will create and analyze 5000 requests for each version of the functionality. My results included:

  • Request Rate: 139.7/second with pgperl, 84.4 without.
  • ms/req: 7.2 with pgperl, 11.8 without.
  • Identical response sizes, with both at about 28k a piece.

These results are really quite promising, but what I’m really interested in is more network latency when I test, as I believe that with a bit of latency, the pgperl solution will perform orders of magnitude better instead. I’m happy enough just to add a few more tables of data instead and see what kind of changes that causes in my results.

So, I’m going to add data from the ‘weight’ table (13209 records, food_des has many weight), and some data from the footnote table (food_des has many footnotes). This should be enough to test a few additional tables to see if two extra queries significantly changes the results time wise.

With two new tables, but not significantly more data returned, my results were:

  • Request Rate: 121.7/second with pgperl, 70.2 without.
  • ms/req: 8.2 with pgperl, 14.2 without.
  • Identical response sizes, with both at about 28.5k a piece.

I see nominal increases of 1 millisecond for the pgperl version of the code, and increases of about 2.5 for the nonpgperl version of the code. Knowing that I can add a query to my code with a 0.5 millisecond increase in page display time means that any one given page can display a significant amount of data without any significant slowdown, if coded using the techniques described, as long as the queries are well coded. It also reduces the need to worry about latency between the database and the application. Based on the design and results I’m seeing here, my latency is either about 0.6 ms or 1.2 ms; both incredibly low numbers for a typical database setup (the prepare might have latency of it’s own, hence the 0.6 out of 2.5). What if that latency begins to approach 50 ms, or another relatively large number (for development systems, pinging a database over a VPN, it is not difficult to approach 200 ms). Now, the benefits of the JSON document view functions become even more clear, as the latency is multiplied by either a constant one or two. With the non-document view format, each query results in an increase of these latency limited calls by one or two, so in our last case the calls would be either five or ten. I believe that this is the very reason that so many people believe that databases are ‘slow’; poorly designed data access.

In addition, this also protects us by allowing people to modify very limited subsets of data safely, and to have the multiple entry points into the data using primary keys for different tables; something that is not available in your typical document data store.

It might also be possible that in the near future (9.2 devel) json support will be added directly to PostgreSQL databases. Whether or not it will support actions like what I have demonstrated here will have to wait to be seen. What such a change would allow is a modification to the way that these functions are created – currently, they must be created by the administrative superuser, which is obviously not optimal.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: