Data Normalization and Reporting Tutorial

Data is what I do for a living, and in this article, I’m going to load some raw data into a database, normalize it for reporting and website use, and run some reports using, at first, grouping statements, and finally I’ll jump into analytic functions. I’m going to be working with PostgreSQL, for a couple of reasons. First, it’s free, second, it has analytic functions, which are sadly missing from MySQL (as of the writing of this tutorial), and while it can be difficult to work with and configure compared to MySQL, once it is up and running it is competitive in every major metric.

First, we need to spin up a box and get postgresql up and running. Once again, I set up a VM and get postgresql installed (Ubuntu 12.04 for this article, mileage may vary). After getting my VM set up and running, installing postgres takes all of a few seconds with:

sudo apt-get install postgresql postgresql-contrib-9.1

Anyone who tells you that Linux is difficult to work with has got to be kidding. Thirty seconds later, I have a running dbms. The postgresql-contrib-9.1 was installed so that I could also use UUID directly in the PostgreSQL database, my preferred method of filling primary keys. I create a database for this activity:

sudo -u postgres psql
create database foreign_aid;
create user foreign_aid password 'foreign_aid';
grant all on database foreign_aid to foreign_aid;

I test this new user with the command:

psql --host=localhost -U foreign_aid -W

And everything should be good to go. Now we need to download the foreign aid data from the data.gov website. For the sake of this tutorial, I’m going to focus on the foreign aid data from the Data.gov website. Going to https://explore.data.gov/Foreign-Commerce-and-Aid/U-S-Overseas-Loans-and-Grants-Greenbook-/5gah-bvex we can get a super awesome data set that covers the loans and grants supplied to foreign governments. While it is displayed in a way that is excellent for display in an excel spreadsheet, it is terrible for use in web sites. We need to normalize this data for use in a web or reporting application, but first we are going to load it into a database table as is, and process it using various SQL commands.

Loading the data into the database

First we need to get it into a workable format. I took the economic data, copied the actual data with headers (as there are a bunch of empty rows before that trying to make our work harder), copied it into a new file, then saved it as a CSV. I mainly do this for brevity, as a more optimal option would be to automate the data load directly from the xls files. In order to do that, we would need to do some scripting against LibreOffice or Microsoft Office, which is beyond the scope of this tutorial. I’ve copied these files into the git repository that I’ll be building as a demonstration of this activity. I then copied the header line that I have in the CSV file over to a new ddl sql that I’ll use to create the table for the economic data. I run a few other vim tricks, and end up with the following create table command:

create table economic_aid (
country_name varchar,
program_name varchar,
FY1946 varchar,
FY1947 varchar,
FY1948 varchar,
FY1949 varchar,
FY1950 varchar,
FY1951 varchar,
FY1952 varchar,
FY1953 varchar,
FY1954 varchar,
FY1955 varchar,
FY1956 varchar,
FY1957 varchar,
FY1958 varchar,
FY1959 varchar,
FY1960 varchar,
FY1961 varchar,
FY1962 varchar,
FY1963 varchar,
FY1964 varchar,
FY1965 varchar,
FY1966 varchar,
FY1967 varchar,
FY1968 varchar,
FY1969 varchar,
FY1970 varchar,
FY1971 varchar,
FY1972 varchar,
FY1973 varchar,
FY1974 varchar,
FY1975 varchar,
FY1976 varchar,
FY1976tq varchar,
FY1977 varchar,
FY1978 varchar,
FY1979 varchar,
FY1980 varchar,
FY1981 varchar,
FY1982 varchar,
FY1983 varchar,
FY1984 varchar,
FY1985 varchar,
FY1986 varchar,
FY1987 varchar,
FY1988 varchar,
FY1989 varchar,
FY1990 varchar,
FY1991 varchar,
FY1992 varchar,
FY1993 varchar,
FY1994 varchar,
FY1995 varchar,
FY1996 varchar,
FY1997 varchar,
FY1998 varchar,
FY1999 varchar,
FY2000 varchar,
FY2001 varchar,
FY2002 varchar,
FY2003 varchar,
FY2004 varchar,
FY2005 varchar,
FY2006 varchar,
FY2007 varchar,
FY2008 varchar,
FY2009 varchar,
FY2010 varchar
);

create table military_aid (
country_name varchar,
program_name varchar,
FY1947 varchar,
FY1948 varchar,
FY1949 varchar,
FY1950 varchar,
FY1951 varchar,
FY1952 varchar,
FY1953 varchar,
FY1954 varchar,
FY1955 varchar,
FY1956 varchar,
FY1957 varchar,
FY1958 varchar,
FY1959 varchar,
FY1960 varchar,
FY1961 varchar,
FY1962 varchar,
FY1963 varchar,
FY1964 varchar,
FY1965 varchar,
FY1966 varchar,
FY1967 varchar,
FY1968 varchar,
FY1969 varchar,
FY1970 varchar,
FY1971 varchar,
FY1972 varchar,
FY1973 varchar,
FY1974 varchar,
FY1975 varchar,
FY1976 varchar,
FY1976tq varchar,
FY1977 varchar,
FY1978 varchar,
FY1979 varchar,
FY1980 varchar,
FY1981 varchar,
FY1982 varchar,
FY1983 varchar,
FY1984 varchar,
FY1985 varchar,
FY1986 varchar,
FY1987 varchar,
FY1988 varchar,
FY1989 varchar,
FY1990 varchar,
FY1991 varchar,
FY1992 varchar,
FY1993 varchar,
FY1994 varchar,
FY1995 varchar,
FY1996 varchar,
FY1997 varchar,
FY1998 varchar,
FY1999 varchar,
FY2000 varchar,
FY2001 varchar,
FY2002 varchar,
FY2003 varchar,
FY2004 varchar,
FY2005 varchar,
FY2006 varchar,
FY2007 varchar,
FY2008 varchar,
FY2009 varchar,
FY2010 varchar
);

Different databases have alternate methods of pushing data into the database from csv or similar files. For PostgreSQL, this is the COPY command, documented at http://www.postgresql.org/docs/9.1/static/sql-copy.html. For the purpose of the initial data load (and with the data files copied to /tmp/, so that the postgres user can view them), this command would be:

sudo -u postgres psql --dbname=foreign_aid
copy economic_aid from '/tmp/economic_aid_data.csv' with csv header;
copy military_aid from '/tmp/military_aid_data.csv' with csv header;

Now you can test the data load by selecting from these tables. We are now ready to perform the actions necessary to normalize the data for effective use.

Data Normalization

We’re going to try to create a 3NF database (Third Normal Form) database from the data that we have loaded. Instead of creating all of the tables DDL to start off, I’m going to go through the data and develop the tables one at a time. Given our data set, I think our tables should probably represent the following objects:

  • Country
  • Aid Type (Military/Economic)
  • Program
  • Aid Amount (with Country, Aid Type, Program references, and with Year/Amount as data).

I would like to build this out using only SQL commands, no looping or non-declarative code, as it will give us an opportunity to see some of the more interesting aspects of SQL commands, and with this toolset, a developer could find themselves spending far less time preparing data. Before I can begin, as I’ll be using uuids, I need to install the uuid module for use:

psalcido@data-normalization:/usr/share/postgresql/9.1/extension$ sudo -u postgres psql --dbname=foreign_aid
psql (9.1.4)
Type "help" for help.

foreign_aid=# create extension "uuid-ossp";
CREATE EXTENSION

Now I need to create a distinct country table from the data. First, the country table is created with the following command:

create table country (
   id uuid primary key,
   name varchar unique not null
);

Now we need to select a list of country names out of the military and economic aid tables. I use the following command to do so:

insert into country
select
  uuid_generate_v4() id,
  coalesce(ea.country_name,ma.country_name) as name
from
  (select distinct country_name from economic_aid) ea
  full outer join (select distinct country_name from military_aid) ma on
    ea.country_name = ma.country_name
;

The above command uses inner queries to get a distinct list of country names from the military aid and economic aid tables, joins them together using full outer on that country name (so that I end up with one record per country name, guaranteeing that if a name appears in either data set, it will get a record), and then I coalesce the names for each country into one name. I select the uuid generation with it, and push that data directly into the table. I will use the same trick to supply data for the program_name as well:

create table program (
   id uuid primary key,
   name varchar unique not null
);

insert into program
select
  uuid_generate_v4() id,
  coalesce(ea.program_name,ma.program_name) as name
from
  (select distinct program_name from economic_aid) ea
  full outer join (select distinct program_name from military_aid) ma on
    ea.program_name = ma.program_name
;

It is relatively simple to build the Aid Type table, and it needs two records, which I will load manually:

create table aid_type (
    id uuid primary key,
    name varchar unique not null
);

insert into aid_type values (uuid_generate_v4(),'economic');
insert into aid_type values (uuid_generate_v4(),'military');

The final table is built with the command:

create table aid_amount(
   id uuid primary key,
   country uuid references country(id) not null,
   program uuid references program(id) not null,
   aid_type uuid references aid_type(id) not null,
   year date not null,
   amount float not null,
   unique(country,program,aid_type,year)
);

There are a number of ways to fill this table. One such way would be to write a plpgsql function to load the data using the data dictionary to insert and update the data. For instance:

create or replace function fill_amount_table() returns void as
/* fill_amount_table
 * This fills the table aid_amount using the data from the loaded tables
 * for economic aid and military aid using the information_schema table
 * definitions as the table reference.
 */
$BODY$
declare
    i information_schema.columns%rowtype;
    j varchar;
    k aid_amount%rowtype;
begin
/* Loop over the columns for the military and economic aid tables */
for i in select
        *
    from
        information_schema.columns
    where
        table_catalog = 'foreign_aid'
        and table_schema='public'
        and table_name in ('economic_aid','military_aid')
        and column_name like 'fy%' loop
    /* Set the aid_type based on the table name for the current column */
    if i.table_name = 'economic_aid' then
        j := 'economic';
    else
        j := 'military';
    end if;
    /* Create a new table record/update current tables based on the
     * data in the loaded tables */
    for k in execute 'select
            uuid_generate_v4() as id,
            c.id as country,
            p.id as program,
            atp.id as aid_type,
            regexp_replace(''' || i.column_name || ''',''[^\d]'','''',''g'') || ''/01/01'' as year,
            regexp_replace(ea.' || i.column_name || ',''[^\d]'','''',''g'') as amount
        from
            ' || j || '_aid ea
            join program p on ea.program_name = p.name
            join country c on ea.country_name = c.name
            join aid_type atp on atp.name = ''' || j || '''
        where
            ea.' || i.column_name || ' is not null' loop
        RAISE NOTICE '(%,%,%,%,%,%)',k.id,k.country,k.program,k.aid_type,k.year,k.amount;
        /* Update or insert if the record does not exist (this 'solves'
         * the problem with the transitional quarter in 1976 */
        update aid_amount
            set amount = amount + k.amount
        where
            program = k.program
            and country = k.country
            and aid_type = k.aid_type
            and year = k.year;
        if found then
            raise notice 'Updated existing';
        else
            insert into aid_amount values (k.*);
            raise notice 'Inserted new';
        end if;
    end loop;
end loop;
end;
$BODY$ language 'plpgsql'

This model handles the transitional quarter in 1976 poorly, but I’m just having fun, so I’ll move on without worrying too much about that, although it isn’t an optimal solution. There are also some SQL query tricks that we can use to verticalize the data, but that would generally require hardcoding of the column names in the query for each year, so I’m not excited to do it. Since this solution appears to work well enough, I consider the data loaded.

Now that we have the data loaded out, let’s try to do some reporting off of the data that we have built out.

Reporting Queries

I’m going to ramp up the difficulty of the reporting queries as I go, and in doing so, demonstrate a large amount of SQL functionality available for reporting.

First, I’m going to join all the tables together to get a decent view of the data without the UUID’s mucking up the way things look:

select
    c.name as country,
    p.name as program,
    atp.name as aid_type,
    to_char(aa.year,'YYYY') as year,
    aa.amount as amount
from
    aid_amount aa
    join country c on
        c.id = aa.country
    join program p on
        p.id = aa.program
    join aid_type atp on
        atp.id = aa.aid_type
order by
    c.name,
    p.name,
    atp.name,
    aa.year;

Whenever I do joins, I have a heavy preference to joining in the from clause using the explicit ‘join’ and ‘on’. This has developed mainly because it works well with all languages (Oracle non-ANSI outer joins, for instance, should be avoided). Also, it means that the joining method is very clear right next to the table definition, and searching the where clause is not necessary for find out how it works. Here is an output demo for this query:

                  country                   |                           program                            | aid_type | year |   amount   
--------------------------------------------+--------------------------------------------------------------+----------+------+------------
 Afghanistan                                | Child Survival and Health                                    | economic | 2002 |    2150000
 Afghanistan                                | Child Survival and Health                                    | economic | 2003 |   47945019
 Afghanistan                                | Child Survival and Health                                    | economic | 2004 |   35000000
 Afghanistan                                | Child Survival and Health                                    | economic | 2005 |   35785000
 Afghanistan                                | Child Survival and Health                                    | economic | 2006 |   37974000
 Afghanistan                                | Child Survival and Health                                    | economic | 2007 |   69444000
 Afghanistan                                | Child Survival and Health                                    | economic | 2008 |   27813000
 Afghanistan                                | Department of Defense Security Assistance                    | economic | 2002 |    2464000
 Afghanistan                                | Department of Defense Security Assistance                    | economic | 2004 |   39717303
 Afghanistan                                | Department of Defense Security Assistance                    | economic | 2005 |  136006927
 Afghanistan                                | Department of Defense Security Assistance                    | economic | 2006 |  214239654
 Afghanistan                                | Department of Defense Security Assistance                    | economic | 2007 |  205387416
 Afghanistan                                | Department of Defense Security Assistance                    | economic | 2008 |  485340607
 Afghanistan                                | Department of Defense Security Assistance                    | economic | 2009 |  548353866
 Afghanistan                                | Department of Defense Security Assistance                    | economic | 2010 |  316514796
 Afghanistan                                | Development Assistance                                       | economic | 2001 |    3361305
 Afghanistan                                | Development Assistance                                       | economic | 2002 |    7283228
 Afghanistan                                | Development Assistance                                       | economic | 2003 |   46279941
 Afghanistan                                | Development Assistance                                       | economic | 2004 |  157126174
 Afghanistan                                | Development Assistance                                       | economic | 2005 |  173989599

Now I’m going to get totals by country and program. This requires a continuation in basic principles by adding joins and group by statements together.

select
    c.name as country,
    p.name as program,
    sum(aa.amount) as amount
from
    aid_amount aa
    join country c on
        c.id = aa.country
    join program p on
        p.id = aa.program
group by
    c.name, p.name
                  country                   |                           program                            |   amount    
--------------------------------------------+--------------------------------------------------------------+-------------
 Afghanistan                                | Child Survival and Health                                    |   256111019
 Afghanistan                                | Department of Defense Security Assistance                    |  1948024569
 Afghanistan                                | Development Assistance                                       |   902540100
 Afghanistan                                | Economic Support Fund/Security Support Assistance            | 10343697736
 Afghanistan                                | Food For Education                                           |     8913952
 Afghanistan                                | Global Health and Child Survival                             |   201160624
 Afghanistan                                | Inactive Programs                                            |      101000
 Afghanistan                                | Migration and Refugee Assistance                             |   251296098
 Afghanistan                                | Military Assistance, Total                                   | 26460597159
 Afghanistan                                | Narcotics Control                                            |  2082536932
 Afghanistan                                | Nonproliferation, Anti-Terrorism, Demining and Related       |   253428801
 Afghanistan                                | Other Active Grant Programs                                  |   102882199
 Afghanistan                                | Other Food Aid Programs                                      |   142326999
 Afghanistan                                | Other State Assistance                                       |    20236958
 Afghanistan                                | Other USAID Assistance                                       |  1099930404

Analytic Functions and More Advanced Queries

First, imagine you wanted to, for each country, list the data for each year of the program that the US had spent the most money on total for that country. One solution might be to, in an inner query, select the sum of data for a program name and country, find the maximum amount of that inner query grouped by country, then rejoin that back to the sum of data for the program name and country again to get the appropriate row for each country. This first part of the query would look something like this:

select
    c.name as country,
    p.name as program,
    aa.year,
    aa.amount
from
    (select
        country,
        program,
        sum(amount) as sum_amount
    from
        aid_amount
    group by
        country,
        program) p1 join
    (select
        country,
        max(sum_amount) as max_sum_amount
    from
        (select
            country,
            program,
            sum(amount) as sum_amount
        from
            aid_amount
        group by
            country,
            program) p3
    group by
        country) p2 on p1.country = p2.country and p1.sum_amount = p2.max_sum_amount
    join aid_amount aa on
        aa.country = p1.country and aa.program = p1.program
    join country c on
        aa.country = c.id
    join program p on
        aa.program = p.id
order by
    c.name,
    p.name,
    aa.year;

Take a second to wrap your head around that really quick. Note the repetitions and joins back to the original table. There is a way to do this without joining back to the base data table (aid_amount) at all, which means no hash joins to aid_amount, and only one table scan of aid_amount. For a table as small as what we have here, this might not mean a whole lot of gain as far as cost, but for really large tables with millions, perhaps billions of rows, this can make a huge amount of difference, or even the difference between a query that can finish and one that cannot.

The way to do this is with windowing and analytic/aggregate functions, and I’m going to go step by step and demonstrate the queries and give an example dataset that results from the current query. Windows allow you to do grouping actions or analytic actions directly in the column specification, based on certain windows. I’m actually opening with a very complex solution, but then I’ll demonstrate some less complex ones afterwards.

I’m going to start by selecting the country, program, year and amount from each record of the aid_amount table, and next to it, I’m going to show the sum of the current program and country in the same row. This query looks like (I’ve joined in the country and program tables, which I’ll have to remove and re-add for each subsequent query, so that the resulting data is human readable):

select
    c.name country,
    p.name program,
    aa.year,
    aa.amount,
    sum(aa.amount) over (
        partition by aa.country,aa.program
    ) total_country_program
from
    aid_amount aa
    join country c on
        aa.country = c.id
    join program p on
        aa.program = p.id
order by
    c.name,
    p.name,
    aa.year;

And the resulting data looks like this:

                  country                   |                           program                            |    year    |   amount   | total_country_program 
--------------------------------------------+--------------------------------------------------------------+------------+------------+-----------------------
 Afghanistan                                | Child Survival and Health                                    | 2002-01-01 |    2150000 |             256111019
 Afghanistan                                | Child Survival and Health                                    | 2003-01-01 |   47945019 |             256111019
 Afghanistan                                | Child Survival and Health                                    | 2004-01-01 |   35000000 |             256111019
 Afghanistan                                | Child Survival and Health                                    | 2005-01-01 |   35785000 |             256111019
 Afghanistan                                | Child Survival and Health                                    | 2006-01-01 |   37974000 |             256111019
 Afghanistan                                | Child Survival and Health                                    | 2007-01-01 |   69444000 |             256111019
 Afghanistan                                | Child Survival and Health                                    | 2008-01-01 |   27813000 |             256111019
 Afghanistan                                | Department of Defense Security Assistance                    | 2002-01-01 |    2464000 |            1948024569
 Afghanistan                                | Department of Defense Security Assistance                    | 2004-01-01 |   39717303 |            1948024569
 Afghanistan                                | Department of Defense Security Assistance                    | 2005-01-01 |  136006927 |            1948024569
 Afghanistan                                | Department of Defense Security Assistance                    | 2006-01-01 |  214239654 |            1948024569
 Afghanistan                                | Department of Defense Security Assistance                    | 2007-01-01 |  205387416 |            1948024569
 Afghanistan                                | Department of Defense Security Assistance                    | 2008-01-01 |  485340607 |            1948024569
 Afghanistan                                | Department of Defense Security Assistance                    | 2009-01-01 |  548353866 |            1948024569
 Afghanistan                                | Department of Defense Security Assistance                    | 2010-01-01 |  316514796 |            1948024569
 Afghanistan                                | Development Assistance                                       | 2001-01-01 |    3361305 |             902540100

As you can see, this works quite nicely already, and gives insight into the data without a whole lot more input. I didn’t have to join in the maximum amount using a group by, I was able to do so in the column specification in the select clause. The implications, if you have written a lot of SQL and haven’t used windows and analytic functions in the past, should be pretty obvious.

Now, with this data, I want the highest rank out of this data for each country – the maximum ‘total_country_program’ if you will. I’m going to use a new window and another analytic function, called ‘rank’, using this as an inner query to accomplish this. The resulting query looks like this:

select
    c.name country,
    p.name program,
    aa.year,
    aa.amount,
    aa.total_country_program,
    rank() over (
        partition by aa.country
        order by aa.total_country_program desc
    ) as total_amount_rank
from
    (select
        country,
        program,
        year,
        amount,
        sum(amount) over (
            partition by country,program
        ) total_country_program
    from
        aid_amount) aa
    join country c on
        c.id = aa.country
    join program p on
        p.id = aa.program;

And here is some of the resulting data. Note that the rank jumps, as the first eight items are all ranked at #1 (they are all the same value), but then the next eight are ranked at #8. To change the behavior so that everything ranked #8 is ranked as #2 instead, the analytic function dense_rank() exists, but for our purposes, this is perfect:

                  country                   |                           program                            |    year    |   amount   | total_country_program | total_amount_rank 
--------------------------------------------+--------------------------------------------------------------+------------+------------+-----------------------+-------------------
 Congo (Brazzaville)                        | Title I                                                      | 1982-01-01 |    1900000 |              31480000 |                 1
 Congo (Brazzaville)                        | Title I                                                      | 1990-01-01 |    1900000 |              31480000 |                 1
 Congo (Brazzaville)                        | Title I                                                      | 1991-01-01 |    1940000 |              31480000 |                 1
 Congo (Brazzaville)                        | Title I                                                      | 1992-01-01 |    5000000 |              31480000 |                 1
 Congo (Brazzaville)                        | Title I                                                      | 1994-01-01 |    6298000 |              31480000 |                 1
 Congo (Brazzaville)                        | Title I                                                      | 1995-01-01 |    3446000 |              31480000 |                 1
 Congo (Brazzaville)                        | Title I                                                      | 1996-01-01 |   10996000 |              31480000 |                 1
 Congo (Brazzaville)                        | Other Food Aid Programs                                      | 1989-01-01 |    2488000 |              23443911 |                 8
 Congo (Brazzaville)                        | Other Food Aid Programs                                      | 2010-01-01 |    2449128 |              23443911 |                 8
 Congo (Brazzaville)                        | Other Food Aid Programs                                      | 2000-01-01 |    7570017 |              23443911 |                 8
 Congo (Brazzaville)                        | Other Food Aid Programs                                      | 2003-01-01 |    2842332 |              23443911 |                 8
 Congo (Brazzaville)                        | Other Food Aid Programs                                      | 2004-01-01 |    3256748 |              23443911 |                 8
 Congo (Brazzaville)                        | Other Food Aid Programs                                      | 2006-01-01 |    4837686 |              23443911 |                 8
 Congo (Brazzaville)                        | Other USAID Assistance                                       | 1978-01-01 |      10000 |              21854551 |                14
 Congo (Brazzaville)                        | Other USAID Assistance                                       | 1979-01-01 |      60000 |              21854551 |                14
 Congo (Brazzaville)                        | Other USAID Assistance                                       | 1980-01-01 |     227000 |              21854551 |                14
 Congo (Brazzaville)                        | Other USAID Assistance                                       | 1981-01-01 |    2000000 |              21854551 |                14
 Congo (Brazzaville)                        | Other USAID Assistance                                       | 1982-01-01 |    1000000 |              21854551 |                14
 Congo (Brazzaville)                        | Other USAID Assistance                                       | 1983-01-01 |    1000000 |              21854551 |                14
 Congo (Brazzaville)                        | Other USAID Assistance                                       | 1984-01-01 |    1000000 |              21854551 |                14
 Congo (Brazzaville)                        | Other USAID Assistance                                       | 1985-01-01 |    1201000 |              21854551 |                14

Now if we use this data as one final inner query and select out the data where the total_amount_rank is ‘1’, then we actually have the same dataset as what we had in the previous demo with all of the joins. That final query looks like:

select
    c.name as country,
    p.name as program,
    aa.year,
    aa.amount
from
    (select
        aa.country,
        aa.program,
        aa.year,
        aa.amount,
        aa.total_country_program,
        rank() over (
            partition by aa.country
            order by aa.total_country_program desc
        ) as total_amount_rank
    from
        (select
            country,
            program,
            year,
            amount,
            sum(amount) over (
                partition by country,program
            ) total_country_program
        from
            aid_amount) aa) aa
    join country c on
        c.id = aa.country
    join program p on
        p.id = aa.program
where total_amount_rank = 1
order by country,program,year;

Now we need to compare the explain plans with analysis. When we do so, we get the following for the first query:

Sort  (cost=2898.95..2898.98 rows=14 width=55) (actual time=148.688..149.342 rows=5190 loops=1)

and the following for the second query:

Sort  (cost=9628.41..9628.82 rows=166 width=55) (actual time=213.127..213.652 rows=5190 loops=1)

Now you might catch youself asking, why are we doing this when the second query performs worse. It only performs worse due to the small size of the dataset. As the hash joins of the dataset to the original data becomes larger, you will start to see the second query performing far better than the first one, but in order to demonstrate some of the value of analytic functions outside of possible questionable performance gains, lets look at another query that might be impossible without them.

This final query is going to take Afghanistan, and for each program, show the year over year increase or decrease from the previous year that it had data (for simplicity of this run), and also a running total of the amount spent on that program. This query uses the lead/lag analytic function, and also an aggregate with a window:

select
    c.name as country,
    p.name as program,
    aa.year,
    aa.amount,
    (aa.amount - lag(aa.amount) over (
        partition by aa.program
        order by aa.year
    ) ) as year_over_year,
    sum(aa.amount) over (
        partition by aa.program
        order by aa.year)
        as running_total
from
    aid_amount aa
    join country c on 
        aa.country = c.id and c.name = 'Afghanistan'
    join program p on
        aa.program = p.id
order by
    c.name,p.name,aa.year;

Running this, I see the following data:

   country   |                        program                         |    year    |   amount   | year_over_year | running_total 
-------------+--------------------------------------------------------+------------+------------+----------------+---------------
 Afghanistan | Child Survival and Health                              | 2002-01-01 |    2150000 |                |       2150000
 Afghanistan | Child Survival and Health                              | 2003-01-01 |   47945019 |       45795019 |      50095019
 Afghanistan | Child Survival and Health                              | 2004-01-01 |   35000000 |      -12945019 |      85095019
 Afghanistan | Child Survival and Health                              | 2005-01-01 |   35785000 |         785000 |     120880019
 Afghanistan | Child Survival and Health                              | 2006-01-01 |   37974000 |        2189000 |     158854019
 Afghanistan | Child Survival and Health                              | 2007-01-01 |   69444000 |       31470000 |     228298019
 Afghanistan | Child Survival and Health                              | 2008-01-01 |   27813000 |      -41631000 |     256111019
 Afghanistan | Department of Defense Security Assistance              | 2002-01-01 |    2464000 |                |       2464000
 Afghanistan | Department of Defense Security Assistance              | 2004-01-01 |   39717303 |       37253303 |      42181303
 Afghanistan | Department of Defense Security Assistance              | 2005-01-01 |  136006927 |       96289624 |     178188230
 Afghanistan | Department of Defense Security Assistance              | 2006-01-01 |  214239654 |       78232727 |     392427884
 Afghanistan | Department of Defense Security Assistance              | 2007-01-01 |  205387416 |       -8852238 |     597815300
 Afghanistan | Department of Defense Security Assistance              | 2008-01-01 |  485340607 |      279953191 |    1083155907
 Afghanistan | Department of Defense Security Assistance              | 2009-01-01 |  548353866 |       63013259 |    1631509773
 Afghanistan | Department of Defense Security Assistance              | 2010-01-01 |  316514796 |     -231839070 |    1948024569

Pretty slick. I’ve used this stuff for some pretty mean tricks at work. Hopefully it helps you in some way with some query or code in the future that would have been a real pain otherwise. I’ve barely scratched the surface of what these functions can net you, and there are several more available in PostgreSQL, which are listed at http://www.postgresql.org/docs/9.1/static/functions-window.html. I can’t be on the PostgreSQL bandwagon enough, and I hope these little tidbits of data will convince you of the same.

Perhaps sometime soon I’ll find myself analyzing another dataset. For now, I’m committing the current data to the github repository at: https://github.com/paulsalcido/data-analysis, so no copying and pasting is necessary, although you might have already done so throughout the tutorial. I should probably have told you that at the beginning of this post.

Advertisements

One comment

  1. Very impressive…Nice information… Thanks for sharing….

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: