Distribution key in Netezza - learn Reading Time: 7 minutes

At the point when you create a table in Netezza database, selecting column/s being a member of distribution key is one of the critical decision for good execution of the queries that will run against the table.

The following are some of the key points to remember when we select distribution key:

Choosing right distribution key is one of the important factor to improve the performance of Netezza server. If you have created the table with RANDOM distribution or with different column with lots of duplicate records then you should immediately change the distribution key otherwise that will reduces the performance. Changing Netezza table distribution key is process of redistributing the Netezza table using Netezza nzsql.

Redistribute using CTAS

Creating new table and loading data at same time is one of the best way to redistribute table. You can use CTAS ( Create table AS) and is one of fastest way to re-create Netezza table.

Syntax:

CREATE TABLE <CTA_table> AS SELECT * FROM <existing_table> DISTRIBUTE ON (QUERIED_colume(s));

For example;

CREATE TABLE new_sales AS SELECT * FROM sales DISTRIBUTE ON (Year);DROP TABLE sales;

DROP TABLE

ALTER TABLE new_sales RENAME TO sales; ALTER TABLE

Redistribute using Intermediate Table

The main drawback of using only CTAS is that you have to drop the original table and rename newly created table to original table. This process will invalidate the any views built on top of base table. You have to explicitly re-compile all views which are built on top of old table.

To overcome above mentioned issue, you can follow tow steps procedure to re-distribute table. First, use CTAS to create new intermediate table from old table, then, second, truncate original table and reload data from intermediate table.

For example;

CREATE TABLE int_sample_sales AS SELECT * FROM sample_sales DISTRIBUTE ON (MONTH_ID); INSERT 0 7 TRAINING.ADMIN(ADMIN)=> TRUNCATE TABLE sample_sales; TRUNCATE TABLE TRAINING.ADMIN(ADMIN)=> INSERT INTO sample_sales SELECT * FROM int_sample_sales; INSERT 0 7 TRAINING.ADMIN(ADMIN)=>

You can also create the scripts to re-distribute the Netezza table using aginity workbench. Here is the example of scripts that aginity generates for redistribution of Netezza table:

CREATE TABLE USER1."RENAME_DEMO_ ( ID INTEGER, NAME CHARACTER(10) ) DISTRIBUTE ON (ID);

INSERT INTO USER1."RENAME_DEMO_ SELECT * FROM USER1.RENAME_DEMO; ALTER TABLE USER1."RENAME_DEMO_ SET PRIVILEGES TO USER1.RENAME_DEMO; ALTER TABLE USER1.RENAME_DEMO RENAME TO USER1."RENAME_DEMO_ ALTER TABLE USER1."RENAME_DEMO_ RENAME TO USER1.RENAME_DEMO; ALTER TABLE USER1.RENAME_DEMO OWNER TO USER1; DROP TABLE USER1."RENAME_DEMO_ GENERATE EXPRESS STATISTICS ON USER1.RENAME_DEMO;

Changing Netezza Table Default Distribution key

By default, Netezza will distribute on the first column if you do not specify the distribute on hash(column) command.

To change this default distribution property, you can update the /nz/data/postgresql.conf configuration file to add the below entry:

enable_random_table_distribute = 1

You need to perform nzstop/nzstart the database to use updated properties or you can issue command “pkill -HUP -f postmaster”.

Distribution, Not Just for Joins

There are several other scenarios where data needs to be distributed to process.

First one that comes to mind are count distincts.  Take for example a count of distinct users or customers like count(distinct customer_id) perhaps when doing analysis on purchase transactions.  An example sql might be

select count(distinct customer_id)  from purchase_transaction

For argument we’ll say that purchase_transaction has random distribution.  So Netezza has to break up the job.  It can’t just count distinct on every spu because you can’t just add up the numbers at the end because some customers might be on more than one spu.  The way to make it where you can just add them up is to make sure a customer can’t be on more than one spu, and that means you need the data distributed by customer.  That’s exactly what the query plan will do, redistribute on customer.  But of course if they data is already distributed on customer, then you save having to do a redistribution.

This is just one example of where Netezza needs all of a grouping of data together.  Another is windowing analytic functions like rank , dense_rank, lead, lag, etc.  They always have a partition over set.  To do these partition overs, you need all the data for each partition on one spu.  If you partition over on customer_id, reusing our above example, and perhaps want to do a rank on dollar amount for each customer, again it will redistribute the data to get each slice of customer data located together on a single spu.

Still another example is any group by aggregate function with group by columns.  If the data set is small enough and all the aggregate functions are summable, it may try to merge each spu’s results together on the host.  But look at your plan files, often if the result set will be large, it will distribute by some or often all of the group by columns.

Select distincts are in may ways like a group by without any aggregate functions.  So to get like values together , it will redistribute on all or most of the column set.

A key thing to realize is netezza doesn’t need to distribute purely on all of the distinct, group by , or partition over columns.  It just needs to make sure that all of that set of data is there local on the spu.  For example if you had zipcode and customer id.  Lets say we know that customer is a pure child of zipcode, in that a customer only has one zipcode and a zipcode has many customers.   It doesn’t need to redistribute on customer_id if it’s already distributed at the zipcode level.  But it has to know this relationship.  So that means in distinct or group by to make sure that zipcode is included.  Or in a partition over statement partition over zipcode, customer_id.   What I’ve seen if the optimizer sees that it is already distributed at that higher, courser grain  AND it sees that that grain’s cardinality has things pretty evenly spread out, it will usually just leave things at that distribution.  It can do this because if customers don’t cross zipcodes, and zipcodes don’t cross spus, then customers don’t cross spus.    What I do see though is if it’s going to need to do redistributions, it would redistribute on all of the column values.  I see this especially on group bys.  Sometimes joins and partition overs are a little smarter if there is a common higher grain to handle both.

So  to summarize, in addition to joins , colocation of data through table distribution or redistribution is also needed for count distincts (or any agg distinct), partition overs in analytic functions,  and column grouping in group by or select distincts.  A really good table distribution strategy will find a column, perhaps like a user_id or customer_id  that can commonly be used for joins and these other calculations.  If you can find a good common distribution like that things will run very very fast.

Full Course Next: Loading and Unloading Tables



Meet Ananth Tirumanur. Hi there 👋

I work on projects in data science, big data, data engineering, data modeling, software engineering, and system design.

Connect with me:

My Resources:

Languages and Tools:

AWS, Bash, Docker, Elasticsearch, Git, Grafana, Hadoop, Hive, EMR, Glue, Athena, Lambda, Step Functions, Airflow/MWAA, DynamoDB, Kafka, Kubernetes, Linux, MariaDB, MySQL, Pandas, PostgreSQL, Python, Redis, Scala, SQLite