Netezza Distribute on - learn Reading Time: 6 minutes

When users create tables in databases and store data into it, data gets stored in disk extents which is the minimum storage allocated on disks for data storage. Netezza distributes the data in data extents across all the available data slices based on the distribution key specified during the table creation. If you don’t want to distribute randomly, a maximum of four columns can be mentioned as distribution key during table creation process. If a user provides no distribution specification, Netezza uses one of the columns to distribute the data and the selection of which cannot be influenced.

When the user specifies particular column for distribution then Netezza uses the column data to distribute the records being inserted across the dataslices. Netezza uses hashing to determine the dataslice into which the record needs to be stored. When the user selects random as the option for data distribution, then the appliance uses round robin algorithm to distribute the data uniformly across all the available dataslices.

Data Slices in each SPU

The key is to make sure that the data for a table is uniformly distributed across all the data slices so that there are no data skews. By distributing data across the data slices, all the SPUs in the system can be utilized to process any query and in turn improves performance. In addition, the performance of any query depends on the slowest SPUs handling the query. If the data is not uniformly distributed then some of the SPUs will have more data in the data slice attached to it called data skew which will impact the processing time and in turn the overall query performance. Selecting columns with high cardinality for the distribution is a good practice to follow. Even if a column with high cardinality like a date column is chosen to distribute data, there is a possibility of creating processing skew. For e.g. using the date column as the distribution key, the data gets distributed fairly evenly across all the data slices. But if most of the queries are looking for data for a particular month which is fairly often in a data warehousing environment, then only a particular set of data slices may need to be processed by the appliance which in turn will only utilize a subset of SPUs causing the query performing sub optimally. This is called processing skew and needs to be prevented by understanding the processing requirements and choosing the correct distribution keys.

Netezza Distribution Key

CBT – Clustered Base Tables

Along with the option to distribute data during table definition, Netezza also provides an option to organize the distributed data with in a data slice. For e.g., we may have distributed employee table on employee id but wanted to have employee records from the same department to be stored closely together, the column dept id in the table can be specified.

Netezza allows up to four columns to organize on. When data gets stored on the data slice, records with the same organize on column values will get stored in the same or close by extends. Organize on improves queries on fact tables when they are defined with the frequently joined columns to organize the data. All the columns specified in the organize on clause are zone mapped and by knowing the range of values of these columns stored in each physical extent, Netezza can eliminate reading unwanted extents during a join query which improves the query performance. 

The distribution of the data across the various disks is the single most important factor that can impact performance. Consider below points when choosing the distribution key:

Avoid distributing the tables on columns that are often used in the where clause as it will cause processing skew. Date columns would be an example of where not to use this, especially in DW environments.

Why Varchar is not a good candidate for a distribution key:

Some helpful Documentation: • https://www.ibm.com/developerworks/community/blogs/Netezza/entry/distribution_what_s_up_with_that13?lang=en



Meet Ananth Tirumanur. Hi there 👋

I work on projects in data science, big data, data engineering, data modeling, software engineering, and system design.

Connect with me:

My Resources:

Languages and Tools:

AWS, Bash, Docker, Elasticsearch, Git, Grafana, Hadoop, Hive, EMR, Glue, Athena, Lambda, Step Functions, Airflow/MWAA, DynamoDB, Kafka, Kubernetes, Linux, MariaDB, MySQL, Pandas, PostgreSQL, Python, Redis, Scala, SQLite