Specifications

Data Management CHAPTER 6 117
nodes in the appliance. There are performance considerations for the selection of a distribution
column, such as distinctness, data skew, and the types of queries executed on the system. For
a detailed discussion of the choice of distributed tables, refer to the product documentation.
To distribute the rows in the fact table, a hash function assigns each row to one of many stor-
age locations based on the distribution column. Each compute node has 8 storage locations,
called distributions, for the hashed rows. If a data rack has 8 compute nodes, the data rack has
64 distributions, which are queried in parallel.
Hash
function
Table
Compute nodes
Each table row
belongs to one
distribution
Distributed table
FIGURE 6-4 Distributed strategy
It is not essential that equal numbers of table rows are assigned to each distribution. There
will almost always be some data skew among the distributions. If the amount of data skew
becomes too large, the parallel system continues to run, but query times might be affected.
You might have to experiment with several approaches before nding the best distributed
strategy. A distributed strategy does not affect other table options that you might want to
implement. For example, you can still dene partitions and clustered indexes as needed.
DDL Extensions
To support the MPP architecture, Parallel Data Warehouse includes a SQL language that
works with appliance databases. This SQL language includes data denition language (DDL)
statements to create and alter databases, tables, views, and other entities on the appliance.
You use these statements to operate on these objects as if they were on a single database
instance. Behind the scenes, Parallel Data Warehouse allocates space for the objects and
instantiates them across nodes.