Specifications

Data Management CHAPTER 6 117

nodes in the appliance. There are performance considerations for the selection of a distribution

column, such as distinctness, data skew, and the types of queries executed on the system. For

a detailed discussion of the choice of distributed tables, refer to the product documentation.

To distribute the rows in the fact table, a hash function assigns each row to one of many stor-

age locations based on the distribution column. Each compute node has 8 storage locations,

called distributions, for the hashed rows. If a data rack has 8 compute nodes, the data rack has

64 distributions, which are queried in parallel.

Hash

function

Table

Compute nodes

Each table row

belongs to one

distribution

Distributed table

FIGURE 6-4 Distributed strategy

It is not essential that equal numbers of table rows are assigned to each distribution. There

will almost always be some data skew among the distributions. If the amount of data skew

becomes too large, the parallel system continues to run, but query times might be affected.

You might have to experiment with several approaches before nding the best distributed

strategy. A distributed strategy does not affect other table options that you might want to

implement. For example, you can still dene partitions and clustered indexes as needed.

DDL Extensions

To support the MPP architecture, Parallel Data Warehouse includes a SQL language that

works with appliance databases. This SQL language includes data denition language (DDL)

statements to create and alter databases, tables, views, and other entities on the appliance.

You use these statements to operate on these objects as if they were on a single database

instance. Behind the scenes, Parallel Data Warehouse allocates space for the objects and

instantiates them across nodes.