MPP is hardware growth (scalability) by adding small 2 to 4 CPU servers, preferably cheap, to the existing infrastructure. SMP is replacing the current hardware by getting a bigger, badder, meaner box.
ETL
While Ab Initio does provide hash partition based parallelism capabilities that can be used in MPP environments, the effective use of this approach is a non-trivial undertaking. MPP is rarely used for ETL processing.
Reporting
It is easy to split reporting load into smaller chunks based on user activity. It is quite common to use a cluster of small servers in a load balanced configuration for reporting. Note that load balancing is not same as MPP. Load balancing distributes the large number of small tasks to individual servers based on the current load being handled by the server. MPP splits the large individual task to be distributed across all the servers.
Database
The database technology is where you have more options, and therefore more confusion. It is possible to use either SMP or MPP in real life situations. Range partitioning, bitmap indexes, and aggregations are simple and well known techniques, that provide acceptable performance for SMP solutions. The hierarchical approach needed for MPP implementation trades off flexibility for performance and requires more planning, it should only be adopted when really needed.
| SMP/NUMA | MPP | |
| Synonyms | Scaling up Vertical scaling | Scaling out Horizontal scaling Clustering Grid Data warehouse appliance Note: data replication to support larger number of concurrent queries is not MPP |
| Database | Oracle, DB2 | Teradata, DB2 |
| Hardware | Sun, HP, IBM | IBM, Teradata, Linux |
| Server size | Large box: 128 CPUs | Small boxes: 1-4 CPUs |
| Memory / Disk | Shared by all the CPUs | Distributed across the boxes |
| Complexity | Relatively simple | Never trust vendor claims that involve black boxes with magical powers and no transparency into the inner workings |
| Scalability | Upper limit on number of CPUs: keeps increasing with advances in technology | Dependant on effective approach for partitioning data across smaller boxes |
| Bottlenecks | Memory bus contention | Inter-node network bandwidth |
| Partitioning preferred | Range | Hash |
| Data structure affinity | Star schema | Hierarchical: you have to think hierarchical thoughts to partition data across small boxes for optimizing the performance, even though the database engine might be relational |
| Reality | Good enough for most situations. | Needed once you cross a certain threshold (50TB) |
| Linux hype | Will Linux SMP capability (open solution) grow faster than Linux MPP hype (proprietary solutions)? | MPP does not necessarily imply Linux but when you start talking cheap hardware, Linux cannot be far behind. If Linux cannot be SMP, let us hype it for MPP! Hopefully, it will become more of a reality than hype going forward |
The biggest challenge with MPP approach is to come up with an effective data partitioning approach. The data is spread across multiple small boxes called nodes. The data is spread using a hashing algorithm based on the partition key defined for the purpose. All the data related to a partition key value gets assigned to an individual node on the MPP cluster.
The scalability is achieved by ensuring that most of the processing related to loading or querying data is pushed down to the individual nodes by the master node so that the inter-node communication and the processing load on the master node is minimized. This is achieved when the join operation for the query can be pushed down to the nodes as well. The join can happen independently on the nodes when all the data needed for the join is available on the individual node itself. This is made possible by arranging our data in a hierarchical structure:
| Possible hierarchy (at least partial) | Totally outside the hierarchy |
| Region (or Sales Organization) Sales Rep (or Clerk) Customer Account Sales Transaction Transaction Detail | Product Calendar |
The problem is that it is not always possible to arrange the data in such a neat hierarchy. A lot of energy in optimizing an MPP system is, therefore, spent in formulating optimal partition keys, data partition related maintenance, and dealing with exception data that does not fit into the hierarchy.
IMS was the famous implementation of the hierarchical model for operational OLTP systems. What the relational model did for operational systems, is what the dimensional model does for the data warehousing systems, is to liberate us from the tyrannies of the hierarchy. As such, to start thinking in terms of hierarchical data structures seems to be going backwards.