|
|
Table of Contents |
|
5.5. Automatically Grouping DataYou have seen that finding all the rows together when doing a range scan can be highly beneficial to performance. There are, actually, other means to achieve a grouping of data than the somewhat constraining use of clustering indexes or index-organized tables. All database management systems let us partition tables and indexesan application of the old principle of divide and rule. A large table may be split into more manageable chunks. Moreover, in terms of process architecture, partitioning allows an increased concurrency and parallelism, thus leading to more scalable architectures, as you shall see in Chapters 9 and 10. First of all, beware that this very word, partition, has a different meaning depending on the DBMS under discussion, sometimes even depending on the version of the DBMS. There was a time, long ago, when what is now known as an Oracle tablespace used to be referred to as a partition. 5.5.1. Round-Robin PartitioningIn some cases, partitioning is a totally internal, non-data-driven mechanism. We arbitrarily define a number of partitions as distinct areas of disk storage, usually closely linked to the number of devices on which we want the data to be stored. One table may have one or more partitions assigned to it. When data is inserted, it is loaded to each partition according to some arbitrary method, in a round-robin fashion, so as to balance the load on disk I/O induced by the insertions. Incidentally, the scattering of data across several partitions may very well assist subsequent random searches. This mechanism is quite comparable to file striping over disk arrays. In fact, if your files are striped, the benefit of such a partitioning becomes slight and sometimes quite negligible. Round-robin scattering can be thought of as a mechanism designed only to arbitrarily spread data irrespective of logical data associations, rather than to regroup data on the basis of those natural associations. However, with some products, Sybase being one of them, one transaction will always write to the same partition, thus achieving some business-process-related grouping of data. 5.5.2. Data-Driven PartitioningThere is, however, a much more interesting type of partitioning known as data-driven partitioning . With data-driven partitioning, it is the values, found in one or several columns, that defines the partition into which each row is inserted. As always, the more the DBMS knows about the data and how it is stored, the better. Most really large tables are large because they contain historical data. However, our interest in a particular news event quickly wanes as new and fresher events crowd in to demand our attention, so it is a safe assumption to make that the most-often-queried subset of historical data is the most recent one. It is therefore quite natural to try to partition data by date, separating the wheat from the chaff, the active data from the dormant data. For instance, a manual way to partition by date is to split a large figures table (containing data for the last twelve months) into twelve separate tables, one for each month, namely jan_figures, feb_figures...all the way to dec_figures. To ensure that a global vision of the year is still available for any queries that require it, we just have to define figures as the union of those twelve tables. Such a union is often given some kind of official endorsement at the database level as a partitioned view , or (in MySQL) a merge table . During the month of March, we'll insert into the table mar_figures. Then we'll switch to apr_figures for the following month. The use of a view as a blanket object over a set of similarly structured tables may appear an attractive idea, but it has drawbacks:
Historically, the first step taken by most database management systems towards partitioning has been the support of partitioned views. The next logical step has been support for true data-driven partitioning. With true partitioning , we have a single table at the logical level, with a true primary key able to be referenced by other tables. In addition, we have one or several columns that are defined as the partition key ; their values are used to determine into which partition a row is inserted. We have all the advantages of partitioned views, transparency when operating on the table, and we can push back to the DBMS engine the task of protecting the integrity of the data, which is one of the primary functions of the DBMS. The kernel knows about partitioning, and the optimizer will know how to exploit such a physical structure, by either limiting operations to a small number of partitions (something known as partition pruning ), or by operating on several partitions in parallel. The exact way partitioning is implemented and the number of available options is product-dependent. There are several different ways to partition data, which may be more or less appropriate to particular situations:
The partitioning process can sometimes be repeated with the creation of subpartitions. A subpartition is merely a partition within a partition, giving you the ability to partition against a second dimension by creating, for instance, hash-partitions within a range-partition. Data partitioning is most valuable when it is based on the data values themselves. |
|
|
Table of Contents |
|