Building the Data Warehouse

Скачать в pdf «Building the Data Warehouse»

Two basic techniques are used to trapp data as update is occurring in the legacy operational environment. One technique is called data replication; the other is called change data capture, where the changes that have occurred are pulled out of log or journal tapes created during online update. Each approach has its pros and cons.

Replication requires that the data to be trapped be identified prior to the update. Then, as update occurs, the data is trapped. A trigger is set that causes the update activity to be captured. One of the advantages of replication is that the process of trapping can be selectively controlled. Only the data that needs to be captured is, in fact, captured. Another advantage of replication is that the format of the data is “clean” and well defined. The content and structure of the data that has been trapped are well documented and readily understandable to the programmer. The disadvantages of replication are that extra I/O is incurred as a result of trapping the data and because of the unstable, ever-changing nature of the data warehouse, the system requires constant attention to the definition of the parameters and triggers that control trapping. The amount of I/O required is usually nontrivial. Furthermore, the I/O that is consumed is taken

out of the middle of the high-performance day, at the time when the system can least afford it.

The second approach to efficient refreshment is changed data capture (CDC). One approach to CDC is to use the log tape to capture and identify the changes that have occurred throughout the online day. In this approach, the log or journal tape is read. Reading a log tape is no small matter, however. Many obstacles are in the way, including the following:

Скачать в pdf «Building the Data Warehouse»