Building the Data Warehouse

The types of data from external sources are many and diverse. Some typical sources of interesting data include the following:

■■ Wall Street Journal ■■ Business Week ■■ Forbes ■■ Fortune

■    Industry newsletters

■    Technology reports

■    Dun & Bradstreet (now D&B)

■    Reports generated by consultants specifically for the corporation ■■ Equifax reports

■    Competitive analysis reports

■    Marketing comparison and analysis reports

■    Sales analysis and comparison reports

■    New product announcements

In addition, reports internal to the corporation are of interest as well:

■    Auditor’s quarterly report

■    Annual report

■    Consultant reports

In a sense, the data generated by the Web-based ebusiness environment is unstructured. It is at such a low level of detail that the data must be reconstituted before it is useful. This clickstream data then is merely a sophisticated form of unstructured data.

External/Unstructured Data in the Data Warehouse

Several issues relate to the use and storage of external and unstructured data in the data warehouse. One problem of unstructured data is the frequency of availability. Unlike internally appearing data, there is no real fixed pattern of appearance for external data. This irregularity is a problem because constant monitoring must be set up to ensure that the right data is captured. For some environments, such as the Internet, monitoring programs can be created and used to build automated alerts.

The second problem with external data is that it is totally undisciplined. To be useful, and for placement in the warehouse, a certain amount of reformatting of external data is needed to transform it into an internally acceptable and usable form. A common practice is to convert the external data as it enters the data warehouse environment. External key data is converted to internal key data. Or external data is passed through simple edits, such as a domain check. In addition, the data is often restructured so that it is compatible with internal data.

