Saturday, December 12, 2009

Pentaho: A Case in Point

That open source BI can provide a full-fledged solution to an organization's BI needs can be demonstrated by looking at how Pentaho's platform addresses the principal requirements of BI—data integration, reporting, and analysis.

ETTL with Kettle

Pentaho's BI platform implements the Common Warehouse Metamodel (CWM). The CWM, which has been implemented by proprietary vendors such as Informatica, is a specification that proposes using XML metadata interchange (XMI) to interchange data warehouse metadata. This entails that mappings can be migrated between tools that implement this interface. Pentaho's extract, transform, and load (ETL) system is based on its Kettle project. Kettle stands for "Kettle ETTL Environment," where ETTL is Pentaho's acronym for "extraction, transformation, transportation, and loading" of data. The ETL system supports: a variety of steps (a step represents the smallest unit in a transformation and contains either predefined or custom logic that is applied to each row as it makes its way from the source to the target); slowly changing dimensions (SCDs); connectors for a multitude of data sources (access to proprietary databases such as Microsoft SQL Server and Oracle is via Java Database Connectivity [JDBC]); and the ability to execute and schedule jobs both locally and remotely. Scripting in Javascript as well as pure Java allows developers to add custom code in any step of the transformation.

Two challenging issues that organizations face are data volume and latency requirements. In order to support high data volume environments, Pentaho has a clustering solution (a solution that uses more than one node or computing entity in order to achieve high performance and availability) that works alongside database partitioning; by using slave servers (a group of servers that perform specific tasks using the data sent by the master server) to distribute central processing unit (CPU) and input/output (I/O) load, performance can be improved by way of this parallelism. However, change data capture, which is based on a data integration technique that triggers data transfer by listening for changes in data sources, is not supported. Changes in data sources are detected by reading transaction logs; with the exception of open source databases, transaction log readers are seldom open source.

Although Pentaho's data integration still lacks a data quality and data cleansing solution, the development of a profiling server (a server dedicated to performing profiling tasks that help discover aberrations in data; see Distilling Data: The Importance of Data Quality in Business Intelligence) seems to be on the list of imminent improvements. In such situations, where the vendor does not support a specific functionality, organizations can look to complementary open source solutions; the DataCleaner project from eobjects.org, for instance, provides functionality to help profile data and monitor data quality. It also points to a significant advantage with open source applications: the fact that software is developed by the community and for the community makes it much simpler to share innovative solutions quickly and seamlessly.

No comments:

Post a Comment