Saturday, December 12, 2009

Analysis with Mondrian

The structure of enterprise business activities is almost always multidimensional. This is because the content of a business is defined in terms of quantifiable or measurable properties (e.g., sales, inventory, or donations) and qualitative attributes (e.g., students, customers, or products). Each business activity can involve a combination of these quantitative and qualitative entities. Although enterprise systems may actually store incoming activities in a relational format, a highly responsive, multidimensional environment is required to analyze and gain insight into the entire business.
Online analytical processing (OLAP), still a growing field in terms of research and development, refers to a manner of storing and querying very large volumes of data across multiple dimensions. The particulars of multidimensional OLAP (MOLAP) versus relational OLAP (ROLAP) still evoke a vigorous debate. But the choice depends entirely on the nature of data, latency, and resources (both hardware and software). For instance, ROLAP may provide a better solution for data that is dimension-intensive or in situations where latency needs to be very low or close to real time. On the other hand, MOLAP may be better suited for large sets of aggregations and more lenient latency requirements. In either case, adherence to sound design principles is essential for creating a successful OLAP solution.

Pentaho's answer to the question of multidimensional analysis is a ROLAP engine called Mondrian. The most important aspect of OLAP is how and where aggregations are stored. In a ROLAP environment, as with Mondrian, data and aggregations are stored in a relational database. Precomputed aggregates are stored in tables alongside the base fact tables. Such aggregate structures are necessary to avoid calculations over millions of fact records for each query. These tables are not part of the analytical engine; they have to be built using an ETL-style process. Pentaho offers a tool called Aggregation Designer that helps create and maintain aggregate tables. Mondrian includes an in-memory aggregate cache that saves multidimensional result-sets on first access for use in subsequent calculations. The extensive CacheControl application programming interface (API) is included for granular access to Mondrian's cache.

Organizations can choose from several approaches to provide a client tool for multidimensional analysis. A complementary open source project called JPivot offers a pivot-table client tool, written in Java Server Pages (JSP), to browse cubes created using Mondrian. Mondrian also provides a multidimensional expressions (MDX) interface (note that this is not entirely the same as Microsoft's implementation of MDX). Developers can write in-house applications using olap4j (or, OLAP for Java), an open specification being developed by several open source companies including Pentaho, JasperSoft, and LucidEra.

No comments:

Post a Comment