Thinking of moving to a cloud based data warehouse platform? Here's what you need to evaluate and learn how to do it.
With advancements in cloud-based databases, self-service analytics, cloud-to-cloud ETL, and data mashup, business intelligence technology has gotten to a point where moving to the cloud is not only a viable option but it can at times be the most cost-effective option. Yet, the technological landscape has many different players and can be difficult navigate. Due to the ease of procuring a cloud instance and beginning work, some fall into the temptation of not doing the same level of rigor in their decision making process that would put into deciding on what on premise tools to acquire. This post is a handy guide to help you understand how to evaluate possible solutions and to walk you through the objectives you should keep in mind while evaluating cloud solutions.
Setting Objectives
While evaluating anything, clearly defined objectives must be set from the project inception. The results of the evaluations can then be referenced to these objects to determine the best fit for the organization. The objectives in implementing a cloud-based business intelligence platform can, and will, differ by company. For example, some companies will put an emphasis on self-service capabilities while the speed of data ingestion might not be as important. Others might find it more important to have higher savings than functionality, etc. However, there are common elements to objectives and they tend to show up in every BI cloud initiative.
As stated, these are typically these are weighted differently by companies. Nevertheless, these are the general themes that run through a cloud business intelligence implementation.
What to Evaluate?
Once objectives are set, the work then falls to evaluating what set of tools and architecture paradigms can best support the objectives. One very large benefit of leveraging cloud based technologies is that it is very cheap and easy to take the technology on a short test run at a very low cost. An environment can be cheaply procured and many software vendors have trial periods that can be leveraged to run a proof of concept. Below are the elements that should be evaluated.
Element |
Description |
Cloud hosted database that will be used to house the data |
|
ETL/ELT Tool |
The software that will be used to transform data and prepare it for reporting |
Data Model Paradigm |
Does storing the data in a star schema or a data lake paradigm yield better performance? |
Data Analytics Tool |
Software used for performing data analytics and visualization as well as performing data mashup |
This post will go into each evaluated element and detail an approach of evaluation for each.
Pre-evaluation
Before an evaluation, there is some work that needs to be done.
Selecting the Tools to Be Evaluated
There are a lot of tools to evaluate. However, an evaluation takes time and running one for every tool choice will get expensive. But, it is very easy to stack the deck in your favor by doing some research, reading user groups and Gartner's recommendations to pick out what would be the tools most likely to meet the criteria of objectives and doing those first. This way, if the tool meets all the criteria necessary then it can be decided upon rather than running additional evaluations. At least three tools are necessary to meet the typical objectives set for these types of projects. Below is a table of what they are and what to consider pre-evaluation
Tool |
Description |
What to Consider |
Cloud hosted database that will be used to house the data |
|
|
ETL/ELT Tool |
The software that will be used to transform data and prepare it for reporting |
|
Data Analytics Tool |
Software used for performing data analytics and visualization as well as performing data mashup |
|
Taking time to properly select the tools to be evaluated will save time and cost in the end.
Selecting Repeatable Tests
This is covered in more detail in the evaluation section. But before starting the evaluation it is important to have an agreed upon set of tests that all involved in the project agree to. Doing this ensures that if multiple evaluations will be ran that they are easily repeatable and further that all tools are evaluated to the same standard.
Cloud Data Warehouse Evaluation
Perhaps the most important tool to evaluate the cloud data warehouse will house all the data that will be leveraged in the platform. It should be tested for ingestion, throughput, scalability, ease of administration and performance. Typical tools that are used for this are Amazon's Redshift and Microsoft Azure's Database.
It should be noted as well that modern columnar indexing and structuring techniques should also be looked at as with larger data sets these features will be important.
How to Evaluate
Identifying the ideal and optimal Database Configuration is an iterative process and it is important to ensure that the evaluation was performed on the best possible scenario. To accomplish this, it is best to load a large data set (several dozen gigabytes to start) across a mixture of different table types. Then construct 3 to 5 queries on the data set varying in complexity from low to high and executing these queries to create a benchmark. Load can simulated on the server and the findings can be benchmarked again. This process is repeated until an optimum configuration is attained so that it can be evaluated.
Key Metrics for Evaluating a Cloud Data Warehouse
Evaluation Criteria |
Description |
Total Cost of Ownership |
Monthly/annual costs, maintenance costs, associated work activities |
Query Performance |
Return on optimized configuration benchmark for differing query complexities |
Data Ingestion Performance |
When data is loaded into the system, how long did it take to load? |
ETL/ELT Tool Evaluation
An ETL, or data preparation, tool should be evaluated on the following points:
To accomplish this, an evaluation of the ETL/ELT tool should be performed by designing and developing several ETL jobs with varying degree of complexity. The mapping complexity is determined by various factors like:
Based on the above criteria, a set of mappings should be developed on the evaluation tool with complexity ranging from performing simple data integration jobs like creating an object replication to building a complex DW mappings like a periodic snapshot fact resolving multiple dimension keys or performing incremental updates on a DWH table.
How to Evaluate
Firstly, the ease of use and flexibility of the tool should be evaluated for building a set of mappings of varying complexity. Effort needed to build each of these mappings will be recorded. Secondly, the tools capabilities to support multiple database platforms should be evaluated by changing an external source of a mapping and the effort needed to accommodate the new source in the code. Tools capabilities for scheduling the jobs and monitoring should be evaluated by setting up run schedules based on external triggers like invoking processes from outside ETL platform or by an external batch process.
Key Metrics for ETL Tool Evaluation
Evaluation Criteria |
Description |
Ease of Development |
Duration of development effort and level of skills required for development |
Total Cost of Ownership |
Monthly/annual costs, maintenance costs, associated work activities |
ETL Performance |
Duration of ETL run times |
Maintenance Effort |
Amount of effort to maintain system and code |
Business intelligence tools have evolved to a point where different data models are available to address reporting needs. Whereas in years prior a Kimballian or Inmon typed traditional data structures such as Stars and Snowflakes were the only way to get reliable and fast reporting performance, the current business intelligence landscape has the tools to provide reliable and fast reporting via different models. As part of this project differing data model approaches will be tested to determine and verify the proper approach.
How to Evaluate
There are several things to do to execute an evaluation of these different data models. A star should be set up with a mix of complex and simple data. At the same time the tables that comprise the source for the star should be brought over into a data lake area. A report containing similar data should be designed and built this then tested against both data paradigms. Similarly these data structures should be refreshed from the source to determine performance of a data refresh. Maintenance of the objects should be projected on likely cases. Lastly, these data structures should be combined with data from another source to evaluate how easily it can be federated.
Key Metrics for Data Paradigm Evaluation
Evaluation Criteria |
Description |
Ease of Development |
Duration of development effort and level of skills required for development |
Total Cost of Ownership |
Monthly/annual costs, maintenance costs, associated work activities |
Data Refresh Performance |
How long did it take for data ingestion and required transformations? |
Ease of Ad hoc Use |
Level of effort needed to build a report |
Report Performance |
How long did the report take to run? |
Projected Maintenance |
How much effort will the data elements take to maintain? |
Data Analytics / Mashup Evaluation
Data Mashup and analytics can be leveraged to allow users to self-serve their own data needs without intervention or large projects to facilitate basic reporting needs. There are three important scenarios to test in this evaluation, creating a new data source, combining two data sources of different granularity, joining to data sources from different sources systems that share the same granularity; these are detailed below.
How to Evaluate
Run evaluations of each of the three scenarios. The tool should be measured against the following metrics:
To facilitate the new data source creation scenario one of the tables within the data lake created to evaluate the data paradigm can be used to create a new data source with two calculations and a basic columnar report containing these calculated columns.
To facilitate scenario where two different data sources at differing levels of granularity create a report leveraging data from two sources with different granularities containing one calculation that leverages both data sources.
Lastly to facilitate the joining two different data sources take data from two separate data sources at the same grain and create a report containing data from both sources. The report should contain one calculation that leverages columns from both data sources.
Key Metrics for Data Paradigm Evaluation
Evaluation Criteria |
Description |
Ease of Use |
Duration of report building effort and level of skills required for developing the reports |
Total Cost of Ownership |
Monthly/annual costs, maintenance costs, associated work activities |
Report Performance |
How long did the report take to run? |
How can KPI help with your evaluation efforts?
Running an evaluation is important and can result in real cost savings for organizations. Yet, running an effective and efficient evaluation can require a specialized skill set and a depth of experience. KPI has both the breadth of technological expertise to assist across many tools and the depth of experience to be able to direct an evaluation to an efficient outcome. KPI has a pre-built detailed evaluation plan that output a set of deliverables to assist customers to make the best decision for their organization at the lowest cost and least amount of risk.
For more information, email us at info@kpipartners.com
Ron Cruz is a Solution Architect and Project Management Professional (PMP) at KPI Partners, specializing in business intelligence tools and applications. Check out Ron's blog at KPIPartners.com.