Evaluating the Cloud for Your Data Warehouse

Think

 

by Ron Cruz

Thinking of moving to a cloud based data warehouse platform? Here's what you need to evaluate and learn how to do it.

 

With advancements in cloud-based databases, self-service analytics, cloud-to-cloud ETL, and data mashup, business intelligence technology has gotten to a point where moving to the cloud is not only a viable option but it can at times be the most cost-effective option. Yet, the technological landscape has many different players and can be difficult navigate. Due to the ease of procuring a cloud instance and beginning work, some fall into the temptation of not doing the same level of rigor in their decision making process that would put into deciding on what on premise tools to acquire. This post is a handy guide to help you understand how to evaluate possible solutions and to walk you through the objectives you should keep in mind while evaluating cloud solutions.

 

Setting Objectives

While evaluating anything, clearly defined objectives must be set from the project inception. The results of the evaluations can then be referenced to these objects to determine the best fit for the organization. The objectives in implementing a cloud-based business intelligence platform can, and will, differ by company. For example, some companies will put an emphasis on self-service capabilities while the speed of data ingestion might not be as important. Others might find it more important to have higher savings than functionality, etc. However, there are common elements to objectives and they tend to show up in every BI cloud initiative.

  • System stability
  • Ease of reporting on multiple data sources
  • Data ingestion at a performance level that meets requirements
  • Ability to support self service
  • Reporting performance
  • Total Cost of Ownership (TCO) savings

As stated, these are typically these are weighted differently by companies. Nevertheless, these are the general themes that run through a cloud business intelligence implementation.

 

What to Evaluate?

Once objectives are set, the work then falls to evaluating what set of tools and architecture paradigms can best support the objectives. One very large benefit of leveraging cloud based technologies is that it is very cheap and easy to take the technology on a short test run at a very low cost.  An environment can be cheaply procured and many software vendors have trial periods that can be leveraged to run a proof of concept. Below are the elements that should be evaluated.

 

Element

Description

Cloud Data Warehouse

Cloud hosted database that will be used to house the data

ETL/ELT Tool

The software that will be used to transform data and prepare it for reporting

Data Model Paradigm

Does storing the data in a star schema or a data lake paradigm yield better performance?

Data Analytics Tool

Software used for performing data analytics and visualization as well as performing data mashup

 

This post will go into each evaluated element and detail an approach of evaluation for each.

Pre-evaluation

 

Before an evaluation, there is some work that needs to be done.

 

Selecting the Tools to Be Evaluated

There are a lot of tools to evaluate. However, an evaluation takes time and running one for every tool choice will get expensive. But, it is very easy to stack the deck in your favor by doing some research, reading user groups and Gartner's recommendations to pick out what would be the tools most likely to meet the criteria of objectives and doing those first. This way, if the tool meets all the criteria necessary then it can be decided upon rather than running additional evaluations.  At least three tools are necessary to meet the typical objectives set for these types of projects. Below is a table of what they are and what to consider pre-evaluation

 

Tool

Description

What to Consider

Cloud Data Warehouse

Cloud hosted database that will be used to house the data

  • Security compliance of the vendor
  • Does the tool have a proven ability to perform on large data sets?
  • How does the tool scale?
  • Projected costs to support performance requirements
  • Gartner rating

ETL/ELT Tool

The software that will be used to transform data and prepare it for reporting

  • Native connectivity to sources and targets
  • Can it handle both cloud and non-cloud sources?
  • Ease of development
  • Does the tool have a proven record for transforming large data sets?
  • Projected cost to support the data requirements
  • Gartner rating

Data Analytics Tool

Software used for performing data analytics and visualization as well as performing data mashup

  • Native connectivity to sources
  • Ease of use
  • How does it mashup data? Can users do it themselves?
  • What formats can it export to?
  • How is security implemented in the tool?
  • Does it have a proven performance record on reports with large data sets?

 

Taking time to properly select the tools to be evaluated will save time and cost in the end.

Selecting Repeatable Tests

This is covered in more detail in the evaluation section. But before starting the evaluation it is important to have an agreed upon set of tests that all involved in the project agree to. Doing this ensures that if multiple evaluations will be ran that they are easily repeatable and further that all tools are evaluated to the same standard.

 

Cloud Data Warehouse Evaluation

Perhaps the most important tool to evaluate the cloud data warehouse will house all the data that will be leveraged in the platform. It should be tested for ingestion, throughput, scalability, ease of administration and performance. Typical tools that are used for this are Amazon's Redshift and Microsoft Azure's Database.

It should be noted as well that modern columnar indexing and structuring techniques should also be looked at as with larger data sets these features will be important.

 

How to Evaluate

Identifying the ideal and optimal Database Configuration is an iterative process and it is important to ensure that the evaluation was performed on the best possible scenario. To accomplish this, it is best to load a large data set (several dozen gigabytes to start) across a mixture of different table types. Then construct 3 to 5 queries on the data set varying in complexity from low to high and executing these queries to create a benchmark. Load can simulated on the server and the findings can be benchmarked again. This process is repeated until an optimum configuration is attained so that it can be evaluated.

 

Key Metrics for Evaluating a Cloud Data Warehouse

Key Metrics for Evaluating a Cloud Data Warehouse

 

Evaluation Criteria

Description

Total Cost of Ownership

Monthly/annual costs, maintenance costs, associated work activities

Query Performance

Return on optimized configuration benchmark for differing query complexities

Data Ingestion Performance

When data is loaded into the system, how long did it take to load?

 

ETL/ELT Tool Evaluation

An ETL, or data preparation, tool should be evaluated on the following points:

  • Overall level of effort for development
  • Ability to incorporate changes to source system quickly and easily
  • Scheduling and administrative capabilities

To accomplish this, an evaluation of the ETL/ELT tool should be performed by designing and developing several ETL jobs with varying degree of complexity. The mapping complexity is determined by various factors like:

  • Complexity of transformations
  • Sources to extract data from
  • Data volumes to be processed
  • Data mashups to be performed
  • Heterogeneous database platform support

Based on the above criteria, a set of mappings should be developed on the evaluation tool with complexity ranging from performing simple data integration jobs like creating an object replication to building a complex DW mappings like a periodic snapshot fact resolving multiple dimension keys or performing incremental updates on a DWH table.

 

How to Evaluate

Firstly, the ease of use and flexibility of the tool should be evaluated for building a set of mappings of varying complexity. Effort needed to build each of these mappings will be recorded. Secondly, the tools capabilities to support multiple database platforms should be evaluated by changing an external source of a mapping and the effort needed to accommodate the new source in the code. Tools capabilities for scheduling the jobs and monitoring should be evaluated by setting up run schedules based on external triggers like invoking processes from outside ETL platform or by an external batch process.

 

Key Metrics for ETL Tool Evaluation

Evaluation Criteria

Description

Ease of Development

Duration of development effort and level of skills required for development

Total Cost of Ownership

Monthly/annual costs, maintenance costs, associated work activities

ETL Performance

Duration of ETL run times

Maintenance Effort

Amount of effort to maintain system and code

 

Business intelligence tools have evolved to a point where different data models are available to address reporting needs. Whereas in years prior a Kimballian or Inmon typed traditional data structures such as Stars and Snowflakes were the only way to get reliable and fast reporting performance, the current business intelligence landscape has the tools to provide reliable and fast reporting via different models. As part of this project differing data model approaches will be tested to determine and verify the proper approach.

Data Volume Paradigm Evaluation

How to Evaluate

There are several things to do to execute an evaluation of these different data models. A star should be set up with a mix of complex and simple data. At the same time the tables that comprise the source for the star should be brought over into a data lake area. A report containing similar data should be designed and built this then tested against both data paradigms. Similarly these data structures should be refreshed from the source to determine performance of a data refresh. Maintenance of the objects should be projected on likely cases. Lastly, these data structures should be combined with data from another source to evaluate how easily it can be federated.

 

Key Metrics for Data Paradigm Evaluation

Evaluation Criteria

Description

Ease of Development

Duration of development effort and level of skills required for development

Total Cost of Ownership

Monthly/annual costs, maintenance costs, associated work activities

Data Refresh Performance

How long did it take for data ingestion and required transformations?

Ease of Ad hoc Use

Level of effort needed to build a report

Report Performance

How long did the report take to run?

Projected Maintenance

How much effort will the data elements take to maintain?

 

Data Analytics / Mashup Evaluation

Data Mashup and analytics can be leveraged to allow users to self-serve their own data needs without intervention or large projects to facilitate basic reporting needs. There are three important scenarios to test in this evaluation, creating a new data source, combining two data sources of different granularity, joining to data sources from different sources systems that share the same granularity; these are detailed below.

 

Data Analytics / Mashup Evaluation

 

How to Evaluate

Run evaluations of each of the three scenarios. The tool should be measured against the following metrics:

  • Total time to complete task
  • Ease of use (low, medium, high)
  • Performance of report
  • Correctness of data (did it take multiple attempts to get the correct data?)

To facilitate the new data source creation scenario one of the tables within the data lake created to evaluate the data paradigm can be used to create a new data source with two calculations and a basic columnar report containing these calculated columns.

To facilitate scenario where two different data sources at differing levels of granularity create a report leveraging data from two sources with different granularities containing one calculation that leverages both data sources.

Lastly to facilitate the joining two different data sources take data from two separate data sources at the same grain and create a report containing data from both sources. The report should contain one calculation that leverages columns from both data sources.

 

Key Metrics for Data Paradigm Evaluation

Evaluation Criteria

Description

Ease of Use

Duration of report building effort and level of skills required for developing the reports

Total Cost of Ownership

Monthly/annual costs, maintenance costs, associated work activities

Report Performance

How long did the report take to run?

 

How can KPI help with your evaluation efforts?

Running an evaluation is important and can result in real cost savings for organizations. Yet, running an effective and efficient evaluation can require a specialized skill set and a depth of experience. KPI has both the breadth of technological expertise to assist across many tools and the depth of experience to be able to direct an evaluation to an efficient outcome. KPI has a pre-built detailed evaluation plan that output a set of deliverables to assist customers to make the best decision for their organization at the lowest cost and least amount of risk.

For more information, email us at info@kpipartners.com

 

Ron Cruz is a Solution Architect and Project Management Professional (PMP) at KPI Partners, specializing in business intelligence tools and applications.  Check out Ron's blog at KPIPartners.com.

Comments

Comments not added yet!

Your future starts today. Ready?

kpi-top-up-button