If you are not familiar with Synapse Analytics (formerly Microsoft Azure SQL Data Warehouse), it is known as a massively parallel processing system (MPP). MPP systems have been around for a long time, but traditionally they have been very expensive and large hardware investments. Some providers in this space include Teradata, Netezza, and Microsoft, with their analysis platform system offer. In this model, the data in the tables is distributed between the nodes and the results are joined in the main or control node. It is a model that is fully optimized for large-scale data loading, as well as for reports.
As you know, data warehouses are divided into fact and dimension tables. Fact tables are commonly generated in a transactional system (think of the point of sale) and then loaded into the data warehouse. Dimension tables contain attributes such as dates or product names. These may change infrequently and are usually much smaller than fact tables.
In Synapse, fact tables are distributed between nodes using a hash column, while smaller dimensions are replicated across all nodes, and larger dimensions use the same hash distribution. The objective of this is to minimize the movement of data between nodes; While it is extremely fast, reading data from the nodes, having to perform cross-node look-ups is very expensive, so the designs usually aim to minimize this.
Synapse allows you to quickly deploy a high-performance cloud data warehouse, available worldwide, and secure. You can independently scale computing and storage while pausing and resuming your data warehouse in minutes through a massively parallel processing architecture designed for the cloud. It offers a guaranteed availability of 99.9 percent.
With Azure Synapse Analytics, Microsoft makes up for some missing functionalities in Azure DW or generally the Azure Cloud overall. Synapse is thus more than a pure rebranding.
With Synapse, we can finally run on-demand SQL or Spark queries. Rather than spinning up a Spark service (e.g., Databricks) or resuming a Data Warehouse for running query, we can now write our SQL or PySpark code and pay per query. On-demand queries make it so much easier for analysts to take a quick look at a .parquet file (just opening it in Synapse - see example below) or to analyse the Data Lake for some interesting data (using the integrated Data Catalog)
Opening a Data Lake parquet file directly in a Notebook
SQL On Demand
Data in your data lake isn't always easy to browse through, and it�s for sure not easy for a business user or analyst. Within Synapse, Azure now integrated their Storage Explorer interface: a way to easily browse through the Data Lake and access all folders.
With Data Explorer integrated, an analyst can - in one interface - see and access all the data in the Data Lake and Data Warehouse (which he/she has access to): no further connection strings to be created and shared and no need for local tool such as SQL Server Management Studio (for accessing the Data Warehouse) and Azure Storage Explorer for Data Lake browsing.
Up to now, analysts or data scientists had to work with local notebook tools (Jupyter), Databricks and/or local SQL tools to access the different data from the Data Lake or Data Warehouse. Both Azure Data Warehouse and Data Lake Store had data explorer in preview, but the functionalities where limited.
Within Synapse, Microsoft has introduced an end-to-end analysis workbench accessible from web-UI/portal. One can write SQL Queries and execute them on SQL Pools(Datawarehouse compute), On-Demand SQL or Spark Compute.
Parquet-format is a great highly-compressed format commonly used in Data Lakes. It's great to store but a bit more cumbersome to read and analyze: you can't open a parquet file in Windows; you'll need a tool that can read parquet.
Within Synapse, Microsoft integrated SQL Analytics functionalities on Data Lake formats: you can now run SQL script on parquet files directly in your Data Lake: e.g., using right-click on the files and using 'Open in SQL Script.
There have been constant improvements in Azure SQL Data Warehouse performance since the product was introduced. The first was to move from standard Azure storage to premium (SSD), which happened quite early in the service life cycle.
In May, 2018, Microsoft announced Gen2 of the hardware for the product -- the "Compute Optimized" tier, which includes caching data to super-fast local NVMe drives while still storing the larger volume of data on networked premium storage. This allows the service to deliver up to 2GB per second of local I/O bandwidth, and up to 5x query performance improvements.
SQL Data Warehouse adapts to your workload through smart caching to accelerate data access and query performance to handle the most demanding data storage workloads. Create your analysis center seamlessly together with native connectivity with data integration and visualization services, all while using your existing SQL and BI skills.
The dynamic resource classes are really dynamic and address the points mentioned above. The new rule is 3-10-22-70 for percentage memory allocations for small-medium-large-xlarge resource classes, regardless of service level. The following table contains the consolidated details of the percentages of memory allocation and the minimum number of concurrent queries that are executed, regardless of service level.
Well, that's its folks. Please watch this section for more exciting posts on Azure Synapse Analytics.
As a Microsoft Azure partner, KPI Partners can help you with consulting, mentoring, and training services for Microsoft Data Management, Business Intelligence, and Analytics. Let us contact you.