Every day we generate 2.5 quintillion bytes. Almost all organizations are placing a high emphasis on building out their technology infrastructure to handle the ever-increasing challenges with data volume, velocity and variety. Now more than ever, companies need a way to collect and consolidate data into a single platform to derive insights quickly and efficiently.
These are one of the major reasons why cloud data platforms like Snowflake, Azure Synapse, AWS Redshift and GCP Big Query are rising in popularity and replacing on prem data platform solutions
The process of making data available to the whole organization is called data democratization. A data platform can make data available and make an organization data driven. Currently, only the top executives have the luxury to ask various departments for reports, get a sense of things from those and then make a decision. But what about the middle management, businesses that need data to not only make a decision but to also run their specific process and others? This is not the case for 90% of organizations worldwide even today. With the tremendous processing power of a cloud data platform, one can use tools to ensure the data is available across the organization and is of good quality. Some of the major benefits of a cloud data platform are
- Data storage in native format
- Schema Flexibility
- Supports not only SQL but more languages
- Advanced Analytics
Snowflake’s Data Cloud is powered by an advanced data platform provided as Software-as-a-Service (SaaS). Snowflake enables data storage, processing, and analytic solutions that are faster, easier to use, and far more flexible than traditional offerings.
The Snowflake data platform is not built on any existing database technology or “big data” software platforms such as Hadoop. Instead, Snowflake combines a completely new SQL query engine with an innovative architecture natively designed for the cloud. To the user, Snowflake provides all of the functionality of an enterprise analytic database, along with many additional special features and unique capabilities.
Synapse is an enterprise analytics service that accelerates time to insight across data warehouses and big data systems. Azure Synapse brings together the best of SQL technologies used in enterprise data warehousing, Spark technologies used for big data, Data Explorer for log and time series analytics, Pipelines for data integration and ETL/ELT, and deep integration with other Azure services such as Power BI, CosmosDB, and AzureML.
Both Snowflake and Azure Synapse are modern, cloud-based systems focused on data warehousing. Data warehouse is an excellent use case for the cloud, given its large data requirements and peak computational workloads like nightly ETL runs or peak report and dashboard consumption time frames.
PaaS vs. SaaS
Snowflake is delivered as SaaS that runs on top of Azure, AWS or Google clouds. An abstraction layer separates the Snowflake compute credits and storage you pay for from the actual underlying storage and compute cloud provided resources and costs.
Azure Synapse is primarily a Platform as a Service (PaaS) solution with free Azure Synapse Workspace development environment on top of those resources. You ultimately end up paying for Azure resources. The benefit of this approach is that other Azure resources such as Azure Active Directory and Power BI are tightly coupled when using Azure Synapse for data warehousing.
Snowflake’s architecture is a hybrid of traditional shared-disk and shared-nothing database architectures. Similar to shared-disk architectures, Snowflake uses a central data repository for persisted data that is accessible from all compute nodes in the platform. But similar to shared-nothing architectures, Snowflake processes queries using MPP (massively parallel processing) compute clusters where each node in the cluster stores a portion of the entire data set locally. This approach offers the data management simplicity of a shared-disk architecture, but with the performance and scale-out benefits of a shared-nothing architecture.
Synapse SQL leverages a scale out architecture to distribute computational processing of data across multiple nodes. Compute is separate from storage, which enables you to scale compute independently of the data in your system.
For dedicated SQL pool, the unit of scale is an abstraction of compute power that is known as a data warehouse units
For serverless SQL pool, being serverless, scaling is done automatically to accommodate query resource requirements. As topology changes over time by adding, removing nodes or failovers, it adapts to changes and makes sure your query has enough resources and finishes successfully. For example, the image below shows serverless SQL pool utilizing 4 compute nodes to execute a query. Spark Pools is a fully managed service offered within Azure Synapse that lets users spin up and configure Apache Spark clusters to tackle more complicated workloads.
Both platforms allow creation of SQL databases for data warehousing but under the hood they are very unique. Snowflake has completely decoupled the SQL databases created in Snowflake from the compute resources that load or query those SQL databases. This means any compute resource, called a “warehouse” in Snowflake, can operate on any SQL database in Snowflake. This approach enables multiple compute resources to concurrently use the same database. For example, one compute resource could be loading data while another is querying data for reports without any concerns about the performance of one job impacting the other job.
Azure Synapse takes a different approach to computing power. A dedicated SQL pool is required to create a long-lived SQL database suitable for data warehousing. That SQL database is tightly coupled to the dedicated SQL pool compute resource. The dedicated SQL pool must be running to access the SQL database associated with it. Multiple SQL pools cannot access the same SQL database at the same time. Instead, Azure Synapse implements a massively parallel processing engine pattern that will distribute SQL commands across a range of compute nodes based on your selected SQL pool performance level.
Snowflake’s Multi-cluster warehouses enable you to scale compute resources to manage your user and query concurrency needs as they change, such as during peak and off hours. You can choose to run a multi-cluster warehouse in either of the following modes: Maximized or Auto-scale. Since each Snowflake warehouse is on its own individual compute cluster, workloads can be isolated individually to enable unlimited concurrency. Snowflake also has a feature called zero-copy cloning which enables users to instantly clone databases without physically copying or storing the data. Multi cluster is only available in enterprise edition. Maximized mode is set by setting max and min clusters to the same number. For auto scale just set it to different values.
You can scale compute for Synapse SQL pool (data warehouse) using the Azure portal. Scale out compute for better performance, or scale back compute to save costs. SQL pool compute resources can be scaled by increasing or decreasing data warehouse units. Your SQL pool must be online to scale. Azure Synapse has fewer features around scalability compared to Snowflake since it is not a native SaaS offering. Serverless SQL Pools and Spark Pools have automatically scaling by default, but Dedicated SQL server pools have to be manually adjusted by the user because there is no auto-suspend/auto-resume feature like Snowflake.
Snowflake offers a Pay-As-You-Go billing for compute calculated on a per-second basis, with a minimum of 60 seconds with auto-suspend and auto-resume capabilities. This means that if the query execution takes 2 minutes, you will pay for only 2 minutes of compute assuming that the Virtual Warehouse is suspended after the query execution.
With Synapse, on the other hand, compute usage is charged on an hourly basis. For example, if your data warehouse is active for 24 hours in a month, you will only be billed for the 24 hours that your data warehouse existed. If your data warehouse exists for only 30 minutes in a month, you will be billed for 1 hour. Additionally, it is organization’s responsibility to manage suspend and resume of Synapse, plan for capacity and educate developers to optimize overall consumption and cost.
Snowflake offers a lot of innovative features like Data Sharing, Zero-Copy clone, Time Travel, Fail-safe features for data protection as well as high-level security and compliance features. As a SaaS offering, Snowflake seamlessly deploys weekly releases that provide up-to-date experience delivering value through innovation and an Agile mind-set
Azure Synapse Analytics also evolves rapidly to address Advanced Analytics use cases leveraging the power of Apache Spark. There have been several enhancements very recently that complement a modern data warehouse architecture such as advanced security features and unified analytics experience within the Azure Synapse studio. Azure Synapse supports many programming languages like SQL, Python, Scala, Spark SQL, and .Net. Azure Synapse Analytics also integrates with Azure Machine Learning and Azure Cognitive Services
Both cloud data platforms are great tools and leaders in the modern cloud data platform landscape. Organizations need to carefully evaluate which platform best suits their needs and current technology stack.