Redshift vs. Snowflake: The Definitive Guide
Learn the key differences between Redshift and Snowflake around architecture, pricing, security, compliance, data support, administration, data protection, cloud infrastructure, and performance.
December 5, 2022
Cloud services and data warehouses have changed drastically over the years, and there are now quite a few options available in the market, but Amazon Redshift and Snowflake still dominate most of the market, and choosing between the two can be challenging since there are so many technical and architectural differences you need to consider.
What is Snowflake?
Founded in 2012, Snowflake is a SaaS-based data platform that can run on any of the major cloud providers like Amazon Web Services (AWS), Microsoft Azure, or Google Cloud Platform (GCP.) In its simplest form, Snowflake helps you consolidate and aggregate your data into a single, centralized platform to tackle analytics use cases. These workloads include data warehousing, data lakes, data engineering, application development, data sharing, and business intelligence.
What is Redshift?
Funny enough, Amazon Redshift was strategically named as a deliberate dig at Oracle since all of Oracle's branding is red. Redshift is also a traditional data warehouse designed to tackle Business Intelligence use cases, among other things. However, whereas Snowflake is a SaaS offering, Redshift is a PaaS (Platform-as-a-Service) solution. AWS Redshift was one of the first cloud data warehouses to become available on the market, officially launching in 2013. Like Snowflake, Redshift lets you query data using SQL for various analytics and engineering-related use cases.
Snowflake is entirely serverless, eliminating the need for dedicated resources. You never have to manage or maintain any hardware, and storage and compute are completely decoupled to optimize for performance and query concurrency. Compute nodes and clusters within Snowflake are known as individual warehouses. Snowflake warehouses do not share compute resources with other virtual data warehouses, which means the platform provides near-unlimited query concurrency and robust computing power.
Snowflake stores a portion of every dataset locally within these clusters and uses micro partitions to optimize and compress all your data into a columnar storage format. Everything from file sizing, compression, structure, metadata, statistics, and every other data object that is not directly visible to you is automatically handled by Snowflake.
Snowflake takes advantage of Massive Parallel Processing (MPP) to process queries, and since the platform is based on ANSI SQL, the barrier to entry is really low. There's even an auto-complete feature where Snowflake automatically makes recommendations when you enter your script in the query editor.
The standard version of AWS Redshift is not serverless, and storage and compute are tightly coupled. However, Redshift offers several node types, so you have much more control and flexibility over how you configure clusters. All nodes within a cluster are automatically partitioned into slices, each representing an allocated portion of an individual node’s disk and memory space.
Redshift runs on PostgresSQL 8, which is older and less user-friendly than Snowflake and the platform only runs on AWS, which means it’s not accessible if you’re currently using a different cloud. Storage within Redshift is duplicated from S3, so you can compress and store data in a columnar format, just like Snowflake.
Snowflake has a unique auto-scaling feature that allows you to automatically spin up more computing resources to handle any query load; and an auto-suspend feature that helps you start and stop your virtual warehouses.
While you can resize clusters and virtual warehouses in a single click, Snowflake does not give you any ability to resize nodes. The only way to get larger nodes in Snowflake is to purchase larger virtual warehouses, which inevitably drives up compute costs and makes it inefficient to run specific queries.
Autoscaling in Amazon Redshift is limited. It can take anywhere from several minutes to hours since you have to add and remove individual nodes manually. Redshift offers a concurrency scaling feature, but it comes at an additional cost and is not as robust as Snowflake’s auto-scale function.
Security & Compliance
Snowflake provides role-based access control (RBAC) to manage users and privileges and multifactor authentication (MFA) for account security. You can control network access through policies, and since you can run Snowflake across clouds, you can take advantage of virtual private networks (VPNs) like Azure Private Link or AWS PrivateLink. In addition to this, Snowflake automatically encrypts all your data, and the platform has earned several certifications like SOC 1 Type II and SOC 2 Type II compliance,
HIPAA compliance, PCI DSS compliance, FedRAMP compliance, etc. One of the best facets of Snowflake security is RBAC (role-based access control), which you can use to manage users and privileges. Snowflake also offers MFA (multi-factor authentication) for increased account security. You can control network access by whitelisting and blacklisting IPs.
Aside from handling all of the infrastructure and hardware security, Redshift also provides security features for resource provisioning. Access management in Redshift is handled by AWS Identity and Access Management (IAM.) You can’t create, configure, or delete clusters unless you have the appropriate credentials. However, you can manage access to AWS Redshift resources and assign security groups to specific cluster instances to grant inbound access and manage cluster security.
To further protect access, you can launch clusters within AWS Virtual Private Cloud (VPC) and enable cluster encryption. AWS Redshift also offers secure sockets layer (SSL) connections to encrypt any connections between your SQL client and clusters. Redshift automatically encrypts all data at rest, and you can further protect your data with Advanced Encryption Standard (AES-256.) Compliance within Redshift is no problem because AWS offers even more compliance capabilities than Snowflake.
Snowflake supports both semi-structured (JSON, Avro, Orc, CSV, JSON, Parquet) and structured data. Snowflake also launched support for unstructured data in late 2021, enabling you to take advantage of more development languages like Python, Java, and Scala.
Amazon Redshift only supports structured and semi-structured data. However, the platform does not support semi-structured data types like objects, arrays, or variants. To query unstructured data, you’ll need to store it within an S3 data lake and query using Amazon Spectrum. One cool aspect of Redshift is that the platform supports PartiQL, which is a query language designed to process semi-structured data more efficiently.
Administration And Maintenance
There’s hardly ever a need for a full-time administrator with Snowflake because nearly every aspect of maintenance is automatic. All query optimization happens automatically, Snowflake offers automatic clustering, and even more performance tuning takes place as your data volume increases.
Since AWS Redshift is a PaaS solution, you must perform substantial manual maintenance to keep the platform up and running. For example, user access is managed on a cluster basis which means you’ll inevitably need multiple credentials and permission grants to work across numerous clusters and client connections. There’s also a lot of maintenance you have to take care of when it comes to data vacuuming (e.g., resorting rows and cleaning up unclaimed space.)
Snowflake has two core features known as Time Travel and Fail-safe. Time Travel keeps a state of your data before it is updated. You can apply this feature to databases, tables, and schemas. Time Travel allows you to access historical data (e.g., tables, schemas, or databases) in a defined period that might have been changed or deleted. Time Travel is limited to one day, but enterprise customers can choose any period of time up to 90 days.
Fail-safe is a 7-day period that works in unison immediately with Time Travel. This feature exists solely to recover data that might have been lost or damaged due to extreme operational failures, and it can take up to hours or days to recover your data.
Redshift offers Snapshots, which are point-in-time backups of clusters. You can configure Snapshots manually or have them run automatically. Automated Snapshots are enabled by default, but you can schedule them to run as you choose.
You can take manual Snapshots at any time. The default retention period is one day, but you can modify this in the Redshift Console, depending on your node type. For example, RA3 nodes have a retention period of 1-35 days. Snapshots store your data in S3, and when you restore from a backup, Redshift will create a new cluster automatically that you can query. Amazon Redshift automatically deletes all Snapshots after the retention period.
Snowflake has five different editions or plans that you can choose from:
- Business Critical
- Virtual Private Snowflake (VPS)
Every plan offers additional features, so you'll need to have a good understanding of your daily usage patterns and overall use cases. Snowflake pricing is based on individual warehouse usage. Computational warehouses come in several different sizes (X-Small, Small, Medium, Large, X-Large, etc.) An X-small warehouse uses one cluster.
For an X-Small warehouse, the pricing structure starts at approximately 0.0003 credits per second or one credit consumed per hour. Every time you increase your warehouse size, you also double the number of clusters and credits you’re consuming within Snowflake. The largest snowflake warehouse (6X-Large) burns 512 credits per hour or .1422 credits per second.
Depending on your Snowflake tier, the cost per credit can vary substantially, but the on-demand pricing for Snowflake Standard Edition starts at $2. Within Snowflake, you can pre-purchase credits or pay on-demand based on usage. The on-demand version of Snowflake Standard Edition starts at $2 per credit.
Storage costs in Snowflake are relatively straightforward. On-demand users pay a flat rate of $40 per month for each TB of data, and upfront, customers pay $23 per TB of data stored. Paying up-front and purchasing a contract is the most cost-effective way to use Snowflake because the company offers massive discounts.
AWS Redshift offers huge discounts if you pay upfront or sign a contract. However, the pricing model is more complex because there are two node types, DC2 and RA3. DC2 nodes are tightly coupled with storage. With RA3 nodes, storage and compute are managed independently.
The price for a DC2 instance ranges from $0.25 per hour for the smallest size to $4.80 per hour for the largest size. On the other hand, RA3 instances start at $1.086 per hour for the smallest size and $13.04 for the largest size.
Redshift also offers a serverless option for users who don’t want to provision and scale hardware. With this option, Redshift automatically scales up or down to meet the requirements of analytic workloads and shuts down during periods of inactivity. Consumption is calculated per minute based on RPU (Redshift Processing Unit) hours. The price for an RPU is $0.45 per hour.
One of the core things to keep in mind when it comes to Snowflake and Redshift is that Snowflake is a SaaS solution, and Redshift is a PaaS solution. With Snowflake, you don’t have to maintain any infrastructure. The same cannot be said for Redshift. On top of this, Snowflake runs across clouds, so it’s incredibly flexible. On the other hand, Redshift is limited to AWS and cannot run on another cloud provider like Azure or GCP (Google Cloud Platform).
Every benchmark can tell its unique story and be tailored to perform in a certain way. Out of the box with no fine-tuning, Snowflake tends to beat out every other cloud data warehouse with superior performance. This is not to say there is not a specific subset of data or use case where Amazon Redshift would outperform Snowflake. If you’re interested in actual benchmarks, Fivetran and Hashmap have done in-depth analyses comparing both platforms.
Pros And Cons
Snowflake is extremely user-friendly and designed to work straight out of the box with immediate value. Once you load your data into the platform, you can start querying it immediately. Snowflake also supports an extensive ecosystem of third-party partners and integrates directly with many different technologies like Fivetran and dbt.
Amazon Redshift is a much older platform than Snowflake, so it carries some legacy baggage. You’re forced to set up infrastructure and configure hardware before you can start seeing value. Redshift integrates natively with the rest of the AWS ecosystem (e.g., AWS Glue and Sage Maker.)
If you're operating a lot of on-premises technology that doesn't integrate easily with cloud-based services, Redshift will likely be a better option unless you want to undergo a full migration and move all of your data to the cloud. It’s much easier to optimize for cost in AWS Redshift for additional savings, but you'll most likely see slower performance.
The key difference to understand about these cloud-based data warehouses is that Redshift is "hands-on," and Snowflake is "set it and forget it." Either way, both platforms will connect seamlessly to your Business Intelligence tools.
What Comes After Redshift & Snowflake?
Cloud data warehouses like Snowflake and Redshift help you build dashboards and establish KPIs, but they don’t solve the “last-mile” analytics challenge of Data Activation. Data warehouses are only accessible to your technical users who know how to write SQL, which means your business teams can’t get access to the robust customer data living in your warehouse.
Hightouch solves this problem with Reverse ETL by querying directly against your warehouse and syncing your data directly to your frontline business systems (e.g., Salesforce, Hubspot, Google Ads, Braze, Iterable, etc.)