From time to time, innovative solutions appear that can bring a truly new quality to the area in which they specialize. In this article, I would like to discuss Snowflake – an innovation in the field of data analytics.
In my opinion, this service has enormous potential and, if used well, it can take corporate analytics to a completely new level. It is in every aspect – from a significant simplification of the solution, through ease and convenience of work, to unlocking new opportunities related to the efficient analysis of large volumes of data.
What is Snowflake then?
Simply put, Snowflake is a dedicated analytical service that allows you to analyze huge volumes of data with very high efficiency.
Snowflake has been designed from the very beginning as a fully cloud service which allowed for an innovative approach in the context of the architecture used. The data storage layer is separated here from the query processing layer. Thanks to this separation, we have a unique ability to dynamically select the computing power of servers and spread the load on more than one so-called virtual data warehouse.
The query scattering mentioned above does not only apply, as one might suppose, to parallel processing of a query by many computing nodes (MPP architecture – massively parallel processing), but also the possibility of establishing separate virtual warehouses that can simultaneously perform various tasks.
Therefore, we obtain unprecedented possibilities of selecting the efficiency of the solution depending on specific needs. For example:
- we can simultaneously load data and without the problems known from other database systems (decrease in performance, lack of access to the structures that are currently in use) efficiently implement reporting, which is extremely important for business users;
- analysts / data science teams can get their own environment with increased computing power to perform complex analyzes. As a result, we obtain useful results faster than with traditional solutions and we can implement the most sophisticated scenarios with less resources;
- if we transfer some of our data to entities with whom we cooperate, we can also provide them with convenient and safe access to data without the need to create time-consuming data copying mechanisms;
- development works are carried out on a quick copy of the data without disrupting the system operation;
- applications that need access to analysis results can access the latest data 24/7.
Each of the virtual data warehouses can be accounted for separately, so without major difficulties we can attempt to allocate costs appropriately to the tasks performed or to specific departments.
The Snowflake usage scenario described earlier is possible thanks to the wonderful set of features inherent in this service. We can, therefore, collect these most interesting features at this point, and so we have:
- the possibility of using a selected cloud – Snowflake is currently present on the Azure, AWS and Google Cloud Platform clouds. We choose the supplier with which we have already implemented solutions or one that offers the most suitable services;
- separation of the load coming from various sources by means of the possibility of creating many so-called virtual data warehouse;
- very low entry threshold – the entire platform is supported by the SQL language, using SQL commands you can even scale the system or create new virtual data warehouses;
- interface available through a browser – an authorized user can execute SQL commands through a web browser without downloading data to his computer;
- a well-thought-out and certain layer of data security – data encryption, masking of sensitive data, security based on roles and security certificates make us able to cover any scenario related to data security;
- the possibility of sharing data – in addition to using your data for your own needs, you can safely share them outside, e.g. to entities with whom we cooperate. Such data exchange processes are usually troublesome (exports to SFTP, network drives, cloud spaces) and here this scenario is covered “out of the box”;
- handling structured and unstructured data;
- two types of dynamic scaling up and down – horizontal scaling (more – coverage of many queries at the same time) and vertical (higher – reduction of query processing time);
- settlement for service time – the more we take care to use the options related to automatic sleep and wake up of virtual data warehouses or dynamic scaling, the better we will achieve the ratio of final effects to the cost incurred. Dormant virtual warehouses do not generate costs, and at the same time we can track the costs for each instance of a data warehouse (therefore it is worth mentioning separate warehouses for specific classes of processes), which allows for the identification of the most expensive elements of the solution and their optimization;
- Developers will surely like the functionalities that incredibly shorten the time necessary to develop the system, such as time travel (the ability to “go back in time” to the previous data state) or zero copy-clone (immediate data copying, e.g. to run a test environment or UAT).
Thanks to a well-thought-out architecture and a great set of features and functionality, Snowflake can revolutionize analytics in almost any organization. It should be emphasized, however, that the implementation of Snowflake should be carried out with due diligence, because improper use of this service can turn out to be quite expensive – leaving virtual warehouses turned on or processes that do not allow servers to automatically go to sleep, inappropriate selection of computing power, lack of cyclical cost supervision or the use of Snowflake as a trading base, it can have dramatic effects. However, this should not stop anyone from even trying this technology – preferably in cooperation with an experienced partner who will efficiently and reliably implement a solution that is a necessary step towards achieving the goal of a truly “data-driven company”.
Norbert Kulski – Transition Technologies MS