The reason that Excel either doesn’t work or is extremely slow with large Snowflake tables such is because of a combination of how Excel Power Query works and a limitation of the Snowflake ODBC driver. Power Query is a really great idea in that it is intended to allow non-technical (or at least semi-technical) users connect to data and do some quite advanced data transformation and then consume the transformed data straight into Excel. Before Power Query you would have either needed to do the transformation in lots of formulae in the spreadsheet once the data was loaded, do it in some more complex SQL queries, or have a data engineer already do the transformation in the database before connecting Excel to it.
A basic concept of how Power Query works is that you connect it to some data, and then you build up the transformations one step at a time. To make it really simple for non-technical users, you don’t need to know any SQL, you just need the connection details for the database and it will then give you a list of tables. You then pick the first table you need, and build your transformation from there.
By default, Power Query does all of the transformation on the data in memory on your PC. For example, if you choose to convert a date/time value into just a date type, Power Query reads the date time value into memory on your PC, converts to date and shows the date on your screen and loads that converted date value into Excel. That sounds fine, right? Well, the problem is more obvious when you want to do some transformation such as “filter by previous 10 days” to only show the past 10 days data. If you have got 1 million rows of data in the table in Snowflake, which is the last 10 years sales data. You want to bring back into Excel only the last 10 days sales data, so that’s why you filter on previous 10 days on date. By doing these transformations in memory, now Power Query has to download all 1 million rows from the database, inspect the date on every row and filter out only those in the past 10 days. Now its slow.
You would have though the clever people at Microsoft would have realised this, realised how useless Power Query would be in many cases when dealing with realistic volumes of business data? Well, they are clever and they did think about it.
There is a good reason why these transformation steps are done in memory, it is so that Power Query can support many different sources of data and still provide the same functionality for transformations. You can filter on a date and convert date/time to date whether the data source is Snowflake, a Microsoft Dynamics CRM system, or a CSV file. But in order to allow the transformation to perform a lot better and work with large and realistic volumes of data, it uses some functionality called Query Folding. Simply, this is where Power Query knows how to get the original data source to do the transformation on the data on its side before returning the data to Excel. With a traditional relational database such as Oracle or SQL Server, you write a SQL statement to select the data you want and if you need to filter only the last 10 days, you add a WHERE clause. Regardless of where you run the query from, the task of applying the WHERE and filtering the last 10 days data is always done at the database server, which is designed to do such things extremely efficiently and very fast. So regardless of how many rows of data are in the table, you only ever download and display the last 10 days data. If you were connecting Power Query to SQL Server and filtered on just the last 10 days data, Power Query would add the WHERE clause to the SQL that it sends to the database server, only the last 10 days data is downloaded, and no transformation is done in memory on your PC.
The task of doing the complicated Query Folding – working out how to convert the filter a business user has added in Power Query to the SQL or the language that the source data understands – is not entirely the responsibility of Power Query, it is something that the driver for the data source needs to enable. Which is exactly the problem with Snowflake, because the Snowflake driver you need to use with Excel Power Query is just a standard ODBC driver, and ODBC itself does not support Query Folding within Power Query in Excel.
So when you have got a lot of data in a table in Snowflake and you connect Excel to it using the ODBC driver and Power Query, it is really slow to work on that table because it is all done in memory and all the data has to be downloaded to your PC and transformed in memory.
The way of getting around this, is to construct a SQL query that does the important filtering to select only the rows you want, and use that SQL query in the data source. You can still then do transformations such as converting date/times to date in Power Query steps, but the important step of filtering out millions of unnecessary rows is done in SQL on the database server.