3 min read

Working with JSON in Snowflake

Picture of Tommy Ranney Tommy Ranney Dec 20, 2021

Snowflake JSON Data Engineering

Snowflake has made working with semi-structured data a cinch! In this blog, I’ll walk through how to query and create tables in Snowflake using data in the JSON format. The data I used was publicly available from the Minnesota Department of Natural Resources through the LakeFinder app. The data tracks fish catches in both counts and sizes as well as other lake information for many bodies of water within Minnesota.

Step 1: Creating Your Table

To work with JSON data within Snowflake, you need to create a table with a single column formatted as a variant data type. Variant is a universal data type that allows you to load semi-structured data into Snowflake.

Step 2: Leverage Dot Notation

Once loaded, querying this data becomes simple using dot notation. Don’t worry if you’ve never used dot notation before, it’s easy to pick up. I’ll outline a query to show how it works:

Below is a sample of the data I loaded into Snowflake in JSON format.

Data loaded into JSON

I named the column that has the above data in it “JSON”. If I want to query the name of the lake contained in the data, I need to first call the column (JSON) , then the array (result), and then the attribute (lakeName). I can do this using the dot notation line below:

SELECT

JSON:result.lakeName

FROM

“DNR_LAKE_FINDER”.”PUBLIC”.”SURVEY_UNPARSED”

Run in Snowflake

All dot notation does is provide a path to where the value you wish to return lives, almost exactly like a SQL Select statement with standard data types. The trickier thing to tackle is how to query a list within an array, which is where flatten commands come in handy.

Step 3: Flatten Your Data

Let’s use the below data as our new sample: Sample Data

As you can see, there’s an array called “fishCatchSummaries” within the array “result”. If we wanted to return species from the fishCatchSummaries array, your first guess may be to continue the dot notation in the first example and write something like the below:

Continue dot notation

However, if you execute that query, you’ll just get NULL as the result. What gives?!

The reason this doesn’t work is because there are lists within the survey and fishCatchSummaries arrays. There’s nothing in the SELECT statement that points to a value in the list. Dot notation can take us all the way to surveys, from there we’ll need to flatten the data. Take a look at the below:

Dot notation surveys

The table(flatten()) command in the above statement creates a flattened table where each value of the list becomes its own row. You can query individual lines by adding the corresponding Path at the end of a SELECT statement such as:

SELECT FC.value:fishCatchSummaries[0]

This still isn’t super helpful though, as you’d have to know the exact path or index value of the data you want to query, and that’s nearly impossible in large datasets. What we want to do is execute another flatten command after the one that we just executed.

Let’s look at that syntax:

SELECT

FC.*

FROM

“DNR_LAKE_FINDER”.”PUBLIC”.”SURVEY_UNPARSED”,

table(flatten(JSON:result.surveys)) S,

table(flatten(S.value:fishCatchSummaries)) FC

Execute flatten command

You can see that this result is very similar to the first flatten. However, the values are now each fishCatchSummaries list in every line. The only difference in the dot notation that we needed for the second flatten was to add a colon instead of a period prior to indicating the list we wanted to return.

We can now run a SELECT statement against every value in the fishCatchSummaries list using the same syntax to map to the desired values within each path indicated in the FROM clause.

The below snapshot is the final script that will create a cleansed table from our original VARIANT table loaded into Snowflake:

Each column is formatted with the data format that I need for final analysis. Any Snowflake data format is available to be used here, so be sure to use whatever is the best case for your use case when writing the final script.

Take note: The cleaner your data is here, the less compute power you’ll need while querying the data later on, and the less work you’ll need to do to format your data in a BI tool if you connect one to your table.

You’re done!

That’s really the gist of it! Once you understand how to use dot notation and flatten commands within Snowflake, semi-structured data becomes easy to mold. We’re able to build upon our standard SQL skills to make clean and organized tables and views to be consumed by users and BI tools. If you love fishing as much as I do, you may be interested to see the Tableau Dashboard I created from the data I parsed in the example above, you can check it out on Tableau Public.

This is just one of the many great features of using Snowflake as your data warehouse. As you can tell, we’re big fans. If you’d like help with your specific use case, use the button below to get in touch!