Sign in

How to flatten JSON array in Athena

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, it is very handy to run the analytical or ad-hoc query against your big data set stored as one single file or many files in the S3 bucket. AWS Glue makes things, even more, convenient with a self-defined crawler to automatically generate the schema of the data and create a ready-to-use Athena table in minutes or seconds.

Athena uses Presto with full standard SQL support and works with a variety of standard data formats, including CSV, JSON, ORC, Apache Parquet, and Avro. The data can be compressed in Snappy, Zlib, LZO, and GZIP formats.

It is very common to use JSON to describe a deeply nested data record, such as personal information, property description, or any object with many attributes. When dealing with JSON data format, query the data inside the array structure often becomes necessary. Let’s look at the following JSON sample record:

{
"class_id":"LOcwnp",
"students":[
{
"student_id":"oyKxvv",
"student_name":"John Willson",
"scores":[
{
"type":"Math",
"value":"90",
},
{
"type":"English",
"value":"90",
},
{
"type":"Science",
"value":"89",
},
{
"type":"Math",
"value":"100",
}
]
},
{
"student_id":"gqeouhgs",
"student_name":"Olivia Lee",
"scores":[
{
"type":"English",
"value":"92",
},
{
"type":"Science",
"value":"93",
}
]
}
]
}

If Glue crawler is used to auto-generate schema from the above JSON record, there will be 2 columns created in Athena table, class_id column with class_id value, and students column with a nested JSON object contains all students score information of the class.

How to count the total number of students who have Math score and who have more than 1 Math scores?

UNNEST is the answer. Yes, we need to flatten the arrays in order to query and count the key-value pair inside the JSON object. The query looks like the below:

Select count(*) from
(SELECT count(*) as total FROM “class_scores”
CROSS JOIN UNNEST(students) as t(student)
CROSS JOIN UNNEST(student.scores) as t(score)
where score.type=’Math’
group by student.student_id)
where total>1

Hope it helps!