Thursday, August 31, 2017

Spark - Type casting when you group

Spark is relatively new to most folks, and if you are like me, you are learning on the go. 
I'm documenting some of the basics that are not readily available if you are combing through the internet for a solution.

If you wanted to do a basic count(*) by date, while you only have timestamp in your dataset, you could do 2 things:

1. As a traditional SQL person, you could just write a spark SQL, which is the easy way

sqlContext.sql(s"""
     select cast(timestamp_field as date) as date_field,count(*)
     from tablename
     group by cast(timestamp_field as date)
""")

2. You are learning dataframes and want to give it a go, here is how:
(as a SQL person, you might be intuitively trying to cast in groupBy, but that does not work, you need to typecast first and use that in your groupBy)

df = sqlContext.table("tablename")

df.withColumn("date_field",to_date($"timestamp_field"))
    .groupBy("date_field")
    .agg(count("*") as "record_count")
    .select("*")
    .show()