How to run dataframe as Spark SQL – PySpark

If you have a situation that you can easily get the result using SQL/ SQL already existing , then you can convert the dataframe to a table and do a query on top of it. Converting dataframe to a table as bellow

from pyspark.sql import SparkSession
from pyspark import SparkContext
sc = SparkContext()
spark=SparkSession.builder.getOrCreate()
myDF = spark.createDataFrame([("Tom", 400,50, "Teacher","IND"),("Jack", 420,60, "Finance","USA"),("Brack", 500,10, "Teacher","IND"),("Jim", 700,80, "Finance","JAPAN")],("name", "salary","cnt", "department","country"))
myDF.registerTempTable("sql_df")
tot_salary = spark.sql("select department,sum(salary) as total_salary from sql_df group by department ")
tot_salary.show(30,False)

+----------+------------+
|department|total_salary|
+----------+------------+
|Teacher |900 |
|Finance |1120 |
+----------+------------+

You can also try the bellow to get all the column from data frame

tot_salary.selectExpr('*').show()
tot_salary.select('*').show()
Author: user

Leave a Reply

Your email address will not be published. Required fields are marked *