How to create tables from Spark Dataframe and join the tables (createOrReplaceTempView)

PySpark @ Freshers.in

createOrReplaceTempView

There are many scenario in which you can do the transformation using sql instead of direct spark dataframe operations. There can be also some sql migration task that you can convert the sql to spark SQL.

Function

pyspark.sql.DataFrame.createOrReplaceTempView

createOrReplaceTempView will creates or replaces a local temporary view with the dataframe mentioned. This temp views  lifetime is tied to the SparkSession. In short , the lifetime of this temporary table is tied to the SparkSession that was used to create this DataFrame. Once you created a table you can drop the table before the spark session exit with dropTempView command as well ( spark.catalog.dropTempView(“table_name _to_drop”))

Example

from pyspark.sql import SparkSession
from pyspark.sql.types import MapType,StringType,IntegerType
from pyspark.sql.types import StructType,StructField, StringType
spark = SparkSession.builder.appName('www.freshers.in training').getOrCreate()
#State sample data
state_data = ([
(1,"California",90201,"Los Angeles County"),\
(2,"California",90213,"Los Angeles County"),\
(3,"California",90222,"Los Angeles County"),\
(4,"California",90295,"Los Angeles County"),\
(5,"California",90240,"Los Angeles County")])
#Zip code and city name 
zip_city = [
(90201,"Bell"),
(90213,"Beverly Hills"),
(90222,"Compton"),
(90295,"Venice"),
(90240,"Downey")]
#Predefined schema for state_date
state_data_schema=StructType([
StructField("si_no",IntegerType(),True),
StructField("state",StringType(),True),
StructField("zip_code",IntegerType(),True),
StructField("county",StringType(),True)])
#Predefined schema for zipcode and city
zip_city_schema=StructType([
StructField("zip_code",IntegerType(),True),
StructField("city_name",StringType(),True)])
state_data_df = spark.createDataFrame(data=state_data,schema=state_data_schema)
zip_city_df = spark.createDataFrame(data=zip_city,schema=zip_city_schema)
state_data_df.show(20,False)
zip_city_df.show(20,False)
#Creating temp tables 
state_data_df.createOrReplaceTempView("tbl_state_date")
zip_city_df.createOrReplaceTempView("tbl_zip_city")
#Executing as SQL
spark.sql("select * from tbl_state_date").show(20,False)
#Executing SQL with Sample Join
spark.sql("select st.si_no,st.state,st.zip_code,z.city_name,st.county from tbl_state_date st join tbl_zip_city z on st.zip_code = z.zip_code").show(20,False)
spark.catalog.dropTempView("tbl_state_date")
spark.catalog.dropTempView("tbl_zip_city")

Reference

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page

pyspark createOrReplaceTempView www.freshers_in

 

 

 

Author: user

Leave a Reply