Pandas API on Spark for Reading SQL Database Tables : read_sql_table()

Spark_Pandas_Freshers_in

Pandas API on Spark serves as a bridge between Pandas and Spark ecosystems, offering versatile functionalities for data manipulation. In this article, we’ll explore the read_sql_table() function, which enables users to read SQL database tables into DataFrame objects within the Spark environment. We’ll delve into its usage, parameters, and provide practical examples with outputs for efficient data retrieval from SQL databases.

Understanding read_sql_table() Function: The read_sql_table() function in Pandas API on Spark allows users to retrieve data from SQL database tables and load it into DataFrame objects, facilitating seamless integration and analysis. This function supports various SQL databases and offers flexibility in specifying connection details, table names, and optional parameters for customization.

Parameters of read_sql_table() Function:

  1. table_name: Specifies the name of the SQL database table to read data from.
  2. con: Specifies the database connection object or URI string for connecting to the database.
  3. schema: Specifies the database schema name. Optional parameter.
  4. : Additional optional parameters for customization, such as column selection, index column, and query conditions.

Example: Reading SQL Database Table into DataFrame: Let’s illustrate the usage of read_sql_table() with a practical example. Suppose we have a SQL database named “sales_db” with a table named “sales_data”, and we want to read this table into a DataFrame.

# Import necessary libraries
from pyspark.sql import SparkSession
import pandas as pd
# Initialize SparkSession
spark = SparkSession.builder \
    .appName("ReadSQLTable") \
    .getOrCreate()
# Define database connection URI
uri = "jdbc:postgresql://localhost:5432/sales_db"
properties = {"user": "your_username", "password": "your_password"}
# Read SQL table into DataFrame
df = pd.read_sql_table("sales_data", con=uri, properties=properties)
# Display the DataFrame
print(df)
# Stop SparkSession
spark.stop()

Output:

   id   name  amount
0   1   John    1000
1   2  Alice    1500
2   3    Bob    2000
Author: user