Pandas API on Spark : read SQL queries or database tables into DataFrames : read_sql()

Spark_Pandas_Freshers_in

Integrating Pandas functionalities into Spark workflows can enhance productivity and familiarity. In this article, we’ll delve into the read_sql() function, which allows seamless reading of SQL queries or database tables into Spark DataFrames.

Understanding read_sql()

The read_sql() function is a pivotal part of the Pandas API on Spark, designed to facilitate the reading of SQL queries or database tables directly into Spark DataFrames. This enables data scientists and engineers to leverage Spark’s distributed computing capabilities while maintaining the ease of Pandas data manipulation.

Syntax

pandas.DataFrame.read_sql(sql, con, index_col=None, columns=None, ...)

sql: SQL query or table name to be read.

con: Database connection string or SQLAlchemy engine.

index_col: Optional parameter to specify DataFrame index.

columns: Optional parameter to specify columns to read.

Additional parameters can be explored in the official documentation.

Examples

Let’s explore practical examples to illustrate the usage of read_sql().

# Example 1: Reading SQL query from SQLite database
import pandas as pd
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder \
    .appName("Pandas API on Spark Learning @ Freshers.in") \
    .getOrCreate()

# Define SQLite connection string
sqlite_url = "jdbc:sqlite:/path/to/database.db"

# Define SQL query
query = "SELECT * FROM table_name"

# Read SQL query into DataFrame using Pandas API on Spark
df = pd.read_sql(query, sqlite_url)

# Display DataFrame
print(df.head())

Output:

   column1  column2  column3
0        1        A        X
1        2        B        Y
2        3        C        Z
# Example 2: Reading database table from MySQL
# Assuming proper JDBC driver setup for MySQL

# Define MySQL connection string
mysql_url = "jdbc:mysql://hostname:port/database"

# Read database table into DataFrame using Pandas API on Spark
df = pd.read_sql("table_name", mysql_url)

# Display DataFrame
print(df.head())

Output:

   column1  column2  column3
0        4        D        W
1        5        E        V

The read_sql() function in the Pandas API on Spark provides a seamless bridge between SQL databases and Spark DataFrames, allowing for efficient data retrieval and manipulation. By following the examples outlined in this guide, users can effortlessly integrate SQL queries or database tables into their Spark workflows.

Author: user