Spark : SQL query execution into DataFrames : read_sql_query()

Spark_Pandas_Freshers_in

While Spark provides its own APIs, integrating Pandas functionalities can enhance productivity and familiarity. One such function, read_sql_query(), enables seamless execution of SQL queries into Spark DataFrames, bridging the gap between SQL databases and Spark processing.

Understanding read_sql_query()

The read_sql_query() function is a part of the Pandas API on Spark, specifically designed to read SQL queries directly into Spark DataFrames. It simplifies data retrieval from SQL databases, leveraging Spark’s distributed computing capabilities.

Syntax

pandas.DataFrame.read_sql_query(sql, con, index_col=None, params=None, ...*)

sql: SQL query to be executed.

con: Database connection string or SQLAlchemy engine.

index_col: Optional parameter to specify DataFrame index.

params: Optional parameter for SQL query parameters.

Additional parameters can be explored in the official documentation.

Examples

Let’s dive into practical examples to illustrate the usage of read_sql_query().

# Example 1: Reading from a SQLite database
import pandas as pd
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder \
    .appName("Pandas API on Spark : Learning @ Freshers.in") \
    .getOrCreate()

# Define SQLite connection string
sqlite_url = "jdbc:sqlite:/path/to/database.db"

# Define SQL query
query = "SELECT * FROM table_name"

# Read SQL query into DataFrame using Pandas API on Spark
df = pd.read_sql_query(query, sqlite_url)

# Display DataFrame
print(df.head())
Output:
   column1  column2  column3
0        1        A        X
1        2        B        Y
2        3        C        Z
# Example 2: Reading from a MySQL database
# Assuming proper JDBC driver setup for MySQL

# Define MySQL connection string
mysql_url = "jdbc:mysql://hostname:port/database"

# Define SQL query
query = "SELECT * FROM table_name WHERE condition = 'value'"

# Read SQL query into DataFrame using Pandas API on Spark
df = pd.read_sql_query(query, mysql_url)

# Display DataFrame
print(df.head())

Output:

   column1  column2  column3
0        4        D        W
1        5        E        V

read_sql_query() function of the Pandas API on Spark is a powerful tool for seamlessly integrating SQL queries into Spark DataFrames. Its simplicity and efficiency make it an invaluable asset for data scientists and engineers working with large-scale data processing.

Author: user