Spark : SQL query execution into DataFrames : read_sql_query()


While Spark provides its own APIs, integrating Pandas functionalities can enhance productivity and familiarity. One such function, read_sql_query(), enables seamless execution of SQL queries into Spark DataFrames, bridging the gap between SQL databases and Spark processing.

Understanding read_sql_query()

The read_sql_query() function is a part of the Pandas API on Spark, specifically designed to read SQL queries directly into Spark DataFrames. It simplifies data retrieval from SQL databases, leveraging Spark’s distributed computing capabilities.


pandas.DataFrame.read_sql_query(sql, con, index_col=None, params=None, ...*)

sql: SQL query to be executed.

con: Database connection string or SQLAlchemy engine.

index_col: Optional parameter to specify DataFrame index.

params: Optional parameter for SQL query parameters.

Additional parameters can be explored in the official documentation.


Let’s dive into practical examples to illustrate the usage of read_sql_query().

# Example 1: Reading from a SQLite database
import pandas as pd
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder \
    .appName("Pandas API on Spark : Learning @") \

# Define SQLite connection string
sqlite_url = "jdbc:sqlite:/path/to/database.db"

# Define SQL query
query = "SELECT * FROM table_name"

# Read SQL query into DataFrame using Pandas API on Spark
df = pd.read_sql_query(query, sqlite_url)

# Display DataFrame
   column1  column2  column3
0        1        A        X
1        2        B        Y
2        3        C        Z
# Example 2: Reading from a MySQL database
# Assuming proper JDBC driver setup for MySQL

# Define MySQL connection string
mysql_url = "jdbc:mysql://hostname:port/database"

# Define SQL query
query = "SELECT * FROM table_name WHERE condition = 'value'"

# Read SQL query into DataFrame using Pandas API on Spark
df = pd.read_sql_query(query, mysql_url)

# Display DataFrame


   column1  column2  column3
0        4        D        W
1        5        E        V

read_sql_query() function of the Pandas API on Spark is a powerful tool for seamlessly integrating SQL queries into Spark DataFrames. Its simplicity and efficiency make it an invaluable asset for data scientists and engineers working with large-scale data processing.

Author: user