Spark : SQL query execution into DataFrames : read_sql_query()

user January 29, 2024

While Spark provides its own APIs, integrating Pandas functionalities can enhance productivity and familiarity. One such function, read_sql_query(), enables seamless execution of SQL queries into Spark DataFrames, bridging the gap between SQL databases and Spark processing.

Understanding read_sql_query()

The read_sql_query() function is a part of the Pandas API on Spark, specifically designed to read SQL queries directly into Spark DataFrames. It simplifies data retrieval from SQL databases, leveraging Spark’s distributed computing capabilities.

Syntax

pandas.DataFrame.read_sql_query(sql, con, index_col=None, params=None, ...*)

sql: SQL query to be executed.

con: Database connection string or SQLAlchemy engine.

index_col: Optional parameter to specify DataFrame index.

params: Optional parameter for SQL query parameters.

Additional parameters can be explored in the official documentation.

Examples

Let’s dive into practical examples to illustrate the usage of read_sql_query().

# Example 1: Reading from a SQLite database
import pandas as pd
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder \
    .appName("Pandas API on Spark : Learning @ Freshers.in") \
    .getOrCreate()

# Define SQLite connection string
sqlite_url = "jdbc:sqlite:/path/to/database.db"

# Define SQL query
query = "SELECT * FROM table_name"

# Read SQL query into DataFrame using Pandas API on Spark
df = pd.read_sql_query(query, sqlite_url)

# Display DataFrame
print(df.head())

Output:

   column1  column2  column3
0        1        A        X
1        2        B        Y
2        3        C        Z

# Example 2: Reading from a MySQL database
# Assuming proper JDBC driver setup for MySQL

# Define MySQL connection string
mysql_url = "jdbc:mysql://hostname:port/database"

# Define SQL query
query = "SELECT * FROM table_name WHERE condition = 'value'"

# Read SQL query into DataFrame using Pandas API on Spark
df = pd.read_sql_query(query, mysql_url)

# Display DataFrame
print(df.head())

Output:

   column1  column2  column3
0        4        D        W
1        5        E        V

read_sql_query() function of the Pandas API on Spark is a powerful tool for seamlessly integrating SQL queries into Spark DataFrames. Its simplicity and efficiency make it an invaluable asset for data scientists and engineers working with large-scale data processing.

Spark important urls to refer

Post Views: 2

Author: user

Spark : SQL query execution into DataFrames : read_sql_query()

Trending

Recent Posts

Featured Posts – Slider Widget

AWS EC2 vs Azure Virtual Machines

Production and Industrial Engineering

Engineering Technical campus placement question and answers

JavaScript’s reduceRight() method to iterate over an array from right to left

Merging Multiple Images into a Single PDF File Using Python

Nanotechnology

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Most Viewed Posts

Related Articles

Trending

Recent Posts

Featured Posts – Slider Widget