Pandas API on Spark : read SQL queries or database tables into DataFrames : read_sql()

user January 29, 2024

Integrating Pandas functionalities into Spark workflows can enhance productivity and familiarity. In this article, we’ll delve into the read_sql() function, which allows seamless reading of SQL queries or database tables into Spark DataFrames.

Understanding read_sql()

The read_sql() function is a pivotal part of the Pandas API on Spark, designed to facilitate the reading of SQL queries or database tables directly into Spark DataFrames. This enables data scientists and engineers to leverage Spark’s distributed computing capabilities while maintaining the ease of Pandas data manipulation.

Syntax

pandas.DataFrame.read_sql(sql, con, index_col=None, columns=None, ...)

sql: SQL query or table name to be read.

con: Database connection string or SQLAlchemy engine.

index_col: Optional parameter to specify DataFrame index.

columns: Optional parameter to specify columns to read.

Additional parameters can be explored in the official documentation.

Examples

Let’s explore practical examples to illustrate the usage of read_sql().

# Example 1: Reading SQL query from SQLite database
import pandas as pd
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder \
    .appName("Pandas API on Spark Learning @ Freshers.in") \
    .getOrCreate()

# Define SQLite connection string
sqlite_url = "jdbc:sqlite:/path/to/database.db"

# Define SQL query
query = "SELECT * FROM table_name"

# Read SQL query into DataFrame using Pandas API on Spark
df = pd.read_sql(query, sqlite_url)

# Display DataFrame
print(df.head())

Output:

   column1  column2  column3
0        1        A        X
1        2        B        Y
2        3        C        Z

# Example 2: Reading database table from MySQL
# Assuming proper JDBC driver setup for MySQL

# Define MySQL connection string
mysql_url = "jdbc:mysql://hostname:port/database"

# Read database table into DataFrame using Pandas API on Spark
df = pd.read_sql("table_name", mysql_url)

# Display DataFrame
print(df.head())

Output:

   column1  column2  column3
0        4        D        W
1        5        E        V

The read_sql() function in the Pandas API on Spark provides a seamless bridge between SQL databases and Spark DataFrames, allowing for efficient data retrieval and manipulation. By following the examples outlined in this guide, users can effortlessly integrate SQL queries or database tables into their Spark workflows.

Spark important urls to refer

Post Views: 2

Author: user

Pandas API on Spark : read SQL queries or database tables into DataFrames : read_sql()

Trending

Recent Posts

Featured Posts – Slider Widget

AWS EC2 vs Azure Virtual Machines

Production and Industrial Engineering

Engineering Technical campus placement question and answers

JavaScript’s reduceRight() method to iterate over an array from right to left

Merging Multiple Images into a Single PDF File Using Python

Nanotechnology

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Most Viewed Posts

Related Articles

Trending

Recent Posts

Featured Posts – Slider Widget