Spark : How to reveal the underlying data’s dimensions – Series.axes


When dealing with large datasets, the distributed computing power of Apache Spark becomes indispensable. Integrating Pandas with Spark offers the best of both worlds, allowing for seamless scalability and enhanced performance. One crucial aspect of data analysis is understanding the shape of the dataset, and the Series.shape method plays a pivotal role in this regard.

Understanding Series.shape

The Series.shape method in Pandas API on Spark returns a tuple representing the dimensions of the underlying data. It provides insights into the structure of the dataset, crucial for various data manipulation tasks.

Example 1: Exploring Dataset Dimensions

Consider a scenario where we have a Pandas Series on Spark containing temperature data:

import pandas as pd
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder \
    .appName("Pandas API On Spark : series Learning @") \
# Sample temperature data
data = [28, 32, 25, 30, 27]
# Create a Pandas Series on Spark
series = pd.Series(data)
# Get the shape of the Series
shape = series.shape
print("Shape of the Series:", shape)
Shape of the Series: (5,)

In this example, the shape of the Series is (5,), indicating that it has one dimension with five elements.

Example 2: Handling Multi-dimensional Data

Now, let’s examine a more complex scenario involving multi-dimensional data:

# Sample multi-dimensional data
multi_data = [[10, 20, 30], [40, 50, 60], [70, 80, 90]]
# Create a Pandas DataFrame on Spark
df = pd.DataFrame(multi_data)
# Convert DataFrame to Series
series_from_df = df.iloc[:, 0]
# Get the shape of the Series
shape_df = series_from_df.shape
print("Shape of the Series from DataFrame:", shape_df)
Shape of the Series from DataFrame: (3,)

In this example, we extracted the first column from a DataFrame, resulting in a Series with three elements, hence the shape (3,).

Author: user