Pandas API on Spark : Learn Indexing and iteration with example

Spark_Pandas_Freshers_in

Pandas, coupled with the scalability of Spark, offers a formidable toolset for data manipulation and analysis at scale. In this article, we delve into the intricacies of indexing and iteration in Pandas API on Spark, focusing on key functions such as Series.at, Series.iat, Series.loc, Series.iloc, Series.keys(), Series.pop(item), Series.items(), Series.iteritems(), Series.item(), Series.xs(), and Series.get(). Through detailed examples, we uncover their utility and explore how they can be leveraged effectively in Spark environments.

1. Series.at

The Series.at function allows you to access a single value for a specific row/column label pair. Let’s illustrate its usage:

# Import necessary libraries
from pyspark.sql import SparkSession
import pandas as pd
# Create a SparkSession
spark = SparkSession.builder \
    .appName("Pandas API on Spark : Learning @ Freshers.in ") \
    .getOrCreate()
# Sample data
data = {'A': [1, 2, 3, 4, 5]}
df = spark.createDataFrame(pd.DataFrame(data))
# Convert DataFrame to Pandas Series
series = df.select('A').toPandas()['A']
# Access a single value by label
value = series.at[0]
# Print the value
print("Value at label 0:", value)

Output:

Value at label 0: 1

2. Series.iat

The Series.iat function enables access to a single value for a specific row/column pair by integer position. Here’s how it works:

# Access a single value by integer position
value = series.iat[0]
# Print the value
print("Value at integer position 0:", value)

Output:

Value at integer position 0: 1

3. Series.loc

Series.loc facilitates accessing a group of rows and columns by label(s) or a boolean Series. Let’s see an example:

# Access a group of values by label
group = series.loc[1:3]
# Print the group
print("Group of values at labels 1 to 3:")
print(group)

Output:

Group of values at labels 1 to 3:
1    2
2    3
3    4
Name: A, dtype: int64

4. Series.iloc

Series.iloc provides purely integer-location based indexing for selection by position. Here’s a demonstration:

# Access values by integer position
values = series.iloc[1:3]
# Print the values
print("Values at integer positions 1 to 2:")
print(values)

Output:

Values at integer positions 1 to 2:
1    2
2    3
Name: A, dtype: int64

5. Series.keys()

Series.keys() returns an alias for the index. Let’s see how it works:

# Get alias for index
alias = series.keys()
# Print the alias
print("Alias for index:", alias)

Output:

Alias for index: RangeIndex(start=0, stop=5, step=1)

6. Series.pop(item)

The Series.pop(item) function returns the specified item and drops it from the series. Here’s an example:

# Pop an item from the series
popped_item = series.pop(0)
# Print the popped item and the modified series
print("Popped Item:", popped_item)
print("Modified Series:")
print(series)

Output:

Popped Item: 1
Modified Series:
1    2
2    3
3    4
4    5
Name: A, dtype: int64

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user