Apache Spark interview questions

user March 7, 2021 Leave a Comment

147. What will getLocalProperty(key) do in Apache Spark ?
Get a local property set in this thread, or null if it is missing. See setLocalProperty

148. What is sequenceFile in Apache Spark ? How to access ?
sequenceFile(path, keyClass=None, valueClass=None, keyConverter=None, valueConverter=None, minSplits=None, batchSize=0)
Read a Hadoop SequenceFile with arbitrary key and value Writable class from HDFS, a local file system (available on all nodes), or any Hadoopsupported file system URI. The mechanism is as follows
1. A Java RDD is created from the SequenceFile or other InputFormat, and the
key and value Writable classes
2. Serialization is attempted via Pyrolite pickling
3. If this fails, the fallback is to call ‘toString’ on each key and value
4. PickleSerializer is used to deserialize pickled objects on the Python side
Parameters: path – path to sequncefile
keyClass – fully qualified classname of key Writable class (e.g.
“org.apache.hadoop.io.Text”)
valueClass – fully qualified classname of value Writable class (e.g.
“org.apache.hadoop.io.LongWritable”)
keyConverter –
valueConverter –
minSplits – minimum splits in dataset (default min(2, sc.defaultParallelism))
batchSize – The number of Python objects represented as a single Java
object. (default 0, choose batchSize automatically)

149. What is setCheckpointDir(dirName) in Apache Spark ?
Set the directory under which RDDs are going to be checkpointed. The directory must be a HDFS path if running on a cluster.

150. What is sparkUser() in Apache Spark ?
Get SPARK_USER for user who is running SparkContext.
>>> sc.sparkUser()

151. What is startTime in Apache Spark ?
Return the epoch time when the Spark Context was started.
>>> sc._jsc.startTime()
1510567393189L

152. What is statusTracker() in Apache Spark ?
Return StatusTracker object (Low-level status reporting APIs for monitoring job and stage progress.)
>>> status = sc.statusTracker()
>>> print status
<pyspark.status.StatusTracker object at 0x7f954a5c1dd0>

153. What is stop() in Apache Spark ?
Shut down the SparkContext.

Post Views: 325

Related Posts

When you should not use Apache Spark ? Explain with reason.
There are a few situations where it may not be appropriate to use Apache Spark,…

Installing Apache Spark standalone on Linux
Installing Spark on a Linux machine can be done in a few steps. The following…

Learn how to connect Hive with Apache Spark.
HiveContext is a Spark SQL module that allows you to work with Hive data in…

Spark : Advantages of Google's Serverless Spark
Google's Serverless Spark has several advantages compared to traditional Spark clusters: Cost-effective: Serverless Spark eliminates…

How do you break a lineage in Apache Spark ? Why we need to break a lineage in Apache Spark ?
In Apache Spark, a lineage refers to the series of RDD (Resilient Distributed Dataset) operations…

AWS Glue interview questions
For Spark please visit (1) Spark Interview Questions (2) Spark Examples (3) PySpark Blogs 1.…

PySpark : Inserting row in Apache Spark Dataframe.
In PySpark, you can insert a row into a DataFrame by first converting the DataFrame…

PySpark : How to decode in PySpark ?
pyspark.sql.functions.decode The pyspark.sql.functions.decode Function in PySpark PySpark is a popular library for processing big data…

PySpark : Explain in detail whether Apache Spark SQL lazy or not ?
Yes, Apache Spark SQL is lazy. In Spark, the concept of "laziness" refers to the…

In pyspark what is the difference between Spark spark.table() and spark.read.table()
In PySpark, spark.table() is used to read a table from the Spark catalog, whereas spark.read.table()…

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Share: Twitter Facebook Pinterest Reddit VK Digg Linkedin Mix
Tagged Big Data, software_engineering, Technical

Author: user

Website

Related Articles

OOPS interview questions for freshers and experienced

Snowflake interview questions

Hive interview questions

Database management system – DBMS

Informatica interview questions

Apache PIG interview questions

Operating system interview questions

Digital Electronics interview questions

Post navigation

What are the Best Practices when using Snowflake Transactions? →
← Apache PIG interview questions

Leave a Reply Cancel reply
You must be logged in to post a comment.

Search for:
Trending
DBT
Python
Numpy
PySpark
Hive
Snowflake
Redshift
Airflow
Aptitude

Recent Posts

AWS EC2 vs Azure Virtual Machines

Production and Industrial Engineering

Engineering Technical campus placement question and answers

JavaScript’s reduceRight() method to iterate over an array from right to left

Merging Multiple Images into a Single PDF File Using Python

Featured Posts – Slider Widget

AWS EC2 vs Azure Virtual Machines

Production and Industrial Engineering

Engineering Technical campus placement question and answers

JavaScript’s reduceRight() method to iterate over an array from right to left

Merging Multiple Images into a Single PDF File Using Python

Nanotechnology

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Related Posts

When you should not use Apache Spark ? Explain with reason.
There are a few situations where it may not be appropriate to use Apache Spark,…

Installing Apache Spark standalone on Linux
Installing Spark on a Linux machine can be done in a few steps. The following…

Learn how to connect Hive with Apache Spark.
HiveContext is a Spark SQL module that allows you to work with Hive data in…

Spark : Advantages of Google's Serverless Spark
Google's Serverless Spark has several advantages compared to traditional Spark clusters: Cost-effective: Serverless Spark eliminates…

How do you break a lineage in Apache Spark ? Why we need to break a lineage in Apache Spark ?
In Apache Spark, a lineage refers to the series of RDD (Resilient Distributed Dataset) operations…

AWS Glue interview questions
For Spark please visit (1) Spark Interview Questions (2) Spark Examples (3) PySpark Blogs 1.…

PySpark : Inserting row in Apache Spark Dataframe.
In PySpark, you can insert a row into a DataFrame by first converting the DataFrame…

PySpark : How to decode in PySpark ?
pyspark.sql.functions.decode The pyspark.sql.functions.decode Function in PySpark PySpark is a popular library for processing big data…

PySpark : Explain in detail whether Apache Spark SQL lazy or not ?
Yes, Apache Spark SQL is lazy. In Spark, the concept of "laziness" refers to the…

In pyspark what is the difference between Spark spark.table() and spark.read.table()
In PySpark, spark.table() is used to read a table from the Spark catalog, whereas spark.read.table()…

Most Viewed Posts

dbt (data build tool) interview questions

Python throwing as NameError: name ‘__file__’ is not defined – Solution

DBT command not found after intalling DBT-How to resolve.

BigQuery : Handle missing or null values in BigQuery

Airflow dags not getting refreshed/updating. How to do it manually?

How to delete a partition data as well from Hive external table on DROP command?

PySpark : Connecting and updating postgres table in spark SQL

Copyright © 2024 Freshers.in