How to start a serverless spark from GCP


To start a serverless Spark job on Google Cloud Platform (GCP), you can use the Cloud Dataproc service. Cloud Dataproc is a fully managed service that makes it easy to run Apache Spark and Apache Hadoop workloads on GCP.

Here are the basic steps to start a serverless Spark job on Cloud Dataproc:

  1. Create a new Cloud Dataproc cluster: This can be done through the Cloud Console, the gcloud command-line tool, or the Cloud Dataproc API. Make sure to select the “serverless” option when creating the cluster.
  2. Submit a Spark job to the cluster: Once the cluster is up and running, you can submit a Spark job to it using the gcloud command-line tool, the Cloud SDK, or the Cloud Dataproc API. You can specify the Spark job’s main class, arguments, and configuration settings.
  3. Monitor the job’s progress: You can use the Cloud Console or the gcloud command-line tool to monitor the status of your Spark job and view the job’s output and logs.
  4. Scale the cluster if needed: Cloud Dataproc allows you to set up the cluster autoscaling, which lets the cluster automatically adjust the number of worker nodes based on the demand of your job.
  5. After the job is done, terminate the cluster : To stop incurring charges, it is important to terminate the cluster once you are done with it.
  6. It’s worth noting that serverless clusters are designed to autoterminate after 90 minutes of inactivity, so if you are running multiple jobs make sure to set up a workflow or a scheduler to keep the cluster up and running.

Also be aware that serverless clusters can have some performance limitations and the maximum number of nodes is 8. Note that Cloud Dataproc also supports other big data frameworks like Apache Hive and Apache Pig.

Please check the documentation for more details and a more complete guide.

Get more post on Python, PySpark

Author: user

Leave a Reply