Introduction to Cloud Dataproc Spark Jobs on GKE

Introduction to Cloud Dataproc Spark Jobs on GKE

17 August 2020

What is Cloud Dataproc and how it can be implemented on GKE?

Cloud Dataproc is Google Cloud’s fully managed Apache Hadoop and Spark service. The mission of Cloud Dataproc has always been to make it simple and intuitive for developers and data scientists to apply their existing tools, algorithms, and programming languages to cloud-scale datasets. It is flexible which means it can continue to use the skills and techniques that have already been used to explore the data of any size. Many enterprises and SaaS companies around the world are currently using Cloud Dataproc for data processing and analytics.

With Cloud Dataproc on Kubernetes, we can eliminate the need for multiple types of clusters that require various sets of software, and the complex processes. By extending the Cloud Dataproc Jobs API to GKE, we can package all the various dependencies of a job into a single Docker container. This Docker container allows us to integrate Spark jobs directly into the rest of the software development pipelines.

Also, by extending the Cloud Dataproc Jobs API for GKE, the administrators will have a unified management system where they can magnify their Kubernetes knowledge. We can avoid having a silo of Spark applications that requires to be managed in standalone virtual machines or in Apache Hadoop YARN.

How to submit a Apache Spark job to Cloud Dataproc on GKE?

Registering the GKE cluster with Cloud Dataproc:

Before we can execute Cloud Dataproc jobs on GKE, we must first register your GKE cluster with Cloud Dataproc. Once the GKE cluster is registered, we should be able to see the GKE cluster unified with the rest of the Cloud Dataproc clusters by running the following command:

$ gcloud dataproc clusters list --region {YOUR_GKE_REGION}

Defining the Cloud Dataproc Docker container:

Cloud Dataproc offers various Docker images that will match into a bundle of software provided on the Cloud Dataproc image version list. This makes it seamless to port Spark code between the Cloud Dataproc running on the Compute Engine and the Cloud Dataproc jobs on GKE.

The Docker container packages not only the Cloud Dataproc’s agent for job management but it also builds on top of Google Cloud’s Spark Operator for Kubernetes. This is a fully open source operator that provides many of the integrations between the Kubernetes and the rest of the Google Cloud Platform, including:

  • An integration with the BigQuery (Google’s serverless data warehouse)
  • Google Cloud Storage that is a replacement for HDFS
  • Logs shipped onto Stackdriver Monitoring
  • Access to spark ctl, which is a command-line tool that simplifies the client-local application dependencies in the Kubernetes environment.

This Cloud Dataproc Docker container can be customized to include all the packages and configurations needed for the Spark job.

Submitting the Cloud Dataproc job:

Once the Docker container is ready, we can submit a Cloud Dataproc job to the GKE cluster. We can follow the same instructions that we can use for any submitting Cloud Dataproc Spark job.

$ gcloud dataproc jobs submit spark \ --cluster "${CLUSTER}" \ --class org.apache.spark.examples.SparkPi \ --jars file:///usr/lib/spark/examples/jars/spark-examples.jar 

Extending the Cloud Dataproc with their own container:

While running the above job, it will duplicate a software environment on GKE which can be found on Cloud Dataproc. However, there is an extra benefit of being able to specify a container image associated with the above job with the GKE option. The container property provides a reliable pairing of the job code and the necessary software configurations.

 --properties spark.kubernetes.container.image="gcr.io/${PROJECT}/my-spark-image"

Request a quote