If you are looking to migrate your existing on-premise Hadoop based Pysaprk jobs on GCP and don’t want to create a Dataproc cluster and run your jobs on that, here are 3 different ways you can achieve that.
If you are looking to migrate your existing on-premise Hadoop based Pysaprk jobs on GCP and don’t want to create a Dataproc cluster and run your jobs on that, here are 3 different ways you can achieve that.
Basic information of all services mentioned in this article:
Dataflow:
BigQuery:
Dataform:
In all of the below pipelines Input/raw HDFS layer is replaced by GCS buckets & final/processed HDFS layer is replaced by BigQuery tables. And, suggested orchestration service from GCP is Cloud Composer.
A) Raw data can reside on GCS buckets.
B) Review your existing PySpark code and identify the key data processing logic and transformations. Refactor your code to use Apache Beam, which is the programming model behind Google Cloud Dataflow. Modify your data processing logic to fit the Apache Beam model by transforming your PySpark code into Dataflow pipelines written as Python scripts.
C) Final data is stored in BigQuery tables.
A) Raw data can reside on GCS buckets.
B) Google Cloud Dataflow jobs will read the data from buckets and write into input/raw BigQuery tables without doing any additional transformations.
C) Rewrite your data processing logic as SQL queries or stored procedures. This will involve translating your PySpark transformations into SQL operations. You can create user-defined functions (UDFs) in BigQuery if needed. Write SQL stored procedures within BigQuery that encapsulate your data processing logic. You can create and manage stored procedures using the BigQuery web console, command-line tools, or client libraries.
D) Stored procedures will store the final data is in final/processed BigQuery tables.
A) Raw data can reside on GCS buckets.
B) Google Cloud Dataflow jobs will read the data from buckets and write into input/raw BigQuery tables without doing any additional transformations.
C) Rewrite your data processing logic as SQL-based Dataform transformations. Dataform uses SQL-like syntax for defining transformations and models. Identify any custom functions or business logic in your PySpark code and determine how to implement them within Dataform. Define Dataform models that correspond to the tables or views in your data warehouse. These models represent the output of your data transformations.
D) Dataform jobs will store the final data is in final/processed BigQuery tables.
Pyspark Functions mapping with Dataflow, BigQuery SQL & Dataflow SQLX
This is a mapping sheet that our team has created for 50+ Pysaprk transformation functions and their equivalents in Dataflow, Big Query SQL & Dataform SQLX which can help one get started converting existing Pyspark scripts to write & execute scripts using any of the above mentioned 3 pipelines.
Translational Health (Evergreen) Read More
A Large retailer in the US leveraging the Mainframe based OMS for Improved Operational Efficiency…
Loyalty Based Platform for Banking and Travel Customers Read More
Monolithic Application Based Thick Client Architecture with Non-Distributed DB Store Read More
AWS Serverless backend-based Hybrid mobile platform for micro-financing Read More