Training

GOOGLE CLOUD

Data Engineering on Google Cloud

Get hands-on experience with designing and building data processing systems on Google Cloud. This course uses lectures, demos, and hands-on labs to show you how to design data processing systems, build end-to-end data pipelines, analyze data, and implement machine learning. This course covers structured, unstructured, and streaming data

What you will learn

Design and build data processing systems on Google Cloud.
Process batch and streaming data by implementing autoscaling data pipelines on Dataflow.
Derive business insights from extremely large datasets using BigQuery.
Leverage unstructured data using Spark and ML APIs on Dataproc.
Enable instant insights from streaming data.
Understand ML APIs and BigQuery ML, and learn to use AutoML to create powerful models without coding.

Who this course is for

This class is intended for developers who are responsible for
Extracting, loading, transforming, cleaning, and validating data.
Designing pipelines and architectures for data processing.
Integrating analytics and machine learning capabilities into data pipelines.
Querying datasets, visualizing query results, and creating reports.

Level

Intermediate

Duration

4 x 8 hour session

Language

Delivered in English

Prerequisites

To benefit from this course, participants should have completed “Google Cloud Big Data and Machine Learning Fundamentals” or have equivalent experience.
Basic proficiency with a common query language such as SQL.
Experience with data modeling and ETL (extract, transform, load) activities.
Experience with developing applications using a common programming language such as Python.
Familiarity with machine learning and/or statistics.

Course TOPICS

Module 1: Introduction to Data Engineering

Explore the role of a data engineer
Analyze data engineering challenges
Introduction to BigQuery
Data lakes and data warehouses
Transactional databases versus data warehouses
Partner effectively with other data teams
Manage data access and governance
Build production-ready pipelines
Review Google Cloud customer case study

Module 2: Building a Data Lake

Introduction to data lakes
Data storage and ETL options on Google Cloud
Building a data lake using Cloud Storage
Securing Cloud Storage
Storing all sorts of data types
Cloud SQL as a relational data lake

Module 3: Building a Data Warehouse

The modern data warehouse
Introduction to BigQuery
Getting started with BigQuery
Loading data
Exploring schemas
Schema design
Nested and repeated fields
Optimizing with partitioning and clustering

Module 4: Introduction to Building Batch Data Pipelines

EL, ELT, ETL
Quality considerations
How to carry out operations in BigQuery
Shortcomings
ETL to solve data quality issues

Module 5: Executing Spark on Dataproc

The Hadoop ecosystem
Run Hadoop on Dataproc
Cloud Storage instead of HDFS
Optimize Dataproc

Module 6: Serverless Data Processing with Dataflow

Introduction to Dataflow
Why customers value Dataflow
Dataflow pipelines
Aggregating with GroupByKey and Combine
Side inputs and windows
Dataflow templates
Dataflow SQL

Module 7: Manage Data Pipelines with Cloud Data Fusion and Cloud Composer

Building batch data pipelines visually with Cloud Data Fusion
Components
UI overview
Building a pipeline
Exploring data using Wrangler
Orchestrating work between Google Cloud services with Cloud Composer
Apache Airflow environment
DAGs and operators
Workflow scheduling
Monitoring and logging

Module 8: Introduction to Processing Streaming Data

Process Streaming Data

Module 9: Serverless Messaging with Pub/Sub

Introduction to Pub/Sub
Pub/Sub push versus pull
Publishing with Pub/Sub code

Module 10: Dataflow Streaming Features

Steaming data challenges
Dataflow windowing

Module 11: High-Throughput BigQuery and Bigtable Streaming Features

Streaming into BigQuery and visualizing results
High-throughput streaming with Cloud Bigtable
Optimizing Cloud Bigtable performance

Module 12: Advanced BigQuery Functionality and Performance

Analytic window functions
Use With clauses
GIS functions
Performance considerations

Module 13: Introduction to Analytics and AI

What is AI?
From ad-hoc data analysis to data-driven decisions
Options for ML models on Google Cloud

Module 14: Prebuilt ML Model APIs for Unstructured Data

Unstructured data is hard
ML APIs for enriching data

Module 15: Big Data Analytics with Notebooks

What’s a notebook?
BigQuery magic and ties to Pandas

Module 16: Production ML Pipelines

Ways to do ML on Google Cloud
Vertex AI Pipelines
AI Hub

Module 17: Custom Model Building with SQL in BigQuery ML

BigQuery ML for quick model building
Supported models

Module 18: Custom Model Building with AutoML

Why AutoML?
AutoML Vision
AutoML NLP
AutoML tables

Ref: T-GCPDE-I-03

Have questions?

No worries. Send us a quick message and we'll be happy to answer any questions you have.

Contact Instructor