GOOGLE CLOUD

Data Engineering on Google Cloud

Get hands-on experience with designing and building data processing systems on Google Cloud. This course uses lectures, demos, and hands-on labs to show you how to design data processing systems, build end-to-end data pipelines, analyze data, and implement machine learning. This course covers structured, unstructured, and streaming data

What you will learn

  • Design and build data processing systems on Google Cloud.

  • Process batch and streaming data by implementing autoscaling data pipelines on Dataflow.

  • Derive business insights from extremely large datasets using BigQuery.

  • Leverage unstructured data using Spark and ML APIs on Dataproc.

  • Enable instant insights from streaming data.

  • Understand ML APIs and BigQuery ML, and learn to use AutoML to create powerful models without coding.

Who this course is for

  • This class is intended for developers who are responsible for

  • Extracting, loading, transforming, cleaning, and validating data.

  • Designing pipelines and architectures for data processing.

  • Integrating analytics and machine learning capabilities into data pipelines.

  • Querying datasets, visualizing query results, and creating reports.

Level

  • Intermediate

Duration

  • 4 x 8 hour session

Language

  • Delivered in English

Prerequisites

  • To benefit from this course, participants should have completed “Google Cloud Big Data and Machine Learning Fundamentals” or have equivalent experience.

  • Basic proficiency with a common query language such as SQL.

  • Experience with data modeling and ETL (extract, transform, load) activities.

  • Experience with developing applications using a common programming language such as Python.

  • Familiarity with machine learning and/or statistics.

Course TOPICS

Module 1: Introduction to Data Engineering

  • Explore the role of a data engineer

  • Analyze data engineering challenges

  • Introduction to BigQuery

  • Data lakes and data warehouses

  • Transactional databases versus data warehouses

  • Partner effectively with other data teams

  • Manage data access and governance

  • Build production-ready pipelines

  • Review Google Cloud customer case study

Module 2: Building a Data Lake

  • Introduction to data lakes

  • Data storage and ETL options on Google Cloud

  • Building a data lake using Cloud Storage

  • Securing Cloud Storage

  • Storing all sorts of data types

  • Cloud SQL as a relational data lake

Module 3: Building a Data Warehouse

  • The modern data warehouse

  • Introduction to BigQuery

  • Getting started with BigQuery

  • Loading data

  • Exploring schemas

  • Schema design

  • Nested and repeated fields

  • Optimizing with partitioning and clustering

Module 4: Introduction to Building Batch Data Pipelines

  • EL, ELT, ETL

  • Quality considerations

  • How to carry out operations in BigQuery

  • Shortcomings

  • ETL to solve data quality issues

Module 5: Executing Spark on Dataproc

  • The Hadoop ecosystem

  • Run Hadoop on Dataproc

  • Cloud Storage instead of HDFS

  • Optimize Dataproc

Module 6: Serverless Data Processing with Dataflow

  • Introduction to Dataflow

  • Why customers value Dataflow

  • Dataflow pipelines

  • Aggregating with GroupByKey and Combine

  • Side inputs and windows

  • Dataflow templates

  • Dataflow SQL

Module 7: Manage Data Pipelines with Cloud Data Fusion and Cloud Composer

  • Building batch data pipelines visually with Cloud Data Fusion

  • Components

  • UI overview

  • Building a pipeline

  • Exploring data using Wrangler

  • Orchestrating work between Google Cloud services with Cloud Composer

  • Apache Airflow environment

  • DAGs and operators

  • Workflow scheduling

  • Monitoring and logging

Module 8: Introduction to Processing Streaming Data

  • Process Streaming Data

Module 9: Serverless Messaging with Pub/Sub

  • Introduction to Pub/Sub

  • Pub/Sub push versus pull

  • Publishing with Pub/Sub code

Module 10: Dataflow Streaming Features

  • Steaming data challenges

  • Dataflow windowing

Module 11: High-Throughput BigQuery and Bigtable Streaming Features

  • Streaming into BigQuery and visualizing results

  • High-throughput streaming with Cloud Bigtable

  • Optimizing Cloud Bigtable performance

Module 12: Advanced BigQuery Functionality and Performance

  • Analytic window functions

  • Use With clauses

  • GIS functions

  • Performance considerations

Module 13: Introduction to Analytics and AI

  • What is AI?

  • From ad-hoc data analysis to data-driven decisions

  • Options for ML models on Google Cloud

Module 14: Prebuilt ML Model APIs for Unstructured Data

  • Unstructured data is hard

  • ML APIs for enriching data

Module 15: Big Data Analytics with Notebooks

  • What’s a notebook?

  • BigQuery magic and ties to Pandas

Module 16: Production ML Pipelines

  • Ways to do ML on Google Cloud

  • Vertex AI Pipelines

  • AI Hub

Module 17: Custom Model Building with SQL in BigQuery ML

  • BigQuery ML for quick model building

  • Supported models

Module 18: Custom Model Building with AutoML

  • Why AutoML?

  • AutoML Vision

  • AutoML NLP

  • AutoML tables

Ref: T-GCPDE-I-03

Have questions?

No worries. Send us a quick message and we'll be happy to answer any questions you have.

© Copyright 2023. Axalon. All rights reserved.

Facebook site
LinkedIn profile