GOOGLE CLOUD
Data Engineering on Google Cloud
Get hands-on experience with designing and building data processing systems on Google Cloud. This course uses lectures, demos, and hands-on labs to show you how to design data processing systems, build end-to-end data pipelines, analyze data, and implement machine learning. This course covers structured, unstructured, and streaming data
Design and build data processing systems on Google Cloud.
Process batch and streaming data by implementing autoscaling data pipelines on Dataflow.
Derive business insights from extremely large datasets using BigQuery.
Leverage unstructured data using Spark and ML APIs on Dataproc.
Enable instant insights from streaming data.
Understand ML APIs and BigQuery ML, and learn to use AutoML to create powerful models without coding.
This class is intended for developers who are responsible for
Extracting, loading, transforming, cleaning, and validating data.
Designing pipelines and architectures for data processing.
Integrating analytics and machine learning capabilities into data pipelines.
Querying datasets, visualizing query results, and creating reports.
Intermediate
4 x 8 hour session
Delivered in English
To benefit from this course, participants should have completed “Google Cloud Big Data and Machine Learning Fundamentals” or have equivalent experience.
Basic proficiency with a common query language such as SQL.
Experience with data modeling and ETL (extract, transform, load) activities.
Experience with developing applications using a common programming language such as Python.
Familiarity with machine learning and/or statistics.
Explore the role of a data engineer
Analyze data engineering challenges
Introduction to BigQuery
Data lakes and data warehouses
Transactional databases versus data warehouses
Partner effectively with other data teams
Manage data access and governance
Build production-ready pipelines
Review Google Cloud customer case study
Introduction to data lakes
Data storage and ETL options on Google Cloud
Building a data lake using Cloud Storage
Securing Cloud Storage
Storing all sorts of data types
Cloud SQL as a relational data lake
The modern data warehouse
Introduction to BigQuery
Getting started with BigQuery
Loading data
Exploring schemas
Schema design
Nested and repeated fields
Optimizing with partitioning and clustering
EL, ELT, ETL
Quality considerations
How to carry out operations in BigQuery
Shortcomings
ETL to solve data quality issues
The Hadoop ecosystem
Run Hadoop on Dataproc
Cloud Storage instead of HDFS
Optimize Dataproc
Introduction to Dataflow
Why customers value Dataflow
Dataflow pipelines
Aggregating with GroupByKey and Combine
Side inputs and windows
Dataflow templates
Dataflow SQL
Building batch data pipelines visually with Cloud Data Fusion
Components
UI overview
Building a pipeline
Exploring data using Wrangler
Orchestrating work between Google Cloud services with Cloud Composer
Apache Airflow environment
DAGs and operators
Workflow scheduling
Monitoring and logging
Process Streaming Data
Introduction to Pub/Sub
Pub/Sub push versus pull
Publishing with Pub/Sub code
Steaming data challenges
Dataflow windowing
Streaming into BigQuery and visualizing results
High-throughput streaming with Cloud Bigtable
Optimizing Cloud Bigtable performance
Analytic window functions
Use With clauses
GIS functions
Performance considerations
What is AI?
From ad-hoc data analysis to data-driven decisions
Options for ML models on Google Cloud
Unstructured data is hard
ML APIs for enriching data
What’s a notebook?
BigQuery magic and ties to Pandas
Ways to do ML on Google Cloud
Vertex AI Pipelines
AI Hub
BigQuery ML for quick model building
Supported models
Why AutoML?
AutoML Vision
AutoML NLP
AutoML tables
Ref: T-GCPDE-I-03
No worries. Send us a quick message and we'll be happy to answer any questions you have.