
Discover how to use Microsoft Azure for data engineering, from Azure SQL databases and Azure Blob Storage to Azure Data Factory pipelines and Azure Synapse Analytics.
Define data and categorize its sources and structure, from devices to observations. Examine categorical, ordinal, and quantitative data, and identify Microsoft Azure services to store and process this data.
Explain non-relational data by comparing semi-structured JSON-based records with unstructured data, noting that fields can vary across records, and outline formats such as Apache Spark, ORC, Avro, and Azure storage.
Understand data processing from raw data to meaningful derived data, including data cleansing and transformations. Compare batch processing and stream processing, with workflows and Azure or open-source tools.
Learn how a resource in Azure encompasses running cloud components like databases, servers, and networks, and how resource groups organize these resources by life cycle and security.
Install Azure Data Studio, connect to an Azure SQL Server, configure firewall to allow your IP address, and explore databases with autocompletion.
Create and connect a Linux virtual machine in the Azure portal, set up a resource group and network, and access an Azure SQL database from the VM using SSH.
Install the mssql CLI on the Ubuntu 18.04 VM and connect to Azure SQL using server, database, and credentials. Enable Azure services access to permit the connection.
Understand authentication and authorization in Azure SQL Database and SQL Server with payroll, log files, and settings databases today.
Create a sql login and a database user in azure sql, map the login to the adventureworks user, and grant read-only access.
Install and use SQL Server Management Studio to administer Azure SQL databases, compare it with Azure Data Studio, connect via object explorer, and launch Azure Data Studio from SSMS.
Explore tables and views in a relational database, showing how tables store data in rows and columns with primary keys and constraints, and how views store sql queries for reusable results.
Create a database diagram in SSMS to reveal relationships between tables, use foreign keys to reference related data, and visualize table connections in a relational database.
Create an Azure storage account in the portal by selecting a subscription, resource group, and a unique name, then configure region, standard performance, redundancy, soft delete, and access tier.
Upload data to an Azure storage account via the Azure portal by creating a private blob container, uploading files, and exploring blob management, hot access tier, and soft delete recovery.
Learn to use Azure Storage Explorer to manage Azure cloud storage from a desktop, connect to storage accounts and containers, and upload data using access keys.
Learn how to grant restricted, time-limited access to Azure storage resources using shared access signatures, including account-level and service-level SAS, with guidance on creation and rotation in the portal.
Generate a shared access signature with the Azure CLI by specifying account name, container, permissions, expiry, and authentication mode, then build the valid url.
Explore how Azure organizes data centers into regions and availability zones to ensure data redundancy, region-based resource deployment, and geography-driven privacy and compliance for storage and applications.
Explore Azure Data Factory to build pipelines. Extract from Azure sql and csv, transform with mapping data flows, and load into storage using linked services, datasets, activities, and integration runtimes.
Create a new SQL Server and database in a resource group. Import the Worldwide Importer backpack with SQL Server Management Studio and configure the firewall.
Create an Azure Data Factory v2 and two linked services for the SQL database and Azure Blob Storage, showing how to connect to external resources and validate connections.
Create two datasets in data factory: a sql server table dataset using the sql linked service, and a blob storage csv dataset with a pipe delimiter, then publish both.
Create an Azure Data Factory pipeline to move data from SQL Server to Azure Blob Storage using a copy data operation, datasets, and a CSV sink, with validation and debugging.
Join a sales order CSV with the sales order line SQL table using an inner join on order ID, then aggregate by customer ID and order ID to count items.
Add an aggregate transformation to group by customer ID and stock item ID, sum the quantity to create a total, and write the results to an inline dataset sink.
Execute a data flow by first copying data from a SQL table to CSV, then reading it with the data flow to join the order data and output JSON.
Explore how data flows transform data without code and execute on a scaled Apache Spark cluster within an Azure Data Factory pipeline, producing many small part files from parallel partitions.
Build a data pipeline in Azure Data Factory, reading json and copying from Azure SQL to parquet. Use a mapping data flow to derive columns, aggregate, and produce csv outputs.
Learn how contained database users store authentication inside the database, making databases portable across environments. Create Azure Data Factory SQL user, set a strong password, and grant data reader access.
Explain why hardcoding SQL server credentials in Azure Data Factory is risky, and show creating an Azure Key Vault to store SQL server secrets and link them to Data Factory.
Create a storage account in Azure with a resource group and a unique name, then create an input container and a transaction-optimized output file share for data factory access.
Create a dataset from delivery info json extracting invoice id and date. Set up a linked service to Azure blob storage using a managed identity and Azure Active Directory authentication.
Grant a managed identity read-only access to a specific blob container using Azure RBAC, assigning the Storage Blob Data Reader role to a data factory.
Create a metadata-container-dev, upload cities to process.json, and grant the data factory storage blob data contributor access. Then define a managed-identity dataset, test the path, and preview city_id and city_name.
Explore Azure Data Factory's lookup activity to process data by city using a metadata JSON file, avoiding hard-coded city IDs, and enable dynamic per-city pipelines.
Learn to implement for each loops in Azure Data Factory, parameterize items with dynamic content and expressions, and pass lookup outputs to copy data into parquet files per city.
Build a data factory pipeline using the for each activity to process three cities, with managed identity access, datasets and link services, and validate parquet outputs.
Parameterize the sink container name and include city names in file names, creating date-based folders and passing pipeline parameters to datasets in Azure Data Factory.
Parameterize the storage directory with expression language and the current date. Format the date with UTC now to create a year-month-day folder, publish the pipeline, and debug.
Read parquet source files organized by dynamic current date folders, use mapping data flow to create year-month-day paths, and infer schemas from parquet data.
Join delivery json and city parquet data with an inner join on invoice id, fix data types. Use a derived column to extract latitude and longitude, then drop unused fields.
Master aggregate transformations in a data flow by grouping by city name, city ID, and delivery status, and count invoices per group to produce total invoices.
Configure an Azure data factory pipeline to copy and merge multiple csv files from an Azure file share using parameterized datasets and a merge files copy behavior.
Execute an end-to-end azure data factory pipeline to process json inputs and parquet files, producing csv outputs, and monitor the run as the data flow merges into a single csv.
Compare online transaction processing (oltp) and online analytical processing (olap) using azure synapse analytics, highlighting etl, data warehouses, data lakes, and serverless versus dedicated query options.
Create an Azure Synapse workspace in the portal, configure resource groups, data lake storage, and access roles, then explore Synapse Studio hubs for data, develop, integration, monitor, and manage.
Learn to query data lake with the serverless SQL pool in Azure Synapse, paying for data processed. Use parquet and delta formats and connect with Azure Data Studio or SSMS.
Connect to a serverless SQL pool endpoint with Azure Data Studio using Azure AD authentication, configure access to storage accounts, and run queries against Parquet data in Azure Data Lake.
Understand server level credentials for serverless sql pool to access azure storage with open row set, and compare server level vs database-scoped credentials for shared access signatures.
Learn to use open row set to read remote data with Transact-SQL. Use the built-in bulk provider in Azure Synapse Analytics to query parquet files in Azure storage.
Query json documents with serverless sql pool in Azure Synapse Analytics. Read line-delimited json and arrays, and extract customer id and demographic data with json value and json query.
Learn to read a line-delimited json file containing an array of objects by using openrowset with a row terminator and the openjson function, returning documents as rows and columns.
Create an external table in serverless sql pool by configuring a database, a master key, and database-scoped credentials; the table stores metadata while data stays in the storage account.
Create an external table in Azure Synapse serverless pool by specifying database, schema, location, data source, and file format, then query CSV or parquet data as a native SQL table.
Explore how Polybase external tables in serverless SQL pool handle dirty records with reject options, and learn to resolve data type mismatches by using string columns or dedicated pools.
Explore CETAS in serverless sql pool to create external tables from a select and export results to Azure storage, using parquet and csv formats.
Create a spark notebook from the spark pool, load address.parquet into a spark dataframe, attach to the pool, and observe the driver and executors forming the cluster.
Learn how to scale a serverless Apache Spark pool in Synapse Analytics by adjusting node counts, executors, and driver, enabling autoscale options, and monitoring with the Spark History Server.
Learn how Azure Synapse Analytics workspace quotas cap CPU cores for data flows and Apache Spark pools, manage executors and drivers, and request capacity increases via the Azure portal.
Learn to read data from Azure Data Lake Storage in Synapse Analytics using a managed identity and linked services. Enable hierarchical namespaces for big data analytics.
Learn to read CSV data from Azure blob storage with Apache Spark in Synapse by creating a blob storage linked service and configuring authentication for multiple CSV files.
Learn to connect Apache Spark to Azure SQL from Synapse using the Azure SQL Connector, configure JDBC URL and Key Vault credentials, read, query, or write data with Spark.
Understand the dedicated sql pool architecture, including the control node, compute nodes, data movement service, distributed tables, and distribution methods like round robin, hash, and replicate.
Learn how to create a dedicated SQL pool in Synapse, name it, choose a performance level, estimate costs by data warehouse units, and manage compute with pause and scale options.
Learn to load data into a dedicated sql pool with the copy statement, including creating a schema and a data engineer user, granting permissions, and handling headers.
Learn to load data into a dedicated sql pool by creating an external table, configuring data sources and file formats, and using CTAS to populate a read table from storage.
Learn how a data warehouse uses a star schema to organize data into fact and dimension tables, using integration and staging tables, to relate sales to time, product, and customers.
Create a hash distributed table to improve performance of large fact tables by assigning rows to distributions with a hash function, and choose a distribution column.
Learn how to check data distribution in a hash-distributed table, spot skew beyond 10%, and select a distribution column that balances rows to prevent performance issues.
Learn to create a round-robin distributed table that evenly distributes rows across distributions, reducing data skew and enabling fast loading for staging tables, though joins may trigger data shuffles.
Create and load a data warehouse with a dimension table and a fixed table for the TPC dataset. Practice copying data into tables or using Azure Data Factory copy activity.
Do you want to jump-start your career as an Azure Data Engineer?
Welcome to the Definitive Guide to Data Engineering on Microsoft Azure.
In this course, you will learn how to use the extensive family of Azure Data Services to build a modern data and analytics platform.
I'll take you step-by-step through engaging video tutorials and teach you everything you need to know to succeed as an Azure Data Engineer.
By the time you complete this course, you will be able to:
Architect a Data Solution on Azure
Integrate relational data and unstructured data
Build data processing pipelines using Azure Data Factory
Integrate and transform data from various data systems
Securely access data stores with Azure Key Vault and Azure role-based access control (Azure RBAC)
Perform exploratory data analysis with Azure Synapse Analytics
Manage your costs
The lectures in this course are hands-on and with lots of explanations.
I also use animations to break down complex topics.
There are SQL scripts and Jupyter notebooks that you can use to follow along easily.
You will be working closely with the documentation of Microsoft Azure as it is essential to know how to find the most up-to-date information about any service.
You will learn how to combine a range of Azure services to ingest, store and process data of all types and sizes from any data source.
Throughout this course, we will learn how to use and combine multiple Azure Services and tools, including:
Azure SQL Databases
Azure Storage
Azure Role-Based Access Control - RBAC
Azure Data Studio
Azure Storage Explorer
Azure Cost Management
Azure Data Factory
Azure Key Vault
Azure Managed Identity
Azure Synapse Analytics
Synapse Workspace
Azure Synapse Studio
Serverless SQL Pool
Serverless Apache Spark Pool
Synapse Analytics Dedicated SQL Pool
At the beginning of each Data Service, you will be introduced to the Service, learn what it is for, and then learn how to use it.
Enroll now, and learn
Microsoft, Windows, Microsoft Azure, and all Azure Data Services are either registered trademarks or trademarks of Microsoft group of companies.
This course is not certified, accredited, affiliated with, or endorsed by Microsoft Corporation.