Udemy

What is Avro?

A free video tutorial from Stephane Maarek | AWS Certified Cloud Practitioner,Solutions Architect,Developer
Best Selling Instructor, 10x AWS Certified, Kafka Guru
Rating: 4.7 out of 5Instructor rating
65 courses
2,420,393 students
What is Avro?

Lecture description

Learn what Apache Avro is and how it came to be

Learn more from the full course

Apache Kafka Series - Confluent Schema Registry & REST Proxy

Kafka - Master Avro, the Confluent Schema Registry and Kafka REST Proxy. Build Avro Producers/Consumers, Evolve Schemas

04:24:21 of on-demand video • Updated March 2024

Write simple and complex Avro Schemas
Create, Write and Read Avro objects in Java
Write a Java Producer and Consumer leveraging Avro data and the Schema Registry
Learn about Schema Evolution
Perform Schema evolution using the command line and in Java
Utilize the REST Proxy using a REST Client
English [Auto]
Okay, so welcome to this section on Mastering Avro and we'll learn about all the schema data types and real life recommendations on Avro. But first you may be like what is Avro? What even is? What is it? So let me just walk you through an evolution of data formats from the most basic all the way to Avro. So we have the comma separated value or CSV, and the CSV is very basic. We have a set of columns and then we add a row. So for example, column one would be John column two, Doe, column 325, and so on. Column five is the boolean true? And then row two, we have Mary Poppins. But look at this column three went from 25 to 60. So what is it? Is it an int or is it a string? Who knows, right? Then we go to column three, row three and row three is missing data. So we don't have the last two columns. What the heck? We wanted them. We were expecting them. So those are most likely problems you already had with CSV. So Ccv's advantages are it's easy to parse, it's easy to read, it's easy to make sense of. But its disadvantages are big. The data types of elements has to be inferred. Like you need to guess what column is what and it's not a guarantee. Anyone can do anything. Parsing becomes very tricky when your data contains commas. The column names may or may not be there and so on. So, you know, columns may change. It's horrible. So now we have relational tables. And so relational tables in databases, basically they add types. So here is a create table statement for a database and it says create the table distributors. By the way, did will be an integer and name will be a varchar and the database will refuse any data that does not comply with this type. Okay. So that's very important because here we're just defined types and the database will refuse anything that doesn't comply. On top of this, we have named columns. Okay. There is no such concept as an order. Every column has a name and that's how we call them. So data is fully typed. That's amazing. Data fits in the table. That's really cool. But disadvantages of this is that, number one, your data has to be flat. Okay. Row columns. Number two, the database, the data stores in a database and the data definition will be different for each database. Okay. So it'll be extremely hard to access the data across databases and across languages. You need to have a driver for each specific database. So that becomes a bit tricky. It's not something that you share with others in that format. Then we have Json and Json is the short for JavaScript object notation and Json format is awesome because it can be shared across the network as much as you want. So here is an example of a Json object. So we have an ID, a type, a name, an image, a thumbnail, and you see there's like some nested values and stuff like an image as a URL and a width and a height and a thumbnail also has a width and height. So it's pretty cool, right? Because it's all text based and it can be nested data and stuff. So it'll be like, Hey, it's pretty good, right? So Json format has advantages. Number one is that data can take any form you want. It could be an array, it could be a nested element, it could be whatever you want. Json is widely accepted on the web. I mean, every single language has a library to parse Json, so that's awesome. And then it can easily be shared over a network. It's just a text, it's just strings. It's easy, right? But there are some inconvenience. And so some of them are that the data has no schema being enforced. You could easily turn a string into an integer. Json will just accept anything. Okay. And then finally, Json objects can be really big in size because of the repeated keys. Okay. Before we had let me show you, we had repeated twice and width repeated twice and height repeated twice and so on. That may take a lot of space. If you have a lot of images or a lot of thumbnails. Okay. So for all these advantages, advantages, there is one answer. It's Avro, and Avro is defined by a schema and the schema itself is written in Json. So to get started you can view Avro as Json with a schema attached to it. So this is what an Avro schema will look like and we are going over the course to fully understand what that schema represents, what it means. So don't worry about it too much. Okay. But avro just remember it has a schema and then it has a payload. And so what are the advantages? Well, the data is fully typed. Okay. Before we define that, our username was a string and that our age was an integer. So data is fully typed and it's named as well. You can compress it automatically. So by the way, if the column name is very, very long, it doesn't matter. It'll be compressed. So less CPU usage. The schema comes alongside with the data. So there is no data that's just lonely. There's always having a schema nearby and that means the data itself is self-explanatory. You can embed the documentation in the schema so you can document your schema so that if anyone receives your data and thinks your schema from the data, they will know exactly what your data represents. The data can be read across many languages or any language. It's just a binary protocol. Compatibility of language can differ, but usually it's pretty good, especially for Java. And then your schema can evolve over time in a safe manner. We can add rows, we can add columns, we can add elements and fields and types. Use Your schema can evolve based on some rules, but you can make it change over time because your data might change over time. Your schema may evolve with it. A few disadvantages though. Some languages may have some trouble to, you know, support Avro and then you can't really see or print the Avro data without using the Avro tools. Okay. And that's because it's compressed and serialized. So an Avro Json document, you just double click and there you go. You read it. But for Avro you can't just double click and read it. You need to use some tools to read it. So before I go and dive deep into the course and be like, But what about protobuf or thrift or parquet or orc or my data format that really like. Overall, they're all pretty much doing the same thing, which is to compress data and put it in some way. Okay. I won't get into some debates, but that Kafka level, what we care about is one message being self explicit and fully describable. And because we're dealing with streaming, okay, so there's no no parquet, no columnar based format. Avro is a really good support already from Hadoop technologies like Hive and others, so Hadoop is already a good candidate. Avro is already a good candidate in the Hadoop ecosystem, but also Avro has been chosen as the only supported data format for this confluent schema registry so far. So there's no choice. We'll just go along with that and it's fine. It's been working for ages. Okay. And then finally, don't go and be like, Oh, but what about the performance of Avro versus proto buffer versus Oh my God. Unless you start doing 1 million messages per second, you're fine. Okay. Um, I've done programs using Avro that have been reaching insane volumes without even worrying once about performance. So performance is great with Avro. Don't worry about it. Okay. Hope you just are liking the format, understanding it, and just going along with it. Okay. You're not in the optimization phase. You're in the development phase. All right. So that was it for an introduction to Avro. I promise we're going to get deep into Avro next, but thanks for watching all the way. See you in the next lecture.