What is Avro?

Stephane Maarek | AWS Certified Cloud Practitioner,Solutions Architect,Developer
A free video tutorial from Stephane Maarek | AWS Certified Cloud Practitioner,Solutions Architect,Developer
Best Selling Instructor, Kafka Guru, 9x AWS Certified
4.7 instructor rating • 41 courses • 929,058 students

Lecture description

Learn what Apache Avro is and how it came to be

Learn more from the full course

Apache Kafka Series - Confluent Schema Registry & REST Proxy

Kafka - Master Avro, the Confluent Schema Registry and Kafka REST Proxy. Build Avro Producers/Consumers, Evolve Schemas

04:23:56 of on-demand video • Updated July 2021

  • Write simple and complex Avro Schemas
  • Create, Write and Read Avro objects in Java
  • Write a Java Producer and Consumer leveraging Avro data and the Schema Registry
  • Learn about Schema Evolution
  • Perform Schema evolution using the command line and in Java
  • Utilize the REST Proxy using a REST Client
English [Auto] OK, so welcome to this section on Mastering Avro, and we'll learn about all this Keema data types and real life recommendations on Avro. But first, you may be like, what is Avro? What even is what is it? So let me just walk you through an evolution of data formats from the most basic all the way to Avro, so we have the comma separated value or C ASV and this user is very basic. We have a set of columns and then we had a row. So we, for example, column one would be John Glenn Tudo, column three, twenty five and so on. Column five is a boolean through and then Rotu. We have Mary Poppins. But look at this column three went from twenty five to sixty. So what is it. Is it an index or is it is a string. When it was right. Then we go to call three, row three and row three is missing data, so we don't have the last two columns. What the heck? We wanted them. We were expecting them. So those are most likely problems you already had with Sears V.. So she has advantages are it's easy to pass, it's easy to read, it's easy to make sense of, but its disadvantages are big. The data types of elements has to be inferred, like you need to guess what the column is, what? And it's not a guarantee anyone can do anything. Parsing becomes very tricky. When you did it contains comments. The column names may or may not be there and so on. So comes my change. It's horrible. So now we have relational tables and so relational tables in databases, basically the ad types. So here is a create table statements for a database and says create the table distributers. By the way, did you will be an integer and name will be of our chart and the database will use any data that does not comply with this type. OK, so this is very important because here we're just defined types and the database will refuse anything that doesn't comply. On top of this, we have named columns. OK, there is no such concept as an order. Every column has a name and that's how we call them. So there is a type. It's amazing. Data fits in a table that's really cool. But disadvantages of this is that No. One, your data has to be Flatt's OK row columns, no to the database that there is stored in a database and the data definition will be different for each database. OK, so it will be extremely hard to access the data across databases and across languages. You need to have a driver for each specific database, so that becomes a bit tricky. It's not something that you share with others in that format. Then we have Jason and Jason is the short for JavaScript object notation and JSON format is awesome because it can be shared across the network as much as you want. So here's an example of adjacent objects, so we have an idea type and name, an image, a thumbnail, and you see there is like some nested values and stuff like an image as a wall and a with the knights in a thumbnail. So as with the Knights, so it's pretty cool, right. Because it's all text based and it can be nested data and stuff. So like I it's pretty good. Right. So JSON format has advantages. Number one is that data can take any form you want, it could be an array, could be a nested element, could be whatever you want. Jason is widely accepted on the web. I mean, every single language has library. You pass, Jason, so that's awesome. And then it can easily be shared over a network. It's just the text. It's just drinks. It's easy, right. But there are some inconvenience. And so some of them are that the data has no schema being enforced. You could easily turn a string into an integer. Jason will just accept anything. OK, and then finally, Jason, objects can be really big in size because of the repeated kiss. OK, before we had let me show you, we had you URL repeated twice and with repeat it twice and repeated twice and so on. That may take a lot of space if you have a lot of images or little thumbnails. OK. So for all these advantages, advantages, there is one answer, it's Avro and Avro is defined by a schema in the schema itself is written in Jason. So to get started, you can view Avro has Jason with a schema attached to it. So this is what an average schema will look like. And we are going over the course to fully understand what the schema represents, what it means. Don't worry about it too much. OK, but Avro just remember it has a schema and then it has a payload. And so what are the advantages? Well, the data is fully typed. OK, before we define that, our user name was a string and our age was an integer. So data is fully typed and it's named as well. You can compress it automatically, so by the way, if the column name is very, very long, it doesn't matter. It'll be compressed so less keep usage. The schema comes alongside with the data. So there is no data there, just lonely. There's always having its schema nearby. And that means the data itself is self-explanatory. You can embed the documentation in the schema so you can document your schema so that if anyone receives your data and takes your schema from the data, they will know exactly what you did it represents. The data can be read across many language or any language is just a binary protocol, compatibility of language can differ, but usually it's pretty good, especially for Java. And then your schema can evolve over time in a safe manner. We can add rose, we can add columns, we can add elements and fields and types use your schema can evolve based on some rules, but you can make it change over time because your Dimmit change over time, your schema may evolve with it. A few disadvantages, though. Some languages may have some trouble to, you know, support Avro and then you can't really see or print the avatar without using the other tools. OK, and that's because it's compressed and serialized. So in our adjacent document, you just double click and there you go. You read it, but forever. You can't just double click and read it. You need to use some tools to read it. So before I go and devaluated, of course, be like, but what about Poterba for Thrift or Pakhi or Sieger made it a format that I really like overall they're all pretty much doing the same thing, which is to compress data and put it in some way. OK, I won't get into some debates, but that kaffiyeh level, what we care about is one message being self explicit and fully describable. And because we're dealing with streaming, OK, so there's no Arcy, no parking and no Colaneri based format. Averroes really good support already from Hadoop technologies like Hive and others. So Hadoop is already a good candidate, is a really good candidate in the Hadoop ecosystem. But also Avra has been chosen as the only supported data format for this conference Kumai registry so far. So there's no choice. We'll just go along with that. And it's fine. It's been working for ages, OK? And then finally, don't go and be like, oh, but what about the performance of Avro versus Poterba buffer versus Oh my God. Unless you start doing one million messages per second, you're fine. OK, um, I've done programs using Avro that have been reaching insane volumes without even worrying once about performance. So performance is great with Avro. Don't worry about it. OK, I hope you just are liking the format, understanding it and just going along with it. OK, you're not in the optimization phase, you're in a development phase. All right. So that was it for introduction to Avro, I promise. We're going to get deep into our next. But thanks for watching all the way. See you in the next lecture.