What is Apache Kafka

        I'd like to tell you what Apache Kafka is, but first, I wanna start with some background. For a long time now, we have written programs that store information in databases. Now, what databases encourage us to do is to think of the world in terms of things, things like, I don't know, users and maybe a thermostat. That's a thermometer, but you get the idea. Maybe a physical thing, like a train, let's see, here's a train. Things, there are things in the world. Database encourages us to think in those terms and those things have some state. We take that state, we store it in the database. This has worked well for decades, but now some people are finding that it's better, rather than thinking of things first, to think of events first. Now events have some state too, right? An event has a description of what happened with it, but the primary idea is that the event is an indication in time that the thing took place. Now it's a little bit cumbersome to store events in databases. Instead, we use a structure called a log and a log is just an ordered sequence of these events. An event happens and we write it into a log. A little bit of state, a little bit of description of what happens, and that says, hey, that event happened at that time. As you can see, logs are really easy to think about. They're also easy to build at scale, which historically has not quite been true of databases, which have been a little cumbersome in one way or another to build at scale. 

Now Apache Kafka is a system for managing these logs using a fairly standard, historical term. It calls them topics. This is a topic. A topic is just an ordered collection of events that are stored in a durable way, durable meaning that they're written to disk and they're replicated. So they're stored on more than one disk, on more than one server, somewhere, wherever that infrastructure runs, so that there's no one hardware failure that can make that data go away. Topics can store data for a short period of time, like a few hours or days or years or hundreds of years or indefinitely. Topics can also be relatively small or they can be enormous. There's nothing about the economics of Kafka that says that topics have to be large in order for it to make sense, and there's nothing about the architecture of Kafka that says that they have to say small. So they can be small, they can be big. They can remember data forever. They can remember data just for a little while, but they're a persistent record of events. Each one of those events represents a thing happening in the business. Like, remember our user, maybe a user updates her shipping address or a train unloads cargo or a thermostat reports that the temperature has gone from comfy to, is it getting hot in here. Each one o' those things can be an event stored in a topic, and Kafka encourages you to think of events, first and things, second. Now, back when databases ruled the world, it was kind of the trend to build one large program. We'll just build this gigantic program here that uses one big database all by itself, and it was customary, for a number of reasons, to do this. But, these things grew to a point where they were difficult to change and also difficult to think about. They got too big for any one developer to fit that whole program in his or her head at the same time. And, if you've lived like this, you know that that's true. 

Now the trend is to write lots and lots of small programs, each one of which is small enough to fit in your head and think about and version and change and evolve all on its own. And these things can talk to each other through Kafka topics. So each one of these services can consume a message from a Kafka topic, do whatever its computation is that goes on there, and then produce that message off to another Kafka topic that lives over here. So that output is now durably and maybe even permanently recorded for other services and other concerns in the system to process. So with all this data living in these persistent real time streams, and I've drawn two of them now, but imagine there are dozens or hundreds more in a large system. Now it's possible to build new services that perform real-time analysis of that data. So I can stand up some other service over here that does some kind of gauge, some sort of real-time analytics dashboard and that is just consuming messages from this topic here. That's in contrast to the way it used to be where you ran a batch process overnight. Now it's possible that yesterday is a long time ago for some businesses now. You might want that insight to be instant or as close to instant as it can possibly be. And, with data in these topics as events, that get processed as soon as they happen, it's now fairly straightforward to build these services that can do that analysis in real time. So you've got events, you've got topics, you've got all these little services talking to each other through topics. You got real-time analytics. I think if you have those four things in your head, you've got a decent idea of kind of the minimum viable understanding, not only of what Kafka is, which is this distributive log thing, but also of the kinds of software architectures that Kafka tends to give rise to. When people start building systems on it, this is what happens. Once a company starts using Kafka, it tends to have this viral effect. Right, we've got these persistent, distributed logs that are records of the things that have happened. We've got things talking through them, but there are other systems. I mean, what's this, there's this database. There's probably gonna be, you know, another database out there that was built before Kafka came along and you wanna integrate these systems. There could be other systems entirely. Maybe there's a search cluster. Maybe you used some SAS product to help your sales people organize their efforts, all these systems in the business, and their data isn't in Kafka. Well, Kafka Connect is a tool that helps get that data in and back out. When there's all these other systems in the world, you wanna collect data, so changes happen in the database, and you wanna collect that data and get it written into a topic like that. And now, I can stand up some new service that consumes that data and does whatever computation is on it now that it's in a Kafka topic. That's the whole point. Connect gets that data in. Then, that service produces some result which goes to a new topic over here and Connect is the piece that moves It to whatever that external legacy system is here. So, Kafka Connect is this process that does this inputting and this outputting and it's also an ecosystem of connectors. 

There are dozens, hundreds of connectors out there in the world. Some of them are open source. Some of them are commercial. Some of them are in between, but there are these little pluggable modules that you can deploy to get this integration done in a declarative way. You deploy them, you configure them. You don't write code to do this reading from the database, this writing to whatever that external system is. Those modules already exist. The code's already written. You just deploy them and Connect does that integration to those external systems. And let's think about the work that these things do. These services, these little boxes I'm drawing, they have some life of their own. They're programs, right, but they're gonna process messages from topics, and they're gonna have some computation that they wanna do over those messages. And it's amazing, there's really just a few things that people end up doing, like, say you have messages, these green messages, you wanna group all those up and add some field. Like, come up with the total weight of all the train cars that passed a certain point or something, but only a certain kind of car, only the green kinds of cars. And you've got these other, say you've got these orange ones here. So, right away, we see that we're gonna have to go through those messages. We're gonna have to group by some key and then we'll take the group and run some aggregation over it or maybe count them or something like that. Maybe you want to filter, maybe I've got this topic and, let's see, make some room for some other topic over here that's got some other kinda data, and I wanna take all the messages here and somehow link them with messages in this topic, and enrich, when I see this message happen here, I wanna go enrich it with the data that's in this other topic. These are common things. If it's the first time you've thought about it, that might seem unusual, but those things, grouping, aggregating, filtering, enrichment... 

Enrichment, by the way, goes by another name in database-land, that's a join, right? These are the things that these services are going to do. They're simple in principle to think about and to sketch, but to actually write the code to make all that happen takes some work, and that's not work you wanna do. So, Kafka again, in the box, just like it has Connect for doing data integration, it has an API called Kafka Streams. That's a Java API that handles all of the framework and infrastructure and kinda undifferentiated stuff you'd have to build to get that work done. So you can use that as a Java API in your services and get all that done in a scalable and fault-tolerant way just like we expect for modern applications to be able to do. And that's not framework code you have to write. You just get to use it because you're using Kafka. Now if I wanted to do some sort of analysis on the data in this topic and I didn't wanna put it in the service. I didn't want to stand up a Java application to run it. Confluent has a language called KSQL that I can now write some very SQL-looking thing here. Select from this topic, you know, group by etc, I won't write the whole query out, but it's a SQL-like language that I can run to be able to do real-time analysis of data in a topic. Where do those results go, you're wondering? Well, I think you'll be unsurprised to hear that the output of this query just goes into another topic which I can then do further analysis on, stand up other services to process, use Kafka Connect to send to a legacy system. It all kinda fits into the platform. And when I say platform, Confluent Platform is a distribution of Apache Kafka that you can use in a number of ways. Number one, there are open source and community license components of Confluent Platform that are free to use. There's a lot of these connectors. There's KSQL, components like that. You can just download and use for free. There are other components of Confluent Platform that come with an Enterprise subscription, things like multi-data center support, Enterprise-grade subscription and things like that. Even those, of course, if you just want to go download it and use it are free to use on a single node. So nothing is stopping you from experimenting even with those subscription features, and, if the idea of downloading, installing and running on your own infrastructure seems old-fashioned to you, you can also use Confluent Cloud. Confluent Cloud is a fully-managed, serverless, Kafka in the cloud. It lets you get all of this kinda thing done without thinking about infrastructure at all. (upbeat music) Now, if you're a developer and you wanna learn more, you know the thing to do is to start writing code. Got a resource for you.

About Home Study

Technology and Life