*this is a guide that was created from lecture videos and is used to help you gain an understanding of how apache kafka works.
Stream Processing– With stream processing, all the data from your data providers (which are called producers in Apache Kafka) comes into the streaming platform and as data is being written there are operations being performed.
Data providers (or Producers) are the things that write data to a cluster.
Use Cases (or Consumers) are the things that are using the data.
Consumers can also rewrite data to other parts of Kafka, and can then become Producers.
You can write directly to the Kafka cluster from applications or you can connect to applications using a Connector. Connectors can reach out and find events in systems and pull them into the data pipeline.
Where messages are sent out and other applications listen to those messages. The application takes the message and there is an action taken because of it. This is a good way to transfer data between systems. The messaging system acts as an intermediary between the two systems.
Web analytics, operational monitoring, log collection and stream processing
When changes or events occur within your data providers/Producers in your applications, a log is a stream of the changes to a specific category or entity.
The log is used to store information to be used for all the specified data sources. When an update is made in one data provider, it also gets saved into the log, which can be used by the other data providers to ensure accurate information.
With Kafka, you do not need to make API calls to third party systems. All the data is written to Kafka in a central repository and any changes made in Kafka are done in real time.
Kafka is fast writing, offers historical tracking and is good for bootstrapping new applications.
Producers are data providers that make changes to applications and create entries into Kafka cluster.
Consumers are the applications that use the data, and they can become producers if writing to the Kafka cluster.
Connectors will allow you to integrate things such as databases and you can pull the changes into the Kafka cluster.
Stream Processors are used for stream processing where data is coming real time and has changes.
In between the Producers and Consumers, there are Brokers. A Kafka Broker is a logical separation of tasks and is used to distribute the load and provide backups of your data. Each broker handles one or more Kafka topics. A broker can be thought of being a container or bucket of the partitions of topics.
An Apache Kafka Topic is a category of changes such as a like on a social media post or a video view. Each topic is setup inside a broker. If there are multiple brokers, then you can split the topics across the brokers into partitions. These partitions are used to create a resilient system.
There is a lead partition which is used to write the all the data. This is where all the data comes in and the data is written to the topic itself. The data is then replicated from the lead partition to all the slave partitions. This process will ensure that one broker is handling all the writes for a topic and all the reads are coming from another broker.
A topic can be seen as a category such as customer information, sales orders, or website visits.
Each topic is a seperate log in Kafka and they are resilient due to the partitioning. The lead partition is partition 0.
Partition 0 is the lead partition for each Broker.
Producers publish data using a partitioner. The partitioner is used to determine which partition for the topic that the producer is writing to the lead partition. The message response is used to validate that the data has been written and the message response is sent only after the data has been replicated to another partition.
You can set the message durability at the lead partition. You can tell the messaging to respond immediately or to wait until is has been replicated. You can also set ordering/retries. Retries will pause or prevent other writes from happening. You can change batching and compression which are used with throughput. There is also a buffer.memory.config item that will help you set up the total memory available to the Java client.
Consumers are the applications that read data from the Kafka Cluster. Consumers are organized into Groups. Each Group has partitions of a topic that they are reading from. Apache Zookeeper is a Broker coordinator and is used to monitor consumer groups for rebalancing.
When you set up the Consumer Group, ensure that you add a consumer group ID, which is the name or identifier for the Consumer Group. You can adjust the session timeout to avoid rebalancing and you can adjust the heartbeat which uses interval configuration.
Autocommit is used to assist with the offset of the consumer. Offset is the number of changes that they have read. Autocommit will log every five seconds and it helps if the consumer was to go down and not capture the messages during the duration it is down.
32 GB+ RAM
24 cores per machine CPU
Multiple drives Disk
* you can use multiple log directories with each directory mounted on a seperate drive
10 GB ethernet Network
Kafka-topic.sh -> specify where Apache Zookeeper is -> specify and create topic -> create number of partitions -> replication factor (how many times do you want it to be replicated) -> any other configuration values
Kafka-topic.sh -> specify where Apache Zookeeper is -> –alter –topic _topic_name -> — partitions 40 (specifying number of partitions)
Kafka-topics.sh -> specify where Apache Zookeeper is -> –alter –topic topic_name -> –config x=y OR –delete-config x (used to delete configurations from a Topic)
You need to first enable topic deletion, delete.topic.enable=true -> kafka-topic.sh -> specify zookeep path (zookeeper zk_host:port/chroot) -> –delete –topic topic_name
Kafka-run-class.sh kafka.tools.ConsumerOffsetChecker -> point to Apache Zookeeper, zookeeper localhost:2181 -> which Consumer Group to look at, –group test and this will show all the Consumers in the Consumer Group and the offset.
kafka-consumer-groups.sh -> point at bootstrap server, –bootstrap-server broker1:9092 -> –list OR –describe OR –group test-consumer-group
Kafka monitor is an open source component that is used for long-running tests to monitor your Kafka Cluster.
In Kafka Monitor, you can set up tests and have services that use those tests to interact with your Kafka Cluster. You can also test across data centers and set up a service that mirrors tests across the different data centers.
Auditing in Kafka is verifying that messages are being handled properly and that they are being delivered and processed.
Kafka Audit is a methodology of creating an auditing system by yourself. To do this, you would have multiple data centers and mirror the messages being delivered and processed to a third data center to compare the results of the two data centers that you have mirrored.
Chaperone is used with a proxy Kafka interface that sends messages across and into a regional Kafka cluster. Chaperone will audit everything from the timestamp on the messages to find the issue.
Confluent has a control center which is a series of charts, graphs and dashboards that will allow you to monitor your Kafka Cluster. You can find this at confluent.io.
Ensure you have the latest Java JDK -> download the latest version of Kafka (binary downloads) from kafka.apache.org -> go into terminal, unpack the compressed file using tar -xzf -> once unpacked, go into kafka directory -> start Zookeeper with bin/zookeeper-server-start.sh config/zookeeper.properties -> start Kafka server using bin/kafka-server-start.sh config/server.properties
Once you have Kafka and Zookeeper running, run the kafka topic script bin/kafka-topics.sh –create –zookeeper localhost 2101 –replication factor 1 -partitions 1 –topic test -> you can see the topic using bin/kafka-topics.sh –list –zookeeper localhost:2101
Ensure Kafka and Zookeeper are running -> in a new terminal window, run the Producer script with bin/kafka-console-producer.sh –broker-list localhost:9092 –topic test -> in a new terminal window, run the Consumer script with bin /kafka-console-consumer.sh –bootstrap-server localhost:9092 –topic test –from-beginning -> type into the Producer terminal window to see it being read into the Consumer window
Ensure Kafka and Zookeeper are running -> open terminal window, create new property files with cp config/server.properties config/server-1.properties -> run cp config/server.properties config/server-2.properties to create another copy -> verify the properties were created using ls config/ -> run vi config/server-1.properties and change the broker.id= and the listener and the log directory -> set up the two new servers using bin/kafka-server-start.sh config/server-1.properties -> bin/kafka-server-start.sh config/server-2.properties -> new terminal window, to create a new topic that is replicated to all three nodes use bin/kafka-topics.sh –create –zookeeper locahost:2101 –replication-factor 3 –partitions 1 –topic my-replicated-topic -> use describe command to see where Zookeeper is and what topic to look at, bin/kafka-topics.sh –describe –zookeeper localhost:2101 –topic my-replicated-topic
Open terminal window, set up producer using bin/kafka-console-producer.sh –broker-list localhost:9092 –topic my-replicated-topic, open new window and create Consumer using bin/kafka-console-consumer.sh –bootstrap-server localhost:9092 –from-beginning –topic my-replicated-topic -> find where your servers are running using ps aux | grep server-1.properties to search for processes using server-1.properties -> find the server number and kill the server -> see which node is the leader for your topic using bin/kafka-topics.sh –describe –zookeeper localhost:2101 –topic my-replicated-topic -> to read the Consumer, run bin/kafka-console-consumer.sh –bootstrap-server localhost:9092 -from-beginning –topic my-replicated-topic
This is used as the connector in Kafka looks for the certain file and any time that data gets modified it will be pulled into Consumer and it will write out the changes.
Open terminal window -> create file -> setup a standalone connector using bin/connect-standalone.sh config/connect-standalone.properties config/connect-file-source.properties config/connect-file-sink.properties -> to see the data in Consumer, use bin/kafka-console-consumer.sh –bootstrap-server localhost:9092 –topic connect-test –from-beginning
For more information on how apache kafka works, visit the official documentation or reach out in the comments.