JRush episode 4th: Event streaming with Apache Pulsar
Transcript:
Let me introduce myself. I’m a Senior Developer Advocate at DataStax. Alex has already given an overview, so I won’t repeat that. Previously, I was a Developer Advocate at IBM, focusing on Java and open-source technologies, including WebSphere. I’m based in Chicago and am an active community builder, serving as a Java Champion and President/Executive Board Member of the Chicago Java Users Group. Before this, I worked as a development engineer and technical architect at various companies in the Chicago area.
Let’s move on quickly, as time is limited. This session aims to excite you about event streaming, Apache Pulsar, and how DataStax supports it. Here’s a quick agenda: First, we’ll align on what event streaming is. Even for seasoned developers, it can be confusing at first. I’ll introduce the basic terms, discuss the value of event streaming, and explain how it improves systems.
Next, I’ll introduce Apache Pulsar—what makes it unique, why it’s worth considering, and some of its developer-friendly features. I’ll also cover the DataStax Managed Cloud Platform, which simplifies working with Pulsar. If time permits, I’ll do a quick demo. If not, I’ll share resources where you can watch live demonstrations.
Let’s start with the basics: what is an event? According to Merriam-Webster, an event is “something that happens—an occurrence.” In technical terms, it’s a point in space and time, with three spatial coordinates (X, Y, Z) and a time dimension. Events are immutable—they cannot be erased once they occur. For instance, the birth of a baby or a musical performance happens at a specific time.
Event-driven kind of way of describing messaging and then there's also message-driven in which the sender and receiver have to know each other—in other words, your address has to be known. So, that's kind of like the two differences. But they are a highly decoupled way of passing data from one point to another.
Now, let's take a look. An event is not a new thing. If you kind of look at the event approach, it's basically an event happens, and then you immediately process it. The more traditional way is batch processing. Batch processing has its place in the computing world—you gather all the data that's there and, when you're ready, batch them all up and send them in one batch. That's why it's called batch processing. It has specific uses in our computing world, but, as you can see, with events, you're able to process it right away. The advantage is the real-time aspect of it.
Now let's look at the patterns again. The event-driven kind of way is essentially streaming too, which is "pub-sub." That's very much an architectural pattern. Basically, your client will send the data to the topic. Your publishing client sends it to the topic, and the broker owns the topic and delivers it to whoever subscribes to the topic. When the messages arrive at the topic, the broker delivers them accordingly to the subscribers.
There’s another way: message queuing, which is a form of asynchronous service-to-service communication. It’s a bit more like the sender sending data to some queues. Once it gets into the queue, the receiver picks up the messages from the queue. Once the messages are picked up, they’re gone from the queue.
Event streaming is a step beyond event messaging. What’s driving the change is the need for real-time data to enhance customer experiences and create a competitive advantage. Delivering results to users faster is crucial. Today, with AI and machine learning applications, building data pipelines is essential. Event streaming supports scalability—it enables systems to ingest high volumes of data during peak times and scale back dynamically. Built-in mechanisms handle backpressure to prevent system overload.
Event streaming allows systems to watch for events and act immediately without waiting. Subscribing to specific topics ensures efficiency and low latency in processing high-frequency messages. Comparing traditional batch processing (e.g., ETL processes) to modern event streaming highlights the shift. Batch processing deals with huge data volumes but takes time as it involves disk I/O operations. In contrast, event streaming processes data in memory in real time, transforming it before outputting to a sink.
Now, let’s talk about Apache Pulsar. Pulsar is an open-source software created by Yahoo and later donated to Apache in 2016. It’s designed to be cloud-native, supporting multi-tenancy and separating compute from storage. Pulsar brokers handle message delivery, and Apache BookKeeper manages message storage efficiently. This separation simplifies scalability and ensures message reliability with features like deduplication and retry mechanisms.
Pulsar also includes serverless "Pulsar Functions" for lightweight, real-time data transformation and processing. It supports offloading less active messages to cost-effective storage options like S3 buckets or Google Cloud Storage, improving sustainability and cost efficiency. Pulsar is compatible with multiple programming languages, including Java, Python, Go, and more.
Pulsar’s architecture separates compute and storage, unlike Apache Kafka. This segmentation allows efficient log management, improving scalability. Pulsar also integrates with existing systems like Kafka, JMS, and RabbitMQ through Starlight APIs, ensuring compatibility during migration. It offers features like geo-replication for disaster recovery and Pulsar IO for building efficient data pipelines.
At DataStax, we offer Astro Streaming, a managed Pulsar platform for quick deployment, proof-of-concept projects, and scalability. Open-source Pulsar and enterprise support options are also available. Pulsar has a rich ecosystem of connectors and clients, making it a versatile unifying platform for messaging, streaming, and queuing.