Monday, July 14, 2008

Broker-based versus Distributed MOM

One thing that some people have commented to me in email is the difference between broker-based and distributed message oriented middleware. In general, broker-based involves a central machine (or machines) that all clients (publishers and subscribers) connect to over TCP (or something equivalent) and route all traffic through. Distributed MOM systems involve all machines simply listening to raw Ethernet and using some type of bit distribution system (Ethernet broadcast or Multicast) to get the required messages to all subscribers.

The major advantages of a distributed system are:
  • No Central Point of Failure. If you have a centralized broker getting in between all your publishers and subscribers, you have one central point where failure will really mess you up, and you have to engineer around that.
  • Low Latency. Because you're not relying on things like TCP, you can engineer a much lower-latency solution overall. In particular, very modern distributed MOM systems should be capable of leveraging Infiniband which would give you incredibly fast and low latency distribution (though I'm not aware of any who are; if you are, please let me know!). Even IP multicast is lower latency than publisher to broker to consumer TCP distribution.
  • No Central Bottlenecks. If everything's going through one chokepoint, and it has performance problems, everything slows down.
  • Slow Consumer Problem Avoided. The default behavior of centralized systems is that if you have a publisher and multiple subscribers, one slow or misconfigured subscriber will throttle the publisher down to the rate that the one slow subscriber can handle. This is the cause of many operational issues as you have to track down the bad code and kill it. This is particularly commonly seen if you have someone lazy enough to run a debugger against a production subscription feed.
On the converse, Broker-based MOM systems offer:
  • Single Central Point of Resiliency. While you have a single point of failure, that single point of failure is one that you can engineer the heck out of to avoid any single point of failure (for example, my firm's production infrastructure allows for a number of failures, across disks, network connections, power supplies, power feeds, cables, switches, RAM cards, etc.). Engineering that level of resiliency into multiple points can be very difficult to achieve, in the Broker-based model you only have one point to harden, so you can focus all your efforts against it.
  • Never Losing A Message. In a distributed system, there are far too many places where you can end up losing a message, or, more difficult to guard against, losing the order of a message in a stream where the order might matter (for example, delivering a "trade" order after the "cancel" order has already been processed). That leads to:
  • In Order Delivery. The central broker can and does (sometimes) act as an ordering mechanism, which means that you can impose different ordering models onto your messages (for example, global order, where every message is guaranteed to be delivered in the order in which the single central broker receives it; or publisher order, where every message is guaranteed to be delivered in the order in which the publisher produced it).
  • Statistics, Management, Advanced Features. There's a lot of stuff you can do if you can put a single man in the middle of your messaging stream, and broker-based approaches tend to do it.
To give you an example of where Broker-based messaging is a pretty clean win, assume a case where a single message absolutely, positively, MUST be delivered to two services, no matter what. To do that in a broker-based system using JMS terminology, you have two subscribers, each with their own Durable Subscription (which will store messages on the broker if the client process is disconnected), and the publisher does Persistent message publishing. Job's done, the broker does the rest.

In a distributed case, you have to have disk storage (redundant of course) on the publisher, and a distributed logging mechanism, and some type of handshaking between the logging mechanism and the publisher to notify the publisher that it can wipe the message from the perm storage, and some type of equivalent handshaking between the publisher and the clients and ... Yes, you can do it, but architecturally it's a lot more complicated, and there are a lot more points that you have to harden to make sure that nothing failing can kill a message from being delivered.

There's one other thing that also factors in here, which is cost and licensing models. Distributed MOM systems tend to cost per node, Broker-based MOM systems tend to cost per broker CPU. The cost of Broker-based systems is usually much better for a case where you're scaling out to your entire company's desktop infrastructure, distributed MOM systems tend to work out better if your application is limited to a few machines. And it doesn't usually take very many machines for the Broker-based systems to win.

So what do I recommend? In general, these days, I recommend coding against a Broker-based model unless you know you need the latency and performance characteristics that you can get out of a fully distributed system. If you've coded your system well, you can always retrofit in a distributed MOM system later on.

Anybody else have any thoughts?
blog comments powered by Disqus