Thursday, April 12, 2007

Load balancing MSMQ

I've been investigating how to set up MSMQ in a load balanced environment using Microsoft's Network Load Balancing, so that messages are distributed over a number of servers without the hassle of having to reference each server individually.

In theory it’s relatively easy, and in practice it’s not too difficult once you’ve got past a few teething troubles.

First make sure the queues are named the same way on each server that you want to put in the cluster. For testing you can use virtual servers, but there are a few restrictions to be aware of.

Add each server to the cluster, and make sure you forward port 1801. You can set the Affinity to None, as a message should be able to be queued on any server, so a session doesn't need to be maintained.

When referencing the queue from code, use Direct Format TCP names (i.e. FormatName:DIRECT=TCP:192.168.x.x\queuename). You can use the OS format, but you need to add a registry key for it to work properly. KB article 899611 tells you how to do this if you need it.

Now you should be in a position to send a message to the cluster, and have it delivered to one of the queues. In basic terms this is all that is required. At this point the load balancing works, so if that works for you then fine.

However, if you have a system where many messages are being pushed around, then you'll soon notice a few flaws with this. When you send to a remote queue what actually happens is that the message goes into the client machine's Outgoing Queues. The MSMQ service is then responsible for connecting to the remote server and pushing the messages to it.

What this means in practice is that when you send a message, MSMQ opens a connection to the cluster. NLB assigns this connection to one server, and the messages are pushed through. If you stop this server in the cluster (for maintenance, for example), then the connection between client and server is ended.

You'll see messages piling up in the Outgoing Queues on the client machine. By default MSMQ reconnects after 60 seconds, but in practice I've seen that this is usually about 70 seconds. After it reconnects, the messages will go onto another server.

If you require more performance than this, then I'm sure you'll agree that 60-70 seconds is too long to wait for MSMQ to realise it needs to make another connection. Luckily there is a WaitTime registry setting you can make that reduces this time. However, the additional 10 seconds seems to remain on top of what you enter here.

Once this is sorted out you're still left with a problem. Lets say we have 2 client machines sending messages to a cluster that has 2 servers in it, with client 1 connected to server 1, and client 2 connected to server 2. If we take down server 2, then client 2 reconnects to server 1, and everyone is happy.

When we bring server 2 back though, as the connections are already established from clients 1 and 2 to server 1, server 2 won't get any messages.

This is because by default MSMQ keeps an idle connection open for 5 minutes. If you are running a high performance system, then its likely that there will be no time for which a connection will be idle for this time. This is where the CleanupInterval registry key comes in. This tells MSMQ how long to leave an idle connection open.

Reduce this to a suitable time for your system. This is likely to give you a much more balanced distribution of messages across the servers in you cluster.

And there you have it. Not too difficult once you've found the pieces of information you need.

1 comment:

Adam said...

Great post Ian, lots of really valuable experience there.