Real-life experiences gained on cutting edge BizTalk projects from the City of London.

Wednesday, December 15, 2004

Useful retry pattern

Reading a ‘Suspended (Not Resumable)’ status for a critical $20M financial transaction is not really the best thing you can wake up to. I’ve woken up to a few in my time, and trust me: this one is particularly unpleasant…

There are a number of reasons it can happen, but where orchestrations are concerned, it generally means that an exception has occurred that hasn’t been gracefully handled.

Although in theory, this can happen anywhere in your orchestration, there are key points that it will happen more than most, and if you protect yourself in these scenarios, you’ll be able to sleep a little easier. Common scenarios that could generate an error are:

  • Calling into to the Rules Engine to execute a Policy – if the Rule Engine Update Service is not running (why does this never start on my laptop??), or the rule engine runs code that has an exception
  • Sending a message to a destination that may sometimes be unavailable (that’s pretty much all destinations!)
  • Executing some .NET code that could go wrong for an infrastructure reason (e.g. database server is down)

All these scenarios are related to infrastructure problems. They’re problems that if you fix the infrastructure and resume the orchestration, then everything should continue as if nothing ever happened.

In these situations, a good way of dealing with it is to execute some retry logic by gracefully putting the orchestration into a Suspended mode and allowing an administrator to resume the orchestration once the infrastructure problem has been resolved.

It’s pretty simple to achieve via nested scopes. Wrap the potentially risky operation in a set of shapes that together form a reusable retry pattern:

Essentially, the pattern requires that the risky operation is encapsulated within a loop shape, which in turn has a non-transactional scope with an exception handler that traps the possible error conditions. If an exception occurs, a String variable and Boolean variable is set that is then used outside the non-transactional scope to determine whether to gracefully Suspend, and to write out a meaningful reason for the suspension. The administrator will get an alert that warns that an instance has been suspended. He can then resolve the problem (e.g. start the Rules Engine Update Service) and then use HAT to resume any orchestration instances. Simple…

Once I’ve worked out how to upload pictures of orchestrations, I’ll post a picture which will make it easier to understand!

One extra issue is that if you’re using this pattern when sending messages and the message fails to be transported, the message to be sent will also go into a Suspended (Resumable) state. If the administrator resumes both the Orchestration service instance and the Send service instance, you will send the same message twice. To work around this problem, you can use delivery notifications and a Nack Handler to automatically clean-up the send service instances, but that’s a subject for another day…

Friday, December 03, 2004

Failover issues with MQ clusters

One issue we've had with MQ Series in a clustered environment is that if you initiate a failover to the passive node, and then failback, MQ Series will moan about Semaphore locking issues. This is a known issue with MQ that IBM is working on with Microsoft.

The solution (work around) is to always reboot the original node before failing back the resources. This seems to cure the issue!