There are many ways to create a clustered environment. This topic provides one suggested method that provides failover support as well as network load balancing.

High availability for an AMI system can be accomplished in a variety of ways. This topic focuses on the use of clustering to provide failover for all AMI components in the event of a hardware failure. Remember that a fully functional AMI system requires Responder to be up and running in order to actually function as an OMS. It is prudent to configure the core components of Responder Server to be highly available as well. Documentation regarding Responder high availability is available here.

If for some reason (cost, hardware availability) it is not feasible to make the entire system fault tolerant. Providing failover for the AMI components alone will allow for the continued receipt of meter and canary events after a hardware failure. In the event a Responder server fails, normal operation can resume once the machines hosting the Responder services (Data, Prediction and Archive) are repaired or replaced. In this case all meter events will have been collected and be available for prediction.

In the example setup described below, network load balancing as well as clustering is used to provide high availability. The example here involves four physical machines to host AMI components. The machine names are Node_1, Node_2, Node_3 and Node_4.

IIS Load Balancing

IIS is used to host several of the AMI components, including:

Event Service
Configuration Service
State Service
Configuration Web user interface

Telvent recommends using two or more machines to run IIS as a network load balanced service. Network load balancing for IIS will provide improved performance (see Figure 1). One virtual IP is used to access the hosted services.

Figure 1, Network Load Balancing cluster

IIS High Availability

In the event that IIS goes down due to hardware or network failure the AMI system will be unable to receive any new meter events, bringing event processing to a halt. To prevent this situation, Telvent recommends using two or more machines to run IIS as an active/passive failover cluster.

MSMQ and the AMI services are located on machines Node_3 and Node_4 and employ active/passive clusters. Figure 2 below illustrates the configuration recommended for the clustered services and an explanation of the reasoning follows.

The Log Service may use an active/active cluster. See the Log Service section below.

Figure 2, Failover cluster

MSMQ

As shown in the Figure 2, MSMQ is configured as part of an active/passive cluster running on Node_3 and Node_4. It may be placed on Node_1 and Node_2 if desired. One potential advantage to clustering MSMQ on Node_1 and Node_2 along with the web services is that the Event Service does not need to traverse the LAN in any way to send messages to an instance of MSMQ located elsewhere. The Event Service needs to be able to enqueue quickly in the event of a large spike in the event submission rate, so as not to lose events. In contrast, the Event Processor (which dequeues from MSMQ) does not pose a risk of losing data if network congestion or latency slows it. In practice though, an uncongested subnet should allow for rapid communication with MSMQ regardless of the location.

The reason for the active/passive configuration is that it allows the MSMQ service to be accessible via one virtual IP address. By avoiding an active/active configuration we remove the need for a custom load balancing component to switch back and forth between two or more instances of MSMQ when enqueuing or dequeuing.

Event Processor

The Event Processor is configured as an active/passive cluster with all instances pointing to the virtual IP for MSMQ. This configuration provides failover support for event processing up to the state store.

Post Event Processor

The Post Event Processor should be run in an active/passive configuration as shown in Figure 2. Running multiple instances of the Post Event Processor as part of an active/active configuration can cause the status field of events in the state store to have incorrect values at times, leading to problems with prediction.

Log Service

The Log Service is shown in Figure 2 using an active/passive configuration in this document. However, there is no technical reason why the Log Service cannot be run in an active/active configuration. The Log Service operates by dequeuing events from its queue that other AMI components have enqueued for logging. These events are then logged by the service to their final destination. In the event that one instance of the Log Service is not able to keep up with the amount of messages being sent to its queue, an active/active configuration may resolve the situation.

The Log Service is shown in Figure 2 as active/passive since testing has demonstrated that one instance of the service is enough to handle a large number of events (e.g., 100,000+ per hour) without difficulty. Additionally one instance of the service results in fewer connections being made to MSMQ to look for messages to be logged.