Software applications can be destabilized by many factors that are difficult to cover in a test or lab environment. These include timing-related issues, unanticipated bursts, cascading effects, unexpected administrative or batch processes, and integration complexities. Some of most nefarious factors that can destabilize applications are network problems. Network infrastructure is often a complex black box and can perform in an inconsistent fashion for various reasons. Network problems can cause serious application stability problems including cascading failures, unrecoverable states, and outages.
Network problems can also be difficult to emulate in a test or lab environment. One technique for handling this is to use a network emulator in the test environment. Take two servers between which you want to test various network problems or impairments. These could be two services in a SOA environment, or a web and application server, an application server and a database server, etc. Plug one of the servers into a network emulator port. Plug the other server into another network emulator port which is tied to the first emulator port, as illustrated below:
With the network emulator in place between the two servers, run the application under load, and introduce various network impairments, observing how the application behaves. The following is an example of network impairments that could be introduced:
- Network latency. Introduce latency at various levels, such as 1ms, 10ms, 100ms, 1000ms, 10000ms. Resume normal functioning after varying lengths of time, such as 10 sec, 1 min, 10 min.
- Bandwidth throttling. Introduce throttling at various levels, such as 100 mb/s, 10mb/s, 1mb/s, 100kb/s. Resume normal functioning after varying lengths of time.
- Network down. Introduce 100% packet loss for varying lengths of time.
- Emulate dropped packets for varying lengths of time.
- Emulate packet accumulation/burst for varying lengths of time.
- Other network impairments
With each network problem scenario, the application behavior should be carefully studied. Answers to questions such as the following should be determined:
- Does the application behave as expected under the network impairments?
- Is the application behavior appropriate? Is timeout, retry, and reconnect functionality functioning as expected?
- When the network recovers to a normal state, does the application recover, or is the application in an unrecoverable state?
- Is any manual intervention required to bring the application to a normal state?
- Do any applications or servers require restarting?
- Are appropriate messages logged?
- Does excessive message spamming occur?
No comments:
Post a Comment