sub-second: stability

Showing posts with label stability. Show all posts

Monday, October 22, 2012

Virtualization Performance Testing

Virtualization technologies (such as VMware, Parallels, Xen, KVM, Waratek, etc.) have become very popular for good reasons, including but not limited to the following:

Application density can be increased
Server resource usage can be maximized
Hardware costs can be reduced many-fold
The number of servers required to run the same load can be reduced
Applications become more portable

It is critical to do performance, load, and stress testing of a virtualization solution before rolling it out for several reasons:

Different applications behave differently in a virtualized environment and some are more suitable than others
Performance or stability could be affected in a virtualized environment
Virtual machine sizing can drastically affect application behavior
Application configuration can be different when run in a virtualized environment.

This post gives a brief outline of how to do performance, load and stress testing of a virtualization solution.

Include online and batch test cases as well as administrative operations
Test bare metal (physical server) as a baseline to be compared to virtualization results

See how many instance of the application can be run on bare metal
Push bare metal up to server capacity

Test virtualization

Run VMs on the same bare metal server or an equivalent one
Push one VM up to capacity, then two VMs, then three VMs, etc., up to the highest density desired.
Capture hypervisor server resource usage metrics (cpu, memory, disk, network)
Verify that VM performance compares reasonably relative to bare metal
Verify that VM capacity compares reasonably relative to bare metal
Verify that VM application stability compares reasonably relative to bare metal
If performance or stability is not acceptable, see if application tuning is needed or virtual machine tuning is needed.
Include stability tests in which high load is applied for an extended period of time.
Perform administrative operations while system is under high load.
Once performance and stability are acceptable, publish the results and the tuning and configuration needed to achieve acceptable results.

Thursday, September 13, 2012

How to Test the Stability of an Application

Testing the stability of an application is critical. It can prevent system outages by identifying problems before they occur in production. Outages can severely damage a business, in some cases permanently. The following outline provides a reasonable template for testing application stability.

Ramp load up incrementally to the breaking point of the system. Do not stop at expected peak load because bursts or unexpected traffic can entail load far higher than anticipated.

Load should cover critical dimensions such as transaction rate/throughput, connections, concurrent users, range of use cases/functionality
When the application breaks, investigate what broke

If the test infrastructure broke (test client capacity hit, test network capacity hit, test case crashed, etc.), the test infrastructure must be repaired so that the application is what breaks, not the test infrastructure.
If the application broke, diagnose the type of breakage and what broke.
Is breakage recoverable?
Does breakage affect already connected users, or just block new users?
Did the application code break (errors, deadlocks, thread blocking, etc.)?
Was a system resource limit hit (cpu, memory, network, disk)?
If system resource limits were not hit, does the application need to be fixed so that it is not the bottleneck? The system should scale up so that system limits are hit, whether CPU, network, disk I/O, or network bandwidth.
Did a downstream service break?

How can the downstream service be improved to provide more capacity and stability?

Did the system just slow down, remaining functional?
Is a restart required, and what must be restarted (services, server, downstream services, etc.)?
Can the system be scaled out or scaled up to improve the capacity?

If not, why not? Is there an architectural limitation preventing further scalability? How can scalability be improved?

From the test determine the peak capacity of the application and verify that proper production monitoring is in place to detect this threshold.

Run at near peak capacity for an extended period of time (this could be one day or more depending on uptime requirements)

Is the application stable when run for a long time or does it eventually crash?

Why does it crash?

Does performance degrade over time?

Why does it degrade?

Perform administrative operations that may need to be performed during production usage while system is near peak load.

Is the system stable when this happens?

Perform the full suite of functional tests while the system is near peak load.

Is the system stable when this happens?

Document the results of the test carefully. Do not ignore crashes and instability. Spend the time and effort to understand the behavior and harden the application to behave well under any conditions, anticipated or not.

Monday, August 27, 2012

Performance Testing ROI and Black Swans

What return on investment/ROI can be expected from performance testing? There are two categories of "return" for performance and stability testing.

One category of return is a simple, quantifiable return in which performance testing results allow the amount of hardware required to be reduced in a couple of ways:

Application tuning and optimization allowing the same amount of hardware to handle additional traffic
Proof of hardware overcapacity, showing that a reduced amount of hardware would be sufficient to handle peak traffic, allowing hardware inventory to be reduced

The second category of return is not a return so much as an insurance policy, or protection against a fat tail or black swan event. In this case, you are protecting yourself against system stability bugs whose consequences can be absolutely catastrophic, up to the complete destruction of the business.

Consider the example of Knight Capital's botched software rollout on August 1st, 2012. The rollout triggered unexpected trades that cost the firm $440 million, depleting its operating capital. This weakened position allowed outside investors to take a controlling 70% stake in the company as terms of its bailout.

By doing performance and stability testing, catastrophic stability bugs may be missed due to limitations of the test environment or test data, or failure to anticipate the nature of production behavior leading to the disaster. However, if a catastophic stability bug is discovered and fixed, which happens regularly, a black swan event has been quietly and successfully sidestepped. The value is avoiding a devastating black swan event that brings down the company through loss of customers, loss of reputation, lawsuits, takeovers, etc. Not to mention more individual losses such as loss of job, bonus, promotion, career, etc. The cost of the testing is fixed and predicable, simply the cost of supporting ongoing performance testing, in effect the insurance policy premiums protecting against the black swan.

Monday, February 6, 2012

How to Performance Test in a Service-Oriented Architecture

How do you performance test, stress test, and load test in the world of service-oriented architecture (SOA)?

The answer is that you test it at different levels of granularity, typically three levels. One, obvious level is the service. Another level is end to end. The third level is the module level, the low-level building blocks making up the service.

One necessary precondition to adequate performance testing services is proper instrumentation providing key performance metrics. Response times, transaction rates, and and success/failure counts must be available for all service entry points and downstream calls. This allows response time contributions to be allocated accurately to the proper services and allows fast performance problem debugging.

The three layers of SOA performance testing share the following in common.

Scalability Testing

What is capacity?
What bottleneck is limiting capacity?
What is response time at various loads?
What is the canonical performance chart?

Stability Testing

Is the application stable?
Is the application fault tolerant?

Performance Regression Testing

Does performance degrade from build to build?
Does server resource usage increase from build to build?

Metrics on server resource usage

CPU
Memory
Network
Disk

Module level SOA performance testing is done as follows:

Test key functional code paths at module level
Use multi-threaded, concurrent execution
Run within unit test framework
Run within continuous integration framework
Run frequently, each check-in, build, or version

Service level SOA performance test is done as follows:

Test through public entry point
Isolate service under test from other services
Use spoofing or stubbing of backend services, mimicking their response time behavior
Determine response time and availability service level agreements (SLAs) based on test results
Thoroughly test the clustering or load balancing mechanism used to scale the service out horizontally.

End to end level SOA performance test is done as follows:

Test public entry point into the application
Verify that bottlenecks hit are consistent with capacity of individual services discovered in service-level testing.
Verify fault tolerance of unavailable downstream services

An additional layer is infrastructure testing. This could include messaging infrastructure, caching infrastructure, storage infrastructure, database infrastructure, etc. Key infrastructure should be directly tested for scalability and stability in some cases to ensure that it behaves as expected and scales as expected.

SOA performance testing can be summarized in the following conceptual chart:

Friday, January 27, 2012

How to Test Software Resiliency to Network Problems Using a Network Emulator

Software applications can be destabilized by many factors that are difficult to cover in a test or lab environment. These include timing-related issues, unanticipated bursts, cascading effects, unexpected administrative or batch processes, and integration complexities. Some of most nefarious factors that can destabilize applications are network problems. Network infrastructure is often a complex black box and can perform in an inconsistent fashion for various reasons. Network problems can cause serious application stability problems including cascading failures, unrecoverable states, and outages.

Network problems can also be difficult to emulate in a test or lab environment. One technique for handling this is to use a network emulator in the test environment. Take two servers between which you want to test various network problems or impairments. These could be two services in a SOA environment, or a web and application server, an application server and a database server, etc. Plug one of the servers into a network emulator port. Plug the other server into another network emulator port which is tied to the first emulator port, as illustrated below:

With the network emulator in place between the two servers, run the application under load, and introduce various network impairments, observing how the application behaves. The following is an example of network impairments that could be introduced:

Network latency. Introduce latency at various levels, such as 1ms, 10ms, 100ms, 1000ms, 10000ms. Resume normal functioning after varying lengths of time, such as 10 sec, 1 min, 10 min.

Bandwidth throttling. Introduce throttling at various levels, such as 100 mb/s, 10mb/s, 1mb/s, 100kb/s. Resume normal functioning after varying lengths of time.

Network down. Introduce 100% packet loss for varying lengths of time.
Emulate dropped packets for varying lengths of time.

Emulate packet accumulation/burst for varying lengths of time.
Other network impairments

With each network problem scenario, the application behavior should be carefully studied. Answers to questions such as the following should be determined:

Does the application behave as expected under the network impairments?
Is the application behavior appropriate? Is timeout, retry, and reconnect functionality functioning as expected?
When the network recovers to a normal state, does the application recover, or is the application in an unrecoverable state?
Is any manual intervention required to bring the application to a normal state?
Do any applications or servers require restarting?
Are appropriate messages logged?
Does excessive message spamming occur?

Testing network problems in the lab provides an extra measure of security and could be well worth the time, expense and effort. If network problems still destabilize the application after doing this type of network problem testing, the test suite should be enhanced to cover the type of scenario that was missed.

Wednesday, November 2, 2011

Extending the Load Test Plan

The previous post covered the minimal plan for load testing which included testing scalability and stability testing. This post covers additional testing needed to ensure a stable, scalable, and well-performing application, specifically:

Performance regression
Fault tolerance
Horizontal scalability

1. Performance Regression

A performance regression test consists of running the same performance test on the prior version of the application and then on the current version of the application using identical or at least equivalent hardware. This will show whether performance has degraded in the current release versus the previous release. The test should be run under load to include the impact of any concurrency or other load-related issues affecting performance. The load could be at various levels, i.e., running the vertical scalability test already discussed on each version of the application. Or, if a single load is used for performance regression, the load should be selected at slightly less than peak capacity.

Ideally, such a performance regression test or standard performance benchmark would be run on a variety of versions of the application over time which will provide a performance trend. This will show whether performance is slowly degrading or improving over time.

In addition to response time performance regressions, regressions should also be looked for in server resource usage. You want to know if the application or client is burning more CPU to do the same amount of work, or whether more memory or network resource is needed.

If there is a performance regression, it should be investigated carefully and fixed if possible.

2. Fault Tolerance

Fault tolerance testing involves running various negative or destructive tests while the application is under load. These could include the following:

Bringing a downstream system down under load (stopping a downstream database or downstream webservice)

Slowing down a downstream service under load.

Applying a sudden heavy burst of traffic under load.

Triggering error scenarios under load.

Dropping network connections under load. (using a tool such as tcpview)

Bouncing the application under load.

Failing over to another server under load.

Imparing the network (reducing bandwidth, dropping packets, etc.)

The behavior of the application is observed in each test:

Does the application recover automatically?
Does it crash?
Does it cause a cascading effect, affecting other systems?
Does it enter into a degraded state and never recover?
Can the event be monitored and alerted on with available tools?
Are appropriate events logged?

In each case, the behavior could be as designed, or it could be unexpected and be a scenario that must be fixed prior to production deployment. This type of testing can find issues that would otherwise not be found in the test lab and can greatly improve the stability of the application.

3. Horizontal Scalability

Vertical scalability of the application on a single server has already been tested and bugs fixed allowing the application to scale up and use a majority the resources of a single server. A horizontal scalability test addresses the question of whether adding additional servers running the application into the cluster allows the application to scale further. Once one server is nearing peak capacity, can another server be added to the cluster to double capacity? How far out does this scale? Does one server double capacity, but no further gains can be obtained beyond that level?

The simplest way to test this is to literally expand the cluster one server at a time and test peak capacity with each addition. However, at some point this may be unfeasible due to lack of test servers. In that case, one strategy might be to focus on individual downstream applications and verify that they can scale to the required level. For example, a downstream database could be load tested directly to see how many queries per second can be supported. This will provide a cap on the scalability of the upstream application. The same thing can be done with a downstream web service.

Any capacity ceilings hit should be studied so that it is understood what is limiting further horizontal scalability, whether it is network bandwidth, capacity of a database, an architectural flaw in the application, etc.

The load balancing mechanism should also be examined carefully to make sure it does not become a bottleneck, particularly if it has not been used previously or if load is to be increased substantially.

Another possibility might be to deploy the application to the cloud and scale out using large numbers of virtual servers.

Monday, October 31, 2011

A Minimal Load Test Plan

What is a minimal load/stress test plan for a new service or application? A minimal plan covers two scenarios:

Scalability Test
Stability Test

1. Scalability

Vertical scalability is a measure of how effectively an application can handle increasing amounts of load on a single server. Ideally, the application can handle increasing amounts of load without significant degradation in response time until reaching some server resource limit such as CPU limits or network adaptor bandwidth limits. The results of a scalability test can be presented in a chart such as the following:

The chart shows, for each of 8 tested load levels, the response time and the transaction rate of the application. In this case each load level is a number of concurrent requests in increments of one. In other cases other increments may be appropriate (such as increments of 10) and other measures of load may be appropriate (message size, etc.). Load should be driven high enough that throughput levels off. Response time will ideally remain flat as load increases, eventually turning a knee or corner and heading upwards as capacity is reached. In this case, the application scales nearly perfectly up to 5 concurrent requests, then begins to degrade, with peak throughput around 4,500 queries per second.

The scalability chart provides a large amount of information that can be used for capacity planning, production configuration, etc. It shows what response times are under typical loads. It shows the throughput capacity of a single server running the application, and it shows the behavior as capacity is exceeded.

As part of a scalability test, metrics showing server resource usage at each load level should be captured, such as CPU usage, network usage, disk usage, and memory usage. Logs and errors should be captured. Similar information should be captured on any downstream systems involved in the test if any, such as databases or services.

Part of the test analysis should involve bottleneck analysis, which is analyzing and determining what is limiting the capacity of the application, what is limiting it to 4,500 queries per second. This could be server resource usage (hitting CPU, network or disk limits), it could be an increase in response time of a downstream database or server as load increases, it could be contention within the application such as thread blocking or error paths hit at high loads, etc.

An appropriate environment for a scalability test would involve two servers, one for the application and one to act as the client driving the load:

Server resource usage on the load generator should be monitored as well to verify that it is not the bottleneck.

This type of vertical scalability test does not guarantee that the application will scale out horizontally for two reasons: (1) there may be downstream systems such as databases that become bottlenecks at higher loads, and (2) load balancing or clustering systems may not scale as expected. More extended testing beyond a minimal plan would have to cover these factors as well.

For the vertical scalability test, it is important to drive the application up to peak capacity, regardless of what expected load may be. Usage may be different than expected, spikes in load may occur, business might grow, etc. Some performance or stability problems only manifest themselves at higher loads, and it is important to identify these even if production load is expected to be much lower.

2. Stability

The second part of the minimal load test plan is the stability test. To verify stability, a high load should be run against the application for an extended period of time, at a minimum 24 hours and ideally for days or weeks. A high load can be determined from the results of the scalability test, just below peak capacity, just below the point at which response time takes a turn for the worse. In the example above, a load of 5 concurrent requests could be used, assuming those are the final results following resolution of performance bottlenecks.

During the run, server resource usage should be captured and monitored, and error logs monitored, as with the scalability test. Trends should be monitored closely. Does response time degrade over time? That indicates a resource leak. Does CPU usage increase over time? That indicates a design or implementation error. Does memory leak? Do errors begin to occur at some point or occur in some pattern? Does the application eventually crash?

Beyond the Minimal Plan

Beyond the minimal plan, other tests are required to ensure the application is completely performant and stable. These will be covered later (http://sub-second.blogspot.com/2011/11/extending-load-test-plan.html) and include:

Performance regression
Horizontal scalability
Fault tolerance