Showing posts with label disk. Show all posts
Showing posts with label disk. Show all posts

Thursday, September 13, 2012

How to Test the Stability of an Application


Testing the stability of an application is critical.  It can prevent system outages by identifying problems before they occur in production.  Outages can severely damage a business, in some cases permanently.  The following outline provides a reasonable template for testing application stability.

  • Ramp load up incrementally to the breaking point of the system.  Do not stop at expected peak load because bursts or unexpected traffic can entail load far higher than anticipated.
    • Load should cover critical dimensions such as transaction rate/throughput, connections, concurrent users, range of use cases/functionality
    • When the application breaks, investigate what broke
      • If the test infrastructure broke (test client capacity hit, test network capacity hit, test case crashed, etc.), the test infrastructure must be repaired so that the application is what breaks, not the test infrastructure.
      • If the application broke, diagnose the type of breakage and what broke.
      • Is breakage recoverable?
      • Does breakage affect already connected users, or just block new users?
      • Did the application code break (errors, deadlocks, thread blocking, etc.)?
      • Was a system resource limit hit (cpu, memory, network, disk)?
      • If system resource limits were not hit, does the application need to be fixed so that it is not the bottleneck?  The system should scale up so that system limits are hit, whether CPU, network, disk I/O, or network bandwidth.
      • Did a downstream service break?
        • How can the downstream service be improved to provide more capacity and stability?
      • Did the system just slow down, remaining functional?
      • Is a restart required, and what must be restarted (services, server, downstream services, etc.)? 
      • Can the system be scaled out or scaled up to improve the capacity?  
        • If not, why not?  Is there an architectural limitation preventing further scalability?  How can scalability be improved?
    • From the test determine the peak capacity of the application and verify that proper production monitoring is in place to detect this threshold.
  • Run at near peak capacity for an extended period of time (this could be one day or more depending on uptime requirements)
    • Is the application stable when run for a long time or does it eventually crash?
      • Why does it crash?
    • Does performance degrade over time?
      • Why does it degrade?
  • Perform administrative operations that may need to be performed during production usage while system is near peak load.
    • Is the system stable when this happens?
  • Perform the full suite of functional tests while the system is near peak load.
    • Is the system stable when this happens?

Document the results of the test carefully.  Do not ignore crashes and instability.  Spend the time and effort to understand the behavior and harden the application to behave well under any conditions, anticipated or not.

Wednesday, May 23, 2012

Monitoring Linux Server Usage With Sar

A simple way to monitor server resource usage is with sar.  The following simple shell script sar.sh will monitor cpu, memory, network, and disk every 10 seconds and write each to a separate log file which can be easily imported into a spreadsheet for charting.

Script


# Run sar every 10 seconds until stopped
# cpu
sar -u 10  > sar.cpu.log &
# free memory
sar -r 10  > sar.freememory.log &
# disk total
sar -b 10  > sar.disk.log &
# network by device
#    - Note that you need to filter by the adaptor in use.  
#    - Run "sar -n DEV 10" to see which adaptor is being used
sar -n DEV 10 |grep eth1 > sar.network.log &

Output

The cpu log file shows user and system CPU % utilization:

03:07:55 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
03:08:55 PM     all     73.99      0.00      2.43      0.21      0.00     23.37
03:09:55 PM     all     81.79      0.00      2.67      0.21      0.00     15.34
03:10:55 PM     all     82.29      0.00      2.68      0.17      0.00     14.86

The free memory log file shows how much memory is free and used:

03:07:55 PM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit
03:08:55 PM 110106128  88246712     44.49    356468  42850352  30363972      7.61
03:09:55 PM 110053452  88299388     44.52    356472  42879192  30371420      7.61
03:10:55 PM 109989584  88363256     44.55    356484  42914152  30372688      7.61

The disk log file shows read and write transfers per second and bytes read and written per second

03:07:55 PM       tps      rtps      wtps   bread/s   bwrtn/s
03:08:55 PM   7889.59      0.00   7889.59      0.00  58582.09
03:09:55 PM   8454.59      0.00   8454.59      0.00  62458.76
03:10:55 PM   8456.30      0.00   8456.30      0.00  62645.15
03:11:55 PM   7257.61      0.00   7257.61      0.00  57384.76

The network log file shows packets received and transmitted per second and bytes received and transmitted per second.

03:00:01 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
03:08:55 PM      eth1   3285.46   2965.12    956.75   1824.97      0.00      0.00      1.05
03:09:55 PM      eth1   3640.33   3307.06   1053.38   2074.92      0.00      0.00      1.14
03:10:55 PM      eth1   3617.67   3283.23   1047.62   2061.22      0.00      0.00      1.65
03:11:55 PM      eth1   2917.34   2657.74    842.35   1686.10      0.00      0.00      1.38
03:12:55 PM      eth1   3859.74   3502.98   1119.06   2194.43      0.00      0.00      1.15