Call now: 252-767-6166  
Oracle Training Oracle Support Development Oracle Apps

 
 Home
 E-mail Us
 Oracle Articles
New Oracle Articles


 Oracle Training
 Oracle Tips

 Oracle Forum
 Class Catalog


 Remote DBA
 Oracle Tuning
 Emergency 911
 RAC Support
 Apps Support
 Analysis
 Design
 Implementation
 Oracle Support


 SQL Tuning
 Security

 Oracle UNIX
 Oracle Linux
 Monitoring
 Remote s
upport
 Remote plans
 Remote
services
 Application Server

 Applications
 Oracle Forms
 Oracle Portal
 App Upgrades
 SQL Server
 Oracle Concepts
 Software Support

 Remote S
upport  
 Development  

 Implementation


 Consulting Staff
 Consulting Prices
 Help Wanted!

 


 Oracle Posters
 Oracle Books

 Oracle Scripts
 Ion
 Excel-DB  

Don Burleson Blog 


 

 

 


 

 

 

 

 

Oracle Stress Testing Tips


A case study by Donald BurlesonUpdated 24 August 2015

Oracle databases do not run in a vacuum and it is the job of the DBA to know the thresholds where their Oracle database becomes dependent on external factors such as RAM, CPU and Network latency. 

Also see my notes on Oracle benchmarking testing.

A valid benchmark stress test  will burden the database with transactions until we see a pronounced drop-off in response time the top transactions per second (TPS). This stress test will also reveal the system bottleneck that emerges under heavy load:

 

A typical Oracle response-time drop-off

In this example we see a dramatic increase in response time as we pass 11 transactions per second.  Once we have "broken the systems (by exceeding the maximum transaction rate), we then correlate the increase with various internal Oracle metrics (wait events from ASH, AWR reports) and external metrics (CPU, RAM, disk and network enqueues).

A Methodology for Oracle Server Stress Testing

The goal of an Oracle server stress test is to place an increasing load on the Oracle instance until the instance becomes dependent on an external resource.  Of course, an applications may have many patterns of SQL access and the bottleneck will change over time as access patterns change, but the primary goal of server stress testing is to determine the "threshold" when the server will "break", as evidenced by hardware-related enqueues and/or internal Oracle bottlenecks.

This is an actual case study using a pair of Oracle servers with the following resources

  • Servers - Two Oracle servers connected via Oracle Streams n-way replication
    • CPU's - Dual Intel 64-bit processors
    • RAM - 16 gigabytes of RAM
  • Disk - Six 72 gigabyte disks using RAID 10
  • Network - TCP/IP using Oracle*Net with n-way Oracle Streams replication
  • Oracle Release - Oracle 10g release 2, Enterprise Edition
  • Operating System - Red Hat Linux Advanced Server
  • Application - Oracle Application Express (formerly HTML-DB)
  • Client Software - Microsoft Internet Explorer

Figure 1 shows a model of the Oracle server architecture:

 

Figure 1 - The Oracle server architecture

 

This system architecture provides Oracle Streams cross-replication for two production servers in widely-separated locations (> 100 miles apart) with high-speed interconnects.  This approach has the dual benefit from Oracle disaster recovery, and also allows both servers to be used to process transactions.  Since all transactions are using APEX, redirecting traffic is as simple as changing the domain name in the requesting URL. 

Should one server location fail, end-user simply clicks-into the surviving servers at the other locations and continues processing, as normal.

The Oracle Stress testing Steps

To be realistic, the transaction load should closely parallel the real-world transactions, and this benchmark will use the Goal benchmark framework from the "benchmark factory".  There are several phases in preparation for the benchmark:

  1. SQL extraction - We extract a representative sample of actual SQL and DML from the production database.
  1. SQL preparation - We create PL/SQL wrappers around the SQL to pass-in valid values for the host variables, using data from the application tables.
  1. Load Preparation - Benchmark factory allow us to submit up to 2000 transaction threads per TNS client, however the practical limit is based on network bandwidth of the client, for example 60 users tends to saturate a 100Mbit line.  We will configure 2 clients, installing BF Agent software and TNS in order to set-up a load of 200 transaction threads.
  1. Test Execution - We then start the server data capture (vmstat, iostat) and take an AWR snapshot.  After a pre-designated interval we will step up the user load by 10 users at each step.
  1. Post-hoc Analysis - We plot the overall response_time and correlate the drop-off with internal Oracle metrics (e.g. wait events) and external server metrics (CPU enqueues, RAM swapping).

Let's take a closer look at each Oracle stress testing step.

1. SQL Extraction

To simulate a full production load we start by extracting SQL and DML from the live production application.  You can gather the most resource intensive SQL directly from the library cache based on any of these criteria:

  • Disk_reads - This will extract the SQL that typically exceeds the data buffer capacity, generally non-repeating data fetches that are unlikely to be cached.
  • Executions - This is a good metric because it simulates the actual load on the database
     
  • Buffer_gets - This measures logical IO and CPU and memory stress
  • Number of Executions - This tests the most used transactions

We must remember that some of the most frequently-executed SQL will be against data blocks that will already be cached (high consistent gets) whereas SQL with high disk_reads will generate a load that is disk I/O intensive (high physical reads). 

If we wanted to extract the top SQL by executions we could issue the following command:

set lines 2000;
select sql_text from v$sql order by executions desc;
 
select  unsub_first_name, unsub_middle_name, unsub_last_name,
   into :post_unsub_first_name, :post_unsub_middle_name,
        :post_unsub_last_name,
   from
        fbi_person g,
        imm_arture i,
        fbi_immig_c gic,
        imm_fli if
  where
        g.fbi_person_id = i.fbi_person_id
        and g.fbi_imcat_cd = gic.fbi_imcat_cd
        and i.flight_seq_nbr_key = if.flight_seq_nbr_key
        and i.ed_card_nbr = :post_lookup_card;
 
insert into
   imm_argssr
      (fbi_person_id, flight_seq_nbr_key, edd_nbr,
       actual_imm, immcate,   
       est_immre_time,
       actual_immige, hotel_name,   
       flight_crew_indicator_flag, purpose_of_visit)
 values
  (:post_fbi_person_id,
   :post_ad_flight_key,
   :post_ed_nbr_save,
   to_date(:post_actl_date, 'dd/mm/yyyy'),
   :post_ad_passetegory,
   to_date(:post_ad_eparture_date,'dd/mm/yyyy'),
   sysdate, null, 'n', null);

2. Preparing the SQL for Benchmark Factory

For application SQL with the PL/SQL into clause, we must replace the PL/SQL into clause syntax with its generic SQLPLUS form.  If you want your test to keep the PL/SQL into clause, you can build anonymous PL/SQL blocks for them and execute them as prepared statements.  For our purposes, we only need the data transferred back to the calling machine, so we alter the SQL to remove the PL/SQL into clause:

select  unsub_stuff
   from
        fbi_person g,
        imm_arr_departure i,
        fbi_immigration_category gic,
        imm_flight if
  where
        g.fbi_person_id = i.fbi_person_id
        and i.flight_seq_key = if.flight_seq_key
        and i.ed_card_nbr = :post_LOOKUP_BY_NAME

Providing data values for the benchmark

There are two ways to generate random testing data form the SQL that we have extracted:

  1. Random Values - We can use the benchmark factory functions $BFRandList and $BFRandRange to generate a random lists of data values.  We can control all ranges of numeric values with valid ID ranges and the list substitutions with valid values.
  1. Application Values - We can extract actual host variable data from the application tables. By examining tables and actual values we can provide a realistic list of values to feed into the functions. For nearly all of the SQL selected for this test there were a limited number of values with most selections being based on integer value keys easily generated using the $BFRandRange function.

Random Range Values

Here we see an example of SQL using the $BFRandRange function to generate random values for host variables:

select stuff
from
 fbi_person g, fbi_immigration ic
where
 g.fbi_person_id = $bfrandrange(10000,100000)
and
 g.fbi_imcat_cd = ic.fbi_imcat_cd;

Application Values

By carefully selecting our SQL we can utilize selection values easily loaded into the $BFRandList function.  To use actual application data you would need to create a function that would return random values to the SQL statement, however, this would generate additional load on the database possibly causing skewed or invalid results.

We are now ready to configure the client machines, placing the TNS layer and the Benchmark Factory software that will load-down the database.  Now that we have the SQL ready we can pursue the set-up on the BF clients.

3. Load Preparation

Stressing the database server involves a set-up of multiple Oracle*Net clients.  The benchmark factory allows you to install an "agent" on an Oracle*net client machine and launch the SQL from many client PC's.

How many users do we want to simulate? Up to 2000 or so (up to CPU or network bandwidth limits) can be done from a single agent process, more than that we will need to install a second or more agents to submit requests through. Any windows based system can be used as an agent, but it must have Oracles network stack installed.

4. Test Execution & Capturing Results

Here is an example of the results from a "Goal" test where the goal was to push the system to where the TPS dropped below the previous two runs.

The Benchmark Factory collects all transaction data and places it inside a spreadsheet for easy access and summarization:

 

 

Experiment Predictions

In our typical 64-bit Oracle system with 32 gigabytes of RAM we would predict results as follows:

  • The RAM data buffer cache (14 gigs in each of the two servers) will be sufficient to cache the working-set of frequently-referenced data blocks and disk enqueues will not be the bottleneck.
  • The low-DML of this system and the fact that OAE does whole-screen data map-outs in one Oracle*Net trip, we do not expect to see network latency being responsible for the response-time drop-off.

Because of the architecture and the nature of transactions against this system we expect that the main source of contention in this system should be CPU, and Oracle latch-elated wait events.  We should start seeing CPU enqueues (in our case, the run queue exceeding "2" on each server) at about the same transaction rate when total response time declines precipitously.  We predict that this configuration will bottleneck on CPU at approximately 30 transactions per second and that RAM, network and disk enqueues will not be on the bottleneck.

Our Results

For the initial test run we utilized only a single Node as the full scale replication was not yet implemented. In the test we set up a run from 1-100 users executing the SQL transactions from Appendix A, in this phase of testing we will not test the replication.

In this goal test the criteria for ending the test was for the transactions per second at the current user load to fall behind that of the two previous runs. The graph in Figure 4 shows the TPS results.

 Figure 4 - The Preliminary Test TPS Results

The results show that at 60 users the system TPS seems to peak in this configuration, but what about other signs of strain in the system? Is it CPU, Disk, or Network related? Look at Figure 5 for CPU usage results.

 

Figure 4 - The Preliminary Test TPS Results

Note that the user (us) CPU usage peaks at around 65 percent; with System (sy) peaking at only 5 percent this still indicates that we have 30 percent CPU idle time. Clearly based on Figure 5 we do not have CPU stress. A secondary indicator of CPU stress is runqueue length. Figure 5 shows the plot of runqueue for the same time period.

 

Figure 5 - The Preliminary Test Runqueue Results

As you can see the runqueue has rarely rises above 4, this indicates that CPU stress was not a factor. Next let's examine disk stress results. Look at Figure 6 showing Disk Writes.

 

Figure 6 - The Preliminary Test Disk Write  Results

As can be seen from Figure 6, other than a single large burst of greater than 30,000 IO in a 10 second interval near the mid-range of the test, disk IO was well within the disks operational limits. Figure 7 shows that Disk Reads were also well within specifications.

 

Figure 7 - The Preliminary Test Disk Read Results

 

The system reads are clearly below any stress levels that would indicate the disks were the cause of the low TPS values. This is also supported when we examine the system wait statistics in Figure 8.

 

Figure 8 - The Preliminary Test Wait Results

 

As can be seen from examining Figure 8, other than a peak that corresponds to the reads on disk hbc, system waits are not causing the TPS drop off. The final possible bottle necks are memory and network. Examination of the AWR report for the test period shows some interesting network statistics. Look at Figure 9 for an excerpt from the applicable AWR report from the test run.

Shared Pool Statistics        Begin    End
                              ------  ------
             Memory Usage %:   88.59   91.21
    % SQL with executions>1:   35.30   41.99
  % Memory for SQL w/exec>1:   44.32   45.20
 
Top 5 Timed Events                                         Avg %Total
~~~~~~~~~~~~~~~~~~                                        wait   Call
Event                                 Waits    Time (s)   (ms)   Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
CPU time                                            555          67.3
latch: library cache                  7,068          69     10    8.3 Concurrenc
latch: shared pool                    4,291          15      4    1.9 Concurrenc
log file sync                         5,520          11      2    1.4     Commit
log file parallel write               6,464           9      1    1.1 System I/O
          -------------------------------------------------------------
 
Statistic Name                                       Time (s) % of DB Time
------------------------------------------ ------------------ ------------
DB CPU                                                  554.8         67.3
sql execute elapsed time                                395.7         48.0
parse time elapsed                                      384.1         46.6
hard parse elapsed time                                 311.9         37.9
 
                                                                  Avg
                                       %Time       Total Wait    wait     Waits
Wait Class                      Waits  -outs         Time (s)    (ms)      /txn
-------------------- ---------------- ------ ---------------- ------- ---------
Concurrency                    12,201     .0               96       8       2.0
System I/O                     10,306     .0               15       1       1.7
Commit                          5,520     .0               11       2       0.9
Network                     2,798,166     .0                8       0     451.2
Other                             247     .0                1       5       0.0
Application                       400     .0                0       0       0.1
User I/O                        1,407     .0                0       0       0.2
          -------------------------------------------------------------
 
 
Wait Events                              DB/Inst: FBIGT/fbigt  Snaps: 383-384
-> s  - second
-> cs - centisecond -     100th of a second
-> ms - millisecond -    1000th of a second
-> us - microsecond - 1000000th of a second
-> ordered by wait time desc, waits desc (idle events last)
 
                                                                   Avg
                                             %Time  Total Wait    wait     Waits
Event                                 Waits  -outs    Time (s)    (ms)      /txn
---------------------------- -------------- ------ ----------- ------- ---------
latch: library cache                  7,068     .0          69      10       1.1
latch: shared pool                    4,291     .0          15       4       0.7
log file sync                         5,520     .0          11       2       0.9
log file parallel write               6,464     .0           9       1       1.0
latch: cache buffers chains             597     .0           7      12       0.1
SQL*Net message to client         2,782,891     .0           5       0     448.8
SQL*Net message from client       2,782,946     .0     190,370      68     448.8
          -------------------------------------------------------------
 
Background Wait Events                   DB/Inst: FBIGT/fbigt  Snaps: 383-384
-> ordered by wait time desc, waits desc (idle events last)
 
                                                                   Avg
                                             %Time  Total Wait    wait     Waits
Event                                 Waits  -outs    Time (s)    (ms)      /txn
---------------------------- -------------- ------ ----------- ------- ---------
rdbms ipc message                    17,804   65.0      37,185    2089       2.9
          -------------------------------------------------------------

 

Figure 9 - The Preliminary Test AWR Results

 

What the AWR report is showing is that we are seeing some shared pool stress and network stress. Review of alert log entries shows the database was also getting ORA-04031 errors, indicating the shared pool was becoming fragmented. It was decided that the CURSOR_SHARING parameter would be left at EXACT; as a result this caused numerous not sharable SQL statements to be loaded into the shared pool causing the ORA-04031 errors.

When we retest using the dual nodes with replication, we will turn on the CURSOR_SHARING to reduce memory thrashing and parsing. In addition we will make sure the clients used are on separate network strands.

Conclusions for Phase 1 Testing

Phase one showed that even with limited network bandwidth the system easily supported 50-60 users on a single node, more than is anticipated than will be needed during normal operations. With better memory parameters and better network bandwidth it is anticipated that the nodes will easily support 80-100 users.

Disk Load - other than a single large burst of greater than 30,000 IO in a 10 second interval near the mid-range of the test, disk IO was well within the disks operational limits.

CPU - The user (us) CPU usage peaks at around 65 percent; with System (sy) peaking at only 5 percent this still indicates that we have 30 percent CPU idle time.

Memory and Network - Examination of the AWR report for the test period shows some interesting network statistics.  As predicted, the server became CPU-bound as active users rose above 60:

 

Top 5 Timed Events                                         Avg %Total
~~~~~~~~~~~~~~~~~~                                        wait   Call
Event                                 Waits    Time (s)   (ms)   Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
CPU time                                            555          67.3
latch: library cache                  7,068          69     10    8.3 Concurrenc
latch: shared pool                    4,291          15      4    1.9 Concurrenc
log file sync                         5,520          11      2    1.4     Commit
log file parallel write               6,464           9      1    1.1 System I/O

 

Other Oracle benchmark notes:

 


 

 

��  
 
 
Oracle Training at Sea
 
 
 
 
oracle dba poster
 

 
Follow us on Twitter 
 
Oracle performance tuning software 
 
Oracle Linux poster
 
 
 

 

Burleson is the American Team

Note: This Oracle documentation was created as a support and Oracle training reference for use by our DBA performance tuning consulting professionals.  Feel free to ask questions on our Oracle forum.

Verify experience! Anyone considering using the services of an Oracle support expert should independently investigate their credentials and experience, and not rely on advertisements and self-proclaimed expertise. All legitimate Oracle experts publish their Oracle qualifications.

Errata?  Oracle technology is changing and we strive to update our BC Oracle support information.  If you find an error or have a suggestion for improving our content, we would appreciate your feedback.  Just  e-mail:  

and include the URL for the page.


                    









Burleson Consulting

The Oracle of Database Support

Oracle Performance Tuning

Remote DBA Services


 

Copyright © 1996 -  2020

All rights reserved by Burleson

Oracle ® is the registered trademark of Oracle Corporation.