 |
|
Disaster Recovery
Oracle Application Server Tips by Burleson
Consulting |
With an implementation such as the one
depicted in Figure 9-1, you have a completely redundant system,
capable of continuous availability, even with the loss of a server
in each tier. Before digging deeper into the high availability
capabilities of the individual components, we need to discuss
recovering from a disaster, say an earthquake or a fire. You need
the ability to recover if you lose your data center. Oracle
Application Server Disaster Recovery is the solution to provide
off-site replication of the application server. The administrator
periodically executes scripts that update the configuration of the
standby site to match the active site. If the active site is lost,
the standby site is activated, and the DNS is changed to address the
new location. The standby site must match the active site. The
standby back-end database must also be configured to stay current
with the active database, possibly using Oracle DataGuard.
Backup and Recovery
Sometimes it
is easier to recover a failed component than to spend time trying to
repair it. Oracle Application Server 10g comes with a new Backup and
Recovery component that allows you to create a checkpoint of the
system and, if need be, quickly recover to that checkpoint. This
capability is instrumental in implementing Disaster Recovery,
discussed in the previous section. For additional information, refer
to Chapter 11.
At this point
we need to discuss how the individual components implement high
availability.
Rolling Upgrades
One of the
requirements that creates downtime is maintenance. Part of
maintenance is upgrading the application server itself. Oracle
Application Server 10g has the ability to upgrade from Oracle9iAS
Release 2 (9.0.2) with minimal downtime. In fact, if you have
implemented a completely redundant system like Figure 9-1, you can
upgrade a component (such as the OHS/OC4J behind the Web Cache) and
test it while the other components are still supporting your
application. As each upgraded component is accepted, you can bring
it online and upgrade the next component. In this way you could
completely upgrade the application server with no downtime while
implementing a rigorous testing and verification routine.
Oracle plans
to implement the rolling upgrade in all future releases of the
Oracle Application Server. For more details on planning and
executing a minimal downtime upgrade from Oracle9iAS (9.0.2) to
Oracle Application Server 10g, refer to the Oracle Application
Server Upgrading to 10g documentation.
OC4J High Availability Features
The Oracle
Application Server 10g OC4J container was discussed in detail in
Chapter 7. The OC4J container has a number of its own high
availability features. These include the ability to deploy
components into a running container, to use multiple JVMs, and to
replicate state across containers and instances.
Hot Deployments and Redeployments
Rolling
upgrades are available only for the application server, but also
for your application running on the application server. This is
referred to as a hot deployment or a redeployment. During a hot
deployment, the OC4J container deploys the new EJB/web components
while continuing to support the present components. This can cause a
temporary performance impact as the deployment occurs.
During a
redeployment, updates to existing application components are
deployed. Redeployments of a currently running application require
additional planning and testing to insure success. The problem with
redeploying a live application is that the current session?s state
may not exist upon completion of the new application components. A stateful EJB will be upgraded, but there is no way to insure that
the new EJB assumes the state of the previous version of the EJB.
One way around this problem is to always store the component?s state
in the back-end database; however, there are trade-offs with this
solution that must be considered during development. A redeployment
that fails may leave the OC4J container in an inoperable state,
which may require you to restore the OC4Jcontainer using DCM (see
Chapter 7).
OC4J Islands
Islands are
multiple containers working together to insure availability. OC4J
containers start in a default island that contains one process. You
can increase the processes in the default island to insure that the
container continues to run after a process fails. Multiple OC4J
containers, in a single island, will replicate state information so
that if one container fails, the other will continue to support the
active sessions. This can be expanded to multiple OC4J containers on
different servers so that the application continues to support
active sessions, even with the complete loss of a server. When
planning for and creating islands of containers, insure that the
island spans multiple servers. Also, state replication among the
containers in an island requires some overhead. Creating multiple
islands reduces the overhead of propagating state information to
other containers in the island while maintaining the ability to
recover from the loss of a server or container.
Midtier 1
OC4J-APP1
OC4J-APP2
OC4J-APP3
Island app 1
Island app 2
Midtier 2
OC4J-APP1
OC4J-APP2
OC4J-APP4
Figure 7: Distribution of islands across
multiple servers
It is the job
of mod_oc4j to map sessions to OC4J islands. If a server fails and
the OC4J island spans multiple servers, mod_oc4j will route the
transaction to an available container within the original island. If
there are no OC4J containers remaining in the island, the session
state is lost. For more information on OC4J islands, refer to
Chapter 7.
Transparent Application Failover
Transparent
Application Failover (TAF) is available on the connection from the
application server to the back-end database. To use TAF, the
database connections must use the thick JDBC client, and the
back-end database must be running Real Application Clusters. When
the application server sends a request to the back-end database, it
gets assigned to an instance in the RAC cluster. That database
instance will execute the request and return the response. If the
assigned database instance fails (even in the middle of executing
the request), TAF will detect this and automatically route the
request to another instance in the database cluster.
TAF uses
Oracle Net connection, and thus your application must use the thick
Oracle JDBC client to connect to the database. TAF is not something
that you can just turn on and walk away. Your application must
understand how TAF works and respond accordingly. TAF supports the
following functions:
* Active transactions - Uncommitted inserts,
updates, and deletes are automatically rolled back if the instance
fails. TAF will return an error to the application until a rollback
command is submitted.
* Database connections - TAF will
automatically reconnect to another database instance if the current
instance fails.
* Select failover - If your application is
retrieving data using a Select statement (open cursors) and is in
the process of fetching rows when the database instance fails, TAF
will reconnect and reexecute the cursor select statement, discard
the already returned rows, and allow you to fetch the remaining
rows. For example, if your application is processing 1 billion rows
and the instance fails after you have fetched 200 rows, TAF will
automatically reconnect, reexecute the cursor, discard the already
fetched rows, and allow the application to continue fetching the
remaining rows. To the application, it appears that the database
stops for a few seconds and then continues.
TAF is not
fail proof. When the connected database instance fails, the
nonpersistent session data is not automatically restored. Also, any
server-side program variables or PL/SQL package state is also lost.
TAF can also
be configured to create two connections, each to a separate database
instance, to reduce the time required to recover from an instance
failure.
TAF is a
powerful feature that requires additional planning to implement in
your application. For more information, refer to the Oracle
documentation or to the Oracle9i RAC book mentioned earlier.
High Availability of Applications
Each component of Oracle Application
Server 10g has the ability to create redundancy. However, you must
insure that your application is implemented in a way to take
advantage of this capability. Insure that your applications
replicate stateless components to multiple servers and that stateful
components are contained in islands that span multiple servers.
Complete redundancy in the application server will be of little use
if the back-end database is not available. Creating a high-available
infrastructure will also remove that critical single point of
failure.
Depending on your application, using
the Web Cache to continue to respond to user requests will allow you
some time to switch systems, but will eventually result in failed
requests or serving stale content. The bottom line is that you must
plan the infrastructure needed to insure that all components of your
application are using the high availability features built into
Oracle Application Server 10g and the Real Application Clusters
Database 10g.
This is an excerpt from "Oracle
10g Application Server Administration Handbook" by Don Burleson
and John Garmany.