 |
|
Oracle Database Backup and
Recovery
Oracle Application Server Tips by Burleson
Consulting |
Whether you are working on a development
system or a production system, you need to take precautions to
protect your valuable work from external problems such as power
outages or server crashes. In the case of a production system,
a company in many cases has bet it?s continued existence on Oracle
software. The Oracle database has built its reputation as a
powerful data management system that can safely maintain the data
entrusted to it. The Oracle Application Server 10g
continues in that vein with a robust capability to recover from
external or internal problems. Chapter 9 discussed High
Availability solutions that could reduce or eliminate application
server down time. In this chapter we discuss methods to backup
the application server and if necessary restore from a backup.
Types of Problems
Before discussion methods of backing up and
restoring the application server, we need to discuss that type of
problems that the system could encounter and how they affect the
application server. I have divided these problems up into two
types, internal and external. An external problem happens
outside the application server or within the application server code
itself. Internal problems happen within the user application
running in the application server.
External Problems
Most system crashes are the result of human
error. An operator makes a change and the effect cascades to
the point of crashing the entire system. The second most
common cause of system crashes is equipment failures.
Equipment Failures
In the not too distant past, one of the
basic tasks of a DBA was to recover data files from disk failures.
With the rapid adoption of RAID systems, many DBAs have never had to
perform this task. As discussed in chapter 9, you can create
an array of application server so that any equipment related failure
would not stop the application server from servicing user request.
This is the basis for using multiple inexpensive commodity servers
to produce fault tolerant systems. Still, even though the
application server is still running, an operator must recover the
failed node and place it back into operation.
Server Crashes
A crash is the sudden failure of a system.
It may be caused by the loss of power, equipment failure, or a
software problem such as a memory leak. The primary problem
with a system crash is that data in memory is lost before it can be
saved. The Oracle Application Server 10g and the Oracle
database (to include the Metadata Repository database) can normally
self-recover from a server crash, however, the user application may
not.
User Error
This is the most common problem that
administrators must deal with and many times it is the hardest to
recover from if not prepared. The development team spends the
last three days working late at night to prepare the application
upgrade for deployment into the production system. A tired
developer accidentally deletes a critical application file. If
(as commonly happens) the team has not backed up the development
system, they may not be able to recover that critical file. In
many companies, the development system will require daily backups
while the production system may need only occasional backups.
It all depends on when the system changes.
Another common error is deleting the wrong
data. This can be a user error and the failure of some type of
automation such as a nightly script.
SQL>
delete from transactions where tans_date > SYSDATE ?7;
In the above example, if the user was trying
to delete all transactions older than 7 days, he just made a
mistake. Instead he deleted all transactions younger than 7
days. If the operator realizes his mistake before committing,
the database can roll back the delete. If the user commits or
logs off (implicit commit) then the deleted data must be recovered.
Another form of user error is incomplete automation of task. A
company implements a nightly script that processes the daily
transactions and then truncates the transaction table. The
process fails (maybe for something simple like a database link being
unavailable) and the second part kicks in and truncates the table
causing the loss of that days receipts.
Site Disaster
A site disaster is the total loss of the
system site, from earthquake or fire for example. In this
case, backups and/or failover systems must be maintain off site at a
location that will not be effected by the disaster that strikes the
main site. Failover systems are discussed in chapter 9.
Internal Problems
Internal problems happen within the user
application and involve programming practices that insure persistent
data is properly maintained and updated to reflect the application
data in memory. This was discussed briefly in chapter 7, but
is beyond the scope of this book. The Oracle Application
Server 10g has a number of features that assist with internal
problems such as OC4J islands that maintain state across instances,
but these features (chapter 9) are designed to support the loss of
an application server instance not to compensate for poor
programming of the user applications.
Between external and internal problems,
there will come a time when you will need to recover the application
server or database. Oracle provides a robust set of tools to
insure data protection but you have to use them. I am always
amazed when I visit a new client and find that they have no backup
strategy and in many case are not running their database in
archivelog mode. My philosophy is that if your data is worth
being in an Oracle product, it is worth implementing the tools to
protect it.
This is an excerpt from "Oracle
10g Application Server Administration Handbook" by Don Burleson
and John Garmany.