Thursday, February 26, 2009

RAC and Linux

Hi for many years I have been supported Linux and Oracle combination.

After few last experiences I’m thinking that newest RedHat 5 and Oracle 10g required more attention and work and you can run into some very strange issues.

1. Bonding

A new card (Intel NIC 4 port) has been added to server. All new interfaces have been configured on modprobe.conf. After that a new configuration for bonding has been created. As far looks good. Restart network interfaces and hurray we have bonding interfaces.

So simple RAC reconfiguration and we have ClusterWare using bonding interface.

But unfortunately until server reboot.

After reboot a order of “ethx” interfaces has been changed L

and Eth0 become Eth 4 and so on.

Ok lets change a configurations to new one.

Next restart and .... ?

Next order of interfaces !!!

Hopefully it was not my task, anyway my colleague found a
solution on RH website

After that is was OK.

Bonding was working ...

but RAC ....

see next point



2. RAC

So we have bounding up and running, according to Metalink we have to change interfaces using oifcfg and then change a VIP configuration.
I have stoped a cluster make all changes, restart cluster ...

Hurray is working ... yes ... until reboot of servers.

After reboot only on node was up and running, on second one ClusterWare didn’t start.

Fast check of logs and in ocssd.log I have found

[ CSSD]2009-02-20 19:43:44.758 [1115699552] >TRACE: clssnmReadDskHeartbeat: node(1) is down. rcfg(2) wrtcnt(14467) LATS(11888034) Disk lastSeqNo(14467)
[ CSSD] 2009-02-20 19:43:44.758 [1220598112] >TRACE: clssnmRcfgMgrThread: Local Join

WTF ? Node 1 is up and running and why local join ?

Another review and ...

[ CSSD] 2009-02-20 19:43:44.758 [1147169120] >TRACE: clsc_send_msg: (0x7271b0) NS err
(12571, 12560), transport (530, 113, 0)

My favourite error – no network. Quick check with ping – both interfaces private and public are up. So what is with that box ?
A problem is that bounding need some time after interfaces are up to start working.
If ClusterWare is starting BEFORE binding is 100 % operational, Oracle check only once if there is a other node using network connection and then decide to start ClusterWare in local configuration. But after that it realize that on vote disk other node is already registered so local join is impossible.
BTW in my opinion ClusterWare should be more flexible and check other node more then once.

A solution from Metalink is to add sleep a few second before ClusterWare will be started using /etc/init.d/init.crs

'start')
CMD=`$BASENAME $0`

# If we are being invoked by the user, perform manual startup.
# If we are being invoked as an RC script, check for autostart.
if [ "$CMD" = "init.crs" ]; then
$LOGMSG "Oracle Cluster Ready Services starting by user request."
$ID/init.cssd manualstart
else
$ID/init.cssd autostart
fi
;;


Add 'sleep ' before '$ID/init.cssd autostart', for example, sleep 20 seconds:

else
sleep 20
$ID/init.cssd autostart
fi


At the end of that my personal opinion - previous versions of Linux and Oracle was more stable.
I hope Linux and Oracle will not finish like MS products.

regards,
Marcin

Thursday, February 12, 2009

Strange error

Hi,

Short note to not forget about it.

I have to investigate it more deeply and replicate it, but today I hit into very strange issue.
I have started a standby DB in read only mode, but unfortunately audit_trail was set to DB.
I have had a error that I could not open database in read only with that parameter.
I have changed that and after restart I have got information that my db files need to be recovered to be consistent :-(

Hmmm right now I'm waiting for a airplane but I will come back to that issue.


Update - Friday 13th :)
RTFM ...
It is a mixed environment between PA-RISC and Itanium and standby can't be open in Read Only mode - see metalink note 413484.1

BTW
Another example of very useful error messages !!!

Wednesday, February 4, 2009

RMAN - random errors from years

Hi,


I have been work with RMAN from 8 years and I'm still wondering why some of RMAN errors are taking from /dev/random ;)

Last example:

Environment : Linux 32 bit - Oracle 10g 10.2.04 on ASM

Performed steps:
  1. Drop existing test DB from ASM - using drop database
  2. Copy backup from production server into test server
  3. restore controlfile from new location
  4. mount database
After that I wanted to restore a database. So I have catalog all necessary backup pieces
in controlfile and check it using list backupset command.
There was one backupset with all datafile with correct status. So it is simple let try to restore
DB.

RMAN>restore database;

creating datafile No=1 name=+DATA/oracle/orcl/datafile/o1_mf_system_3n5w1nky_.dbf
released channel: t1
released channel: t2
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of restore command at 07/04/2008 10:54:29
ORA-01180: can not create datafile 1
ORA-01110: data file 1: '+DATA/dataprd/ORCL/datafile/o1_mf_system_3n5w1nky_.dbf'

Yeah, nice error.
There is some notes on metalink related to duplicated incarnation and corrupted controlfile
(BTW there is a solution to recreate a controlfile from command line before you restore datafiles - it is possible to recreate a controlfile without datafiles ???)

Anyway there was not my case.

A solution is very simple - I have found out that during catalog phase RMAN is scaning existing flash recovery area and I found archive logs in backupset from previous (droped) database and
bacuse there was differect incarnation of that archive log ... incarnation of my new database has been changed too. And now we have a strange behaviour of RMAN.

RMAN> list backupset;

still display a valid backups for that incarnation

RMAN> restore database;

raise error (see above)

Solution:

RMAN> reset database to incarnation xxx;

where xxx is a previous incarnation of database.

I can understand that Oracle could use a backup from previous incarnation in new (but why ?)
but why there is so stuip error about datafile number 1 ?

Is is impossible to display something more useful like there is no backup for that incarnation ?

ps.
All databases have this same DBID - there are clones
I know there is a bad idea to keep one DBID for many databases but I have thought that with RMAN catalog there is no issue.

regards,
Marcin