Friday, March 27, 2009

Oracle CRS (oprocd) TOC for clusterware integrity


oprocd: A Journey to the Unknown


Troubleshooting node reboot can be frustrating and definitely a tedious endeavor. On the other hand, it can also be rewarding when you nail it with supporting facts and reliable results. One of the main components of Oracle RAC is oprocd. This tiny software is so powerful that it can reboot a node over and over again if it deemed necessary for I/O fencing. It is therefore very important for us to understand oprocd. However, browsing through Oracle manuals can give you very little information about this piece of software.

oprocd (Linux)

Up until Oracle Clusterware 10.2.0.3, the hang-check timer module is used by Oracle RAC on Linux to detect nodes that have hardware issues or have failed devices which cause the node to hang and not to respond. Starting Oracle Clusterware 10.2.0.4, Oracle indicated it will also use oprocd.


oprocd (other O/S)

Oracle Clusterware 10.2.0.3 in all other operating systems including HP-UX 11.31 uses oprocd to implement I/O fencing.

oprocd bits and pieces

Julian Dyke (http://www.juliandyke.com/Presentations/RACTroubleshooting.ppt) gave a formula on how the oprocd values change when diagwait is configured:

If diagwait > reboottime then OPROCD_DEFAULT_MARGIN := (diagwait - reboottime) * 1000

You can actually see for yourself this logic in $ORA_CRS_HOME/css/admin/init.cssd
where $ORA_CRS_HOME is where you installed your Oracle Clusterware. The active version of this code is in /sbin/init.d/init.cssd for HP-UX 11.31. I believe it is in /etc/init.d in other operating system.

Both diagwait and reboottime are stored in Oracle Cluster Registry(OCR). When you start with

crsctl start crs

Oracle get these values from OCR and compute for the margin as can be seen in

/opt/var/oracle/oprocd/nodename.oprocd.log:
Oct 29 15:47:14.700 | INF | monitoring started with timeout(1000), margin(500), skewTimeout(125)

The values are in milliseconds. The default is 500 milliseconds which is the lowest and can be achieved by not setting diagwait in OCR, as in this case.

I was told by Oracle support that oprocd wakes up every minute to get the current time. If it is within 500ms range with the last result it will go back to sleep again otherwise it will reboot the node.

I found the above statement vague and misleading which caused more questions asked than answered:

How does oprocd get the time?
What do you mean by within range with the last result?

I've spent countless hours at night trying to decipher how oprocd works in the quest of understanding why it caused each node to reboot about 6 times a day.

The following is my own personal opinion and understanding about how oprocd works. It does not reflect Oracle or HP which owns the original code of Oracle Clusterware since Oracle bought Tru64 Clustering Technology some years ago. Of course I could be wrong or Oracle may change it in the new version.

oprocd logic

oprocd gets the current time, save this value and sleep for 1 second (1000ms). oprocd wakes up and gets the current time, compare it with (the previous time saved + 1 second). If the difference is more than .5 second (500ms) oprocd assumes that something is wrong with the node and reboots it accordingly implementing I/O fencing as designed. In pseudo-code form,

margin_time = .5 second;
sleep_time = 1 second;
save_time = get current time;

start loop
sleep for sleep_time;
current_time= get current time;
if abs(current_time - (save_time+sleep_time)) > margin_time
then
reboot;
else
save_time = current_time;
end loop

How can this simple logic fail and cause headaches for DBAs, Project Managers(PMs), and everyone involve in the Oracle RAC Deployment project?

Watch out for the answers next time...

Tuesday, March 10, 2009

My First

I have been planning to share my professional experience through blogging but always too busy to start it.

Finally, I'm here doing it.

More to come...