Friday, October 30, 2015

GoldenGate Extract : Bounded Recovery


Scenario: Placed lag monitoring scripts on goldengate servers. One of our extract is running with 17 hour lag ! We are engaged  :) 

Initial investigation on database and checked for long running transactions and blocking sessions on database. Database is clear  
Then started looking at goldengate components
Checked for report file generated by the extract found process information as follows 

2015-10-05 07:59:03  INFO    OGG-01738  BOUNDED RECOVERY: CHECKPOINT: for object pool 1: p8606_Redo Thread 1: start=SeqNo: 118841, RBA: 25235984, SCN: 3326.2439638626 (14287500865122), Timestamp: 2015-10-05 07:59:02.000000, end=SeqNo: 118841, RBA: 25249280, SCN: 3326.2439638675 (14287500865171), Timestamp: 2015-10-05 07:59:02.000000, Thread: 1.
2015-10-05 07:59:03  INFO    OGG-01738  BOUNDED RECOVERY: CHECKPOINT: for object pool 2: p8606_Redo Thread 2: start=SeqNo: 117802, RBA: 84960272, SCN: 3326.2439638649 (14287500865145), Timestamp: 2015-10-05 07:59:02.000000, Thread: 2, end=SeqNo: 117802, RBA: 84960768, SCN: 3326.2439638649 (14287500865145), Timestamp: 2015-10-05 07:59:02.000000, Thread: 2.
2015-10-05 07:59:03  INFO    OGG-01738  BOUNDED RECOVERY: CHECKPOINT: for object pool 3: p8606_Redo Thread 3: start=SeqNo: 118013, RBA: 16379408, SCN: 3326.2439638623 (14287500865119), Timestamp: 2015-10-05 07:59:01.000000, Thread: 3, end=SeqNo: 118013, RBA: 16379904, SCN: 3326.2439638623 (14287500865119), Timestamp: 2015-10-05 07:59:01.000000, Thread: 3.

Bounded Recovery happens when there are long running transaction or batch jobs that runs beyond the default bounded recovery time (4 hours)  
You know that extract collects only committed transactions from database but in the case of batch jobs, they run for long time on database and  have both committed and uncommitted transactions 
If the extract is waiting for commit for longer time and mean while uncommitted/committed data is flushed into the archive logs / flash recovery area on database 
At this point if the archive logs are deleted by RMAN, the transactions performed by the batch jobs will be lost 
To avoid these problems, Oracle / GG team developed the methodology to save those transactions to the disk / goldengate home directory BR 
It will save all long running transaction / uncommitted records info in BR directory under goldengate home, where extract reads records beyond whenever commit is performed 

Read checkpoints on database 
The current checkpoint and the recovery checkpoint are read checkpoints which keeps the redo log positions related to the extract process:
current checkpoint " up to here all committed transactions are written to trail by extract 
&
recovery checkpoint " is the position when the   old or long or batch processing transaction starts 
write checkpoints in the trail
When the extract stops, the cached data will be lost. When GoldenGate extract restarts, it has to go back to the recovery checkpoint to read the redo log files, discard all committed transactions before the current checkpoint, and recover all of the cached data for open 

Root Cause: 
I believe, in this process extract is struck in finding a record during reading records from BR directory. Restarted extract,It performed Bounded  recovery, lag is cleared 

GGSCI (tnc569) 3> info all
Program     Status      Group                Lag at Chkpt            Time since Chkpt
MANAGER     RUNNING
EXTRACT     RUNNING     EXTGRP     00:00:00      00:00:03


Oracle in memory --- :) 
Nikhil Tatineni --- 



Querys to monitor RAC

following few  Query's will help to find out culprits-  Query to check long running transaction from last 8 hours  Col Sid Fo...