Scenario: Placed lag monitoring scripts on goldengate servers. One of our extract is running with 17 hour lag ! We are engaged :)
Initial investigation on database and checked for long running transactions and blocking sessions on database. Database is clear
Then started looking at goldengate components
Checked for report file generated by the extract found process information as follows
2015-10-05 07:59:03 INFO OGG-01738 BOUNDED RECOVERY: CHECKPOINT: for object pool 1: p8606_Redo Thread 1: start=SeqNo: 118841, RBA: 25235984, SCN: 3326.2439638626 (14287500865122), Timestamp: 2015-10-05 07:59:02.000000, end=SeqNo: 118841, RBA: 25249280, SCN: 3326.2439638675 (14287500865171), Timestamp: 2015-10-05 07:59:02.000000, Thread: 1.
2015-10-05 07:59:03 INFO OGG-01738 BOUNDED RECOVERY: CHECKPOINT: for object pool 2: p8606_Redo Thread 2: start=SeqNo: 117802, RBA: 84960272, SCN: 3326.2439638649 (14287500865145), Timestamp: 2015-10-05 07:59:02.000000, Thread: 2, end=SeqNo: 117802, RBA: 84960768, SCN: 3326.2439638649 (14287500865145), Timestamp: 2015-10-05 07:59:02.000000, Thread: 2.
2015-10-05 07:59:03 INFO OGG-01738 BOUNDED RECOVERY: CHECKPOINT: for object pool 3: p8606_Redo Thread 3: start=SeqNo: 118013, RBA: 16379408, SCN: 3326.2439638623 (14287500865119), Timestamp: 2015-10-05 07:59:01.000000, Thread: 3, end=SeqNo: 118013, RBA: 16379904, SCN: 3326.2439638623 (14287500865119), Timestamp: 2015-10-05 07:59:01.000000, Thread: 3.
Bounded Recovery happens when there are long running transaction or batch jobs that runs beyond the default bounded recovery time (4 hours)
You know that extract collects only committed transactions from database but in the case of batch jobs, they run for long time on database and have both committed and uncommitted transactions
If the extract is waiting for commit for longer time and mean while uncommitted/committed data is flushed into the archive logs / flash recovery area on database
At this point if the archive logs are deleted by RMAN, the transactions performed by the batch jobs will be lost
To avoid these problems, Oracle / GG team developed the methodology to save those transactions to the disk / goldengate home directory BR
It will save all long running transaction / uncommitted records info in BR directory under goldengate home, where extract reads records beyond whenever commit is performed
Read checkpoints on database
The current checkpoint and the recovery checkpoint are read checkpoints which keeps the redo log positions related to the extract process:
current checkpoint " up to here all committed transactions are written to trail by extract
&
recovery checkpoint " is the position when the old or long or batch processing transaction starts
write checkpoints in the trail
When the extract stops, the cached data will be lost. When GoldenGate extract restarts, it has to go back to the recovery checkpoint to read the redo log files, discard all committed transactions before the current checkpoint, and recover all of the cached data for open
I believe, in this process extract is struck in finding a record during reading records from BR directory. Restarted extract,It performed Bounded recovery, lag is cleared
GGSCI (tnc569) 3> info all
Program Status Group Lag at Chkpt Time since Chkpt
MANAGER RUNNING
EXTRACT RUNNING EXTGRP 00:00:00 00:00:03
Oracle in memory --- :)
Nikhil Tatineni ---