Problem(Abstract)
When there is a data transmission issue between Primary and HDR Secondary (usually caused by a network problem) the client applications which are working with the Primary may become blocked and may look hung even if the data replication is configured to be asynchronous (DRINTERVAL > 0).
Symptom
Once the ping timeout is written to the online.log file of the Primary/Secondary instance (see the sample output below), user sessions return to normal work.
11:27:57 DR: ping timeout
11:27:57 DR: Receive error
11:27:57 ASF Echo-Thread Server: asfcode = -25582: oserr = 4: errstr =
: Network connection is broken.
11:27:57 DR_ERR set to -1
11:27:59 DR: Turned off on primary server
Cause
When data replication is established, primary and secondary regularly exchange ping messages. If the ping acknowledge is not received by the time when DRTIMEOUT is elapsed, a server re-sends ping message three more times and then reports ping timeout and turns off the DR subsytem. From this, the time span between first ping and the "DR: ping timeout" message can be as large as (DRTIMEOUT x 4).
For example, if DRTIMEOUT is set to be 180 second, it will take 12 minutes before DR is turned off.
Scenario #1:
Although with asynchronous replication transactions do not wait for acknowledgement from HDR secondary after the logical log record was put in DR buffer, when there is a transmission failure, the DR buffer may fill up pretty quickly (the time required for that depends on DRTIMEOUT value, LOGBUFF value and the activity that the instance is having). Until DR is not turned off, a user session has to wait until DR buffer has enough space for the logical log record.
Scenario #2:
In addition to the above scenario, a checkpoint can be requested on Primary between the first ping failure and the time when the "DR: ping timeout" message is reported. The checkpoints are synchronous between Primary and Secondary regardless of the DRINTERVAL value. once checkpoint is requested, it will prevent any threads from entering the critical section. The instance will remain blocked until checkpoint acknowledgment is received from the Secondary or until DR is turned off.
Diagnosing the problem
For scenario #1 check if the corresponding user thread demonstrates a stack similar to the following:
Stack for thread: 73 sqlexec
base: 0x0700000011abc000
len: 69632
pc: 0x00000001000370f4
tos: 0x0700000011acafe0
state: sleeping
vp: 8
0x00000001000370f4 (oninit)yield_processor_mvp
0x0000000100041f30 (oninit)mt_yield
0x000000010076a5ac (oninit)cdrTimerWait
0x0000000100716908 (oninit)dr_buf_deq_int
0x00000001001fe3c0 (oninit)dr_logcopy
0x00000001001f2d0c (oninit)logwrite
0x000000010011b7c4 (oninit)log_put
0x0000000100121384 (oninit)logm_write
0x00000001001f3e68 (oninit)logputx
0x000000010017137c (oninit)rscommit
0x000000010022b70c (oninit)iscommit
0x00000001002865e4 (oninit)sqiscommit
0x0000000100533d38 (oninit)committx
0x0000000100536480 (oninit)commitcmd
0x000000010053b01c (oninit)excommand
0x000000010042893c (oninit)sq_execute
0x000000010026becc (oninit)sqmain
0x00000001002d51a4 (oninit)listen_verify
0x00000001002d33b8 (oninit)spawn_thread
0x0000000100e0b59c (oninit)startup
For scenario #2 check the 'onstat -g ath' output and see if the user threads are having "cond wait cp" status.
Resolving the problem
To resolve the problem it may be required to:
1) Fix any problems that can cause data transmission issues between Primary and HDR Secondary (e.g. increase network reliability and throughput)
2) Decrease the value of DRTIMEOUT configuration parameter.
Note: increasing the LOGBUFF may also help to reduce the blockage time, however having a large logical log buffer may result in data loss in case of the Primary failure.
http://www-01.ibm.com/support/docview.wss?uid=swg21643957