728x90


Problem(Abstract)

When there is a data transmission issue between Primary and HDR Secondary (usually caused by a network problem) the client applications which are working with the Primary may become blocked and may look hung even if the data replication is configured to be asynchronous (DRINTERVAL > 0).

Symptom

Once the ping timeout is written to the online.log file of the Primary/Secondary instance (see the sample output below), user sessions return to normal work.

11:27:57  DR: ping timeout                                             
11:27:57  DR: Receive error                                            
11:27:57  ASF Echo-Thread Server: asfcode = -25582: oserr = 4: errstr =
: Network connection is broken.                                        
11:27:57  DR_ERR set to -1                                             
11:27:59  DR: Turned off on primary server    

Cause

When data replication is established, primary and secondary regularly exchange ping messages. If the ping acknowledge is not received by the time when DRTIMEOUT is elapsed, a server re-sends ping message three more times and then reports ping timeout and turns off the DR subsytem. From this, the time span between first ping and the "DR: ping timeout" message can be as large as (DRTIMEOUT x 4).


For example, if DRTIMEOUT is set to be 180 second, it will take 12 minutes before DR is turned off.

Scenario #1:
Although with asynchronous replication transactions do not wait for acknowledgement from HDR secondary after the logical log record was put in DR buffer, when there is a transmission failure, the DR buffer may fill up pretty quickly (the time required for that depends on DRTIMEOUT value, LOGBUFF value and the activity that the instance is having). Until DR is not turned off, a user session has to wait until DR buffer has enough space for the logical log record.

Scenario #2:
In addition to the above scenario, a checkpoint can be requested on Primary between the first ping failure and the time when the "DR: ping timeout" message is reported. The checkpoints are synchronous between Primary and Secondary regardless of the DRINTERVAL value. once checkpoint is requested, it will prevent any threads from entering the critical section. The instance will remain blocked until checkpoint acknowledgment is received from the Secondary or until DR is turned off.


Diagnosing the problem

For scenario #1 check if the corresponding user thread demonstrates a stack similar to the following:


Stack for thread: 73 sqlexec                  
base: 0x0700000011abc000                      
len: 69632                                    
pc: 0x00000001000370f4                        
tos: 0x0700000011acafe0                       
state: sleeping                               
vp: 8                                         
                                              
0x00000001000370f4 (oninit)yield_processor_mvp
0x0000000100041f30 (oninit)mt_yield           
0x000000010076a5ac (oninit)cdrTimerWait
0x0000000100716908 (oninit)dr_buf_deq_int
0x00000001001fe3c0 (oninit)dr_logcopy         
0x00000001001f2d0c (oninit)logwrite
0x000000010011b7c4 (oninit)log_put            
0x0000000100121384 (oninit)logm_write         
0x00000001001f3e68 (oninit)logputx            
0x000000010017137c (oninit)rscommit           
0x000000010022b70c (oninit)iscommit           
0x00000001002865e4 (oninit)sqiscommit         
0x0000000100533d38 (oninit)committx           
0x0000000100536480 (oninit)commitcmd          
0x000000010053b01c (oninit)excommand          
0x000000010042893c (oninit)sq_execute         
0x000000010026becc (oninit)sqmain             
0x00000001002d51a4 (oninit)listen_verify      
0x00000001002d33b8 (oninit)spawn_thread       
0x0000000100e0b59c (oninit)startup       

For scenario #2 check the 'onstat -g ath' output and see if the user threads are having "cond wait cp" status.


Resolving the problem

To resolve the problem it may be required to:

1) Fix any problems that can cause data transmission issues between Primary and HDR Secondary (e.g. increase network reliability and throughput)

2) Decrease the value of DRTIMEOUT configuration parameter.

Note: increasing the LOGBUFF may also help to reduce the blockage time, however having a large logical log buffer may result in data loss in case of the Primary failure.



http://www-01.ibm.com/support/docview.wss?uid=swg21643957

728x90

+ Recent posts