728x90


Problem(Abstract)

When there is a data transmission issue between Primary and HDR Secondary (usually caused by a network problem) the client applications which are working with the Primary may become blocked and may look hung even if the data replication is configured to be asynchronous (DRINTERVAL > 0).

Symptom

Once the ping timeout is written to the online.log file of the Primary/Secondary instance (see the sample output below), user sessions return to normal work.

11:27:57  DR: ping timeout                                             
11:27:57  DR: Receive error                                            
11:27:57  ASF Echo-Thread Server: asfcode = -25582: oserr = 4: errstr =
: Network connection is broken.                                        
11:27:57  DR_ERR set to -1                                             
11:27:59  DR: Turned off on primary server    

Cause

When data replication is established, primary and secondary regularly exchange ping messages. If the ping acknowledge is not received by the time when DRTIMEOUT is elapsed, a server re-sends ping message three more times and then reports ping timeout and turns off the DR subsytem. From this, the time span between first ping and the "DR: ping timeout" message can be as large as (DRTIMEOUT x 4).


For example, if DRTIMEOUT is set to be 180 second, it will take 12 minutes before DR is turned off.

Scenario #1:
Although with asynchronous replication transactions do not wait for acknowledgement from HDR secondary after the logical log record was put in DR buffer, when there is a transmission failure, the DR buffer may fill up pretty quickly (the time required for that depends on DRTIMEOUT value, LOGBUFF value and the activity that the instance is having). Until DR is not turned off, a user session has to wait until DR buffer has enough space for the logical log record.

Scenario #2:
In addition to the above scenario, a checkpoint can be requested on Primary between the first ping failure and the time when the "DR: ping timeout" message is reported. The checkpoints are synchronous between Primary and Secondary regardless of the DRINTERVAL value. once checkpoint is requested, it will prevent any threads from entering the critical section. The instance will remain blocked until checkpoint acknowledgment is received from the Secondary or until DR is turned off.


Diagnosing the problem

For scenario #1 check if the corresponding user thread demonstrates a stack similar to the following:


Stack for thread: 73 sqlexec                  
base: 0x0700000011abc000                      
len: 69632                                    
pc: 0x00000001000370f4                        
tos: 0x0700000011acafe0                       
state: sleeping                               
vp: 8                                         
                                              
0x00000001000370f4 (oninit)yield_processor_mvp
0x0000000100041f30 (oninit)mt_yield           
0x000000010076a5ac (oninit)cdrTimerWait
0x0000000100716908 (oninit)dr_buf_deq_int
0x00000001001fe3c0 (oninit)dr_logcopy         
0x00000001001f2d0c (oninit)logwrite
0x000000010011b7c4 (oninit)log_put            
0x0000000100121384 (oninit)logm_write         
0x00000001001f3e68 (oninit)logputx            
0x000000010017137c (oninit)rscommit           
0x000000010022b70c (oninit)iscommit           
0x00000001002865e4 (oninit)sqiscommit         
0x0000000100533d38 (oninit)committx           
0x0000000100536480 (oninit)commitcmd          
0x000000010053b01c (oninit)excommand          
0x000000010042893c (oninit)sq_execute         
0x000000010026becc (oninit)sqmain             
0x00000001002d51a4 (oninit)listen_verify      
0x00000001002d33b8 (oninit)spawn_thread       
0x0000000100e0b59c (oninit)startup       

For scenario #2 check the 'onstat -g ath' output and see if the user threads are having "cond wait cp" status.


Resolving the problem

To resolve the problem it may be required to:

1) Fix any problems that can cause data transmission issues between Primary and HDR Secondary (e.g. increase network reliability and throughput)

2) Decrease the value of DRTIMEOUT configuration parameter.

Note: increasing the LOGBUFF may also help to reduce the blockage time, however having a large logical log buffer may result in data loss in case of the Primary failure.



http://www-01.ibm.com/support/docview.wss?uid=swg21643957

728x90
728x90


Problem(Abstract)

Sometime you can get "ping timeout" and "send error" in online.log,and check network environment,which are all normal.Last HDR relation had been broken.Why do it occur ?

Symptom

ping timeout,received error,send error

Cause

ping timeout will occur if "DR_MSG_PING" can't flow between primary and secondary,or ack duration exceed 4*DRTIMEOUT."ping timeout" is a message type in DR BUFFER QUEUE,so it require waiting for dr buffer space,therefor PING TIMEOUT maybe occur due to dr buffer is full or its priority is too lower than logical log buffer.

Logical log buffer can't be transfer maybe lead to the 'ping timeout'.


Environment

HDR environment

Diagnosing the problem

DR BUFFER size is same as logical log buffer,and "DR_MSG_PING" save in dr buffer,so we can configure LOGBUFF to adjust DR BUFFER.

Primary server send logical log to Secondary server to keep consistent data as following description.
1.primary : logical log buffer -> dr buffer
2.primary : dr_prsend thread send these logical log to dr buffer on secondary server across network using TCP/IP.
3.secondary:dr_secrecv thread received those logical log in secondary.

HDR primary server and secondary server will ping each other and must waiting for a acknowledgment during a appointed times ,otherwise HDR relation will be broken due to "ping timeout" error .The ack duration is 4 times as DRTIMEOUT value.


Resolving the problem

To avoid "ping timeout" occur according to following mention.

1.Increasing LOGBUFF value to adjust dr buffer size and lay more signal message.
2.Secondary server hang maybe lead to the "ping timeout" due to dr buffer can't be received immediately or DR_MSG_PING lower priority.
3.Long checkpoint duration in secondary server.



http://www-01.ibm.com/support/docview.wss?uid=swg21413380

728x90
728x90


Informix Enterprise Replication(CDR)Informix MACH11 (HDR, SDS and RSS)
Replication granularity is at table and column levelReplication granularity is at the instance level
Supports hierarchical routing. (Supports root, non-root and leaf servers).All secondary servers has to be directly connected to primary server
Supports update anywhere, data consolidation and data dissimination modelsSecondary servers are read-only servers and can be used only for reporting activity
Supports hetorogenious replication. i.e ER can replicate data between 11.10 and 7.31 serversPrimary and secondary server version has to be same
ER servers doesn't have to be running on the same operating system platformHardware platform has to be same between primary and all secondary servers
ER needs primary key for all the replicated tablesDoesn't need primary key for replication
Can co-exist in MACH11 environmentCan co-exist with ER
Database must be logging databaseDatabase must be logging database
Supports blob space blobs along with smartblobs and partition blobsDoesn't support blobspace blobs. Supports smartblobs and partition blobs
Source and target must use the same code set for replicated tablesDatabase codeset must be same between primary and secondary servers
Supports network encryptionSupports network encryption
Supports compression before transmitting data through networkDoesn't support data compression


http://www.inturi.net/coranto/viewnews.cgi?id=EkpuAFpkkyOKeKEaaY

728x90

+ Recent posts