[SERVER-17297] Sync target occasionally changes incorrectly (or not?) Created: 16/Feb/15  Updated: 12/Jun/15  Resolved: 12/Jun/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.6.7
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: Aristarkh Zagorodnikov Assignee: Sam Kleinman (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File logs.tar.bz2    
Issue Links:
Depends
depends on SERVER-17429 the message logged when changing sync... Closed
Related
Operating System: ALL
Participants:

 Description   

While examining our logs I've found some messages like these:

2015-02-15T02:28:04.928+0300 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Feb 15 02:27:28:1 which is more than 30 seconds behind member d1.s1.fs.drive.bru:27021 whose most recent OpTime is 1423956484
2015-02-15T02:28:04.942+0300 [rsBackgroundSync] replset setting syncSourceFeedback to d2.s1.fs.drive.bru:27021
2015-02-15T02:28:04.946+0300 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Feb 15 02:27:28:1 which is more than 30 seconds behind member d1.s1.fs.drive.bru:27021 whose most recent OpTime is 1423956484
2015-02-15T02:28:04.948+0300 [rsBackgroundSync] replset setting syncSourceFeedback to d2.s1.fs.drive.bru:27021
2015-02-15T02:28:04.951+0300 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Feb 15 02:27:28:1 which is more than 30 seconds behind member d1.s1.fs.drive.bru:27021 whose most recent OpTime is 1423956484
2015-02-15T02:28:04.953+0300 [rsBackgroundSync] replset setting syncSourceFeedback to d2.s1.fs.drive.bru:27021
2015-02-15T02:28:04.956+0300 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Feb 15 02:27:28:1 which is more than 30 seconds behind member d1.s1.fs.drive.bru:27021 whose most recent OpTime is 1423956484
2015-02-15T02:28:04.958+0300 [rsBackgroundSync] replset setting syncSourceFeedback to d2.s1.fs.drive.bru:27021
2015-02-15T02:28:04.962+0300 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Feb 15 02:27:28:1 which is more than 30 seconds behind member d1.s1.fs.drive.bru:27021 whose most recent OpTime is 1423956484
2015-02-15T02:28:04.963+0300 [rsBackgroundSync] replset setting syncSourceFeedback to d2.s1.fs.drive.bru:27021
2015-02-15T02:28:04.966+0300 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Feb 15 02:27:28:1 which is more than 30 seconds behind member d1.s1.fs.drive.bru:27021 whose most recent OpTime is 1423956484
2015-02-15T02:28:04.968+0300 [rsBackgroundSync] replset setting syncSourceFeedback to d2.s1.fs.drive.bru:27021
2015-02-15T02:28:04.971+0300 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Feb 15 02:27:28:1 which is more than 30 seconds behind member d1.s1.fs.drive.bru:27021 whose most recent OpTime is 1423956484
2015-02-15T02:28:04.972+0300 [rsBackgroundSync] replset setting syncSourceFeedback to d2.s1.fs.drive.bru:27021
2015-02-15T02:28:04.976+0300 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Feb 15 02:27:28:1 which is more than 30 seconds behind member d1.s1.fs.drive.bru:27021 whose most recent OpTime is 1423956484
2015-02-15T02:28:04.977+0300 [rsBackgroundSync] replset setting syncSourceFeedback to d2.s1.fs.drive.bru:27021
2015-02-15T02:28:04.981+0300 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Feb 15 02:27:28:1 which is more than 30 seconds behind member d1.s1.fs.drive.bru:27021 whose most recent OpTime is 1423956484
2015-02-15T02:28:04.982+0300 [rsBackgroundSync] replset setting syncSourceFeedback to d2.s1.fs.drive.bru:27021
2015-02-15T02:28:04.985+0300 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Feb 15 02:27:28:1 which is more than 30 seconds behind member d1.s1.fs.drive.bru:27021 whose most recent OpTime is 1423956484
2015-02-15T02:28:04.987+0300 [rsBackgroundSync] replset setting syncSourceFeedback to d2.s1.fs.drive.bru:27021
2015-02-15T02:28:04.990+0300 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Feb 15 02:27:28:1 which is more than 30 seconds behind member d1.s1.fs.drive.bru:27021 whose most recent OpTime is 1423956484
2015-02-15T02:28:04.991+0300 [rsBackgroundSync] replset setting syncSourceFeedback to d2.s1.fs.drive.bru:27021
2015-02-15T02:28:04.995+0300 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Feb 15 02:27:28:1 which is more than 30 seconds behind member d1.s1.fs.drive.bru:27021 whose most recent OpTime is 1423956484
2015-02-15T02:28:04.996+0300 [rsBackgroundSync] replset setting syncSourceFeedback to d2.s1.fs.drive.bru:27021
2015-02-15T02:28:05.000+0300 [rsBackgroundSync] changing sync target because current sync target's most recent OpTime is Feb 15 02:27:28:1 which is more than 30 seconds behind member d1.s1.fs.drive.bru:27021 whose most recent OpTime is 1423956484
2015-02-15T02:28:05.001+0300 [rsBackgroundSync] replset setting syncSourceFeedback to d2.s1.fs.drive.bru:27021
...

At first I suspected time sync issue, rechecked twice, found no problems (ntpd is up and running, deviation is less than one second, it wasn't applying time changes at the moment this problem manifested).

Since this only occurs at night (local time is GMT+3), and this replica set has very light write load, I suspect that the problem might be in no writes being done for extended periods of time (>30 seconds) and background sync complaining about this erroneously.



 Comments   
Comment by Sam Kleinman (Inactive) [ 12/Jun/15 ]

I've looked over this case again, and it seems that the sync target selection contains has the following rule:

.bq When chainingAllowed is false, a member will refuse to sync from a member that isn't primary, from it's perspective, even if the other member has more recent oplog entries than itself.

This could explain the message you're seeing. See the following document on chained replication. I'm going to go ahead and close this ticket for now, and sorry again about the confusion.

Regards,
sam

Comment by Aristarkh Zagorodnikov [ 29/May/15 ]

There are backups, but they are run about 2 hours later than the problem usually manifests itself. Also, the backups are done on different disks. Besides, the network and CPU are never saturated (weekly idle CPU minimum is ~63%, max 1-min load is 0.63, network sometimes peaks at 800Mbps, but when these events happen the utilization is about 150Mbps max).
Logging happen on all secondaries.
It happened on each day for the last week, can check further but I don't think the pattern would change.

Comment by Sam Kleinman (Inactive) [ 27/May/15 ]

Thanks for this information. Wanted to ask a couple follow up questions:

  • Is there some other task running on these machines (backups, reporting jobs, updates, etc.) that might account for the changes in sync source? System, network, or I/O systems could all have an impact.
  • Do you see this logging on all secondaries or just on one? If only one machine is affected is there anything else you can
  • Does this happen every night or only some nights?

Regards,
sam

Comment by Aristarkh Zagorodnikov [ 18/May/15 ]

driveFS-4:PRIMARY> rs.conf()
{
        "_id" : "driveFS-4",
        "version" : 12,
        "members" : [
                {
                        "_id" : 0,
                        "host" : "d1.s4.fs.drive.bru:27024",
                        "tags" : {
                                "group" : "d1"
                        }
                },
                {
                        "_id" : 1,
                        "host" : "d2.s4.fs.drive.bru:27024",
                        "tags" : {
                                "group" : "d2"
                        }
                },
                {
                        "_id" : 3,
                        "host" : "d3.s4.fs.drive.bru:27024",
                        "tags" : {
                                "group" : "d3"
                        }
                }
        ],
        "settings" : {
                "chainingAllowed" : false
        }
}

Comment by Aristarkh Zagorodnikov [ 18/May/15 ]

Logs from all 3 members of replicaset for a couple of days, look for "changing sync target".

Comment by Aristarkh Zagorodnikov [ 18/May/15 ]

No problem, Ramón, I understand that the team has to prioritize problems to make real progress. Unfortunately, the logs still contain these messages (in a new format). I will upload new logs in the following hours.

Comment by Ramon Fernandez Marina [ 15/May/15 ]

Apologies for the long delay in getting back to you onyxmaster. We made some improvements to the log messages in SERVER-17429 (part of 2.6.9), but it is unclear as of yet why you're seeing these messages in the first place. If this is still an issue for you, could you please upload logs from all your replica set nodes and the output of rs.conf()? I think this information may help us understand the root cause of the messages you're seeing.

Thanks,
Ramón.

Generated at Thu Feb 08 03:43:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.