[SERVER-13117] Secondary repeatedly fails to sync Created: 10/Mar/14  Updated: 22/Oct/15  Resolved: 14/Mar/14

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.4.5, 2.4.6, 2.4.7, 2.4.8, 2.4.9
Fix Version/s: None

Type: Bug Priority: Blocker - P1
Reporter: Ryan Witt Assignee: Eric Milkie
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File m5.clean.log.gz    
Issue Links:
Related
related to SERVER-21068 Ability to resume initial sync attemps Closed
Operating System: ALL
Participants:

 Description   

I've gotten this error many tens of times on 2.4.x

Symptoms:

  • Secondary clones dbs
  • Secondary builds indexes
  • Immediately upon completion, secondary resyncs again
  • My countenance darkens and my heart sinks into despair
  • Sometimes (maybe 10% of the time), it actually finishes initial sync

Setup

  • Replica set with ~1.5TB data
  • Biggest collection ~500GB data, 80M docs with 4 indexes 3GB-5GB each
  • 1 Primary, 3 replicas (1 trying to resync), 1 arbiter
  • Typically 8k-15k active connections
  • AWS, servers spread across availability zones
  • Initial sync takes 1-2 days
  • Oplog is ~5days long (example below)
  • Seems worse with larger replicasets with bigger data (one that includes a 1.5TB db with large compound indexes)

# oplog size
rs1:PRIMARY> db.version()
2.4.8
rs1:PRIMARY> db.printReplicationInfo()
configured oplog size:   79149.0830078125MB
log length start to end: 459190secs (127.55hrs)
oplog first event time:  Wed Mar 05 2014 09:25:53 GMT+0000 (UTC)
oplog last event time:   Mon Mar 10 2014 16:59:03 GMT+0000 (UTC)
now:                     Mon Mar 10 2014 16:59:03 GMT+0000 (UTC)

The slightly anonymized log below contains the moment where the initial sync fails. The primary is m5. The secondary trying to sync is m4. There is some traffic on the servers, which causes slow responses. Connection spam is filtered out.

Mon Mar 10 16:51:11.048 [rsSync]                Index: (1/3) External Sort Progress: 6932500/8429282    82%
Mon Mar 10 16:51:12.389 [rsHealthPoll] couldn't connect to m5.example.com:27017: couldn't connect to server m5.example.com:27017
Mon Mar 10 16:51:18.390 [rsHealthPoll] couldn't connect to m5.example.com:27017: couldn't connect to server m5.example.com:27017
Mon Mar 10 16:51:21.013 [rsSync]                Index: (1/3) External Sort Progress: 7114000/8429282    84%
Mon Mar 10 16:51:24.391 [rsHealthPoll] replSet info m5.example.com:27017 is down (or slow to respond):
Mon Mar 10 16:51:24.391 [rsHealthPoll] replSet member m5.example.com:27017 is now in state DOWN
Mon Mar 10 16:51:26.405 [rsHealthPoll] replSet member m5.example.com:27017 is up
Mon Mar 10 16:51:26.406 [rsHealthPoll] replSet member m5.example.com:27017 is now in state PRIMARY
Mon Mar 10 16:51:31.011 [rsSync]                Index: (1/3) External Sort Progress: 7288600/8429282    86%
Mon Mar 10 16:51:41.011 [rsSync]                Index: (1/3) External Sort Progress: 7488900/8429282    88%
Mon Mar 10 16:51:47.191 [rsHealthPoll] couldn't connect to m8.example.com:27017: couldn't connect to server m8.example.com:27017
Mon Mar 10 16:51:51.026 [rsSync]                Index: (1/3) External Sort Progress: 7619400/8429282    90%
Mon Mar 10 16:51:53.192 [rsHealthPoll] couldn't connect to m8.example.com:27017: couldn't connect to server m8.example.com:27017
Mon Mar 10 16:51:54.980 [rsHealthPoll] replset info m8.example.com:27017 heartbeat failed, retrying
Mon Mar 10 16:52:01.057 [rsSync]                Index: (1/3) External Sort Progress: 7717000/8429282    91%
Mon Mar 10 16:52:11.568 [rsSync]                Index: (1/3) External Sort Progress: 7825800/8429282    92%
Mon Mar 10 16:52:21.009 [rsSync]                Index: (1/3) External Sort Progress: 7940600/8429282    94%
Mon Mar 10 16:52:31.007 [rsSync]                Index: (1/3) External Sort Progress: 8049300/8429282    95%
Mon Mar 10 16:52:41.003 [rsSync]                Index: (1/3) External Sort Progress: 8151500/8429282    96%
Mon Mar 10 16:52:51.008 [rsSync]                Index: (1/3) External Sort Progress: 8288500/8429282    98%
Mon Mar 10 16:52:59.235 [rsSync]         external sort used : 9 files  in 1038 secs
Mon Mar 10 16:53:09.003 [rsSync]                Index: (2/3) BTree Bottom Up Progress: 1120400/8429282  13%
Mon Mar 10 16:53:19.003 [rsSync]                Index: (2/3) BTree Bottom Up Progress: 2620200/8429282  31%
Mon Mar 10 16:53:29.003 [rsSync]                Index: (2/3) BTree Bottom Up Progress: 4067100/8429282  48%
Mon Mar 10 16:53:39.003 [rsSync]                Index: (2/3) BTree Bottom Up Progress: 5539300/8429282  65%
Mon Mar 10 16:53:49.003 [rsSync]                Index: (2/3) BTree Bottom Up Progress: 6962200/8429282  82%
Mon Mar 10 16:53:58.343 [rsSync]         done building bottom layer, going to commit
Mon Mar 10 16:54:00.684 [rsSync] build index done.  scanned 8429282 total records. 1100.07 secs
Mon Mar 10 16:54:00.684 [rsSync] build index cxxx_pxxx.day { stat: 1.0 }
Mon Mar 10 16:54:00.721 [rsSync] build index done.  scanned 0 total records. 0.037 secs
Mon Mar 10 16:54:00.721 [rsSync] build index cxxx_pxxx.day { caxxx: 1.0 }
Mon Mar 10 16:54:00.723 [rsSync] build index done.  scanned 0 total records. 0.001 secs
Mon Mar 10 16:54:00.723 [rsSync] build index cxxx_pxxx.day { bucket: 1.0 }
Mon Mar 10 16:54:00.724 [rsSync] build index done.  scanned 0 total records. 0 secs
Mon Mar 10 16:54:00.724 [rsSync] build index cxxx_pxxx.day { bucket: -1.0 }
Mon Mar 10 16:54:00.747 [rsSync] build index done.  scanned 0 total records. 0.023 secs
Mon Mar 10 16:54:00.747 [rsSync] build index cxxx_pxxx.day { caxxx: 1.0, stat: 1.0, bucket: 1.0 }
Mon Mar 10 16:54:00.749 [rsSync] build index done.  scanned 0 total records. 0.001 secs
Mon Mar 10 16:54:00.749 [rsSync] build index cxxx_pxxx.day { caxxx: 1.0, stat: 1.0, bucket: -1.0 }
Mon Mar 10 16:54:00.750 [rsSync] build index done.  scanned 0 total records. 0 secs
Mon Mar 10 16:54:00.750 [rsSync] replSet initial sync cloning indexes for : clxxxx
Mon Mar 10 16:54:00.860 [rsSync] Socket say send() errno:110 Connection timed out m5.example.com:27017
Mon Mar 10 16:54:00.923 [rsSync] replSet initial sync exception: 9001 socket exception [SEND_ERROR] server [m5.example.com:27017]  9 attempts remaining
Mon Mar 10 16:54:17.663 [rsHealthPoll] DBClientCursor::init call() failed
Mon Mar 10 16:54:17.663 [rsHealthPoll] replset info m8.example.com:27017 heartbeat failed, retrying
Mon Mar 10 16:54:18.663 [rsHealthPoll] replSet info m8.example.com:27017 is down (or slow to respond):
Mon Mar 10 16:54:18.663 [rsHealthPoll] replSet member m8.example.com:27017 is now in state DOWN
Mon Mar 10 16:54:26.664 [rsHealthPoll] replset info m8.example.com:27017 heartbeat failed, retrying
Mon Mar 10 16:54:26.884 [rsHealthPoll] replSet member m8.example.com:27017 is up
Mon Mar 10 16:54:26.884 [rsHealthPoll] replSet member m8.example.com:27017 is now in state SECONDARY
Mon Mar 10 16:54:30.941 [rsSync] replSet initial sync pending
Mon Mar 10 16:54:30.941 [rsSync] replSet syncing to: m5.example.com:27017
Mon Mar 10 16:54:31.042 [rsSync] replSet initial sync drop all databases
Mon Mar 10 16:54:31.061 [rsSync] dropAllDatabasesExceptLocal 19
Mon Mar 10 16:54:31.123 [rsSync] removeJournalFiles
Mon Mar 10 16:54:48.391 [rsHealthPoll] DBClientCursor::init call() failed
Mon Mar 10 16:54:48.391 [rsHealthPoll] replset info m9.example.com:27017 heartbeat failed, retrying
Mon Mar 10 16:54:48.923 [rsHealthPoll] DBClientCursor::init call() failed
Mon Mar 10 16:54:48.923 [rsHealthPoll] replset info m8.example.com:27017 heartbeat failed, retrying
Mon Mar 10 16:54:49.391 [rsHealthPoll] replSet info m9.example.com:27017 is down (or slow to respond):
Mon Mar 10 16:54:49.391 [rsHealthPoll] replSet member m9.example.com:27017 is now in state DOWN
Mon Mar 10 16:54:49.923 [rsHealthPoll] DBClientCursor::init call() failed
Mon Mar 10 16:54:49.923 [rsHealthPoll] replSet info m8.example.com:27017 is down (or slow to respond):
Mon Mar 10 16:54:49.923 [rsHealthPoll] replSet member m8.example.com:27017 is now in state DOWN
Mon Mar 10 16:54:52.052 [rsHealthPoll] replSet member m8.example.com:27017 is up
Mon Mar 10 16:54:52.052 [rsHealthPoll] replSet member m8.example.com:27017 is now in state SECONDARY
Mon Mar 10 16:54:53.217 [rsSync] removeJournalFiles

What more can I share that would help debug this problem? Does it seem like initial sync is sensitive to high load?



 Comments   
Comment by Eric Milkie [ 13/Mar/14 ]

1) You're totally right about the docs; I edited that page and pushed a fix just now to the 2.6 docs repo (it will be published to the web later).
2) We have many improvement ideas for initial sync; its slowness and its current "all-or-nothing" approach to error recovery leaves a lot to be desired. There were no silver bullet initial-sync improvements made in 2.6, but we definitely want to do things for the 2.8 release.

Comment by Ryan Witt [ 13/Mar/14 ]

Yes, looks like it is on secondaries right now. Thanks! Couple more questions:

1) Is this behavior documented somewhere? This didn't matter as much at smaller data volumes, but it would be really helpful to know for big replica sets. Maybe it could be added to http://docs.mongodb.org/manual/core/replica-set-sync/, where I got the impression that there would only be one round of index building.
2) Am I doing something wrong by running multi TB replica sets?

The fragility of initial sync is a little disappointing, but looks like system snapshotting tools are still the tried-and-true database workaround.
Looks to me like this is a "we wish it were better, but don't have a fix now" type of thing. Thanks for your help, Eric!

Comment by Eric Milkie [ 13/Mar/14 ]

Is it building secondary indexes? _id indexes are built as the data is copied, but the secondary indexes are built in a later phase after catching up on the oplog first. Once that's done, it will catch up on the oplog again and then deem the node ready to transition to SECONDARY state.

Comment by Ryan Witt [ 13/Mar/14 ]

So, looks like keeping the load low lets the initial sync happen. m4 got through a round of indexing, then caught up on it's oplog.... then went into another round of indexing, which I don't recall happening before. It is still indexing, which might mean it's in some kind of indexing loop. Is this expected? Should I open another ticket with this behavior?

Comment by Eric Milkie [ 11/Mar/14 ]

That's correct; the cloner uses a long-running query to scan through all the documents in each collection in natural order. If the query gets interrupted, it doesn't currently have a way to pick up where it left off. We're looking at ways to make initial sync better, where keeping one TCP connection alive for a long time won't be as critical.

Comment by Ryan Witt [ 11/Mar/14 ]

Looks like keepalive was at the usual 7200 (had not persisted after reboots). Also played with cutting down TIME_WAIT. Neither seemed to help.

However, I did try changing readahead, and very low values (32) on the primary seems to make it much more stable while syncing. We'll see by tomorrow morning if it helps m4 complete resyncing.

So, just to confirm, connection interruptions can keep sync from completing?

Comment by Eric Milkie [ 11/Mar/14 ]

I went through the log for m5. Based on the socket timeout error message from the secondary, I would surmise that the primary is too busy to be able to support the long-lasting connections that are necessary for initial sync to complete. Can you check your tcp keepalive setting? We recommend 300 if you haven't already adjusted this from the default.
(details here: http://docs.mongodb.org/manual/faq/diagnostics/#does-tcp-keepalive-time-affect-sharded-clusters-and-replica-sets )

Comment by Ryan Witt [ 10/Mar/14 ]

Thanks, Eric. I'm always running mongostat, and I don't think m5 was near 20k. Here's the ulimit on mongod:

[ec2-user@m5 ~]$ cat /proc/28075/limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             119685               119685               processes 
Max open files            64000                64000                files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       119685               119685               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us

There is a readahead warning on m5, but this replset's access pattern usually benefits from large reads:

Mon Mar 10 17:23:54.440 [initandlisten] 
Mon Mar 10 17:23:54.440 [initandlisten] ** WARNING: Readahead for /data is set to 2048KB
Mon Mar 10 17:23:54.440 [initandlisten] **          We suggest setting it to 256KB (512 sectors) or less
Mon Mar 10 17:23:54.440 [initandlisten] **          http://dochub.mongodb.org/core/readahead

I'll go grab that log.

Comment by Eric Milkie [ 10/Mar/14 ]

Can you provide a log from m5.example.com:27017 for the time that matches the log above? (Don't filter out the connection attempts).

It looks like m5.example.com is having trouble keeping a steady connection to that machine (and m8.example.com as well). While load can slow things down, it shouldn't block connection attempts unless you are hitting a resource limit. Can you also provide the output from "uname -a", assuming you are running mongod on Linux.

Are there any warnings emitted into the server log at mongod startup from m5.example.com?

Generated at Thu Feb 08 03:30:40 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.