[SERVER-2681] replica member never syncs up and starts from scratch multiple times. Created: 04/Mar/11  Updated: 12/Jul/16  Resolved: 14/Mar/11

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 1.7.5
Fix Version/s: 1.9.0

Type: Bug Priority: Major - P3
Reporter: Reinaldo Giudici Assignee: Kristina Chodorow (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

ubuntu,
mongodb 1.7.5,
two replica members + 1 arbiter
java driver connecting to log11 and log12


Attachments: File mongod.finallysynched.gz     File mongod.log.log12.failedtosync.gz     File mongod.node1.log.afterRepairDatabase.gz    
Operating System: ALL
Participants:

 Description   

seems like the member is not able to initialize fully and tries to start from scratch during the initial sync process.
the oplog seems to be enough to do the process as it takes about 12h to copydata and create indexes on the new member and the oplog has 51h

log0x:PRIMARY> db.printReplicationInfo()
configured oplog size: 79892.4875MB
log length start to end: 184394secs (51.22hrs)
oplog first event time: Wed Mar 02 2011 06:38:52 GMT-0800 (PST)
oplog last event time: Fri Mar 04 2011 09:52:06 GMT-0800 (PST)
now: Fri Mar 04 2011 09:52:06 GMT-0800 (PST)

Attached is the log from the log12 the member trying to sync up.



 Comments   
Comment by Kristina Chodorow (Inactive) [ 14/Mar/11 ]

Yay! Looks like it finally made it. The fix I made should be able to make it a bit more robust (so it doesn't have to keep starting over) in the future. Glad it finally worked out.

Comment by Reinaldo Giudici [ 14/Mar/11 ]

I cleaned up the local database shutdown mongod on log11, and rm local.*
and started again from scratch the sync from mongo12, it failed at least 50 times, but today seems like log12 is a secondary !!!

cat /var/log/mongodb/mongod.node1.log | grep 13127 | wc -l
49
(but the log only goes back to saturday)

Attached is the log.
this section of the log never happened before (just partial log here) and it seens that there is some delay on the first part of the oplog sync, i.e. from the log it seems like it took 24 minutes to apply the first 100k ops, and 9 minutes to apply the last 1.9MM.

In all previous attempts the initialSyncOplogApplication was never on the log at all.

Mon Mar 14 04:57:46 [replica set sync] replSet initial sync initial oplog application
[......]
Mon Mar 14 05:21:31 [replica set sync] replSet initialSyncOplogApplication 100000
Mon Mar 14 05:23:26 [replica set sync] replSet initialSyncOplogApplication 200000
Mon Mar 14 05:23:34 [replica set sync] replSet initialSyncOplogApplication 300000
Mon Mar 14 05:26:38 [replica set sync] replSet initialSyncOplogApplication 400000
Mon Mar 14 05:27:21 [replica set sync] replSet initialSyncOplogApplication 500000
Mon Mar 14 05:27:29 [replica set sync] replSet initialSyncOplogApplication 600000
Mon Mar 14 05:27:38 [replica set sync] replSet initialSyncOplogApplication 700000
Mon Mar 14 05:27:45 [replica set sync] replSet initialSyncOplogApplication 800000
Mon Mar 14 05:27:52 [replica set sync] replSet initialSyncOplogApplication 900000
Mon Mar 14 05:28:00 [replica set sync] replSet initialSyncOplogApplication 1000000
Mon Mar 14 05:28:07 [replica set sync] replSet initialSyncOplogApplication 1100000
Mon Mar 14 05:28:14 [replica set sync] replSet initialSyncOplogApplication 1200000
Mon Mar 14 05:28:21 [replica set sync] replSet initialSyncOplogApplication 1300000
Mon Mar 14 05:28:28 [replica set sync] replSet initialSyncOplogApplication 1400000
Mon Mar 14 05:28:36 [replica set sync] replSet initialSyncOplogApplication 1500000
Mon Mar 14 05:28:43 [replica set sync] replSet initialSyncOplogApplication 1600000
Mon Mar 14 05:28:50 [replica set sync] replSet initialSyncOplogApplication 1700000
Mon Mar 14 05:28:57 [replica set sync] replSet initialSyncOplogApplication 1800000
Mon Mar 14 05:29:04 [replica set sync] replSet initialSyncOplogApplication 1900000
Mon Mar 14 05:29:11 [replica set sync] replSet initialSyncOplogApplication 2000000

Comment by auto [ 11/Mar/11 ]

Author:

{u'login': u'kchodorow', u'name': u'Kristina', u'email': u'kristina@10gen.com'}

Message: handle cursor timeouts during initial sync SERVER-2681
https://github.com/mongodb/mongo/commit/7e9c5f3dcaff349489f3cd342afa42c2830113e9

Comment by Kristina Chodorow (Inactive) [ 10/Mar/11 ]

That's an error code. Initial sync should handle that, I'll work on fixing it so that it doesn't get stopped by that error.

Comment by Reinaldo Giudici [ 10/Mar/11 ]

No killing of connections that I know of...
The strange thing is that is seems to move along until this line

Wed Mar 9 19:45:37 [replica set sync] replSet initial sync initial oplog application

And the timeout call seems to show the same error, is that an oplog position ? or a error code ?

[13:49:16 rgiudici@cloud-log12:~]$ cat mongod.log | grep oplog | grep failing
Wed Mar 9 20:33:33 [replica set sync] replSet initial sync failing, error applying oplog 13127 getMore: cursor didn't exist on server, possible restart or timeout?
Wed Mar 9 21:25:16 [replica set sync] replSet initial sync failing, error applying oplog 13127 getMore: cursor didn't exist on server, possible restart or timeout?
Wed Mar 9 22:46:44 [replica set sync] replSet initial sync failing, error applying oplog 13127 getMore: cursor didn't exist on server, possible restart or timeout?
Wed Mar 9 23:33:27 [replica set sync] replSet initial sync failing, error applying oplog 13127 getMore: cursor didn't exist on server, possible restart or timeout?
Thu Mar 10 02:12:27 [replica set sync] replSet initial sync failing, error applying oplog 13127 getMore: cursor didn't exist on server, possible restart or timeout?
Thu Mar 10 04:26:33 [replica set sync] replSet initial sync failing, error applying oplog 13127 getMore: cursor didn't exist on server, possible restart or timeout?
Thu Mar 10 07:00:30 [replica set sync] replSet initial sync failing, error applying oplog 13127 getMore: cursor didn't exist on server, possible restart or timeout?
Thu Mar 10 07:53:20 [replica set sync] replSet initial sync failing, error applying oplog 13127 getMore: cursor didn't exist on server, possible restart or timeout?
Thu Mar 10 08:34:50 [replica set sync] replSet initial sync failing, error applying oplog 13127 getMore: cursor didn't exist on server, possible restart or timeout?
Thu Mar 10 09:04:09 [replica set sync] replSet initial sync failing, error applying oplog 13127 getMore: cursor didn't exist on server, possible restart or timeout?
Thu Mar 10 11:28:44 [replica set sync] replSet initial sync failing, error applying oplog 13127 getMore: cursor didn't exist on server, possible restart or timeout?
Thu Mar 10 12:05:01 [replica set sync] replSet initial sync failing, error applying oplog 13127 getMore: cursor didn't exist on server, possible restart or timeout?

Comment by Kristina Chodorow (Inactive) [ 10/Mar/11 ]

Could you have some sort of router or switch that is killing connections after some length of time?

Comment by Kristina Chodorow (Inactive) [ 10/Mar/11 ]

The "no user in local.system.users to use for authentication" is harmless.

Comment by Kristina Chodorow (Inactive) [ 10/Mar/11 ]

The problem is that you keep having network blips:

Sun Mar 6 05:23:04 [replica set sync] MessagingPort recv() errno:104 Connection reset by peer 10.17.100.71:27017
Sun Mar 6 05:23:04 [replica set sync] SocketException: remote: 10.17.100.71:27017 error: 9001 socket exception [1]
Sun Mar 6 05:23:04 [replica set sync] DBClientCursor::init call() failed
Sun Mar 6 05:23:04 [replica set sync] replSet initial sync exception 10276 DBClientBase::findOne: transport error: cloud-log11:27017 query: { query: {}, orderby:

{ $natural: -1 }

}
Sun Mar 6 05:23:34 [replica set sync] replSet initial sync pending
Sun Mar 6 05:23:34 [replica set sync] replSet syncing to: cloud-log11:27017
Sun Mar 6 05:23:34 [replica set sync] replSet initial sync drop all databases

It can't reach the server it's trying to sync to.

As I said above, you could try doing a fastsync. You could also see if you could do anything to fix your network or sync to a "closer" machine, if you're syncing over a WAN at the moment.

Initial sync will be getting more tolerant of network failures, but not until 1.8.1 (at least).

Comment by Reinaldo Giudici [ 10/Mar/11 ]

Attached is a new log. this one is after repairing all databases in the primary (except local and admin) and after dropping some data from the primary.
Still fails to sync, seems like the last entry related to the sync is

Thu Mar 10 11:42:37 [replica set sync] replSet initial sync initial oplog application

I see a few of these, we do not have authentication enabled:
Thu Mar 10 11:42:37 [replica set sync] replauthenticate: no user in local.system.users to use for authentication

The rest interface shows like is necer able to catchup on replication. I think the optime on the secondary, moves really slowly sometimes does not change in 20/30 seconds and then when it does change it changes by 1 for example now (1 minute later is at: 4d790a52:37e )

Member id Up cctime Last heartbeat Votes Priority State Messages optime skew
cloud-log11:27017 (me) 0 1 1.6e+02 hrs 1 1 PRIMARY 4d790fd9:111
cloud-mongo-arbiter01.vm.dfw:27017 1 1 1.1e+02 hrs 1 sec ago 1 1 ARBITER 0:0 1
cloud-log12:27017 2 1 16 hrs 0 secs ago 1 1 RECOVERING 4d790a52:37c

Comment by Kristina Chodorow (Inactive) [ 09/Mar/11 ]

Apparently it's cosmetic: https://jira.mongodb.org/browse/SERVER-2669.

From the logs, it looks like MongoDB lost the connection to the server it was syncing from... it isn't very tolerant of blippy networks at the moment.

Thu Mar 3 18:12:36 [replica set sync] MessagingPort say send() errno:32 Broken pipe 10.17.100.71:27017
Thu Mar 3 18:12:36 [replica set sync] replSet initial sync exception 9001 socket exception [2]
Thu Mar 3 18:13:06 [replica set sync] replSet initial sync pending

If you have a backup of log11, you could try starting up using --fastsync with it, which should make it pull less over the network.

Comment by Reinaldo Giudici [ 09/Mar/11 ]

another interesting find maybe, seems like the size for the oplog is negative ? or is that just an overflow

log0x:PRIMARY> > db.oplog.rs.stats()
{
"ns" : "local.oplog.rs",
"count" : 152530910,
"size" : -339245068,
"avgObjSize" : -171.17218302834488,
"storageSize" : 83773345024,
"numExtents" : 40,
"nindexes" : 0,
"lastExtentSize" : 62697328,
"paddingFactor" : 1,
"flags" : 0,
"totalIndexSize" : 0,
"indexSizes" : {

},
"capped" : 1,
"max" : 2147483647,
"ok" : 1
}

Comment by Reinaldo Giudici [ 07/Mar/11 ]

I have not run repairDatabase on all of them. I ran repair database on a few of them

This is a "log server" we had setup a 50GB capped collection per log collection, and then realized that only production logs needed that much.
So We reduced to 2GB capped collections, I removed the collection, created and ran repair database on the non production logs. That recovered the log11 space and server seems pretty healthy. But log12 has not been able to catchup again ever.

Will run repairDatabase() try to reindex the collections ? if so that will bring down the server as the indexing on the secondaries seem to be taking multiple hours which will be a problem in our setup.

Comment by Kristina Chodorow (Inactive) [ 07/Mar/11 ]

Running out of diskspace on the master could definitely cause this on the slave. Have you run repair on every database that was handling writes when you ran out of disk space?

Comment by Reinaldo Giudici [ 04/Mar/11 ]

The master log11 was having issue with diskspace until yesterday that we change the size of some capped collections and run repairDatabase on a few dtabases.
the master is doing fine now, and the sync after that started much faster and better than before, but still got into that strange repeat cycle.

Comment by Kristina Chodorow (Inactive) [ 04/Mar/11 ]

Is it trying to sync from the master that you're having corruption issues with?

Comment by Reinaldo Giudici [ 04/Mar/11 ]

This log is actually the result of that.

We did tried multiple things, but yesterday we removed all, and started to sync from scratch.

I think this is the last restart info. We did remove the data.
You can see the indexing takes forever and I suspect this might be related to that.
We have tried:

. start from scratch (no data on the dir)
. recover a backup and start
. recover a backup and start with fast sync
. recover with mongorestore and start

Also we upped the memory to 96GB and still no joy

none of them have been able to finish at all.

Each test takes more than 12h

Thu Mar 3 05:08:59 [initandlisten] MongoDB starting : pid=5555 port=27017 dbpath=/var/lib/mongodb/node1 64-bit

    • NOTE: This is a development version (1.7.5) of MongoDB.
    • Not recommended for production.

Thu Mar 3 05:08:59 [initandlisten] db version v1.7.5, pdfile version 4.5
Thu Mar 3 05:08:59 [initandlisten] git version: 1978898d7a013657e5400133defdc996fb4c2c15
Thu Mar 3 05:08:59 [initandlisten] sys info: Linux domU-12-31-39-06-79-A1 2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri Nov 20 17:48:28 EST 2009 x86_64 BOOST_LIB_VERSION=1_41
Thu Mar 3 05:08:59 [initandlisten] waiting for connections on port 27017
Thu Mar 3 05:08:59 [websvr] web admin interface listening on port 28017
Thu Mar 3 05:08:59 [initandlisten] connection accepted from 127.0.0.1:19147 #1
Thu Mar 3 05:09:00 [initandlisten] connection accepted from 10.36.115.6:12188 #2
Thu Mar 3 05:09:00 [initandlisten] connection accepted from 10.36.115.6:12189 #3
Thu Mar 3 05:09:00 [initandlisten] connection accepted from 10.17.100.71:22092 #4
Thu Mar 3 05:09:00 [startReplSets] trying to contact cloud-log11:27017
Thu Mar 3 05:09:00 [initandlisten] connection accepted from 10.36.58.20:58175 #5
Thu Mar 3 05:09:00 [startReplSets] replSet STARTUP2
Thu Mar 3 05:09:00 [rs Manager] replSet can't see a majority, will not try to elect self
Thu Mar 3 05:09:00 [replica set sync] replSet initial sync pending
Thu Mar 3 05:09:00 [replica set sync] replSet couldn't find a member matching the sync criteria:
state? none
name? none
_id? -1
optime? Dec 31 18:00:00:0
Thu Mar 3 05:09:00 [replica set sync] replSet initial sync need a member to be primary or secondary to do our initial sync
Thu Mar 3 05:09:00 [ReplSetHealthPollTask] replSet info cloud-mongo-arbiter01.vm.dfw:27017 is up
Thu Mar 3 05:09:00 [ReplSetHealthPollTask] replSet member cloud-mongo-arbiter01.vm.dfw:27017 ARBITER
Thu Mar 3 05:09:00 [rs Manager] replSet info not trying to elect self, do not yet have a complete set of data from any point in time

Comment by Kristina Chodorow (Inactive) [ 04/Mar/11 ]

It looks like you guys kill mongod multiple times, sometimes in the middle of the initial sync. I think this may have confused it. Can you stop mongod, remove everything from the data directory, and start the initial sync over again?

Generated at Thu Feb 08 03:00:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.