[SERVER-10839] Race condition in rsStart on getting new version of config Created: 21/Sep/13  Updated: 10/Dec/14  Resolved: 28/Oct/14

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.5.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Asya Kamsky Assignee: Spencer Brody (Inactive)
Resolution: Done Votes: 0
Labels: elections
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-15770 Fassert after removing and re-adding ... Closed
Tested
Operating System: ALL
Participants:

 Description   

I removed a node from replica set, restarted mongod on a different port and added it back in with new host:port.

When coming back into the set, I noticed in the logs:

$ mongod --port 37020 --dbpath /data/rs3 --replSet asyaS17 --oplogSize 100
2013-09-21T11:21:30.780-0700 [initandlisten] MongoDB starting : pid=44500 port=37020 dbpath=/data/rs3 64-bit host=asyasmacbook.local
2013-09-21T11:21:30.780-0700 [initandlisten]
2013-09-21T11:21:30.780-0700 [initandlisten] ** NOTE: This is a development version (2.5.2) of MongoDB.
2013-09-21T11:21:30.780-0700 [initandlisten] **       Not recommended for production.
2013-09-21T11:21:30.780-0700 [initandlisten]
2013-09-21T11:21:30.780-0700 [initandlisten] ** WARNING: soft rlimits too low. Number of files is 256, should be at least 1000
2013-09-21T11:21:30.780-0700 [initandlisten]
2013-09-21T11:21:30.780-0700 [initandlisten] db version v2.5.2
2013-09-21T11:21:30.780-0700 [initandlisten] git version: b283169ae51328f3fb5e3c8a3c9017a2e42c148f
2013-09-21T11:21:30.780-0700 [initandlisten] build info: Darwin bs-osx-106-x86-64-2.10gen.cc 10.8.0 Darwin Kernel Version 10.8.0: Tue Jun  7 16:32:41 PDT 2011; root:xnu-1504.15.3~1/RELEASE_X86_64 x86_64 BOOST_LIB_VERSION=1_49
2013-09-21T11:21:30.780-0700 [initandlisten] allocator: system
2013-09-21T11:21:30.780-0700 [initandlisten] options: { dbpath: "/data/rs3", oplogSize: 100, port: 37020, replSet: "asyaS17" }
2013-09-21T11:21:30.781-0700 [initandlisten] journal dir=/data/rs3/journal
2013-09-21T11:21:30.783-0700 [initandlisten] recover : no journal files present, no recovery needed
2013-09-21T11:21:30.824-0700 [initandlisten] waiting for connections on port 37020
2013-09-21T11:21:30.826-0700 [IndexRebuilder] assertion 13435 not master and slaveOk=false ns:cs.system.namespaces query:{}
2013-09-21T11:21:30.826-0700 [IndexRebuilder] warning: index rebuilding did not complete
2013-09-21T11:21:30.826-0700 [rsStart] replSet REMOVED
2013-09-21T11:21:30.826-0700 [rsStart] replSet info self not present in the repl set configuration:
2013-09-21T11:21:30.826-0700 [rsStart] { _id: "asyaS17", version: 7, members: [ { _id: 0, host: "asyasmacbook.local:37017", priority: 2.0 }, { _id: 1, host: "asyasmacbook.local:37018" } ] }
2013-09-21T11:21:30.826-0700 [rsStart] replSet info Couldn't load config yet. Sleeping 20sec and will try again.
2013-09-21T11:21:50.827-0700 [rsStart] replSet info Couldn't load config yet. Sleeping 20sec and will try again.
2013-09-21T11:21:50.899-0700 [initandlisten] connection accepted from 127.0.0.1:59868 #1 (1 connection now open)
2013-09-21T11:21:51.960-0700 [initandlisten] connection accepted from 127.0.0.1:59870 #2 (2 connections now open)
2013-09-21T11:22:09.971-0700 [conn2] end connection 127.0.0.1:59870 (1 connection now open)
2013-09-21T11:22:09.971-0700 [initandlisten] connection accepted from 127.0.0.1:59876 #3 (2 connections now open)
2013-09-21T11:22:10.827-0700 [rsStart] trying to contact asyasmacbook.local:37017
2013-09-21T11:22:10.829-0700 [rsStart] trying to contact asyasmacbook.local:37018
2013-09-21T11:22:10.830-0700 [rsStart] replSet I am asyasmacbook.local:37020
2013-09-21T11:22:10.830-0700 [rsStart] replSet got config version 8 from a remote, saving locally
2013-09-21T11:22:10.830-0700 [rsStart] replSet info saving a newer config version to local.system.replset
2013-09-21T11:22:10.836-0700 [rsStart] replSet saveConfigLocally done
2013-09-21T11:22:10.836-0700 [rsStart] replSet info Couldn't load config yet. Sleeping 20sec and will try again.
2013-09-21T11:22:10.918-0700 [conn1] end connection 127.0.0.1:59868 (1 connection now open)
2013-09-21T11:22:10.918-0700 [initandlisten] connection accepted from 127.0.0.1:59879 #4 (2 connections now open)
2013-09-21T11:22:30.833-0700 [clientcursormon] mem (MB) res:36 virt:3088
2013-09-21T11:22:30.833-0700 [clientcursormon]  mapped (incl journal view):576
2013-09-21T11:22:30.833-0700 [clientcursormon]  connections:2
2013-09-21T11:22:30.837-0700 [rsStart] trying to contact asyasmacbook.local:37017
2013-09-21T11:22:30.838-0700 [rsStart] trying to contact asyasmacbook.local:37018
2013-09-21T11:22:30.838-0700 [rsStart] replSet I am asyasmacbook.local:37020
2013-09-21T11:22:30.839-0700 [rsStart] replSet STARTUP2
2013-09-21T11:22:32.840-0700 [rsHealthPoll] replSet member asyasmacbook.local:37017 is up
2013-09-21T11:22:32.840-0700 [rsHealthPoll] replSet member asyasmacbook.local:37017 is now in state PRIMARY
2013-09-21T11:22:32.841-0700 [rsHealthPoll] replSet member asyasmacbook.local:37018 is up
2013-09-21T11:22:32.841-0700 [rsHealthPoll] replSet member asyasmacbook.local:37018 is now in state SECONDARY

It seemed like it got the config and applied it successfully and yet logged that it couldn't and slept for another 20 seconds.

Looking at the code (I looked for 2.5.2 to make sure it matched my executables) I see this:

https://github.com/mongodb/mongo/blob/r2.5.2/src/mongo/db/repl/rs.cpp#L793

so false was returned from _loadConfigFinish and the only place I see that is https://github.com/mongodb/mongo/blob/r2.5.2/src/mongo/db/repl/rs.cpp#L701 and the messages at 11:22:10 clearly indicate that we're in saveConfigLocally https://github.com/mongodb/mongo/blob/r2.5.2/src/mongo/db/repl/rs.cpp#L705 after which only true is returned.

Not sure if I'm misreading something, but it seems strange that the logs indicate first successful save of new version of config and then a warning that we "Couldn't load config yet".



 Comments   
Comment by Spencer Brody (Inactive) [ 28/Oct/14 ]

Tested on master and could not reproduce, as expected since the code that did this is dead now.

Comment by Eric Milkie [ 28/Aug/14 ]

We need to confirm this is now fixed in 2.7.

Generated at Thu Feb 08 03:24:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.