|
After rebooting a system, the database which was currently Primary lost an election to the system which is now the primary. During that process, and in the seconds that followed, the system encountered an error, got stuck in "RECOVERING" and then crashed.
DAY:15:38.702-0500 I REPL [ReplicationExecutor] stepping down from primary, because a new term has begun: 21
|
DAY:15:38.703-0500 I REPL [replExecDBWorker-1] transition to SECONDARY
|
DAY:15:38.710-0500 I NETWORK [conn244] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [5.5.5.12:58244]
|
DAY:15:39.248-0500 I REPL [ReplicationExecutor] Member mongo01.domain.us:27017 is now in state PRIMARY
|
DAY:15:39.694-0500 I REPL [ReplicationExecutor] syncing from: mongo01.domain.us:27017
|
DAY:15:39.696-0500 I REPL [SyncSourceFeedback] setting syncSourceFeedback to mongo01.domain.us:27017
|
DAY:15:39.697-0500 I ASIO [NetworkInterfaceASIO-BGSync-0] Successfully connected to mongo01.domain.us:27017
|
DAY:15:39.698-0500 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 20, timestamp: Jul 12 22:15:38:1ce). source's GTE: (term: 21, timestamp: Jul 12 22:15:38:1ce) hashes: (7825714044400008130/-1471538608383373916)
|
DAY:15:39.698-0500 I REPL [rsBackgroundSync] beginning rollback
|
DAY:15:39.698-0500 I REPL [rsBackgroundSync] rollback 0
|
DAY:15:39.698-0500 I REPL [ReplicationExecutor] transition to ROLLBACK
|
DAY:15:39.699-0500 I REPL [rsBackgroundSync] rollback 1
|
DAY:15:39.700-0500 I REPL [rsBackgroundSync] rollback 2 FindCommonPoint
|
DAY:15:39.700-0500 I REPL [rsBackgroundSync] rollback our last optime: Jul 12 22:15:38:1ce
|
DAY:15:39.700-0500 I REPL [rsBackgroundSync] rollback their last optime: Jul 12 22:15:39:167
|
DAY:15:39.700-0500 I REPL [rsBackgroundSync] rollback diff in end of log times: -1 seconds
|
DAY:15:39.707-0500 I QUERY [conn545] assertion 13436 cannot read from oplog collection while in rollback ns:local.oplog.rs query:{}
|
DAY:15:39.707-0500 I QUERY [conn545] ntoskip:0 ntoreturn:1
|
DAY:15:39.809-0500 I QUERY [conn547] assertion 13436 cannot read from oplog collection while in rollback ns:local.oplog.rs query:{}
|
DAY:15:39.809-0500 I QUERY [conn547] ntoskip:0 ntoreturn:1
|
DAY:15:39.907-0500 I REPL [rsBackgroundSync] rollback 3 fixup
|
DAY:15:39.908-0500 I REPL [rsBackgroundSync] rollback 3.5
|
DAY:15:39.908-0500 I REPL [rsBackgroundSync] rollback 4 n:2
|
DAY:15:39.908-0500 I REPL [rsBackgroundSync] minvalid=(term: 21, timestamp: Jul 12 22:15:39:1df)
|
DAY:15:39.908-0500 I REPL [rsBackgroundSync] rollback 4.6
|
DAY:15:39.908-0500 I REPL [rsBackgroundSync] rollback 4.7
|
DAY:15:39.909-0500 I REPL [rsBackgroundSync] rollback 5 d:1 u:2
|
DAY:15:39.909-0500 I REPL [rsBackgroundSync] rollback 6
|
DAY:15:39.910-0500 I REPL [rsBackgroundSync] rollback done
|
DAY:15:39.910-0500 I REPL [rsBackgroundSync] rollback finished
|
DAY:15:39.910-0500 I NETWORK [SyncSourceFeedback] Socket say send() errno:9 Bad file descriptor 5.5.6.11:27017
|
DAY:15:39.910-0500 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update: socket exception [SEND_ERROR] for 5.5.6.11:27017
|
DAY:15:39.910-0500 I REPL [SyncSourceFeedback] updateUpstream failed: Location9001: socket exception [SEND_ERROR] for 5.5.6.11:27017, will retry
|
DAY:15:39.910-0500 I QUERY [conn548] assertion 13436 cannot read from oplog collection while in rollback ns:local.oplog.rs query:{}
|
DAY:15:39.910-0500 I QUERY [conn548] ntoskip:0 ntoreturn:1
|
DAY:15:39.920-0500 I REPL [ReplicationExecutor] transition to RECOVERING
|
DAY:15:39.920-0500 I REPL [ReplicationExecutor] syncing from: mongo03.domain.us:27017
|
DAY:15:39.921-0500 I REPL [SyncSourceFeedback] setting syncSourceFeedback to mongo03.domain.us:27017
|
DAY:15:39.922-0500 I ASIO [NetworkInterfaceASIO-BGSync-0] Successfully connected to mongo03.domain.us:27017
|
DAY:16:09.931-0500 I REPL [ReplicationExecutor] re-evaluating sync source because our current sync source's most recent OpTime is (term: 20, timestamp: Jul 12 22:15:38:1cc) which is more than 30s behind member mongo01.domain.us:27017 whose most recent OpTime is (term: 21, timestamp: Jul 12 22:16:09:a8)
|
DAY:16:09.932-0500 I REPL [ReplicationExecutor] syncing from: mongo01.domain.us:27017
|
DAY:16:09.933-0500 I REPL [SyncSourceFeedback] setting syncSourceFeedback to mongo01.domain.us:27017
|
DAY:16:09.934-0500 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 20, timestamp: Jul 12 22:15:38:1cc). source's GTE: (term: 21, timestamp: Jul 12 22:15:38:1cd) hashes: (1087207337565601823/-6762428419615138830)
|
DAY:16:09.935-0500 I - [rsBackgroundSync] Fatal assertion 18750 UnrecoverableRollbackError: need to rollback, but in inconsistent state. minvalid: (term: 21, timestamp: Jul 12 22:15:39:1df) > our last optime: (term: 20, timestamp: Jul 12 22:15:38:1cc)
|
DAY:16:09.935-0500 I - [rsBackgroundSync]
|
|
***aborting after fassert() failure
|
|
Followed by this on restart:
DAY::21:36.465-0500 I FTDC [initandlisten] Initializing full-time diagnostic data capture with directory '/mongodb/data/wt/diagnostic.data'
|
DAY::21:36.465-0500 I NETWORK [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
|
DAY::21:36.467-0500 I NETWORK [initandlisten] waiting for connections on port 27017
|
DAY::21:36.470-0500 I REPL [ReplicationExecutor] New replica set config in use: { _id: "OnspringCavern01", version: 8, protocolVersion: 1, members: [ { _id: 0, host: "mongo01.domain.us:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 3.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 1, host: "mongo02.domain.us:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 2.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: "mongo03.domain.us:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 1000, heartbeatTimeoutSecs: 5, electionTimeoutMillis: 6000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('57857435bb2a73b423c7eb4c') } }
|
DAY::21:36.470-0500 I REPL [ReplicationExecutor] This node is mongo02.domain.us:27017 in the config
|
DAY::21:36.470-0500 I REPL [ReplicationExecutor] transition to STARTUP2
|
DAY::21:36.470-0500 I REPL [ReplicationExecutor] Starting replication applier threads
|
DAY::21:36.470-0500 I REPL [ReplicationExecutor] transition to RECOVERING
|
DAY::21:36.471-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to mongo01.domain.us:27017
|
DAY::21:36.472-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to mongo03.domain.us:27017
|
DAY::21:36.476-0500 I REPL [ReplicationExecutor] New replica set config in use: { _id: "OnspringCavern01", version: 9, protocolVersion: 1, members: [ { _id: 0, host: "mongo01.domain.us:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 3.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 1, host: "mongo02.domain.us:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 2.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: "mongo03.domain.us:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: false, heartbeatIntervalMillis: 1000, heartbeatTimeoutSecs: 5, electionTimeoutMillis: 6000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('57857435bb2a73b423c7eb4c') } }
|
DAY::21:36.476-0500 I REPL [ReplicationExecutor] This node is mongo02.domain.us:27017 in the config
|
DAY::21:36.477-0500 I REPL [ReplicationExecutor] Member mongo03.domain.us:27017 is now in state SECONDARY
|
DAY::21:36.477-0500 I REPL [ReplicationExecutor] Member mongo01.domain.us:27017 is now in state PRIMARY
|
DAY::21:37.002-0500 I FTDC [ftdc] Unclean full-time diagnostic data capture shutdown detected, found interim file, some metrics may have been lost. OK
|
DAY::21:37.471-0500 I REPL [ReplicationExecutor] syncing from primary: mongo01.domain.us:27017
|
DAY::21:37.475-0500 I REPL [SyncSourceFeedback] setting syncSourceFeedback to mongo01.domain.us:27017
|
DAY::21:37.476-0500 I ASIO [NetworkInterfaceASIO-BGSync-0] Successfully connected to mongo01.domain.us:27017
|
DAY::21:37.477-0500 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 20, timestamp: Jul 12 22:15:38:1cc). source's GTE: (term: 21, timestamp: Jul 12 22:15:38:1cd) hashes: (1087207337565601823/-6762428419615138830)
|
DAY::21:37.477-0500 I - [rsBackgroundSync] Fatal assertion 18750 UnrecoverableRollbackError: need to rollback, but in inconsistent state. minvalid: (term: 21, timestamp: Jul 12 22:15:39:1df) > our last optime: (term: 20, timestamp: Jul 12 22:15:38:1cc)
|
DAY::21:37.477-0500 I - [rsBackgroundSync]
|
|
***aborting after fassert() failure
|
|
|
Ho Spencer,
thanks for you answer.
I hope I did not attached the wrong time frame ...
However, in the meantime we resolved our dependencies between different data center to get our setup stable for the xmas business ...
Probably I will resetup the cluster over multiple datacenters next year and will also upgrade to 3.4 then ...
I will let you know, if we hit the issue another time then.
Thanks anyway for all the work!
Greets
Steffen
|
|
Hi sbo@spreadshirt.net,
Sorry for the long response time - it's been a very busy couple of weeks getting ready for our upcoming 3.4 server release. I have taken a look at the logs you attached on November 9th and I don't see any messages including the words "fasssert", "Fatal Assertion", or "UnrecoverableRollbackError", so it looks like you didn't actually hit this issue at all during that time frame. In fact I don't see any indication of any process crashes in any of those logs. If there was a different issue that occurred during that time that's different than the fatal assertion this ticket is describing, please open a different ticket and we'd be happy to take a look.
Cheers,
-Spencer
|
|
I attached the logfiles with entries around the time the incident happened ...
The default primary was the one with logfile
mongodb.log.2016-10-12.to.2016-10-14.mongodb01.dfw.gz
The incident started most probably around 2016-10-12T23:11:06 or minutes later ...
I'm not exactly sure, since I didn't note the time ...
Hope it helps you to get out more ...
|
|
sbo@spreadshirt.net, if you upload those logs I would be happy to take a look at them and see if it looks like anything new beyond the behavior we know about in SERVER-25145 and SERVER-21537
|
|
Hi Spencer,
if you want I could upload the logs from this case too ...
Then you could check ... it you would like to confirm that its the same bug ... shall I?
|
|
sbo@spreadshirt.net, I took a look at the logs you uploaded and they appear to me to be the same issues as Thomas referred to, which were recently fixed as part of SERVER-25145 and will be included in our upcoming 3.2.11 release. It is interesting that you say that you encountered the same issue even after disabling replication chaining - it is possible that you hit a combination of SERVER-25145 with SERVER-21537, though without logs from that event I cannot confirm that for certain. Either way I strongly recommend upgrading to 3.2.11 when it is released, which should prevent the crashes that you were seeing in the logs you uploaded.
-Spencer
|
|
And regarding our last crash ... this was really special ...
It was not the behavior that one of the node crashed but the primary stepped down to secondary and the repl set was not able to eligible a new primary ...
Was a strange behavior ...
See my comment from Oct 13 2016 05:18:31
|
|
Here is our rs.config() output:
{
|
"_id" : "flexiApiCGN",
|
"version" : 13,
|
"protocolVersion" : NumberLong(1),
|
"members" : [
|
{
|
"_id" : 0,
|
"host" : "mongodb01.prv.cgn.sprd.net:27020",
|
"arbiterOnly" : false,
|
"buildIndexes" : true,
|
"hidden" : false,
|
"priority" : 100,
|
"tags" : {
|
|
},
|
"slaveDelay" : NumberLong(0),
|
"votes" : 1
|
},
|
{
|
"_id" : 1,
|
"host" : "mongodb02.prv.cgn.sprd.net:27020",
|
"arbiterOnly" : false,
|
"buildIndexes" : true,
|
"hidden" : false,
|
"priority" : 50,
|
"tags" : {
|
|
},
|
"slaveDelay" : NumberLong(0),
|
"votes" : 1
|
},
|
{
|
"_id" : 2,
|
"host" : "mongodb01.prv.dfw.sprd.net:27020",
|
"arbiterOnly" : false,
|
"buildIndexes" : true,
|
"hidden" : false,
|
"priority" : 20,
|
"tags" : {
|
|
},
|
"slaveDelay" : NumberLong(0),
|
"votes" : 1
|
},
|
{
|
"_id" : 3,
|
"host" : "mongodb02.prv.dfw.sprd.net:27020",
|
"arbiterOnly" : false,
|
"buildIndexes" : true,
|
"hidden" : false,
|
"priority" : 10,
|
"tags" : {
|
|
},
|
"slaveDelay" : NumberLong(0),
|
"votes" : 1
|
},
|
{
|
"_id" : 4,
|
"host" : "10.100.23.30:27020",
|
"arbiterOnly" : true,
|
"buildIndexes" : true,
|
"hidden" : false,
|
"priority" : 1,
|
"tags" : {
|
|
},
|
"slaveDelay" : NumberLong(0),
|
"votes" : 1
|
}
|
],
|
"settings" : {
|
"chainingAllowed" : false,
|
"heartbeatIntervalMillis" : 2000,
|
"heartbeatTimeoutSecs" : 10,
|
"electionTimeoutMillis" : 10000,
|
"getLastErrorModes" : {
|
|
},
|
"getLastErrorDefaults" : {
|
"w" : 1,
|
"wtimeout" : 0
|
}
|
}
|
}
|
|
|
Hi Keren,
I can not give you a answer for your question, but saw the
"chainingAllowed" : true,
|
in your settings ...
Is this intentionally what you want?
Setting this to false minimized the problems for my repl sets a lot ...
|
|
After the primary changed, 3 Servers changed to "not reachable/healthy" state at the same time and I tried to restart them but they always get "Fatal assertion 18750 UnrecoverableRollbackError: need to rollback, but in inconsistent state".
What can I do to fix it beside deleting the data folder?
Mongo version: 3.2.8
We are using 2 shards with the same architecture.
{
|
"_id" : "rs1",
|
"version" : 10,
|
"protocolVersion" : NumberLong(1),
|
"members" : [
|
{
|
"_id" : 0,
|
"host" : "server1:27017",
|
"arbiterOnly" : false,
|
"buildIndexes" : true,
|
"hidden" : false,
|
"priority" : 1,
|
"tags" : {
|
|
},
|
"slaveDelay" : NumberLong(0),
|
"votes" : 1
|
},
|
{
|
"_id" : 1,
|
"host" : "server2:27017",
|
"arbiterOnly" : false,
|
"buildIndexes" : true,
|
"hidden" : false,
|
"priority" : 1,
|
"tags" : {
|
|
},
|
"slaveDelay" : NumberLong(0),
|
"votes" : 1
|
},
|
{
|
"_id" : 2,
|
"host" : "server3:27017",
|
"arbiterOnly" : false,
|
"buildIndexes" : true,
|
"hidden" : false,
|
"priority" : 1,
|
"tags" : {
|
|
},
|
"slaveDelay" : NumberLong(0),
|
"votes" : 1
|
},
|
{
|
"_id" : 3,
|
"host" : "server4:27017",
|
"arbiterOnly" : false,
|
"buildIndexes" : true,
|
"hidden" : false,
|
"priority" : 1,
|
"tags" : {
|
|
},
|
"slaveDelay" : NumberLong(0),
|
"votes" : 1
|
},
|
{
|
"_id" : 4,
|
"host" : "server5:27017",
|
"arbiterOnly" : false,
|
"buildIndexes" : true,
|
"hidden" : false,
|
"priority" : 1,
|
"tags" : {
|
|
},
|
"slaveDelay" : NumberLong(0),
|
"votes" : 1
|
},
|
{
|
"_id" : 5,
|
"host" : "server6:27017",
|
"arbiterOnly" : false,
|
"buildIndexes" : true,
|
"hidden" : false,
|
"priority" : 1,
|
"tags" : {
|
|
},
|
"slaveDelay" : NumberLong(0),
|
"votes" : 1
|
},
|
{
|
"_id" : 6,
|
"host" : "server7:27017",
|
"arbiterOnly" : false,
|
"buildIndexes" : true,
|
"hidden" : false,
|
"priority" : 0,
|
"tags" : {
|
|
},
|
"slaveDelay" : NumberLong(0),
|
"votes" : 1
|
}
|
],
|
"settings" : {
|
"chainingAllowed" : true,
|
"heartbeatIntervalMillis" : 2000,
|
"heartbeatTimeoutSecs" : 10,
|
"electionTimeoutMillis" : 10000,
|
"getLastErrorModes" : {
|
|
},
|
"getLastErrorDefaults" : {
|
"w" : 1,
|
"wtimeout" : 0
|
},
|
"replicaSetId" : ObjectId("aaaaa")
|
}
|
}
|
|
|
Hi Boemm and keren.mantin,
Unfortunately, this is bug is present in all previous versions of MongoDB. There are a number of open tickets which we are currently working on to mitigate this issue in MongoDB 3.2. In addition to SERVER-25848 and SERVER-26415, I would reccomend watching SERVER-25145.
Would you please provide the output of rs.config(), so we can determine if there are additional steps you can take to avoid encountering this issue?
Thank you,
Thomas
|
|
Hi Thomas Schubert,
Is there an older and stable version that this issue doesn’t happens?
When are you planning to release the fix for 3.2?
|
|
Furthermore we had 1 instance where the previously primary stays in state "rollback" after the network issue occur ...
Thats the first time we had this issue where no primary was elected after the rollback problems.
I restarted this "default" primary instances and then it rejoined in the cluster and got the primary role back again ...
|
|
Unfortunately we just hit the bug again, even with chained replication disabled in the config ... I wonder that this happen again ...
Shall I attach another bunch of log files?
|
|
Hi sallgeud,
I'm glad you haven't run into this issue again since disabling chained replication. Unfortunately, the changes included in SERVER-7200 are too large to responsibly backport into MongoDB 3.2. Instead, we have other tickets open to correct this issue in MongoDB 3.2, including SERVER-25848 (previously mentioned) and SERVER-26415.
Kind regards,
Thomas
|
|
Thanks for the explanation.
Will try the config change you mentioned next week ...
|
|
Obviously given our experience upgrading within version 3.2, we might be very timid about moving to 3.4. Is there no chance 7200 will be backported?
|
|
Thanks.
We had actually inadvertently disabled chaining when we rebuilt the replicaset. That explains why we haven't been able to repeat the issue (thank god). The rebuild also involved clean resync.
|
|
Hi sallgeud and Boemm,
Thank you for providing the logs. The issue you are encountering will be corrected in MongoDB 3.4 by SERVER-7200. Additionally, we have a backport planned to MongoDB 3.2 of fixes that will resolve this issue. Please feel free to vote for SERVER-25848 and watch it for updates.
In the interim, I would recommend disabling chained replication as we have observed that this mitigates the issue. To resolve the fatal assertion, please execute an clean resync on the affected secondary.
Kind regards,
Thomas
|
|
And a second instance running in parallel on the same servers wich crashed closely to the other one ...
|
|
Yesterday we had another incident of the same type ...
See attached files ...
|
|
This is a second example with keyword abc in the name.
Same explanation like above hits here with the exception, that this is a non-shared repl set.
In this case the node dfw.mongodb01 crashed ...
|
|
I upload the first set of logfiles here ... those with keyword flexi in the name.
The crash happens on 2016-09-15 in the eraly morning ...
Some network problems appeared before, most probably this ist related but should not lead to a crash at the end in out opinion ...
Some word as explanation.
This mondodb setup is a sharding setup where only the nodes for one shard crashed.
The cgn, dfw and lej keywords marks servers in different datacenters where
cgn is cologne/germany
lej is leipzig/germany
dfw is dallas/us
the lej server is a arbiter only ...
In this case the dfw.mongodb02 node crashed in the morning of the 15th.
|
|
Hi Boemm, if you could upload logs from all members of the replica set that would be great. Thanks.
|
|
Hi guys, we hit the same bug since upgrading to 3.2.9 several times on different instances ...
Will provide the logs, if you want ...
|
|
Unfortunately they are overwritten on most of the servers. There were no other errors in the logs at the time, only replicaset notifications as the Secondary had just finished rebuilding from scratch about 5 minutes earlier.
|
|
Hi sallgeud,
Thanks for reporting this issue. So we can continue to investigate, would you please upload the complete logs for each node in the replica set?
Kind regards,
Thomas
|
Generated at Thu Feb 08 04:08:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.