|
udit, sorry you're running into this issue. If this behavior persists after upgrading to 3.2.0, can you please open a separate ticket so we can investigate your issue separately?
Thanks,
Ramón.
|
|
We are experiencing it very frequent (every 2-3 hours) on secondary nodes.
Logs:
2015-12-16T10:41:56.052-0500 F REPL [rsSync] replication oplog stream went back in time. previous timestamp: 56718779:76c newest timestamp: 56718704:536. Op being applied
2015-12-16T10:41:56.052-0500 I - [rsSync] Fatal Assertion 18905
2015-12-16T10:41:56.052-0500 I - [rsSync]
Environment :
OS : Ubuntu 14.04
Mongo DB version : 3.0.8
Storage Engine : Wired Tiger
AWS instance : I2.xlarge (using local SSD storage of xfs fils system)
As a next step, we are upgrading the replica set to 3.2.0 to see if it fix it.
|
|
I have the similar crash:
2015-12-09T06:30:46.903+0800 E REPL [rsBackgroundSync] sync producer problem: 16235 going to start syncing, but buffer is not empty
|
2015-12-09T06:30:47.260+0800 I REPL [ReplicationExecutor] syncing from: node_slave1:2701
|
2015-12-09T06:30:47.399+0800 I REPL [SyncSourceFeedback] replset setting syncSourceFeedback to node_slave1:2701
|
2015-12-09T06:30:47.604+0800 F REPL [rsSync] replication oplog stream went back in time. previous timestamp: 56675270:239 newest timestamp: 5667526f:653. Op being applied: { ts: Timestamp 1449611887000|1619, h: 7253060569348003209, v: 2, op: "u", ns: "db1.coll1", o2: { _id: ... }, o: { ... } }
|
2015-12-09T06:30:47.605+0800 I - [rsSync] Fatal Assertion 18905
|
2015-12-09T06:30:47.605+0800 I - [rsSync]
|
|
***aborting after fassert() failure
|
what's different:
1. All node version is 3.0.5
2. after crash, we restart the slave, it not crash again
there are some logs before crash
2015-12-09T06:30:00.243+0800 I NETWORK [rsBackgroundSync] Socket recv() timeout node_slave1:2701
|
2015-12-09T06:30:00.273+0800 I NETWORK [rsBackgroundSync] SocketException: remote: node_slave1:2701 error: 9001 socket exception [RECV_TIMEOUT] server [node_slave1:2701]
|
2015-12-09T06:30:00.333+0800 E REPL [rsBackgroundSync] sync producer problem: 10278 dbclient error communicating with server: node_slave1:2701
|
2015-12-09T06:30:00.357+0800 I REPL [ReplicationExecutor] syncing from: node_slave1:2701
|
2015-12-09T06:30:00.397+0800 I NETWORK [conn832] end connection node_slave1:36844 (53 connections now open)
|
2015-12-09T06:30:00.397+0800 I NETWORK [initandlisten] connection accepted from node_slave1:37024 #837 (54 connections now open)
|
2015-12-09T06:30:00.874+0800 I REPL [SyncSourceFeedback] replset setting syncSourceFeedback to node_slave1:2701
|
2015-12-09T06:30:03.619+0800 I REPL [rsBackgroundSync] replSet syncTail condition 1
|
2015-12-09T06:30:04.619+0800 I - [rsBackgroundSync] Assertion: 16235:going to start syncing, but buffer is not empty
|
2015-12-09T06:30:04.710+0800 I CONTROL [rsBackgroundSync]
|
0xf77cc9 0xf16b81 0xefbddf 0xc15d20 0xc173b4 0xc17458 0xfc4e84 0x7f756be01e9a 0x7f756a8d838d
|
----- BEGIN BACKTRACE -----
|
{"backtrace":[{"b":"400000","o":"B77CC9"},{"b":"400000","o":"B16B81"},{"b":"400000","o":"AFBDDF"},{"b":"400000","o":"815D20"},{"b":"400000","o":"8173B4"},{"b":"400000","o":"817458"},{"b":"400000","o":"BC4E84"},{"b":"7F756BDFA000","o":"7E9A"},{"b":"7F756A7E5000","o":"F338D"}],"processInfo":{ "mongodbVersion" : "3.0.5", "gitVersion" : "8bc4ae20708dbb493cb09338d9e7be6698e4a3a3", "uname" : { "sysname" : "Linux", "release" : "3.2.0-23-generic", "version" : "#36-Ubuntu SMP Tue Apr 10 20:39:51 UTC 2012", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "1D3BB52AB05E62C119D95718E82CD0447A8CE0FA" }, { "b" : "7FFFE72FF000", "elfType" : 3, "buildId" : "0C0CC7757C6933FC35A55C90475D300ADAF970C5" }, { "b" : "7F756BDFA000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "C340AF9DEE97C17C730F7D03693286C5194A46B8" }, { "b" : "7F756BB9C000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "425A091D0C88F36982CD69122BDB853F6E261DA0" }, { "b" : "7F756B7C1000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "EB6660D716414E4EFB6C9E74657DDD709CB2383F" }, { "b" : "7F756B5B9000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "352C5B373A50E6C4AB881A5DB6F5766FDF81EEE0" }, { "b" : "7F756B3B5000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "D181AF551DBBC43E9D55913D532635FDE18E7C4E" }, { "b" : "7F756B0B5000", "path" : "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", "elfType" : 3, "buildId" : "B534DA725D06A04267EB2FEB92B9CC14C838B57B" }, { "b" : "7F756ADB9000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "3CD696AF1C2A66D38B949CB0D1D2A218B3ED858B" }, { "b" : "7F756ABA3000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "ECF322A96E26633C5D10F18215170DD4395AF82C" }, { "b" : "7F756A7E5000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "AA64A66AC46BFF200848C0A0694011BD0140AB4E" }, { "b" : "7F756C017000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "E25AD1A11CCF57E734116B8EC9C69F643DCA9F18" }, { "b" : "7F756A5CE000", "path" : "/lib/x86_64-linux-gnu/libz.so.1", "elfType" : 3, "buildId" : "F695ECFCF3918D5D34989398A14B7ECDD9F46CD0" } ] }}
|
mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf77cc9]
|
mongod(_ZN5mongo10logContextEPKc+0xE1) [0xf16b81]
|
mongod(_ZN5mongo11msgassertedEiPKc+0xAF) [0xefbddf]
|
mongod(_ZN5mongo4repl14BackgroundSync5startEPNS_16OperationContextE+0x320) [0xc15d20]
|
mongod(_ZN5mongo4repl14BackgroundSync15_producerThreadEv+0xA4) [0xc173b4]
|
mongod(_ZN5mongo4repl14BackgroundSync14producerThreadEv+0x68) [0xc17458]
|
mongod(+0xBC4E84) [0xfc4e84]
|
libpthread.so.0(+0x7E9A) [0x7f756be01e9a]
|
libc.so.6(clone+0x6D) [0x7f756a8d838d]
|
----- END BACKTRACE -----
|
|
|
tredman@fb.com, dynamike, is this still an issue for you? Does Eric's last message address your questions? Please let us know.
Thanks,
Ramón.
|
|
Hi Travis,
The new instances you have posted appear to be a different problem. The original problem was due to a duplicated op in an oplog on a 2.6 primary node. For your new instances, the error message on the secondaries indicates to me that it isn't a duplicated op; instead, it would appear that the secondary nodes' query on the sync source's oplog is not returning data that matches up with a previous oplog query.
In the new log snippits, I see that just prior to querying a sync source's oplog, a new connection is made. If that connection is to a different node than the one that was used to copy the documents, this problem may have a chance of occurring. I can't tell more without the replica set configuration and the full logs from each of the nodes (the version of mongodb that is running on each node is important, as well as each node's initial sync source, which appears earlier in the logs than what you have initially provided here).
|
|
Hi Eric,
Sorry, I wasn't able to get an answer to your question because I didn't have log data going back that far. Pinging the task today because we had another incident today. We had a replica set with three new nodes doing initial sync, and all three crashed within minutes of each other with this same issue.
The conditions were slightly different than our previous report because the timestamp difference was very large. In this first example, there was a 84762 second difference between the the current op and the next op.
Thu Sep 10 20:39:09.514 I INDEX [rsSync] build index on: appdata100.app_57d46c89-3aaf-4817-82ff-e6ac6404d106:_Index properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "appdata100.app_57d46c89-3aaf-4817-82ff-e6ac6404d106:_Index" }
|
Thu Sep 10 20:39:09.514 I INDEX [rsSync] building index using bulk method
|
Thu Sep 10 20:39:09.514 D STORAGE [rsSync] appdata100.app_57d46c89-3aaf-4817-82ff-e6ac6404d106:_Index: clearing plan cache - collection info cache reset
|
Thu Sep 10 20:39:09.514 D INDEX [rsSync] bulk commit starting for index: _id_
|
Thu Sep 10 20:39:09.514 D INDEX [rsSync] done building bottom layer, going to commit
|
Thu Sep 10 20:39:09.514 I INDEX [rsSync] build index done. scanned 0 total records. 0 secs
|
Thu Sep 10 20:39:09.514 D STORAGE [rsSync] looking up metadata for: appdata100.app_57d46c89-3aaf-4817-82ff-e6ac6404d106:_Index @ RecordId(396770)
|
Thu Sep 10 20:39:09.514 D STORAGE [rsSync] looking up metadata for: appdata100.app_57d46c89-3aaf-4817-82ff-e6ac6404d106:_Index @ RecordId(396770)
|
Thu Sep 10 20:39:09.514 D STORAGE [rsSync] appdata100.app_57d46c89-3aaf-4817-82ff-e6ac6404d106:_Index: clearing plan cache - collection info cache reset
|
Thu Sep 10 20:39:09.514 D STORAGE [rsSync] appdata100.app_57d46c89-3aaf-4817-82ff-e6ac6404d106:_Index: clearing plan cache - collection info cache reset
|
Thu Sep 10 20:39:09.516 I REPL [rsSync] initial sync cloning db: config
|
Thu Sep 10 20:39:09.519 D NETWORK [rsSync] connected to server db104:27017 (10.252.5.205)
|
Thu Sep 10 20:39:09.554 I REPL [rsSync] initial sync data copy, starting syncup
|
Thu Sep 10 20:39:09.848 I REPL [rsSync] oplog sync 1 of 3
|
Thu Sep 10 20:39:09.885 F REPL [rsSync] replication oplog stream went back in time. previous timestamp: 55f15435:c4 newest timestamp: 55f0091b:6f. Op being applied: { ts: Timestamp 1441794331000|111, h: -3122181208205084538, v: 2, op: "u", ns: "appdata119.app_15acc0d6-faf3-4681-a963-f91ec5f7d588:Locations", o2: { _id: "Exiw3lvQZY" }, o: { $set: { _updated_at: new Date(1441794331548), rankValue: -7.445298078517805e-08 } } }
|
Thu Sep 10 20:39:09.885 I - [rsSync] Fatal Assertion 18905
|
Thu Sep 10 20:39:09.885 I - [rsSync]
|
In this second example, the difference was 84651 seconds. Note that the timestamps are different in each case, as is the op being applied at the time of the crash.
Thu Sep 10 20:39:10.229 I INDEX [rsSync] build index on: appdata100.app_57d46c89-3aaf-4817-82ff-e6ac6404d106:_Index properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "appdata100.app_57d46c89-3aaf-4817-82ff-e6ac6404d106:_Index" }
|
Thu Sep 10 20:39:10.229 I INDEX [rsSync] building index using bulk method
|
Thu Sep 10 20:39:10.229 D STORAGE [rsSync] appdata100.app_57d46c89-3aaf-4817-82ff-e6ac6404d106:_Index: clearing plan cache - collection info cache reset
|
Thu Sep 10 20:39:10.229 D INDEX [rsSync] bulk commit starting for index: _id_
|
Thu Sep 10 20:39:10.229 D INDEX [rsSync] done building bottom layer, going to commit
|
Thu Sep 10 20:39:10.229 I INDEX [rsSync] build index done. scanned 0 total records. 0 secs
|
Thu Sep 10 20:39:10.229 D STORAGE [rsSync] looking up metadata for: appdata100.app_57d46c89-3aaf-4817-82ff-e6ac6404d106:_Index @ RecordId(396770)
|
Thu Sep 10 20:39:10.229 D STORAGE [rsSync] looking up metadata for: appdata100.app_57d46c89-3aaf-4817-82ff-e6ac6404d106:_Index @ RecordId(396770)
|
Thu Sep 10 20:39:10.229 D STORAGE [rsSync] appdata100.app_57d46c89-3aaf-4817-82ff-e6ac6404d106:_Index: clearing plan cache - collection info cache reset
|
Thu Sep 10 20:39:10.229 D STORAGE [rsSync] appdata100.app_57d46c89-3aaf-4817-82ff-e6ac6404d106:_Index: clearing plan cache - collection info cache reset
|
Thu Sep 10 20:39:10.230 I REPL [rsSync] initial sync cloning db: config
|
Thu Sep 10 20:39:10.233 D NETWORK [rsSync] connected to server db104:27017 (10.252.5.205)
|
Thu Sep 10 20:39:10.254 I REPL [rsSync] initial sync data copy, starting syncup
|
Thu Sep 10 20:39:10.345 I REPL [rsSync] oplog sync 1 of 3
|
Thu Sep 10 20:39:10.361 F REPL [rsSync] replication oplog stream went back in time. previous timestamp: 55f153d6:125 newest timestamp: 55f0092b:7d. Op being applied: { ts: Timestamp 1441794347000|125, h: 7670102112123622001, v: 2, op: "u", ns: "appdata111.app_26044e27-cfbd-4cf2-8ab1-9a9228b3b2bb:_Installation", o2: { _id: "N8vdqzy1LJ" }, o: { $set: { _updated_at: new Date(1441794347499) } } }
|
Thu Sep 10 20:39:10.361 I - [rsSync] Fatal Assertion 18905
|
Thu Sep 10 20:39:10.361 I - [rsSync]
|
These nodes had been undergoing initial sync for about 48H. We've restarted initial sync and will see how it goes this time, but I wonder if there is anything we can do to debug this more proactively/efficiently? Waiting days to repro the issue makes this hard to track down. I'll happily set up some extra data collection or something during the initial sync if it will help.
Thanks,
Travis
|
|
Thanks Travis. That op was recorded on Tue, 18 Aug 2015 at 22:42:36, immediately after a reconfig was run. Do you know if the node with the duplicate op was primary at 22:42:36? If you have all the logs from Tuesday, we can check. I see that the node with the duplicate op is running version 2.6.7; the latest version of the 2.6 branch is 2.6.11. I will go through the fixes in later versions to see if something related to this symptom has been reported and fixed.
|
|
Build info for the primary.
mongodata3:PRIMARY> db.serverBuildInfo()
|
{
|
"version" : "2.6.7",
|
"gitVersion" : "a7d57ad27c382de82e9cb93bf983a80fd9ac9899",
|
"OpenSSLVersion" : "",
|
"sysInfo" : "Linux build7.nj1.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49",
|
"loaderFlags" : "-fPIC -pthread -Wl,-z,now -rdynamic",
|
"compilerFlags" : "-Wnon-virtual-dtor -Woverloaded-virtual -fPIC -fno-strict-aliasing -ggdb -pthread -Wall -Wsign-compare -Wno-unknown-pragmas -Winvalid-pch -pipe -Werror -O3 -Wno-unused-function -Wno-deprecated-declarations -fno-builtin-memcmp",
|
"allocator" : "tcmalloc",
|
"versionArray" : [
|
2,
|
6,
|
7,
|
0
|
],
|
"javascriptEngine" : "V8",
|
"bits" : 64,
|
"debug" : false,
|
"maxBsonObjectSize" : 16777216,
|
"ok" : 1
|
}
|
|
|
Hi Eric, Mike is on leave for the next few weeks so I'm going to be managing this ticket from our end.
I ran the above query on the primary and it does look like there are multiple entries for the same op:
mongodata3:PRIMARY> db.oplog.rs.find( {ts: {$gte:Timestamp(1439937756,69) }}, {}, 5,0,5,8 )
|
{ "ts" : Timestamp(1439937756, 69), "h" : NumberLong("-5240046610690244083"), "v" : 2, "op" : "n", "ns" : "", "o" : { "msg" : "Reconfig set", "version" : 83 } }
|
{ "ts" : Timestamp(1439937756, 70), "h" : NumberLong("1660986903740963784"), "v" : 2, "op" : "i", "ns" : "ABC:XXX", "o" : { "_id" : "bWsEVm2o1J", "IP" : "X.X.X.X", "Serial" : "app", "OS" : "5.0.1", "Locale" : "", "FirstName" : "XXX", "LoyaltyPoints" : "5", "_rperm" : [ "xEqWRGEwoD" ], "LastName" : "", "Rank" : "5", "_wperm" : [ "xEqWRGEwoD" ], "DeviceId" : "357197051399544", "DevicePlatform" : "Android", "DeviceType" : "Phone", "SCNMemberKey" : "", "LoyaltyStatusId" : "10", "PointsDaily" : "7949", "PointsWeekly" : "7949", "Points" : "7949", "AppVersion" : "22.65.0", "Visits" : "4", "IsPayingCustomer" : "true", "Credits" : "15616", "_updated_at" : ISODate("2015-08-18T22:42:36.424Z"), "_acl" : { "xEqWRGEwoD" : { "r" : true, "w" : true } }, "PurchasedPremiumCount" : "0", "_created_at" : ISODate("2015-08-18T22:42:36.424Z"), "UserId" : "10700667" } }
|
{ "ts" : Timestamp(1439937756, 70), "h" : NumberLong("1660986903740963784"), "v" : 2, "op" : "i", "ns" : "ABC:XXX", "o" : { "_id" : "bWsEVm2o1J", "IP" : "X.X.X.X", "Serial" : "app", "OS" : "5.0.1", "Locale" : "", "FirstName" : "XXX", "LoyaltyPoints" : "5", "_rperm" : [ "xEqWRGEwoD" ], "LastName" : "", "Rank" : "5", "_wperm" : [ "xEqWRGEwoD" ], "DeviceId" : "357197051399544", "DevicePlatform" : "Android", "DeviceType" : "Phone", "SCNMemberKey" : "", "LoyaltyStatusId" : "10", "PointsDaily" : "7949", "PointsWeekly" : "7949", "Points" : "7949", "AppVersion" : "22.65.0", "Visits" : "4", "IsPayingCustomer" : "true", "Credits" : "15616", "_updated_at" : ISODate("2015-08-18T22:42:36.424Z"), "_acl" : { "xEqWRGEwoD" : { "r" : true, "w" : true } }, "PurchasedPremiumCount" : "0", "_created_at" : ISODate("2015-08-18T22:42:36.424Z"), "UserId" : "10700667" } }
|
{ "ts" : Timestamp(1439937782, 1), "h" : NumberLong("4124812040359931130"), "v" : 2, "op" : "u", "ns" : "ABC:XXX", "o2" : { "_id" : "FjuzozeNAk" }, "o" : { "$set" : { "message" : "{\"message\":\"TIME_OK\",\"status\":{},\"totalApiRequests\":28,\"timeStarted\":1439937726180,\"lateStartSeconds\":6.986,\"timeElapsedMills\":21918,\"timeElapsedMessage\":\"21.918s\",\"name\":\"MASTER\",\"timeSinceSeasonEndProcessRun\":1439937748098,\"timeSeasonProcessingStartedAt\":0,\"PROCESS_REJECTED_INVITATIONS\":true,\"PROCESS_EXPIRED_INVITATIONS\":true,\"PROCESS_SUCCESSFUL_INVITATIONS\":true,\"PROCESS_REQUESTED_INVITATIONS\":true,\"PROCESS_FINISHED_MATCHES\":true,\"PROCESS_PVP_REQUESTS\":true,\"PROCESS_TRANSFERS\":true,\"PROCESS_RENAMED_TEAMS\":true,\"CLEAR_INACTIVE_TEAMS\":true,\"PROCESS_PURCHASES\":true,\"PROCESS_CHECK_SEASON_END\":true,\"MAX_RUN_TIME\":170000,\"totalRuns\":3,\"runsAndTimes\":[\"1:852\",\"2:10862\",\"3:21011\"],\"terminationTime\":0,\"terminationAt\":\"\",\"day\":9,\"cancellationPromise\":{\"_resolved\":false,\"_rejected\":false,\"_resolvedCallbacks\":[null],\"_rejectedCallbacks\":[null]},\"FORCE_TEAMS_CREATION\":false}" } } }
|
{ "ts" : Timestamp(1439937782, 2), "h" : NumberLong("-6066342674646170879"), "v" : 2, "op" : "u", "ns" : "ABC:XXX", "o2" : { "_id" : "XdF2rYZ9Mu" }, "o" : { "$set" : { "sessions" : NumberLong(1262), "lastSession" : ISODate("2015-08-18T22:42:36.520Z"), "pageViews" : NumberLong(3625), "_updated_at" : ISODate("2015-08-18T22:42:36.744Z") } } }
|
|
|
Michael, the file you attached "md3-db2-full.log.tar.gz" is the same as the file you attached earlier "md3-db1.gz"; I suspect something went wrong with the generation of the full log?
|
|
Since the node fails to restart with the same error, the problem may be on the source rather than the target node. Can you query the oplog on all other nodes and confirm there is only one document with
{ "ts" : Timestamp(1439937756, 70), "h" : NumberLong("1660986903740963784") }
?
Something like this should work:
use local
|
rs.slaveOk()
|
db.oplog.rs.find( {ts: {$gte:Timestamp(1439937756,69) }}, {}, 5,0,5,8 )
|
The numbers (5,0,5,8) are, respectively, limit, skip, batchSize, and options (a bitmap) – "8" means use OplogReplay which enables the query to use a special optimized algorithm to return results. By using OplogReplay, the projection field of the find() function is ignored so I took it out.
If the problem is indeed on the source, rebuilding the node will appear to help because the target will only look at the oplog on the source past the point of the problem.
|
|
Attached a larger log for m3-db2.
|
|
The node fails to restart with the same error so I'm left with having to rebuild the node
|
|
Do you have further history for the log for md3-db2? I was hoping to determine what node it was syncing from. This would be the node that fed it an incorrect op after the election of md3-db1.
|
|
I don't see why any oplog surgery is necessary – their oplogs are still correct, just getting stale. They crashed because they used a query on their sync source that didn't precisely overlap with what they already had in their oplog (is my theory). If you attempt to restart them, what happens?
|
|
Gotcha, but given that it happened in 2.6 means the code does get exercised. Is there any sort of oplog surgery I could do to the crashed 3.0 nodes rather than syncing from scratch?
|
|
Thanks for the logs! I'll take a look.
I removed the force-sync-from-primary logic as part of a big refactor because that particular bandaid was a layer violation and it would have taken a big effort to reproduce the behavior it provided. It's a condition that should be prohibited by the design of the logic, so I was comfortable with shutting down the server at that point so that a full diagnosis could take place.
|
|
Log files are attached as hostnames. The state of the replicaset before the crash was
db111 - PRIMARY running 2.6.7
db109 - SECONDARY running 2.6.7
db110 - SECONDARY running 2.6.7
db112 - SECONDARY running 2.6.7
md3-db1 - SECONDARY running 3.0.5
md3-db2 - SECONDARY running 3.0.5
md3-db3 - SECONDARY running 3.0.5
I triggered an election to make md3-db1 PRIMARY and both md3-db2 and md3-db3 crashed with the "back in time fassert". The other 2.6.7 SECONDARY nodes were fine.
I noticed that after the election db111(2.6.7) also logged a back in time error, but the logic in 2.6 is to force sync from the PRIMARY.
Wed Aug 19 15:35:32.006 [rsSync] replication oplog stream went back in time. previous timestamp: 55d4a23f:d2 newest timestamp: 55d4a23f:cf. attempting to sync directly from primary.
|
The behavior change between 3.0 and 2.6 comes from this diff –
https://github.com/mongodb/mongo/commit/e86e08deff7293b5778fad27df9031c013595b12#diff-c64a61c81bff3e9fbae542a82463b76d
Why was it decided to fassert rather than force sync from the primary?
|
|
sorry for the typo. Correct this issue is new in 3.0.5 .
|
|
I'm sorry, did you mean to say you have experienced this issue only after upgrading to 3.0.5? That is what the original description says.
|
|
I'll work on getting the logs. I suspect the issue could be related to SERVER-19375. We haven't had this issues since upgrading to 3.0.5 from 3.0.4
|
|
You're correct that there is one op that is being applied twice. However, the sync source logic should prevent this from happening.
Do you have the full logs from the failing node and its sync source node for an instance of this issue?
|
|
I put the crashed secondary in standalone mode took a look at the oplog by running
db.oplog.rs.find( { }, { ts: 1, h: 1 } ).sort( {$natural : -1} )
|
{ "ts" : Timestamp(1439937756, 70), "h" : NumberLong("1660986903740963784") }
|
{ "ts" : Timestamp(1439937756, 69), "h" : NumberLong("-5240046610690244083") }
|
{ "ts" : Timestamp(1439937756, 68), "h" : NumberLong("364598519254495583") }
|
{ "ts" : Timestamp(1439937756, 67), "h" : NumberLong("-1977037730287425552") }
|
{ "ts" : Timestamp(1439937756, 66), "h" : NumberLong("1144686957426521030") }
|
It looks like the op is trying to be applied twice
|
|
Tue Aug 18 22:42:38.336 I REPL [ReplicationExecutor] New replica set config in use: { _id: "blah", version: 83, members: [ { _id: 9, host: "db109:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 2.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 10, host: "db110:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 2.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 11, host: "db111:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 2.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 12, host: "db112:27017", arbiterOnly: false, buildIndexes: true, hidden: true, priority: 0.0, tags: {}, slaveDelay: 0, votes: 0 }, { _id: 13, host: "arbiter3:30002", arbiterOnly: true, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 14, host: "arbiter4:30002", arbiterOnly: true, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 15, host: "db3:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 2.0, tags: {}, slaveDelay: 0, votes: 0 }, { _id: 16, host: "db2:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 2.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 17, host: "db1:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 3.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatTimeoutSecs: 10, getLastErrorModes: {}, getLastErrorDefaults:
{ w: 1, wtimeout: 0 }
} }
Tue Aug 18 22:42:38.336 I REPL [ReplicationExecutor] This node is db2:27017 in the config
Tue Aug 18 22:42:38.337 I REPL [ReplicationExecutor] Member db111:27017 is now in state SECONDARY
Tue Aug 18 22:42:38.337 I REPL [ReplicationExecutor] Standing for election
Tue Aug 18 22:42:38.337 I REPL [ReplicationExecutor] not electing self, db109:27017 would veto with 'db2:27017 has lower priority than db1:27017'
Tue Aug 18 22:42:38.337 I REPL [ReplicationExecutor] not electing self, we are not freshest
Tue Aug 18 22:42:38.338 I REPL [ReplicationExecutor] Standing for election
Tue Aug 18 22:42:38.338 I REPL [ReplicationExecutor] Member db1:27017 is now in state PRIMARY
Tue Aug 18 22:42:38.338 I REPL [ReplicationExecutor] not electing self, db109:27017 would veto with 'db2:27017 has lower priority than db1:27017'
Tue Aug 18 22:42:40.670 F REPL [rsSync] replication oplog stream went back in time. previous timestamp: 55d3b4dc:46 newest timestamp: 55d3b4dc:46. Op being applied: { ts: Timestamp 1439937756000|70, h: 1660986903740963784, v: 2, op: "i", ns: "...", o:
{ ..}
Tue Aug 18 22:42:40.671 I - [rsSync] Fatal Assertion 18905
Tue Aug 18 22:42:40.671 I - [rsSync]
***aborting after fassert() failure
|
|
We've hit this issue again. What could cause this to happen?
|
Generated at Thu Feb 08 03:52:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.