[SERVER-17081] Resync command under write load can fassert Created: 27/Jan/15  Updated: 06/Dec/22  Resolved: 11/Jun/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Scott Hernandez (Inactive) Assignee: Backlog - Replication Team
Resolution: Won't Fix Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File resync_command_with_write_load.js    
Issue Links:
Depends
depends on SERVER-18031 Data Replicator : resync Closed
Duplicate
is duplicated by SERVER-23907 While performing the load test on sin... Closed
Related
related to SERVER-25662 Data replicator should stop initial s... Closed
is related to SERVER-17071 Resync crashes secondary Closed
is related to SERVER-23250 One node in replicated sharded cluste... Closed
Assigned Teams:
Replication
Operating System: ALL
Participants:

 Description   

m31001| 2015-01-27T14:38:12.778-0500 I REPL     [rsSync] initial sync drop all databases
 m31001| 2015-01-27T14:38:12.778-0500 I STORAGE  [rsSync] dropAllDatabasesExceptLocal 2
 m31001| 2015-01-27T14:38:12.786-0500 I JOURNAL  [rsSync] journalCleanup...
 m31001| 2015-01-27T14:38:12.786-0500 I JOURNAL  [rsSync] removeJournalFiles
 m31001| 2015-01-27T14:38:12.788-0500 I JOURNAL  [rsSync] journalCleanup...
 m31001| 2015-01-27T14:38:12.788-0500 I JOURNAL  [rsSync] removeJournalFiles
 m31001| 2015-01-27T14:38:12.790-0500 I REPL     [rsSync] initial sync clone all databases
 m31001| 2015-01-27T14:38:12.790-0500 I REPL     [rsSync] initial sync cloning db: test
 m31000| 2015-01-27T14:38:12.792-0500 I NETWORK  [initandlisten] connection accepted from 192.168.1.104:60845 #21 (6 connections now open)
 m31000| 2015-01-27T14:38:12.792-0500 I NETWORK  [conn21] end connection 192.168.1.104:60845 (5 connections now open)
 m31000| 2015-01-27T14:38:12.793-0500 I NETWORK  [initandlisten] connection accepted from 192.168.1.104:60846 #22 (6 connections now open)
 m31001| 2015-01-27T14:38:12.799-0500 I INDEX    [rsSync] allocating new ns file /data/db/testReplSet-1/test.ns, filling with zeroes...
 m31001| 2015-01-27T14:38:12.846-0500 I STORAGE  [FileAllocator] allocating new datafile /data/db/testReplSet-1/test.0, filling with zeroes...
 m31001| 2015-01-27T14:38:12.916-0500 I STORAGE  [FileAllocator] done allocating datafile /data/db/testReplSet-1/test.0, size: 16MB,  took 0.069 secs
 m31001| 2015-01-27T14:38:12.971-0500 I INDEX    [rsSync] build index on: test.a properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "test.a" }
 m31001| 2015-01-27T14:38:12.971-0500 I INDEX    [rsSync] 	 building index using bulk method
 m31001| 2015-01-27T14:38:12.982-0500 I INDEX    [rsSync] build index done.  scanned 5479 total records. 0 secs
 m31001| 2015-01-27T14:38:12.982-0500 I REPL     [rsSync] initial sync data copy, starting syncup
 m31001| 2015-01-27T14:38:12.982-0500 I REPL     [rsSync] oplog sync 1 of 3
 m31001| 2015-01-27T14:38:12.983-0500 F REPL     [rsSync] replication oplog stream went back in time. previous timestamp: 54c7e924:1a6 newest timestamp: 54c7e924:1a4. Op being applied: { ts: Timestamp 1422387492000|420, h: 3326148221373857081, v: 2, op: "i", ns: "test.a", o: { _id: ObjectId('54c7e9249ee01fcbe529f117'), x: 5391.0 } }
 m31001| 2015-01-27T14:38:12.983-0500 I -        [rsSync] Fatal Assertion 18905
 m31001| 2015-01-27T14:38:12.983-0500 I -        [rsSync] 
 m31001| 
 m31001| ***aborting after fassert() failure

The issue below is that lastapplied is greater than minvalid, so setting minvalid on initial sync failure addresses this:

m31001| 2015-01-27T14:17:05.790-0500 I REPL     [rsSync] initial sync pending
 m31001| 2015-01-27T14:17:05.790-0500 I REPL     [ReplicationExecutor] syncing from: pans.local:31000
 m31000| 2015-01-27T14:17:05.790-0500 I NETWORK  [conn16] end connection 192.168.1.104:57695 (3 connections now open)
 m31000| 2015-01-27T14:17:05.793-0500 I NETWORK  [initandlisten] connection accepted from 192.168.1.104:57717 #17 (4 connections now open)
 m31000| 2015-01-27T14:17:05.796-0500 I NETWORK  [initandlisten] connection accepted from 192.168.1.104:57718 #18 (5 connections now open)
 m31001| 2015-01-27T14:17:05.797-0500 I REPL     [rsSync] initial sync drop all databases
 m31001| 2015-01-27T14:17:05.797-0500 I STORAGE  [rsSync] dropAllDatabasesExceptLocal 2
 m31001| 2015-01-27T14:17:05.803-0500 I REPL     [rsBackgroundSync] replSet our last op time fetched: Dec 31 19:00:00:0
 m31001| 2015-01-27T14:17:05.803-0500 I REPL     [rsBackgroundSync] replset source's GTE: Jan 27 14:16:11:1
 m31001| 2015-01-27T14:17:05.808-0500 I JOURNAL  [rsSync] journalCleanup...
 m31001| 2015-01-27T14:17:05.808-0500 I JOURNAL  [rsSync] removeJournalFiles
 m31001| 2015-01-27T14:17:05.811-0500 I JOURNAL  [rsSync] journalCleanup...
 m31001| 2015-01-27T14:17:05.811-0500 I JOURNAL  [rsSync] removeJournalFiles
 m31001| 2015-01-27T14:17:05.813-0500 I REPL     [rsSync] initial sync clone all databases
 m31001| 2015-01-27T14:17:05.813-0500 F REPL     [rsBackgroundSync] replSet need to rollback, but in inconsistent state
 m31001| 2015-01-27T14:17:05.813-0500 I REPL     [rsBackgroundSync] minvalid: 54c7e430:1de our last optime: 0:0
 m31001| 2015-01-27T14:17:05.813-0500 I REPL     [rsSync] initial sync cloning db: test
 m31001| 2015-01-27T14:17:05.813-0500 I -        [rsBackgroundSync] Fatal Assertion 18750
 m31001| 2015-01-27T14:17:05.813-0500 I -        [rsBackgroundSync] 
 m31001| 
 m31001| ***aborting after fassert() failure



 Comments   
Comment by Ben McCann [ 30/Jun/15 ]

Does it make sense to update the assertion message being printed in the logs to fix the format of the timestamps? Is it a hex representation? Not sure about the colon. We're seeing - previous timestamp: 55924d08:8 newest timestamp: 559241bc:d

Comment by Ben McCann [ 30/Jun/15 ]

We're seeing this on Mongo 3.0.4

Comment by Jonathan Abrahams [ 24/Mar/15 ]

I saw this in a 3 node replica set in 3.1.0, on a secondary (not from initial sync):

2015-03-24T18:23:34.868+0000 E REPL     [rsBackgroundSync] sync producer problem: 16235 going to start syncing, but buffer is not empty
2015-03-24T18:23:34.868+0000 I -        [rsBackgroundSync] Assertion: 16235:going to start syncing, but buffer is not empty
2015-03-24T18:23:34.872+0000 I CONTROL  [rsBackgroundSync]
 0xef13c9 0xe96c11 0xe7bb1f 0xbc8e5b 0xbca7a4 0xbca848 0xf40b73 0x7f8aa155e182 0x7f8aa128b00d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"AF13C9"},{"b":"400000","o":"A96C11"},{"b":"400000","o":"A7BB1F"},{"b":"400000","o":"7C8E5B"},{"b":"400000","o":"7CA7A4"},{"b":"400000","o":"7CA848"},{"b":"400000","o":"B40B73"},{"b":"7F8AA1556000","o":"8182"},{"b":"7F8AA1190000","o":"FB00D"}],"processInfo":{ "mongodbVersion" : "3.1.0", "gitVersion" : "7d15cd965ccb3ad684d8ae4e4f09d5b1e9394552", "uname" : { "sysname" : "Linux", "release" : "3.13.0-44-generic", "version" : "#73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "619037FE8091C113A33690E692D2133F138909A0" }, { "b" : "7FFF127CD000", "elfType" : 3, "buildId" : "9D77366C6409A9EA266179080FA7C779EEA8A958" }, { "b" : "7F8AA277A000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "FF43D0947510134A8A494063A3C1CF3CEBB27791" }, { "b" : "7F8AA23A0000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "379F80D2768BA6A21F52781895EE9F47B34A0A85" }, { "b" : "7F8AA2198000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "92FCF41EFE012D6186E31A59AD05BDBB487769AB" }, { "b" : "7F8AA1F94000", "path" :
"/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "C1AE4CB7195D337A77A3C689051DABAA3980CA0C" }, { "b" : "7F8AA1C90000", "path" : "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", "elfType" : 3, "buildId" : "19EFDDAB11B3BF5C71570078C59F91CF6592CE9E" }, { "b" : "7F8AA198A000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "1D76B71E905CB867B27CEF230FCB20F01A3178F5" }, { "b" : "7F8AA1774000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "8D0AA71411580EE6C08809695C3984769F25725B" }, { "b" : "7F8AA1556000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "9318E8AF0BFBE444731BB0461202EF57F7C39542" }, { "b" : "7F8AA1190000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "882AD7AAD54790E2FA6EF64CA2E6188F06BF9207" }, { "b" : "7F8AA29D8000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "9F00581AB3C73E3AEA35995A0C50D24D59A01D47" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xef13c9]
 mongod(_ZN5mongo10logContextEPKc+0xE1) [0xe96c11]
 mongod(_ZN5mongo11msgassertedEiPKc+0xAF) [0xe7bb1f]
 mongod(_ZN5mongo4repl14BackgroundSync5startEPNS_16OperationContextE+0x2FB) [0xbc8e5b]
 mongod(_ZN5mongo4repl14BackgroundSync15_producerThreadEv+0xA4) [0xbca7a4]
 mongod(_ZN5mongo4repl14BackgroundSync14producerThreadEv+0x68) [0xbca848]
 mongod(+0xB40B73) [0xf40b73]
 libpthread.so.0(+0x8182) [0x7f8aa155e182]
 libc.so.6(clone+0x6D) [0x7f8aa128b00d]
-----  END BACKTRACE  -----
2015-03-24T18:23:34.872+0000 E REPL     [rsBackgroundSync] sync producer problem: 16235 going to start syncing, but buffer is not empty
2015-03-24T18:23:39.830+0000 I NETWORK  [conn508] end connection 10.138.86.132:52645 (1 connection now open)
2015-03-24T18:23:39.831+0000 I NETWORK  [initandlisten] connection accepted from 10.138.86.132:52647 #510 (2 connections now open)
2015-03-24T18:23:42.619+0000 I STORAGE  [DataFileSync] flushing mmaps took 192172ms  for 14 files
2015-03-24T18:24:03.216+0000 I NETWORK  [conn509] end connection 10.153.240.7:41372 (1 connection now open)
2015-03-24T18:24:03.216+0000 I NETWORK  [initandlisten] connection accepted from 10.153.240.7:41374 #511 (2 connections now open)
2015-03-24T18:24:09.840+0000 I NETWORK  [conn510] end connection 10.138.86.132:52647 (1 connection now open)
2015-03-24T18:24:09.841+0000 I NETWORK  [initandlisten] connection accepted from 10.138.86.132:52649 #512 (2 connections now open)
2015-03-24T18:24:14.036+0000 I NETWORK  [rsBackgroundSync] Socket flush send() errno:9 Bad file descriptor 10.138.86.132:37000
2015-03-24T18:24:14.036+0000 I -        [rsBackgroundSync] caught exception (socket exception [SEND_ERROR] for 10.138.86.132:37000) in destructor (~PiggyBackData)
2015-03-24T18:24:14.036+0000 I REPL     [ReplicationExecutor] syncing from: ec2-54-144-22-47.compute-1.amazonaws.com:37000
2015-03-24T18:24:14.073+0000 I REPL     [SyncSourceFeedback] setting syncSourceFeedback to ec2-54-144-22-47.compute-1.amazonaws.com:37000
2015-03-24T18:24:14.485+0000 F REPL     [rsSync] replication oplog stream went back in time. previous timestamp: 5511aaae:fc newest timestamp: 5511aa59:70. Op being applied: { ts: Timestamp 1427221081000|112, h: -1875619851628002890, v: 2, op: "i", ns: "test.users", o: { _id: ObjectId('5511aa596d75276a8eefe619'), ssn: "330-07-9201", lname: "Emfinger", fname: "Laquita", addr: "675 South Mountain Street", zip: 84592.0 } }
2015-03-24T18:24:14.485+0000 I -        [rsSync] Fatal Assertion 18905
2015-03-24T18:24:14.485+0000 I -        [rsSync]
 
***aborting after fassert() failure

Comment by Scott Hernandez (Inactive) [ 02/Feb/15 ]

No, this was not committed as we hadn't fixed these issues and this test still fails.

The thing that was committed was fixing the issue of calling resync at all: SERVER-17071

Comment by Eric Milkie [ 02/Feb/15 ]

Was this committed?

Generated at Thu Feb 08 03:43:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.