[SERVER-28227] ApplyBatchFinalizerForJournal] Got signal: 6 (Aborted) Created: 07/Mar/17  Updated: 31/May/17  Resolved: 21/Mar/17

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: 3.4.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Robert Romano Assignee: Mark Agarunov
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

SUSE Enterprise Linux 12 SP2
MongoDB Server 3.4.2
Replicaset with 3 nodes


Operating System: ALL
Participants:

 Description   

2017-03-04T05:50:45.828-0500 E STORAGE  [ApplyBatchFinalizerForJournal] WiredTiger error (28) [1488624645:828501][15745:0x7f7d863ca700], WT_SESSION.log_flush: /hostname/mongo/data/journal/WiredTigerLog.0000059170: handle-write: pwrite: failed to write 100992 bytes at offset 69279744: No space left on device
2017-03-04T05:50:45.864-0500 I -        [ApplyBatchFinalizerForJournal] Invariant failure: s->log_flush(s, "sync=on") resulted in status UnknownError: 28: No space left on device at src/mongo/db/storage/wiredtiger/wiredtiger_session_cache.cpp 229
2017-03-04T05:50:45.868-0500 I -        [ApplyBatchFinalizerForJournal]
 
***aborting after invariant() failure
 
 
2017-03-04T05:50:45.955-0500 F -        [ApplyBatchFinalizerForJournal] Got signal: 6 (Aborted).
 
 0x7f7db4bf5be1 0x7f7db4bf4cd9 0x7f7db4bf51bd 0x7f7db23c47c0 0x7f7db2073b55 0x7f7db2075131 0x7f7db3e841ff 0x7f7db4900196 0x7f7db48fbb85 0x7f7db46d3fea 0x7f7db5668880 0x7f7db23bc7b6 0x7f7db21189cd
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"7F7DB3673000","o":"1582BE1","s":"_ZN5mongo15printStackTraceERSo"},{"b":"7F7DB3673000","o":"1581CD9"},{"b":"7F7DB3673000","o":"15821BD"},{"b":"7F7DB23B5000","o":"F7C0"},{"b":"7F7DB2041000","o":"32B55","s":"gsignal"},{"b":"7F7DB2041000","o":"34131","s":"abort"},{"b":"7F7DB3673000","o":"8111FF","s":"_ZN5mongo25fassertFailedWithLocationEiPKcj"},{"b":"7F7DB3673000","o":"128D196","s":"_ZN5mongo22WiredTigerSessionCache16waitUntilDurableEb"},{"b":"7F7DB3673000","o":"1288B85","s":"_ZN5mongo22WiredTigerRecoveryUnit16waitUntilDurableEv"},{"b":"7F7DB3673000","o":"1060FEA"},{"b":"7F7DB3673000","o":"1FF5880"},{"b":"7F7DB23B5000","o":"77B6"},{"b":"7F7DB2041000","o":"D79CD","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.4.2", "gitVersion" : "3f76e40c105fc223b3e5aac3e20dcd026b83b38b", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.0.38-0.5-default", "version" : "#1 SMP Fri Aug 3 09:02:17 UTC 2012 (358029e)", "machine" : "x86_64" }, "somap" : [ { "b" : "7F7DB3673000", "elfType" : 3, "buildId" : "EEF048F2BC725C64C859DB42292814F165184232" }, { "b" : "7FFF38CFF000", "elfType" : 3, "buildId" : "F0DC8B581876594941DC196587ADA74A695C0C64" }, { "b" : "7F7DB31FE000", "path" : "/usr/lib64/libssl.so.0.9.8", "elfType" : 3, "buildId" : "E8358CAA116D347873309FD2B418A97CB5FD3938" }, { "b" : "7F7DB2E6E000", "path" : "/usr/lib64/libcrypto.so.0.9.8", "elfType" : 3, "buildId" : "921B613DF2F59194920EA4F00C63F4D22E9DC74A" }, { "b" : "7F7DB2C65000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "44612B93C19E6567318299411987B113D2387081" }, { "b" : "7F7DB2A61000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "3E4F6BFEE9FDF77CA975B77B8C325347D9228BB8" }, { "b" : "7F7DB27E8000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "B10C3CAE031BA5A87E715C117A83CD3BEF83EBD2" }, { "b" : "7F7DB25D2000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "FE7C25BFB3E605F9D6C1CB00B3C5F96ED95BE6E5" }, { "b" : "7F7DB23B5000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "368B7757BC9F9D7E2E93678C63CB3E5587A9934F" }, { "b" : "7F7DB2041000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "72E7B043935A2BD0B80D325F7F166A132CF37140" }, { "b" : "7F7DB3452000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "C81DE241A528795F8DBFDE0F0E0E236F9A6554E6" }, { "b" : "7F7DB1E2B000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "4C05D1EB180F9C02B81A0C559C813DADA91E0CA4" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x7f7db4bf5be1]
 mongod(+0x1581CD9) [0x7f7db4bf4cd9]
 mongod(+0x15821BD) [0x7f7db4bf51bd]
 libpthread.so.0(+0xF7C0) [0x7f7db23c47c0]
 libc.so.6(gsignal+0x35) [0x7f7db2073b55]
 libc.so.6(abort+0x181) [0x7f7db2075131]
 mongod(_ZN5mongo25fassertFailedWithLocationEiPKcj+0x0) [0x7f7db3e841ff]
 mongod(_ZN5mongo22WiredTigerSessionCache16waitUntilDurableEb+0x696) [0x7f7db4900196]
 mongod(_ZN5mongo22WiredTigerRecoveryUnit16waitUntilDurableEv+0x15) [0x7f7db48fbb85]
 mongod(+0x1060FEA) [0x7f7db46d3fea]
 mongod(+0x1FF5880) [0x7f7db5668880]
 libpthread.so.0(+0x77B6) [0x7f7db23bc7b6]
 libc.so.6(clone+0x6D) [0x7f7db21189cd]
-----  END BACKTRACE  -----



 Comments   
Comment by Mark Agarunov [ 21/Mar/17 ]

Hello rromano,

We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Thanks,
Mark

Comment by Mark Agarunov [ 10/Mar/17 ]

Hello rromano,

Thank you for the report. To better investigate the behavior you've described, I'd like to request some additional information. Could you please provide the following:

  • The complete logs from mongod when this issue occured, for all three instances involved.
  • The configuration file for mongod for all three instances.
  • An archive (tar or zip) of the $dbpath/diagnostic.data directory

I've created an upload portal so that you can securely send us these files.

Additionally, it appears that the three logs provided are timestamped a few days apart. Are the timestamps on the logs accurate, and if so did this issue occur multiple times or was there a delay between the primary and the secondary crashing?

Thanks,
Mark

Comment by Robert Romano [ 07/Mar/17 ]

I just noticed "No space left on device" in the stack trace, so this may be a user error. Still, it would be nice if another replica could have remained up as primary.

Comment by Robert Romano [ 07/Mar/17 ]

When the primary crashed, one of the secondaries also crashed with this log entry:

2017-03-06T17:17:56.066-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to torqmongodevbal801:27017
2017-03-06T17:17:56.067-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to torqmongodevbal801:27017 - HostUnreachable: Connection refused
2017-03-06T17:17:56.067-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to torqmongodevbal801:27017; HostUnreachable: Connection refused
2017-03-06T17:17:56.223-0500 F REPL     [rsBackgroundSync] rollback error RS100 reached beginning of remote oplog
2017-03-06T17:17:56.231-0500 I REPL     [rsBackgroundSync]   them:      torqmongodevbal800:27017 (10.110.208.30)
2017-03-06T17:17:56.231-0500 I REPL     [rsBackgroundSync]   theirTime: Mar  6 16:41:26 58bdd786:2869
2017-03-06T17:17:56.248-0500 I REPL     [rsBackgroundSync] rollback finished
2017-03-06T17:17:56.248-0500 F REPL     [rsBackgroundSync] Unable to complete rollback. A full resync may be needed: UnrecoverableRollbackError: need to rollback, but unable to determine common point between local and remote oplog: NoMatchingDocument: RS100 reached beginning of remote oplog [1] @ 18752
2017-03-06T17:17:56.255-0500 I -        [rsBackgroundSync] Fatal Assertion 28723 at src/mongo/db/repl/bgsync.cpp 676

The last replica remained running and remained in SECONDARY state with these log entries being repeated repeatedly without end:

2017-03-07T08:25:47.517-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to torqmongodevbal801:27017
2017-03-07T08:25:47.518-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to torqmongodevbal801:27017 - HostUnreachable: Connection refused
2017-03-07T08:25:47.518-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to torqmongodevbal801:27017; HostUnreachable: Connection refused
2017-03-07T08:25:47.518-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to torqmongodevbal801:27017
2017-03-07T08:25:47.518-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to torqmongodevbal801:27017 - HostUnreachable: Connection refused
2017-03-07T08:25:47.518-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to torqmongodevbal801:27017; HostUnreachable: Connection refused
2017-03-07T08:25:47.518-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to torqmongodevbal801:27017
2017-03-07T08:25:47.518-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to torqmongodevbal801:27017 - HostUnreachable: Connection refused
2017-03-07T08:25:47.518-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to torqmongodevbal801:27017; HostUnreachable: Connection refused
2017-03-07T08:25:48.734-0500 I REPL     [ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election because I cannot see a majority (mask 0x1)
2017-03-07T08:25:51.839-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to torqmongodevbal802:27017
2017-03-07T08:25:51.839-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to torqmongodevbal802:27017 - HostUnreachable: Connection refused
2017-03-07T08:25:51.839-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to torqmongodevbal802:27017; HostUnreachable: Connection refused

The log entry "Not starting an election, since we are not electable due to: Not standing for election because I cannot see a majority (mask 0x1)" is alarming. The last standing replica never took up role as primary, so cluster remained down. Very sad!

Generated at Thu Feb 08 04:17:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.