[SERVER-3278] Master stopped allowing connections, didn't fail over "DR102 too much data written uncommitted" Created: 16/Jun/11  Updated: 29/Aug/11  Resolved: 04/Aug/11

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 1.8.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Chris McNabb Assignee: Mathias Stearn
Resolution: Duplicate Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-3447 journal exception during large operat... Closed
Operating System: ALL
Participants:

 Description   

Thu Jun 16 16:43:34 [conn35976] core.Core_Model_User Assertion failure ! "DR102 too much data written uncommitted" db/dur_commitjob.cpp 204
0x51fe32 0x5305cf 0x5e6a28 0x5da88a 0x6947c8 0x698477 0x6948c2 0x60fb65 0x61193f 0x6cb29d 0x6d1520 0x6d23c2 0x548582 0x5cc648 0x35c3855112 0x35c385738f 0x35c38557b3 0x35c384f794 0x35c3855112 0x35c385738f
/usr/bin/mongod(_ZN5mongo12sayDbContextEPKc+0xb2) [0x51fe32]
/usr/bin/mongod(_ZN5mongo9wassertedEPKcS1_j+0xbf) [0x5305cf]
/usr/bin/mongod(_ZN5mongo3dur9CommitJob4noteEPvi+0x198) [0x5e6a28]
/usr/bin/mongod(_ZN5mongo3dur11DurableImpl10writingPtrEPvj+0xa) [0x5da88a]
/usr/bin/mongod(ZN5mongo12append_O_ObjEPcRKNS_7BSONObjES3+0x48) [0x6947c8]
/usr/bin/mongod [0x698477]
/usr/bin/mongod(_ZN5mongo5logOpEPKcS1_RKNS_7BSONObjEPS2_Pb+0x42) [0x6948c2]
/usr/bin/mongod(_ZN5mongo14_updateObjectsEbPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugEPNS_11RemoveSaverE+0x1f05) [0x60fb65]
/usr/bin/mongod(_ZN5mongo13updateObjectsEPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugE+0x12f) [0x61193f]
/usr/bin/mongod(_ZN5mongo14receivedUpdateERNS_7MessageERNS_5CurOpE+0x4dd) [0x6cb29d]
/usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_8SockAddrE+0x15a0) [0x6d1520]
/usr/bin/mongod(_ZN5mongo14DBDirectClient3sayERNS_7MessageE+0x62) [0x6d23c2]
/usr/bin/mongod(_ZN5mongo12DBClientBase6updateERKSsNS_5QueryENS_7BSONObjEbb+0x242) [0x548582]
/usr/bin/mongod(ZN5mongo12mongo_updateEP9JSContextP8JSObjectjPlS4+0x1c8) [0x5cc648]
/usr/lib64/libjs.so.1(js_Invoke+0x492) [0x35c3855112]
/usr/lib64/libjs.so.1(js_Interpret+0x14df) [0x35c385738f]
/usr/lib64/libjs.so.1(js_Invoke+0xb33) [0x35c38557b3]
/usr/lib64/libjs.so.1 [0x35c384f794]
/usr/lib64/libjs.so.1(js_Invoke+0x492) [0x35c3855112]
/usr/lib64/libjs.so.1(js_Interpret+0x14df) [0x35c385738f]
Thu Jun 16 16:43:34 [conn35976] local.oplog.rs Assertion failure ! "DR102 too much data written uncommitted" db/dur_commitjob.cpp 204

This message was repeating in the logs (unfortunately we failed to save the logs prior to restart so we don't have any more info from them.

Trying to connect to the master failed:

PROD root@docbase1-10-125-50-69 ~ $ mongo localhost:27018
MongoDB shell version: 1.8.1
connecting to: localhost:27018/test
Thu Jun 16 16:39:05 MessagingPort recv() errno:104 Connection reset by peer 127.0.0.1:27018
Thu Jun 16 16:39:05 SocketException: remote: error: 9001 socket exception [1]
Thu Jun 16 16:39:05 DBClientCursor::init call() failed
exception: DBClientBase::findOne: transport error: localhost:27018

But it was still sending heartbeats

query:

{ whatsmyuri: 1 }

"members" : [
{
"_id" : 0,
"name" : "ec2-50-16-30-183.compute-1.amazonaws.com:27018",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 35043,
"optime" :

{ "t" : 1308256847000, "i" : 28 }

,
"optimeDate" : ISODate("2011-06-16T20:40:47Z"),
"lastHeartbeat" : ISODate("2011-06-16T20:40:47Z")
},
{
"_id" : 1,
"name" : "ec2-184-72-213-192.compute-1.amazonaws.com:27018",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"optime" :

{ "t" : 1308254621000, "i" : 7 }

,
"optimeDate" : ISODate("2011-06-16T20:03:41Z"),
"self" : true
},



 Comments   
Comment by Mathias Stearn [ 04/Aug/11 ]

Judging by the stack track I think this is a dup of a bug involving multi-update inside of javascript. The fix has been backported and will be in the next release of both the 1.8 and 1.9 branches.

Comment by Matthew Crawford [ 17/Jun/11 ]

Also while the DR102 per the comment above "warning" is cycling the logs it stops being able to process any queries (but it was found will take new ones until it maxes out its threads), won't shutdown when requested, and even very important to use keeps responding to the replica set that its healthy even though it resulted in in its threads maxed out and unable to handle new real queries so despite for all purposes the server is "down" to the replica set its healthy as can be. Nothing else was produced in the logs except for the DR102 error once it started doing it. So far with journal disabled we have not seen a repeat.

Comment by Matthew Crawford [ 17/Jun/11 ]

Eliot,

When we had it fail over to the secondary after the above event it was running all fine but then the secondary started doing it as well out of the blue while the primary was recovering. While I have the error logs below I don't have the actual transaction oplog unfortunately. It was blown away in our very rapid attempt to restore the replication set. We haven't had the problem again since removing the journal option.

To keep the log snippet reasonability short the now promoted to primary secondary in summary it was going along and then started spitting out those assertions in a loop in the log. I tried doing a clean shutdown via mongo console and it refused to shutdown (or even recognize the request) resulting in us issuing a kill -9 eventually. As our main primary (from the initial post) and then secondary (from the below messages) we ended up stoping everything, copying a repaired database around, and starting everything cleanly without journaling to restore service.

Thu Jun 16 18:22:41 [conn37233] getmore local.oplog.rs cid:7917876430387765344 getMore: { ts:

{ $gte: new Date(5618910812035874823) }

} bytes:20 nreturned:0 3002ms
Thu Jun 16 18:22:44 [conn37233] getmore local.oplog.rs cid:7917876430387765344 getMore: { ts:

{ $gte: new Date(5618910812035874823) }

} bytes:20 nreturned:0 3005ms
Thu Jun 16 18:22:56 [dur] lsn set 41337550
Thu Jun 16 18:22:56 [conn37233] getmore local.oplog.rs cid:7917876430387765344 getMore: { ts:

{ $gte: new Date(5618910812035874823) }

} bytes:20 nreturned:0 3008ms
Thu Jun 16 18:23:56 [dur] lsn set 41397349
Thu Jun 16 18:24:02 [conn37233] getmore local.oplog.rs cid:7917876430387765344 getMore: { ts:

{ $gte: new Date(5618910812035874823) }

} bytes:20 nreturned:0 3008ms
Thu Jun 16 18:24:12 [conn42436] btree unused skipped:21
btree unused skipped:16
query core.Core_Model_Activity_Entry ntoreturn:150 scanAndOrder reslen:67330 nscanned:2436 { $query: { walls:

{ $in: [ "PRD__4ca2dc25bdc290_31299638", "PRD__4d0a8d5425d440_57225994", "PRD__4d0baab52ca9d9_09934089", "PRD__4c8a6e859dc131_83785660" ] }

, time:

{ $gt: "1970-05-06T16:25:10-04:00" }

}, $orderby:

{ time: -1 }

} nreturned:150 118ms
Thu Jun 16 18:24:21 [conn42443] local.oplog.rs Assertion failure ! "DR102 too much data written uncommitted" db/dur_commitjob.cpp 204
0x51fe32 0x5305cf 0x5e6a28 0x5da88a 0x6b80be 0x6b8be1 0x6af745 0x6b02ab 0x6e5e08 0x6983fc 0x6948c2 0x60fb65 0x61193f 0x6cb29d 0x6d1520 0x6d23c2 0x548582 0x5cc648 0x3513e55112 0x3513e5738f
/usr/bin/mongod(_ZN5mongo12sayDbContextEPKc+0xb2) [0x51fe32]
/usr/bin/mongod(_ZN5mongo9wassertedEPKcS1_j+0xbf) [0x5305cf]
/usr/bin/mongod(_ZN5mongo3dur9CommitJob4noteEPvi+0x198) [0x5e6a28]
/usr/bin/mongod(_ZN5mongo3dur11DurableImpl10writingPtrEPvj+0xa) [0x5da88a]
/usr/bin/mongod(_ZN5mongo16NamespaceDetails7compactEv+0x43e) [0x6b80be]
/usr/bin/mongod(_ZN5mongo16NamespaceDetails11cappedAllocEPKci+0x281) [0x6b8be1]
/usr/bin/mongod(_ZN5mongo16NamespaceDetails6_allocEPKci+0x15) [0x6af745]
/usr/bin/mongod(_ZN5mongo16NamespaceDetails5allocEPKciRNS_7DiskLocE+0x3b) [0x6b02ab]
/usr/bin/mongod(_ZN5mongo11DataFileMgr17fast_oplog_insertEPNS_16NamespaceDetailsEPKci+0x1e8) [0x6e5e08]
/usr/bin/mongod [0x6983fc]
/usr/bin/mongod(_ZN5mongo5logOpEPKcS1_RKNS_7BSONObjEPS2_Pb+0x42) [0x6948c2]
/usr/bin/mongod(_ZN5mongo14_updateObjectsEbPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugEPNS_11RemoveSaverE+0x1f05) [0x60fb65]
/usr/bin/mongod(_ZN5mongo13updateObjectsEPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugE+0x12f) [0x61193f]
/usr/bin/mongod(_ZN5mongo14receivedUpdateERNS_7MessageERNS_5CurOpE+0x4dd) [0x6cb29d]
/usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_8SockAddrE+0x15a0) [0x6d1520]
/usr/bin/mongod(_ZN5mongo14DBDirectClient3sayERNS_7MessageE+0x62) [0x6d23c2]
/usr/bin/mongod(_ZN5mongo12DBClientBase6updateERKSsNS_5QueryENS_7BSONObjEbb+0x242) [0x548582]
/usr/bin/mongod(ZN5mongo12mongo_updateEP9JSContextP8JSObjectjPlS4+0x1c8) [0x5cc648]
/usr/lib64/libjs.so.1(js_Invoke+0x492) [0x3513e55112]
/usr/lib64/libjs.so.1(js_Interpret+0x14df) [0x3513e5738f]
Thu Jun 16 18:24:21 [conn42443] local.oplog.rs Assertion failure ! "DR102 too much data written uncommitted" db/dur_commitjob.cpp 204
0x51fe32 0x5305cf 0x5e6a28 0x5da88a 0x6b00ec 0x6b7e87 0x6b8be1 0x6af745 0x6b02ab 0x6e5e08 0x6983fc 0x6948c2 0x60fb65 0x61193f 0x6cb29d 0x6d1520 0x6d23c2 0x548582 0x5cc648 0x3513e55112
/usr/bin/mongod(_ZN5mongo12sayDbContextEPKc+0xb2) [0x51fe32]
/usr/bin/mongod(_ZN5mongo9wassertedEPKcS1_j+0xbf) [0x5305cf]
/usr/bin/mongod(_ZN5mongo3dur9CommitJob4noteEPvi+0x198) [0x5e6a28]
/usr/bin/mongod(_ZN5mongo3dur11DurableImpl10writingPtrEPvj+0xa) [0x5da88a]
/usr/bin/mongod(_ZN5mongo16NamespaceDetails13addDeletedRecEPNS_13DeletedRecordENS_7DiskLocE+0x23c) [0x6b00ec]
/usr/bin/mongod(_ZN5mongo16NamespaceDetails7compactEv+0x207) [0x6b7e87]
/usr/bin/mongod(_ZN5mongo16NamespaceDetails11cappedAllocEPKci+0x281) [0x6b8be1]
/usr/bin/mongod(_ZN5mongo16NamespaceDetails6_allocEPKci+0x15) [0x6af745]
/usr/bin/mongod(_ZN5mongo16NamespaceDetails5allocEPKciRNS_7DiskLocE+0x3b) [0x6b02ab]
/usr/bin/mongod(_ZN5mongo11DataFileMgr17fast_oplog_insertEPNS_16NamespaceDetailsEPKci+0x1e8) [0x6e5e08]
/usr/bin/mongod [0x6983fc]
/usr/bin/mongod(_ZN5mongo5logOpEPKcS1_RKNS_7BSONObjEPS2_Pb+0x42) [0x6948c2]
/usr/bin/mongod(_ZN5mongo14_updateObjectsEbPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugEPNS_11RemoveSaverE+0x1f05) [0x60fb65]
/usr/bin/mongod(_ZN5mongo13updateObjectsEPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugE+0x12f) [0x61193f]
/usr/bin/mongod(_ZN5mongo14receivedUpdateERNS_7MessageERNS_5CurOpE+0x4dd) [0x6cb29d]
/usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_8SockAddrE+0x15a0) [0x6d1520]
/usr/bin/mongod(_ZN5mongo14DBDirectClient3sayERNS_7MessageE+0x62) [0x6d23c2]
/usr/bin/mongod(_ZN5mongo12DBClientBase6updateERKSsNS_5QueryENS_7BSONObjEbb+0x242) [0x548582]
/usr/bin/mongod(ZN5mongo12mongo_updateEP9JSContextP8JSObjectjPlS4+0x1c8) [0x5cc648]
/usr/lib64/libjs.so.1(js_Invoke+0x492) [0x3513e55112]

Comment by Chris McNabb [ 17/Jun/11 ]

No, but it happened again after we forced failover, and we've got that log. I'll attach it tomorrow when I get ahold of it.

Comment by Eliot Horowitz (Inactive) [ 17/Jun/11 ]

The DR102 message is actually a warning, not an error.

Do you still have the end of the log?

Comment by Chris McNabb [ 16/Jun/11 ]

Comments are added in line above, sorry, I didn't realize how unreadable that would be.

Additionally, we had journaling turned on, we're turning it off on recommendation of this thread: http://groups.google.com/group/mongodb-user/browse_thread/thread/69f010e29ad22274?pli=1 Even though the situation was not exactly the same, we're convinced that the journaling was the most likely culprit.

Generated at Thu Feb 08 03:02:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.