[SERVER-3278] Master stopped allowing connections, didn't fail over "DR102 too much data written uncommitted" Created: 16/Jun/11 Updated: 29/Aug/11 Resolved: 04/Aug/11 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 1.8.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Chris McNabb | Assignee: | Mathias Stearn |
| Resolution: | Duplicate | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
Thu Jun 16 16:43:34 [conn35976] core.Core_Model_User Assertion failure ! "DR102 too much data written uncommitted" db/dur_commitjob.cpp 204 This message was repeating in the logs (unfortunately we failed to save the logs prior to restart so we don't have any more info from them. Trying to connect to the master failed: PROD root@docbase1-10-125-50-69 ~ $ mongo localhost:27018 But it was still sending heartbeats query: { whatsmyuri: 1 } "members" : [ , , |
| Comments |
| Comment by Mathias Stearn [ 04/Aug/11 ] |
|
Judging by the stack track I think this is a dup of a bug involving multi-update inside of javascript. The fix has been backported and will be in the next release of both the 1.8 and 1.9 branches. |
| Comment by Matthew Crawford [ 17/Jun/11 ] |
|
Also while the DR102 per the comment above "warning" is cycling the logs it stops being able to process any queries (but it was found will take new ones until it maxes out its threads), won't shutdown when requested, and even very important to use keeps responding to the replica set that its healthy even though it resulted in in its threads maxed out and unable to handle new real queries so despite for all purposes the server is "down" to the replica set its healthy as can be. Nothing else was produced in the logs except for the DR102 error once it started doing it. So far with journal disabled we have not seen a repeat. |
| Comment by Matthew Crawford [ 17/Jun/11 ] |
|
Eliot, When we had it fail over to the secondary after the above event it was running all fine but then the secondary started doing it as well out of the blue while the primary was recovering. While I have the error logs below I don't have the actual transaction oplog unfortunately. It was blown away in our very rapid attempt to restore the replication set. We haven't had the problem again since removing the journal option. To keep the log snippet reasonability short the now promoted to primary secondary in summary it was going along and then started spitting out those assertions in a loop in the log. I tried doing a clean shutdown via mongo console and it refused to shutdown (or even recognize the request) resulting in us issuing a kill -9 eventually. As our main primary (from the initial post) and then secondary (from the below messages) we ended up stoping everything, copying a repaired database around, and starting everything cleanly without journaling to restore service. Thu Jun 16 18:22:41 [conn37233] getmore local.oplog.rs cid:7917876430387765344 getMore: { ts: { $gte: new Date(5618910812035874823) } } bytes:20 nreturned:0 3002ms } bytes:20 nreturned:0 3005ms } bytes:20 nreturned:0 3008ms } bytes:20 nreturned:0 3008ms , time: { $gt: "1970-05-06T16:25:10-04:00" }}, $orderby: { time: -1 } } nreturned:150 118ms |
| Comment by Chris McNabb [ 17/Jun/11 ] |
|
No, but it happened again after we forced failover, and we've got that log. I'll attach it tomorrow when I get ahold of it. |
| Comment by Eliot Horowitz (Inactive) [ 17/Jun/11 ] |
|
The DR102 message is actually a warning, not an error. Do you still have the end of the log? |
| Comment by Chris McNabb [ 16/Jun/11 ] |
|
Comments are added in line above, sorry, I didn't realize how unreadable that would be. Additionally, we had journaling turned on, we're turning it off on recommendation of this thread: http://groups.google.com/group/mongodb-user/browse_thread/thread/69f010e29ad22274?pli=1 Even though the situation was not exactly the same, we're convinced that the journaling was the most likely culprit. |