[SERVER-21943] MR crashes in a sharded WT environment Created: 18/Dec/15  Updated: 05/Mar/16  Resolved: 05/Mar/16

Status: Closed
Project: Core Server
Component/s: MapReduce, Sharding, WiredTiger
Affects Version/s: 3.0.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Moshe Kaplan [X] Assignee: Kelsey Schubert
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Steps To Reproduce:

1. Install a sharded 3.0.4 mongodb setup w/
2. Run a MR every 5 min
3. After few days the MR (and the daemon) will crush w/ the following error
4. The same does being recreated in MMAPv1

Participants:

 Description   

We got the a similar error to SERVER-16429 at version 3.0.4.
It happens once a week in a sharded environment with WT engine in the mongod instances.
The evidences is a crashed shard, while in the other shard there a remaining of a tmp table that was not deleted.
Attached is the log error from the failed shard:

2015-12-12T00:35:40.814+0000 I COMMAND  [conn92487] mr failed, removing collection :: caused by :: WriteConflict
2015-12-12T00:35:40.818+0000 I COMMAND  [conn92487] CMD: drop XXXX.tmp.mr.account_231015
2015-12-12T00:35:40.822+0000 I NETWORK  [initandlisten] connection accepted from XXX.XXX.XXX.XXX:XXXXX #110728 (47 connections now open)
2015-12-12T00:35:40.920+0000 I COMMAND  [conn92487] command XXXX.$cmd command: drop { drop: "tmp.mr.account_231015" } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:122 locks:{ Global: { acquireCount: { r: 8, w: 4 } }, Database: { acquireCount: { r: 1, w: 1, R: 1, W: 4 }, acquireWaitCount: { W: 4 }, timeAcquiringMicros: { W: 7627049380 } }, Collection: { acquireCount: { r: 1, w: 1, W: 1 } } } 102ms
2015-12-12T00:35:40.920+0000 I QUERY    [conn110694] query XXXX.endpoints query: { $query: { gw: { $gt: 0 }, $or: [ { status: "unmanaged" }, { status: "managed" } ] }, $readPreference: { mode: "secondaryPreferred" } } planSummary: IXSCAN { gw: -1.0, status: -1.0 } ntoreturn:0 ntoskip:0 nscanned:0 nscannedObjects:0 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:0 reslen:20 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 }, acquireWaitCount: { r: 2 }, timeAcquiringMicros: { r: 3690878996 } }, Collection: { acquireCount: { r: 2 } } } 106ms
2015-12-12T00:35:40.920+0000 I QUERY    [conn110692] query XXXX.endpoints query: { $query: { gw: { $gt: 0 }, $or: [ { status: "unmanaged" }, { status: "managed" } ] }, $readPreference: { mode: "secondaryPreferred" } } planSummary: IXSCAN { gw: -1.0, status: -1.0 } ntoreturn:0 ntoskip:0 nscanned:0 nscannedObjects:0 keyUpdates:0 writeConflicts:0 numYields:3 nreturned:0 reslen:20 locks:{ Global: { acquireCount: { r: 8 } }, Database: { acquireCount: { r: 4 }, acquireWaitCount: { r: 4 }, timeAcquiringMicros: { r: 3690908386 } }, Collection: { acquireCount: { r: 4 } } } 182338ms
2015-12-12T00:35:40.927+0000 I COMMAND  [conn92491] command admin.$cmd command: listDatabases { listDatabases: 1 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:290 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 3 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 14825915727 } } } 179101ms
2015-12-12T00:35:40.978+0000 E NETWORK  [conn92487] Uncaught std::exception: std::exception, terminating
2015-12-12T00:35:40.978+0000 I CONTROL  [conn92487] dbexit:  rc: 100



 Comments   
Comment by Moshe Kaplan [X] [ 11/Feb/16 ]

The customer still suffers from the issue at v3.0.8
They will soon try to upgrade to v3.2.1 and I'll report reg. the results.

Comment by Ramon Fernandez Marina [ 10/Feb/16 ]

Hi MosheKaplan, we haven't heard back from you for a while. Are you still experiencing this issue?

MongoDB 3.0 is at version 3.0.9, so if you were going to upgrade and try to reproduce the problem I'd suggest you consider 3.0.9.

Thanks,
Ramón.

Comment by Moshe Kaplan [X] [ 30/Dec/15 ]

Thanks Thomas,
We currently try to eliminate the issue, and probably we won't have results before early Jan.
I'll keep update regarding our results.

Comment by Kelsey Schubert [ 30/Dec/15 ]

Hi MosheKaplan,

I am not able to point you to a specific fix - there have been numerous improvements to the WiredTiger storage engine since 3.0.4 that may affect what you are observing. If the issue persists after upgrading would you be able to upload the relevant datasets so we can attempt to reproduce this behavior on our side?

Thank you,
Thomas

Comment by Moshe Kaplan [X] [ 22/Dec/15 ]

Hi Thomas,
We may be able to do that on Thu, and it may take few more days to recreate,
Do you suspect any specific fix that implemented on 3.0.5 to 3.0.8 that should have solved it?
Moshe

Comment by Kelsey Schubert [ 22/Dec/15 ]

Hi MosheKaplan,

Can you please upgrade to 3.0.8 and confirm if the issue persists?

Thank you,
Thomas

Generated at Thu Feb 08 03:58:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.