[SERVER-13175] Write command with invalid write concern crashes mongos in a mixed cluster. Created: 12/Mar/14 Updated: 11/Jul/16 Resolved: 17/Mar/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 2.6.0-rc1 |
| Fix Version/s: | 2.6.0-rc2 |
| Type: | Bug | Priority: | Blocker - P1 |
| Reporter: | Bernie Hackett | Assignee: | Randolph Tan |
| Resolution: | Done | Votes: | 0 |
| Labels: | 26qa | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Steps To Reproduce: |
In my example both shards are standalone mongod. w > 1 write concern causes mongos to crash. |
||||||||
| Participants: | |||||||||
| Description |
|
Using a mixed version cluster (2.6 mongos and config, 2.6 shard, 2.4 shard), a write operation that affects both shards causes mongos to crash if the write concern is invalid for one of the shards. Commands in python:
mongos log:
sh.status:
|
| Comments |
| Comment by Randolph Tan [ 17/Mar/14 ] |
|
Confirmed that |
| Comment by Randolph Tan [ 13/Mar/14 ] |
Some extra info about the crash:Happens only in mixed clusters, shards all in 2.4 or 2.6 runs ok. How the crash happens:1. The broadcast write is broken down into child writes, one for each shard. https://github.com/mongodb/mongo/blob/r2.6.0-rc1/src/mongo/s/write_ops/batch_write_op.cpp#L483-490 5. Then the response from 2.4 shard is processed. And since the getLastError response will give { ok: 1 }, it will continue on (as opposed to the early return at step #4). https://github.com/mongodb/mongo/blob/r2.6.0-rc1/src/mongo/s/write_ops/batch_write_op.cpp#L531 6. It then crashes when it calls either noteWriteComplete or noteWriteError: https://github.com/mongodb/mongo/blob/r2.6.0-rc1/src/mongo/s/write_ops/batch_write_op.cpp#L543-556 because both of them will try to access an element in the _childOps vector with an index and this vector was already cleared during the cancel call at step #4. Current statusIt is very likely that |