ISSUE SUMMARY
MongoDB with WiredTiger may produce WriteConflict exceptions under the following conditions:
- when doing bulk or multi updates, and
- these updates have upsert=true set, and
- no documents match the update condition, and
- these operations use a unique index on the updated field
USER IMPACT
These exceptions cause the insert part of the write operation to fail and an error is returned to the application.
WORKAROUNDS
Retry failed write operations as inserts at the application level.
AFFECTED VERSIONS
MongoDB 3.0.0 to 3.0.3 with WiredTiger.
FIX VERSION
The fix is included in the 3.0.4 production release.
Original description
Intro
We have a pretty complex python application running over Apache Storm 0.9.2, MongoEngine 0.8.7, PyMongo 2.8, and MongoDB 3.0.2. Our Mongo setup is a sharding cluster (3 mongos, 3 mongo-config and 5 ReplicaSets) and all ReplicaSets are running with WiredTiger.
With a very high load of concurrent updates, I get a lot of exceptions at application level:
OperationError: Update failed (exception: WriteConflict)
|
And looking into the mongod logs:
2015-04-22T16:24:02.100+0200 I WRITE [conn81979] Had WriteConflict during multi update, aborting
|
2015-04-22T16:24:04.907+0200 I WRITE [conn75303] Had WriteConflict during multi update, aborting
|
2015-04-22T16:24:18.538+0200 I WRITE [conn81979] Had WriteConflict during multi update, aborting
|
2015-04-22T16:24:27.539+0200 I WRITE [conn75526] Had WriteConflict during multi update, aborting
|
2015-04-22T16:24:34.413+0200 I WRITE [conn81979] Had WriteConflict during multi update, aborting
|
2015-04-22T16:24:34.455+0200 I WRITE [conn77036] Had WriteConflict during multi update, aborting
|
2015-04-22T16:24:46.693+0200 I WRITE [conn81979] Had WriteConflict during multi update, aborting
|
I've mitigated the issue with a try/except during the update operation and re-doing it in the except. Seems to work fine almost all the time, but sometimes it fails twice in a row.
I've asked this same issue in StackOverflow (http://stackoverflow.com/questions/29800823/mongodb-with-wiredtiger-lots-of-writeconflict) and in the mongodb-user mailing list. There, Asya (thanks!) asked me to open this ticket.
The update operation that fails with WriteConflict
With MongoEngine we hace created a model with a custom QuerySet that simply does a bunch of:
self.update(inc__valueX=CONSTANT * ammount, upsert=True)
|
Where ammount is 99% of the time 1.0.
What the mongo server really does (what this traduces to) is:
db.collection.update({ s: ObjectId('53cd0cdac2f4ed5f76527f9a'), u: ObjectId('553df07f996d7110b3a2843b') }, { $inc: { value: VALUE } }, multi: true, upsert: true })
|
Where VALUE is the result of (CONSTANT * ammount).
The
are attached.
Current setup
Currently we have 3 mongos, 3 mongo-config (running on the same servers as the mongos) and 5 replicaSet, each one with 3 members. Mongo-config are running with MMAPv1 and replicaSets with WiredTiger. All of them are running this MongoDB version:
2015-04-10T10:34:06.213+0200 I CONTROL [initandlisten] MongoDB starting : pid=21243 port=27017 dbpath=/opt/mongodb 64-bit host=bb07
|
2015-04-10T10:34:06.213+0200 I CONTROL [initandlisten] db version v3.0.2
|
2015-04-10T10:34:06.213+0200 I CONTROL [initandlisten] git version: 6201872043ecbbc0a4cc169b5482dcf385fc464f
|
2015-04-10T10:34:06.213+0200 I CONTROL [initandlisten] OpenSSL version: OpenSSL 1.0.1f 6 Jan 2014
|
2015-04-10T10:34:06.213+0200 I CONTROL [initandlisten] build info: Linux ip-10-229-1-2 3.13.0-24-generic #46-Ubuntu SMP Thu Apr 10 19:11:08 UTC 2014 x86_64 BOOST_LIB_ VERSION=1_49
|
2015-04-10T10:34:06.213+0200 I CONTROL [initandlisten] allocator: tcmalloc
|
2015-04-10T10:34:06.213+0200 I CONTROL [initandlisten] options: { config: "/etc/mongod.conf", net: { http: { enabled: false } }, replication: { replSet: "rs01" }, storage: { dbPath: "/opt/mongodb", engine: "wiredTiger" }, systemLog: { destination: "file", logAppend: true, path: "/var/log/mongodb/mongod.log" } }
|
> mongos --version
|
MongoS version 3.0.2 starting: pid=19084 port=27017 64-bit host=bb01 (--help for usage)
|
git version: 6201872043ecbbc0a4cc169b5482dcf385fc464f
|
OpenSSL version: OpenSSL 1.0.1f 6 Jan 2014
|
build sys info: Linux ip-10-229-1-2 3.13.0-24-generic #46-Ubuntu SMP Thu Apr 10 19:11:08 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49
|
The Apache Storm machines (the ones performing the update() operations, run on different servers and same OS version. The whole setup is running on top of VMware 5.5.
Logs
After running a
db.adminCommand({setParameter:1, logLevel:5})
|
in one of the primarys and waiting for the error to reproduce, this is the log output:
2015-04-27T10:17:03.359+0200 D COMMAND [conn63749] run command database.$cmd { update: "collection", upda
|
tes: [ { q: { s: ObjectId('53cd0cdac2f4ed5f76527f9a'), u: ObjectId('553df07f996d7110b3a2843b') }
|
, u: { $inc: { value: 0.0328 } }, multi: true, upsert: true } ], ordered: true, metadata: { shardName: "b
|
brs03", shardVersion: [ Timestamp 856000|4, ObjectId('552f8fbfeaa225ff8d70a7dd') ], session: 0 } }
|
2015-04-27T10:17:03.359+0200 D QUERY [conn63749] [QLOG] Tagging the match expression according to cach
|
e data:
|
Filter:
|
$and
|
s == ObjectId('53cd0cdac2f4ed5f76527f9a')
|
u == ObjectId('553df07f996d7110b3a2843b')
|
Cache data:
|
(index-tagged expression tree: tree=Node
|
---Leaf { u: 1, s: 1 }, pos: 1
|
---Leaf { u: 1, s: 1 }, pos: 0
|
)
|
2015-04-27T10:17:03.359+0200 D QUERY [conn63749] [QLOG] Index 0: { _id: 1 }
|
2015-04-27T10:17:03.359+0200 D QUERY [conn63749] [QLOG] Index 1: { u: 1, s: 1 }
|
2015-04-27T10:17:03.359+0200 D QUERY [conn63749] [QLOG] Index 2: { u: 1, valueX: -1 }
|
2015-04-27T10:17:03.359+0200 D QUERY [conn63749] [QLOG] Index 3: { u: 1, valueY: -1 }
|
2015-04-27T10:17:03.359+0200 D QUERY [conn63749] [QLOG] Index 4: { u: 1, value: -1 }
|
2015-04-27T10:17:03.359+0200 D QUERY [conn63749] [QLOG] Index 5: { u: 1, valueZ: -1 }
|
2015-04-27T10:17:03.359+0200 D QUERY [conn63749] [QLOG] Tagged tree:
|
$and
|
u == ObjectId('553df07f996d7110b3a2843b') || Selected Index #1 pos 0
|
s == ObjectId('53cd0cdac2f4ed5f76527f9a') || Selected Index #1 pos 1
|
2015-04-27T10:17:03.359+0200 D QUERY [conn63749] [QLOG] Planner: solution constructed from the cache:
|
FETCH
|
---fetched = 1
|
---sortedByDiskLoc = 1
|
---getSort = [{}, { s: 1 }, { u: 1 }, { u: 1, s: 1 }, ]
|
---Child:
|
------IXSCAN
|
---------keyPattern = { u: 1, s: 1 }
|
---------direction = 1
|
---------bounds = field #0['u']: [ObjectId('553df07f996d7110b3a2843b'), ObjectId('553df07f996d7110b3a2843b')], field #1['s']: [ObjectId('53cd0cdac2f4ed5f76527f9a'), ObjectId('53cd0cdac2f4ed5f76527f9a')]
|
---------fetched = 0
|
---------sortedByDiskLoc = 1
|
---------getSort = [{}, { s: 1 }, { u: 1 }, { u: 1, s: 1 }, ]
|
|
2015-04-27T10:17:03.359+0200 D STORAGE [conn63749] WT begin_transaction
|
2015-04-27T10:17:03.359+0200 D WRITE [conn63749] update validate options -- updatedFields: Fields:[ value,] immutableAndSingleValueFields.size:1 fromRepl: 0 validate:1
|
2015-04-27T10:17:03.359+0200 D STORAGE [conn63749] WT rollback_transaction
|
2015-04-27T10:17:03.359+0200 D WRITE [conn63749] Caught WriteConflictException doing update on database.collection, attempt: 2 retrying
|
2015-04-27T10:17:03.359+0200 D STORAGE [conn63749] WT begin_transaction
|
2015-04-27T10:17:03.359+0200 D WRITE [conn63749] update validate options -- updatedFields: Fields:[ value,] immutableAndSingleValueFields.size:1 fromRepl: 0 validate:1
|
2015-04-27T10:17:03.359+0200 D STORAGE [conn63749] WT rollback_transaction
|
2015-04-27T10:17:03.359+0200 D WRITE [conn63749] Caught WriteConflictException doing update on database.collection, attempt: 3 retrying
|
...
|
2015-04-27T10:17:03.359+0200 D STORAGE [conn63749] WT begin_transaction
|
2015-04-27T10:17:03.360+0200 D WRITE [conn63749] update validate options -- updatedFields: Fields:[ value,] immutableAndSingleValueFields.size:1 fromRepl: 0 validate:1
|
2015-04-27T10:17:03.360+0200 D STORAGE [conn63749] WT rollback_transaction
|
2015-04-27T10:17:03.360+0200 D WRITE [conn63749] Caught WriteConflictException doing update on database.collection, attempt: 4 retrying
|
...
|
Final note
I modified some names (database, collections, fields or indexes). If you want, I can upload full logs or configuration (via SCP) with all the original names. If you require so, maybe we could even plan a full restart of all the processes (that would require planification).
|
|
This appears to now be a recurring condition in Mongodb 5.0.15. Upgraded from 4.4.18, and are now getting 100k of these an hour, with all of them showing "nMatched | nWouldModify | nWouldUpsert" of 0.
Â
|
|
Author:
{u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}
Message: SERVER-18213: Retry insert in case of a write conflict during multi-upsert
(cherry picked from commit 054c874f25a91121ed16cedbc44b7d4b919a430b)
Conflicts:
src/mongo/db/exec/update.cpp
Branch: v3.0
https://github.com/mongodb/mongo/commit/bfdf548bc4c882ab944e0cf2f1e6015752fc05c2
|
|
Thanks for your reactivity !
|
|
Author:
{u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}
Message: SERVER-18213: Retry insert in case of a write conflict during multi-upsert
Branch: master
https://github.com/mongodb/mongo/commit/054c874f25a91121ed16cedbc44b7d4b919a430b
|
|
Thanks asya 
|
|
Alex,
The following run in mongo shell will usually reproduce it:
db.foo.drop();
|
db.foo.ensureIndex({x:1},{unique:true});
|
for (i=0; i<10000; i++) { db.foo.insert({x:i, y:0}); }
|
db.foo.remove({});
|
op={ op: "update", ns: "test.foo", query: { y: { '#RAND_INT': [0, 5 ] } }, update:{ $set: { x: { '#RAND_INT': [0, 10 ] }} }, multi: true, upsert: true, writeCmd: true, safe: false }
|
benchRun( { parallel: 16, seconds: 16, ops:[op], host:db._mongo.host } )
|
You may need to use a different value for number of threads (parallel) depending on your number of available cores.
This uses the built-in 'benchRun' utility.
Asya
|
|
Hi asya,
I'm glad you've been able to reproduce it. Could you please give us more info in how you did it? I'm curious if julien.thebault@1000mercis.com was right or it wasn't the same issue.
Thanks 
|
|
julien.thebault@1000mercis.com, alex.sobrino@gmail.com,
Thanks for your help - I've been able to reproduce this failure.
Asya
|
|
Julien,
enabled any FailPoint
No - failpoint does not make it easier to hit failure, it simulates hitting that failure, so if you turn on failpoint you are of course guaranteed to hit the exception even if the conditions for the real error are not met - Geert's point above was that his script couldn't reproduce the error, so he had to enable failpoint to get the exception.
Unfortunately, this means we're back to needing the actual conditions which reproduce this exception for you in your actual system. What is the "compound unique index on the affected collection" and what is the exact update that triggered the exception? If you have any of your logs from when this happened, this would be most helpful, or if you can recreate the exception in test system and generate fresh verbose logs that would help us generate a reproducer.
Asya
|
I'm not able to produce the exception - I want to make sure I'm using the same version, would you include the output of
are you enabled any FailPoint ?
mongodb: 3.0.3
pymongo: 2.7.2 / 2.8 (tests with both versions)
gevent: 0.13.6 / 1.0.2 (tests with both versions)
os: Gentoo Linux
|
|
I'm not able to produce the exception - I want to make sure I'm using the same version, would you include the output of
{{ mongod --version }}
and the version of the pymongo driver you are running this test with? Also the platform - I tried it on my mac with 3.0.3 and no luck reproducing so far.
|
If you can get us the reproducer test code, that would be ideal as we will be able to create the failing test locally which will speed up the process tremendously.
see this comment above: https://jira.mongodb.org/browse/SERVER-18213?focusedCommentId=925667&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-925667
|
|
julien.thebault@millemercis.com
Thank you for the logs - pastebin works fine - could you rerun the second test (the one without the unique index) but in addition to enabling traces could you increase verbosity?
db.adminCommand({setParameter:1,logLevel:1})
|
If you can get us the reproducer test code, that would be ideal as we will be able to create the failing test locally which will speed up the process tremendously.
|
|
We test this issue through a local instance using wiredTiger, no worry about the data/logs.
I can continue to use this pastebin (0bin) if it is right for you or you prefer a directly upload on this issue ?
We can talk on IRC (#freenode Lujeni) if you want more details without spam this thread.
with unique index:
without unique index:
|
|
julien.thebault@1000mercis.com
If you can trigger this without unique index, I'm not at all certain that this is the same issue. Can you please provide the actual update statement from the logs that triggers the exception and actual output from db.collection.getIndexes() command?
I can provide a secure location to upload the information to if you are concerned about exposing your actual data publicly. Most helpful would be full logs after enabling
db.adminCommand( {setParameter:1, traceWriteConflictExceptions:true} );
|
Asya
|
|
Hi all,
From a quick test, seems that setting the multi:false in the update() operation stops the WriteConflict error.
In my original post I wrote that this is what we were doing in MongoEngine:
self.update(inc__valueX=CONSTANT * ammount, upsert=True)
|
It can be changed to:
self.update(inc__valueX=CONSTANT * ammount, upsert=True, multi=False)
|
or:
self.update_one(inc__valueX=CONSTANT * ammount, upsert=True)
|
to remove the WriteConflict error. Of course, this is a solution that works because we were performing the update with a combination of fields that was Unique (defined by index).
Bulk operations probably are treated as a multi:true operation when performing an update() and that's why we also get the WriteConflict errors.
Hope this helps.
|
|
Hi Asya,
Julien Thebault above is part of our team so his answer speaks for me as well. It looks like that indeed the multi is the culprit here since we can reproduce the error even without an unique index on the collection.
Hope this helps.
|
|
Hi Asya Kamsky,
more details about our context:
- pymongo 2.8
- insert throught a greenlet (gevent)
the exact update(s) that cause(s) this exception
We do the same think as Alex Sobrino (bulk operations). However, the query use 5 fields for the spec.
the indexes present on the collection that the update is running on
- 1x compound index (2fields), unique=True, dropDups=True
- 1x compound index (2fields)
testing result:
- we use a python script to simulate the write conflict:
- bulk or update operations trigger the same issue.
- index with or without the unique tag trigger the issue also.
|
|
Hi asya,
In the first project (the one used as example to open this issue) we use MongoEngine 3.0 (ontop of Pymongo 2.8). The multi:true is the default behavior, provided by either PyMongo or MongoEngine. I'll have to take a look and see if it's possible to pass a multi:false to the update().
In the second project we're executing a bulk.find().upsert().update() like this one: http://api.mongodb.org/python/2.8/examples/bulk.html#ordered-bulk-write-operations . Correct me if I'm wrong, but I've tried to add a multi:false parameter to the update() and it seems that in bulk ops it's not possible. If I'm right, I won't be able to test your suggestion in this project.
What I've tried in python code is:
from pymongo import *
|
db = MongoClient()['test']
|
bulk = db.test.initialize_unordered_bulk_op()
|
bulk.find({'a':'b','c':'d'}).upsert().update({'$set': {'e':'f'}}, multi=False)
|
|
|
ultrabug
So we can confirm if this is exactly the same issue, could you please provide the following details:
- the exact update(s) that cause(s) this exception
- the indexes present on the collection that the update is running on
In particular, we're looking for details of the options on update (upsert/multi) and all attributes of indexes involved.
Asya
|
|
alexsobrino
We are still working on figuring out how to reproduce this issue (so that we can know that our fix would work correctly) but meanwhile I may have spotted a work-around for you to not get these errors.
I noticed that the update is made on the same combination of fields that have a unique index on them, which means that you cannot ever be updating more than one document with this update. You should be able to remove multi:true from the update and see if that resolves the write conflict exceptions to your application. Since exceptions happens on a multi-update code path, avoiding it with a single update will allow your application to proceed without unexpected errors to handle.
If you are able to try that, would you let us know if that completely eliminates the exceptions? It would confirm then what the minimum conditions for reproducing this issue are.
Asya
|
|
Hello Geert Bosch,
What's the aim of this procedure? it s for your internal developers?
Otherwise, the result of the process below.
mongodb.conf:
storage:
|
syncPeriodSecs: 60
|
dbPath: "/var/lib/mongodb"
|
engine: wiredTiger
|
wiredTiger:
|
engineConfig:
|
directoryForIndexes: true
|
collectionConfig:
|
blockCompressor: snappy
|
|
systemLog:
|
destination: file
|
path: "/var/log/mongodb/mongodb.log"
|
quiet: false
|
logAppend: true
|
|
net:
|
port: 27017
|
|
operationProfiling:
|
slowOpThresholdMs: 500
|
|
setParameter:
|
enableTestCommands: 1
|
log:
2015-05-27T18:15:47.930+0200 I NETWORK [initandlisten] connection accepted from 127.0.0.1:44740 #2 (2 connections now open)
|
2015-05-27T18:15:47.936+0200 I COMMAND [conn2] CMD: drop test.multiupsertwc
|
2015-05-27T18:15:47.983+0200 I INDEX [conn2] build index on: test.multiupsertwc properties: { v: 1, key: { x: 1.0 }, name: "x_1", ns: "test.multiupsertwc" }
|
2015-05-27T18:15:47.983+0200 I INDEX [conn2] building index using bulk method
|
2015-05-27T18:15:47.984+0200 I INDEX [conn2] build index done. scanned 0 total records. 0 secs
|
2015-05-27T18:15:51.416+0200 I WRITE [conn2] Had WriteConflict during multi update, aborting
|
2015-05-27T18:15:52.865+0200 I WRITE [conn2] Had WriteConflict during multi update, aborting
|
2015-05-27T18:15:54.779+0200 I WRITE [conn2] Had WriteConflict during multi update, aborting
|
2015-05-27T18:15:55.935+0200 I WRITE [conn2] Had WriteConflict during multi update, aborting
|
2015-05-27T18:15:57.235+0200 I WRITE [conn2] Had WriteConflict during multi update, aborting
|
2015-05-27T18:15:57.243+0200 I NETWORK [conn2] end connection 127.0.0.1:44740 (1 connection now open)
|
output:
18:15 $ mongo localhost:27017/test writeconflist.js
|
MongoDB shell version: 3.0.3
|
connecting to: localhost:27017/test
|
{
|
"writeError" : {
|
"code" : 112,
|
"errmsg" : "exception: WriteConflict"
|
}
|
}
|
{
|
"writeError" : {
|
"code" : 112,
|
"errmsg" : "exception: WriteConflict"
|
}
|
}
|
{
|
"writeError" : {
|
"code" : 112,
|
"errmsg" : "exception: WriteConflict"
|
}
|
}
|
{
|
"writeError" : {
|
"code" : 112,
|
"errmsg" : "exception: WriteConflict"
|
}
|
}
|
{
|
"writeError" : {
|
"code" : 112,
|
"errmsg" : "exception: WriteConflict"
|
}
|
}
|
|
Regards
|
|
Here's an attempt at a standalone reproducer. Needed to configure a fail point which causes some other issues.
(function() {
|
"use strict";
|
// SERVER-18213: Test WriteConflictException recovery in multi-upsert path
|
|
var t = db.multiupsertwc;
|
t.drop()
|
|
t.ensureIndex({x:1});
|
|
for (var j = 0; j < 1000; j++) {
|
t.insert({ x : j});
|
}
|
|
assert.commandWorked(db.adminCommand({ configureFailPoint: 'WTWriteConflictException',
|
mode: { 'activationProbability' : 0.0001 }}));
|
|
for (var k = 0; k < 10; k++) {
|
for (var j = 0; j < 1000; j++) {
|
var n = Math.floor(Math.random(1000));
|
var res = t.update({ x : n },
|
{ $set: { x : (j + 1) * 337 % 1000 }},
|
{ multi : 1, upsert : 1});
|
if (!res.nModified && !res.nUpserted) {
|
printjson(res);
|
break;
|
}
|
}
|
}
|
|
assert.commandWorked(db.adminCommand({ configureFailPoint: 'WTWriteConflictException',
|
mode: 'off'}));
|
}())
|
|
|
Hi,
I'd like to add more information since we've had this behavior in another project/database.
Things in common:
- WiredTiger as storage engine.
- """Heavy""" insert()/update(). In fact, it's a bulk operation that performs ~30-100 upserts.
- It uses a compound unique index.
- More indexes (not unique) are defined.
Update: this one is a single MongoDB server.
|
|
We are migrating our 3.0.3 MMAPv1 5 RS cluster to WiredTiger and got hit by the same issue. So far only our secondaries were running on WT and were doing okay, but as secondaries.
Our test migration was to switch the MMAPv1 primaries with the WT secondaries so that we could see the effect of WT nodes getting all the writes and load while we could fallback on MMAPv1 is something went wrong (which it did).
This same kind of operation and work load with about 50-250 WriteConflict per minute and we also have a compound unique index on the affected collections.
Obviously it killed our app performances so we had to roll back quite quickly.
|
|
Hi david.daly,
I see 5 indexes from your output above. Are any of them unique (besides _id?).
Yes, we do have another unique index, named u_1_s_1. This is the output of a db.collection.getIndexes():
[
|
{
|
"v" : 1,
|
"key" : {
|
"_id" : 1
|
},
|
"name" : "_id_",
|
"ns" : "ourdbname.collection"
|
},
|
{
|
"v" : 1,
|
"unique" : true,
|
"key" : {
|
"u" : 1,
|
"s" : 1
|
},
|
"name" : "u_1_s_1",
|
"ns" : "ourdbname.collection",
|
"background" : true
|
},
|
{
|
"v" : 1,
|
"key" : {
|
"u" : 1,
|
"value2d" : -1
|
},
|
"name" : "u_1_value2d_-1",
|
"ns" : "ourdbname.collection",
|
"background" : true
|
},
|
{
|
"v" : 1,
|
"key" : {
|
"u" : 1,
|
"value7d" : -1
|
},
|
"name" : "u_1_value7d_-1",
|
"ns" : "ourdbname.collection",
|
"background" : true
|
},
|
{
|
"v" : 1,
|
"key" : {
|
"u" : 1,
|
"v" : -1
|
},
|
"name" : "u_1_v_-1",
|
"ns" : "ourdbname.collection",
|
"background" : true
|
},
|
{
|
"v" : 1,
|
"key" : {
|
"u" : 1,
|
"value2y" : -1
|
},
|
"name" : "u_1_v2y_-1",
|
"ns" : "ourdbname.collection",
|
"background" : true
|
}
|
]
|
What are you using as the shard key?
We're using the field u. Here's the partial output of a sh.status():
ourdbname.collection
|
shard key: { "u" : 1 }
|
chunks:
|
rs01 228
|
rs02 231
|
rs03 234
|
rs04 227
|
rs05 227
|
too many chunks to print, use verbose if you want to force print
|
The field u is basically a ReferenceField from MongoEngine. It is basically a ObjectId that references the _id field in another collection.
How frequent are the "Had WriteConflict during multi update, aborting" messages? Do they happen in bunches, or do they occur regularly in the trace?
It depends on the application load. Right now, this is what I saw in one of the primary mongod:
2015-05-05T10:13:12.153+0200 I WRITE [conn94734] Had WriteConflict during multi update, aborting
|
2015-05-05T10:13:40.340+0200 I WRITE [conn94125] Had WriteConflict during multi update, aborting
|
2015-05-05T10:14:05.404+0200 I WRITE [conn94125] Had WriteConflict during multi update, aborting
|
2015-05-05T10:14:15.952+0200 I WRITE [conn93215] Had WriteConflict during multi update, aborting
|
2015-05-05T10:14:28.920+0200 I WRITE [conn93215] Had WriteConflict during multi update, aborting
|
2015-05-05T10:14:30.749+0200 I WRITE [conn90462] Had WriteConflict during multi update, aborting
|
2015-05-05T10:14:37.799+0200 I WRITE [conn94734] Had WriteConflict during multi update, aborting
|
2015-05-05T10:14:47.016+0200 I WRITE [conn94734] Had WriteConflict during multi update, aborting
|
2015-05-05T10:14:51.805+0200 I WRITE [conn94734] Had WriteConflict during multi update, aborting
|
2015-05-05T10:15:21.282+0200 I WRITE [conn71125] Had WriteConflict during multi update, aborting
|
2015-05-05T10:15:47.750+0200 I WRITE [conn94125] Had WriteConflict during multi update, aborting
|
2015-05-05T10:15:51.644+0200 I WRITE [conn63785] Had WriteConflict during multi update, aborting
|
2015-05-05T10:15:59.008+0200 I WRITE [conn93215] Had WriteConflict during multi update, aborting
|
2015-05-05T10:16:10.894+0200 I WRITE [conn95373] Had WriteConflict during multi update, aborting
|
2015-05-05T10:16:22.795+0200 I WRITE [conn94125] Had WriteConflict during multi update, aborting
|
As you can see, we get like 4-6 WriteConflict per minute. Hopefully the ss.log file has been correctly uploaded to the upload location Asya provided. david.daly if you want, I can provide you with a larger ss.log. This one monitors a few minutes with a delay of 1 sec.
|
|
Looking through the log file Alex attached (run with Eliot's comment), there was one occurrence of an update aborting in the trace and 47 other WriteConflictExceptions that were caught. This is the backtrace for the one that caused the abort.
----- BEGIN BACKTRACE -----
|
{"backtrace":[{"b":"400000","o":"B4F859"},{"b":"400000","o":"5D3F68"},{"b":"400000","o":"979FFF"},{"b":"400000","o":"95E8F5"},{"b":"400000","o":"95F778"},{"b":"400000","o":"68675B"},{"b":"400000","o":"524B8E"},{"b":"400000","o":"524F05"},{"b":"400000","o":"512F20"},{"b":"400000","o":"51331D"},{"b":"400000","o":"63A4A4"},{"b":"400000","o":"63B0CC"},{"b":"400000","o":"7D1544"},{"b":"400000","o":"7D18F4"},{"b":"400000","o":"7D1F2D"},{"b":"400000","o":"5B231D"},{"b":"400000","o":"5B3B0C"},{"b":"400000","o":"5B404B"},{"b":"400000","o":"5B69ED"},{"b":"400000","o":"5DA114"},{"b":"400000","o":"5DB0FD"},{"b":"400000","o":"5DBE0B"},{"b":"400000","o":"79FC36"},{"b":"400000","o":"6B65C0"},{"b":"400000","o":"40E09D"},{"b":"400000","o":"B0342B"},{"b":"7F7ECFDC8000","o":"8182"},{"b":"7F7ECE79E000","o":"FA47D"}],"processInfo":{ "mongodbVersion" : "3.0.2", "gitVersion" : "6201872043ecbbc0a4cc169b5482dcf385fc464f", "uname" : { "sysname" : "Linux", "release" : "3.16.0-34-generic", "version" : "#45~14.04.1-Ubuntu SMP Tue Mar 24 11:14:29 UTC 2015", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "72DF3CBF268392E00CB7F12561E5A4EDC3939078" }, { "b" : "7FFD161AE000", "elfType" : 3, "buildId" : "5552B9335DDE9349419BA10896C1E75C9432A946" }, { "b" : "7F7ECFDC8000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "9318E8AF0BFBE444731BB0461202EF57F7C39542" }, { "b" : "7F7ECFB6A000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "FF43D0947510134A8A494063A3C1CF3CEBB27791" }, { "b" : "7F7ECF78F000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "B927879B878D90DD9FF4B15B00E7799AA8E0272F" }, { "b" : "7F7ECF587000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "92FCF41EFE012D6186E31A59AD05BDBB487769AB" }, { "b" : "7F7ECF383000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "C1AE4CB7195D337A77A3C689051DABAA3980CA0C" }, { "b" : "7F7ECF07F000", "path" : "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", "elfType" : 3, "buildId" : "19EFDDAB11B3BF5C71570078C59F91CF6592CE9E" }, { "b" : "7F7ECED79000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "1D76B71E905CB867B27CEF230FCB20F01A3178F5" }, { "b" : "7F7ECEB63000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "8D0AA71411580EE6C08809695C3984769F25725B" }, { "b" : "7F7ECE79E000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "30C94DC66A1FE95180C3D68D2B89E576D5AE213C" }, { "b" : "7F7ECFFE6000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "9F00581AB3C73E3AEA35995A0C50D24D59A01D47" } ] }}
|
mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf4f859]
|
mongod(_ZN5mongo22WriteConflictExceptionC1Ev+0xA8) [0x9d3f68]
|
mongod(_ZN5mongo17wtRCToStatus_slowEiPKc+0x32F) [0xd79fff]
|
mongod(_ZN5mongo21WiredTigerIndexUnique7_insertEP11__wt_cursorRKNS_7BSONObjERKNS_8RecordIdEb+0x425) [0xd5e8f5]
|
mongod(_ZN5mongo15WiredTigerIndex6insertEPNS_16OperationContextERKNS_7BSONObjERKNS_8RecordIdEb+0xC8) [0xd5f778]
|
mongod(_ZN5mongo22BtreeBasedAccessMethod6insertEPNS_16OperationContextERKNS_7BSONObjERKNS_8RecordIdERKNS_19InsertDeleteOptionsEPl+0x19B) [0xa8675b]
|
mongod(_ZN5mongo12IndexCatalog12_indexRecordEPNS_16OperationContextEPNS_17IndexCatalogEntryERKNS_7BSONObjERKNS_8RecordIdE+0x6E) [0x924b8e]
|
mongod(_ZN5mongo12IndexCatalog11indexRecordEPNS_16OperationContextERKNS_7BSONObjERKNS_8RecordIdE+0x85) [0x924f05]
|
mongod(_ZN5mongo10Collection15_insertDocumentEPNS_16OperationContextERKNS_7BSONObjEb+0xB0) [0x912f20]
|
mongod(_ZN5mongo10Collection14insertDocumentEPNS_16OperationContextERKNS_7BSONObjEb+0x8D) [0x91331d]
|
mongod(_ZN5mongo11UpdateStage8doInsertEv+0x194) [0xa3a4a4]
|
mongod(_ZN5mongo11UpdateStage4workEPm+0xCC) [0xa3b0cc]
|
mongod(_ZN5mongo12PlanExecutor18getNextSnapshottedEPNS_11SnapshottedINS_7BSONObjEEEPNS_8RecordIdE+0xA4) [0xbd1544]
|
mongod(_ZN5mongo12PlanExecutor7getNextEPNS_7BSONObjEPNS_8RecordIdE+0x34) [0xbd18f4]
|
mongod(_ZN5mongo12PlanExecutor11executePlanEv+0x3D) [0xbd1f2d]
|
mongod(_ZN5mongo18WriteBatchExecutor10execUpdateERKNS_12BatchItemRefEPNS_7BSONObjEPPNS_16WriteErrorDetailE+0x71D) [0x9b231d]
|
mongod(_ZN5mongo18WriteBatchExecutor11bulkExecuteERKNS_21BatchedCommandRequestERKNS_19WriteConcernOptionsEPSt6vectorIPNS_19BatchedUpsertDetailESaIS9_EEPS7_IPNS_16WriteErrorDetailESaISE_EE+0x23C) [0x9b3b0c]
|
mongod(_ZN5mongo18WriteBatchExecutor12executeBatchERKNS_21BatchedCommandRequestEPNS_22BatchedCommandResponseE+0x37B) [0x9b404b]
|
mongod(_ZN5mongo8WriteCmd3runEPNS_16OperationContextERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x15D) [0x9b69ed]
|
mongod(_ZN5mongo12_execCommandEPNS_16OperationContextEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x34) [0x9da114]
|
mongod(_ZN5mongo7Command11execCommandEPNS_16OperationContextEPS0_iPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0xC7D) [0x9db0fd]
|
mongod(_ZN5mongo12_runCommandsEPNS_16OperationContextEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x28B) [0x9dbe0b]
|
mongod(_ZN5mongo8runQueryEPNS_16OperationContextERNS_7MessageERNS_12QueryMessageERKNS_15NamespaceStringERNS_5CurOpES3_+0x746) [0xb9fc36]
|
mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0xB10) [0xab65c0]
|
mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0xDD) [0x80e09d]
|
mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x34B) [0xf0342b]
|
libpthread.so.0(+0x8182) [0x7f7ecfdd0182]
|
libc.so.6(clone+0x6D) [0x7f7ece89847d]
|
----- END BACKTRACE -----
|
2015-04-28T17:33:26.161+0200 I WRITE [conn75303] Had WriteConflict during multi update, aborting
|
It's in WiredTigerIndexUnique::_insert and from the stack trace it appears to be from a WT_ROLLBACK return code from the cursor insert call (to insert into the index).
alex@v5tech.es, thanks for the report. Some follow-up questions for you:
- I see 5 indexes from your output above. Are any of them unique (besides _id?).
- What are you using as the shard key?
- How frequent are the
Had WriteConflict during multi update, aborting messages? Do they happen in bunches, or do they occur regularly in the trace?
And a request: could you collect some more data? Could you run
delay=1 # pick a number in seconds
|
mongo --eval "while(true) {print(JSON.stringify(db.serverStatus())); sleep(1000*$delay)}" >ss.log &
|
and upload the ss.log file? I'd like to correlate that data with the times when the write conflicts cause aborts. If you need to run for a long time (hours) to get the aborts to happen, set delay=10 or so to limit the file size.
|
|
Hi eliot,
I've uploaded it (mongod-log.txt). It's from one of the primary replicaSet mongod process. The stack traces were on for about a minute, hope it's enough (if not, I can execute it again).
PS: I've modified the domain name and collection names.
|
|
Alex,
Assuming this is still happening, can you run the following in the shell:
db.adminCommand( {setParameter:1, traceWriteConflictExceptions:true} );
|
That will generate stack traces in the logs that we can use to track this down.
You'll want to turn it off pretty soon after you get stack traces as it'll hurt performance while its on.
Hopefully once you send us those stack traces it'll be easy for us to fix.
|
|
SERVER-17194 is a performance regression ticket showing high levels of WriteConflicts with high thread count with an update heavy workload. This sounds similar with high concurrency and with the write conflicts delaying application level threads enough to cause trouble.
SERVER-17194 doesn't have an application level or driver level error, so they definitely differ in that respect.
|
Generated at Thu Feb 08 03:46:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.