[SERVER-10578] got writeback waitfor for older id (Sharded Environment) Created: 20/Aug/13  Updated: 05/Sep/13  Resolved: 05/Sep/13

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 2.4.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Brad C. Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

CentOS 6.4 x86_64 + pymongo 2.5.2


Operating System: Linux
Participants:

 Description   

The issue only came about when we moved from a single mongod instance to a sharded environment. After a few million inserts I start getting the following OperationFailure error and inserts slow down to a crawl.

I assume this is a bug as the source states:
// it should be impossible because once we call GLE, no other
// writebacks should happen with that connection id

OperationFailure Errors:
ERROR: error inserting 1 documents to shard shard0002:mongo3:27017 at version 16|103||5212c053b8650139d3328156 :: caused by :: got writeback waitfor for older id oid: 5212d98f0000000000000021 s.id: 52133f4c0000000000000082 ident: localhost.localdomain:2
ERROR: error inserting 1 documents to shard shard0001:mongo2:27017 at version 16|103||5212c053b8650139d3328156 :: caused by :: got writeback waitfor for older id oid: 5212d975000000000000001c s.id: 52133f4c0000000000000083 ident: localhost.localdomain:5
ERROR: error inserting 3 documents to shard shard0001:mongo2:27017 at version 16|103||5212c053b8650139d3328156 :: caused by :: got writeback waitfor for older id oid: 5212d98f0000000000000020 s.id: 52133f4c000000000000007d ident: localhost.localdomain:36
(It happens over and over)

mongos log output:
Tue Aug 20 16:36:00.735 [conn58] warning: exception during insert (continue on
error set) :: caused by :: 16460 error inserting 1 documents to shard shard000
0:mongo1:27017 at version 16|103||5212c053b8650139d3328156 :: caused by :: got
writeback waitfor for older id oid: 5212d975000000000000001c s.id: 52133f4c00
00000000000083 ident: localhost.localdomain:5
Tue Aug 20 16:36:00.735 [conn58] warning: swallowing exception during insert
:: caused by :: 16460 error inserting 1 documents to shard
shard0000:mongo1:2701
7 at version 16|103||5212c053b8650139d3328156 :: caused by :: got writeback
waitfor for older id oid: 5212d975000000000000001c s.id:
52133f4c000000000000008
3 ident: localhost.localdomain:5
Tue Aug 20 16:36:00.736 [conn58] Assertion: 14041:got writeback waitfor for
older id oid: 5212d975000000000000001c s.id: 52133f4c0000000000000083 ident:
loc
alhost.localdomain:5
0xa86db1 0xa4cbbb 0xa4d0fc 0x9f2b4f 0x8d376f 0x8d7413 0x9c2960 0x9c522c
0x994011 0x66a161 0xa7351e 0x3bb3e07851 0x3bb36e890d
mongos(_ZN5mongo15printStackTraceERSo+0x21) [0xa86db1]
mongos(_ZN5mongo11msgassertedEiPKc+0x9b) [0xa4cbbb]
mongos() [0xa4d0fc]
mongos(_ZN5mongo17WriteBackListener7waitForERKNS0_15ConnectionIdentERKNS_3OIDE+0xa7f)
[0x9f2b4f]
mongos(_ZN5mongo10ClientInfo17_handleWriteBacksERKSt6vectorINS0_6WBInfoESaIS2_EEb+0x9f)
[0x8d376f]
mongos(_ZN5mongo10ClientInfo12getLastErrorERKSsRKNS_7BSONObjERNS_14BSONObjBuilderERSsb+0x15d3)
[0x8d7413]
mongos(_ZN5mongo13ShardStrategy7_insertERKSsRNS_9DbMessageEiRNS_7RequestE+0x640)
[0x9c2960]
mongos(_ZN5mongo13ShardStrategy7writeOpEiRNS_7RequestE+0x53c) [0x9c522c]
mongos(_ZN5mongo7Request7processEi+0xd1) [0x994011]
mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x71)
[0x66a161]
mongos(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x42e) [0xa7351e]
/lib64/libpthread.so.0() [0x3bb3e07851]
/lib64/libc.so.6(clone+0x6d) [0x3bb36e890d]

It's probably also worth noting that the python script doing these inserts is multiprocess.



 Comments   
Comment by Ian Whalen (Inactive) [ 05/Sep/13 ]

Thanks for the follow-up, Brad. I'm closing this now.

Comment by Brad C. [ 05/Sep/13 ]

I'm closing this as it was ultimately an issues with the hostnames of the servers. Even though all shards used unique hostnames in mongodb, it appears that the issue was that two systems had the same hostname at the OS level.

Generated at Thu Feb 08 03:23:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.