[SERVER-5949] Updates can be lost when issued near the time of migration Created: 28/May/12  Updated: 15/Aug/12  Resolved: 29/May/12

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.1.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Ian Whalen (Inactive) Assignee: Randolph Tan
Resolution: Duplicate Votes: 0
Labels: buildbot
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File bbfail    
Issue Links:
Duplicate
is duplicated by SERVER-5200 slowNightly tests failing on sharding... Closed
Operating System: ALL
Participants:

 Description   

 m30999| Mon May 28 09:58:45 [WriteBackListener-localhost:30001] initializing shard connection to localhost:30001
 m30999| Mon May 28 09:58:45 [WriteBackListener-localhost:30001]     setShardVersion  shard0001 localhost:30001  test.foo  { setShardVersion: "test.foo", configdb: "localhost:30000", version: Timestamp 7000|1, versionEpoch: ObjectId('000000000000000000000000'), serverID: ObjectId('4fc34c265d1a283b7f98f90c'), shard: "shard0001", shardHost: "localhost:30001" } 0x7f40200018b0
 m30999| Mon May 28 09:58:45 [WriteBackListener-localhost:30001]        setShardVersion failed!
 m30999| { oldVersion: Timestamp 0|0, oldVersionEpoch: ObjectId('000000000000000000000000'), ns: "test.foo", version: Timestamp 7000|1, versionEpoch: ObjectId('000000000000000000000000'), globalVersion: Timestamp 8000|0, globalVersionEpoch: ObjectId('4fc34c265d1a283b7f98f90e'), reloadConfig: true, errmsg: "shard global version for collection is higher than trying to set to 'test.foo'", ok: 0.0 }
 m30999| Mon May 28 09:58:45 [Balancer] moveChunk result: { ok: 1.0 }
 m30999| Mon May 28 09:58:45 [WriteBackListener-localhost:30001] ChunkManager: time to load chunks for test.foo: 0ms sequenceNumber: 66 version: 8|1||000000000000000000000000 based on: 7|1||000000000000000000000000
 m30999| Mon May 28 09:58:45 [WriteBackListener-localhost:30001]     setShardVersion  shard0001 localhost:30001  test.foo  { setShardVersion: "test.foo", configdb: "localhost:30000", version: Timestamp 8000|1, versionEpoch: ObjectId('000000000000000000000000'), serverID: ObjectId('4fc34c265d1a283b7f98f90c'), authoritative: true, shard: "shard0001", shardHost: "localhost:30001" } 0x7f40200018b0
 m30999| Mon May 28 09:58:45 [Balancer] *** end of balancing round
 m30999| Mon May 28 09:58:45 [Balancer] distributed lock 'balancer/bs-linux64con:30999:1338199078:1804289383' unlocked. 
 m30999| Mon May 28 09:58:45 [WriteBackListener-localhost:30001]       setShardVersion success: { oldVersion: Timestamp 0|0, oldVersionEpoch: ObjectId('000000000000000000000000'), ok: 1.0 }
 m30999| Mon May 28 09:58:45 [conn1]     setShardVersion  shard0000 localhost:30000  test.foo  { setShardVersion: "test.foo", configdb: "localhost:30000", version: Timestamp 8000|0, versionEpoch: ObjectId('000000000000000000000000'), serverID: ObjectId('4fc34c265d1a283b7f98f90c'), shard: "shard0000", shardHost: "localhost:30000" } 0x7f4034002ef0
 m30999| Mon May 28 09:58:45 [conn1]       setShardVersion success: { oldVersion: Timestamp 7000|0, oldVersionEpoch: ObjectId('000000000000000000000000'), ok: 1.0 }
going to assert for id: 413 correct count is: 9 db says count is: {
	"_id" : 413,
	"s" : "asdasd...",
	"x" : 8
}
assert failed : GLE diff myid: 413 1: {
...

http://buildbot.mongodb.org/builders/Nightly%20Linux%2064-bit%20concurrency/builds/59/steps/test_1/logs/stdio



 Comments   
Comment by Randolph Tan [ 29/May/12 ]

Summary:
1. moveChunk range _id(374, 428] from shard1 to shard0
2. Inc upsert with _id 413 was directed to shard1
3. Assert fails because the doc was missing one increment (the query is correctly being directed to shard0)

Log excerpt:

 m30001| Mon May 28 09:58:45 [conn4] command admin.$cmd command: { moveChunk: "test.foo", from: "localhost:30001", to: "localhost:30000", fromShard: "shard0001", toShard: "shard0000", min: { _id: 374.0 }, max: { _id: 428.0 }, maxChunkSizeBytes: 1048576, shardId: "test.foo-_id_374.0", configdb: "localhost:30000" } ntoreturn:1 keyUpdates:0 reslen:37 1024ms
 m30999| Mon May 28 09:58:45 [Balancer] moveChunk result: { ok: 1.0 }
...
 
going to assert for id: 413 correct count is: 9 db says count is: {
	"_id" : 413,
        "s" : "asd..."
	"x" : 8
}
 
assert failed : GLE diff myid: 413 1: {
	"shards" : [
		"localhost:30000",
		"localhost:30001"
	],
	"shardRawGLE" : {
		"localhost:30000" : {
			"n" : 0,
			"connectionId" : 12,
			"err" : null,
			"ok" : 1
		},
		"localhost:30001" : {
			"updatedExisting" : false,
			"n" : 1, // <------- write was directed to 30001, even though the chunk was supposed to be now on 30000
			"connectionId" : 6,
			"err" : null,
			"ok" : 1
		}
	},
	"n" : 1,
	"updatedExisting" : false,
	"err" : null,
	"writeback" : ObjectId("4fc34c55014f31bb223a2449"),
	"instanceIdent" : "bs-linux64con:30001",
	"connectionId" : 3,
	"ok" : 1,
	"writebackGLE" : {
		"shards" : [
			"localhost:30000",
			"localhost:30001"
		],
		"shardRawGLE" : {
			"localhost:30000" : {
				"n" : 0,
				"connectionId" : 12,
				"err" : null,
				"ok" : 1
			},
			"localhost:30001" : {
				"updatedExisting" : false,
				"n" : 1,
				"connectionId" : 6,
				"err" : null,
				"ok" : 1
			}
		},
		"n" : 1,
		"updatedExisting" : false,
		"err" : null
	},
	"initialGLEHost" : "localhost:30001"
}

Generated at Thu Feb 08 03:10:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.