[SERVER-6790] '[Balancer] Assertion: 13141:Chunk map pointed to incorrect chunk' error in the mongos log Created: 17/Aug/12  Updated: 15/Feb/13  Resolved: 20/Aug/12

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.2.0-rc1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Vladimir Poluyaktov Assignee: Spencer Brody (Inactive)
Resolution: Duplicate Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

All servers are EC2 hosts, Ubuntu 11.04 (GNU/Linux 2.6.38-13-virtual x86_64)
9 Shard hosts (m2.4xlarge), 3 Replica sets, 3 servers in each
3 config servers (m1.small)
2 mongos hosts (m1.medium)


Attachments: File configdb.dmp.tgz     File configdb.dmp.tgz     Zip Archive logs.zip    
Issue Links:
Depends
depends on SERVER-6809 splitVector doesn't filter additional... Closed
Duplicate
Operating System: Linux
Participants:

 Description   

Recently we sharded few big collections in our database (2.2.0-rc1)

First collection was balanced just fine:
oggiReporting.site.mainStats chunks:
RS02 238
RS03 238
RS01 239

But for second collection we get a balancer error in mongos log every few seconds:

Fri Aug 17 00:56:27 [Balancer] ns: oggiReporting.playlist.mainStats going to move { _id: "oggiReporting.playlist.mainStats-campaignId_MinKeyplaylistId_MinKeydate_MinKey", lastmod: Timestamp 1000|0, lastmodEpoch: ObjectId('502be6c21d848a61a13d51eb'), ns: "oggiReporting.playlist.mainStats", min:

{ campaignId: MinKey, playlistId: MinKey, date: MinKey }

, max:

{ campaignId: "014869ee-5d62-4212-a609-40edcc2169d3", playlistId: "0c5bb01a-5780-413d-8d7c-b22c722e6ad4", date: new Date(1309309200000) }

, shard: "RS01" } from: RS01 to: RS02 tag []
foo:

{ campaignId: "0095af50-203b-4e1a-9337-6dad60a46688", playlistId: "09038ddd-ac89-4818-b6c3-90c3c43cdce9", domain: "biography.com", date: new Date(1311120000000), partition: "2011-07" }

*c: ns:oggiReporting.domain.mainStats at: RS02:RS02/RPTDB-RS02-Zuse1b-S01.oggifinogi.com:27018,RPTDB-RS02-Zuse1c-S01.oggifinogi.com:27018,RPTDB-RS02-Zuse1d-S01.oggifinogi.com:27018 lastmod: 2|0||000000000000000000000000 min:

{ campaignId: MinKey, playlistId: MinKey, domain: MinKey, date: MinKey }

max:

{ campaignId: "0095af50-203b-4e1a-9337-6dad60a46688", playlistId: "09038ddd-ac89-4818-b6c3-90c3c43cdce9", domain: "biography.com", date: new Date(1311120000000), partition: "2011-07" }

key:

{ campaignId: "0095af50-203b-4e1a-9337-6dad60a46688", playlistId: "09038ddd-ac89-4818-b6c3-90c3c43cdce9", domain: "biography.com", date: new Date(1311120000000) }

Fri Aug 17 00:56:27 [Balancer] Assertion: 13141:Chunk map pointed to incorrect chunk
0x5389f1 0x6ed1cb 0x57d090 0x5312ec 0x532d66 0x670b9e 0x6724e4 0x72ab79 0x7f4a00f4cd8c 0x7f4a002eec2d
/usr/bin/mongos(_ZN5mongo15printStackTraceERSo+0x21) [0x5389f1]
/usr/bin/mongos(_ZN5mongo11msgassertedEiPKc+0x9b) [0x6ed1cb]
/usr/bin/mongos(_ZNK5mongo12ChunkManager9findChunkERKNS_7BSONObjE+0x540) [0x57d090]
/usr/bin/mongos(_ZN5mongo8Balancer11_moveChunksEPKSt6vectorIN5boost10shared_ptrINS_11MigrateInfoEEESaIS5_EEb+0x5ac) [0x5312ec]
/usr/bin/mongos(_ZN5mongo8Balancer3runEv+0x986) [0x532d66]
/usr/bin/mongos(_ZN5mongo13BackgroundJob7jobBodyEN5boost10shared_ptrINS0_9JobStatusEEE+0xbe) [0x670b9e]
/usr/bin/mongos(_ZN5boost6detail11thread_dataINS_3_bi6bind_tIvNS_4_mfi3mf1IvN5mongo13BackgroundJobENS_10shared_ptrINS7_9JobStatusEEEEENS2_5list2INS2_5valueIPS7_EENSD_ISA_EEEEEEE3runEv+0x74) [0x6724e4]
/usr/bin/mongos() [0x72ab79]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x6d8c) [0x7f4a00f4cd8c]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f4a002eec2d]
Fri Aug 17 00:56:27 [Balancer] distributed lock 'balancer/RPTSRVC-Zuse1b-S01.oggifinogi.com:27017:1345162100:1804289383' unlocked.
Fri Aug 17 00:56:27 [Balancer] scoped connection to RPTDB-CFG-Zuse1b-S01.oggifinogi.com:27019,RPTDB-CFG-Zuse1c-S01.oggifinogi.com:27019,RPTDB-CFG-Zuse1d-S01.oggifinogi.com:27019 not being returned to the pool
Fri Aug 17 00:56:27 [Balancer] caught exception while doing balance: Chunk map pointed to incorrect chunk

I tried to reboot mongos as well as config and replica set servers - no luck



 Comments   
Comment by Greg Studer [ 20/Aug/12 ]

Thanks for the info, and thanks for trying rc-1 - the problem you're running into is SERVER-6809. Will be fixed for the release, but the current workaround is to drop the collection and create a new index exactly over the exact shard key (you can keep all your other indexes).

Comment by Vladimir Poluyaktov [ 20/Aug/12 ]

Hi Greg!

Well I did the following:

1. I dropped the oggiReporting.domain.mainStats collection.
2. Loaded the collection back from a dump (not sharded yet).
3. Created index for the collection (db.domain.mainStats.ensureIndex(

{ "campaignId" : 1, "playlistId" : 1, "domain" : 1, "date" : 1}

,

{background : 1}

))
4. In about 24 hours after that I stopped everything - the replica set servers, config servers, mongos servers. Primary and secondary servers in the replica set were synchronized by that time. Removed old logs.
5. Started the replica set servers, config servers and one of mongos servers with "-vv" flag.
6. Then I sharded the collection (db.runCommand( { shardcollection : "oggiReporting.domain.mainStats", key : { "campaignId" : 1, "playlistId" : 1, "domain" : 1, "date" : 1}})
7. In few minutes after that I received the same error in the mongos log.

Log files from all servers attached to this issue (see logs.zip file):

RPTSRVC-Zuse1b-S01.log - monogs server

RPTDB-CFG-Zuse1b-S01.log - config server
RPTDB-CFG-Zuse1c-S01.log - config server
RPTDB-CFG-Zuse1d-S01.log - config server

RPTDB-RS01-Zuse1b-S01.log - primary server in Replica Set RS01
RPTDB-RS01-Zuse1c-S01.log - secondary server in Replica Set RS01
RPTDB-RS01-Zuse1d-S01.log - secondary server in Replica Set RS01

RPTDB-RS02-Zuse1b-S01.log - primary server in Replica Set RS02
RPTDB-RS02-Zuse1c-S01.log - secondary server in Replica Set RS02
RPTDB-RS02-Zuse1d-S01.log - secondary server in Replica Set RS02

RPTDB-RS03-Zuse1b-S01.log - primary server in Replica Set RS03
RPTDB-RS03-Zuse1c-S01.log - secondary server in Replica Set RS03
RPTDB-RS03-Zuse1d-S01.log - secondary server in Replica Set RS03

Also I attached fresh dump of our config db made right after I received the error (configdb.dmp.tgz)

That is interesting - we have already sharded four of our big collections. All of them were balanced just fine:

mongos> db.printShardingStatus()
— Sharding Status —
sharding version:

{ "_id" : 1, "version" : 3 }

shards:

{ "_id" : "RS01", "host" : "RS01/RPTDB-RS01-Zuse1b-S01.oggifinogi.com:27018,RPTDB-RS01-Zuse1c-S01.oggifinogi.com:27018,RPTDB-RS01-Zuse1d-S01.oggifinogi.com:27018" } { "_id" : "RS02", "host" : "RS02/RPTDB-RS02-Zuse1b-S01.oggifinogi.com:27018,RPTDB-RS02-Zuse1c-S01.oggifinogi.com:27018,RPTDB-RS02-Zuse1d-S01.oggifinogi.com:27018" } { "_id" : "RS03", "host" : "RS03/RPTDB-RS03-Zuse1b-S01.oggifinogi.com:27018,RPTDB-RS03-Zuse1c-S01.oggifinogi.com:27018,RPTDB-RS03-Zuse1d-S01.oggifinogi.com:27018" }

databases:

{ "_id" : "admin", "partitioned" : false, "primary" : "config" } { "_id" : "social", "partitioned" : false, "primary" : "RS01" } { "_id" : "test", "partitioned" : false, "primary" : "RS01" } { "_id" : "oggiReportingTest", "partitioned" : false, "primary" : "RS01" } { "_id" : "oggiReporting", "partitioned" : true, "primary" : "RS01" }

oggiReporting.cmSite chunks:
RS02 29
RS03 29
RS01 30
too many chunks to print, use verbose if you want to force print
oggiReporting.playlist.displayTimes chunks:
RS02 26
RS03 26
RS01 26
too many chunks to print, use verbose if you want to force print
oggiReporting.playlist.mainStats chunks:
RS02 48
RS03 48
RS01 49
too many chunks to print, use verbose if you want to force print
oggiReporting.site.mainStats chunks:
RS02 238
RS03 238
RS01 239
too many chunks to print, use verbose if you want to force print

Only oggiReporting.domain.mainStats failed with the error. So it can be just data issue I think.
I could provide you a dump of the domain.mainStats collection if you need it just take into account it's big enough (.gz file size is about 2.5 Gb)

Comment by Greg Studer [ 20/Aug/12 ]

Thanks for the database dump - it looks like somehow there is confusion between the different mainStats collections - is it possible to also send the full log files of the admin and balancing mongoses you're using?

Trying to track down why a chunk was selected from "oggiReporting.playlist.mainStats" by the balancer, which is apparently dropped, but "oggiReporting.domain.mainStats" is checked.

Also, if you can reproduce this every time you restart mongos, the logs after a fresh restart for a few minutes while the error appears several times would be very helpful. To do this you just need to start using "-vv". If that is not an option the full log files or any mongos logs you have available will be useful.

Just to verify - did you restart the replica set servers (or just stepdown the primaries) after dropping the old collection which was erroring and before creating the new collection?

Comment by Vladimir Poluyaktov [ 18/Aug/12 ]

Config database dump (tar.gzip archive)

Comment by Vladimir Poluyaktov [ 18/Aug/12 ]

Hi Spencer!

Config db dump is attached.

I tried to dump/drop/restore/shard the collection three times - each time I got the same error in few minutes after I sharded the collection.

Comment by Vladimir Poluyaktov [ 18/Aug/12 ]

Config database dump (tar.gzip archive)

Comment by Pavlo Grinchenko [ 17/Aug/12 ]

We are migrating our non-sharded 2.0.6 to sharded 2.2.0-rc1

Comment by Pavlo Grinchenko [ 17/Aug/12 ]

Me and Vladimir are describing the same situation. We are both working in the same company.

We will prepare configuration servers dump for you.

Comment by Spencer Brody (Inactive) [ 17/Aug/12 ]

Do you see the same problem when running on 2.0.7, or only in 2.2.0-rc1?

Pavlo, are you also running 2.2.0-rc1?

Could you attach a dump of your config database from running mongodump against a config server? I'd like to see if your chunk mappings somehow got messed up.

If you'd rather not attach that to this publically-viewable ticket, you can create a ticket in the "Community Private" project, attach the dump there, then post a link to the Community Private ticket here. Tickets in the Community Private project will only be viewable to the reporter and to employees of 10gen.

Comment by Pavlo Grinchenko [ 17/Aug/12 ]

We had the following situation:
1) 1st collection sharded successfully
2) 2nd collection was in-progress
3) 3rd collection was also going

2nd collection started to fail with the exception specified above.

We tried to do the following work-around:
1) backed up 2nd collection
2) created empty sharded collection with the same name
3) ran restore process

We saw that during restore process it already puts data into shards - which is good.
Soon this exception started to happen again. Now we are officially stuck.

Generated at Thu Feb 08 03:12:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.