[SERVER-3547] Collections are not droppable but are present in system.namespaces Created: 08/Aug/11  Updated: 30/Mar/12  Resolved: 18/Aug/11

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 1.8.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kristina Chodorow (Inactive) Assignee: Aaron Staple
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-3570 allocation failure creating collectio... Closed
Operating System: ALL
Participants:

 Description   

They seem to not have been added to the nsdetails hash.

> db.getCollectionNames()
"tmp.mr.type_foo_tmp_reduce_items_45993_47129_inc",
"tmp.mr.type_foo_tmp_reduce_items_45994_47130_inc",
"type_foo"
]

> db.getCollection("tmp.mr.type_foo_tmp_reduce_items_45994_47130_inc").stats()
{
"ns" : "indexer_cache.tmp.mr.type_foo_tmp_reduce_items_45994_47130_inc",
"sharded" : false,
"primary" : "p",
"errmsg" : "ns not found",
"ok" : 0
}

> db.getCollection("tmp.mr.type_foo_tmp_reduce_items_45994_47130_inc").drop()
false

– ON RS

PRIMARY> db.oplog.rs.findOne(

{"o.drop": "tmp.mr.type_foo_tmp_reduce_items_45994_47130_inc"}

)
null

From the log:

Mon Aug 8 08:23:59 [conn50679] CMD: drop dbname.tmp.mr.type_foo_tmp_reduce_items_45925_47047
Mon Aug 8 08:23:59 [conn50679] error: hashtable namespace index max chain reached:1335
Mon Aug 8 08:23:59 [conn50679] CMD: drop dbname.tmp.mr.type_foo_tmp_reduce_items_45925_47047_inc
Mon Aug 8 08:23:59 [conn50679] error: hashtable namespace index max chain reached:1335
Mon Aug 8 08:23:59 [conn50679] error: hashtable namespace index max chain reached:1335
Mon Aug 8 08:23:59 [conn50679] error: hashtable namespace index max chain reached:1335
Mon Aug 8 08:23:59 [conn50679] mr failed, removing collection
Mon Aug 8 08:23:59 [conn50679] CMD: drop dbname.tmp.mr.type_foo_tmp_reduce_items_45925_47047
Mon Aug 8 08:23:59 [conn50679] error: hashtable namespace index max chain reached:1335
Mon Aug 8 08:23:59 [conn50679] CMD: drop dbname.tmp.mr.type_foo_tmp_reduce_items_45925_47047_inc
Mon Aug 8 08:30:00 [conn50679] error: hashtable namespace index max chain reached:1335
Mon Aug 8 08:30:00 [conn50679] CMD: drop dbname.tmp.mr.type_foo_tmp_reduce_items_45934_47056
Mon Aug 8 08:30:00 [conn50679] error: hashtable namespace index max chain reached:1335
Mon Aug 8 08:30:00 [conn50679] CMD: drop dbname.tmp.mr.type_foo_tmp_reduce_items_45934_47056_inc
Mon Aug 8 08:30:00 [conn50679] error: hashtable namespace index max chain reached:1335
Mon Aug 8 08:30:00 [conn50679] error: hashtable namespace index max chain reached:1335
Mon Aug 8 08:30:00 [conn50679] error: hashtable namespace index max chain reached:1335
Mon Aug 8 08:30:00 [conn50679] mr failed, removing collection
Mon Aug 8 08:30:00 [conn50679] error: hashtable namespace index max chain reached:1335
Mon Aug 8 08:30:00 [conn50679] CMD: drop dbname.tmp.mr.type_foo_tmp_reduce_items_45934_47056
Mon Aug 8 08:30:00 [conn50679] error: hashtable namespace index max chain reached:1335



 Comments   
Comment by Mathieu Poumeyrol [ 10/Aug/11 ]

Thanks for the answer, and glad to know it will help nail the root cause of a major issue. And maybe the recovery procedure will help others.

Unfortunately for me, this procedure comes a bit too late. I was unaware the replica were stuck so I have stupidly scrub the primary node data directory to force a full recovery. I have to deal with a dataloss on 5 days on valuable, but not critical data. Daily backup won't help either, as they're backing up from the stale secondaries.

We are still in the process of understanding how we could stay unaware of the secondaries drift for 5 days. Obviously, our monitoring was insufficient. Nevertheless, I will spin-off one or two bugs and/or feature requests in the following days, as I feel secondary nodes state reporting has shown inadequate in this instance and can be greatly improved to avoid somebody else being as lazy and stupid I have been this morning.

Comment by Aaron Staple [ 10/Aug/11 ]

Hi Mathieu,

The root cause of this is SERVER-3570. For now you can drop the bad collection documents via the following:

1) Drop some number of collections, until you are able to create new collections.
2) Create a new collection with the same collection name as in the orphaned collection document.
3) Drop the collection you just created. This will remove the orphaned collection documents.

Comment by Eliot Horowitz (Inactive) [ 10/Aug/11 ]

On the secondary, you can use a larger --nssize to work around for now.

Comment by Mathieu Poumeyrol [ 10/Aug/11 ]

I just figure out the issue has also halted replication to the replica sets with the following message.

Wed Aug 10 06:34:33 [replica set sync] replSet syncThread: 10081 too many namespaces/collections
06:34:50 .

Now, I'm in deep.

Comment by Mathieu Poumeyrol [ 09/Aug/11 ]

Trying to find a way to come back to a sane configuration from here, I put the primary offline to run a repair on it. The repair runs fine, but when mongod restart, I get that kind of errors.

[...]
2011-08-09_16:55:38.30319 Tue Aug 9 16:55:38 [replica set sync] replSet WARNING ignoring op on rollback no _id TODO : kali.system.indexes { _id: ObjectId('4e415f2b644c6fce100e2d94'), ts: Timestamp 1312901240000|18, h: 8754756684372246472, op: "i", ns: "kali.system.indexes", o: { ns: "kali.tmp.mr.data_out_12876_41716_inc", key:

{ 0: 1 }

, name: "0_1", v: 0 } }
2011-08-09_16:55:38.30326 Tue Aug 9 16:55:38 [replica set sync] replSet WARNING ignoring op on rollback no _id TODO : kali.system.indexes { _id: ObjectId('4e415f2b644c6fce100e2d91'), ts: Timestamp 1312901240000|15, h: -8852462082678340227, op: "i", ns: "kali.system.indexes", o: { ns: "kali.tmp.mr.data_out_12874_41714_inc", key:

{ 0: 1 }

, name: "0_1", v: 0 } }
2011-08-09_16:55:38.30332 Tue Aug 9 16:55:38 [replica set sync] replSet info rollback of renameCollection is slow in this version of mongod
2011-08-09_16:55:38.30340 Tue Aug 9 16:55:38 [replica set sync] replSet WARNING ignoring op on rollback no _id TODO : kali.system.indexes { _id: ObjectId('4e415f2b644c6fce100e2d89'), ts: Timestamp 1312901240000|7, h: -1640308357504338439, op: "i", ns: "kali.system.indexes", o: { ns: "kali.tmp.mr.data_out_12871_41711_inc", key:

{ 0: 1 }

, name: "0_1", v: 0 } }
2011-08-09_16:55:38.30347 Tue Aug 9 16:55:38 [replica set sync] replSet info rollback of renameCollection is slow in this version of mongod
2011-08-09_16:55:38.30356 Tue Aug 9 16:55:38 [replica set sync] replSet WARNING ignoring op on rollback no _id TODO : kali.system.indexes { _id: ObjectId('4e415f2b644c6fce100e2d80'), ts: Timestamp 1312901239000|12, h: 6925292605345270750, op: "i", ns: "kali.system.indexes", o: { ns: "kali.tmp.mr.data_out_12870_41710_inc", key:

{ 0: 1 }

, name: "0_1", v: 0 } }
2011-08-09_16:55:38.30362 Tue Aug 9 16:55:38 [replica set sync] replSet info rollback of renameCollection is slow in this version of mongod
2011-08-09_16:55:38.30370 Tue Aug 9 16:55:38 [replica set sync] replSet WARNING ignoring op on rollback no _id TODO : kali.system.indexes { _id: ObjectId('4e415f2b644c6fce100e2d7b'), ts: Timestamp 1312901239000|7, h: 3779087353189048104, op: "i", ns: "kali.system.indexes", o: { ns: "kali.tmp.mr.data_out_12869_41709_inc", key:

{ 0: 1 }

, name: "0_1", v: 0 } }
2011-08-09_16:55:38.30376 Tue Aug 9 16:55:38 [replica set sync] replSet WARNING ignoring op on rollback no _id TODO : kali.system.indexes { _id: ObjectId('4e415f2b644c6fce100e2d79'), ts: Timestamp 1312901239000|5, h: -8505161559152385289, op: "i", ns: "kali.system.indexes", o: { ns: "kali.tmp.mr.data_out_12868_41708_inc", key:

{ 0: 1 }

, name: "0_1", v: 0 } }
2011-08-09_16:55:38.30384 Tue Aug 9 16:55:38 [replica set sync] replSet WARNING ignoring op on rollback no _id TODO : kali.system.indexes { _id: ObjectId('4e415f2b644c6fce100e2d77'), ts: Timestamp 1312901239000|3, h: -4092830567795822378, op: "i", ns: "kali.system.indexes", o: { ns: "kali.tmp.mr.data_out_12867_41707_inc", key:

{ 0: 1 }

, name: "0_1", v: 0 } }
2011-08-09_16:55:38.30390 Tue Aug 9 16:55:38 [replica set sync] replSet info rollback of renameCollection is slow in this version of mongod
2011-08-09_16:55:38.30400 Tue Aug 9 16:55:38 [replica set sync] replSet WARNING ignoring op on rollback no _id TODO : kali.system.indexes { _id: ObjectId('4e415f2b644c6fce100e2d6a'), ts: Timestamp 1312901238000|19, h: -2013834961402568477, op: "i", ns: "kali.system.indexes", o: { ns: "kali.tmp.mr.data_out_12866_41706_inc", key:

{ 0: 1 }

, name: "0_1", v: 0 } }
2011-08-09_16:55:38.30407 Tue Aug 9 16:55:38 [replica set sync] replSet info rollback of renameCollection is slow in this version of mongod
2011-08-09_16:55:38.30414 Tue Aug 9 16:55:38 [replica set sync] replSet WARNING ignoring op on rollback no _id TODO : kali.system.indexes { _id: ObjectId('4e415f2b644c6fce100e2d5f'), ts: Timestamp 1312901238000|8, h: 315461714917508211, op: "i", ns: "kali.system.indexes", o: { ns: "kali.tmp.mr.data_out_12865_41705_inc", key:

{ 0: 1 }

, name: "0_1", v: 0 } }
2011-08-09_16:55:38.30421 Tue Aug 9 16:55:38 [replica set sync] replSet WARNING ignoring op on rollback no _id TODO : kali.system.indexes { _id: ObjectId('4e415f2b644c6fce100e2d5d'), ts: Timestamp 1312901238000|6, h: 3404851968451820918, op: "i", ns: "kali.system.indexes", o: { ns: "kali.tmp.mr.data_out_12864_41704_inc", key:

{ 0: 1 }

, name: "0_1", v: 0 } }
2011-08-09_16:55:38.30439 Tue Aug 9 16:55:38 [replica set sync] replSet WARNING ignoring op on rollback no _id TODO : kali.system.indexes { _id: ObjectId('4e415f2b644c6fce100e2d59'), ts: Timestamp 1312901238000|2, h: 7190354169504677580, op: "i", ns: "kali.system.indexes", o: { ns: "kali.tmp.mr.data_out_12863_41703_inc", key:

{ 0: 1 }

, name: "0_1", v: 0 } }
2011-08-09_16:55:38.30446 Tue Aug 9 16:55:38 [replica set sync] replSet rollback found matching events at Aug 9 14:47:18:1
2011-08-09_16:55:38.30451 Tue Aug 9 16:55:38 [replica set sync] replSet rollback findcommonpoint scanned : 83547
2011-08-09_16:55:38.30457 Tue Aug 9 16:55:38 [replica set sync] replSet replSet rollback 3 fixup
2011-08-09_16:55:40.94426 Tue Aug 9 16:55:40 [initandlisten] connection accepted from 10.203.38.176:49256 #11
2011-08-09_16:55:43.71309 Tue Aug 9 16:55:43 [replica set sync] replSet rollback 3.5
2011-08-09_16:55:43.71419 Tue Aug 9 16:55:43 [replica set sync] replSet rollback 4 n:6140
2011-08-09_16:55:43.71424 Tue Aug 9 16:55:43 [replica set sync] replSet minvalid=Aug 9 16:55:43 4e41668f:9
2011-08-09_16:55:43.96736 Tue Aug 9 16:55:43 [replica set sync] replSet rollback 4.1 coll resync core.computed_best_recent_members
2011-08-09_16:55:43.97397 Tue Aug 9 16:55:43 [replica set sync] building new index on

{ _id: 1 }

for core.computed_best_recent_members
2011-08-09_16:55:43.97403 Tue Aug 9 16:55:43 [replica set sync] done for 0 records 0secs
2011-08-09_16:55:43.97404 Tue Aug 9 16:55:43 [replica set sync] Assertion: 13312:replSet error : logOp() but not primary?
2011-08-09_16:55:43.97405 0x555051 0x6bd43f 0x6bc022 0x7179d2 0x6cf80a 0x6cfad5 0x6abedc 0x6b093b 0x6b1be9 0x6b6937 0x6b7b40 0x6b808c 0x81353e 0x7f81db05c971 0x7f81da61992d
2011-08-09_16:55:43.97579 /opt/mongodb/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x131) [0x555051]
2011-08-09_16:55:43.97581 /opt/mongodb/bin/mongod() [0x6bd43f]
2011-08-09_16:55:43.97582 /opt/mongodb/bin/mongod(_ZN5mongo5logOpEPKcS1_RKNS_7BSONObjEPS2_Pb+0x42) [0x6bc022]
2011-08-09_16:55:43.97583 /opt/mongodb/bin/mongod(_ZN5mongo12userCreateNSEPKcNS_7BSONObjERSsbPb+0x4d2) [0x7179d2]
2011-08-09_16:55:43.97585 /opt/mongodb/bin/mongod(_ZN5mongo6Cloner14copyCollectionERKSsS2_RKNS_7BSONObjERSsbb+0xaca) [0x6cf80a]
2011-08-09_16:55:43.97586 /opt/mongodb/bin/mongod(_ZN5mongo24copyCollectionFromRemoteERKSsS1_RKNS_7BSONObjERSsb+0x35) [0x6cfad5]
2011-08-09_16:55:43.97587 /opt/mongodb/bin/mongod(_ZN5mongo11ReplSetImpl9syncFixUpERNS_10HowToFixUpERNS_11OplogReaderE+0xc6c) [0x6abedc]
2011-08-09_16:55:43.97588 /opt/mongodb/bin/mongod(_ZN5mongo11ReplSetImpl13_syncRollbackERNS_11OplogReaderE+0x19db) [0x6b093b]
2011-08-09_16:55:43.97590 /opt/mongodb/bin/mongod(_ZN5mongo11ReplSetImpl12syncRollbackERNS_11OplogReaderE+0x9) [0x6b1be9]
2011-08-09_16:55:43.97593 /opt/mongodb/bin/mongod(_ZN5mongo11ReplSetImpl8syncTailEv+0x6b7) [0x6b6937]
2011-08-09_16:55:43.97595 /opt/mongodb/bin/mongod(_ZN5mongo11ReplSetImpl10syncThreadEv+0x80) [0x6b7b40]
2011-08-09_16:55:43.97596 /opt/mongodb/bin/mongod(_ZN5mongo15startSyncThreadEv+0x24c) [0x6b808c]
2011-08-09_16:55:43.97598 /opt/mongodb/bin/mongod(thread_proxy+0x6e) [0x81353e]
2011-08-09_16:55:43.97599 /lib/libpthread.so.0(+0x7971) [0x7f81db05c971]
2011-08-09_16:55:43.97600 /lib/libc.so.6(clone+0x6d) [0x7f81da61992d]
2011-08-09_16:55:43.97603 Tue Aug 9 16:55:43 [replica set sync] replset rollback error resyncing collection core.computed_best_recent_members
2011-08-09_16:55:43.98326 Tue Aug 9 16:55:43 [replica set sync] replSet unexpected exception in syncThread()

Comment by Mathieu Poumeyrol [ 09/Aug/11 ]

Trying to reproduce the error, I have done the following thing:

db.data.save(

{ value : "blah"}

);
m = function()

{ emit(this.value, 1); }

r = function(key, values)

{ return 1; }

var i = 0;
for(i=0; i<100000; i++) {
db.runCommand(

{ mapreduce: "data", map: m, reduce: r, out : "out_" + i }

)
}

I was expecting to see it crash when a db.getCollectionNames() monitored in another shell reach 28k, but it actualy went on until I kill it at around 35k.

At that point, the primary node of the RS constantly complains:

2011-08-09_15:24:35.99697 Tue Aug 9 15:24:35 [conn20135] error: hashtable namespace index max chain reached:1335
2011-08-09_15:24:36.00489 Tue Aug 9 15:24:36 [conn20135] warning: hashtable namespace index long chain
2011-08-09_15:24:36.00521 Tue Aug 9 15:24:36 [conn20135] error: hashtable namespace index max chain reached:1335
2011-08-09_15:24:36.00531 Tue Aug 9 15:24:36 [conn20135] CMD: drop kali.tmp.mr.data_out_37476_66317
2011-08-09_15:24:36.00546 Tue Aug 9 15:24:36 [conn20135] error: hashtable namespace index max chain reached:1335
2011-08-09_15:24:36.00576 Tue Aug 9 15:24:36 [conn20135] error: hashtable namespace index max chain reached:1335
2011-08-09_15:24:36.00588 Tue Aug 9 15:24:36 [conn20135] error: hashtable namespace index max chain reached:1335
2011-08-09_15:24:36.00604 Tue Aug 9 15:24:36 [conn20135] mr failed, removing collection
2011-08-09_15:24:36.00620 Tue Aug 9 15:24:36 [conn20135] error: hashtable namespace index max chain reached:1335
2011-08-09_15:24:36.00629 Tue Aug 9 15:24:36 [conn20135] CMD: drop kali.tmp.mr.data_out_37476_66317
2011-08-09_15:24:36.00643 Tue Aug 9 15:24:36 [conn20135] warning: hashtable namespace index long chain
2011-08-09_15:24:36.00652 Tue Aug 9 15:24:36 [conn20135] CMD: drop kali.tmp.mr.data_out_37476_66317_inc
2011-08-09_15:24:36.00662 Tue Aug 9 15:24:36 [conn20135] warning: hashtable namespace index long chain
2011-08-09_15:24:36.00673 Tue Aug 9 15:24:36 [conn20135] warning: hashtable namespace index long chain
2011-08-09_15:24:36.06410 Tue Aug 9 15:24:36 [conn20135] warning: hashtable namespace index long chain
2011-08-09_15:24:36.06431 Tue Aug 9 15:24:36 [conn20135] error: hashtable namespace index max chain reached:1335
2011-08-09_15:24:36.06442 Tue Aug 9 15:24:36 [conn20135] error: hashtable namespace index max chain reached:1335
2011-08-09_15:24:36.06497 Tue Aug 9 15:24:36 [conn20135] query kali.$cmd ntoreturn:1 assertion command: { mapreduce: "data", map: function ()

{ 2011-08-09_15:24:36.06499 emit(this.value, 1); 2011-08-09_15:24:36.06501 }

, reduce: function (key, values)

{ 2011-08-09_15:24:36.06503 return 1; 2011-08-09_15:24:36.06504 }

, out: "out_37476" } reslen:152 121ms
2011-08-09_15:24:36.10552 Tue Aug 9 15:24:36 [conn20135] CMD: drop kali.tmp.mr.data_out_37477_66318
2011-08-09_15:24:36.10576 Tue Aug 9 15:24:36 [conn20135] CMD: drop kali.tmp.mr.data_out_37477_66318_inc
2011-08-09_15:24:36.10605 Tue Aug 9 15:24:36 [conn20135] building new index on

{ 0: 1 }

for kali.tmp.mr.data_out_37477_66318_inc
2011-08-09_15:24:36.10645 Tue Aug 9 15:24:36 [conn20135] error: hashtable namespace index max chain reached:1335
2011-08-09_15:24:36.10661 Tue Aug 9 15:24:36 [conn20135] error: hashtable namespace index max chain reached:1335
2011-08-09_15:24:36.10672 Tue Aug 9 15:24:36 [conn20135] error: hashtable namespace index max chain reached:1335
2011-08-09_15:24:36.10706 Tue Aug 9 15:24:36 [conn20135] error: hashtable namespace index max chain reached:1335
2011-08-09_15:24:36.12388 Tue Aug 9 15:24:36 [conn20135] CMD: drop kali.tmp.mr.data_out_37477_66318
2011-08-09_15:24:36.12429 Tue Aug 9 15:24:36 [conn20135] warning: hashtable namespace index long chain
2011-08-09_15:24:36.12441 Tue Aug 9 15:24:36 [conn20135] warning: hashtable namespace index long chain
2011-08-09_15:24:36.12451 Tue Aug 9 15:24:36 [conn20135] warning: hashtable namespace index long chain
2011-08-09_15:24:36.12464 Tue Aug 9 15:24:36 [conn20135] warning: hashtable namespace index long chain
2011-08-09_15:24:36.12479 Tue Aug 9 15:24:36 [conn20135] building new index on

{ _id: 1 }

for kali.tmp.mr.data_out_37477_66318
2011-08-09_15:24:36.12491 Tue Aug 9 15:24:36 [conn20135] warning: hashtable namespace index long chain
2011-08-09_15:24:36.12531 Tue Aug 9 15:24:36 [conn20135] warning: hashtable namespace index long chain
2011-08-09_15:24:36.12543 Tue Aug 9 15:24:36 [conn20135] warning: hashtable namespace index long chain
2011-08-09_15:24:36.12567 Tue Aug 9 15:24:36 [conn20135] done for 0 records 0secs
2011-08-09_15:24:36.14832 Tue Aug 9 15:24:36 [conn20135] Assertion failure foundIndex db/commands/mr.cpp 587
2011-08-09_15:24:36.14851 0x5457f1 0x5547bf 0x78e508 0x7949ec 0x76011c 0x761c40 0x618bab 0x61c295 0x6f8660 0x8075c0 0x81353e 0x7f9b70a9d971 0x7f9b7005a92d
2011-08-09_15:24:36.15062 /opt/mongodb/bin/mongod(_ZN5mongo12sayDbContextEPKc+0xb1) [0x5457f1]
2011-08-09_15:24:36.15066 /opt/mongodb/bin/mongod(_ZN5mongo8assertedEPKcS1_j+0xbf) [0x5547bf]
2011-08-09_15:24:36.15067 /opt/mongodb/bin/mongod(_ZN5mongo2mr5State11finalReduceEPNS_5CurOpERNS_19ProgressMeterHolderE+0x5f8) [0x78e508]
2011-08-09_15:24:36.15069 /opt/mongodb/bin/mongod(_ZN5mongo2mr16MapReduceCommand3runERKSsRNS_7BSONObjERSsRNS_14BSONObjBuilderEb+0xb6c) [0x7949ec]
2011-08-09_15:24:36.15070 /opt/mongodb/bin/mongod(_ZN5mongo11execCommandEPNS_7CommandERNS_6ClientEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0x76c) [0x76011c]
2011-08-09_15:24:36.15073 /opt/mongodb/bin/mongod(_ZN5mongo12_runCommandsEPKcRNS_7BSONObjERNS_10BufBuilderERNS_14BSONObjBuilderEbi+0x350) [0x761c40]
2011-08-09_15:24:36.15074 /opt/mongodb/bin/mongod(_ZN5mongo11runCommandsEPKcRNS_7BSONObjERNS_5CurOpERNS_10BufBuilderERNS_14BSONObjBuilderEbi+0x3b) [0x618bab]
2011-08-09_15:24:36.15075 /opt/mongodb/bin/mongod(ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1+0xff5) [0x61c295]
2011-08-09_15:24:36.15077 /opt/mongodb/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_8SockAddrE+0xe30) [0x6f8660]
2011-08-09_15:24:36.15078 /opt/mongodb/bin/mongod(_ZN5mongo10connThreadEPNS_13MessagingPortE+0x240) [0x8075c0]
2011-08-09_15:24:36.15080 /opt/mongodb/bin/mongod(thread_proxy+0x6e) [0x81353e]
2011-08-09_15:24:36.15081 /lib/libpthread.so.0(+0x7971) [0x7f9b70a9d971]
2011-08-09_15:24:36.15082 /lib/libc.so.6(clone+0x6d) [0x7f9b7005a92d]
2011-08-09_15:24:36.15099 Tue Aug 9 15:24:36 [conn20135] mr failed, removing collection
2011-08-09_15:24:36.15120 Tue Aug 9 15:24:36 [conn20135] warning: hashtable namespace index long chain
2011-08-09_15:24:36.15130 Tue Aug 9 15:24:36 [conn20135] CMD: drop kali.tmp.mr.data_out_37477_66318
2011-08-09_15:24:36.15140 Tue Aug 9 15:24:36 [conn20135] warning: hashtable namespace index long chain
2011-08-09_15:24:36.15155 Tue Aug 9 15:24:36 [conn20135] warning: hashtable namespace index long chain

db.getCollectionNames() shows a mix of expected out_xxxxx and "tmp.mr.data_out_xxxxx_yyyyy_inc".

I then try to discard the unwanted collections:
> db.getCollectionNames().forEach(function

{ if(n.substr(0,4) == "out_") db.getCollection(n).drop()}

)
> db.getCollectionNames().forEach(function

{ if(n.substr(0,3) == "tmp") db.getCollection(n).drop()}

)

after that, db.getCollectionsNames() still shows some "tmp.mr.data_out_xxxxx_yyyyy"

"tmp.mr.data_out_37623_66464_inc",
"tmp.mr.data_out_37624_66465_inc",
"tmp.mr.data_out_37625_66466_inc",
"tmp.mr.data_out_37626_66467_inc"
]

and back to square one:

> db.getCollection("tmp.mr.data_out_37626_66467_inc").stats()

{
"ns" : "kali.tmp.mr.data_out_37626_66467_inc",
"sharded" : false,
"primary" : "pink-alpha",
"errmsg" : "ns not found",
"ok" : 0
}

Comment by Mathieu Poumeyrol [ 09/Aug/11 ]

ok, reproduced it by running m/r jobs without deleting output until mongo insults me. I got just one "rogue" collection, no thousands of them, but it shows the same symptoms:

> show collections
admin
admin_indexer
system.indexes
tmp.mr.type_albums_tmp_reduce_items_12440_28828
type_albums
> db.getCollectionNames()
[
"admin",
"admin_indexer",
"system.indexes",
"tmp.mr.type_albums_tmp_reduce_items_12440_28828",
"type_albums"
]
> db.getCollection("tmp.mr.type_albums_tmp_reduce_items_12440_28828").stats()
{
"ns" : "indexer_cache.tmp.mr.type_albums_tmp_reduce_items_12440_28828",
"sharded" : false,
"primary" : "pink-alpha",
"errmsg" : "ns not found",
"ok" : 0
}

Comment by Mathieu Poumeyrol [ 09/Aug/11 ]

I have not tried to reproduce it. I was actually more looking for a way to get rid of the condition, that is becoming a concern on a production system.

  • our setup is a sharded system, with only one shard ("pink-alpha") which is a 3-node replicaset
  • the database that is impacted was overflown in collection number (the application stack was not discarding a m/r result).
  • I found out through errors raised through the application that the collection number was exceeded, and immediately fixed the issue in the application layer
  • I then tried to discard the collections, by scripting against the mongoS: some collections went away gracefully: the "tmp_reduce_items_xxxx" that were generated by the application layer. The "tmp.mr.type_albums_tmp_reduce_items_xxxxx_yyyyy_inc" did not, leading to the situation described in the report.
  • after some discussion with Kristina, I started running a repair on the secondaries, planning to run it on the primary next. Now both secondaries are repaired, the primary has not been repaired yet.
    Now the situation is as follow:
  • getCollectionNames on the primary and the mongos report thousands of "tmp.mr.type_albums_tmp_reduce_items_xxxxx_yyyyy_inc" and "tmp.mr.type_albums_tmp_reduce_items_xxxxx_yyyyy", no "tmp_reduce_items_xxxx".
  • getCollectionNames on both secondaries report "tmp_reduce_items_xxxx" and no "tmp.mr.type_albums_tmp_reduce_items_xxxxx_yyyyy_inc".

The issue is a source of concern for us, but as everything seems to work fine at the current time, I will not attempt stepDown() and repair on the primary for some time in case you guys want to investigate. I will also spend some time trying to reproduce the issue on a test system.

kali @freenode, european daylight time.

Comment by Eliot Horowitz (Inactive) [ 09/Aug/11 ]

Do you have a way to reproduce or jsut saw?

Generated at Thu Feb 08 03:03:22 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.