[SERVER-8594] Killing parallel M/R jobs causes temporary collection corruption Created: 15/Feb/13  Updated: 11/Jul/16  Resolved: 01/Mar/13

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: 2.4.0-rc2

Type: Bug Priority: Blocker - P1
Reporter: Ben Becker Assignee: Mathias Stearn
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

All


Attachments: Text File invalid_bsonobj_log.txt     File mr_memleak_test.js    
Operating System: ALL
Steps To Reproduce:

Execute attached MapReduce script, and kill mongod as soon as a few emit progress messages show up:

Fri Feb 15 13:56:52.000 [conn5] 		M/R: (1/3) Emit Progress: 1600/5000	32%
Fri Feb 15 13:56:52.005 [conn3] 		M/R: (1/3) Emit Progress: 1700/5000	34%
Fri Feb 15 13:56:52.005 [conn4] 		M/R: (1/3) Emit Progress: 1700/5000	34%

Participants:

 Description   

Bisected down to commit f62bc8177b52420e7dc4d0086b6902ccd188f725.

When running parallel M/R jobs and killing mongod via C-c, mongod fails to start up due to an invalid record in a temp collection

Fri Feb 15 14:03:34.857 [initandlisten] ** WARNING: soft rlimits too low. Number of files is 256, should be at least 1000
Fri Feb 15 14:03:34.857 [initandlisten] db version v2.4.0-rc1-pre-, pdfile version 4.5
Fri Feb 15 14:03:34.857 [initandlisten] git version: ea4b79b64fc4b7fd80aa5b4ec171ac8a37190801
Fri Feb 15 14:03:34.857 [initandlisten] build info: Darwin leaf.local 11.4.2 Darwin Kernel Version 11.4.2: Thu Aug 23 16:25:48 PDT 2012; root:xnu-1699.32.7~1/RELEASE_X86_64 x86_64 BOOST_LIB_VERSION=1_49
Fri Feb 15 14:03:34.857 [initandlisten] allocator: tcmalloc
Fri Feb 15 14:03:34.857 [initandlisten] options: {}
Fri Feb 15 14:03:34.860 [initandlisten] journal dir=/data/db/journal
Fri Feb 15 14:03:34.860 [initandlisten] recover : no journal files present, no recovery needed
Fri Feb 15 14:03:34.900 [initandlisten] Assertion: 10334:BSONObj size: -286331154 (0xEEEEEEEE) is invalid. Size must be between 0 and 16793600(16MB) First element: name: "test.tmp.mr.foo_2_inc.$0_1"
0x10a776405 0x10a73983b 0x10a717a43 0x10a717c5a 0x109ffba58 0x10a510eec 0x109f4fa8c 0x109fefbb7 0x10a109c66 0x10a119c85 0x10a11d175 0x10a04625e 0x10a0465b5 0x109e7f795 0x109e82d6e 0x109e83488 0x109e8f480 0x109e8f548 0x109e7a8b4 
 0   mongod                              0x000000010a776405 _ZN5mongo15printStackTraceERSo + 37
 1   mongod                              0x000000010a73983b _ZN5mongo10logContextEPKc + 123
 2   mongod                              0x000000010a717a43 _ZN5mongo11msgassertedEiPKc + 339
 3   mongod                              0x000000010a717c5a _ZNK5mongo13ExceptionInfo6appendERNS_14BSONObjBuilderEPKcS4_ + 0
 4   mongod                              0x0000000109ffba58 _ZNK5mongo7BSONObj14_assertInvalidEv + 984
 5   mongod                              0x000000010a510eec _ZN5mongo7BSONObj4initEPKc + 76
 6   mongod                              0x0000000109f4fa8c _ZN5mongo7BSONObjC1EPKc + 60
 7   mongod                              0x0000000109fefbb7 _ZN5mongo7BSONObj4makeEPKNS_6RecordE + 55
 8   mongod                              0x000000010a109c66 _ZN5mongo11BasicCursor7currentEv + 66
 9   mongod                              0x000000010a119c85 _ZN5mongo8Database19clearTmpCollectionsEv + 397
 10  mongod                              0x000000010a11d175 _ZN5mongo14DatabaseHolder11getOrCreateERKSsS2_Rb + 1993
 11  mongod                              0x000000010a04625e _ZN5mongo6Client7Context11_finishInitEv + 210
 12  mongod                              0x000000010a0465b5 _ZN5mongo6Client7ContextC1ERKSsS3_b + 225
 13  mongod                              0x0000000109e7f795 _ZN5mongoL30repairDatabasesAndCheckVersionEv + 725
 14  mongod                              0x0000000109e82d6e _ZN5mongo14_initAndListenEi + 5262
 15  mongod                              0x0000000109e83488 _ZN5mongo13initAndListenEi + 24
 16  mongod                              0x0000000109e8f480 _ZL11mongoDbMainiPPcS0_ + 784
 17  mongod                              0x0000000109e8f548 main + 40
 18  mongod                              0x0000000109e7a8b4 start + 52
Fri Feb 15 14:03:34.915 [initandlisten] exception in initAndListen: 10334 BSONObj size: -286331154 (0xEEEEEEEE) is invalid. Size must be between 0 and 16793600(16MB) First element: name: "test.tmp.mr.foo_2_inc.$0_1", terminating
Fri Feb 15 14:03:34.915 dbexit: 



 Comments   
Comment by auto [ 01/Mar/13 ]

Author:

{u'date': u'2013-02-28T18:13:47Z', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: SERVER-8594 fix clearTmpCollections for collections with indexes
Branch: master
https://github.com/mongodb/mongo/commit/c909370ff97ccf17d340ab3c337269f8c181af92

Comment by Mathias Stearn [ 28/Feb/13 ]

The issue is that dropCollection() removes the collection and its indexes. If the cursor is currently on one of those indexes it will be invalidated. This was handled correctly prior to https://github.com/mongodb/mongo/commit/f62bc8177b52420e7dc4d0086b6902ccd188f725 due to the use of DBDirectClient which handles the deletion under the cursor automatically. I'll post a fix for this soon.

Easy repro:

use screwy
db.foo.runCommand('create', {temp:true})
db.foo.ensureIndex({x:1})
// restart mongod and watch it crash

Comment by Ben Becker [ 27/Feb/13 ]

I can reproduce this again, but not as reliably. Testing with the latest from master after killing mongod with C-\. Data files have been cleared.

leaf:~/projects/mongo (master*) $ rm -rf /data/db/*
leaf:~/projects/mongo (master*) $ ./mongod
./mongod --help for help and startup options
Wed Feb 27 02:22:34.560 [initandlisten] MongoDB starting : pid=2841 port=27017 dbpath=/data/db/ 64-bit host=leaf.local
Wed Feb 27 02:22:34.561 [initandlisten] 
Wed Feb 27 02:22:34.561 [initandlisten] ** WARNING: soft rlimits too low. Number of files is 256, should be at least 1000
Wed Feb 27 02:22:34.561 [initandlisten] db version v2.4.0-rc2-pre-, pdfile version 4.5
Wed Feb 27 02:22:34.561 [initandlisten] git version: b06ea7b670826856cbc28ba99ae766c7ca397084
Wed Feb 27 02:22:34.561 [initandlisten] build info: Darwin leaf.local 12.2.0 Darwin Kernel Version 12.2.0: Sat Aug 25 00:48:52 PDT 2012; root:xnu-2050.18.24~1/RELEASE_X86_64 x86_64 BOOST_LIB_VERSION=1_49
Wed Feb 27 02:22:34.561 [initandlisten] allocator: tcmalloc
Wed Feb 27 02:22:34.561 [initandlisten] options: {}
Wed Feb 27 02:22:34.564 [initandlisten] journal dir=/data/db/journal
Wed Feb 27 02:22:34.564 [initandlisten] recover : no journal files present, no recovery needed
Wed Feb 27 02:22:34.577 [FileAllocator] allocating new datafile /data/db/local.ns, filling with zeroes...
Wed Feb 27 02:22:34.577 [FileAllocator] creating directory /data/db/_tmp
Wed Feb 27 02:22:34.607 [FileAllocator] done allocating datafile /data/db/local.ns, size: 16MB,  took 0.03 secs
Wed Feb 27 02:22:34.651 [FileAllocator] allocating new datafile /data/db/local.0, filling with zeroes...
Wed Feb 27 02:22:34.960 [FileAllocator] done allocating datafile /data/db/local.0, size: 64MB,  took 0.308 secs
Wed Feb 27 02:22:35.009 [initandlisten] command local.$cmd command: { create: "startup_log", size: 10485760, capped: true } ntoreturn:1 keyUpdates:0  reslen:37 431ms
Wed Feb 27 02:22:35.009 [websvr] admin web console waiting for connections on port 28017
Wed Feb 27 02:22:35.009 [initandlisten] waiting for connections on port 27017
Wed Feb 27 02:22:53.338 [initandlisten] connection accepted from 127.0.0.1:50000 #1 (1 connection now open)
Wed Feb 27 02:23:34.256 [FileAllocator] allocating new datafile /data/db/test.ns, filling with zeroes...
Wed Feb 27 02:23:34.289 [FileAllocator] done allocating datafile /data/db/test.ns, size: 16MB,  took 0.031 secs
Wed Feb 27 02:23:34.332 [FileAllocator] allocating new datafile /data/db/test.0, filling with zeroes...
Wed Feb 27 02:23:34.635 [FileAllocator] done allocating datafile /data/db/test.0, size: 64MB,  took 0.302 secs
Wed Feb 27 02:23:34.690 [FileAllocator] allocating new datafile /data/db/test.1, filling with zeroes...
Wed Feb 27 02:23:34.692 [conn1] build index test.system.js { _id: 1 }
Wed Feb 27 02:23:34.713 [conn1] build index done.  scanned 0 total records. 0.021 secs
Wed Feb 27 02:23:34.713 [conn1] insert test.system.js ninserted:1 keyUpdates:0 locks(micros) w:457421 457ms
Wed Feb 27 02:23:35.438 [FileAllocator] done allocating datafile /data/db/test.1, size: 128MB,  took 0.747 secs
Wed Feb 27 02:23:39.068 [conn1] end connection 127.0.0.1:50000 (0 connections now open)
Wed Feb 27 02:24:19.605 [initandlisten] connection accepted from 127.0.0.1:50005 #2 (1 connection now open)
Wed Feb 27 02:24:19.625 [conn2] dropDatabase test starting
Wed Feb 27 02:24:19.638 [conn2] removeJournalFiles
Wed Feb 27 02:24:19.648 [conn2] dropDatabase test finished
Wed Feb 27 02:24:19.658 [FileAllocator] allocating new datafile /data/db/test.ns, filling with zeroes...
Wed Feb 27 02:24:19.687 [FileAllocator] done allocating datafile /data/db/test.ns, size: 16MB,  took 0.028 secs
Wed Feb 27 02:24:19.728 [FileAllocator] allocating new datafile /data/db/test.0, filling with zeroes...
Wed Feb 27 02:24:20.030 [initandlisten] connection accepted from 127.0.0.1:50006 #3 (2 connections now open)
 
...
 
Wed Feb 27 02:28:08.403 [conn19] build index test.tmp.mr.foods.foo_15 { value.l3m.l: -1.0 }
Wed Feb 27 02:28:08.403 [conn19] build index done.  scanned 0 total records. 0 secs
Wed Feb 27 02:28:08.404 [conn19] build index test.tmp.mr.foods.foo_15 { value.l6m.f: -1.0 }
Wed Feb 27 02:28:08.404 [conn19] build index done.  scanned 0 total records. 0 secs
Wed Feb 27 02:28:08.404 [conn19] build index test.tmp.mr.foods.foo_15 { value.l6m.w: -1.0 }
Wed Feb 27 02:28:08.405 [conn19] build index done.  scanned 0 total records. 0 secs
Wed Feb 27 02:28:08.405 [conn19] build index test.tmp.mr.foods.foo_15 { value.l6m.l: -1.0 }
Wed Feb 27 02:28:08.405 [conn19] build index done.  scanned 0 total records. 0 secs
^\Wed Feb 27 02:28:09.514 Got signal: 3 (Quit: 3).
 
Wed Feb 27 02:28:09.523 Backtrace:
0x1024b17e5 0x101bb3f3e 0x7fff8744d8ea 0x10254466e 0x102495095 0x101bba67f 0x101bbbd4b 0x101bbc2a8 0x101bc82a0 0x101bc8368 0x7fff8863a7e1 0x1 
 0   mongod                              0x00000001024b17e5 _ZN5mongo15printStackTraceERSo + 37
 1   mongod                              0x0000000101bb3f3e _ZN5mongo10abruptQuitEi + 446
 2   libsystem_c.dylib                   0x00007fff8744d8ea _sigtramp + 26
 3   mongod                              0x000000010254466e _ZL20CheckCachedSizeClassPv + 78
 4   mongod                              0x0000000102495095 _ZN5mongo17PortMessageServer3runEv + 37
 5   mongod                              0x0000000101bba67f _ZN5mongo6listenEi + 399
 6   mongod                              0x0000000101bbbd4b _ZN5mongo14_initAndListenEi + 5707
 7   mongod                              0x0000000101bbc2a8 _ZN5mongo13initAndListenEi + 24
 8   mongod                              0x0000000101bc82a0 _ZL11mongoDbMainiPPcS0_ + 784
 9   mongod                              0x0000000101bc8368 main + 40
 10  libdyld.dylib                       0x00007fff8863a7e1 start + 0
 11  ???                                 0x0000000000000001 0x0 + 1
 
leaf:~/projects/mongo (master*) $ ./mongod
./mongod --help for help and startup options
Wed Feb 27 02:28:11.074 [initandlisten] MongoDB starting : pid=2969 port=27017 dbpath=/data/db/ 64-bit host=leaf.local
Wed Feb 27 02:28:11.075 [initandlisten] 
Wed Feb 27 02:28:11.075 [initandlisten] ** WARNING: soft rlimits too low. Number of files is 256, should be at least 1000
Wed Feb 27 02:28:11.075 [initandlisten] db version v2.4.0-rc2-pre-, pdfile version 4.5
Wed Feb 27 02:28:11.075 [initandlisten] git version: b06ea7b670826856cbc28ba99ae766c7ca397084
Wed Feb 27 02:28:11.075 [initandlisten] build info: Darwin leaf.local 12.2.0 Darwin Kernel Version 12.2.0: Sat Aug 25 00:48:52 PDT 2012; root:xnu-2050.18.24~1/RELEASE_X86_64 x86_64 BOOST_LIB_VERSION=1_49
Wed Feb 27 02:28:11.075 [initandlisten] allocator: tcmalloc
Wed Feb 27 02:28:11.075 [initandlisten] options: {}
Wed Feb 27 02:28:11.078 [initandlisten] journal dir=/data/db/journal
Wed Feb 27 02:28:11.078 [initandlisten] recover begin
Wed Feb 27 02:28:11.078 [initandlisten] recover lsn: 274226
Wed Feb 27 02:28:11.078 [initandlisten] recover /data/db/journal/j._4
Wed Feb 27 02:28:11.078 [initandlisten] recover skipping application of section seq:221391 < lsn:274226
Wed Feb 27 02:28:11.247 [initandlisten] recover cleaning up
Wed Feb 27 02:28:11.247 [initandlisten] removeJournalFiles
Wed Feb 27 02:28:11.247 [initandlisten] recover done
Wed Feb 27 02:28:11.287 [initandlisten] Assertion: 10334:BSONObj size: -286331154 (0xEEEEEEEE) is invalid. Size must be between 0 and 16793600(16MB) First element: name: "test.tmp.mr.foods.foo_15_inc.$0_1"
0x10f4627e5 0x10f425cfb 0x10f403f03 0x10f40411a 0x10ece6038 0x10f1fba2c 0x10ec39c4c 0x10ecda207 0x10edf4596 0x10ee045b5 0x10ee07b17 0x10ed305be 0x10ed30915 0x10eb695b5 0x10eb6cb8e 0x10eb6d2a8 0x10eb792a0 0x10eb79368 0x7fff8863a7e1 0x1 
 0   mongod                              0x000000010f4627e5 _ZN5mongo15printStackTraceERSo + 37
 1   mongod                              0x000000010f425cfb _ZN5mongo10logContextEPKc + 123
 2   mongod                              0x000000010f403f03 _ZN5mongo11msgassertedEiPKc + 339
 3   mongod                              0x000000010f40411a _ZNK5mongo13ExceptionInfo6appendERNS_14BSONObjBuilderEPKcS4_ + 0
 4   mongod                              0x000000010ece6038 _ZNK5mongo7BSONObj14_assertInvalidEv + 984
 5   mongod                              0x000000010f1fba2c _ZN5mongo7BSONObj4initEPKc + 76
 6   mongod                              0x000000010ec39c4c _ZN5mongo7BSONObjC1EPKc + 60
 7   mongod                              0x000000010ecda207 _ZN5mongo7BSONObj4makeEPKNS_6RecordE + 55
 8   mongod                              0x000000010edf4596 _ZN5mongo11BasicCursor7currentEv + 66
 9   mongod                              0x000000010ee045b5 _ZN5mongo8Database19clearTmpCollectionsEv + 397
 10  mongod                              0x000000010ee07b17 _ZN5mongo14DatabaseHolder11getOrCreateERKSsS2_Rb + 1993
 11  mongod                              0x000000010ed305be _ZN5mongo6Client7Context11_finishInitEv + 210
 12  mongod                              0x000000010ed30915 _ZN5mongo6Client7ContextC1ERKSsS3_b + 225
 13  mongod                              0x000000010eb695b5 _ZN5mongoL30repairDatabasesAndCheckVersionEv + 725
 14  mongod                              0x000000010eb6cb8e _ZN5mongo14_initAndListenEi + 5262
 15  mongod                              0x000000010eb6d2a8 _ZN5mongo13initAndListenEi + 24
 16  mongod                              0x000000010eb792a0 _ZL11mongoDbMainiPPcS0_ + 784
 17  mongod                              0x000000010eb79368 main + 40
 18  libdyld.dylib                       0x00007fff8863a7e1 start + 0
 19  ???                                 0x0000000000000001 0x0 + 1
Wed Feb 27 02:28:11.302 [initandlisten] exception in initAndListen: 10334 BSONObj size: -286331154 (0xEEEEEEEE) is invalid. Size must be between 0 and 16793600(16MB) First element: name: "test.tmp.mr.foods.foo_15_inc.$0_1", terminating
Wed Feb 27 02:28:11.302 dbexit: 

Comment by auto [ 15/Feb/13 ]

Author:

{u'date': u'2013-02-15T23:06:28Z', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}

Message: SERVER-8594 do not attempt to drop index namespaces when clearing temp collections
Branch: master
https://github.com/mongodb/mongo/commit/caab2a0821cf3e2ad17b6a6ec44979d5881797f0

Comment by Ben Becker [ 15/Feb/13 ]

Attaching invalid_bsonobj_log.txt, which was captured after killing mongod via C-c, then attempting to restart mongod.

Generated at Thu Feb 08 03:17:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.