[SERVER-6585] stale_clustered.js fails during overflow oplog stage Created: 25/Jul/12  Updated: 15/Aug/12  Resolved: 03/Aug/12

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Ian Whalen (Inactive) Assignee: Randolph Tan
Resolution: Duplicate Votes: 0
Labels: buildbot
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File log    
Issue Links:
Duplicate
is duplicated by SERVER-6538 Invalid access at address: 0x7f2e3402... Closed
Related
related to SERVER-6629 zbigMapReduce.js replica set fassert ... Closed
Operating System: ALL
Participants:

 Description   

http://buildlogs.mongodb.org/build/5010010fd2a60f1944000556/test/5010010fd2a60f7a91000b05/
http://buildbot.mongodb.org/builders/Linux%2032-bit%20debug/builds/2017



 Comments   
Comment by Randolph Tan [ 31/Jul/12 ]

Attaching plain text bbot log before it disappears.

Comment by Tad Marshall [ 30/Jul/12 ]

Similar crash in 32-bit Windows:

http://buildbot.mongodb.org/builders/Windows%2032-bit/builds/5288/steps/test_8/logs/stdio
http://buildlogs.mongodb.org/build/5015b161d2a60f46720008b4/test/5015c855d2a60f6faf000c10/

Sun Jul 29 19:33:42 shell: started program mongod.exe --oplogSize 40 --port 31101 --noprealloc --smallfiles --rest --replSet clusteredstale-rs0 --dbpath /data/db/clusteredstale-rs0-1
 m31101| Sun Jul 29 19:33:42 
 m31101| Sun Jul 29 19:33:42 warning: 32-bit servers don't have journaling enabled by default. Please use --journal if you want durability.
 m31101| Sun Jul 29 19:33:42 
 m31101| Sun Jul 29 19:33:42 [initandlisten] MongoDB starting : pid=1264 port=31101 dbpath=/data/db/clusteredstale-rs0-1 32-bit host=ip-0A420969
 m31101| Sun Jul 29 19:33:42 [initandlisten] 
 m31101| Sun Jul 29 19:33:42 [initandlisten] ** NOTE: when using MongoDB 32 bit, you are limited to about 2 gigabytes of data
 m31101| Sun Jul 29 19:33:42 [initandlisten] **       see http://blog.mongodb.org/post/137788967/32-bit-limitations
 m31101| Sun Jul 29 19:33:42 [initandlisten] **       with --journal, the limit is lower
 m31101| Sun Jul 29 19:33:42 [initandlisten] 
 m31101| Sun Jul 29 19:33:42 [initandlisten] db version v2.2.0-rc1-pre-, pdfile version 4.5
 m31101| Sun Jul 29 19:33:42 [initandlisten] git version: a2bb2d0bdc92cbfd02f88a0395ad76422dcdea4a
 m31101| Sun Jul 29 19:33:42 [initandlisten] build info: windows sys.getwindowsversion(major=6, minor=0, build=6002, platform=2, service_pack='Service Pack 2') BOOST_LIB_VERSION=1_49
 m31101| Sun Jul 29 19:33:42 [initandlisten] options: { dbpath: "/data/db/clusteredstale-rs0-1", noprealloc: true, oplogSize: 40, port: 31101, replSet: "clusteredstale-rs0", rest: true, smallfiles: true }
 m31101| Sun Jul 29 19:33:42 [initandlisten] waiting for connections on port 31101
// ... snip ...
 m31101| Sun Jul 29 19:35:17 [rsHealthPoll] couldn't connect to ip-0A420969:31102: couldn't connect to server ip-0A420969:31102
ReplSetTest Could not call ismaster on node 2
ReplSetTest Timestamp(1343604916000, 5128)
Sun Jul 29 19:35:17 reconnect 127.0.0.1:31102 failed couldn't connect to server 127.0.0.1:31102
	"ts" : Timestamp(1343604916000, 5128),
	"h" : NumberLong("741572296314734452"),
	"op" : "i",
	"ns" : "_overflow.coll",
	"o" : {
		"_id" : ObjectId("5015c8b41350583c79d40e59"),
		"overflow" : "value"
	}
{
ReplSetTest await TS for connection to ip-0A420969:31101 is 1343604916000:5128 and latest is 1343604916000:5128
ReplSetTest await oplog size for connection to ip-0A420969:31101 is 160006
ReplSetTest await synced=true
}
ReplSetTest overflow count : 160007 prev : 150007
ReplSetTest overflow inserting 10000
2012-07-29 19:35:19 EDT	
 m31101| Sun Jul 29 19:35:18 [repl prefetch worker] *** unhandled exception (access violation) at 0x00C555D4, terminating
 m31101| Sun Jul 29 19:35:18 [repl prefetch worker] *** access violation was a read from 0x00000013
 m31101| Sun Jul 29 19:35:18 [repl prefetch worker] *** unhandled exception (access violation) at 0x00C555D4, terminating
 m31101| Sun Jul 29 19:35:18 [repl prefetch worker] writing minidump diagnostic file mongo.dmp
 m31101| Sun Jul 29 19:35:18 [repl prefetch worker] *** access violation was a read from 0x00000013
 m31101| Sun Jul 29 19:35:18 [repl prefetch worker] failed to open minidump file mongo.dmp : errno:32 The process cannot access the file because it is being used by another process.
 m31101| Sun Jul 29 19:35:18 [repl prefetch worker] shutdown: going to close listening sockets...
 m31101| Sun Jul 29 19:35:18 [repl prefetch worker] closing listening socket: 476
 m31101| Sun Jul 29 19:35:18 [repl prefetch worker] closing listening socket: 480
 m31101| Sun Jul 29 19:35:18 [repl prefetch worker] shutdown: going to flush diaglog...
 m31101| Sun Jul 29 19:35:18 [repl prefetch worker] shutdown: going to close sockets...
 m31101| Sun Jul 29 19:35:18 [repl prefetch worker] shutdown: waiting for fs preallocator...
 m31101| Sun Jul 29 19:35:18 [repl prefetch worker] shutdown: closing all files...
 m31100| Sun Jul 29 19:35:18 [conn11] end connection 10.66.9.105:56083 (19 connections now open)
 m31101| Sun Jul 29 19:35:18 [repl prefetch worker] closeAllFiles() finished
 m31101| Sun Jul 29 19:35:18 [repl prefetch worker] shutdown: removing fs lock...
 m31101| Sun Jul 29 19:35:18 dbexit: really exiting now
// ... snip ...
Sun Jul 29 19:35:43 Error: 9001 socket exception [2] server [127.0.0.1:31100]  src/mongo/shell/collection.js:179
failed to load: C:\10gen\buildslaves\mongo\Windows_32bit\mongo\jstests\replsets\stale_clustered.js

The crash dump was evidently clobbered by having two threads try to write to it at the same time ... it is length 0.

Comment by Randolph Tan [ 25/Jul/12 ]

It looks like one of the replica node run out of memory and aborted:

m31101| Wed Jul 25 09:22:55 [repl prefetch worker] Unhandled std::exception in prefetchOp(): std::bad_alloc
m31101| Wed Jul 25 09:22:55 [repl prefetch worker] Unhandled std::exception in prefetchOp(): std::bad_alloc
m31101| Wed Jul 25 09:22:55 [repl prefetch worker] Fatal Assertion 16397

Discussed with Eric, and he thinks that 32-bit machines might barely have enough address space to handle the test.

Generated at Thu Feb 08 03:12:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.