[SERVER-5680] repl13.js failing on Windows 64-bit Created: 21/Apr/12  Updated: 11/Jul/16  Resolved: 01/May/12

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 2.0.7, 2.1.1

Type: Bug Priority: Major - P3
Reporter: Ian Whalen (Inactive) Assignee: Eric Milkie
Resolution: Done Votes: 0
Labels: Windows, buildbot
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

http://buildbot.mongodb.org/builders/Windows%2064-bit/builds/4658/steps/test_10/logs/stdio


Issue Links:
Duplicate
is duplicated by SERVER-5623 access violation in rollback4.js Closed
Related
related to SERVER-2942 MapViewOfFileEx failed during large i... Closed
Backwards Compatibility: Fully Compatible
Operating System: Windows
Participants:

 Description   

http://buildbot.mongodb.org/builders/Windows%2064-bit/builds/4658/steps/test_10/logs/stdio



 Comments   
Comment by auto [ 27/Jul/12 ]

Author:

{u'date': u'2012-04-30T10:42:02-07:00', u'email': u'milkie@10gen.com', u'name': u'Eric Milkie'}

Message: SERVER-5680 fix Windows accvios due to incorrect recursive locking of MongoFiles

Conflicts:
db/dur.cpp
Branch: v2.0
https://github.com/mongodb/mongo/commit/c88805a6f6447458290719703887fb736692359e

Comment by Eric Milkie [ 30/Apr/12 ]

Note that this should fix the test failures from the slow weekly builders as well. I looked at their dump files and rollback4.js is crashing in the same place (rec->touch()).

Comment by auto [ 30/Apr/12 ]

Author:

{u'login': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}

Message: SERVER-5680 fix Windows accvios due to incorrect recursive locking of MongoFiles
Branch: master
https://github.com/mongodb/mongo/commit/8b099a96ac9b5a1f73954204806ef8e0c77d481a

Comment by Eric Milkie [ 30/Apr/12 ]

Running in debug mode, repl13.js hits an assert on startup:

 m31000| Mon Apr 30 12:03:39 [journal]   Assertion failure s <= 0 e:\m\mongo\src\mongo\db\../util/concurrency/rwlock.h 175
 m31000| Mon Apr 30 12:03:39 [journal] *** unhandled exception 0x80000003 at 0x000007FEFD6E3172, terminating
 m31000| Mon Apr 30 12:03:39 [journal] writing minidump dignostic file 0000000140C38BD0

Comment by Eric Milkie [ 30/Apr/12 ]

Unfortunately, this test failed again with the same access violation, even after Andy's change was introduced.

It's odd to me how it's this one particular test that keeps hitting an accvio, on the same line of code. I'm going to attempt to flush it out running in the debugger.

Comment by Tad Marshall [ 26/Apr/12 ]

This should be fixed by by Andy's addition of LockMongoFilesExclusive to remapPrivateView in Windows. Memory accesses to the private view being remapped would generate access violations if they happened during the window of time between the unmap and remap operations.

Comment by auto [ 25/Apr/12 ]

Author:

{u'login': u'andy10gen', u'name': u'Andy Schwerin', u'email': u'schwerin@10gen.com'}

Message: LockMongoFilesExclusive in remapViewOfFiles on Windows.

Since remapViewOfFiles isn't atomic on Windows, it must exclusively acquire the
"mongo files" lock. Otherwise, "touch" operations in other threads might try
to access memory during the window when it is not mapped.

See SERVER-5680, SERVER-5663.
Branch: master
https://github.com/mongodb/mongo/commit/d8462a26b9089c5e58d1e340dcada83719ea4e47

Comment by Tad Marshall [ 23/Apr/12 ]

I tried to find the faulting code (hard because ASLR gives you 254 possible locations and we haven't added any code to allow us to adjust for this) and found one solid candidate: NamespaceDetailsTransient::notifyOfWriteOp() at line 537 in db/namespace_details.h.

/* you must notify the cache if you are doing writes, as query plan utility will change */
void notifyOfWriteOp()

{ if ( _qcCache.empty() ) return; if ( ++_qcWriteCount >= 100 ) // this is the line matching the access violation clearQueryCache(); }

I tried running the test on my home machine and was not able to duplicate the crash. It is possible that this is a compiler bug, broken in the original Visual Studio 2010 run by the Buildbot but fixed in the Service Pack 1 version running on my machine, but that is a guess and probably wishful thinking.

Comment by Tad Marshall [ 21/Apr/12 ]

This seems relevant:

m31000| Sat Apr 21 02:23:13 [conn2] *** unhandled exception (access violation) at 0x000000013FD52EFE, terminating
m31000| Sat Apr 21 02:23:13 [conn2] *** access violation was a read from 0x0000000009B80018

Generated at Thu Feb 08 03:09:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.