[SERVER-16616] "chunks out of order" error during md5sum command Created: 20/Dec/14  Updated: 22/Dec/14  Resolved: 22/Dec/14

Status: Closed
Project: Core Server
Component/s: Concurrency
Affects Version/s: 2.8.0-rc3
Fix Version/s: None

Type: Bug Priority: Blocker - P1
Reporter: Bernie Hackett Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Java Source File SERVER16616.java     File mongod.log    
Issue Links:
Related
is related to SERVER-16437 Simple index scans should work like C... Closed
Operating System: ALL
Participants:

 Description   

Just started seeing this on Jenkins starting with 2.8rc3. The test that triggers it writes 100 1-chunk files to GridFS using 10 threads in python. A few of the threads end up throwing this error. All writes are done with w=1 write concern.

The test code is here:
https://github.com/mongodb/mongo-python-driver/blob/v2.8/test/test_gridfs.py#L179-L194

An example of the failure in Jenkins can be seen here:
http://jenkins.bci.10gen.cc:8080/view/Python/job/mongo-python-driver-v2.8/75/extensions=with-extensions,label=linux64,mongodb_configuration=single_server,mongodb_server=master-nightly-release,python_language_version=3.2/console

I've attached the log for this failure. I can't reproduce the problem locally, just in Jenkins. But the failure is pretty consistent in Jenkins. Let me know how to help debug.



 Comments   
Comment by J Rassi [ 22/Dec/14 ]

Thanks for the report and repro help. I would guess from the attached test failure that f984b532 introduced a bug in IndexScan::saveState(), IndexScan::restoreState(), or IndexScan::invalidate(). Per discussion with schwerin, I'm going to revert the commit and have Dave or Mathias (who co-authored the commit) debug this issue next week when they're back from vacation, as I don't have any spare cycles to look at this until then.

Re-opening SERVER-16437, resolving this issue as a dup of that ticket.

Comment by Daniel Pasette (Inactive) [ 21/Dec/14 ]

Found it. Thanks guys.

f984b532331e46298d52d4c786cb359fa208f3d9 is the first bad commit
commit f984b532331e46298d52d4c786cb359fa208f3d9
Author: Jason Rassi <rassi@10gen.com>
Date:   Tue Dec 16 15:12:57 2014 -0500
 
    SERVER-16437 IndexScan optimize end checker for single interval scans
 
:040000 040000 9602d070ab5a8893b552610a24932f8a7c10d12f 0d3cc6b588024175370564e2d0c86c691fb1a478 M	src
bisect run success

Comment by Daniel Pasette (Inactive) [ 21/Dec/14 ]

definite regression in rc3. i'm bisecting with slightly modified version of jeff's script.

Comment by Jeffrey Yemin [ 21/Dec/14 ]

I can reproduce this locally with the attached Java program. I get about 100 failed calls to filemd5 for every 25000 failures.

Generated at Thu Feb 08 03:41:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.