[JAVA-628] Read after WriteConcern.SAFE does not guarantee to be able to get data? Created: 22/Aug/12  Updated: 16/Nov/21  Resolved: 30/Aug/12

Status: Closed
Project: Java Driver
Component/s: API
Affects Version/s: 2.7.3
Fix Version/s: None

Type: Task Priority: Minor - P4
Reporter: Solomon Wu Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

MongoDB 2.0.3
Linux


Attachments: Text File ok.txt     Text File problem.txt     Text File update.txt     Text File update_1.txt    

 Description   

3 nodes are involved

  • 1 mongo node - no replica set
  • 2 of our application node. let's call the first node appserver node, and the second node worker node.
  • using mongo Java driver

steps
1. appserver node writes to mongo with WriteConcern.SAFE.
2. appserver sends a message to worker node
3. upon receiving the message, worker node tries to read the data from mongo. the data that appserver just wrote is not there.

from our understanding of the document, Writeconcern.safe is equivalent of calling getLastError(). the subsequent read should be able to read the data just written. if we write with writeconern.normal, the subsequent read is not guaranteed to be able to read the data just written. but if we write with writeconern.safe, we should be able to read the data.



 Comments   
Comment by Jeffrey Yemin [ 30/Aug/12 ]

Please open a new case if you find a different issue.

Comment by Tihomir Slavkov [ 30/Aug/12 ]

Thanks Jeff,
Following your comment we found an application error in our code, which is not related to MongoDB. We will address it and then come back. At the moment you could put the case on-hold.

Comment by Jeffrey Yemin [ 30/Aug/12 ]

I didn't realize you were using count. I believe that ntoreturn is always 1 for count, regardless of the return value, because it's a command. But I think this is enough to diagnose for now.

At least in these logs, the query for the number 39710019 is for a different field than the update. The query looks like:

query: { messageId: 39710019 } 

while the update looks like:

{..., cpk: 39710019, ...}

So I don't see how this works at all. The only place in the logs I see an update to a messageId field is here:

Thu Aug 30 09:23:40 [conn20082] update quad.messageThread query: { _id: ObjectId('503f3100e4b0381a2afe65bd') } update: { $set: { messageId: 39710011, imid: "quad-1d152265548240bea269d92d5bf75bc4-39710011-1346318592859@dev.dodo.local", threadIndex: "Ac2GkRQQgIDVCw54RXOvob5rerfFRg==", subject: "sample title XQ6IATAG8P51ZOF3ZH04C4T7REYW9KA2", resourceId: 0 } } fastmodinsert:1 0ms

Comment by Tihomir Slavkov [ 30/Aug/12 ]

I attached new file, update_1.txt. The id of the object, which could not be found is 39710019. I'm not sure do i read MongoDB log file properly, but it seems to me that it reports the value was found. However, reading the value with code below fails sometimes:

DBCursor mtDbCursor = <DBCollection>.find(mtDBDbObject);

if (mtDbCursor != null && mtDbCursor.count() > 0)

{ // do something }

else {
throw new NoSuchMessageThreadException("Message Thread is not found " + messageId);
}

Comment by Jeffrey Yemin [ 28/Aug/12 ]

It's true there is a difference between two separate physical nodes and two threads, but I don't see how it's a significant difference so long as the two threads are using two different sockets.

Comment by Jeffrey Yemin [ 28/Aug/12 ]

No problem on the delay. What's the id of the document that the fineOne failed to find?

Comment by Tihomir Slavkov [ 28/Aug/12 ]

Sorry for the delay Jeff, i just attached a log snippet from execution, where insert was replaced with update as per instructions. Initially all went fine and then the problem appeared again. I waited ~10 seconds and then stopped collecting log data.

Looking at the code snippet which you posted on 23-th, there is one more difference - in our case we have 2 separate physical nodes, one is performing insert and the second is performing read. Perhaps we might need to look in this direction?

Comment by Tihomir Slavkov [ 28/Aug/12 ]

This call
<DBCollection>.insert(mtMetaDataDBObject, WriteConcern.SAFE);
was replaced with this one
<DBCollection>.update(new BasicDBObject("_id", new ObjectId()), new BasicDBObject("$set", mtMetaDataDBObject), true, false, WriteConcern.SAFE);

Comment by Jeffrey Yemin [ 23/Aug/12 ]

jeff.yemin@10gen.com

Comment by Solomon Wu [ 23/Aug/12 ]

Hi Jeff,
Do you have a private email address that we can send it to?
Thanks,

Comment by Jeffrey Yemin [ 23/Aug/12 ]

The logs don't show enough

I wrote a test to demonstrate what you're doing, but all in one VM:

import com.mongodb.BasicDBObject;
import com.mongodb.DBCollection;
import com.mongodb.DBObject;
import com.mongodb.Mongo;
import com.mongodb.WriteConcern;
 
import java.net.UnknownHostException;
import java.util.concurrent.ArrayBlockingQueue;
 
public class JAVA628 {
    public static void main(String[] args) throws UnknownHostException, InterruptedException {
        Mongo m = new Mongo();
 
        final DBCollection c = m.getDB("test").getCollection("JAVA626");
        final ArrayBlockingQueue<Integer> queue = new ArrayBlockingQueue<Integer>(10000000);
 
        c.drop();
 
        Thread t = new Thread() {
            public void run() {
                while (true) {
                    try {
                        int i = queue.take();
                        DBObject o = c.findOne(new BasicDBObject("_id", i));
                        while (o == null) {
                            System.out.println("Couldn't find " + i );
                            o = c.findOne(new BasicDBObject("_id", i));
                        }
                    } catch (InterruptedException e) {
                        //
                    }
                }
            }
 
        };
        t.setDaemon(true);
        t.start();
 
        int i = 0;
        while (i < Integer.MAX_VALUE) {
            c.update(new BasicDBObject("_id", i), new BasicDBObject("$inc", new BasicDBObject("i", 1)), true, false,
                    WriteConcern.SAFE);
            queue.add(i++);
            Thread.sleep(1);
        }
    }
}

Only difference is that I'm using update instead of insert because updates are logged more fully than inserts on the server. Using WriteConcern.SAFE, I haven't seen the reading thread ever fail to find the document that it takes from the blocking queue.

There's nothing fundamentally different about this test than your situation, so far as I can see. Due to the way the connection pool works, the two threads are guaranteed to use different sockets.

Can you paste the code you're using for the insert?

Comment by Tihomir Slavkov [ 23/Aug/12 ]

Jeff, please find attached 2 snippets - ok.txt is from when there was no error in application and problem.txt is from where there was an "not found" error in our application. The logs are rolling pretty fast, so i hope i was able to capture what is needed. The db collection in problem is "messageThread"

Comment by Jeffrey Yemin [ 23/Aug/12 ]

I think -vv will do it, but if not, keep turning it up by oner until you see them.

Comment by Solomon Wu [ 23/Aug/12 ]

Hi Jeff,
We do not see log showing insert or find. Which verbose level would you
like us to turn on?

Comment by Jeffrey Yemin [ 23/Aug/12 ]

Can you confirm in the mongod server logs that the query from the worker node is logged after the getlasterror command from the appserver node? The only way I can see this happening is if the getlasterror does not complete first or is not happening at all, setting up this race condition.

Comment by Tihomir Slavkov [ 23/Aug/12 ]

Hi Jeff,

Yes, we see the issue on various testbeds, the only difference being the frequency. For example, on our automation tests environment it is reproducible >90% of the times, and on production environment it happens <0.1% of the times.

The code which reads from MongoDB from the second node looks like:

DBObject mtDBDbObject = new BasicDBObject();
mtDBDbObject.put(key, value);

DBCursor mtDbCursor = <DBCollection>.find(mtDBDbObject);

while (mtDbCursor.hasNext())

{ DBObject object = mtDbCursor.next(); // do something with }

The same code when invoked on first node right after the write code execution, works correctly 100% of the times.

I didn't have time today to test scenario with read from different thread on the same node.

Generally, our observation is that there is a small period of time T in order for the second node to be able to retrieve correctly what the first node was written. If the read is attempted before that period T, then it fails; if it is attempted later than that period T, then it succeeds.

Comment by Jeffrey Yemin [ 23/Aug/12 ]

Are you seeing the problem both in the testbed and in production?

Comment by Solomon Wu [ 23/Aug/12 ]

Hi Tihomir,

Do you want to comment on some of these questions?

Hi Jeff,
On the writing side, it is doing something like:

DBCollection messageThread = // get collection.
messageThread.insert(mtMetaDataDBObject, WriteConcern.SAFE);

Where mtMetaDataDBObject is a BasicDBObject.

We can probably provide the code we use for inserting and reading the
messageThread.

Two more things I may forget to mention:

  • the situation occurs occasionally, not always.
  • on production system, our mongo is in replica set. Although in this
    testbed setup, it is a single node mongo.
Comment by Jeffrey Yemin [ 22/Aug/12 ]

Your understanding is correct.

What does the insert/update from appserver node look like and what does the query from worker node look like? Does the exact same query work from appserver node? What if the query is done on a different thread than the update?

Generated at Thu Feb 08 08:52:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.