[SERVER-1768] getLastError(2) hangs forever when "replSet error RS102 too stale to catch up" Created: 09/Sep/10  Updated: 19/Sep/15  Resolved: 21/Apr/11

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 1.7.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Tony Hannan Assignee: Kristina Chodorow (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

db version v1.7.1-pre-, pdfile version 4.5
Thu Sep 9 15:50:22 git version: 10f750fa617e7499a89e9a140d89ddb4a7427ad6


Operating System: ALL
Participants:

 Description   

1. Set up replica set of 3 servers.
2. Insert 2000 docs, each 1K size.
3. Call getLastError(2), ie. wait for last write to reach at least two servers

Problem: call never returns.

However, works when only inserting 1000 docs.

Servers log during problem:
m30002| Thu Sep 9 17:22:45 [rs_sync] replSet error RS102 too stale to catch up, at least from primary: 127.0.0.1:30000
m30002| Thu Sep 9 17:22:45 [rs_sync] replSet our last optime : Sep 9 17:22:43 4c895023:6
m30002| Thu Sep 9 17:22:45 [rs_sync] replSet oldest at 127.0.0.1:30000 : Sep 9 17:22:43 4c895023:c2
m30002| Thu Sep 9 17:22:45 [rs_sync] replSet See http://www.mongodb.org/display/DOCS/Resyncing+a+Very+Stale+Replica+Set+Member
m30002| Thu Sep 9 17:22:45 [rs_sync] replSet error RS102 too stale to catch up
m30000| Thu Sep 9 17:22:46 [initandlisten] connection accepted from 127.0.0.1:61974 #15
m30001| Thu Sep 9 17:22:46 [rs_sync] replSet error RS102 too stale to catch up, at least from primary: 127.0.0.1:30000
m30001| Thu Sep 9 17:22:46 [rs_sync] replSet our last optime : Sep 9 17:22:43 4c895023:6
m30001| Thu Sep 9 17:22:46 [rs_sync] replSet oldest at 127.0.0.1:30000 : Sep 9 17:22:43 4c895023:c2
m30001| Thu Sep 9 17:22:46 [rs_sync] replSet See http://www.mongodb.org/display/DOCS/Resyncing+a+Very+Stale+Replica+Set+Member
m30001| Thu Sep 9 17:22:46 [rs_sync] replSet error RS102 too stale to catch up
m30000| Thu Sep 9 17:24:45 [conn14] end connection 127.0.0.1:61973
m30000| Thu Sep 9 17:24:46 [conn15] end connection 127.0.0.1:61974
m30000| Thu Sep 9 17:24:46 [initandlisten] connection accepted from 127.0.0.1:62005 #16
m30002| Thu Sep 9 17:24:46 [rs_sync] replSet error RS102 too stale to catch up, at least from primary: 127.0.0.1:30000
m30002| Thu Sep 9 17:24:46 [rs_sync] replSet our last optime : Sep 9 17:22:43 4c895023:6
m30002| Thu Sep 9 17:24:46 [rs_sync] replSet oldest at 127.0.0.1:30000 : Sep 9 17:22:43 4c895023:c2
m30002| Thu Sep 9 17:24:46 [rs_sync] replSet See http://www.mongodb.org/display/DOCS/Resyncing+a+Very+Stale+Replica+Set+Member
m30002| Thu Sep 9 17:24:46 [rs_sync] replSet error RS102 too stale to catch up
m30000| Thu Sep 9 17:24:47 [initandlisten] connection accepted from 127.0.0.1:62006 #17
m30001| Thu Sep 9 17:24:47 [rs_sync] replSet error RS102 too stale to catch up, at least from primary: 127.0.0.1:30000
m30001| Thu Sep 9 17:24:47 [rs_sync] replSet our last optime : Sep 9 17:22:43 4c895023:6
m30001| Thu Sep 9 17:24:47 [rs_sync] replSet oldest at 127.0.0.1:30000 : Sep 9 17:22:43 4c895023:c2
m30001| Thu Sep 9 17:24:47 [rs_sync] replSet See http://www.mongodb.org/display/DOCS/Resyncing+a+Very+Stale+Replica+Set+Member
m30001| Thu Sep 9 17:24:47 [rs_sync] replSet error RS102 too stale to catch up



 Comments   
Comment by Githook User [ 02/Sep/15 ]

Author:

{u'username': u'benety', u'name': u'Benety Goh', u'email': u'benety@mongodb.com'}

Message: SERVER-18384 removed inactive test. SERVER-1768 was closed as works as designed
Branch: master
https://github.com/mongodb/mongo/commit/5cf0aca99109ee132fd2f5da011eb6583b028d84

Comment by Eliot Horowitz (Inactive) [ 21/Apr/11 ]

There is no default timeout, so it'll wait forever correctly.

Comment by Tony Hannan [ 10/Sep/10 ]

Not what I observed. Just repeated saying "replSet error RS102 too stale to catch up"

Comment by Eliot Horowitz (Inactive) [ 10/Sep/10 ]

Doesn't the secondary then start a resync?

Comment by Tony Hannan [ 10/Sep/10 ]

According to Dwight, this happens when the oplog rolls over before the secondary can read it.
In this case, getLastError should return an error instead of hanging forever.

Comment by Tony Hannan [ 10/Sep/10 ]

I just sent a pull request from TonyGen containing my test case.
Other test cases in jstest/replsets that use getLastError (replset2.js, replset5.js, and rollback.js) work for me, but this one does not.

Comment by Dwight Merriman [ 10/Sep/10 ]

see replset2 or replset5.js
try:
mongo --nodb jstests/replsets/replset2.js

Comment by Dwight Merriman [ 10/Sep/10 ]

tony, please take one of hte jstests/replsets/ and modify it s.t. it reproduces this. then i can reproduce and we will also have a regression test forever.

i would take one of the existing getlasterror tests there and modify iut.

also a good place to start is to see if those tests pass for you on your box.

once that is done, assign it back to me.
tx

Comment by Tony Hannan [ 10/Sep/10 ]

Still hung after 45 minutes.

Comment by Eliot Horowitz (Inactive) [ 09/Sep/10 ]

This should recover...
How long did you let if go?

Generated at Thu Feb 08 02:57:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.