[SERVER-36870] Replication fails if server date exceeds January 19th, 2038 Created: 24/Aug/18  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: Alex Bevilacqua Assignee: Backlog - Replication Team
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Replication
Operating System: ALL
Steps To Reproduce:
  • setup an Ubuntu 18.04 server VM
  • setup a 3 node replica set using mlaunch (mlaunch --replicaset --nodes 3)
  • access the primary via the mongo shell and enable verbose logging
    • db.setLogLevel(1)
  • start inserting random documents into a collection:
    • for (var x = 1; x < 100000; x++) { db.test.insert( { a: x }

      ) }

  • in another terminal set the system date beyond January 19th, 2038:
    • sudo date -s "15 JAN 2039 18:00:00"
  • crash details should be present in one of the replica set member mongod.log files
Participants:
Case:

 Description   

If a replica set is running on a node who's date exceeds the bounds set by the 2038 bug (see https://en.wikipedia.org/wiki/Year_2038_problem), replication will fail due to a timestamp overflow:

2039-01-15T18:00:00.037-0500 I REPL [replication-1] Restarting oplog query due to error: ExceededTimeLimit: Operation timed out, request was RemoteCommand 13017 – target:localhost:27018 db:local expDate:2018-08-24T17:06:58.217-0400 cmd:{ getMore: 16995111421, collection: "oplog.rs", maxTimeMS: 5000, term: 4, lastKnownCommittedOpTime:

Unknown macro: { ts}

}. Last fetched optime (with hash): { ts: Timestamp 1535144753000|331, t: 4 [-5142302661271950628]. Restarts remaining: 3
2039-01-15T18:00:00.049-0500 I REPL [ReplicationExecutor] transition to PRIMARY
2039-01-15T18:00:00.049-0500 I REPL [ReplicationExecutor] Entering primary catch-up mode.
2039-01-15T18:00:00.050-0500 I REPL [ReplicationExecutor] Member localhost:27017 is now in state SECONDARY
2039-01-15T18:00:00.050-0500 I REPL [ReplicationExecutor] Heartbeats updated catchup target optime to { ts: Timestamp 1535144753000|351, t: 4 }
2039-01-15T18:00:00.050-0500 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update to localhost:27018: InvalidSyncSource: Sync source was cleared. Was localhost:27018
2039-01-15T18:00:00.050-0500 I REPL [rsSync] Caught up to the latest known optime successfully after becoming primary.
2039-01-15T18:00:00.050-0500 I REPL [rsSync] Exited primary catch-up mode.
2039-01-15T18:00:00.059-0500 I REPL [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source. Abandoning this batch of oplog entries and re-evaluating our sync source.
2039-01-15T18:00:01.051-0500 I - [rsSync] Fatal assertion 28560 BadValue: ts secs too high at src/mongo/db/repl/oplog.cpp 168
2039-01-15T18:00:01.051-0500 I - [rsSync]

 This will crash the mongod process. Note that this was done during a test of 2038 compatibility and has not been identified as a critical issue in any actual production deployments.



 Comments   
Comment by Andy Schwerin [ 18/Mar/22 ]

I think we're still ok. Durable history retention periods are unlikely to be more than hours or days, so they roll over pretty quickly.

Comment by Mathias Stearn [ 18/Mar/22 ]

Because we don't store timestamps in the data rows, but only in the oplog, I think a migration today would have pretty manageable upgrade behavior. If we start storing history in the tables, it will become more complicated.

Is this still true now that we have durable history, or are we now in the "more complicated" world?

Comment by Ramon Fernandez Marina [ 28/Aug/18 ]

Sending this to the Replication team for consideration; some alternatives have already been discussed as part of other projects.

Generated at Thu Feb 08 04:44:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.