-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
conductor: add heartbeat monitor for background workers #1023
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,85 @@ | ||
use std::sync::{ | ||
atomic::{AtomicU64, Ordering}, | ||
Arc, | ||
}; | ||
use std::time::{Duration, SystemTime, UNIX_EPOCH}; | ||
|
||
fn current_timestamp() -> u64 { | ||
SystemTime::now() | ||
.duration_since(UNIX_EPOCH) | ||
.expect("now() is not later than UNIX_EPOCH") | ||
.as_secs() | ||
} | ||
|
||
#[derive(Clone)] | ||
pub struct HeartbeatMonitor { | ||
shared_heartbeat: Arc<AtomicU64>, | ||
update_interval: Duration, | ||
} | ||
|
||
#[derive(Clone)] | ||
pub struct HeartbeatUpdater { | ||
shared_heartbeat: Arc<AtomicU64>, | ||
} | ||
|
||
/// Initializes and returns both a [`HeartbeatMonitor`] and [`HeartbeatUpdater`]. | ||
pub fn start(expected_update_interval: Duration) -> (HeartbeatMonitor, HeartbeatUpdater) { | ||
let heartbeat = Arc::new(AtomicU64::new(current_timestamp())); | ||
|
||
let heartbeat_monitor = HeartbeatMonitor { | ||
shared_heartbeat: heartbeat.clone(), | ||
update_interval: expected_update_interval, | ||
}; | ||
let heartbeat_updater = HeartbeatUpdater { | ||
shared_heartbeat: heartbeat, | ||
}; | ||
|
||
(heartbeat_monitor, heartbeat_updater) | ||
} | ||
|
||
impl HeartbeatMonitor { | ||
/// Checks if the heartbeat is still active | ||
/// | ||
/// # Returns true if the heartbeat has been updated within the expected time frame, false if the heartbeat has not been updated within twice the expected timeout duration | ||
pub fn is_heartbeat_active(&self) -> bool { | ||
let last_update = self.shared_heartbeat.load(Ordering::Relaxed); | ||
let current_time = current_timestamp(); | ||
|
||
if current_time >= last_update { | ||
let elapsed = Duration::from_secs(current_time - last_update); | ||
elapsed < self.update_interval * 2 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what is the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To check if there's been an update within twice the expected timeout duration There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I figured that, but why 2x? Maybe that should just be part of the update interval config? Keep in mind there is healthcheck config on kubernetes side too. Like how many consecutive failed requests will restart the pod. I There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we could replace |
||
} else { | ||
// System time went backwards or clock drift, consider the heartbeat stale | ||
false | ||
} | ||
} | ||
} | ||
|
||
impl HeartbeatUpdater { | ||
pub fn update_heartbeat(&self) { | ||
self.shared_heartbeat | ||
.store(current_timestamp(), Ordering::Relaxed); | ||
} | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's better to keep this as something manually updated rather than being updated by yet another background thread |
||
|
||
#[cfg(test)] | ||
mod tests { | ||
use std::time::Duration; | ||
|
||
use crate::heartbeat_monitor; | ||
|
||
#[tokio::test] | ||
async fn check_heartbeat_monitor() { | ||
let (monitor, updater) = heartbeat_monitor::start(Duration::from_secs(1)); | ||
|
||
// Is alive since there's been an update in the last second | ||
assert!(monitor.is_heartbeat_active()); | ||
|
||
tokio::time::sleep(Duration::from_secs(4)).await; | ||
|
||
assert_eq!(monitor.is_heartbeat_active(), false); | ||
updater.update_heartbeat(); | ||
|
||
assert!(monitor.is_heartbeat_active()); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went with a lock-free approach rather than something like
Arc<RwLock<Instant>>
, as the issue with workers getting stuck might stem from lock contention, so adding more lock contention probably wouldn't help