Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement retry in readonly open #2122

Open
Ngoguey42 opened this issue Oct 19, 2022 · 10 comments
Open

Implement retry in readonly open #2122

Ngoguey42 opened this issue Oct 19, 2022 · 10 comments

Comments

@Ngoguey42
Copy link
Contributor

Since #2119 we are able to detect if a control file is being read with a data race. These data races can be solved by sleeping and retrying. We are however not capable of distinguishing between data races and corrupted control files, this implies that the number of retry should be bounded to fallback on a Corrupted_control_file error.

Also, since the GC finalisation deletes old files, a data race is possible if a RO instance stalls between the read the control file and the opening of the files. The data race would materialise as a No_such_file_or_directory error.

Both these data races could be fixed at once by implementing a retry in the file manager.

@Athishpranav2003
Copy link

Athishpranav2003 commented Aug 9, 2024

Correct me if i am wrong

let open_ro config =
let open Result_syntax in
let indexing_strategy = Conf.indexing_strategy config in
let root = Irmin_pack.Conf.root config in
let lower_root = Irmin_pack.Conf.lower_root config in
let use_fsync = Irmin_pack.Conf.use_fsync config in
(* 1. Open the control file *)
let* control =
let path = Layout.control ~root in
Control.open_ ~readonly:true ~path ~tmp_path:None
(* If no control file, then check whether the store is in v1 or v2. *)
|> Result.map_error (function
| `No_such_file_or_directory _ -> (
let pack = Irmin_pack.Layout.V1_and_v2.pack ~root in
match Io.classify_path pack with
| `File -> `Migration_needed
| `No_such_file_or_directory -> `No_such_file_or_directory pack
| `Directory | `Other -> `Invalid_layout)
This function is used to open files in read_only format. Since Dataraces is additional cause of error(which is temporary) we should have retry mechanism in this procedure in case the file exist but not in ready state.

@Athishpranav2003
Copy link

@kayceesrk I am not sure whom to ping for this :(. Pinging you for help

@kayceesrk
Copy link
Contributor

@art-w @samoht are be the best to ping. Note that most folks are likely to be away in August (holiday month in France).

@Athishpranav2003
Copy link

Athishpranav2003 commented Aug 9, 2024

I see
Fine
Anyways no harm in waiting until they get back :)

@art-w
Copy link
Contributor

art-w commented Aug 30, 2024

@Athishpranav2003 > Indeed you found the right function to fix! Since the issue was written, we can now distinguish between a Corrupted_control_file error (= nothing can be done) and a data-race. So if the code that you highlighted succeed (step 1: open a control file), a data-race can happen on the lines that follow where we try to open all the other database files. If one of them turns out to be missing (because the main RW process deleted them), then we need to retry by re-reading the control file (which will have been updated with the new list of files to open).

Thanks for looking into this, let me know if you have any questions!

@Athishpranav2003
Copy link

so @art-w can we map the Corrupted_control_file normally and the rest of the errors we can do a sleep and rety?

@art-w
Copy link
Contributor

art-w commented Sep 30, 2024

Yes so any No_such_file_or_directory error that happens during step 2 (open suffix, prefix, dict, index) and step 3 (open lower) should retry the whole open_ro. I believe the other errors should only result in closing any successfully open files and propagating the error (not retry).

@Athishpranav2003
Copy link

@art-w so i was thinking about performing this.
We can check the value of suffix, prefix, dict etc and if its pointing to no_such_file_or_directory we sleep(give up the computation) and awake to recursively retry the entire function. In order retry the entire function i was thinking of using exceptions in this manner

let try_finalize f x finally y =
  let res = try f x with exn -> finally y; raise exn in
  finally y;
  res

Do u have any suggestions?

@art-w
Copy link
Contributor

art-w commented Oct 4, 2024

That makes sense, but you might want to check a tutorial on Lwt, which is the scheduler used by every let* in the code (you might see it as let%lwt in the tutorials, which is the same). Sadly the try ... with ... syntax doesn't work with Lwt, but it provides alternative like Lwt.catch or Lwt.finalize to achieve the same.

Since the bug is almost impossible to replicate, it could help if you start by writing a unit test that creates a new irmin-pack repo, add a few things, and then remove one of the internal irmin files before trying to reopen the store. This should trigger the No_such_file_or_directory error and allow you to experiment with the catch-and-retry logic.

@Athishpranav2003
Copy link

Hmm ok

Will try the same,

Seems like small but tricky thing. i have some problem with type hinting so will resolve it to start with

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants