-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement V4TrainingData #722
Conversation
This reverts commit 318b6cb.
For compatibility with WDL value head draw value should also be included. Currently WDL head outputs the same Q and additionally D in range 0 to 1 for draw probability. It's enough to add a field for D to support it. |
I agree with ttl here, no point adding a v4 format without including the WDL via root and best move d. But this would only be a temporary thing - more input planes will likely render this obsolete relatively soon. Probably good to invest more into working out a better protobuf format when we can. |
I will add d of the root and the best move. Since that depends on #635, do I wait until that's merged or what is the procedure here? My thinking was that we can start with this V4TrainingData to try stuff in the T50 run ASAP and move to protobuf format afterwards. |
Doing some more thinking. Unless the update to the training PR to support both v3 and v4 goes in before the a0 policy head PR to the training codebase - the engine is going to have to output v3 or v4 dependent on whether the input net is the right format - or someone will need to write a v4 to v3 down converter that I can stick into the training pipeline next to rescorer. Because the a0 policy head PR to training doesn't support backwards compatibility so I can't update the T40 run to a new version of lczero-training once its in (or at least the version I saw on ttl's branch didn't support backwards compatibility IIRC). |
AZ policy head training code is backwards compatible. Policy head type can be specified in the yaml file and the code understands both old and new policy heads. |
Same does not appear true for the WDL change though. Do you plan on updating it to also be backwards compatible? |
Yes, I will make it also backwards compatible. |
okay all good then - but the training side version of this PR should go in first. With regards to ordering with respect to 635 - if this is ready to submit before 635 just store 0 for the draw values and then update them as part of 635. |
I'll approve this once I've upgraded the server. |
Actually there is going to be a small window of breakage either way - as I need to merge this PR into rescorer and make rescorer loading code support both v3 and v4. |
This PR updates the current V3TrainingData to V4TrainingData with the following additional information:
Aggregate evaluation Q: Storing information about Q allows us to perform resign analysis (LeelaChessZero/lczero-common#1) and try training schemes involving Q, such as the (Q+Z)/2 proposal by Oracle, which I also discussed on our blog. The Q that is stored here is from the same player's perspective as the outcome Z. I included both root Q and best Q since we might as well have both, it's just a 32-bit float extra.
Legal moves: By initialising the vector of probabilities to -1, we can easily distinguish between legal and illegal moves on the training side. This lets us try alternative training schemes where policy is only trained on legal moves rather than all moves, making the training setting as close as possible to the evaluation setting (we don't care about what probability it puts on illegal moves in gameplay after all). In my initial testing on CCRL, this was around 30 Elo stronger, and it could prevent issues with the network placing very low probability on some moves, simply because it wrongly thought that the move is illegal. With this change, all consumers of the training data must threshold the -1s back to 0 if they want to treat it as a probability vector.
I have tried implementing use of protobuf format for training data in https://github.com/Cyanogenoid/lc0/tree/chunk. With the current protobuf training data, they have bigger file size after compression so there is no immediate benefit. I think it's better to separate the functional changes in this PR from the format change to protobuf anyway.
I believe that fersbery, dkappe, and ASilver have been using a prior version of this branch with good success. With decode_training from https://github.com/Cyanogenoid/lczero-training/tree/q, here is the output of a game (note the Root Q and Best Q lines): https://gist.github.com/Cyanogenoid/cc3a4d6a717b9605b074a50d35c320e3
Is there any other data we want to store in the training data?