Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix training resumption #312

Draft
wants to merge 51 commits into
base: sichu/match-resumption-loss-curve
Choose a base branch
from

Conversation

sichu2023
Copy link
Collaborator

@sichu2023 sichu2023 commented Oct 15, 2024

Summary

Loss curve from training resumption is inconsistent with a single uninterrupted loss curve.

Details

It has now been identified partly from incorrect datamodule behavior and is fixed with implementing state_dict for datamodule. However, there is still a very minor dip in validation loss curve that does not affect subsequent training curve.

@sichu2023 sichu2023 changed the title add trainer.sanity_check Fix training resumption Oct 22, 2024
@sichu2023 sichu2023 marked this pull request as ready for review October 22, 2024 21:24
@sichu2023 sichu2023 added bug Something isn't working enhancement New feature or request labels Oct 22, 2024
Copy link
Collaborator

@skothenhill-nv skothenhill-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments on stop and go, when addressed lgtm

@sichu2023
Copy link
Collaborator Author

/build-ci

@sichu2023
Copy link
Collaborator Author

Confirm that we can reproduce loss curve upon resumption despite a small jump on validation loss at the beginning.
https://wandb.ai/clara-discovery/bionemo2-esm2-debug/reports/val_ppl-24-10-25-14-39-37---Vmlldzo5ODg0ODMy

Comment on lines +35 to +85
def create_dummy_parquet_train_val_inputs(tmp_path: Path) -> Tuple[Path, Path]:
"""Create a mock protein train and val cluster parquet."""
train_cluster_path = tmp_path / "train_clusters.parquet"
train_clusters = pd.DataFrame(
{
"ur90_id": [["UniRef90_A"], ["UniRef90_B", "UniRef90_C"]],
}
)
train_clusters.to_parquet(train_cluster_path)

valid_cluster_path = tmp_path / "valid_clusters.parquet"
valid_clusters = pd.DataFrame(
{
"ur50_id": ["UniRef50_A", "UniRef50_B", "UniRef90_A", "UniRef90_B"],
}
)
valid_clusters.to_parquet(valid_cluster_path)
return train_cluster_path, valid_cluster_path


def create_dummy_protein_dataset(tmp_path) -> Path:
"""Create a mock protein dataset."""
if not isinstance(tmp_path, Path):
tmp_path = Path(str(tmp_path))

db_file = tmp_path / "protein_dataset.db"
conn = sqlite3.connect(str(db_file))
cursor = conn.cursor()

cursor.execute(
"""
CREATE TABLE protein (
id TEXT PRIMARY KEY,
sequence TEXT
)
"""
)

proteins = [
("UniRef90_A", "ACDEFGHIKLMNPQRSTVWY"),
("UniRef90_B", "DEFGHIKLMNPQRSTVWYAC"),
("UniRef90_C", "MGHIKLMNPQRSTVWYACDE"),
("UniRef50_A", "MKTVRQERLKSIVRI"),
("UniRef50_B", "MRILERSKEPVSGAQLA"),
]
cursor.executemany("INSERT INTO protein VALUES (?, ?)", proteins)

conn.commit()
conn.close()

return db_file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are testing-specific functions, let's not have them in the library itself

@sichu2023
Copy link
Collaborator Author

Broke down the PR into smaller pieces. Now we have a PR specifically tackles inconsistency stop-and-go v.s. continuous training curve.
#362

@pstjohn pstjohn marked this pull request as draft October 25, 2024 15:42
@pstjohn pstjohn force-pushed the sichu/fix-pplcallback-sanity branch from 631a51e to 4994705 Compare October 25, 2024 15:47
@jstjohn jstjohn changed the base branch from main to sichu/match-resumption-loss-curve October 25, 2024 20:02
@jstjohn
Copy link
Collaborator

jstjohn commented Oct 25, 2024

/build-ci

@jstjohn
Copy link
Collaborator

jstjohn commented Oct 25, 2024

/build-ci

@jstjohn
Copy link
Collaborator

jstjohn commented Oct 25, 2024

/build-ci

@jstjohn
Copy link
Collaborator

jstjohn commented Oct 25, 2024

/build-ci

@jstjohn
Copy link
Collaborator

jstjohn commented Oct 25, 2024

/build-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request NOT_related_to_v24.10
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants