Fix training resumption #312

sichu2023 · 2024-10-15T22:03:28Z

Summary

Loss curve from training resumption is inconsistent with a single uninterrupted loss curve.

Details

It has now been identified partly from incorrect datamodule behavior and is fixed with implementing state_dict for datamodule. However, there is still a very minor dip in validation loss curve that does not affect subsequent training curve.

sub-packages/bionemo-geneformer/tests/bionemo/geneformer/test_stop_and_go.py

sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_stop_and_go.py

sub-packages/bionemo-testing/src/bionemo/testing/harnesses/stop_and_go.py

sub-packages/bionemo-llm/src/bionemo/llm/data/datamodule.py

sub-packages/bionemo-testing/src/bionemo/testing/testing_callbacks.py

skothenhill-nv

left some comments on stop and go, when addressed lgtm

sichu2023 · 2024-10-25T11:19:47Z

/build-ci

sichu2023 · 2024-10-25T12:40:26Z

Confirm that we can reproduce loss curve upon resumption despite a small jump on validation loss at the beginning.
https://wandb.ai/clara-discovery/bionemo2-esm2-debug/reports/val_ppl-24-10-25-14-39-37---Vmlldzo5ODg0ODMy

pstjohn · 2024-10-25T13:40:58Z

sub-packages/bionemo-esm2/src/bionemo/esm2/data/dataset.py

+def create_dummy_parquet_train_val_inputs(tmp_path: Path) -> Tuple[Path, Path]:
+    """Create a mock protein train and val cluster parquet."""
+    train_cluster_path = tmp_path / "train_clusters.parquet"
+    train_clusters = pd.DataFrame(
+        {
+            "ur90_id": [["UniRef90_A"], ["UniRef90_B", "UniRef90_C"]],
+        }
+    )
+    train_clusters.to_parquet(train_cluster_path)
+
+    valid_cluster_path = tmp_path / "valid_clusters.parquet"
+    valid_clusters = pd.DataFrame(
+        {
+            "ur50_id": ["UniRef50_A", "UniRef50_B", "UniRef90_A", "UniRef90_B"],
+        }
+    )
+    valid_clusters.to_parquet(valid_cluster_path)
+    return train_cluster_path, valid_cluster_path
+
+
+def create_dummy_protein_dataset(tmp_path) -> Path:
+    """Create a mock protein dataset."""
+    if not isinstance(tmp_path, Path):
+        tmp_path = Path(str(tmp_path))
+
+    db_file = tmp_path / "protein_dataset.db"
+    conn = sqlite3.connect(str(db_file))
+    cursor = conn.cursor()
+
+    cursor.execute(
+        """
+        CREATE TABLE protein (
+            id TEXT PRIMARY KEY,
+            sequence TEXT
+        )
+    """
+    )
+
+    proteins = [
+        ("UniRef90_A", "ACDEFGHIKLMNPQRSTVWY"),
+        ("UniRef90_B", "DEFGHIKLMNPQRSTVWYAC"),
+        ("UniRef90_C", "MGHIKLMNPQRSTVWYACDE"),
+        ("UniRef50_A", "MKTVRQERLKSIVRI"),
+        ("UniRef50_B", "MRILERSKEPVSGAQLA"),
+    ]
+    cursor.executemany("INSERT INTO protein VALUES (?, ?)", proteins)
+
+    conn.commit()
+    conn.close()
+
+    return db_file


These are testing-specific functions, let's not have them in the library itself

sichu2023 · 2024-10-25T15:09:12Z

Broke down the PR into smaller pieces. Now we have a PR specifically tackles inconsistency stop-and-go v.s. continuous training curve.
#362

jstjohn · 2024-10-25T21:07:16Z

/build-ci

jstjohn · 2024-10-25T22:00:59Z

/build-ci

jstjohn · 2024-10-25T22:20:30Z

/build-ci

jstjohn · 2024-10-25T22:52:39Z

/build-ci

jstjohn · 2024-10-25T23:05:52Z

/build-ci

sichu2023 added the NOT_related_to_v24.10 label Oct 15, 2024

sichu2023 force-pushed the sichu/fix-pplcallback-sanity branch from c8a7d1a to 4575a61 Compare October 21, 2024 18:03

sichu2023 changed the title ~~add trainer.sanity_check~~ Fix training resumption Oct 22, 2024

sichu2023 marked this pull request as ready for review October 22, 2024 21:24

sichu2023 requested review from jstjohn, malcolmgreaves, skothenhill-nv, farhadrgh, dorotat-nv, gwarmstrong, jomitchellnv and pstjohn as code owners October 22, 2024 21:24

sichu2023 added bug Something isn't working enhancement New feature or request labels Oct 22, 2024