Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to save and restore a model with dynamic embeddings? #451

Open
alykhantejani opened this issue Aug 1, 2024 · 2 comments
Open

How to save and restore a model with dynamic embeddings? #451

alykhantejani opened this issue Aug 1, 2024 · 2 comments

Comments

@alykhantejani
Copy link

Hi,

I am training a model with dynamic embeddings (specifically HvdAllToAllEmbeddings). I am saving the model to disk with de.keras.models.de_save_model and I see that it appears my dynamic embedding variables are saved to disk.

However, when restoring from this directory it appears only the dense weights get restored. I am restoring with model.load_weights(FLAGS.model_dir) as shown here

Am I supposed to restore a KVCreator too?

@ZunwenYou
Copy link

The same to me!

When I load trained model from disk for incremental training, it will failed when fit(train_dataset)

I load model by model = tf.keras.models.load_model(FLAGS.model_dir)

the error log is

Traceback (most recent call last):
  File "/apdcephfs/dd_model/recommenders-addons-0.7.2/demo/dynamic_embedding/movielens-1m-keras/movielens-1m-keras.py", line 247, in <module>
    app.run(main)
  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/apdcephfs/dd_model/recommenders-addons-0.7.2/demo/dynamic_embedding/movielens-1m-keras/movielens-1m-keras.py", line 237, in main
    train()
  File "/apdcephfs/dd_model/recommenders-addons-0.7.2/demo/dynamic_embedding/movielens-1m-keras/movielens-1m-keras.py", line 147, in train
    model.fit(dataset, epochs=FLAGS.epochs, steps_per_epoch=FLAGS.steps_per_epoch)
  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 53, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:

Detected at node Adam/ResourceScatterAdd_3 defined at (most recent call last):
  File "/apdcephfs/dd_model/recommenders-addons-0.7.2/demo/dynamic_embedding/movielens-1m-keras/movielens-1m-keras.py", line 247, in <module>

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/absl/app.py", line 308, in run

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main

  File "/apdcephfs/dd_model/recommenders-addons-0.7.2/demo/dynamic_embedding/movielens-1m-keras/movielens-1m-keras.py", line 237, in main

  File "/apdcephfs/dd_model/recommenders-addons-0.7.2/demo/dynamic_embedding/movielens-1m-keras/movielens-1m-keras.py", line 147, in train

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/keras/src/engine/training.py", line 1807, in fit

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/keras/src/engine/training.py", line 1401, in train_function

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/keras/src/engine/training.py", line 1384, in step_function

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/keras/src/engine/training.py", line 1373, in run_step

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/keras/src/engine/training.py", line 1154, in train_step

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/keras/src/optimizers/optimizer.py", line 544, in minimize

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/keras/src/optimizers/optimizer.py", line 1223, in apply_gradients

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/keras/src/optimizers/optimizer.py", line 652, in apply_gradients

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/keras/src/optimizers/optimizer.py", line 1253, in _internal_apply_gradients

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/keras/src/optimizers/optimizer.py", line 1345, in _distributed_apply_gradients_fn

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/keras/src/optimizers/optimizer.py", line 1342, in apply_grad_to_update_var

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/keras/src/optimizers/optimizer.py", line 241, in _update_step

  File "/root/miniconda3/envs/py39tfra072/lib/python3.9/site-packages/keras/src/optimizers/adam.py", line 185, in update_step

indices[0] = 0 is not in [0, 0)
         [[{{node Adam/ResourceScatterAdd_3}}]] [Op:__inference_train_function_3810]
2024-08-08 16:21:52.222686: W tensorflow/core/kernels/data/cache_dataset_ops.cc:858] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
2024-08-08 16:21:52.232673: W tensorflow/core/kernels/data/cache_dataset_ops.cc:858] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.

@MoFHeka
Copy link
Collaborator

MoFHeka commented Aug 8, 2024

Sorry, TFRA is hard to support tf.keras.models.load_model API. Because load_model will create trainable variable object from TensorFlow, but TFRA trainable wrapper is not in TF code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants