Questions about the TensorRT deployment of the video inference model and video inference flows #434

liyihao76 · 2024-11-04T07:29:27Z

I'm currently trying to deploy a video inference model for SAM2 using TensorRT+cpp. Following his idea https://github.com/Aimol-l/OrtInference , split it into four models: image encoder, image decoder, memory encoder, and memory attention. First convert them into onnx files and further generate TensorRT engine files. Currently I have completed the deployment of inference for frame 0 (image encoder + image decoder) modeled after the deployment process of SAM1. However, the inference process for subsequent frames seems to be quite complex, especially the storage and update of obj_ptr and mask_mem. I'm a beginner, are there any detailed explanatory articles/videos for this part of the source code? Or a project on c++ deployment? Much appreciated.
There are a couple of specific questions:

after I have completed the inference for frame 0, do the input hints need to be updated when predicting for subsequent frames? (e.g. using the box from the previous frame's inference as the prompt for the new frame?) What should be the input prompt part of the image encoder for frames without prompts?
for obj_ptr storage, should it hold the contents of frame 0 (with prompt, conditioned in the paper) + the contents of the 15 most recent frames? If I add a new prompt at frame 20, should it save the contents of frame 20 or the contents of frame 0 (+ the contents of the 15 most recent frames)?
https://github.com/Aimol-l/OrtInference adds time coding (the dark green part [7,1,64,64]), I don't know if it exists in the source code, what is its significance?
i have some objects that may only exist in certain frames, if i want to reason from a certain frame in the middle of the video, shouldn't the cpp implementation do it by inference backwards + forwards from that frame?

heyoeyo · 2024-11-04T20:57:40Z

I've also been trying to make sense of the video processing sequence, so I can try to answer some of these (though I may have some parts wrong still):

What should be the input prompt part of the image encoder for frames without prompts?

Having no inputs seems to work fine, but the model actually uses a single point with label -1 (i.e. a padding point) when no inputs are given. It also uses a box input of None which ends up adding a second padding point due to the way the prompt encoder is setup, so the final result is a prompt made of 2 padding point embeddings.

If I add a new prompt at frame 20, should it save the contents of frame 20 or the contents of frame 0 (+ the contents of the 15 most recent frames)?

With the default settings, the model will use all prompted pointers that have happened before or during the current frame. So at frame 19, only the frame 0 pointer would be used (+ the 15 recent frame pointers) and then at frame 20, 21, 22, etc. both the frame 0 & 20 pointers would be used.

adds time coding (the dark green part [7,1,64,64]), I don't know if it exists in the source code, what is its significance?

I can't read the label in the diagram, but most likely it's the maskmem_tpos_enc which is like a time-based position encoding for the 6 most recent frame memory encodings + 1 encoding for the prompt memory encodings.

if i want to reason from a certain frame in the middle of the video, shouldn't the cpp implementation do it by inference backwards + forwards from that frame?

Yes that makes sense, especially if the object isn't easily visible when it first appears. The original propagation code has a reverse flag for helping with this sort of thing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about the TensorRT deployment of the video inference model and video inference flows #434

Questions about the TensorRT deployment of the video inference model and video inference flows #434

liyihao76 commented Nov 4, 2024

heyoeyo commented Nov 4, 2024

Questions about the TensorRT deployment of the video inference model and video inference flows #434

Questions about the TensorRT deployment of the video inference model and video inference flows #434

Comments

liyihao76 commented Nov 4, 2024

heyoeyo commented Nov 4, 2024