Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.Yaml model config Alterations ( ADDED COMMENTS) #207

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,891 changes: 1,891 additions & 0 deletions Image_Predictor_Experiments.ipynb

Large diffs are not rendered by default.

1,116 changes: 680 additions & 436 deletions notebooks/automatic_mask_generator_example.ipynb

Large diffs are not rendered by default.

3,262 changes: 1,917 additions & 1,345 deletions notebooks/video_predictor_example.ipynb

Large diffs are not rendered by default.

99 changes: 99 additions & 0 deletions sam2_configs/sam2_hiera_b+.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -111,3 +111,102 @@ model:
use_mlp_for_obj_ptr_proj: true
# Compilation flag
compile_image_encoder: False


#Let's break down this configuration file, which appears to define the architecture of a deep learning model, likely for object detection or segmentation.
#Key Components
#The configuration uses a Python-like syntax to define the model's structure. Here's a breakdown of the key sections and their functions:
#@package _global_
#This might be a convention used by the framework to signal that the following configuration is intended to be applied globally within the project.
#model:
#Defines the main model structure.
#_target_: sam2.modeling.sam2_base.SAM2Base
#This indicates that the model is based on the SAM2Base class, likely from a library named sam2.
#image_encoder:
#Describes the image encoder part of the model, which takes an image as input and extracts features.
#_target_: sam2.modeling.backbones.image_encoder.ImageEncoder
#Points to a specific image encoder implementation.
#scalp: 1
#Likely a parameter related to the scaling or resolution of the image encoder.
#trunk:
#Represents the backbone of the image encoder, often a convolutional neural network.
#_target_: sam2.modeling.backbones.hieradet.Hiera
#Specifies a specific type of backbone called Hiera (likely referring to a hierarchical architecture).
#embed_dim: 112
#The dimensionality of the feature vectors produced by the backbone.
#num_heads: 2
#The number of attention heads in a transformer-based layer within the backbone (likely in the Hiera architecture).
#neck:
#A feature pyramid network (FPN) responsible for combining features from different levels of the backbone.
#_target_: sam2.modeling.backbones.image_encoder.FpnNeck
#Points to a specific FPN implementation.
#position_encoding:
#A mechanism to encode spatial information into the features, which is important for tasks like object detection and segmentation.
#_target_: sam2.modeling.position_encoding.PositionEmbeddingSine
#A common method for positional encoding using sinusoidal functions.
#d_model: 256
#The dimensionality of the features processed by the FPN.
#backbone_channel_list: [896, 448, 224, 112]
#Specifies the number of channels in different levels of the backbone.
#fpn_top_down_levels: [2, 3]
#Defines the levels of the FPN that directly use features from the backbone.
#fpn_interp_model: nearest
#The interpolation method used in the FPN for upsampling features.
#memory_attention:
#Defines a memory attention mechanism, often used to maintain a history of past observations or interactions.
#_target_: sam2.modeling.memory_attention.MemoryAttention
#Points to the implementation of the memory attention module.
#d_model: 256
#The dimensionality of the features used in the memory attention.
#pos_enc_at_input: true
#Indicates that positional encoding is applied at the input of the memory attention.
#layer:
#The specific layer within the memory attention mechanism.
#_target_: sam2.modeling.memory_attention.MemoryAttentionLayer
#The class of the layer within the memory attention.
#self_attention:
#A self-attention mechanism, which attends to different parts of the input sequence or feature map.
#_target_: sam2.modeling.sam.transformer.RoPEAttention
#A specific type of self-attention using a mechanism called RoPE (Rotary Position Embedding) to encode positional information.
#cross_attention:
#A cross-attention mechanism, which allows the model to attend to information from different parts of the input or from other sources.
#_target_: sam2.modeling.sam.transformer.RoPEAttention
#The type of cross-attention, likely also using RoPE.
#memory_encoder:
#Defines the memory encoder, which processes and encodes the information from the memory attention module.
#_target_: sam2.modeling.memory_encoder.MemoryEncoder
#The class of the memory encoder.
#out_dim: 64
#The dimensionality of the output from the memory encoder.
#position_encoding:
#Positional encoding used in the memory encoder.
#mask_downsampler:
#A module for downsampling the mask representations.
#fuser:
#A module for fusing features from different levels of the memory encoder.
#num_maskmem: 7
#Likely the number of memory slots or cells used for storing mask representations.
#image_size: 1024
#The expected input image size for the model.
#sigmoid_scale_for_mem_enc: 20.0
#A scaling factor applied to the sigmoid function for the memory encoder's output.
#sigmoid_bias_for_mem_enc: -10.0
#A bias term added to the sigmoid function for the memory encoder's output.
#use_mask_input_as_output_without_sam: true
#An option indicating that the input mask is used as the output mask without further processing by a module called SAM.
#directly_add_no_mem_embed: true
#A flag for how the memory embedding is integrated into the model.
#use_high_res_features_in_sam: true
#Indicates that high-resolution features are used in the SAM (Segment Anything Model) module.
#multimask_output_in_sam: true
#Suggests that the SAM module can produce multiple masks as output.
#iou_prediction_use_sigmoid: True
#A flag related to the use of a sigmoid function in predicting Intersection over Union (IoU) scores.
#use_obj_ptrs_in_encoder: true
#Indicates that object pointers (information about object locations) are used in the encoder.
#pred_obj_scores: true
#A flag indicating that the model predicts object scores (confidence scores for object presence).
#multimask_output_for_tracking: true
#Suggests that the model supports multi-mask tracking, likely for tracking multiple objects over time.
#compile_image_encoder: False
#A flag for whether the image encoder should be compiled for optimization.
208 changes: 104 additions & 104 deletions sam2_configs/sam2_hiera_l.yaml
Original file line number Diff line number Diff line change
@@ -1,117 +1,117 @@
# @package _global_

# Model
# @package _global_ # This line indicates a global package declaration
#
# Model # Section defining the model architecture
model:
_target_: sam2.modeling.sam2_base.SAM2Base
image_encoder:
_target_: sam2.modeling.backbones.image_encoder.ImageEncoder
scalp: 1
trunk:
_target_: sam2.modeling.backbones.hieradet.Hiera
embed_dim: 144
num_heads: 2
stages: [2, 6, 36, 4]
global_att_blocks: [23, 33, 43]
window_pos_embed_bkg_spatial_size: [7, 7]
window_spec: [8, 4, 16, 8]
neck:
_target_: sam2.modeling.backbones.image_encoder.FpnNeck
position_encoding:
_target_: sam2.modeling.position_encoding.PositionEmbeddingSine
num_pos_feats: 256
normalize: true
scale: null
temperature: 10000
d_model: 256
backbone_channel_list: [1152, 576, 288, 144]
fpn_top_down_levels: [2, 3] # output level 0 and 1 directly use the backbone features
fpn_interp_model: nearest
_target_: sam2.modeling.sam2_base.SAM2Base # Specifies the base model class
image_encoder: # Configuration for the image encoder
_target_: sam2.modeling.backbones.image_encoder.ImageEncoder # Specifies the image encoder class
scalp: 1 # Scaling factor for the image encoder
trunk: # Configuration for the backbone of the image encoder
_target_: sam2.modeling.backbones.hieradet.Hiera # Specifies the backbone class
embed_dim: 144 # Dimensionality of the feature embeddings
num_heads: 2 # Number of attention heads in the backbone
stages: [2, 6, 36, 4] # Stages of the backbone network
global_att_blocks: [23, 33, 43] # Global attention blocks in the backbone
window_pos_embed_bkg_spatial_size: [7, 7] # Spatial size for window position embedding
window_spec: [8, 4, 16, 8] # Window specifications for the backbone
neck: # Configuration for the feature pyramid network (FPN)
_target_: sam2.modeling.backbones.image_encoder.FpnNeck # Specifies the FPN class
position_encoding: # Positional encoding for the FPN
_target_: sam2.modeling.position_encoding.PositionEmbeddingSine # Specifies the positional encoding class
num_pos_feats: 256 # Number of positional features
normalize: true # Normalize positional features
scale: null # Scaling factor for positional features
temperature: 10000 # Temperature for positional encoding
d_model: 256 # Dimensionality of the FPN features
backbone_channel_list: [1152, 576, 288, 144] # Channel list for the backbone
fpn_top_down_levels: [2, 3] # Levels of the FPN that directly use backbone features
fpn_interp_model: nearest # Interpolation method for upsampling in the FPN

memory_attention:
_target_: sam2.modeling.memory_attention.MemoryAttention
d_model: 256
pos_enc_at_input: true
layer:
_target_: sam2.modeling.memory_attention.MemoryAttentionLayer
activation: relu
dim_feedforward: 2048
dropout: 0.1
pos_enc_at_attn: false
self_attention:
_target_: sam2.modeling.sam.transformer.RoPEAttention
rope_theta: 10000.0
feat_sizes: [32, 32]
embedding_dim: 256
num_heads: 1
downsample_rate: 1
dropout: 0.1
d_model: 256
pos_enc_at_cross_attn_keys: true
pos_enc_at_cross_attn_queries: false
cross_attention:
_target_: sam2.modeling.sam.transformer.RoPEAttention
rope_theta: 10000.0
feat_sizes: [32, 32]
rope_k_repeat: True
embedding_dim: 256
num_heads: 1
downsample_rate: 1
dropout: 0.1
kv_in_dim: 64
num_layers: 4
memory_attention: # Configuration for the memory attention module
_target_: sam2.modeling.memory_attention.MemoryAttention # Specifies the memory attention class
d_model: 256 # Dimensionality of features in the memory attention
pos_enc_at_input: true # Apply positional encoding at the input
layer: # Configuration for the layer within the memory attention
_target_: sam2.modeling.memory_attention.MemoryAttentionLayer # Specifies the layer class
activation: relu # Activation function used in the layer
dim_feedforward: 2048 # Dimensionality of the feedforward network
dropout: 0.1 # Dropout rate for the layer
pos_enc_at_attn: false # Apply positional encoding during attention
self_attention: # Configuration for the self-attention mechanism
_target_: sam2.modeling.sam.transformer.RoPEAttention # Specifies the self-attention class
rope_theta: 10000.0 # Theta value for Rotary Position Embedding (RoPE)
feat_sizes: [32, 32] # Feature sizes for RoPE
embedding_dim: 256 # Embedding dimension for RoPE
num_heads: 1 # Number of attention heads
downsample_rate: 1 # Downsampling rate
dropout: 0.1 # Dropout rate for self-attention
d_model: 256 # Dimensionality of the features in the layer
pos_enc_at_cross_attn_keys: true # Apply positional encoding for cross-attention keys
pos_enc_at_cross_attn_queries: false # Apply positional encoding for cross-attention queries
cross_attention: # Configuration for the cross-attention mechanism
_target_: sam2.modeling.sam.transformer.RoPEAttention # Specifies the cross-attention class
rope_theta: 10000.0 # Theta value for RoPE in cross-attention
feat_sizes: [32, 32] # Feature sizes for RoPE in cross-attention
rope_k_repeat: True # Repeat keys for RoPE
embedding_dim: 256 # Embedding dimension for RoPE in cross-attention
num_heads: 1 # Number of attention heads in cross-attention
downsample_rate: 1 # Downsampling rate in cross-attention
dropout: 0.1 # Dropout rate for cross-attention
kv_in_dim: 64 # Dimensionality of key and value inputs for cross-attention
num_layers: 4 # Number of layers in the memory attention module

memory_encoder:
_target_: sam2.modeling.memory_encoder.MemoryEncoder
out_dim: 64
position_encoding:
_target_: sam2.modeling.position_encoding.PositionEmbeddingSine
num_pos_feats: 64
normalize: true
scale: null
temperature: 10000
mask_downsampler:
_target_: sam2.modeling.memory_encoder.MaskDownSampler
kernel_size: 3
stride: 2
padding: 1
fuser:
_target_: sam2.modeling.memory_encoder.Fuser
layer:
_target_: sam2.modeling.memory_encoder.CXBlock
dim: 256
kernel_size: 7
padding: 3
layer_scale_init_value: 1e-6
use_dwconv: True # depth-wise convs
num_layers: 2
memory_encoder: # Configuration for the memory encoder
_target_: sam2.modeling.memory_encoder.MemoryEncoder # Specifies the memory encoder class
out_dim: 64 # Dimensionality of the output from the memory encoder
position_encoding: # Positional encoding for the memory encoder
_target_: sam2.modeling.position_encoding.PositionEmbeddingSine # Specifies the positional encoding class
num_pos_feats: 64 # Number of positional features
normalize: true # Normalize positional features
scale: null # Scaling factor for positional features
temperature: 10000 # Temperature for positional encoding
mask_downsampler: # Configuration for the mask downsampler
_target_: sam2.modeling.memory_encoder.MaskDownSampler # Specifies the mask downsampler class
kernel_size: 3 # Kernel size for downsampling
stride: 2 # Stride for downsampling
padding: 1 # Padding for downsampling
fuser: # Configuration for the feature fuser
_target_: sam2.modeling.memory_encoder.Fuser # Specifies the feature fuser class
layer: # Configuration for the layer within the fuser
_target_: sam2.modeling.memory_encoder.CXBlock # Specifies the layer class
dim: 256 # Dimensionality of the layer
kernel_size: 7 # Kernel size for the layer
padding: 3 # Padding for the layer
layer_scale_init_value: 1e-6 # Initial value for layer scaling
use_dwconv: True # Use depthwise convolution
num_layers: 2 # Number of layers in the fuser

num_maskmem: 7
image_size: 1024
num_maskmem: 7 # Number of memory slots for storing masks
image_size: 1024 # Expected input image size
# apply scaled sigmoid on mask logits for memory encoder, and directly feed input mask as output mask
sigmoid_scale_for_mem_enc: 20.0
sigmoid_bias_for_mem_enc: -10.0
use_mask_input_as_output_without_sam: true
sigmoid_scale_for_mem_enc: 20.0 # Scaling factor for sigmoid in the memory encoder
sigmoid_bias_for_mem_enc: -10.0 # Bias term for sigmoid in the memory encoder
use_mask_input_as_output_without_sam: true # Use input mask directly as output without SAM processing
# Memory
directly_add_no_mem_embed: true
directly_add_no_mem_embed: true # Add non-memory embeddings directly
# use high-resolution feature map in the SAM mask decoder
use_high_res_features_in_sam: true
use_high_res_features_in_sam: true # Use high-resolution features in SAM
# output 3 masks on the first click on initial conditioning frames
multimask_output_in_sam: true
multimask_output_in_sam: true # Output multiple masks in SAM
# SAM heads
iou_prediction_use_sigmoid: True
iou_prediction_use_sigmoid: True # Use sigmoid for IoU prediction
# cross-attend to object pointers from other frames (based on SAM output tokens) in the encoder
use_obj_ptrs_in_encoder: true
add_tpos_enc_to_obj_ptrs: false
only_obj_ptrs_in_the_past_for_eval: true
use_obj_ptrs_in_encoder: true # Use object pointers in the encoder
add_tpos_enc_to_obj_ptrs: false # Add temporal positional encoding to object pointers
only_obj_ptrs_in_the_past_for_eval: true # Only use past object pointers for evaluation
# object occlusion prediction
pred_obj_scores: true
pred_obj_scores_mlp: true
fixed_no_obj_ptr: true
pred_obj_scores: true # Predict object scores
pred_obj_scores_mlp: true # Use MLP for object score prediction
fixed_no_obj_ptr: true # Use a fixed object pointer for "no object"
# multimask tracking settings
multimask_output_for_tracking: true
use_multimask_token_for_obj_ptr: true
multimask_min_pt_num: 0
multimask_max_pt_num: 1
use_mlp_for_obj_ptr_proj: true
multimask_output_for_tracking: true # Output multiple masks for tracking
use_multimask_token_for_obj_ptr: true # Use multi-mask tokens for object pointers
multimask_min_pt_num: 0 # Minimum number of points for multi-mask
multimask_max_pt_num: 1 # Maximum number of points for multi-mask
use_mlp_for_obj_ptr_proj: true # Use MLP for object pointer projection
# Compilation flag
compile_image_encoder: False
compile_image_encoder: False # Flag for compiling the image encoder
Loading