[Feature Request] use one pass to compute mean and variance of recorded data #452

tanjunyao7 · 2024-09-25T12:04:01Z

Hi,

first of all, thanks for the great work.

I recorded 50 episodes with a real robot with each episode lasting 20 seconds. When the recording is finished, the statistics of the data is computed for the normalization. However, the computation costs almost one hour. After investigating the code, I found that it iterates the data twice, first for the computation of mean, second for variance.

lerobot/lerobot/common/datasets/compute_stats.py

Lines 102 to 149 in 9257348

    
           first_batch = None 
        
           running_item_count = 0  # for online mean computation 
        
           dataloader = create_seeded_dataloader(dataset, batch_size, seed=1337) 
        
           for i, batch in enumerate( 
        
               tqdm.tqdm(dataloader, total=ceil(max_num_samples / batch_size), desc="Compute mean, min, max") 
        
           ): 
        
               this_batch_size = len(batch["index"]) 
        
               running_item_count += this_batch_size 
        
               if first_batch is None: 
        
                   first_batch = deepcopy(batch) 
        
               for key, pattern in stats_patterns.items(): 
        
                   batch[key] = batch[key].float() 
        
                   # Numerically stable update step for mean computation. 
        
                   batch_mean = einops.reduce(batch[key], pattern, "mean") 
        
                   # Hint: to update the mean we need x̄ₙ = (Nₙ₋₁x̄ₙ₋₁ + Bₙxₙ) / Nₙ, where the subscript represents 
        
                   # the update step, N is the running item count, B is this batch size, x̄ is the running mean, 
        
                   # and x is the current batch mean. Some rearrangement is then required to avoid risking 
        
                   # numerical overflow. Another hint: Nₙ₋₁ = Nₙ - Bₙ. Rearrangement yields 
        
                   # x̄ₙ = x̄ₙ₋₁ + Bₙ * (xₙ - x̄ₙ₋₁) / Nₙ 
        
                   mean[key] = mean[key] + this_batch_size * (batch_mean - mean[key]) / running_item_count 
        
                   max[key] = torch.maximum(max[key], einops.reduce(batch[key], pattern, "max")) 
        
                   min[key] = torch.minimum(min[key], einops.reduce(batch[key], pattern, "min")) 
        
               if i == ceil(max_num_samples / batch_size) - 1: 
        
                   break 
        
           first_batch_ = None 
        
           running_item_count = 0  # for online std computation 
        
           dataloader = create_seeded_dataloader(dataset, batch_size, seed=1337) 
        
           for i, batch in enumerate( 
        
               tqdm.tqdm(dataloader, total=ceil(max_num_samples / batch_size), desc="Compute std") 
        
           ): 
        
               this_batch_size = len(batch["index"]) 
        
               running_item_count += this_batch_size 
        
               # Sanity check to make sure the batches are still in the same order as before. 
        
               if first_batch_ is None: 
        
                   first_batch_ = deepcopy(batch) 
        
                   for key in stats_patterns: 
        
                       assert torch.equal(first_batch_[key], first_batch[key]) 
        
               for key, pattern in stats_patterns.items(): 
        
                   batch[key] = batch[key].float() 
        
                   # Numerically stable update step for mean computation (where the mean is over squared 
        
                   # residuals).See notes in the mean computation loop above. 
        
                   batch_std = einops.reduce((batch[key] - mean[key]) ** 2, pattern, "mean") 
        
                   std[key] = std[key] + this_batch_size * (batch_std - std[key]) / running_item_count 
        
               if i == ceil(max_num_samples / batch_size) - 1: 
        
                   break

I believe both the mean and variance can be computed in a single pass, halving the total computation time. Are there any plan for this improvement?

Cadene · 2024-09-25T16:31:11Z

@tanjunyao7 Yes! it's on our todo list but we don't have the bandwidth as of now. If you have time could you please create a PR? That would be extremely helpful!!!

cc @michel-aractingi for visibility

tanjunyao7 · 2024-09-26T16:49:55Z

yes, I could create a PR. I'll close this issue.

tanjunyao7 · 2024-09-26T17:14:02Z

sorry I decided to paste the code here since I don't have time to write the test script. It's manually tested by computing the original result and the new result of the same data. Here is the code snippet:

first_batch = None
running_item_count = 0.0
dataloader = create_seeded_dataloader(dataset, batch_size, seed=1337)
for i, batch in enumerate(
        tqdm.tqdm(dataloader, total=ceil(max_num_samples / batch_size), desc="Compute mean, min, max")
):
    this_batch_size = len(batch["index"])

    if first_batch is None:
        first_batch = deepcopy(batch)
    for key, pattern in stats_patterns.items():
        batch_key = batch[key].float()
        batch_mean = einops.reduce(batch_key, pattern, "mean")
        batch_sq_mean = einops.reduce(batch_key**2, pattern, "mean")

        mean[key] = (running_item_count * mean[key] + this_batch_size * batch_mean) / (
                running_item_count + this_batch_size)

        #as of now it's the mean of squares, not std
        std[key] = (running_item_count * std[key] + this_batch_size * batch_sq_mean) / (
                running_item_count + this_batch_size)

        max[key] = torch.maximum(max[key], einops.reduce(batch[key], pattern, "max"))
        min[key] = torch.minimum(min[key], einops.reduce(batch[key], pattern, "min"))
    running_item_count += this_batch_size * 1.0
    if i == ceil(max_num_samples / batch_size) - 1:
        break

for key in stats_patterns.keys():
    std[key] = torch.sqrt(std[key] - mean[key]*mean[key])

Cadene assigned Cadene and michel-aractingi Sep 25, 2024

tanjunyao7 closed this as completed Sep 26, 2024

tanjunyao7 reopened this Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] use one pass to compute mean and variance of recorded data #452

[Feature Request] use one pass to compute mean and variance of recorded data #452

tanjunyao7 commented Sep 25, 2024 •

edited

Loading

Cadene commented Sep 25, 2024 •

edited

Loading

tanjunyao7 commented Sep 26, 2024

tanjunyao7 commented Sep 26, 2024 •

edited

Loading

[Feature Request] use one pass to compute mean and variance of recorded data #452

[Feature Request] use one pass to compute mean and variance of recorded data #452

Comments

tanjunyao7 commented Sep 25, 2024 • edited Loading

Cadene commented Sep 25, 2024 • edited Loading

tanjunyao7 commented Sep 26, 2024

tanjunyao7 commented Sep 26, 2024 • edited Loading

tanjunyao7 commented Sep 25, 2024 •

edited

Loading

Cadene commented Sep 25, 2024 •

edited

Loading

tanjunyao7 commented Sep 26, 2024 •

edited

Loading