haosulab · astonastonaston · Jul 9, 2024 · Jul 10, 2024 · Jul 10, 2024 · Jul 10, 2024
diff --git a/docs/source/user_guide/concepts/images/voxel_cam_view_all.png b/docs/source/user_guide/concepts/images/voxel_cam_view_all.png
diff --git a/docs/source/user_guide/concepts/images/voxel_cam_view_one.png b/docs/source/user_guide/concepts/images/voxel_cam_view_one.png
diff --git a/docs/source/user_guide/concepts/images/voxel_pushcube.png b/docs/source/user_guide/concepts/images/voxel_pushcube.png
diff --git a/docs/source/user_guide/concepts/images/voxel_pushcube_complete.png b/docs/source/user_guide/concepts/images/voxel_pushcube_complete.png
diff --git a/docs/source/user_guide/concepts/observation.md b/docs/source/user_guide/concepts/observation.md
@@ -92,6 +92,71 @@ For a quick demo to visualize pointclouds, you can run
 python -m mani_skill.examples.demo_vis_pcd -e "PushCube-v1"
 ```
 
+### voxel
+This observation mode has the same data format as the [sensor_data mode](#sensor_data), but all sensor data from cameras are removed and instead a new key is added called `voxel_grid`.
+
+To use this observation mode, a dictionary of observation config parameters is required to be passed in via obs_mode_config during environment initializations (gym.make()). It should contain the following voxelization config hyperparameters:
+
+- `coord_bounds`: `[torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32]` It has form **[x_min, y_min, z_min, x_max, y_max, z_max]**  defining the metric volume to be voxelized.
+- `voxel_size`: `torch.int` Defining the side length of each voxel, assuming that all voxels are cubic.
+- `device`: `torch.device` The device on which the voxelization takes place. Something like **torch.device("cuda" if torch.cuda.is_available() else "cpu")**
+- `segmentation`: `bool` Defining whether or not to estimate voxel segmentations using the point cloud segmentations. If true then num_channels=11 (including one channel for voxel segmentation), otherwise num_channels=10.
+
+Then, as you step throught the environment you created and get observations, you can see the extra key `voxel_grid` indicating the voxel grid generated:
+
+
+- `voxel_grid`: `[torch.int, torch.int, torch.int, torch.int, torch.int]` It has form **[N, voxel_size, voxel_size, voxel_size, num_channels]**. Voxel grids generated by fusing the point cloud and rgb data from all cameras. `N` is the batch size. `voxel_size` is the side length of the voxel, as indicated in voxelization configs. `num_channels` indicates the number of feature channels for each voxel. 
+
+
+The voxel grid can be visualized below. This is an image showing the voxelized scene of PushCube-v1 with slightly-tuned default hyperparameters. The voxel grid is reconstructed from the front camera, following the default camera settings of the PuchCube-v1 task, and hence it only contains the front voxels instead of the voxels throughout the scene.
+
+```{image} images/voxel_pushcube.png
+---
+alt: Voxelized PushCube-v1 scene at the initial state
+---
+```
+
+The RGBD image data used to reconstruct the voxel scene above is shown in the following figure. Here we use only one base camera in PushCube-v1 task.
+
+```{image} images/voxel_cam_view_one.png
+---
+alt: Corresponding RGBD observations
+---
+```
+
+For a quick demo to visualize voxel grids, you can run
+
+<!-- TODO: add command line args -->
+```bash
+python -m mani_skill.examples.demo_vis_voxel -e "PushCube-v1" --voxel-size 200 --zoom-factor 2.2 --coord-bounds -1 -1 -1 2 2 2
+```
+
+Or simply 
+
+```bash
+python -m mani_skill.examples.demo_vis_voxel -e "PushCube-v1" 
+```
+
+When using just the default settings.
+
+Furthermore, if you use more sensors (currently only RGB and depth cameras) to film the scene and collect more point cloud and RGB data from different poses, you can get a more accurate voxel grid reconstruction of the scene. Figure below gives a more completely reconstructed voxel scene of PushCube-v1 using more RGBD cameras.
+
+```{image} images/voxel_pushcube_complete.png
+---
+alt: Densely voxelized PushCube-v1 scene at the initial state
+---
+```
+
+It is reconstructed using 5 cameras located at the up, left, right, front, and back of the tabletop scene, respectively, as shown in the visualized RGBD observations below.
+
+```{image} images/voxel_cam_view_all.png
+---
+alt: Corresponding RGBD observations
+---
+```
+
+
+
 ## Segmentation Data
 
 Objects upon being loaded are automatically assigned a segmentation ID (the `per_scene_id` attribute of `sapien.Entity` objects). To get information about which IDs refer to which Actors / Links, you can run the code below
@@ -113,4 +178,4 @@ Note that ID 0 refers to the distant background. For a quick demo of this, you c
 ```bash
 python -m mani_skill.examples.demo_vis_segmentation -e "PushCube-v1" # plot all segmentations
 python -m mani_skill.examples.demo_vis_segmentation -e "PushCube-v1" --id cube # mask everything but the object with name "cube" 
-```
+```
diff --git a/docs/source/user_guide/demos/images/voxel_pushcube.png b/docs/source/user_guide/demos/images/voxel_pushcube.png
diff --git a/docs/source/user_guide/demos/index.md b/docs/source/user_guide/demos/index.md
@@ -190,6 +190,19 @@ python -m mani_skill.examples.demo_vis_rgbd -e "StackCube-v1"
 ```{figure}  images/rgbd_vis.png
 ```
 
+## Visualize Voxel Data
+
+You can run the following to visualize the voxelized data. It will give you the following voxelized scene under the default sensor settings with only 1 camera at the front of the scene.
+
+```bash
+python -m mani_skill.examples.demo_vis_voxel -e "PushCube-v1" 
+```
+
+
+```{figure}  images/voxel_pushcube.png
+```
+
+
 ## Visualize Reset Distributions
 
 Determining how difficult a task might be for ML algorithms like reinforcement learning and imitation learning can heavily depend on the reset distribution of the task. To see what the reset distribution of any task (the result of repeated env.reset calls) looks like you can run the following to save a video to the `videos` folder

diff --git a/mani_skill/envs/sapien_env.py b/mani_skill/envs/sapien_env.py
@@ -23,6 +23,7 @@
 from mani_skill.envs.utils.observations import (
     sensor_data_to_pointcloud,
     sensor_data_to_rgbd,
+    sensor_data_to_voxel
 )
 from mani_skill.sensors.base_sensor import BaseSensor, BaseSensorConfig
 from mani_skill.sensors.camera import (
@@ -66,6 +67,8 @@ class BaseEnv(gym.Env):
 
         sensor_cfgs (dict): configurations of sensors. See notes for more details.
 
+        obs_mode_config (dict): configuration hyperparameters of observations. See notes for more details.
+
         human_render_camera_cfgs (dict): configurations of human rendering cameras. Similar usage as @sensor_cfgs.
 
         robot_uids (Union[str, BaseAgent, List[Union[str, BaseAgent]]]): List of robots to instantiate and control in the environment.
@@ -99,7 +102,7 @@ class BaseEnv(gym.Env):
     SUPPORTED_ROBOTS: List[Union[str, Tuple[str]]] = None
     """Override this to enforce which robots or tuples of robots together are supported in the task. During env creation,
     setting robot_uids auto loads all desired robots into the scene, but not all tasks are designed to support some robot setups"""
-    SUPPORTED_OBS_MODES = ("state", "state_dict", "none", "sensor_data", "rgb", "rgbd", "pointcloud")
+    SUPPORTED_OBS_MODES = ("state", "state_dict", "none", "sensor_data", "rgb", "rgbd", "pointcloud", "voxel")
     SUPPORTED_REWARD_MODES = ("normalized_dense", "dense", "sparse", "none")
     SUPPORTED_RENDER_MODES = ("human", "rgb_array", "sensors")
     """The supported render modes. Human opens up a GUI viewer. rgb_array returns an rgb array showing the current environment state.
@@ -120,6 +123,8 @@ class BaseEnv(gym.Env):
     """all sensor configurations parsed from self._sensor_configs and agent._sensor_configs"""
     _agent_sensor_configs: Dict[str, BaseSensorConfig]
     """all agent sensor configs parsed from agent._sensor_configs"""
+    _obs_mode_config: Dict
+    """configurations for converting sensor data to observations under the current observation mode (e.g. voxel size and scene bounds for voxel observations)"""
     _human_render_cameras: Dict[str, Camera]
     """cameras used for rendering the current environment retrievable via `env.render_rgb_array()`. These are not used to generate observations"""
     _default_human_render_camera_configs: Dict[str, CameraConfig]
@@ -146,6 +151,7 @@ def __init__(
         shader_dir: str = "default",
         enable_shadow: bool = False,
         sensor_configs: dict = None,
+        obs_mode_config: dict = None, 
         human_render_camera_configs: dict = None,
         robot_uids: Union[str, BaseAgent, List[Union[str, BaseAgent]]] = None,
         sim_cfg: Union[SimConfig, dict] = dict(),
@@ -156,6 +162,7 @@ def __init__(
         self.num_envs = num_envs
         self.reconfiguration_freq = reconfiguration_freq if reconfiguration_freq is not None else 0
         self._reconfig_counter = 0
+        self._obs_mode_config = obs_mode_config
         self._custom_sensor_configs = sensor_configs
         self._custom_human_render_camera_configs = human_render_camera_configs
         self._parallel_gui_render_enabled = parallel_gui_render_enabled
@@ -408,7 +415,7 @@ def get_obs(self, info: Dict = None):
             obs = common.flatten_state_dict(state_dict, use_torch=True, device=self.device)
         elif self._obs_mode == "state_dict":
             obs = self._get_obs_state_dict(info)
-        elif self._obs_mode in ["sensor_data", "rgbd", "rgb", "pointcloud"]:
+        elif self._obs_mode in ["sensor_data", "rgbd", "rgb", "pointcloud", "voxel"]:
             obs = self._get_obs_with_sensor_data(info)
             if self._obs_mode == "rgbd":
                 obs = sensor_data_to_rgbd(obs, self._sensors, rgb=True, depth=True, segmentation=True)
@@ -417,6 +424,14 @@ def get_obs(self, info: Dict = None):
                 obs = sensor_data_to_rgbd(obs, self._sensors, rgb=True, depth=False, segmentation=True)
             elif self.obs_mode == "pointcloud":
                 obs = sensor_data_to_pointcloud(obs, self._sensors)
+            elif self.obs_mode == "voxel":
+                # assert on _obs_mode_config here, and pass them to the convertion function
+                assert self._obs_mode_config != None, "You mush pass in configs in voxel observation mode via obs_mode_config keyword arg in gym.make(). See the Maniskill docs for details. No such config detected."
+                assert "voxel_size" in self._obs_mode_config.keys(), "Lacking voxel_size (voxel size) in observation configs"
+                assert "coord_bounds" in self._obs_mode_config.keys(), "Lacking coord_bounds (coordinate bounds) in observation configs"
+                assert "device" in self._obs_mode_config.keys(), "Lacking device (device for voxelizations) in observation configs"
+                assert "segmentation" in self._obs_mode_config.keys(), "Lacking segmentation (a boolean indicating whether including voxel segmentations) in observation configs"
+                obs = sensor_data_to_voxel(obs, self._sensors, self._obs_mode_config)
         else:
             raise NotImplementedError(self._obs_mode)
         return obs

diff --git a/mani_skill/envs/utils/observations/__init__.py b/mani_skill/envs/utils/observations/__init__.py
@@ -1 +1,2 @@
 from .observations import *
+from .voxelizer import *
diff --git a/mani_skill/envs/utils/observations/observations.py b/mani_skill/envs/utils/observations/observations.py
@@ -11,7 +11,7 @@
 from mani_skill.sensors.base_sensor import BaseSensor, BaseSensorConfig
 from mani_skill.sensors.camera import Camera
 from mani_skill.utils import common
-
+from mani_skill.envs.utils.observations.voxelizer import VoxelGrid
 
 def sensor_data_to_rgbd(
     observation: Dict,
@@ -113,3 +113,98 @@ def sensor_data_to_pointcloud(observation: Dict, sensors: Dict[str, BaseSensor])
     #         observation["pointcloud"]["segmentation"].numpy().astype(np.uint16)
     #     )
     return observation
+
+def sensor_data_to_voxel(
+    observation: Dict, 
+    sensors: Dict[str, BaseSensor],
+    obs_mode_config: Dict
+    ):
+    """convert all camera data in sensor to voxel grid"""
+    sensor_data = observation["sensor_data"]
+    camera_params = observation["sensor_param"]
+    coord_bounds = obs_mode_config["coord_bounds"] # [x_min, y_min, z_min, x_max, y_max, z_max] - the metric volume to be voxelized
+    voxel_size = obs_mode_config["voxel_size"] # size of the voxel grid (assuming cubic)
+    device = obs_mode_config["device"] # device on which doing voxelization
+    seg = obs_mode_config["segmentation"] # device on which doing voxelization
+    pcd_rgb_observations = dict()
+
+    # Collect all cameras' observations
+    for (cam_uid, images), (sensor_uid, sensor) in zip(
+        sensor_data.items(), sensors.items()
+    ):
+        assert cam_uid == sensor_uid
+        if isinstance(sensor, Camera):
+            cam_data = {}
+
+            # Extract point cloud and segmentation data
+            images: Dict[str, torch.Tensor]
+            position = images["PositionSegmentation"]
+            if seg:
+                segmentation = position[..., 3].clone()
+            position = position.float()
+            position[..., 3] = 1 # convert to homogeneious coordinates
+            position[..., :3] = (
+                position[..., :3] / 1000.0
+            )  # convert the raw depth from millimeters to meters
+
+            # Convert to world space position and update camera data
+            cam2world = camera_params[cam_uid]["cam2world_gl"]
+            xyzw = position.reshape(position.shape[0], -1, 4) @ cam2world.transpose(
+                1, 2
+            )
+            xyz = xyzw[..., :3] / xyzw[..., 3].unsqueeze(-1) # dehomogeneize
+            cam_data["xyz"] = xyz
+            if seg:
+                cam_data["seg"] = segmentation.reshape(segmentation.shape[0], -1, 1)
+
+            # Extract rgb data
+            if "Color" in images:
+                rgb = images["Color"][..., :3].clone()
+                rgb = rgb / 255 # convert to range [0, 1]
+                cam_data["rgb"] = rgb.reshape(rgb.shape[0], -1, 3)
+
+            pcd_rgb_observations[cam_uid] = cam_data
+
+    # just free sensor_data to save memory
+    for k in pcd_rgb_observations.keys():
+        del observation["sensor_data"][k]
+
+    # merge features from different cameras together
+    pcd_rgb_observations = common.merge_dicts(pcd_rgb_observations.values())
+    for key, value in pcd_rgb_observations.items():
+        pcd_rgb_observations[key] = torch.concat(value, axis=1)
+
+    # prepare features for voxel convertions
+    xyz_dev = pcd_rgb_observations["xyz"].to(device)    
+    rgb_dev = pcd_rgb_observations["rgb"].to(device)    
+    if seg:
+        seg_dev = pcd_rgb_observations["seg"].to(device)    
+    coord_bounds = torch.tensor(coord_bounds, device=device).unsqueeze(0)
+    batch_size = xyz_dev.shape[0]
+    max_num_coords = rgb_dev.shape[1]
+    vox_grid = VoxelGrid(
+        coord_bounds=coord_bounds,
+        voxel_size=voxel_size,
+        device=device,
+        batch_size=batch_size,
+        feature_size=3,
+        max_num_coords=max_num_coords,
+    )
+
+    # convert to the batched voxel grids
+    # voxel 11D features contain: 3 (pcd xyz coordinates) + 3 (rgb) + 3 (voxel xyz indices) + 1 (seg id if applicable) + 1 (occupancy)
+    if seg: # add voxel segmentations
+        voxel_grid = vox_grid.coords_to_bounding_voxel_grid(xyz_dev,
+                                                    coord_features=rgb_dev,
+                                                    coord_bounds=coord_bounds,
+                                                    clamp_vox_id=True, 
+                                                    pcd_seg=seg_dev) 
+    else: # no voxel segmentation
+        voxel_grid = vox_grid.coords_to_bounding_voxel_grid(xyz_dev,
+                                                    coord_features=rgb_dev,
+                                                    coord_bounds=coord_bounds,
+                                                    clamp_vox_id=False) 
+
+    # update voxel grids to the observation dict
+    observation["voxel_grid"] = voxel_grid 
+    return observation