Neural Radiance Field (NeRF)

Introduction

This project is all about implementing a Neural Radiance Field (NeRF) for 3D scene reconstruction from multi-view images. We will first calibrate our camera and capture a 3D scan of our own scene, which will serve as our dataset. Before we dive into the implementation of 3D NeRF, we will explore a 2D case first to get ourselves familiar with the key concepts. Then, we will move on to the 3D NeRF implementation, for both the given Lego dataset and our own data. My resized Labubu photos and camera calibration photos are available here.

Part 0: Camera Calibration and 3D Scanning

In this part, we will take a 3D scan of our own object for building a NeRF later. We will use ArUco tags to calibrate our camera and then estimate the pose of each image after undistorting them. The calibration intrinsics and estimated extrinsics, together with the undistorted images, will be saved as a dataset for training our NeRF later.

0.1: Calibrating The Camera

The images of ArUco tags that were taken under the same zoom but from different angles and distances were used to calibrate our camera. To avoid distortion effects from real cameras, the smartphone camera was utilized instead. The pipeline is as follows:

Loop through all the calibration images
For each image, detect the ArUco tags using OpenCV's cv2.aruco.detectMarkers (also handle the case where no tags are detected)
Extract the corner coordinates from the detected tags
Collect all detected corners and their corresponding 3D world coordinates by referring to the ArUco tags as the world origin
Use cv2.calibrateCamera to compute the camera intrinsics and distortion coefficients

Resizing raw images smaller. The original photos were too large to be processed; therefore, they were resized first, and then we will play with the resized images for all the following parts of this project. Since all calibration and later pose estimation are also performed on the resized images, the estimated intrinsics correspond to the resized resolution, so the NeRF pipeline remains consistent.

Results:The RMS reprojection error is around 0.71 pixels, which is within a reasonable range for a smartphone camera and resized images, indicating that the calibration is successful.

0.2: Capturing a 3D Object Scan

Here, I cut one of the ArUco tags from the board used in Part 0.1 and put it next to our object to be scanned. In my case, I scanned my Labubu toy. For good NeRF results later, here we follow the guidelines below when capturing the photos:

Take all photos under the same zoom as used in Part 0.1 (24mm focal length without digital zoom in my case).
Keep the brightness and exposure consistent across all photos.
Keep our hands steady while taking photos to avoid motion blur.
Capturing photos from different angles around the object to cover all sides.
Keep the same distance from the object for all photos, i.e., photos distributed like a dome.
Keep the object reasonably centered in each frame and ensure it occupies enough pixels for NeRF to learn fine details.

0.3: Estimating Camera Pose

With the intrinsics from camera calibration, together with the photos taken in Part 0.2, we can estimate camera poses (position and orientation) for each image of our Labubu scan, which is a Perspective-n-Point (PnP) problem: estimating camera extrinsics given a set of known 3D points in the world coordinate system and their correspnding 2D projections in an image.

Here we use OpenCV's cv2.solvePnP function to solve this problem. The implemented pipeline is roughly as follows:

Loop through all the images.
For each image, detect the ArUco tag that we placed next to our object.
Extract the corner coordinates from the detected tag.
Define the 3D-2D correspondences between the known 3D points and the detected 2D points. Here, I define the ArUco tag's top-left corner as the world origin, and the other corners follow from the known physical tag size (6 cm × 6 cm).
Call cv2.solvePnP() with the 3D-2D correspondences, camera intrinsics, and distortion coefficients to estimate the rotation and translation vectors.
If successful, convert the world-to-camera transformation to camera-to-world extrinsics and save them; otherwise, return an identity matrix.

Results:Some images did not contain a clearly visible tag and were therefore skipped. This is expected and does not harm the reconstruction, as long as we retain enough diverse viewpoints. At the end of this process, we obtained 139 valid images with their estimated camera extrinsics. To verify that the estimated camera poses are reasonable, two screenshots of the visualization of the estimated camera poses rendered in Viser are shown below. The frustums form a dome-like distribution around the Labubu toy, which matches the way I captured the photos in Part 0.2 and confirms that the pose estimation pipeline is working correctly.

Visualizing Cloud of Cameras in Viser

It can be observed that our cloud of cameras form a dome-like distribution around the Labubu, matching our initial setup. Actually, careful photographing will help a lot when we train a NeRF of it (will see later).

Cloud of Cameras 1

Cloud of Cameras 2

Cloud of Cameras 3

0.4: Undistorting images and creating a dataset

Now, with the camera intrinsics and pose estimates, we can undistort our images and create a dataset for training NeRF in the later parts. For the division of training and validation sets, I randomly selected $80\%$ of the valid images for training and the remaining $20\%$ for validation. The undistorted images, along with the camera poses (c2ws) and focal length extracted from the intrinsic matrix, were saved in a .npz file with the same format as the provided synthetic Lego dataset, ensuring compatibility with the NeRF implementation in the subsequent parts.

Part 1: Fit a Neural Field to a 2D Image

Before jumping into 3D NeRF ($F: \{x,y,z,\theta,\phi\} \to \{r,g,b,\sigma\}$), let's first play with NeRF using a 2D example. In fact, NeRF will fall back to just a Neural Field ($F: \{u,v\} \to \{r,g,b\}$) in 2D, since there is no concept of radiance in 2D space.

In this part, we will create a neural field that can represent a 2D image and optimize it to fit this image. Specifically, let's train a Multilayer Perceptron (MLP) network with Sinusoidal Positional Encoding (PE) as our neural field $F$ that maps 2D pixel coordinates to the corresponding RGB colors. Some implementation details are introduced as follows.

Model Architecture and Training Configuration

Positional Encoding. To help the MLP learn high-frequency details in the image effectively, we apply Sinusoidal Positional Encoding to the pixel coordinates, together with the original coordinates, as our input to the MLP. Given an original input $x$, the PE with $L$ as its highest frequency level is defined as: $$ PE(x) = [x, \sin(2^0 \pi x), \cos(2^0 \pi x), \sin(2^1 \pi x), \cos(2^1 \pi x), \ldots, \sin(2^{L-1} \pi x), \cos(2^{L-1} \pi x)] $$ which applies to each dimension of the input independently, hence, expanding the input dimension from $x \in \mathbb{R}^d$ to $d \times (2L + 1)$.

MLP architecture. The MLP architecture for this part follows the suggested structure from the problem statement. The later parts may change some hyperparameters here and there, but all of them mainly follow this structure as a foundation below:

Positional Encoder: apply sinusoidal PE to project the input space from in_dim=2 to in_dim=2*(2L+1), where L is the highest frequency level.
Hidden layers: 3 fully connected linear layers with hidden_dim=256 and ReLU activations.
Output layer: a fully connected linear layer with out_dim=3 followed by a Sigmoid activation to ensure the output colors are in [0, 1].

Dataset class. A custom dataset class is inherited from torch.utils.data.Dataset that loads the image and prepares the (pixel coordinate, color) pairs for training, both of which are normalized within $[0, 1]$ range to facilitate training.

Data loader. A data loader function is implemented to randomly sample N (pixel coordinate, color) pairs from the dataset class for each training iteration.

Trainer & training configuration. The squared error (MSE) loss (torch.nn.MSELoss) between the predicted colors and ground truth is used as the objective. The model is optimized using Adam (torch.optim.Adam) optimizer with a learning rate of lr=1e-2 for n_iters=10000 iterations. Each iteration randomly samples N=10000 pixels from the image to compute the loss and update the model. The model performance is evaluated with PSNR metric, which is computed from MSE by: $$ PSNR = 10 \cdot \log_{10}\left(\frac{1}{MSE}\right) $$ Since we are fitting a single image, the validation set is the same as the training set here.

Fox Example: Fox Image & Ablation Study

Here we start from this image of fox. To verify that how the hyperparameters of layer width (channel size, or hidden_dim) and max frequency $L$ for the PE affect the fitting quality, we conduct a small ablation study by training the model with different combination of them. Four configurations are tested:

hidden_dim=32, L=2
hidden_dim=32, L=10
hidden_dim=256, L=2
hidden_dim=256, L=10

All other configurations remain the same as described before:

Number of iterations: n_iters=10000
Learning rate: lr=1e-2
Number of sampled pixels per iteration: N=10000
Number of hidden layers: 3

Fox: Comparison of Different Configurations

From the training logs shown below, it can be observed that increasing both the layer width (hidden_dim) and the maximum frequency level (L) leads to better fitting quality, as indicated by higher PSNR values. It makes sense because a wider network has a larger capacity to learn complex mappings, and a higher frequency level allows the model to capture finer details in the image.

Fox: Visualization of Reconstructed Images During Training

Now, let's visualize the reconstructed images at different training iterations for all of the configurations. The images below show the model's output at various training stages, demonstrating how the neural field progressively learns to represent the fox image more accurately over time. With higher layer width and frequency level, the model converges faster and achieves higher quality.

Reconstructed Images of Fox at Different Training Iterations

Labubu Example: Reconstruct Our Own Image

Here let's try to apply the same neural field fitting pipeline to reconstruct my own image. The model and training configurations are the same as the best-performing one in the ablation study of fox image above. The reconstructed image is one of our resized labubu photos taken in Part 0.

Part 2: Fit a Neural Radiance Field from Multi-view Image

Once we are familiar with the 2D case, we can now proceed to the more interesting task that using a neural radiance field to represent a 3D space, through inverse rendering from multi-view calibrated images as supervision signals. For this part, we will use the Lego scene from the original paper but with lower resolution ($200 \times 200$) and preprocessed cameras (available from here).

Let's first build the bricks for our NeRF implementation, including creating rays from the estimated camera, sampling points along each ray, and volume rendering, which will be used together to form the NeRF training and inference pipeline.

Part 2.1: Create Rays from Cameras

Recall that NeRF is a volumetric representation of a 3D scene, where each point in space emits radiance in different directions. To render an image from a given camera viewpoint, we need to generate rays that originate from the camera and pass through each pixel in the image plane. Therefore, here we need a function to convert a pixel coordinate to a ray with origin and normalized direction, say, ray_o, ray_d = px2ray(K, c2w, uv). Instead of implementing the full function directly, let's first decompose it into multiple steps, each of which can be implemented and verified separately, and then we can combine them to form the final function.

Homogeneous coordinate helpers. First, we can implement two helper functions to convert between Euclidean coordinates and homogeneous coordinates:

get_homoge_coords(x): convert Euclidean coordinates x into homogeneous coordinates by appending a 1 to each coordinate.
homoge_back(x_h): convert homogeneous coordinates x_h back to Euclidean coordinates by dividing by the last coordinate and removing it.

Camera to World Coordinate Conversion. The transformation between the world space $\mathbf{X}_w = (x_w, y_w, z_w)$ and the camera space $\mathbf{X}_c = (x_c, y_c, z_c)$ can be defined as a rigid body transformation that composed by a rotation ($\mathbf{R} \in \mathbb{R}^{3 \times 3}$) and a translation matrix ($\mathbf{t} \in \mathbb{R}^{3}$). The whole transformation can be represented as a single linear transformation in the homogeneous coordinate system:

$$ \begin{bmatrix} x_c \\ y_c \\ z_c \\ 1 \end{bmatrix} = \begin{bmatrix} \mathbf{R} & \mathbf{t} \\ 0 & 1 \end{bmatrix} \begin{bmatrix} x_w \\ y_w \\ z_w \\ 1 \end{bmatrix} = \mathbf{T}_{c \gets w} \begin{bmatrix} x_w \\ y_w \\ z_w \\ 1 \end{bmatrix} $$ in which the $\mathbf{T}_{c \gets w}$ is called world-to-camera (w2c) transformation matrix. Similarly, we can have the camera-to-world (c2w) transformation matrix $\mathbf{T}_{w \gets c}$ by inverting $\mathbf{T}_{c \gets w}$.

Implementation details: I implemented a transform(c2w, x_c) function that maps camera coordinates x_c to world coordinates using the c2w transformation matrix, with the following steps:

Convert x_c into homogeneous coordinates using get_homoge_coords, producing tensors of shape (N, 4) for a single camera or (B, N, 4) for batched cameras.
Apply the camera-to-world transform using np.einsum:
- single camera: x_w_h = np.einsum('ij,nj->ni', c2w, x_c_h)
- batched cameras: x_w_h = np.einsum('bij,bnj->bni', c2w, x_c_h)
This supports both (N,3) and (B,N,3) inputs.
Convert the transformed points back to Euclidean coordinates using homoge_back, obtaining the final world-space coordinates.

Pixel to Camera Coordinate Conversion. Consider a pinhole camera with focal length $(f_x, f_y)$ and principal point $(o_x, o_y)$ where $o_x = W / 2$ and $o_y = H / 2$. The intrinsic matrix $\mathbf{K}$ of this camera can be represented as: $$ \mathbf{K} = \begin{bmatrix} f_x & 0 & o_x \\ 0 & f_y & o_y \\ 0 & 0 & 1 \end{bmatrix} $$ which can be used to project a 3D point $(x_c, y_c, z_c)$ in the camera coordinate system to a 2D point $(u,v)$ in the image pixel coordinate system: $$ s \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} = \mathbf{K} \begin{bmatrix} x_c \\ y_c \\ z_c \end{bmatrix} $$ in which $s = z_c$ is the depth of this point along the optical axis. Here, we implement a inverse mapping of this projection, i.e., from pixel coordinate $(u,v)$ to camera coordinate $(x_c, y_c, z_c)$ given a specific depth $s$ and the intrinsic matrix $\mathbf{K}$: $$ \mathbf{x}_c = s \cdot \mathbf{K}^{-1} [u, v, 1]^{\top} $$

Implementation details: I implemented a px2cam(K, uv, s) function that maps pixel coordinates uv to camera coordinates:

Convert uv to homogeneous coordinates uv_h = (u, v, 1) using get_homoge_coords.
Compute the inverse intrinsic matrix K_inv = np.linalg.inv(K).
Apply the inverse projection x_c = s * K_inv @ uv_h using np.einsum, supporting both (N, 2) and batched (B, N, 2) inputs.

Pixel to Ray. A ray can be defined by an origin vector $\mathbf{r}_o \in \mathbb{R}^3$ and a direction vector $\mathbf{r}_d \in \mathbb{R}^3$. In the case of a pinhole camera, we want to know the $\{\mathbf{r}_o, \mathbf{r}_d\}$ for every pixel $(u,v)$. The origin $\mathbf{r}_o$ of those rays is easy to get because it is just the location of the camera in world coordinates. For $T_{w \gets c}$ transformation matrix, the camera origin is simply the translation component: $$ \mathbf{r}_o = \mathbf{t}_{w \gets c} $$

To calculate the ray direction for pixel $(u,v)$, we can simply choose a point along this ray with depth equal to 1 ($s=1$) and find its coordinate in world space $\mathbf{X}_w = (x_w, y_w, z_w)$ using our previously defined transformations, then the normalized direction vector can be computed as: $$ \mathbf{r}_d = \frac{\mathbf{X}_w - \mathbf{r}_o}{\|\mathbf{X}_w - \mathbf{r}_o\|_2} $$

Implementation details: The implemented px2ray(K, c2w, uv) function combines the above steps to compute the ray origin and direction for pixel coordinates uv, by:

Convert pixel coordinates uv to camera coordinates using x_c = px2cam(K, uv, s=1.0).
Convert camera coordinates x_c to world coordinates using x_w = transform(c2w, x_c).
Extract the camera origin r_o from the translation component of c2w.
Compute the ray direction r_d by normalizing the vector from r_o to x_w.
Finally, return r_o and r_d.

Part 2.2: Sampling

Recall that the training signal of NeRF comes from the rendered images, which are generated by casting rays from the camera through each pixel and sampling points along these rays to query the neural radiance field. Before building a volumetric renderer, we first need to implement the functions to sample points along rays, which can be broken down into two steps: (1) sampling rays from images and (2) sampling points along each ray.

Dataset Class. Similar to what we did in Part 1 for the 2D case, here we again create a custom dataset class DatasetNerf that loads the multi-view images images, their extrinsics c2ws, and intrinsics K. It's methods include:

__init__(): Loading the images, camera extrinsics, and intrinsics, then get uv with a shift of 0.5, and get the normalized colors.
__len__(): Return the number of pixels.
__getitem__(): Return the pixel coordinates, normalized colors, and the c2w matrix of the current view.

Sampling Rays from Images. Here we extend what we did in Part 1 (sample pixel coordinates and colors), together with the camera intrinsics and extrinsics, to convert the pixel coordinates to ray origins and directions.

Implementation details: The implemented sample_rays_from_imgs(dataset,N_rays) function randomly samples N_rays rays from multi-view images in the dataset by:

Randomly select N_rays pixels from all images in the dataset.
Return the selected pixel coordinates uv, colors, and corresponding c2w matrices, in the dimension of (N_rays, 2), (N_rays, 3), and (N_rays, 4, 4) respectively.

Sampling Points along Rays. After having rays, we need to discretize each ray into samples that live in the 3D space. The simplest way is to uniformly sample points along the ray (t=np.linspace(near,far,n_samples)). For the Lego scene that we have, we can set near=2.0 and far=6.0. The actually 3D coordinates can be acuired by $\mathbf{x} = \mathbf{R}_o + \mathbf{R}_d \cdot t$. However, this would lead to a fixed set of 3D points, which could potentially lead to overfitting when we train the NeRF later on. On top of this, we want to introduce some small perturbation on the points only during training, so that every location along the ray would be touched upon during training. This can be achieved by something like t = t + (np.random.rand(t.shape) * t_width) where is set to be the start of each interval.

Implementation details: The implemented sample_pts_from_rays(r_os, r_ds, N_pts, near, far, perturbation) function samples points along each ray defined by ray_o and ray_d:

Get the number of rays from N_rays = r_os.shape[0].
Create N_pts + 1 depth boundaries between near and far using torch.linspace, and derive the interval starts t_start and widths t_width, both of shape (N_pts,).
If perturbation=True, sample a random offset noise in [0, 1) for each ray and each interval, and set t = t_start[None, :] + noise * t_width[None, :], so that each t lies inside its interval. Otherwise, use the interval midpoints.
Compute the 3D sample positions with pts = r_os[:, None, :] + r_ds[:, None, :] * t[..., None], producing pts of shape (N_rays, N_pts, 3).
Prepare expanded versions of the ray origins and directions for later rendering: r_os_expand and r_ds_expand with shape (N_rays, N_pts, 3), the per-interval distances deltas of shape (N_rays, N_pts, 1) obtained from t_width, and the sampled depths ts of shape (N_rays, N_pts, 1).

These outputs will be used directly by the volume rendering module in the next part to accumulate colors and depths along each ray.

Part 2.3: Putting the Dataloading Together

Similar to Part 1, we prepare a simple dataloader that randomly samples pixels from the multi-view training images. The main difference here is that instead of returning only pixel coordinates and colors, we also convert each sampled pixel into a 3D ray in world space and return the ray origin, ray direction, and pixel color.

Implementation details: The function sample_rays(dataset, N_rays, K) wraps the previous utilities into a single call:

First, it calls sample_rays_from_imgs(dataset, N_rays) to randomly sample N_rays pixels across all views. This returns pixel coordinates uvs, colors, and the corresponding c2w matrices.
To use the batched interface of px2ray, the sampled pixels are reshaped to uvs_batch of shape (N_rays, 1, 2), and the c2w matrices are stacked into an array of shape (N_rays, 4, 4).
We then call px2ray(K, c2ws_batch, uvs_batch) to obtain ray origins and directions with shape (N_rays, 1, 3), and squeeze the singleton dimension to get (N_rays, 3).
Finally, we convert the NumPy arrays back to torch.float32 tensors and return r_os, r_ds, and colors as tensors of shape (N_rays, 3). These are the basic ray batches that will be used in the NeRF training loop.

Visualization of Ray & Point Sampling in Viser

As a demonstration, here we visualize how the rays, as well as the points along each ray, are sampled in the 3D space, which are the core components of NeRF rendering and training process. For each view, we visualize 100 rays, eahch of which has 64 sampled points along it (with perturbation) within the near and far bounds.

Part 2.4: Neural Radiance Field

With the samples in 3D space, we can then use a neural net to predict the density and color for those samples. Similar to Part 1, here we create an MLP network with three main differences:

Input is now 3D world coordinates, alongside a 3D vector as the ray direction. And the output is not only the color but also the corresponding density for the 3D points. In the radiance field, the color of each point depends on the view direction, so we are going to use the view direction as the condition when predicting colors. Here we use Sigmoid to constrain the color outputs to [0,1], and use ReLU to constrain the output density non-negative. The rray direction also needs to be encoded by PE but can use less frequency (e.g., L=4) than the coordinate PE (e.g., L=10).
Since we are doing a more challenging task of optimizing a 3D representation, here we need a deeper network with more parameters.
Inject the input (after PE) to the middle of our MLP through concatenation, which is a general trick for deep neural network.

The architectural design and implementation follow the suggested one in the project description, illustrated below. The following parts may change hyperparameters, but all of them are modified from this one as the foundation. The modifications will be explicitly reported later if applicable.

Part 2.5: Volume Rendering

The core volume rendering equation is as follows: $$ C(\mathbf{r}) = \int_{t_n}^{t_f} T(t) \sigma(\mathbf{r}(t)) \mathbf{c}(\mathbf{r}(t), \mathbf{d}) dt, \quad \text{where} \quad T(t) = \exp\left(-\int_{t_n}^{t} \sigma(\mathbf{r}(s)) ds\right) $$ This fundamentally means that at every small step $dt$ along the ray, we add the contribution of that small interval $[t, t+dt]$ to that final color, conceptually, we sum contributions over infinitely many infinitesimal intervals, which becomes an integral.

The discrete approximation (thus tractable to compute) of this equation can be stated as follows: $$ \hat{C}(\mathbf{r}) = \sum_{i=1}^{N} T_i \alpha_i \mathbf{c}_i = \sum_{i=1}^{N} w_i \mathbf{c}_i, \quad \text{where} \quad T_i = \exp\left(-\sum_{j=1}^{i-1} \sigma_j \delta_j\right), \quad \alpha_i = 1 - \exp\left(-\sigma_i \delta_i\right) $$ where $\mathbf{c}_i$ is the color obtained from our network at sample location $i$, $T_i$ is the probability of a ray not terminating before sample location $i$, and $\alpha_i = 1 - \exp\left(-\sigma_i \delta_i\right)$ is the probability of terminating at sample location $i$.

We can also interpret this rendering process as a weighted sum of colors along the ray, where the weight $w_i = T_i \alpha_i$ indicates how much each sample contributes to the final color.

Implementation details: The function volrend(sigmas, rgbs, deltas, ts, depth) implements this discrete renderer:

All inputs have shape (N_rays, N_pts, ...) except for depth, which indicates whether to render an RGB image or a depth map.
Compute the opacities alphas = 1 - exp(-sigmas * deltas) giving shape (N_rays, N_pts, 1).
Compute the transmittances T_i:
- accumulate sigmas * deltas along the ray,
- shift by one position and prepend a zero to obtain the sum of previous intervals,
- apply exp(-...) to obtain T_i.
Render color by taking the weighted sum torch.sum(alphas * T_i * rgbs, dim=1), producing (N_rays, 3).
If depth=True, we instead compute torch.sum(w_i * ts, dim=1), producing (N_rays, 1). The role of depth rendering will be introduced in a later section.

Trainer for NeRF

Here we build a train_val_nerf function that encapsulates the whole NeRF pipeline: sampling rays, sampling points along each ray, querying the MLP, volume rendering, and computing the photometric loss, together with periodic validation and checkpointing.

Implementation details: In each training iteration, the following steps are performed:

Randomly sample N_rays_per_iter rays from the training images using sample_rays.
Sample N_pts points along each ray with sample_pts_from_rays, with perturbation enabled during training.
Feed the 3D points and expanded view directions to MlpNerf to predict densities and colors.
Apply the volume renderer volrend to obtain the predicted pixel colors, and minimize the MSE loss to the ground-truth colors using Adam.
Every 100 iterations, run the same pipeline on the validation set with perturbation disabled to log the validation PSNR.
Periodically save model checkpoints for later visualization and video rendering.

Function to render novel frame from extrinsics and intrinsics

To use our trained NeRF model to render novel views, here we implement a function render_frames that takes in a set of camera extrinsics and intrinsics, and produces the rendered images. This function is similar to the training loop but without optimization, and it processes all pixels in the image instead of sampling a subset of rays.

Implementation details: The function render_frames(nerf_model, c2ws, K, H, W, N_pts, near, far, depth) is implemented as follows:

Create a meshgrid of all pixel coordinates uv in the image of size (H, W), and reshape them to (H*W, 2), during which the 0.5 offset is added to each coordinate.
Get all rays for the image by calling px2ray(K, c2ws, uv), producing r_os and r_ds of shape (H*W, 3).
Sample points along all rays using sample_pts_from_rays(r_os, r_ds, N_pts, near, far, perturbation=False), obtaining the 3D sample positions pts, ray origins r_os_expand, ray directions r_ds_expand, deltas, and depths ts.
Query the NeRF model to get densities and colors for all sampled points.
Perform volume rendering using volrend to get the final rendered image or depth map.
Reshape the output back to (H, W, 3) for images or (H, W, 1) for depth maps, and return the result.

Experiment: NeRF of Lego Scene

Lego NeRF with 1k iterations

Here we first adopt the suggested hyperparameters from the project problem statement, the configurations are as follows:

Number of rays per iteration: 10000
Number of points per ray: 64
Number of training iterations: 1000
Learning rate: 5e-4
Near and far bounds: 2.0 and 6.0
Highest frequency level for position PE: 10
Highest frequency level for direction PE: 4

The training process by plotting the predicted images across iterations, training PSNR curve on train and validation sets, and the rendered test set video, are as follows:

Lego NeRF with more iterations

It is apparent that from the training log of 1k iterations, the model is still not fully converged. Here we increase the training iterations and the number of points to see how the model is improved. Let's keep all other hyperparameters the same as before, except increasing the number of training iterations to 10000.

The training process by plotting the predicted images across iterations, training PSNR curve on train and validation sets, and the rendered test set video, are as follows:

Observations

From the above results, we can see that with more training iterations, the NeRF model is able to produce significantly better renderings of the Lego scene. The PSNR on both training and validation sets improves steadily, indicating that the model is learning a more accurate representation of the 3D scene. The rendered images become sharper and more detailed, with fewer artifacts compared to the 1k iteration model. This demonstrates the effectiveness of NeRF in capturing complex 3D structures and view-dependent effects when given sufficient training time.

Part 2.6: Training with our own data

Now, it's time to train a NeRF model using our own captured Labubu dataset from Part 0. The model architecture keeps the same as before, except increasing the highest frequency level for position PE to 20 and for direction PE to 8, to better capture the fine details in the real-world scene. The training configurations are as follows:

Number of rays per iteration: 10000
Number of points per ray: 96
Number of training iterations: 30000
Learning rate: 5e-4
Near and far bounds: 0.02 and 0.4
Highest frequency level for position PE: 20
Highest frequency level for direction PE: 8

Results & Discussion

With some experiments, I found that increasing the frequency levels can improve the model's performance a little bit, but will overfit the training data if setting them too high without regularization. Another observation is that the sampling range defined by [near, far] is crucial and very sensitive for obtaining accurate results, which seems due the drawback of uniform sampling strategy.

The training process by plotting the one predicted training image across iterations, training PSNR curve on train and validation sets, and the rendered novel video, are as follows. It can be observed that the model already overfits the training data at around 2k iterations, however, keep training it can typically give us better results when rendering our novel views.

Bells & Whistles: Depth map video for the Lego scene

We can also render the depth map with the weight $w_i = T_i \alpha_i$ computed during volume rendering, which is the "expected depth" expressed as: $$ \bar{t} = \sum_{i} T_i \alpha_i t_i = \sum_{i} w_i t_i $$

Implementation details: The volrend function already supports rendering depth maps by setting the depth=True flag. During the rendering process, if this flag is set, the function computes the expected depth along each ray using the weights and sampled depths, returning a depth map instead of an RGB image. Another trick to improve the quality of visualized depth maps is to normalize each frame individually to the range of $[0,1]$, in my implementation, I achieve this by frame = np.clip((frame - near)/(far - near), 0, 1) before saving the depth frames as images or videos.

The depth map video of the Lego scene is shown below: