Images of the Russian Empire -- Colorizing the Prokudin-Gorskii Photo Collection

Aligning RGB channels from glass plate negatives and enhancing them for better visual quality

Introduction

Sergei Mikhailovich Prokudin-Gorskii (1863-1944) was a chemist and photographer best known as a pioneer of color photography. With special permission from the Tsar, he traveled across Russia and neighboring regions carrying a custom-designed camera. Due to the technical limitations of the early 20th century, each so-called “color photograph” was actually recorded as three separate negatives on a single specially shaped glass plate, each taken through a red, green, or blue filter. Fortunately, many of these high-quality negatives have been preserved and are now housed at the Library of Congress, which has digitized and publicly released them. As a result, we have the unique opportunity to glimpse the Russian Empire in color more than a century ago.

In this project, I start from the digitized glass plates provided by the LOC, separate the three RGB channels, and align them to reconstruct plausible color images. I explore different metrics and alignment algorithms to achieve accurate results even for large high-resolution scans in a reasonable runtime. In addition, I apply a series of post-processing enhancement techniques—including automatic cropping, white balance, contrast adjustment, and color mapping—to improve visual quality and produce more natural and historically faithful results.

Highlights

Exhaustive search for small JPGs using NCC.
Image pyramid for large TIFs for efficient alignment.
Robust alignment via Sobel gradient-domain NCC.
Automatic border cropping (edge density + k-fold), white balance (white patch assumption with enhancements), and contrast (CLAHE).
Optional Lab color transfer for historical aesthetics.

Aligning Small JPG Images

Metrics for Image Alignment

Before aligning the images, one need to define a metric to quantify "how well the images are aligned." This can be achieved through measuring the similarity between two images (patches). With a reference image $P$ and a target image $I$ (that we want to align with $P$), we define a region/patch $N$ that we want to align, then, our goal is to quantify how similar the region $N$ in $I$ is to $P$ after shifting $I$ by $(u,v)$. Two simple metrics are provided in the lecture:

Euclidean Distance (square root dropped): the L2 norm between two images in the pixel/feature space, $\operatorname{ssd}(u,v) = \sum_{(x,y)\in N} \left[I(u+x, v+y) - P(x,y)\right]^2$
Normalized Cross-Correlation (NCC): the dot product between two images in the pixel/feature space. $\operatorname{ncc}(u,v) = \frac{\sum_{(x,y)\in N}\left[I(u+x,v+y)-\bar I\right]\left[P(x,y)-\bar P\right]} {\sqrt{\sum_{(x,y)\in N}\left[I(u+x,v+y)-\bar I\right]^2}\sqrt{\sum_{(x,y)\in N}\left[P(x,y)-\bar P\right]^2}}$

Pre-processing

Before applying the alignment algorithm, appropriate pre-processing are required for better performance and robustness. For instance, computing the metric on the internal pixels (patches) can avoid influences from the borders, which may significantly harm the metric calculation (e.g., ssd, ncc) when we do it on raw pixel value space while they heavily rely on pixel intensities. Therefore, here we use a central patch (e.g., 90%) for scoring.

Search Algorithm: Exhaustive Search

With metric and searching space (in the raw pixel space or feature space) defined, the optimal alignment should be searched with an efficient algorithm from the possible shifts. As suggested from the project description, here I start with a simple exhaustive search.

Implementation details

Search window: $[-15, 15]$ around the center $(0, 0)$ in both x and y directions.
Metric: NCC (its normalization makes it robust to illumination changes than SSD.)
Scoring region: central crop (ignore ~10% margin)
Output: best (dx, dy) for G and R relative to B

Results

small_raw — monastery.jpg（exhaustive）

small_raw — tobolsk.jpg（exhaustive）

small_raw — cathedral.jpg（exhaustive）

Aligning Large TIF Images

Image Pyramid for Faster Alignment

The exhaustive search algorithm works fine for small images with small searching space, however, its computational cost grows exponentially with the size of searching space and image. For instance, if we implement the exhaustive search through nested loops (without any parallelization), and use it to search for the optimal shift of the raw `.tif` image with a searching range of $[-50, 50]$ in x and y directions, it may take over 20 minutes to run!

Therefore, more efficient methods should be considered. Here I implement a coarse-to-fine alignment with image pyramid, where for each level, the exhaustive search is applied, but with a much smaller searching space.

More specifically, each level is obtained by downsampling the last level image (by a factor of 2 in my implementation). Blurring is necessary before downsampling at each level, aiming to avoid aliasing. The alignment starts from the coarsest level and the optimal shift is propagated to the next finer level as the initial guess, until the finest level is reached. Therefore, except for the coarsest level, the searching space at each level is reduced to only $[-3, 3]$ in both x and y directions, which is much smaller than the original searching space of naive exhaustive search.

Implementation details

Build pyramids by blur + downsampling by 2 (through cv2.resize interpolated by INTER_AREA), until minimum size (e.g., 300 px).
Start at the coarsest level with a wide search range (e.g., $[-20, 20]$) around the center, get the optimal shift and scale to the next level.
At finer levels, search over a smaller grid (e.g., $[-3, 3]$) around the last level's optimal shift, get the optimal shift and scale to the next level.
Repeat the last step until we reach the finest level.

Results

large_pyramid_row — church.tif → church.jpg

large_pyramid_row — melons.tif → melons.jpg

large_pyramid_row — self.tif → self.jpg

large_pyramid_row — three.tif → three.jpg

Bells & Whistles

Analysis of Systematic Failures and Artifacts

The aforementioned baseline uses NCC over raw pixel space at each pyramid level with exhaustive search, can already achieve reasonable alignment. However, the processed images still contain some artifacts that makes them less visually appealing and not perfectly aligned. For instance, systematic failures can be observed in some `.tif` images, like the `emir.tif`, where the red channel is obviously mis-aligned. Therefore, here we analyze and address those systematic failures and artifacts, and also apply some post-processing techniques to enhance the visual quality.

Better Alignment: Robust Metric and/or Feature Space

While the NCC is foundamentally better than SSD due to its normalization, applying it directly on the raw pixel space still makes it highly sensitive to absolute intensities. This results in poor robustness against local defects and outliers (e.g., saturation, scratches, noise), and leads to systematic failures when the channels differ significantly in contrast and brightness.

In `emir.tif`, when we visualize the three channels separately, one can easily observe that the R channel exhibits a highly different contrast and saturation distribution compared to B and G channels. Consequently, NCC on raw intensities tends to systematically mis-align the red channel.

To address this, one can:

either adopt a more robust metric, or
map the raw pixel space into feature spaces that are less affected by intensity variations.

In my implementation, I adopt the latter strategy. Specifically, I compute NCC in the Sobel gradient space, where the focus is shifted from pixel intensities to edges and structural information.

Before — large_pyramid_row/emir.jpg

After — large_pyramid_sobel/emir.jpg

Post-Processing: Automatic Cropping

Once the alignment is done, the colored images often look real enough. However, they typically contain multiple artifacts that prevent them from being as good as photos taken by modern digital cameras. Therefore, we here we explore some post-processing techniques to enhance their visual quality.

The most obvious artifact is the borders, which are typically single color and contain (almost) no information. To address this, here we can remove them through automatic cropping.

One method is to analyze the edges of the image. From observations, one can find that the borders usually single colored with minimal edges in the direction that they are perpendicular to. Therefore, with the assumption that the borders are approximately parallel to the image edges, we can detect them through "finding the significant changes in edge gradient" and crop them out. To achieve this, we can operate on the sobel gradient map with respect to x and y directions separately, details are as follows:

Convert the image to grayscale to keep only the intensity information
Compute the Sobel gradient maps along the x (horizontal derivative) and y (vertical derivative) directions, and take their absolute values
For each gradient map, project the values by summing along the perpendicular axis (e.g., sum columns for the x-gradient)
Compute the cumulative percentage of these projected values, and crop at the positions where the percentage first exceeds a predefined threshold
To avoid over-cropping the edges where the gradients are not significant (e.g, some clean background), the maximum cropping ratio should be limited.

Failure Analysis & Improvement (k-fold)

My first implementation of the automatic cropping works well for most images, however, some images, like `emir.tif`, contains writings on the borders (usually the meta information of the image), which are recognized as edges and affected the cropping. Since the initial cropping method is based on detecting the dramatic change of cumulative edge density, those writings on the border made the density increase rapidly to hit the threshold. However, simply increasing the threshold will cause the cropping to degenerate to "simple crop with a fixed ratio", since most other borders do not contain such writings and the maximum cropping will be simply controlled by the maximum cropping ratio.

Therefore, to address this issue, I cut the image into k folds along each direction, get the cropping range for each fold and adopt the most aggressive one for that direction. This is based on the observation that the writings typically appear on a specific region of one border, and the rest of borders are clean. Therefore, by analyzing k folds, we can assume that at least one fold does not affected by writings and can provide a good cropping range. This method works well for all the images I have tested.

Before cropping — large_pyramid_sobel/emir.jpg

After cropping — large_pyramid_sobel/emir_cropped.jpg

Post-Processing: Automatic White Balancing

Limited by the technology Prokudin-Gorskii's time, the color of aligned images are typically not natural enough. To enhance the visual quality, here we manipulate the pixel intensities and distributions as post-processing.

From observations, it is common that the aligned colored photos exhibit color casts, i.e., the overall color shifts toward one color. Therefore, to make the colors more natural, here we correct the color through automatic white balancing, where we identify and adjust the white points to be displayed as pure white. As like other color constancy algorithms, white balancing consists of two steps:

Determining the illuminant under which an image was captured
Adjusting the colors to counteract the illuminant and simulate a neutral illuminant

There are various algorithms for automatic white balancing, where the most straightforward yet effective ones are the Gray World Assumption and White Patch Assumption, the former assumes that the average reflectance in a scene is achromatic (gray), while the latter assumes that there exists at least one pixel in perfect white (equal RGB values), both of which serves as reference for step 1. Once we have the illuminant information from step 1, correcting the colors (step 2) is simply a linear transformation that scales the RGB channels accordingly.

The linear transformation for adjusting RGB channels can be expressed as a $3 \times 3$ matrix multiplication on the raw RGB values. With the assumption that the image was taken under an single illuminant, and the Prokudin-Gorskii's RGB filters were ideal such that 3 channels are uncoupled (similar to the Von Kries Coefficient Law that human's cones are adjusted separately), then the transformation matrix can be simply diagonal with scaling factors of each channel:

\begin{bmatrix} R \\[2pt] G \\[2pt] B \end{bmatrix} = \begin{bmatrix} \alpha & 0 & 0\\ 0 & \beta & 0\\ 0 & 0 & \gamma \end{bmatrix} \begin{bmatrix} R' \\[2pt] G' \\[2pt] B' \end{bmatrix}

where $[R', G', B']^{\top}$ are the original RGB values before white balancing, $[R, G, B]^{\top}$ are the adjusted ones. The scaling factors $\alpha, \beta, \gamma$ can be determined based on the illuminant estimated from step 1.

But one may want to ask, which algorithm should we implement for this project? Here we can analyze the photos we have, together with the properties of each algorithm. One common pattern we can observe Prokudin-Gorskii's photos is that most of them exhibit darkened edges, shadows, or aging artifacts, which bias the global average toward darker or desaturated tones. As a result, applying Gray World Assumption tends to introduce undesirable color shifts.

On the other hand, by analyzing the subject of Prokudin-Gorskii’s photographs, we can find that they are mainly landscapes, architecture and humanities, which typically contain some white objects. Therefore, trading off the two algorithms, implementing the White Patch Assumption is a better choice for this project.

Recall that the White Patch Assumption calibrates the photo based on the brightest pixels in each channel, say, $R_{max}, G_{max}, B_{max}$, then the scaling factors can be determined as:

\alpha=\frac{1}{R_{\max}},\ \beta=\frac{1}{G_{\max}},\ \gamma=\frac{1}{B_{\max}}

Simply applying White Patch Assumption on the brightest pixel of each channel may be sensitive to noise and extreme cases, e.g., some noise pixels are extremely bright, or some images contain saturated regions that are very bright in single channel, both of which can potentially harm the performance. To address this and enhance the robustness, my implementation involves two strategies:

Only consider pixels whose channel values are similar (within a threshold $\delta$), which are more likely to be near-white/neutral rather than single-colored highlights.
Instead of using the pixel with the highest intensity, here we use the average of the `top_p` percentile pixels in each channel.

With the constraints above, we now only consider the pixel $i$ that are bright (i.e., high intensity in grayscale $L_i$) but not dominated by a single channel, which are collected into:

\Omega = \left\{\, i \;\middle|\; L_i \ge Q_p(L),\; \max\!\left( \frac{|R_i-G_i|}{\max(R_i,G_i)},\; \frac{|G_i-B_i|}{\max(G_i,B_i)},\; \frac{|B_i-R_i|}{\max(B_i,R_i)} \right) \le \delta \right\}

where the $Q_p(L)$ is the brightness of the $\texttt{top\_p}$-th percentile in the grayscale $L$. Then, the $R_{max}, G_{max}, B_{max}$ are computed as the averages of the corresponding channels over the set $\Omega$:

R_{\max}=\frac{1}{|\Omega|}\!\sum_{i\in\Omega} R_i,\quad G_{\max}=\frac{1}{|\Omega|}\!\sum_{i\in\Omega} G_i,\quad B_{\max}=\frac{1}{|\Omega|}\!\sum_{i\in\Omega} B_i

Before WB — large_pyramid_sobel/emir/aligned_cropped.jpg

After WB — large_pyramid_sobel/emir/aligned_cropped_wb.jpg

Post-Processing: Automatic Contrasting

One common issue one can observe is that those photos are usually low-contrast, which means, the image histograms are not spanning the full intensity range, yielding the details in some regions less distinguishable. Therefore, here we enhance the contrast through manipulating the image histograms.

One simplest way is to map the histogram linearly to span the full intensity range, namely contrast stretching. Say, for the grayscale space with the intensity range of $[I_{min}, I_{max}]$ and total number of gray levels $L$ (e.g., for uint8, $L=256$ and range is $[0,255]$), if we want to stretch an image, whose maximum and minimum intensities are $r_{max}$ and $r_{min}$, then the function mapping the original intensity $r$ to the new one $s$ (assuming $I_{min}=0$ and $I_{max}=L-1$) is defined as:

s = \frac{r-r_{\min}}{r_{\max}-r_{\min}}(I_{\max}-I_{\min}) + I_{\min} = \frac{r-r_{\min}}{r_{\max}-r_{\min}}(L-1)

However, one can observe that this method is sensitive to noise and outliers, since it relies on $r_{min}$ and $r_{max}$ heavily, hence, any noise/outliers that significantly off the main distribution will affect the result significantly. Moreover, such a linear mapping does not change the shape of histogram, therefore, cannot effectively enhance regions where pixel intensities are densely clustered (i.e., locally low-contrast regions due to over/under exposure).

A more robust way is the histogram equalization (HE), which instead manipulates the cumulative distribution function (CDF) of the pixel intensities distribution. Again, starting from the image histogram, one can normalize it to get the probability distribution function (PDF) and then get the CDF through cumulative summation, which is basically the empirical CDF (ECDF) of the pixel intensities, which is defined as:

\hat{F}_n(r) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}(R_i \le r)

where $R_i$ is the intensity of pixel $i \in \{1,2,...,n\}$ that belongs to one of $n$ pixels of the image, $I(\cdot)$ is the indicator function.

Then, we can use ECDF to construct the mapping function from original intensity $r$ to new intensity $s$ that satisfies the requirement that $T: [I_{min}, I_{max}] \to [I_{min}, I_{max}]$, which is simply a scaled version of ECDF:

s = T(r) = \hat{F}_n(r)(I_{\max}-I_{\min}) + I_{\min} = \hat{F}_n(r)(L-1)

From the definition, one can observe that this method is less sensitive to outliers, and can re-distribute the pixel intensities more evenly, hence, able to address the aforementioned issues of contrast stretching.

However, there are still issues with implementing HE directly in our case:

Both of above methods operate globally, however, the photos we have typically contain noise and defects in some local regions (e.g., uneven aging artifacts and unusual monochrome shades around the edges). Such local defects and noise will be incorporated into the new histogram and propagated to the whole image and affect the overall contrast enhancement.
Since HE redistributes pixel intensities to approximate a uniform distribution, originally dense regions may be excessively spread out, while sparse regions are filled with additional pixels, sometimes introducing artificial contrast patterns or visual artifacts.

With such issues in mind, the final implementation adopts the Contrast Limited Adaptive Histogram Equalization (CLAHE), which operates HE on small tile regions locally and combines the results via bilinear interpolation, addressing the first issue. To alleviate the second issue, HE on each tile is operated on a clipped and re-distributed histogram, yielding better local contrast enhancement without over-amplifying noise. The implementation is based on `skimage.exposure.equalize_adapthist`.

Before contrast — large_pyramid_sobel/emir/aligned_cropped_wb.jpg

After CLAHE — large_pyramid_sobel/emir/aligned_cropped_wb_contrast.jpg

Post-Processing: Better Color Mapping

TBD

Before color transfer — large_pyramid_sobel/emir_cropped_wb_contrast.jpg

After color transfer — large_pyramid_sobel/emir_color_transfer.jpg