3D Projection

April 29, 2019

In 3D graphics, objects are rendered from some viewer's position and displayed on a flat screen, like a phone or laptop. Projection describes the transformation of a three-dimensional point into a two-dimensional point. This transformation can be represented by a projection matrix, which may encode both perspective, like a camera's focal length, as well as the transformation to normalized device coordinates (NDC).

Projection matrices are one of the more confusing parts of the GL pipeline, are notoriously difficult to debug, and can be parameterized in several different ways. The following fundamentals and equations attempt to clarify the process and provide reference for common projection tasks and conversions.

The camera moves as the field of view changes to ensure that the subject is always consistently framed, a technique known as the dolly zoom.
Boy Finnigon by Fion Noir (CC-BY-4.0), Forest Block by Jarlan Perez (CC-BY-3.0)

Projection Transformation

The two most common types of projection are orthographic and perspective projection. Axonometric (isometric) projections are common in games as well.

Orthographic projections do not visualize depth, and are often used for schematics, architectural drawings, and 3D software when lining up vertices. As there is no applied perspective, lines can be absolutely measured and compared.

Perspective projection, however, accounts for depth in a way that simulates how humans perceive the world. Objects that are further away appear smaller, resulting in roughly a single vanishing point in the center of our vision.

Grafitti on walls in an alley — Parallel lines of a road appear to eventually converge due to perspective.
Haight Street, San Francisco, Jens-Oliver Pukke

Whatever type of projection is used, the end result is a 4D homogeneous coordinate in clip space; in the OpenGL pipeline, this value is then divided by $w$ , becoming a 3D vector in normalized device coordinates, and any vertex outside of the $-1$ to $1$ range gets clipped.

The projection transformation maps the viewing volume into an NDC cube in OpenGL.

Viewing Frustum

A camera abstraction in a 3D engine has an area of space that is visible, described as a viewing volume in a cuboid shape for orthographic projections, or a frustum for perspective projections. The human visual system, although a series of lies and magic, has a viewing volume that includes 180° horizontally and 90° vertically, and extends essentially an infinite amount. After all, we can see V762 Cas in Cassiopeia, 16,308 light-years away! Cameras in 3D engines are much more constrained.

A camera's frustum can be thought of as 6 planes, and any objects between those planes are visible and within the camera's field of view. Frustums are generally defined in terms of the near and far planes' distance from the camera on the Z axis, and how far the frustum extends on the near plane to the left, right, top and bottom from the Z axis. The near plane is the 2D plane that the rendered image will be projected upon.

Frustum visualization, using extents parameterization.

Perspective projection

With the six extent values (near, far, left, right, top, bottom), a perspective projection matrix can be created:

\begin{bmatrix} \dfrac{2n}{r - l} & 0 & \dfrac{r + l}{r - l} & 0 \\ 0 & \dfrac{2n}{t - b} & \dfrac{t + b}{t - b} & 0 \\ 0 & 0 & \dfrac{f + n}{n - f} & \dfrac{2fn}{n - f} \\ 0 & 0 & -1 & 0 \\ \end{bmatrix}

Most 3D engines or libraries will have a function that creates a perspective matrix from these values, like glFrustum or three.js's Matrix4#makePerspective,

These values are in world units; the near and far values are absolute distances from the camera's forward axis, and the extents are the relative position between the camera's focal point on the camera's forward axis on the near plane, and the extent.

The following figures illustrate the context of the extent values, and how they can be used with trigonometry to measure any length or angle.

Diagram of a symmetric camera's frustum on the YZ plane in camera space — **Figure 1** A view of a symmetric camera's frustum on the YZ plane in camera space. Notice how the top (t), bottom (b) and near plane distance (n) all determine the vertical field of view (θ). The Y extents on the far plane are calculated with the ratio f/n, as the angles are identical for both the origin/near triangle, as well as the origin/far triangle.

Diagram of a symmetric camera's frustum on the XZ plane in camera space — **Figure 2** A view of a symmetric camera's frustum on the XZ plane in camera space. Almost identical to *Figure 1*, except using the left (l) and right (r) frustum extents.

Projection Symmetry

Note that the simulation and images so far have been symmetric projections. The symmetric frustums' extents are symmetrical both vertically and horizontally around the Z axis at the near plane, such that $r = -l$ and $t = -b$ . Symmetric projections are common in 3D renderings, although asymmetric projections can be used in stereoscopic VR rendering, augmented reality platforms, or immersive installations.

A simplified form of the perspective projection matrix can be used for symmetric projections, where $r = -l$ and $t = -b$ :

\begin{bmatrix} \dfrac{n}{r} & 0 & 0 & 0 \\ 0 & \dfrac{n}{t} & 0 & 0 \\ 0 & 0 & \dfrac{f + n}{n - f} & \dfrac{2fn}{n - f} \\ 0 & 0 & -1 & 0 \\ \end{bmatrix}

Parameterization

Defining a perspective projection in terms of its frustum extents is just one option. Projections can be defined via aspect ratio, field of view, focal length, or other parameters, depending on background or purpose.

Field of view

Perhaps more commonly, perspective cameras are defined by a vertical field of view and the projection screen's aspect ratio, as well as the near and far plane values. This parameterization is (subjectively) more human-understandable: aspect ratio usually must be configurable to work across different screen resolutions, and the field of view is more intuitive than frustum extents.

Frustum visualization, using aspect ratio/FOV parameterization.

Referencing Figure 1 above and using some trigonometry, the vertical field of view and aspect ratio can be converted to frustum extents, or used directly in the creation of the matrix. This assumes a symmetric projection.

  let top = near * Math.tan(fov / 2);
  let bottom = -top;
  let right = aspect * top;
  let left = -right;

e = \dfrac{1}{tan(FOV/2)}

\begin{bmatrix} \dfrac{e}{aspect} & 0 & 0 & 0 \\ 0 & e & 0 & 0 \\ 0 & 0 & \dfrac{f + n}{n - f} & \dfrac{2fn}{n - f} \\ 0 & 0 & -1 & 0 \\ \end{bmatrix}

$e$ above can be thought of as the focal length. While rendering doesn't quite have the same idea as a focal length, Eric Lengyel shared some matrix tricks at GDC 2007 to simulate the parameterization. Paul Bourke's brief note, "Field of view and focal length" sketches out the relationship between the two as well.

Camera intrinsics

If working with OpenCV or augmented reality platforms (ARCore, ARKit), controlling projections via camera intrinsics may be necessary.

Where $f_{x}$ and $f_{y}$ are the horizontal and vertical focal lengths in pixels, an often unused $s$ for skew, and $c_{x}$ and $c_{y}$ representing the principal point, or the horizontal and vertical offset from the bottom-left in pixels, which for symmetric projections results in $c_{x} = width / 2$ and $c_{y} = height / 2$ .

\begin{bmatrix} f_{x} & s & c_{x} \\ 0 & f_{y} & c_{y} \\ 0 & 0 & 1 \\ \end{bmatrix}

Koshy George shared a specialized form of representing camera intrinsics in OpenGL, for symmetric projections that have adjustable near/far planes:

\begin{bmatrix} \dfrac{f_{x}}{c_{x}} & 0 & 0 & 0 \\ 0 & \dfrac{f_{y}}{c_{y}} & 0 & 0 \\ 0 & 0 & \dfrac{f + n}{n - f} & \dfrac{2fn}{n - f} \\ 0 & 0 & -1 & 0 \\ \end{bmatrix}

George's solution derives from Kyle Simek's excellent and detailed series on camera calibration and OpenGL, where more background and a generalized form is described.

Framing

Sometimes it's desirable to change the position of the camera such that some object is framed relatively to the viewport. Unlike the very specific dolly zoom example above, the field of view is most likely a fixed size.

Illustration of how the position of a camera can change to accomodate framing any distance from the camera with any field of view. Boy Finnigon by Fion Noir (CC-BY-4.0)

For example, a lot of thought went into creating the framing rules used in model-viewer. We wanted an arbitrarily-sized model to look great inside of an arbitrarily sized viewport. To ensure good "framing", the model is placed inside of a "room" representing the camera frustum that maximizes the model's size given the current aspect ratio. The camera's near plane "frames" the room's forward plane.

Given a static vertical field of view, and the height of the frame in world units, the corresponding camera's position can be calculated via similar triangles, using values from Figure 1 above. Using half of the height and half of the field of view (in radians), the distance can be derived the same way as the near plane ( $n = t / tan(fov/2)$ ).

const d = (height / 2) / Math.tan(fov / 2)

Similarly, this can be done with horizontal field of view and extents, or revised to find the size of a frustum at a given distance from the camera.

Orthographic projection

Orthographic projections lack perspective and are a bit more straight forward than perspective projections.

Frustum visualization of orthographic projection.

The orthographic projection matrix can be constructed from its extent values like perspective projection:

\begin{bmatrix} \dfrac{2}{r - l} & 0 & 0 & -\dfrac{r + l}{r - l} \\ 0 & \dfrac{2}{t - b} & 0 & -\dfrac{t + b}{t - b} \\ 0 & 0 & \dfrac{-2}{f - n} & -\dfrac{f + n}{f - n} \\ 0 & 0 & 0 & 1 \\ \end{bmatrix}

A simplified form can be used for symmetric projections, where $r = -l$ and $t = -b$ .

\begin{bmatrix} \dfrac{1}{r} & 0 & 0 & 0 \\ 0 & \dfrac{1}{t} & 0 & 0 \\ 0 & 0 & \dfrac{-2}{f - n} & -\dfrac{f + n}{f - n} \\ 0 & 0 & 0 & 1 \\ \end{bmatrix}