Model View Projection

April 14, 2019

In 3D engines, scenes are typically described as objects in three-dimensional space, with each object comprised of many three-dimensional vertices. Ultimately, these objects are rendered and displayed on a flat screen. Rendering a scene is always relative to the camera, and as such, the scene's vertices must also be defined relative to the camera's view.

A scene being visualized in world space, camera space, and then normalized device coordinates, representing the stages of transformation in the Model View Projection pipeline.

When drawing a mesh in an OpenGL pipeline, a vertex shader will process every vertex, expecting the vertex's position to be defined in clip space. Model View Projection is a common series of matrix transformations that can be applied to a vertex defined in model space, transforming it into clip space, which can then be rasterized.

v^{\prime} = P \cdot V \cdot M \cdot v

A vertex position is transformed by a model matrix, then a view matrix, followed by a projection matrix, hence the name Model View Projection, or MVP.

Model Space

Models, geometry and meshes are some series of vertices defined in model space. For example, a cube geometry could be defined as 8 vertices: $(1, 1, 1)$ , $(-1, -1, -1)$ , $(1, 1, -1)$ , and so on. This would result in a 2x2x2 cube, centered at $(0, 0, 0)$ .

2x2x2 cube with corners at (-1, -1, -1) and (1, 1, 1) — A geometry's vertices defined in model space.

Often geometry is reused multiple times in the same render, at different locations or different sizes. Pushing unique vertices for each model instance is costly and unnecessary. A single set of geometry vertices can be shared across multiple instances, with each instance applying its own unique set of transformations, represented by a model matrix. The model matrix transforms vertices from model space to world space. A 2x2x2 cube centered at $(0, 0, 0)$ can be resized, twisted and placed anywhere when combined with a model matrix.

Cubes of various sizes and rotations — Many cubes in world space.

A model matrix $M$ is composed from an object's translation transform $T$ , rotation transform $R$ , and scale transform $S$ . Multiplying a vertex position $v$ by this model matrix transforms the vector into world space.

\begin{aligned} M &= T \cdot R \cdot S \\ v_{world} &= M \cdot v_{model} \\ \end{aligned}

View

World space is the shared global 3D Cartesian coordinate system. Renderable objects, lights, and cameras all exist within this space, defined by their model matrix, all relative to the same $(0, 0, 0)$ point.

As all renders are from some camera's perspective, all vertices must be defined relatively to the camera.

Camera space is the coordinate system defined as the camera at $(0, 0, 0)$ , facing down its -Z axis. The camera also has a model matrix defining its position in world space. The inverse of the camera's model matrix is the view matrix, and it transforms vertices from world space to camera space, or view space.

A camera at (0, 0, 0), highlighting its visible volume — A scene in camera space, where everything is relative to the camera, the origin.

Sometimes the view matrix and model matrix are premultiplied and stored as a model-view matrix. While each object has its own model matrix, the view matrix is shared by all objects in the scene, as they are all rendered from the same camera. Given a camera's model matrix $C$ , any vector $v$ can be transformed from model space, to world space, to camera space.

\begin{aligned} V &= C^{-1} \\ v_{camera} &= V \cdot M \cdot v_{model} \\ \end{aligned}

In an OpenGL system where the camera faces down -Z, any vertex that will be rendered must be in front of the camera, and in camera space, will have a negative Z value.

Projection

Once vertices are in camera space, they can finally be transformed into clip space by applying a projection transformation. The projection matrix encodes how much of the scene is captured in a render by defining the extents of the camera's view. The two most common types of projection are perspective and orthographic.

Perspective projection results in the natural effect of things appearing smaller the further away they are from the viewer. Orthographic projections do not have this feature, which can be useful for technical schematics or architectural blueprints for example. Much like how different lenses in a traditional camera can drastically change the field of view or distortion, the projection matrix transforms the scene in a similar way.

After applying a projection matrix, the scene's vertices are now in clip space. Note that the 3D vertices are represented by 4D vectors of homogeneous coordinates, with $w = 1$ .

v_{clip} = P \cdot V \cdot M \cdot v

In camera space, after the model-view transformations, $w$ is still unchanged and equal to 1. However, perspective projection is a large reason the 4th coordinate is needed, and may no longer equal 1 after applying projection.

The vertex shader in OpenGL expects vec4 gl_Position to be set to clip space coordinates. Once the vertex shader finishes and the clip space position is known, the pipeline automatically performs perspective division, dividing the $[x, y, z]$ components by the $w$ value turning the 4D vector back into a 3D vector, resulting in the vertex finally being in normalized device coordinates.

\begin{bmatrix} x_{ndc} \\ y_{ndc} \\ z_{ndc} \end{bmatrix} = \begin{bmatrix} x_{clip}/w_{clip} \\ y_{clip}/w_{clip} \\ z_{clip}/w_{clip} \end{bmatrix}

A scene being transformed into NDC space, highlighting corners at (-1, -1, 1) and (1, 1, -1) — Visualization of objects in normalized device coordinates. Note that the Z axis has flipped, where the camera is now facing down the +Z axis.

At this point, the pipeline discards any vertices outside of a 2x2x2 cube with extents at $(-1, -1, -1)$ and $(1, 1, 1)$ . The entire visible scene, defined by the projection matrix, is now collapsed into a cube, with frustum extents defining how much was squashed into that cube, with the near plane mapped to $z=-1$ and the far plane mapped to $z=1$ .

The model, view, and projection matrices transform vertices that start in model space, and then world space, camera space, and then clip space. The vertices are then transformed into normalized device coordinates via implicit perspective division. Finally, during rasterization, a viewport transform is applied to interpolated vertex positions, resulting in a window space position: an X and Y position of a texel in two dimensions, translating some point in 3D space relative to some viewer, into a specific pixel on a screen.

\begin{aligned} v_{world} &= M \cdot v_{model} \\ v_{camera} &= V \cdot M \cdot v_{model} \\ v_{clip} &= P \cdot V \cdot M \cdot v_{model} \\ \end{aligned}