# Camera Calibration

Cameras have distortions. You know this after seeing Stuff Made Hereβs video.

See Pinhole Camera Geometry which goes into this in more detail.

https://docs.opencv.org/3.4.3/dc/dbb/tutorial_py_calibration.html β Does intrinsics calibration

Resources

- Camera Parameters - Extrinsics and Intrinsics (Cyrill Stachniss)
- slides here

- Camera Intrinsics and Extrinsics - 5 Minutes with Cyrill
- Mapping the 3D World to an Image - 5 Minutes with Cyrill

Two parts to camera calibration:

- Camera Intrinsics: Focal length + some translation
- Camera Extrinsics: Static transforms from global frame of reference

This is something you personally saw during your time at NVIDIA.

All the matrices used:

- Intrinsic Matrix
- Extrinsic Matrix
- Projection Matrix
- Reprojection Matrix
- Essential Matrix
- Fundamental Matrix

Repositories

### WATonomous

Apparently, I need to do this before doing Euclidean Clustering.

Personally, I have never done this before. It seems that F1TENTH does this, see Lab8.

### Intrinsics vs. Extrinsics

Really really learned these things through NVIDIA.

Intrinsics are the parameters inside the camera (think Focal Length), whereas extrinsics are the outside parameters relative to the environment (think Static Transforms).

We need to calibrate both.

## Camera Calibration notes from Cyrill Stachniss

Notation

I donβt like the notation in these videos. But I think I need to stick to it to be consistent as I learn from the material. Also, helps me to learn about coordinate systems in different contexts.

This is Camera Calibration fundamentals. Camera Parameters https://www.youtube.com/watch?v=uHApDqH-8UE&list=PLgnQpQtFTOGRYjqjdZxTEQPZuFHQa7O7Y

Slides: http://www.ipb.uni-bonn.de/html/teaching/photo12-2021/2021-pho1-20-camera-params.pptx.pdf Most fundamental equation $x=PX$

- Lowercase = points in image (2D)
- Uppercase = points in 3D World coordinate (3D)

Coordinate Systems

- World/object coordinate system $o$
- Camera coordinate system $k$
- Image plane coordinate system $c$
- Sensor coordinate system $s$

Why isn't the image plane and the sensor coordinate system the same thing?

It is just a matter of a simple Affine Transformation, because you donβt know how the sensor is configured. Actually, if you look at Camera Intrinsics (written from SLAM book), there is no sensor coordinate system, the image plane assumes to be corrected to be zeroed to the Principal Point.

We want to compute the mapping

$β_{s}x_{s}y1ββ=_{s}H_{c}_{c}P_{k}_{k}H_{o}βXYZ1ββ$

- $_{s}H_{c}$ is the image plane to sensor mapping
- $_{c}P_{k}$ is the camera to image mapping
- $_{k}H_{o}$ is the object to camera mapping

To get all these transforms, we break it down between extrinsic parameters and intrinsic parameters.

Extrinsic vs. Intrinsic Parameters

Extrinsic parametersdescribe the pose of the camera in the worldIntrinsic parametersdescribe the mapping of the scene in front of the camera to the pixels in the final image (sensor)

**Extrinsic Parameters**

- Describe the Pose of the camera with respect to the world coordinate frame
- Is an invertible transformation
- Number of parameters: 3 for position + 3 for the heading
- (although I would argue 4?? for the quaternion)

Point $P$ with coordinates in **world** coordinates
$X_{P}=βX_{P}Y_{P}Z_{P}ββ$

Center $O$ of the projection (origin of the camera coordinate system) $X_{O}=βX_{O}Y_{O}Z_{O}ββ$

- $X_{O}$ sometimes also called $Z$ or $Z_{O}$

*Translation*between the origin of the world coordinate system and the camera coordinate system $X_{O}=βX_{O}Y_{O}Z_{O}ββ$*Rotation*$R$ from $S_{O}$ to $S_{K}$.

In Euclidean coordinates, combining translation and rotation yields $_{k}X_{P}=R(X_{P}βX_{O})$

- $_{k}X_{p}$ meanings point $p$ measured in frame $k$

I am confusion

Normally, we rotate first, and then translate, but why is it the case here that we translate first? Itβs important to understand in what frames these points are expressed in. The issue is that $X_{O}$ (the translation) is expressed in the world frame, whereas usually, that translation is in terms of the target (camera) frame. Therefore, it is equivalent to $RββX_{O}=RX_{O}$.

But we like to work with Homogeneous Coordinates. We get $_{k}H_{o}=[RO_{T}ββRX_{O}1β]$

Italicized in in Euclidean. Not italicized is in homogeneous coordinates.

The central projection is not invertible. Why?

- because we go from 3d to 2d. There is a projection. We lose a degree of freedom.

Camera Intrinsics Steps

The camera intrinsics mapping is split into 3 steps:

- Ideal perspective projection to the image plane $β$ central projection
- Mapping to the sensor coordinate system (βwhere the pixels areβ) $β$ an affine transformation
- Compensation for the fact that the two previous mappings are idealized
$PβPβPβP_{β²}$

Assumption is that all rays pass through the same point (pinhole).

- This is not true in reality, because the pinhole cannot be infinitesimally small

##### First Mapping

- This slide is illustrating how physically, the image plane is flipped update down and on the right
- However, we flip the image by 180 degrees to make it more convenient to work with
- The diagram is kind of confusing, I donβt really understand what is going on, I made my own diagram below

Ideal Perspective Projection

- Distortion-free lens
- All rays are straight lines and pass through the projection center. This point is the origin of the camera coordinate system $S_{k}$
- Focal point and principal point lie on the optical axis
- The distance from the camera origin to the image plane is the constant $c$

- We basically project the 3D point into the 2D world, and then drop the 3rd coordinate ($z$)

The above are results derived from pure geometry. I spent a bit of time deriving it myself:

So basically the derivation is like similar triangles. We know that $_{k}Z_{p}_{k}X_{p}β=_{k}Z_{pβ}_{k}X_{pβ}β$

You rearrange the terms and itβs super simple, where $c=_{k}Z_{p}$, so $_{k}X_{p}=_{k}Z_{p}_{k}Z_{pβ}_{k}X_{pβ}β=c_{k}Z_{pβ}_{k}X_{pβ}β$

This is simple by looking at the image that I drew below

I am confused, what is

`c`

?$c$ is the distance from the camera origin to the image plane. So then what is $_{k}Z_{p}$?

- That that is the distance from camera origin to the actual point $p$
- See the diagram above

How does the

`Z`

coordinate magically get lost?Intuitively, it makes sense, because everything is getting flattened into a 2D image plane. Because all of them have the same $Z$ value, which is the $c$ constant. Youβre projecting into a fixed image plane.

Now, we want to try expressing this in homogeneous coordinates, so we get the following:

- So why do they divide it by $_{k}Z_{p}$? Cyrill says it is to turn it from Homogenous to Euclidean Coordinates

$_{c}x_{pβ}=_{c}P_{k}_{k}X_{p}$ with $_{c}P_{k}=βc00β0c0β001β000ββ$

Recap

So now, we can go from world to camera, and camera to image plane.

**Calibration Matrix**
We can now define the calibration matrix for the ideal camera

$_{c}K=βc00β0c0β001ββ$

$_{c}P=_{c}K[Rβ£βRX_{O}]=_{c}KR[I_{3}β£βX_{O}]$

- The $β£$ is just a more compact way of writing a matrix, youβre essentially combining them into a single matrix, so $[I_{3}β£βX_{O}]$ is a $3Γ4$ matrix
- $_{c}P$ is $3Γ4$ matrix then?? Yup, remember, that isnβt the actual point

##### Second Mapping for Intrinsics

The next step is to move onto the sensor coordinate frame. They are different locations due to a few factors:

- Different origins (principal point)
- Scale difference
- Shear compensation

The principal point of the image is the point where the optical axis passes through the image plane.

- the point on the image plane onto which the perspective center is projected

I'm still not sure what the principal point is...

The principal point is basically the origin of our camera coordinate frame $k$. The thing is that the sensorβs origin and the cameraβs center is not exactly the same. We compensate for this shift using $x_{H}$ and $y_{H}$, basically distance between the 2 coordinate frames $s$ and $k$.

Why don't we just treat the camera frame as the sensor frame?

Well the whole point of working with different coordinate frames is that it is reflective of the real world. Same reason you wouldnβt have everything expressed in the same frame, see Coordinate Frame.

The origin of the sensor system is not exactly at the principal point.

- We need to compensate for this is compensated through a shift $x_{H}$ and $y_{H}$

$_{s}H_{c}=β100β010βx_{H}y_{H}1ββ$

It represents the physical center of the sensor.

We add in the other factors:

- scale difference $m$ in $x$ and $y$
- Sheer compensation $s$ (for digital cameras, we typically have $sβ0$)

so we end up with this calibration matrix $_{s}H_{c}=β100βs1+m0βx_{H}y_{H}1ββ$

If we multiply it with the first mapping, we get

$K=_{s}H_{c}_{c}K=βc00βcsc(1+m)0βx_{H}y_{H}1ββ$

Something I realized

See the Camera Intrinsics page, but basically you end up with an unnormalized 3x1 vector, you normalize it so the third coordinate becomes 1. And that is how you lose that 1 extra dimension.

DLT: Direct Linear Transform

Has 2 meanings

- 6 extrinsics + 5 intrinsics

$P=KR[I_{3}β£βX_{O}]$

- 6 extrinsics: $R$, $X_{O}$
- 5 intrinsics: $c,x_{H},y_{H},m,s$

- Used for estimating those parameters

Affine camera = camera with an affine mapping to the censor.

##### 3rd mapping: Non-Linear Errors

So far, we considered only linear errors (DLT)β¦ but the real world is non-linear.

Reasons for non-linear errors:

- Imperfect lens
- Planarity of the sensor

Idea: add a last step that covers the non-linear effects

- Location-dependent shift in the sensor coordinate system
- Individual shift for each pixel
- General mapping