Stereo Visual Odometry Example (Large Dataset)

In this notebook we build on StereoVOExample.ipynb to tackle a more realistic stereo visual-odometry problem. A robot carries a stereo camera, moves forward over a longer trajectory, and observes a large sample of landmarks. Our goals are:

Define a noise model and stereo calibration.
Assemble a factor graph of stereo-measurement factors.
Provide initial estimates for both camera poses and landmark positions.
Run Levenberg–Marquardt optimization to recover the most likely trajectory and map.
Analyze and visualize final poses and landmarks.

If any section may seem unclear, we recommend looking at StereoVOExample.ipynb first and coming back to this example afterwards.

Note: This example is also available in C++. If you’re interested in the C++ implementation, refer to StereoVOExample_large.cpp in the GTSAM examples directory.

import gtsam
import gtsam.utils.plot as gtsam_plot
import numpy as np
import plotly.graph_objects as go

# For shorthand for common GTSAM types (like X for poses, L for landmarks)
from gtsam.symbol_shorthand import X, L

1. Setup Factor Graph¶

We start by creating an empty NonlinearFactorGraph.

# Create an empty nonlinear factor graph
graph = gtsam.NonlinearFactorGraph()

2. Define Camera Calibration and Measurement¶

To accurately model the stereo observations, we must define both the measurement noise model and the intrinsic calibration of the stereo camera system.

Measurement Noise: We assume an isotropic Gaussian noise model for the stereo measurements. Since each observation consists of three pixel values $(u_L, u_R, v)$ , we use Isotropic.Sigma(3, 1.0) to define a noise model with unit standard deviation in each of the three dimensions.
Camera Calibration (Cal3_S2Stereo): We load precomputed intrinsic parameters and baseline values to construct a stereo calibration model. These values are as follows:
- Focal lengths: $f_x=f_y=721.5377$
- Skew: $s=0$
- Principal point: $(u,v)=(609.5593,\,172.854)$
- Baseline: $b=0.537150588$

These parameters define how 3D world points are projected into pixel coordinates in the left and right stereo images, and are essential for constructing GeneraicStereoFactor3D measurements in Section 5.

# Retrieve Calibration file
cal_params_file = gtsam.findExampleDataFile("VO_calibration.txt")

# Read calibration parameters
cal_params = np.loadtxt(cal_params_file)

# Create a Cal3_S2Stereo calibration object
K = gtsam.Cal3_S2Stereo(cal_params)
print(K)

# Define the stereo measurement noise model (isotropic, 1 pixel standard deviation)
measurement_noise_model = gtsam.noiseModel.Isotropic.Sigma(3, 1.0)

K: 721.538       0 609.559
      0 721.538 172.854
      0       0       1
Baseline: 0.537151

3. Construct Initial Pose Estimates from Odometry Measurements¶

The next step in our visual odometry pipeline is to provide an initial guess for the robot’s trajectory. GTSAM relies on good initial values for its nonlinear optimization process to converge accurately and efficiently. These initial pose estimates serve as a prior that guides the optimizer toward a globally consistent solution.

In this example, the pose data is stored in a text file, where each row corresponds to a unique camera pose. Each row contains 17 values:

The first value is an integer index/ID that identifies the camera pose.
The remaining 16 values encode the camera’s 3D pose as a flattened $4\times 4$ transformation matrix, stored in row-major order. This matrix represents a rigid body transformation in $SE(3)$ , combining both rotation and translation.

The file contains pose data for 26 time steps along the robot’s trajectory. We load these poses and insert them into a gtsam.Values object named initial_estimates. This object functions like a dictionary, mapping symbolic keys (e.g. X(1), X(2),...) to corresponding Pose3 objects. These estimates will later be refined during optimization, but they must be reasonably close to the true trajectory to ensure successful convergence.

# Odometry data file with camera poses
odometry_data_file = gtsam.findExampleDataFile("VO_camera_poses_large.txt")

# Read poses
camera_poses = np.loadtxt(odometry_data_file)

# Create a gtsam.Values object to hold the pose data
initial_estimates = gtsam.Values()

# Parse the raw camera pose data
for pose in camera_poses:
	pose_id = int(pose[0])
	m = pose[1:].copy().reshape(4, 4)
	initial_estimates.insert(X(pose_id), gtsam.Pose3(m))

4. Load and Inspect Stereo Measurement Data¶

In this section, we load the raw stereo measurement data that encodes the pixel observations and 3D landmark positions captured by the robot’s stereo camera. This data provides the core geometric constraints that link camera poses to landmarks in our factor graph.

Each row in the data file represents a single stereo observation, consisting of eight values:

The first two values are integer indices/IDs: one for the camera pose and one for the landmark.
The next three values are the rectified stereo pixel coordinates $(u_L, u_R, v)$ . These represent horizontal pixel positions in the left and right images and the sharted vertical coordinate.
The last three values are the 3D coordinates $(X, Y, Z)$ of the observed landmark, expressed in the camera frame using stereo triangulation.

This data will be used in the next steps to build measurement factors and initialize landmark positions in the world frame.

# Factor data file with pixel coordinates and landmark coordinates
factor_data_file = gtsam.findExampleDataFile("VO_stereo_factors_large.txt")

# Each row has 8 columns corresponding to [pose_id, landmark_id, uL, uR, v, X, Y, Z]
data_matrix = np.loadtxt(factor_data_file)

print(f"Loaded {data_matrix.shape[0]} stereo observations.")
print(f"First row (for reference):\n{data_matrix[0]}")

Loaded 8189 stereo observations.
First row (for reference):
[  1.        3.      209.979   185.87     61.5418   -8.90263  -2.48003
  16.0758 ]

5. Add Stereo Factors to the Factor Graph¶

With the measurement data loaded, we now translate each stereo observation into a factor that constrains the relative position of a camera pose and a landmark. These constraints form the backbone of our factor graph and are essential for computing an optimized trajectory and map.

For every observation entry (or row) in the measurement matrix:

We create a GenericStereoFactor3D using the pixel coordinates $(u_L, u_R, v)$ , the stereo calibration object K, and the noise model defined earlier.
Each factor connects a camera pose variable X(i) and a landmark variable L(j) with the associated stereo observation.
The stereo measurement provides enough geometric information to triangulate depth and establish spatial relationships in 3D.

This step builds up the graph structure that the optimizer will later use to refine both the robot’s trajectory and the map of landmarks.

# Add a stereo measurement factor for each observation
for entry in data_matrix:
	pose_id = int(entry[0])
	landmark_id = int(entry[1])
	uL, uR, v = entry[2:5]

	graph.add(gtsam.GenericStereoFactor3D(
		gtsam.StereoPoint2(uL, uR, v),	# Stereo measurement
		measurement_noise_model, 		# Assumed pixel measurement noise
		X(pose_id), 					# Camera pose key
		L(landmark_id), 				# Landmark key
		K								# Stereo calibration
	))

6. Initialize Landmark Positions from Triangulated Observations¶

Now that we have added measurement factors to the graph, we need to provide an initial guess for the location of each landmark. These guesses are essential for nonlinear optimization; without them, the optimizer would have no starting point from which to refine the 3D landmark positions.

As mentioned before, each observation row includes a 3D point $(X, Y, Z)$ representing the landmark’s position in the camera frame at the time of observation. However, GTSAM operates in a global coordinate frame, so we must convert each point from the camera’s local frame to the world frame.

To do this:

We check whether a landmark has already been initialized in initial_estimates.
If it hasn’t, we retrieve the pose of the observing camera and apply a transformation from the camera frame to the world frame using transformFrom().
We then insert the transformed landmark point into the initial estimate under its corresponding symbolic key.

Each landmark is initialized only once, using the pose from its first observation.

# Provide an initial guess for each unique landmark in world coordinates
for entry in data_matrix:
	landmark_id = int(entry[1])
	if initial_estimates.exists(L(landmark_id)):
		continue	# Skip this landmark if already initialized

	pose_id = int(entry[0])
	land_X, land_Y, land_Z	= entry[5:8]

	cam_pose = initial_estimates.atPose3(X(pose_id))	# Get pose of observing camera
	world_point = cam_pose.transformFrom(				# Convert from camera frame to world frame
		gtsam.Point3(land_X, land_Y, land_Z)
	)
	initial_estimates.insert(L(landmark_id), world_point)

7. Fix the First Pose to Anchor the World Frame¶

Before running optimization, we fix the very first camera pose to anchor the entire system in space. Without this constraint, the entire solution would be underdetermined up to a rigid-body transformation, meaning the optimizer could slide or rotate all poses and landmarks arbitrarily while still minimizing the objective function.

Intuitively, this step establishes a reference origin for the map of the environment we’re trying to build, pinning it down so that all subsequent positions are expressed relative to that origin. As an analogy, this is much like specifying an initial condition when solving a differential equation as we need to fix a starting point in order to recover a particular solution.

To enforce this constraint, we add a NonlinearEqualityPose3 factor to the graph. It ensures that the first pose, X(1), remains exactly equal to its initial estimate and does not change during optimization. This convention is common in SLAM and structure-form-motion systems to eliminate gauge freedom and define a fixed global frame.

# Fix the first pose to serve as the world frame origin
first_pose = initial_estimates.atPose3(X(1))
graph.add(gtsam.NonlinearEqualityPose3(X(1), first_pose))

8. Optimize the Factor Graph with Levenberg-Marquardt¶

With the factor graph fully constructed and initial values provided for all variables, we now run nonlinear optimization to compute the most probable configuration of camera poses and landmarks. GTSAM uses the Levenberg-Marquardt algorithm, a robust and widely used solver that blends the Gauss-Newton method with gradient descent to handle nonlinearity and poor initial guesses.

We configure the optimizer using a LevenbergMarquardtParams object. One optional parameter we set here is the variable elimination ordering strategy, which influences how the underlying linear system is solved during optimization. We use "METIS", a graph partitioning strategy from the METIS software package that often improves performance by reducing fill-in during matrix factorization. While this step is not strictly necessary for small problems, it becomes valuable in larger graphs.

Once the optimizer runs, it returns a new Values object containing the optimized pose and landmark estimates. This step may take several seconds depending on the size of the graph.

# Set up optimizer with optional METIS ordering strategy
params = gtsam.LevenbergMarquardtParams()
params.setOrderingType("METIS")

# Optimize the factor graph to compute maximum posterior estimates
optimizer = gtsam.LevenbergMarquardtOptimizer(graph, initial_estimates, params)
result = optimizer.optimize()

9. Analyze the Results¶

With optimization complete, we now examine the quality of the resulting map and trajectory. A key benefit of using factor graphs and nonlinear optimization is that we can recover a globally consistent trajectory and landmark map even when the initial estimates are noisy or imprecise.

In this last section, we extract and visualize both:

The initial estimates of poses and landmarks prior to optimization.
The optimized result returned by the Levenberg-Marquardt optimizer.

To highlight the improvement, we plot both sets side-by-side in a single interactive 3D view. The viewer can rotate, zoom, and pan to inspect the geometry from different perspectives.

# Function to extract camera coordinates and put them in an N x 3 matrix
def extract_camera_xyz(values: gtsam.Values, prefix='x'):
    coords = []
    for key in values.keys():
        symbol = gtsam.Symbol(key)
        if symbol.chr() == ord(prefix):
            coords.append(values.atPose3(key).translation())
    return np.array([p for p in coords])

# Function to extract landmark coordinates and put them in an N x 3 matrix
def extract_landmark_xyz(values: gtsam.Values, prefix='l'):
    coords = []
    for key in values.keys():
        symbol = gtsam.Symbol(key)
        if symbol.chr() == ord(prefix):
            coords.append(values.atPoint3(key))
    return np.array([p for p in coords])

# Extract both initial and optimized estimates
init_cam = extract_camera_xyz(initial_estimates)
opt_cam = extract_camera_xyz(result)
init_land = extract_landmark_xyz(initial_estimates)
opt_land = extract_landmark_xyz(result)

# Define traces
trace_init_cam = go.Scatter3d(
    x=init_cam[:, 0], y=init_cam[:, 1], z=init_cam[:, 2],
    mode="lines+markers",
    name="Initial Trajectory",
    line=dict(color="lightgray", width=2, dash="dash"),
    marker=dict(size=2, color="black")
)

trace_opt_cam = go.Scatter3d(
    x=opt_cam[:, 0], y=opt_cam[:, 1], z=opt_cam[:, 2],
    mode="lines+markers",
    name="Optimized Trajectory",
    line=dict(color="blue", width=2),
    marker=dict(size=2, color="blue")
)

scatter_init_land = go.Scatter3d(
    x=init_land[:, 0], y=init_land[:, 1], z=init_land[:, 2],
    mode="markers",
    name="Initial Landmarks",
    marker=dict(size=2, color="black", opacity=0.5)
)

scatter_opt_land = go.Scatter3d(
    x=opt_land[:, 0], y=opt_land[:, 1], z=opt_land[:, 2],
    mode="markers",
    name="Optimized Landmarks",
    marker=dict(size=2, color="orange")
)

# Assemble the figure
fig = go.Figure(data=[trace_init_cam, trace_opt_cam, scatter_init_land, scatter_opt_land])

fig.update_layout(
    title="SLAM Result: Initial vs Optimized Estimates",
    scene=dict(
        xaxis_title='X',
        yaxis_title='Y',
        zaxis_title='Z',
        aspectmode='data',
        camera=dict(
            up=dict(x=0, y=-1, z=0),
            center=dict(x=0, y=0.1, z=0),
            eye=dict(x=-1, y=-1, z=-1)
		)
	),
    legend=dict(x=0, y=1),
    margin=dict(l=0, r=0, b=0, t=30)
)

fig.show()

While the visual difference between the initial and optimized trajectory may be subtle, especially in a controlled, hallway-like environment in the case of this example, we can still measure the effectiveness of optimization quantitatively.

GTSAM provides a convenient method to evaluate the total error of the factor graph under any estimate. This cost represents the sum of squared Mahalanobis distances between the predicted and actual measurements over all factors in the graph. By computing this cost both before and after optimization, we gain a numerical understanding of how well the optimizer has improved the estimate, even in cases where the visual improvement is minor.

# Compute total error (cost) before and after optimization
init_cost = graph.error(initial_estimates)
opt_cost = graph.error(result)

fig_cost = go.Figure(data=[
    go.Bar(name="Initial Estimate", x=["Initial"], y=[init_cost], marker_color="gray"),
    go.Bar(name="Optimized Result", x=["Optimized"], y=[opt_cost], marker_color="blue")
])

fig_cost.update_layout(
    title="Total Graph Error: Initial vs. Optimized",
    yaxis_title="Total Error (sum of squared residuals)",
    bargap=0.4,
    showlegend=True
)

fig_cost.show()

GTSAM Docs

Stereo Visual Odometry with GTSAM

GTSAM Docs

Expressions