Preprint arXiv:2601.09698 2026

COMPOSE Hypergraph Cover Optimization for Multi-view 3D Human Pose Estimation

  • Training-free
  • Off-the-shelf 2D detectors
  • ILP + BP
  • ~5 ms per frame

Prior optimization methods match cameras pairwise and then stitch those matches together, which becomes brittle when views are occluded or noisy. COMPOSE instead matches each person across all views at once, framing the problem as a weighted exact cover over a hypergraph. An ILP solves it exactly, while Belief Propagation gives an answer in milliseconds on the GPU. No training is involved.

  • 1 Technical University of Munich (TUM)
  • 2 Munich Center for Machine Learning (MCML)
  • 3 Imperial College London
01 — Abstract

COMPOSE matches whole multi-view hypotheses at once rather than stitching pairwise links, so it keeps the correspondences that pairwise methods lose under occlusion — and it trains nothing.

  • Problem Matching cameras two at a time is fragile under occlusion and noisy detections. One bad local match can corrupt the whole global solution.
  • Idea Match every detection at once as one weighted exact cover over a hypergraph of person hypotheses, instead of resolving view pairs one by one.
  • Solvers Geometric pruning first trims the candidate hyperedges. An ILP then solves the cover exactly, or Belief Propagation solves it at scale on the GPU.
  • Result state of the art among optimization-based methods, and ahead of recent self-supervised ones (numbers below).
Read the full abstract

3D human pose estimation from sparse multi-view camera rigs is an essential task for numerous applications, including action recognition, sports analysis, and human-robot interaction. While learned methods dominate the field on benchmarks, they require large annotated datasets; training-free optimization-based methods remain promising as they circumvent 3D supervision by solving a correspondence problem across views from 2D detections.

Existing combinatorial formulations rely on pairwise associations to model this correspondence problem and enforce global consistency across views only as a downstream constraint. However, reconciling locally plausible pairwise matches becomes brittle under occlusion and noisy detections, where local errors propagate globally.

We propose COMPOSE, which recasts multi-view 3D human pose estimation as a weighted exact-cover optimization over a hypergraph of person hypotheses. Our formulation replaces pairwise association and post-hoc consistency enforcement with a single global combinatorial objective. To address the exponentially large candidate space, we introduce a geometric pruning strategy alongside two complementary solvers: an exact Integer Linear Programming formulation and a scalable relaxation via Belief Propagation.

Without any 3D supervision, COMPOSE improves average precision by up to 31 points over the best optimization-based method and 13 points over self-supervised learned methods, demonstrating the effectiveness of higher-order combinatorial association for training-free multi-view 3D human pose estimation.

02 — Overview

Why higher-order matching helps

Pairwise matching links view-pairs in isolation: each match can look consistent, yet the pairwise matches together cannot be one person, so a mismatched correspondence triangulates to a wrong 3D pose. COMPOSE scores whole multi-view hypotheses and selects one global exact-cover, recovering the correct pose for everyone with nothing to reconcile. CMU Panoptic, 5 views.

1 Graph construction

2 Correspondence matching

3 Pose triangulation

Previous approaches

Previous approaches, graph construction: detections across views form a graph of pairwise edges, with geometrically consistent relations in green and inconsistent ones in red. Previous approaches, correspondence matching: pairwise edges are filtered and matched independently between view pairs, leaving inconsistent matches in red. Previous approaches, triangulation: triangulated 3D skeletons recovered after stitching pairwise matches.

COMPOSE (ours)

COMPOSE, graph construction: detections across all views form a V-partite hypergraph, with geometrically consistent multi-view hyperedges in green and inconsistent ones in red. COMPOSE, correspondence matching: a single global hypergraph cover is selected, keeping only the consistent green hyperedges across every view at once. COMPOSE, triangulation: 3D skeletons triangulated from the globally consistent hypergraph cover.
consistent inconsistent
Previous approaches filter pairwise edges. COMPOSE filters hyperedges instead, then selects one global cover and triangulates from it.
03 — Headline results

Headline results on CMU Panoptic

Evaluated on CMU Panoptic with the same 2D detector as the strongest optimization baseline.

+31AP25

over the best optimization-based method (MvPose 37.63 → COMPOSE-BP 68.88)

+13AP25

over the self-supervised SelfPose3d (55.13)

22.78mm

MPJPE on CMU Panoptic — below the best optimization baseline (MvPose 26.46)

04 — Method

Method: a single global hypergraph cover

  • Hypergraph construction. Each hyperedge picks one detection per camera view to form a single multi-view person hypothesis, giving a V-partite (V camera views) hypergraph over all detections.
  • Geometric pruning. A memory-bounded builder grows the hypergraph level by level, avoiding the O((N+1)V) blow-up, and keeps only hyperedges whose multi-view reprojection cost stays low (reprojection cost C(e) below a threshold τ).
  • Correspondence estimation. Choose a weighted exact cover that assigns every detection to one person, solved exactly by ILP or at scale by Belief Propagation.
  • Triangulation. The chosen correspondences are ray-triangulated into 3D skeletons, using nothing beyond 2D detections and camera calibration.

COMPOSE-ILP exact

Finds the global optimum of the weighted exact cover with branch-and-cut Integer Linear Programming.

COMPOSE-BP scalable · GPU

Relaxes the same problem and runs it as loopy Belief Propagation in batched GPU tensor operations.

The full pipeline

COMPOSE pipeline, left to right: multi-view RGB, 2D pose detection, a pruned V-partite hypergraph, exact-cover matching by ILP or Belief Propagation, then triangulation to 3D skeletons.
The COMPOSE pipeline: multi-view RGB inputs → 2D pose detection → weighted hypergraph construction with geometric pruning (C(e) ≤ τ) → correspondence estimation as a weighted exact cover, solved by ILP or Belief Propagation → ray-based triangulation into 3D skeletons.
05 — Benchmarks

Full benchmark results

The gain concentrates at AP25, the strictest threshold, where one wrong cross-view match ruins the pose — so precise association is exactly what COMPOSE buys: it nearly doubles the best optimization baseline there (37.63 → 68.88), while AP50 and up are already saturated.

CMU Panoptic. AP / Recall higher is better; MPJPE (mm) lower is better. COMPOSE rows highlighted; best optimization-based value per column is marked ▲.
Method AP25 AP50 AP100 AP150 R@500 ↑ MPJPE ↓
Fully-supervised
Plane Sweep Pose 92.1298.9699.8199.8416.78
Wu et al. 93.9398.9399.7899.9099.9715.63
TEMPO 89.0199.0899.7699.9314.68
VoxelPose + 3DSA 94.2098.4999.2199.3113.98
Self-supervised
SelfPose3d 55.1396.4498.4698.9899.6024.47
DSP (†, 9 temporal frames) 57.6086.1094.0023.10
Optimization-based (training-free)
ACTOR 168.40
MvPose (‡, same 2D detector) 37.6395.7097.8498.2899.6026.46
COMPOSE-ILP Ours 66.7098.2399.4399.6299.8122.78
COMPOSE-BP Ours 68.8898.3799.4299.6199.8122.78

DSP uses 9 temporal frames. MvPose uses the same 2D detector as COMPOSE. ▲ = best among optimization-based methods. R@500 = Recall@500 mm.

On the sparse 4-view CMU3 setup, COMPOSE reaches 74.43 mAP, beating the self-supervised SelfPose3d (61.43) without training a pose model of its own.

Generalization across camera setups (CMU Panoptic). Same model, no retraining — only the number and placement of cameras change. mAP and Recall@500 mm higher is better. COMPOSE rows highlighted; best optimization-based value per column is marked ▲.
Method CMU1 (7 cams) CMU2 (7 cams) CMU3 (4 cams) CMU4 (4 cams)
mAP ↑R@500 ↑ mAP ↑R@500 ↑ mAP ↑R@500 ↑ mAP ↑R@500 ↑
Self-supervised
SelfPose3d 74.5097.98 59.0694.32 61.4383.96 62.8598.32
Optimization-based
MvPose 84.6299.53 80.0799.37 59.7498.80 74.8598.59
COMPOSE-ILP Ours 88.4999.61 84.4599.58 73.8398.40 80.1799.31
COMPOSE-BP Ours 88.1099.45 84.3499.41 74.4398.39 79.6099.31

CMU1/CMU2 use 7 cameras; CMU3/CMU4 use 4 cameras with different placements. ▲ = best among optimization-based methods. R@500 = Recall@500 mm.

Also evaluated on Shelf & Campus (PCP)
Shelf & Campus. Percentage of Correct Parts (PCP %, higher is better). COMPOSE rows highlighted; best optimization-based value per column is marked ▲.
Method Shelf Campus
A1A2A3Avg A1A2A3Avg
Fully-supervised
VoxelPose 99.394.197.697.0 97.693.898.896.7
Wu et al. 99.396.597.397.7
TEMPO 99.395.197.897.4 97.795.597.997.3
Self-supervised
SelfPose3d 97.290.397.995.1 92.582.289.287.9
Optimization-based
3DPS 75.369.787.677.5 93.575.784.484.5
MvPose 98.894.197.896.9 97.693.398.096.3
COMPOSE-ILP Ours 99.892.496.396.2 99.494.398.197.3
COMPOSE-BP Ours 99.892.496.396.2 99.494.393.695.7

Datasets from Belagiannis et al., 2014. Shelf — 4 people, heavy occlusion, 5 cameras. Campus — 3 people, outdoor, 3 cameras. A1/A2/A3 = Actor 1/2/3. ▲ = best among optimization-based methods.

06 — Qualitative results

Qualitative reconstructions

A few calibrated camera views are enough to reconstruct the full 3D scene.

Multi-view input

Input camera view 1: a calibrated RGB view of the scene with multiple people. Input camera view 2 of the same scene. Input camera view 3 of the same scene. Input camera view 4 of the same scene. Input camera view 5 of the same scene.
The input is five calibrated camera views with off-the-shelf 2D detections and no 3D labels.

Whole-scene 3D reconstruction

MvPose MvPose whole-scene 3D reconstruction: pairwise cross-view association misses one person where matching fails, leaving a gap (arrowed) in the reconstructed scene.
COMPOSE (ours) COMPOSE whole-scene 3D reconstruction: the global hypergraph cover recovers complete 3D skeletons for every person in the scene, including the one (arrowed) that MvPose misses.
From those views, MvPose misses one person where its pairwise matching breaks down (red arrow, left), while COMPOSE recovers the complete 3D scene (green arrow, right).

Annotation noise on Shelf

Qualitative results on Shelf: in several occluded frames the public 3D ground-truth annotation, drawn dashed red, drifts from the visible actor, while COMPOSE's prediction, in solid colors, stays consistent with the image; 3D reconstructions are shown at the right.
Visual evidence exposes annotation noise. In several occluded Shelf frames the public 3D ground-truth annotation (dashed red) drifts from the visible actor, while COMPOSE's prediction (solid colored) stays consistent with the image. PCP then penalizes poses that are actually correct. (the Shelf & Campus PCP table is in Benchmarks).
07 — Scalability & efficiency

Runtime and pruning efficiency

Belief Propagation tracks the exact ILP's accuracy closely while running in about 5 ms, regardless of how many cameras are used.

Runtime versus number of cameras: COMPOSE-BP stays at roughly 5 milliseconds as cameras increase, matching MvPose's pairwise speed, while exact ILP runtime rises steeply with more views.
COMPOSE-BP holds a near-constant ~5 ms runtime as cameras are added, on par with MvPose, whereas the exact ILP grows steeply with more views.
Geometric pruning retention versus number of cameras: fewer than 2 percent of candidate hyperedges are retained, and the retained fraction stays small as the number of views grows.
Geometric pruning keeps under 2% of candidate hyperedges, and the retained set stays small as views grow, so the combinatorial search remains tractable.

COMPOSE assumes calibrated cameras and an off-the-shelf 2D detector, and the exact ILP solver scales with the number of views (above) — Belief Propagation is the scalable alternative.

08 — Citation

Cite this work

BibTeX
@article{wang2026compose,
  title   = {COMPOSE: Hypergraph Cover Optimization for Multi-view 3D Human Pose Estimation},
  author  = {Tony Danjun Wang and Tolga Birdal and Nassir Navab and Lennart Bastian},
  journal = {arXiv preprint arXiv:2601.09698},
  year    = {2026}
}