Multi-person 3D pose from several cameras, learned with no 3D ground truth.
Match people first, then estimate their poses.
A combinatorial matching problem, made learnable.
Click any contribution card to jump to its dedicated section.
The assignment-as-diffusion formulation and the hypergraph pose decoder.
See method750 hand-annotated frames of authentic surgical scenes. Loose surgical attire and severe occlusion.
Watch the benchmarkNew self-supervised state-of-the-art across four benchmarks, and 75% vs 59% mAP on unseen camera rigs without any fine-tuning.
See resultsExisting multi-view 3D pose benchmarks (CMU Panoptic, Shelf, Campus) feature relatively moderate occlusion and standard street attire. Real clinical scenes break both assumptions: surgeons wear loose smocks, the field is crowded with medical equipment, lights and staff. We extend the MM-OR dataset (Özsoy et al., 2025), recordings of knee surgeries, with 750 manually annotated 3D pose frames across three surgical sequences, releasing the labels as the MM-OR Pose benchmark.
TL;DR: a self-supervised framework that solves multi-view person assignment as diffusion on the polystochastic polytope, then regresses 3D skeletons with a hypergraph decoder. New self-supervised state of the art, robust to unseen camera rigs, no 3D ground truth.
Prevailing methods fuse all views into a 3D voxel grid and solve assignment and regression jointly, which ties the model to a fixed camera rig and hurts generalization. DisPOSE keeps the two apart and recasts assignment as a matching problem over the multi-view hypergraph below.
An edge links two detections: “view 1 ↔ view 2 is the same person”. Many such pairwise edges must then be reconciled into a globally consistent matching (cycle consistency, synchronization). Local mistakes propagate.
A single hyperedge groups one detection per view into one multi-view person hypothesis. Consistency is enforced jointly across all cameras at once. The atomic unit of correspondence is the hypothesis rather than the pairwise match.
Following COMPOSE (Wang et al., 2026), DisPOSE adopts this higher-order formulation: every candidate cross-view association is a hyperedge of the multi-view correspondence hypergraph. Stage I then runs diffusion on the polytope of hyperedge selections.
Per-view 2D root candidates (extracted via soft-argmax on the heatmaps Hv) become the nodes of a multi-view correspondence hypergraph; each hyperedge is a candidate cross-camera association. We relax the discrete assignment over this hypergraph into a continuous polystochastic tensor: the multi-mode analogue of a doubly-stochastic matrix.
A hypergraph denoiser learns the reverse-time process on this polytope. At each step it predicts a clean assignment χ0; the differentiable Sinkhorn projection Π𝒮 then snaps the state back onto the polystochastic manifold, and a DDIM update walks the latent one step closer to t = 0.
Each individual starts as a T-Pose 𝒫(0) anchored at its Stage-I root. We sample per-joint multi-view features at the projected 3D locations pv,k,j, then iteratively refine the pose 𝒫(τ) → 𝒫(τ+1) with a Hypergraph Convolutional Decoder. The decoder runs attention over two edge types: cross-view hyperedges fuse evidence across cameras, and person-part hyperedges enforce articulated joint consistency. This resolves fine-grained pose without ever materialising a 3D voxel grid.
DisPOSE leads every self-supervised baseline on all four benchmarks below, and the margin grows on the hardest cases (surgery, unseen rigs). We compare against fully-supervised, optimization-based, and self-supervised methods on CMU Panoptic, Shelf & Campus, and the new MM-OR Pose; method names link to code where available.
Large-scale multi-view dome capture (Joo et al., 2015) with moderate occlusion and social interactions.
Bottom line: best self-supervised method on every metric, +11 AP25 over the next best.
| Method | AP (mm) ↑ | Recall@500 ↑ | MPJPE ↓ | |||
|---|---|---|---|---|---|---|
| 25 | 50 | 100 | 150 | |||
| Fully-supervised | ||||||
| VoxelPose | 83.59 | 98.33 | 99.76 | 99.91 | — | 17.68 |
| Plane Sweep | 92.12 | 98.96 | 99.81 | 99.84 | — | 16.75 |
| MvP | 92.28 | 96.60 | 97.45 | 97.69 | — | 15.76 |
| Wu et al. | 93.93 | 98.93 | 99.78 | 99.90 | 99.97 | 15.63 |
| Faster VoxelPose | 85.22 | 98.08 | 99.32 | 99.48 | — | 18.26 |
| TEMPO | 89.01 | 99.08 | 99.76 | 99.93 | — | 14.68 |
| MVGFormer | 92.32 | 97.93 | 99.32 | 99.55 | 99.86 | 15.99 |
| VoxelPose+3DSA | 94.20 | 98.49 | 99.21 | 99.31 | — | 13.98 |
| MV-SSM | 93.50 | — | — | — | — | 15.70 |
| Optimization-based | ||||||
| ACTOR | — | — | — | — | — | 168.40 |
| MvPose | 37.63 | 95.70 | 97.84 | 98.28 | 99.60 | 26.46 |
| COMPOSE | 54.66 | 97.27 | 98.94 | 99.17 | 99.83 | 23.62 |
| Self-supervised | ||||||
| SelfPose3D | 55.13 | 96.44 | 98.46 | 98.98 | 99.60 | 24.47 |
| DSP† | 57.60 | 86.10 | 94.00 | — | — | 23.10 |
| DisPOSE (Ours) | 68.59 | 98.59 | 99.60 | 99.80 | 99.91 | 21.20 |
Among self-supervised methods, DisPOSE wins every metric: +11.0 AP25 over the next-best baseline and an 8% MPJPE improvement (21.20 vs 23.10 mm). DisPOSE also surpasses every optimization-based competitor at every precision threshold.
Surgical operating rooms with loose attire, severe occlusion, and unusual postures. 5 calibrated cameras, 750 hand-annotated frames.
Bottom line: on hard surgical scenes DisPOSE nearly doubles AP50 and cuts joint error by 14 mm.
| Method | AP (mm) ↑ | Recall@500 ↑ | MPJPE ↓ | ||
|---|---|---|---|---|---|
| 50 | 100 | 150 | |||
| Self-supervised | |||||
| SelfPose3D | 23.38 | 69.68 | 82.82 | 94.33 | 70.69 |
| DisPOSE (Ours) | 47.06 | 83.59 | 92.01 | 97.04 | 56.91 |
Surgical scenes break baselines. SelfPose3D's AP50 collapses from 96.44 to 23.38, while DisPOSE doubles AP50 (+23.7), lifts Recall@500 by +2.7 points, and shrinks MPJPE by 14 mm.
Two classic small-scale multi-view benchmarks reported jointly. Shelf (Belagiannis et al., 2014): 4 people interacting around a wooden shelf in a confined indoor space, 5 calibrated cameras. Campus (Belagiannis et al., 2014): 3 subjects navigating an outdoor courtyard, minimal 3-camera setup.
Bottom line: matches or beats every self-supervised baseline, and stays competitive with fully-supervised methods.
| Method | Shelf (PCP %) ↑ | Campus (PCP %) ↑ | ||||||
|---|---|---|---|---|---|---|---|---|
| Act 1 | Act 2 | Act 3 | Avg. | Act 1 | Act 2 | Act 3 | Avg. | |
| Fully-supervised | ||||||||
| Ershadi-Nasab et al. | 93.3 | 75.9 | 94.8 | 88.0 | 94.2 | 92.9 | 84.6 | 90.6 |
| VoxelPose | 99.3 | 94.1 | 97.6 | 97.0 | 97.6 | 93.8 | 98.8 | 96.7 |
| Wu et al. | 99.3 | 96.5 | 97.3 | 97.7 | — | — | — | — |
| MvP | 99.3 | 95.1 | 97.8 | 97.4 | 98.2 | 94.1 | 97.4 | 96.6 |
| Faster VoxelPose | 99.4 | 96.0 | 97.5 | 97.6 | 96.5 | 94.1 | 97.9 | 96.2 |
| TEMPO | 99.3 | 95.1 | 97.8 | 97.4 | 97.7 | 95.5 | 97.9 | 97.3 |
| Optimization-based | ||||||||
| 3DPS | 75.3 | 69.7 | 87.6 | 77.5 | 93.5 | 75.7 | 84.4 | 84.5 |
| MvPose | 98.8 | 94.1 | 97.8 | 96.9 | 97.6 | 93.3 | 98.0 | 96.3 |
| COMPOSE | 99.8 | 92.4 | 96.3 | 96.2 | 99.4 | 94.3 | 98.1 | 97.3 |
| Self-supervised | ||||||||
| SelfPose3D | 97.2 | 90.3 | 97.9 | 95.1 | 92.5 | 82.2 | 89.2 | 87.9 |
| DSP† | 99.1 | 92.8 | 98.3 | 96.7 | 94.9 | 91.0 | 92.4 | 92.8 |
| DisPOSE (Ours) | 99.5 | 94.1 | 97.8 | 97.1 | 98.8 | 93.7 | 94.3 | 95.6 |
Because assignment and regression are decoupled, DisPOSE is nearly agnostic to the camera arrangement. Voxel-grid baselines collapse on unseen rigs; DisPOSE holds its ground and even improves as more views become available.
Bottom line: on unseen rigs the baseline collapses to ~30% root mAP; DisPOSE stays above 69% on every setup.
| Setup / Method | Root | Pose | ||||
|---|---|---|---|---|---|---|
| mAP ↑ | Recall ↑ | MDE ↓ | mAP ↑ | Recall ↑ | MPJPE ↓ | |
| CMU1 (7 cameras) | ||||||
| SelfPose3D | 50.99 | 98.00 | 74.22 | 74.50 | 97.98 | 42.08 |
| DisPOSE (Ours) | 73.42 | 99.90 | 46.34 | 91.97 | 99.92 | 23.26 |
| CMU2 (7 cameras) | ||||||
| SelfPose3D | 32.31 | 94.58 | 115.13 | 59.06 | 94.32 | 67.90 |
| DisPOSE (Ours) | 78.94 | 99.52 | 38.28 | 83.01 | 99.80 | 30.44 |
| CMU3 (4 cameras) | ||||||
| SelfPose3D | 29.94 | 85.86 | 114.91 | 61.43 | 83.96 | 49.61 |
| DisPOSE (Ours) | 69.20 | 90.78 | 43.37 | 75.57 | 90.75 | 35.72 |
| CMU4 (4 cameras) | ||||||
| SelfPose3D | 31.19 | 97.92 | 132.07 | 62.85 | 98.32 | 72.91 |
| DisPOSE (Ours) | 71.89 | 99.86 | 50.29 | 87.02 | 99.85 | 28.10 |
Across all four unseen rigs, SelfPose3D's root mAP collapses to 29–51%; DisPOSE stays above 69% on every setup and improves with denser arrays (70.5% on 4-cam → 76.2% on 7-cam average).
Bottom line: DisPOSE keeps improving as cameras are added; SelfPose3D plateaus after four views.
| Setup / Method | Root | Pose | ||||
|---|---|---|---|---|---|---|
| mAP ↑ | Recall ↑ | MDE ↓ | mAP ↑ | Recall ↑ | MPJPE ↓ | |
| CMU0 ·3 views | ||||||
| SelfPose3D | 54.22 | 92.91 | 71.42 | 66.42 | 93.40 | 53.57 |
| DisPOSE (Ours) | 76.68 | 98.77 | 40.82 | 72.06 | 98.83 | 47.29 |
| CMU0 ·4 views | ||||||
| SelfPose3D | 62.29 | 99.44 | 63.98 | 86.59 | 99.44 | 28.99 |
| DisPOSE (Ours) | 79.26 | 99.83 | 39.36 | 92.13 | 99.85 | 23.83 |
| CMU0 ·6 views | ||||||
| SelfPose3D | 62.90 | 99.39 | 62.93 | 78.77 | 99.41 | 37.49 |
| DisPOSE (Ours) | 80.42 | 99.90 | 37.29 | 95.15 | 99.91 | 20.87 |
| CMU0 ·7 views | ||||||
| SelfPose3D | 63.02 | 99.24 | 62.58 | 78.73 | 99.20 | 37.11 |
| DisPOSE (Ours) | 80.41 | 99.90 | 37.34 | 95.65 | 99.91 | 20.49 |
DisPOSE scales monotonically with view count: pose mAP climbs from 72.06 at 3 views to a peak of 95.65 at 7 views, with MPJPE dropping from 47.3 mm to 20.5 mm. SelfPose3D peaks at 4 views and then regresses (86.59 → 78.77 → 78.73): more cameras hurt it.
Despite running a full diffusion-based assignment process, DisPOSE is 3–4× faster than SelfPose3D across every view count, with modest memory growth. The differentiable Sinkhorn projection accounts for less than 3% of total runtime.
Measured on a single NVIDIA A40, CMU Panoptic, batch size 8 (paper Tables 17 & 18). The differentiable Sinkhorn projection costs 2.8–3.0% of total runtime; “Other” groups all remaining operations (data movement, I/O, and post-processing).
Frames are processed independently; motion continuity is unused. Temporal trajectory priors are a natural next step.
A person visible in only one camera cannot be triangulated, and therefore cannot be reconstructed.
@inproceedings{wang2026dispose,
title = {DisPOSE: Projected Polystochastic Diffusion for
Self-Supervised Multi-View 3D Human Pose Estimation},
author = {Wang, Tony Danjun and Birdal, Tolga and
Navab, Nassir and Bastian, Lennart},
booktitle = {Proceedings of the 43rd International Conference on
Machine Learning},
series = {Proceedings of Machine Learning Research},
publisher = {PMLR},
year = {2026},
}