DisPOSE: Projected Polystochastic Diffusion

At a glance

DisPOSE in a nutshell

We estimate multi-person 3D poses from several cameras, learned without 3D ground truth.

1 · Disentangle assignment from regression

Match people first, then estimate their poses.

Who? link each person across views, then triangulate the 3D root location
Where? regress the full body on that root
Payoff: one model fits different camera rigs without fine-tuning

2 · Approximate discrete matching by leveraging diffusion

A combinatorial matching problem, made learnable.

Relax: discrete hypergraph matching to a continuous polytope of valid multi-view assignments
Diffuse: denoise toward a match, staying feasible at every step
Train: fully differentiable, no 3D ground truth

+0 AP₂₅ over the best self-supervised baseline on CMU Panoptic

3–4× faster than SelfPose3D across all view counts

+0% mAP over the best self-supervised baseline on unseen camera rigs

DisPOSE overall framework with Stage I (root assignment) and Stage II (pose regression). — **The pipeline.** Stage I matches people across views and triangulates their 3D root positions; Stage II regresses each full-body pose with a hypergraph-convolutional decoder.

Projected diffusion in motion. Detections across the camera views form a noisy multi-view correspondence hypergraph. DisPOSE denoises it as a diffusion on the space of polystochastic tensors (the sphere inset), projecting each reverse step back onto this feasible set, until the correspondences resolve into consistent matches and full 3D poses. Click the video to play or pause.

Key Contributions

Click any contribution card to jump to its dedicated section.

Contribution ①

Projected Polystochastic Diffusion

The assignment-as-diffusion formulation and the hypergraph pose decoder.

See method

Contribution ②

MM-OR Pose Benchmark

750 hand-annotated frames of authentic surgical scenes. Loose surgical attire and severe occlusion.

Watch the benchmark

Contribution ③

Camera-Agnostic SOTA

New self-supervised state-of-the-art across four benchmarks, and 75% vs 59% mAP on unseen camera rigs without any fine-tuning.

See results

Contribution ② · New Benchmark

MM-OR Pose: Surgical Operating Rooms

Existing multi-view 3D pose benchmarks (CMU Panoptic, Shelf, Campus) feature relatively moderate occlusion and standard street attire. Real clinical scenes break both assumptions: surgeons wear loose smocks, the field is crowded with medical equipment, lights and staff. We extend the MM-OR dataset (Özsoy et al., 2025), recordings of knee surgeries, with 750 manually annotated 3D pose frames across three surgical sequences, releasing the labels as the MM-OR Pose benchmark.

MM-OR Pose · Ground Truth

Showcase of the new MM-OR Pose ground-truth annotations across the three labelled sequences (004 PKA, 011 TKA, 036 PKA).

Abstract

TL;DR: a self-supervised framework that solves multi-view person assignment as diffusion on the polystochastic polytope, then regresses 3D skeletons with a hypergraph decoder. New self-supervised state of the art, robust to unseen camera rigs, no 3D ground truth.

Recovering 3D human poses for multiple individuals from different camera views is a fundamental bottleneck for analyzing interacting behaviors. Existing self-supervised approaches rely on synthetic catalogues of 3D poses; however, this leads to poor generalization in real-world scenarios due to distribution shifts. We therefore introduce DisPOSE, a self-supervised framework that approximates the inherently discrete multi-view person-assignment problem as a generative diffusion process over the space of polystochastic tensors. By employing differentiable Sinkhorn projections during denoising, our model learns to guide solutions toward feasible assignments based on 2D image priors. The complete 3D skeletons of localized individuals are then regressed using a Hypergraph-Convolutional Decoder that explicitly models relational structures and articulated joints across multiple views. The proposed approach outperforms current state-of-the-art self-supervised methods on standard datasets and demonstrates strong performance on a newly proposed benchmark featuring highly occluded scenes from surgical operating rooms. Our diffusion-based localization demonstrates high label efficiency, retaining 99% of its performance with only 10% of the pseudo-labels. Disentangling the assignment and root regression components while maintaining differentiability makes DisPOSE nearly agnostic to different camera arrangements.

Contribution ① · Method

From Voxel Grids to a Diffusion on the Polytope

Prevailing methods fuse all views into a 3D voxel grid and solve assignment and regression jointly, which ties the model to a fixed camera rig and hurts generalization. DisPOSE keeps the two apart and recasts assignment as a matching problem over the multi-view hypergraph below.

Preliminary

From pairwise edges to hyperedges

Pairwise edges

An edge links two detections: “view 1 ↔ view 2 is the same person”. Many such pairwise edges must then be reconciled into a globally consistent matching (cycle consistency, synchronization). Local mistakes propagate.

Hyperedges (ours)

A single hyperedge groups one detection per view into one multi-view person hypothesis. Consistency is enforced jointly across all cameras at once. The atomic unit of correspondence is the hypothesis rather than the pairwise match.

Following COMPOSE (Wang et al., 2026), DisPOSE adopts this higher-order formulation: every candidate cross-view association is a hyperedge of the multi-view correspondence hypergraph. Stage I then runs diffusion on the polytope of hyperedge selections.

Stage I: Projected Polystochastic Diffusion

1 Build a multi-view correspondence hypergraph from 2D root heatmaps

2 Diffuse on the polystochastic polytope (Multi-Marginal Sinkhorn projection every step)

3 Differentiable triangulation to 3D root locations

Per-view 2D root candidates (extracted via soft-argmax on the heatmaps H_v) become the nodes of a multi-view correspondence hypergraph; each hyperedge is a candidate cross-camera association. We relax the discrete assignment over this hypergraph into a continuous polystochastic tensor: the multi-mode analogue of a doubly-stochastic matrix.

Stage I — Projected diffusion on the polystochastic tensor space producing per-individual 3D root locations. — **Stage I.** A denoising hypergraph network operates on the polystochastic tensor space 𝒮^(V). Every reverse-time prediction is projected back onto the feasible polytope via Π_𝒮, and differentiable triangulation produces per-individual 3D root locations **𝒫_root**.

A hypergraph denoiser learns the reverse-time process on this polytope. At each step it predicts a clean assignment χ₀; the differentiable Sinkhorn projection Π_𝒮 then snaps the state back onto the polystochastic manifold, and a DDIM update walks the latent one step closer to t = 0.

Single step of the projected reverse-time generation: projection Pi_S, denoiser f_theta, DDIM step. — **Zoom on one reverse-time step.** The latent *u_t* is projected onto the polystochastic polytope via Π_𝒮, denoised by *f_Θ* to a clean estimate û₀, and combined by the DDIM update above to produce *u_t−1*. Iterating walks the latent through the projected forward chain until *t = 0*.

Stage II: Hypergraph Pose Regression

1 T-Pose initialization at each predicted 3D root

2 Sample per-joint features at projected 3D locations in every view

3 Hypergraph convolutions over cross-view + person-part edges

Each individual starts as a T-Pose 𝒫⁽⁰⁾ anchored at its Stage-I root. We sample per-joint multi-view features at the projected 3D locations p_v,k,j, then iteratively refine the pose 𝒫^(τ) → 𝒫^(τ+1) with a Hypergraph Convolutional Decoder. The decoder runs attention over two edge types: cross-view hyperedges fuse evidence across cameras, and person-part hyperedges enforce articulated joint consistency. This resolves fine-grained pose without ever materialising a 3D voxel grid.

Stage II — Iterative pose regression with a Hypergraph Convolutional Decoder operating over cross-view and person-part hyperedges. — **Stage II.** Starting from T-Pose initializations sampled at projected 3D locations, the Hypergraph Convolutional Decoder iteratively refines each skeleton over cross-view (blue) and person-part (pink) hyperedges, yielding the final per-root 3D pose **𝒫^(T)**.

Results

DisPOSE leads every self-supervised baseline on all four benchmarks below, and the margin grows on the hardest cases (surgery, unseen rigs). We compare against fully-supervised, optimization-based, and self-supervised methods on CMU Panoptic, Shelf & Campus, and the new MM-OR Pose; method names link to code where available.

Choose a dataset

Dataset 1 CMU Panoptic

Large-scale multi-view dome capture (Joo et al., 2015) with moderate occlusion and social interactions.

Bottom line: best self-supervised method on every metric, +11 AP₂₅ over the next best.

Table 1 Pose estimation on CMU Panoptic. Best self-supervised results in **bold**; DisPOSE row highlighted. † uses 9 temporal frames as input.
Method	AP (mm) ↑				Recall@500 ↑	MPJPE ↓
Method	25	50	100	150	Recall@500 ↑	MPJPE ↓
Fully-supervised
VoxelPose	83.59	98.33	99.76	99.91	—	17.68
Plane Sweep	92.12	98.96	99.81	99.84	—	16.75
MvP	92.28	96.60	97.45	97.69	—	15.76
Wu et al.	93.93	98.93	99.78	99.90	99.97	15.63
Faster VoxelPose	85.22	98.08	99.32	99.48	—	18.26
TEMPO	89.01	99.08	99.76	99.93	—	14.68
MVGFormer	92.32	97.93	99.32	99.55	99.86	15.99
VoxelPose+3DSA	94.20	98.49	99.21	99.31	—	13.98
MV-SSM	93.50	—	—	—	—	15.70
Optimization-based
ACTOR	—	—	—	—	—	168.40
MvPose	37.63	95.70	97.84	98.28	99.60	26.46
COMPOSE	54.66	97.27	98.94	99.17	99.83	23.62
Self-supervised
SelfPose3D	55.13	96.44	98.46	98.98	99.60	24.47
DSP†	57.60	86.10	94.00	—	—	23.10
DisPOSE (Ours)	68.59	98.59	99.60	99.80	99.91	21.20

Among self-supervised methods, DisPOSE wins every metric: +11.0 AP₂₅ over the next-best baseline and an 8% MPJPE improvement (21.20 vs 23.10 mm). DisPOSE also surpasses every optimization-based competitor at every precision threshold.

Dataset 2 MM-OR Pose (new; see Contribution ②)

Surgical operating rooms with loose attire, severe occlusion, and unusual postures. 5 calibrated cameras, 750 hand-annotated frames.

Bottom line: on hard surgical scenes DisPOSE nearly doubles AP₅₀ and cuts joint error by 14 mm.

Table 2 Results on the proposed MM-OR Pose benchmark. Best results in **bold**. AP₂₅ omitted: surgical helmets and gowns make millimeter-precise GT annotation infeasible.
Method	AP (mm) ↑			Recall@500 ↑	MPJPE ↓
Method	50	100	150	Recall@500 ↑	MPJPE ↓
Self-supervised
SelfPose3D	23.38	69.68	82.82	94.33	70.69
DisPOSE (Ours)	47.06	83.59	92.01	97.04	56.91

Surgical scenes break baselines. SelfPose3D's AP₅₀ collapses from 96.44 to 23.38, while DisPOSE doubles AP₅₀ (+23.7), lifts Recall@500 by +2.7 points, and shrinks MPJPE by 14 mm.

Dataset 3 · Shelf + Dataset 4 · Campus Shelf & Campus

Two classic small-scale multi-view benchmarks reported jointly. Shelf (Belagiannis et al., 2014): 4 people interacting around a wooden shelf in a confined indoor space, 5 calibrated cameras. Campus (Belagiannis et al., 2014): 3 subjects navigating an outdoor courtyard, minimal 3-camera setup.

Bottom line: matches or beats every self-supervised baseline, and stays competitive with fully-supervised methods.

Table 3 Pose estimation on Shelf and Campus (PCP %). Best self-supervised results in **bold**. † uses 81 temporal frames as input.
Method	Shelf (PCP %) ↑				Campus (PCP %) ↑
Method	Act 1	Act 2	Act 3	Avg.	Act 1	Act 2	Act 3	Avg.
Fully-supervised
Ershadi-Nasab et al.	93.3	75.9	94.8	88.0	94.2	92.9	84.6	90.6
VoxelPose	99.3	94.1	97.6	97.0	97.6	93.8	98.8	96.7
Wu et al.	99.3	96.5	97.3	97.7	—	—	—	—
MvP	99.3	95.1	97.8	97.4	98.2	94.1	97.4	96.6
Faster VoxelPose	99.4	96.0	97.5	97.6	96.5	94.1	97.9	96.2
TEMPO	99.3	95.1	97.8	97.4	97.7	95.5	97.9	97.3
Optimization-based
3DPS	75.3	69.7	87.6	77.5	93.5	75.7	84.4	84.5
MvPose	98.8	94.1	97.8	96.9	97.6	93.3	98.0	96.3
COMPOSE	99.8	92.4	96.3	96.2	99.4	94.3	98.1	97.3
Self-supervised
SelfPose3D	97.2	90.3	97.9	95.1	92.5	82.2	89.2	87.9
DSP†	99.1	92.8	98.3	96.7	94.9	91.0	92.4	92.8
DisPOSE (Ours)	99.5	94.1	97.8	97.1	98.8	93.7	94.3	95.6

Contribution ③ Camera Generalization

Because assignment and regression are decoupled, DisPOSE is nearly agnostic to the camera arrangement. Voxel-grid baselines collapse on unseen rigs; DisPOSE holds its ground and even improves as more views become available.

View by

Bottom line: on unseen rigs the baseline collapses to ~30% root mAP; DisPOSE stays above 69% on every setup.

Table 4 Pose estimation across four **unseen camera arrangements** on CMU Panoptic. No fine-tuning. Best results in **bold**.
Setup / Method	Root			Pose
Setup / Method	mAP ↑	Recall ↑	MDE ↓	mAP ↑	Recall ↑	MPJPE ↓
CMU1 (7 cameras)
SelfPose3D	50.99	98.00	74.22	74.50	97.98	42.08
DisPOSE (Ours)	73.42	99.90	46.34	91.97	99.92	23.26
CMU2 (7 cameras)
SelfPose3D	32.31	94.58	115.13	59.06	94.32	67.90
DisPOSE (Ours)	78.94	99.52	38.28	83.01	99.80	30.44
CMU3 (4 cameras)
SelfPose3D	29.94	85.86	114.91	61.43	83.96	49.61
DisPOSE (Ours)	69.20	90.78	43.37	75.57	90.75	35.72
CMU4 (4 cameras)
SelfPose3D	31.19	97.92	132.07	62.85	98.32	72.91
DisPOSE (Ours)	71.89	99.86	50.29	87.02	99.85	28.10

Across all four unseen rigs, SelfPose3D's root mAP collapses to 29–51%; DisPOSE stays above 69% on every setup and improves with denser arrays (70.5% on 4-cam → 76.2% on 7-cam average).

Bottom line: DisPOSE keeps improving as cameras are added; SelfPose3D plateaus after four views.

Table 10 Scaling on the standard CMU0 setup with a **varying number of inference cameras** (3 / 4 / 6 / 7). DisPOSE improves monotonically as more cameras are added; SelfPose3D plateaus around 4 views. Best results in **bold**.
Setup / Method	Root			Pose
Setup / Method	mAP ↑	Recall ↑	MDE ↓	mAP ↑	Recall ↑	MPJPE ↓
CMU0 ·3 views
SelfPose3D	54.22	92.91	71.42	66.42	93.40	53.57
DisPOSE (Ours)	76.68	98.77	40.82	72.06	98.83	47.29
CMU0 ·4 views
SelfPose3D	62.29	99.44	63.98	86.59	99.44	28.99
DisPOSE (Ours)	79.26	99.83	39.36	92.13	99.85	23.83
CMU0 ·6 views
SelfPose3D	62.90	99.39	62.93	78.77	99.41	37.49
DisPOSE (Ours)	80.42	99.90	37.29	95.15	99.91	20.87
CMU0 ·7 views
SelfPose3D	63.02	99.24	62.58	78.73	99.20	37.11
DisPOSE (Ours)	80.41	99.90	37.34	95.65	99.91	20.49

DisPOSE scales monotonically with view count: pose mAP climbs from 72.06 at 3 views to a peak of 95.65 at 7 views, with MPJPE dropping from 47.3 mm to 20.5 mm. SelfPose3D peaks at 4 views and then regresses (86.59 → 78.77 → 78.73): more cameras hurt it.

Efficiency Inference Latency & GPU Memory

Despite running a full diffusion-based assignment process, DisPOSE is 3–4× faster than SelfPose3D across every view count, with modest memory growth. The differentiable Sinkhorn projection accounts for less than 3% of total runtime.

Efficiency · DisPOSE vs SelfPose3D

SelfPose3D DisPOSE

Inference latency (ms, lower is better)

≈ 3–4× faster

Peak GPU memory (GiB, lower is better)

Where the 452 ms goes (component share at 5 views)

2D Backbone Hypergraph Decoder Diffusion Denoise Sinkhorn Projection (2.8%) Other

Measured on a single NVIDIA A40, CMU Panoptic, batch size 8 (paper Tables 17 & 18). The differentiable Sinkhorn projection costs 2.8–3.0% of total runtime; “Other” groups all remaining operations (data movement, I/O, and post-processing).

Qualitative Side-by-side comparisons

SelfPose3D on CMU Panoptic: misses the toddler crawling on the ground. — **Out-of-distribution generalization on CMU Panoptic.** Red is the ground-truth pose. SelfPose3D (left) misses the crawling toddler — its red ground truth is left unmatched, a result of training on synthetic adult pose catalogues. DisPOSE (right) detects and articulates it.

DisPOSE on CMU Panoptic: recovers the crawling toddler. — **Out-of-distribution generalization on CMU Panoptic.** Red is the ground-truth pose. SelfPose3D (left) misses the crawling toddler — its red ground truth is left unmatched, a result of training on synthetic adult pose catalogues. DisPOSE (right) detects and articulates it.

SelfPose3D on MM-OR Pose: fails on the bending surgeon. — **MM-OR Pose, a bending surgeon.** SelfPose3D (left) fails on the non-canonical bent-over posture. DisPOSE (right) recovers both surgeons despite severe occlusion from sterile gowns and surgical helmets.

DisPOSE on MM-OR Pose: recovers both surgeons under occlusion. — **MM-OR Pose, a bending surgeon.** SelfPose3D (left) fails on the non-canonical bent-over posture. DisPOSE (right) recovers both surgeons despite severe occlusion from sterile gowns and surgical helmets.

Limitations & Future Work

Limitation 1

No temporal modeling

Frames are processed independently; motion continuity is unused. Temporal trajectory priors are a natural next step.

Limitation 2

Single-view occlusion

A person visible in only one camera cannot be triangulated, and therefore cannot be reconstructed.

BibTeX

@inproceedings{wang2026dispose,
    title     = {DisPOSE: Projected Polystochastic Diffusion for
                 Self-Supervised Multi-View 3D Human Pose Estimation},
    author    = {Wang, Tony Danjun and Birdal, Tolga and
                 Navab, Nassir and Bastian, Lennart},
    booktitle = {Proceedings of the 43rd International Conference on
                 Machine Learning},
    series    = {Proceedings of Machine Learning Research},
    publisher = {PMLR},
    year      = {2026},
}