Beyond Role-Based Surgical Domain Modeling: A Staff-Centric Approach for Surgical Domain Modeling
TL;DR:
We challenge the traditional, role-based view of the operating room and propose a "staff-centric" model that recognizes the unique impact of each individual on surgical dynamics.
Surgical data science has aimed to optimize operating room (OR) workflows by analyzing the roles of the surgical staff. However, this approach treats team members as interchangeable parts, failing to capture the nuanced differences between them. Given that factors like team familiarity and individual habits significantly affect surgical outcomes, from operative times to complication rates, we argue that this role-based view is not enough. We propose a fundamental shift towards a staff-centric model where each person is recognized as a unique individual, not simply an abstract role. However, achieving this presents a significant challenge.
To see the limitation of the role-centric model, we show two different teams performing the same surgery with the same roles. Notice how their coordination, and use of the space differ, highlighting how much team dynamics can change even when roles remain the same.
Challenges in Surgical Operating Rooms
The surgical OR is a challenging environment compared to general environments where individuals are easily distinguished by their clothing and faces. Team members wear standardized, often visually indistinct smocks and gowns to maintain sterility. This homogeneity makes traditional appearance-based identification methods, which rely on texture and color fail. Furthermore, surgical masks, skull caps, and other protective equipment hide facial features and hair, which are the prominent cues for person re-identification.
As shown in the figure below, the standardized attire makes it difficult to differentiate individuals based on appearance alone.
Generalizable Re-Identification in the Operating Room
Mitigating Biases in Surgical Operating Rooms with Geometry
Deep neural networks are known to have a strong bias towards texture. In the context of the OR, these networks will learn spurious correlations from incidental visual cues rather than robust biometric features. Our saliency analysis reveal a clear example of this "shortcut learning" on the 4D-OR dataset, where models fixate on simulation artifacts like street shoes and distinct eyewear visible beneath surgical gowns. These shortcuts allow the models to achieve high accuracy within that specific dataset. However, when these artifacts vanish in the more realistic MM-OR dataset, the models' activation maps become diffuse and unfocused, suggesting they struggle to find any reliable features.
We shift our focus from appearance to geometry and articulated motion to overcome this. Our approach encodes surgical personnel as sequences of 3D point clouds, to naturally disentangle identity-relevant shape and motion patterns from misleading, appearance-based confounders. By capturing invariant geometric properties like an individual's stature, body shape, and unique movement patterns, we can create a robust biometric signature that persists even when attire is identical.
The video below illustrates this concept. While the surgeons are visually ambiguous in the RGB image, their distinct heights and body shapes become clear and measurable in the point cloud representation.
Methodology
We first segment each individual from the fused 3D point cloud of the scene. For each person, we then use their segmented point cloud to extract RGB bounding boxes and render a series of 2D depth maps from multiple virtual camera angle.
Once we have the input sequences, we use a modal-agnostic encoding strategy. A lightweight ResNet-9 encoder individually processes each frame in a sequence. The features are then aggregated over time and processed to generate a final probe embedding for each view. During training, the model learns to pull embeddings of the same individual closer together in the latent space while pushing those of different individuals apart. During inference, we use a simple majority vote across all views to determine the person's identity
Results
In our quantitative results, all approaches perform well on the simulated 4D-OR dataset (contains artificial visual cues), However, their effectiveness diverges dramatically on the more authentic OR_REID_13 dataset. Our point cloud-based method outperforms its RGB counterpart by a 12% margin in accuracy. This trend is even more pronounced in cross-dataset evaluations.
We visualize learned feature spaces. The visualization on the left reveals that the RGB model tends to cluster individuals by their surgical attire and role, grouping people with similar scrubs together, even if they are different individuals. In contrast, the point cloud model structures the latent space based on more meaningful physical characteristics like stature ("Tall" vs. "Short").
Downstream Analysis of OR Workflows through 3D Activity Imprints
A key benefit of our staff-centric model is the ability to analyze surgical workflows at the individual level. We propose 3D activity imprints, which is a visualization technique that plots a person's movement and spatial occupancy as a heatmap overlaid on a 3D representation of the OR. The videos below show imprints for two different head surgeons performing the same procedure. Notice how they develop distinct positional preferences, one consistently favoring the patient's left side and the other both.
We can also aggregate these imprints to analyze the dynamics of an entire surgical team. The heatmaps below illustrate two different team constellations and their collective movement patterns.
Towards Personalized Intelligent: Robust Tracking in Operating Rooms (TrackOR)
While provides a strong biometric signature, a complete tracking system must handle the dynamic nature of the OR, where staff frequently leave and re-enter the room for extended periods. Traditional multi-object trackers often fail in these scenarios; they can associate individuals from one frame to another but lose their identity permanently after a prolonged absence. This confines their utility to short, uninterrupted video segments and makes a true longitudinal analysis of a full surgery impossible.
We introduce TrackOR, a framework for long-term multi-person tracking and re-identification to solve this. TrackOR integrates our 3D geometric signature into a complete online tracking pipeline. This allows the system to handle temporary occlusions and to correctly re-identify staff members who return to the OR after a long absence. The result is the ability to reconstruct complete, persistent, and temporally aware trajectories for each staff member, to enable staff-centric analyses required for personalized intelligent systems in the OR.
Methodology
TL;DR:
TrackOR is a "tracking-by-detection" framework. Its core is a real-time online tracker that achieves SOTA performance. For downstream applications requiring fully reconstructed paths, the framework also includes an optional offline recovery process.
The online tracking stage operates in real-time during the surgery. The system first detects all individuals in the scene for each frame as 3D human poses using multi-view RGB. We extract our robust 3D geometric ReID feature for each detection from the person's corresponding point cloud. To associate people from one frame to the next, we calculate a cost based on their spatial proximity and the similarity of their geometric signatures.
We can perform an offline global trajectory recovery step for downstream tasks to correct any errors or fragmentation from the online stage. We take all the tracklets and use our ReID model to assign a definitive identity to each tracklet. Finally, we group all tracklets with the same assigned identity to reconstruct a single, complete global trajectory for each person from the start to the end its global trajectory
Results
We benchmark our framework against a comprehensive set of modern 2D and 3D trackers. The quantitative results show that TrackOR achieves the highest overall tracking performance, driven by the Association Accuracy.
The qualitative results show a common failure mode for traditional trackers. In the sequence below, a state-of-the-art 2D tracker loses track of a person during an occlusion and swaps their identity with a visually similar person. In contrast, TrackOR successfully maintains the correct identities throughout the sequence.
Downstream Analysis of OR Workflows through Temporal Pathways Imprints
The figure below shows two temporal pathway imprints from the same robot technician across two different surgeries. The imprints capture in Surgery 2 that the non-sterile technician’s pathway comes into close proximity with the sterile patient table, which could be a potential safety concern. Ultimately, these imprints demonstrate the potential to move towards a data-driven science of the OR, enabling automated workflow analysis, objective safety monitoring, and personalized feedback for the entire surgical team.
BibTeX
@article{wang2025_beyond_role,
title = {Beyond role-based surgical domain modeling: Generalizable re-identification in the operating room},
journal = {Medical Image Analysis},
volume = {105},
pages = {103687},
year = {2025},
author = {Tony Danjun Wang and Lennart Bastian and Tobias Czempiel and Christian Heiliger and Nassir Navab},
}
@inproceedings{wang2025_trackor,
title={TrackOR: Towards Personalized Intelligent Operating Rooms Through Robust Tracking},
author={Tony Danjun Wang and Christian Heiliger and Nassir Navab and Lennart Bastian},
booktitle={Workshop Collaborative Intelligence and Autonomy in Image-guided Surgery (COLAS) at MICCAI},
year={2025},
organization={Springer Nature}
}
@inproceedings{wang2025_mitigating_biases,
title={Mitigating Biases in Surgical Operating Rooms with Geometry},
author={Tony Danjun Wang and Tobias Czempiel and Christian Heiliger and Nassir Navab and Lennart Bastian},
booktitle={Workshop Collaborative Intelligence and Autonomy in Image-guided Surgery (COLAS) at MICCAI},
year={2025}
}