Beyond Role-Based Surgical Domain Modeling

Abstract

Surgical data science increasingly relies on role-based domain models. However, these fail to capture the significance of individual team members and their collaborative dynamics on surgical outcomes. We propose a paradigm shift towards staff-centric surgical domain models, which model individual traits as opposed to considering staff as interchangeable surgical roles. To achieve this, we address the necessary problem of person re-identification in the operating room (OR), which has been hindered due to the challenging visual environment, where traditional biometric cues are obscured. To overcome monotenous texture appearances due to standardized attire, we introduce a novel approach that leverages 3D shape and articulated motion cues to achieve robust, invariant biometric signatures for personnel re-identification. We further present TrackOR, an end-to-end framework capable of maintaining persistent staff identities throughout long surgical procedures, even across extended absences. The proposed staff-centric models enable downstream analyses, such as 3D activity imprints and temporal pathways, which ultimately pave the way for surgical domain models that delineate team dynamics from procedural efficiency and patient outcomes.

Beyond Role-Based Surgical Domain Modeling: A Staff-Centric Approach for Surgical Domain Modeling

TL;DR:

We challenge the traditional, role-based view of the operating room and propose a "staff-centric" model that recognizes the unique impact of each individual on surgical dynamics.

Surgical data science has aimed to optimize operating room (OR) workflows by analyzing the roles of the surgical staff. However, this approach treats team members as interchangeable parts, failing to capture the nuanced differences between them. Given that factors like team familiarity and individual habits significantly affect surgical outcomes, from operative times to complication rates, we argue that this role-based view is not enough. We propose a fundamental shift towards a staff-centric model where each person is recognized as a unique individual, not simply an abstract role. However, achieving this presents a significant challenge.

To see the limitation of the role-centric model, we show two different teams performing the same surgery with the same roles. Notice how their coordination, and use of the space differ, highlighting how much team dynamics can change even when roles remain the same.

Team 1

Team 2

Comparison: Team 1 vs Team 2 performing the same (robot assisted) surgery. Both teams contain: head surgeon, assistant surgeon, scrub nurse, and robot technician.

↑ Back to top

Challenges in Surgical Operating Rooms

The surgical OR is a challenging environment compared to general environments where individuals are easily distinguished by their clothing and faces. Team members wear standardized, often visually indistinct smocks and gowns to maintain sterility. This homogeneity makes traditional appearance-based identification methods, which rely on texture and color fail. Furthermore, surgical masks, skull caps, and other protective equipment hide facial features and hair, which are the prominent cues for person re-identification.

As shown in the figure below, the standardized attire makes it difficult to differentiate individuals based on appearance alone.

Individuals become much more difficult to re-identify in the operating room due to standardized attire and gear that obstructs traditional landmarks like face and hair.

↑ Back to top

Generalizable Re-Identification in the Operating Room

Note

Read our Medical Image Analysis Paper for more details.

Mitigating Biases in Surgical Operating Rooms with Geometry

Deep neural networks are known to have a strong bias towards texture. In the context of the OR, these networks will learn spurious correlations from incidental visual cues rather than robust biometric features. Our saliency analysis reveal a clear example of this "shortcut learning" on the 4D-OR dataset, where models fixate on simulation artifacts like street shoes and distinct eyewear visible beneath surgical gowns. These shortcuts allow the models to achieve high accuracy within that specific dataset. However, when these artifacts vanish in the more realistic MM-OR dataset, the models' activation maps become diffuse and unfocused, suggesting they struggle to find any reliable features.

RGB image excepts and overlayed saliency maps generated with GradCAM on the simulated datasets 4D-OR and OR_ReID_13. 4D-OR’s limited realism and variety allow CNNs to identify individuals solely by their heads and shoes. In more realistic OR settings like OR_ReID_13, these features become less useful due to more homogenous attire. These discrepancies between different clinical environments can impede generalization.

We shift our focus from appearance to geometry and articulated motion to overcome this. Our approach encodes surgical personnel as sequences of 3D point clouds, to naturally disentangle identity-relevant shape and motion patterns from misleading, appearance-based confounders. By capturing invariant geometric properties like an individual's stature, body shape, and unique movement patterns, we can create a robust biometric signature that persists even when attire is identical.

The video below illustrates this concept. While the surgeons are visually ambiguous in the RGB image, their distinct heights and body shapes become clear and measurable in the point cloud representation.

Who is the person to re-identify? A depiction of four surgeons (Surgeon 1, 2, 3, and 4), which are challenging to visually differentiate.

↑ Back to top

Methodology

We first segment each individual from the fused 3D point cloud of the scene. For each person, we then use their segmented point cloud to extract RGB bounding boxes and render a series of 2D depth maps from multiple virtual camera angle.

An overview of the data pre-processing to acquire RGB and point cloud input frames.

Once we have the input sequences, we use a modal-agnostic encoding strategy. A lightweight ResNet-9 encoder individually processes each frame in a sequence. The features are then aggregated over time and processed to generate a final probe embedding for each view. During training, the model learns to pull embeddings of the same individual closer together in the latent space while pushing those of different individuals apart. During inference, we use a simple majority vote across all views to determine the person's identity

Our model can take a sequence of multi-view 2D images as input.

↑ Back to top

Results

In our quantitative results, all approaches perform well on the simulated 4D-OR dataset (contains artificial visual cues), However, their effectiveness diverges dramatically on the more authentic OR_REID_13 dataset. Our point cloud-based method outperforms its RGB counterpart by a 12% margin in accuracy. This trend is even more pronounced in cross-dataset evaluations.

Comparison of inter- and intra-dataset performance between LidarGait, PAT, and our method.

We visualize learned feature spaces. The visualization on the left reveals that the RGB model tends to cluster individuals by their surgical attire and role, grouping people with similar scrubs together, even if they are different individuals. In contrast, the point cloud model structures the latent space based on more meaningful physical characteristics like stature ("Tall" vs. "Short").

Visualization of the RGB and point cloud latent spaces on OR_ReID_13, visualized by a t-SNE projection. Regions are manually highlighted based on the common attributes within clusters in the latent spaces.

↑ Back to top

Downstream Analysis of OR Workflows through 3D Activity Imprints

A key benefit of our staff-centric model is the ability to analyze surgical workflows at the individual level. We propose 3D activity imprints, which is a visualization technique that plots a person's movement and spatial occupancy as a heatmap overlaid on a 3D representation of the OR. The videos below show imprints for two different head surgeons performing the same procedure. Notice how they develop distinct positional preferences, one consistently favoring the patient's left side and the other both.

Head Surgeon (Person 1)

Head Surgeon (Person 2)

Comparison: Person 1 as head surgeon, vs. person 2 as head surgeon, performing the same surgical procedure.

We can also aggregate these imprints to analyze the dynamics of an entire surgical team. The heatmaps below illustrate two different team constellations and their collective movement patterns.

Different personnel constellations and their respective 3D activity imprints. Our proposed re-ID-based tracking approach yields insight into the coordination of surgical teams, providing insight into group workflow patterns and usage of the OR for a given surgery. The patient table is visualized in blue; the tool-table in green; the robot maintenance station in gold.

↑ Back to top

Towards Personalized Intelligent: Robust Tracking in Operating Rooms (TrackOR)

Note

Please read our COLAS workshop paper for more details.

While provides a strong biometric signature, a complete tracking system must handle the dynamic nature of the OR, where staff frequently leave and re-enter the room for extended periods. Traditional multi-object trackers often fail in these scenarios; they can associate individuals from one frame to another but lose their identity permanently after a prolonged absence. This confines their utility to short, uninterrupted video segments and makes a true longitudinal analysis of a full surgery impossible.

We introduce TrackOR, a framework for long-term multi-person tracking and re-identification to solve this. TrackOR integrates our 3D geometric signature into a complete online tracking pipeline. This allows the system to handle temporary occlusions and to correctly re-identify staff members who return to the OR after a long absence. The result is the ability to reconstruct complete, persistent, and temporally aware trajectories for each staff member, to enable staff-centric analyses required for personalized intelligent systems in the OR.

A) 2D Trackers: Fail due to single-camera views and ambiguous, appearance-based ReID. B) Standard 3D Trackers: Handle multi-camera setups but lack robust ReID features, so they cannot track staff through prolonged absences. C) TrackOR (Ours): Creates strong geometric ReID features, enabling persistent identity tracking even when staff leave and re-enter the OR.

Methodology

TL;DR:

TrackOR is a "tracking-by-detection" framework. Its core is a real-time online tracker that achieves SOTA performance. For downstream applications requiring fully reconstructed paths, the framework also includes an optional offline recovery process.

The online tracking stage operates in real-time during the surgery. The system first detects all individuals in the scene for each frame as 3D human poses using multi-view RGB. We extract our robust 3D geometric ReID feature for each detection from the person's corresponding point cloud. To associate people from one frame to the next, we calculate a cost based on their spatial proximity and the similarity of their geometric signatures.

1.) 3D Detection: We take a "3D-first" approach, detecting human poses directly in 3D from multi-view camera inputs. 2.) Feature Extraction: A robust, view-invariant 3D geometric signature is extracted from each person's point cloud to serve as the ReID feature. 3.) Association: Detections are matched to existing trajectories using a cost matrix that combines a spatial cost (3D GIoU) with a shape cost based on the cosine dissimilarity of our geometric ReID features.

We can perform an offline global trajectory recovery step for downstream tasks to correct any errors or fragmentation from the online stage. We take all the tracklets and use our ReID model to assign a definitive identity to each tracklet. Finally, we group all tracklets with the same assigned identity to reconstruct a single, complete global trajectory for each person from the start to the end its global trajectory

This post-processing step can be applied for specific analyses, such as our temporal pathway imprints. An SVM-Gallery classifies each tracklet by identity. All tracklets assigned the same identity are then merged to form a complete, persistent trajectory for each individual.

Results

We benchmark our framework against a comprehensive set of modern 2D and 3D trackers. The quantitative results show that TrackOR achieves the highest overall tracking performance, driven by the Association Accuracy.

Quantitative comparison of TrackOR (Ours) against 2D and 3D tracking baselines on the MM-OR test set. Bold indicates the best performance per metric. ↑ Higher is better, ↓ lower is better, † denotes offline methods.

The qualitative results show a common failure mode for traditional trackers. In the sequence below, a state-of-the-art 2D tracker loses track of a person during an occlusion and swaps their identity with a visually similar person. In contrast, TrackOR successfully maintains the correct identities throughout the sequence.

Qualitative results of BoT-Sort and TrackOR (Ours). The bounding box colors reflect the predicted identity.

Downstream Analysis of OR Workflows through Temporal Pathways Imprints

The figure below shows two temporal pathway imprints from the same robot technician across two different surgeries. The imprints capture in Surgery 2 that the non-sterile technician’s pathway comes into close proximity with the sterile patient table, which could be a potential safety concern. Ultimately, these imprints demonstrate the potential to move towards a data-driven science of the OR, enabling automated workflow analysis, objective safety monitoring, and personalized feedback for the entire surgical team.

Temporal pathway imprints of the robot technician of two different surgeries. We plot the first 1,000s of each surgery. The extended transparent borders delineate the 12-inch border around the sterile fields.

↑ Back to top

From Role-Based Surgical Domain Modeling
To Personalized OR Intelligence