Abstract.
Marker-less motion capture from a single RGB camera has improved dramatically, but the output of even strong monocular pose estimators is still too noisy for direct use in production. Joint positions jitter frame-to-frame, foot contacts skate, and accumulated rotation errors compound through the kinematic chain.
Filmstorm's mocap pipeline addresses this with a dedicated refinement stage. We train a small, fast network purely on paired (noisy, clean) motion data generated from a closed-loop synthetic pipeline that we control end-to-end. The result is a refiner that drops in behind any monocular pose estimator and brings its output to a working production bar, without retraining the upstream model, without changing the camera setup, and without manual cleanup.
This report describes the data pipeline, the network architecture, the composite loss used to make rotation predictions semantically meaningful, and the measured improvement across categories from idle to high-velocity dynamic motion.
Why training mocap from real data alone hits a ceiling.
The largest publicly available mocap datasets contain tens of thousands of clips. That sounds like a lot. It isn't.
Long-tail starvation.
Real-world mocap is acquired in controlled studios with marker suits, multi-camera rigs, and consenting subjects. Most of it is recorded by a small number of facilities with overlapping motion vocabularies: walking, running, simple gestures, dance, locomotion under everyday conditions. The motion categories actually demanded by production, combat, stunts, falls, contact sports, partial-occlusion interactions, full-range acrobatic motion, are either absent or appear in tiny single-digit-clip slivers.
A model trained on this distribution learns idle and locomotion to within a couple of centimetres of ground truth and then collapses outside the easy regions. The gap is structural, not a tuning problem.
Inter-frame jitter.
Single-camera pose estimators are typically trained on independent frames or short windows. Even when overall pose is approximately correct, the velocity profile of each joint frame-to-frame is rarely smooth. The artefacts read on screen as foot skating, hand flicker, and twitchy spinal motion, anything that breaks the perceived inertia of the captured body.
Yaw ambiguity.
From a single camera, the rotation of the body about the gravity axis cannot be uniquely recovered. Two real motions that differ only in their global yaw produce identical 2D projections from a stationary camera. The estimator has to guess, and downstream consumers inherit the guess.
The supervision signal needed to fix these failure modes, paired (noisy single-view input, clean ground-truth motion) at scale, covering every regime where the estimator fails, does not exist as a public dataset, and is impractical to collect physically. So we generate it.
A closed-loop pipeline that generates its own supervision.
We render photoreal video from known motion, run the upstream pose estimator across the rendered video to produce its (deliberately noisy) prediction, and pair every prediction with the underlying ground-truth motion. Every clip is a free supervision signal.
Sample motion
Draw a clip from a curated motion library covering idle, locomotion, dynamic and edge-case categories. Subject body shape is sampled separately.
Render
Photoreal render of the motion onto a parametric body, with randomised camera azimuth, distance, focal length, and full-spectrum HDRI lighting from a 30-environment bank.
Estimate
Run our monocular pose estimator on the rendered video. Output is a noisy joint-rotation stream with the same failure modes seen on real input.
Pair
Store the estimator's noisy output alongside the original ground-truth motion. Both are expressed in the same skeleton convention, frame-aligned to the source.
Train
Train the refinement network on the paired stream. The supervision signal is the difference between noisy and clean, exactly what we want the network to learn to remove.
Why this works.
The estimator's failure modes on synthetic video closely match its failure modes on real video. Jitter, foot drift, occlusion bleed, yaw ambiguity, they all appear in the synthetic pipeline, because the estimator does not know it is looking at a render. The synthetic data is therefore directly useful as training material for the refiner downstream.
What the scale looks like.
A per-joint, per-frame transformer with residual heads.
The refiner takes the noisy rotation stream and a root translation track as input, and predicts both as a residual on top of the noisy signal. Predicting deltas, not absolutes, lets the network defer to a strong-enough upstream estimate and only correct where it disagrees.
Representation choices.
- 6D rotation, not axis-angle, internally. The 6D parameterisation is continuous, has no antipodal ambiguity, and is well-suited to L1 regression, axis-angle's wraparound makes naïve regression unstable.
- Per-joint, per-frame tokens. A clip of T frames and J joints becomes a sequence of T·J tokens. Self-attention is unrestricted across both time and joints, so the network can learn long-range temporal smoothness and kinematic-chain constraints in the same layer stack.
- Translation as an additional token, with its own learned query embedding, attended jointly with the rotation tokens. The root translation isn't an afterthought, it's where most of the global motion signal lives.
- Residual prediction. The output heads emit a delta added to the original noisy signal. If the input is already correct, the network only needs to learn the zero map. If the input is wrong, the loss only penalises the difference.
The small variant runs comfortably alongside the upstream estimator on the same GPU at inference time. The scaled variant is used when we have headroom and want every last point of accuracy.
Five components, each fixing a specific failure mode.
Rotation L1 alone is necessary but not sufficient, small angular errors compound along the kinematic chain and produce large position errors at the extremities. We supervise both the rotation prediction directly and its forward-kinematics consequence, then add smoothness terms in position space.
- L1 on 6D rotation, the core regression target. Directly penalises predicted joint angle disagreement with the clean motion.
- L1 on root translation, keeps the global trajectory anchored. Without this, the network is free to slide the body arbitrarily; with it, foot-on-floor reads cleanly.
- L1 on forward-kinematics joint positions, runs FK on both predicted and target rotations and penalises the difference in 3D end-effector position. A small shoulder rotation error becomes a large wrist position error; this term forces the network to feel the consequences in metres, not radians.
- L1 on joint velocity, first-difference of position. Penalises jitter. The single biggest visual quality lever.
- L1 on joint acceleration, second-difference. Smooths the velocity profile itself, eliminating high-frequency jerk that survives velocity regularisation.
Default weighting.
Position terms are scaled from centimetres to metres so they share an order of magnitude with the rotation L1. If we leave them in cm, the position terms dominate by two orders of magnitude and the rotation L1 becomes a vestigial regulariser.
What each term measurably buys.
Ablating each term in turn isolates its contribution. The FK joint term carries the largest single accuracy gain, about half the total error reduction at convergence. The velocity and acceleration terms produce smaller absolute MPJPE gains but the largest qualitative improvement in playback, which is what production cares about.
Setup and tricks worth keeping.
Mixed precision, modest hardware.
The model trains in bf16 AMP on a single A100 80GB. At batch size 32 with 96-frame clips, each epoch over the full corpus takes 10–20 minutes. Eighty epochs is approximately 15–25 GPU-hours; on commodity cloud at sub-$1.50/hour that's a sub-fifty-dollar full training run.
Optimiser.
AdamW with a cosine schedule and a 1 k-step linear warm-up. Weight decay 0.01, gradient clipping at norm 1.0. No exotic optimisers needed at this scale.
Yaw canonicalisation.
Because monocular pose estimators cannot recover absolute yaw, a noisy clip and its clean counterpart can disagree on global rotation in a way that the network cannot learn to fix from the visible signal. We resolve this at data load: a per-clip rotation about the gravity axis is applied to both clean and noisy so frame zero of both lands at identity yaw. The network learns to refine motion, not to invent missing camera information.
Subject-level split.
Training and held-out test splits are drawn at the level of subjects, not clips. Multiple clips of the same body shape always co-locate. Without this, the model can over-fit to subject morphology and report misleadingly low test error.
Mirror augmentation.
Joints come in left/right pairs. A reflected motion is also a valid motion. Random L↔R mirroring at training time effectively doubles the dataset for free, with no risk of label leakage.
All hyperparameters above are the defaults used to produce the numbers in the Results section. The configuration file is versioned alongside the model checkpoint so any reported result can be traced back to an exact training recipe.
Measured improvement, by category and over time.
All numbers are on the held-out subject split. Baseline is the raw monocular pose estimator without refinement; refined is the same input passed through our trained refiner. Lower is better.
Position error by motion category
Temporal stability: joint acceleration error across a 240-frame clip
Loss curves: total composite loss over training epochs
On real-camera footage outside the synthetic distribution, the refiner removes acceleration spikes and tightens foot contacts without altering the actor's underlying motion. When over-smoothing is detected, usually on very-high-velocity sequences where the refiner is asked to trade jerk for accuracy, the acceleration loss weight is lowered and the network is fine-tuned for a small number of epochs.
What we learned by building this.
1. The refiner doesn't replace the estimator. It completes it.
We tried earlier versions that re-estimated pose from video frames. They were larger, slower, and no better. Letting the upstream estimator do what it's good at, coarse pose recovery, and adding a small refinement pass for the failure modes it can't fix is the right factoring of the problem.
2. Forward kinematics in the loss is non-negotiable.
Without the FK joint position term, the network minimises rotation L1 but the resulting limbs flail. Position-space supervision is what makes the rotation predictions semantically meaningful.
3. Synthetic data must look like the estimator's failure mode, not like reality.
Photoreal rendering matters less than producing inputs that the upstream estimator gets wrong in the same way it gets real video wrong. We spent more time tuning camera randomisation and lighting variation, both of which affect estimator behaviour, than chasing skin shading.
4. Yaw canonicalisation is the single largest "trick".
Without it, the network spends capacity trying to learn an impossibility and the loss plateaus high. With it, the loss curve simply continues falling.
5. Mirror augmentation alone roughly doubles effective dataset size.
For the cost of one swap-table per skeleton, training-set diversity doubles. There is no analogue for translation- or time-axis flipping, motion is directional in time and a flipped clip is no longer a valid motion. But L↔R mirroring is essentially free.
6. The refiner is fast enough to be inline.
At 4.8 M parameters in bf16, the small variant runs at well above real-time on the same GPU that hosts the upstream estimator. There is no architectural reason to expose the refinement as a separate stage, it can be folded into the estimator's inference path and shipped as a single drop-in component.
Roadmap.
Per-subject canonicalisation.
The current pipeline uses a single canonical body shape for retargeting between the estimator's output skeleton and the clean motion reference. The natural next step is per-clip subject-shape estimation, so the refiner sees skeletons that share proportions with the rendered body, closing one more gap between training and inference.
Hand and face.
The current refinement target is the body skeleton. The synthetic pipeline already generates hand articulation; head/face refinement is a separate problem with different loss design, on the roadmap as a follow-up release.
Online refinement for live capture.
The current refiner consumes a temporal window and emits the refined window. A causal variant, refining the latest frame given only past context, enables true real-time mocap with a small fixed latency. Architectural changes are minimal; the work is in the loss recipe.
Domain adaptation to in-the-wild footage.
The largest open question is how much the synthetic-trained refiner needs adapting before it generalises to specific production environments, dimly lit interiors, mirrored studios, smartphone footage. Initial results suggest little adaptation is needed for the position-space behaviour, but the appearance distribution of the underlying renders does affect estimator behaviour, which the refiner inherits.