Alok Upadhyay | March 2026

TL;DR: We show that the intrinsic Riemannian geometry of pretrained neural network embedding spaces predicts how well face and voice can be matched across modalities — without any cross-modal training. A single metric, Centered Kernel Alignment (CKA), correlates with cross-modal matching error at Spearman rho = -0.87.

Paper: Riemannian Geometry of Multimodal Biometric Embedding Spaces (MathAI 2026, Oral)

Code: github.com/alok-upadhyay/riemannian-biometric-geometry


The Question

Here’s a thought experiment: you have a face recognition model and a speaker recognition model, both pretrained independently. Can you predict — without ever training them together — how well you could match a person’s face to their voice?

It turns out you can. And the answer lives in the geometry of their embedding spaces.

Why This Matters

Cross-modal biometric matching (face-voice, face-gait, etc.) is important for security, forensics, and accessibility. But building cross-modal systems is expensive: you need paired training data, alignment networks, and extensive evaluation.

What if you could look at two pretrained encoders and say, “these two will work well together” — before doing any of that work? That’s exactly what our geometric analysis enables.

The Setup

We took 7 pretrained encoders — 4 face, 3 voice — and extracted embeddings for 1,249 identities from VoxCeleb1:

Modality Encoder Architecture Dim
Face ArcFace ResNet-100 512
Face SigLIP ViT-B/16 768
Face DINOv2 ViT-B/14 768
Face CLIP ViT-B/16 512
Voice WavLM Transformer-Large 1024
Voice HuBERT Transformer-Large 1024
Voice wav2vec 2.0 Transformer-Large 1024

This gives us 12 face-voice encoder pairs, each with a measurable cross-modal Equal Error Rate (EER).

Measuring Geometry

We characterized each encoder’s embedding manifold using four geometric properties:

1. Intrinsic Dimensionality

Neural embeddings live in high-dimensional spaces (512-1024 dims), but the data itself occupies a much lower-dimensional manifold. We estimated intrinsic dimensionality using Maximum Likelihood Estimation (MLE).

Intrinsic dimensionality per encoder Face encoders have lower intrinsic dimensionality (16-20) than voice encoders (29-39), suggesting face identity occupies a more compact manifold.

2. Local Curvature

We estimated Riemannian curvature via the second fundamental form — fitting local quadratic surfaces to the embedding manifold using kNN neighborhoods. This tells us how “curved” the identity manifold is in each encoder’s space.

Curvature distributions Curvature distributions vary significantly across encoders. ArcFace shows the most concentrated (low-variance) curvature, while self-supervised voice encoders exhibit broader distributions.

3. Cluster Topology

We measured the compactness gap: the difference between inter-class and intra-class distances. A larger gap means identities are better separated.

4. Cross-Modal Metrics

For each of the 12 face-voice pairs, we computed: - CKA (Centered Kernel Alignment) — measures structural similarity between kernel matrices - Gromov-Wasserstein distance — optimal transport between metric spaces - Spectral gap divergence — difference in graph Laplacian spectra - ID mismatch — difference in intrinsic dimensionality

The Key Finding

CKA predicts cross-modal matching performance.

Geometric metrics vs. cross-modal EER Four geometric metrics plotted against cross-modal EER. CKA (top-left) shows the strongest correlation: higher CKA = lower EER = better matching.

The numbers: - CKA vs. EER: Spearman rho = -0.87, p < 0.001 - ID mismatch vs. EER: rho = -0.79, p = 0.002 - GW distance & spectral gap: not significant

A multivariate regression with CKA and ID mismatch achieves leave-one-out cross-validated R-squared = 0.72, meaning we can explain most of the variance in cross-modal matching difficulty from geometry alone.

The CKA Heatmap

CKA similarity heatmap CKA between all encoder pairs. CLIP-WavLM has the highest cross-modal CKA (0.211) and indeed the lowest EER (24.4%). ArcFace pairs have the lowest CKA and highest EER.

The CKA heatmap reveals something intuitive: CLIP — trained on image-text pairs with contrastive learning — produces face embeddings that are geometrically most compatible with voice embeddings. This makes sense: CLIP’s training objective encourages embedding structure that transfers across modalities.

Why Does This Work?

CKA measures the alignment of inner-product structures between two embedding spaces. In Riemannian terms, the inner product defines the metric tensor — the fundamental object that determines all geometry (distances, angles, curvature). When two spaces have similar CKA, their metric tensors induce similar geometric structure on identity representations.

This means CCA-based alignment (which we use for cross-modal matching) has an easier optimization landscape: matching directions already exist.

Practical Implications

Encoder selection without paired data. Before building a cross-modal biometric system, compute CKA between candidate encoders on a small calibration set. Pick the pair with highest CKA. This is ~1000x cheaper than training and evaluating a full cross-modal pipeline.

Understanding cross-modal transfer. The geometry-to-performance link suggests that successful cross-modal matching isn’t about modality-specific features — it’s about whether two encoders organize identity information with similar geometric structure.

Validation

We validated on MAV-Celeb (70 identities, different demographics): CKA correlation held at rho = -0.61 (p = 0.002) across 24 pooled pairs (2 datasets x 12 pairs). The geometric signal is robust.

What’s Next


Citation

If you find this work useful, please cite:

@inproceedings{upadhyay2026riemannian,
  title={Riemannian Geometry of Multimodal Biometric Embedding Spaces},
  author={Upadhyay, Alok},
  booktitle={Proceedings of the International Conference on Mathematics of Artificial Intelligence (MathAI)},
  year={2026},
  url={https://openreview.net/forum?id=SPIdRsn5GD}
}

Accepted for oral presentation at MathAI 2026. Full paper and code available at the links above.