Alok Upadhyay | March 2026
TL;DR: We show that the intrinsic Riemannian geometry of pretrained neural network embedding spaces predicts how well face and voice can be matched across modalities — without any cross-modal training. A single metric, Centered Kernel Alignment (CKA), correlates with cross-modal matching error at Spearman rho = -0.87.
Paper: Riemannian Geometry of Multimodal Biometric Embedding Spaces (MathAI 2026, Oral)
Code: github.com/alok-upadhyay/riemannian-biometric-geometry
The Question
Here’s a thought experiment: you have a face recognition model and a speaker recognition model, both pretrained independently. Can you predict — without ever training them together — how well you could match a person’s face to their voice?
It turns out you can. And the answer lives in the geometry of their embedding spaces.
Why This Matters
Cross-modal biometric matching (face-voice, face-gait, etc.) is important for security, forensics, and accessibility. But building cross-modal systems is expensive: you need paired training data, alignment networks, and extensive evaluation.
What if you could look at two pretrained encoders and say, “these two will work well together” — before doing any of that work? That’s exactly what our geometric analysis enables.
The Setup
We took 7 pretrained encoders — 4 face, 3 voice — and extracted embeddings for 1,249 identities from VoxCeleb1:
| Modality | Encoder | Architecture | Dim |
|---|---|---|---|
| Face | ArcFace | ResNet-100 | 512 |
| Face | SigLIP | ViT-B/16 | 768 |
| Face | DINOv2 | ViT-B/14 | 768 |
| Face | CLIP | ViT-B/16 | 512 |
| Voice | WavLM | Transformer-Large | 1024 |
| Voice | HuBERT | Transformer-Large | 1024 |
| Voice | wav2vec 2.0 | Transformer-Large | 1024 |
This gives us 12 face-voice encoder pairs, each with a measurable cross-modal Equal Error Rate (EER).
Measuring Geometry
We characterized each encoder’s embedding manifold using four geometric properties:
1. Intrinsic Dimensionality
Neural embeddings live in high-dimensional spaces (512-1024 dims), but the data itself occupies a much lower-dimensional manifold. We estimated intrinsic dimensionality using Maximum Likelihood Estimation (MLE).
Face encoders have lower intrinsic dimensionality (16-20) than voice encoders (29-39), suggesting face identity occupies a more compact manifold.
2. Local Curvature
We estimated Riemannian curvature via the second fundamental form — fitting local quadratic surfaces to the embedding manifold using kNN neighborhoods. This tells us how “curved” the identity manifold is in each encoder’s space.
Curvature distributions vary significantly across encoders. ArcFace shows the most concentrated (low-variance) curvature, while self-supervised voice encoders exhibit broader distributions.
3. Cluster Topology
We measured the compactness gap: the difference between inter-class and intra-class distances. A larger gap means identities are better separated.
4. Cross-Modal Metrics
For each of the 12 face-voice pairs, we computed: - CKA (Centered Kernel Alignment) — measures structural similarity between kernel matrices - Gromov-Wasserstein distance — optimal transport between metric spaces - Spectral gap divergence — difference in graph Laplacian spectra - ID mismatch — difference in intrinsic dimensionality
The Key Finding
CKA predicts cross-modal matching performance.
Four geometric metrics plotted against cross-modal EER. CKA (top-left) shows the strongest correlation: higher CKA = lower EER = better matching.
The numbers: - CKA vs. EER: Spearman rho = -0.87, p < 0.001 - ID mismatch vs. EER: rho = -0.79, p = 0.002 - GW distance & spectral gap: not significant
A multivariate regression with CKA and ID mismatch achieves leave-one-out cross-validated R-squared = 0.72, meaning we can explain most of the variance in cross-modal matching difficulty from geometry alone.
The CKA Heatmap
CKA between all encoder pairs. CLIP-WavLM has the highest cross-modal CKA (0.211) and indeed the lowest EER (24.4%). ArcFace pairs have the lowest CKA and highest EER.
The CKA heatmap reveals something intuitive: CLIP — trained on image-text pairs with contrastive learning — produces face embeddings that are geometrically most compatible with voice embeddings. This makes sense: CLIP’s training objective encourages embedding structure that transfers across modalities.
Why Does This Work?
CKA measures the alignment of inner-product structures between two embedding spaces. In Riemannian terms, the inner product defines the metric tensor — the fundamental object that determines all geometry (distances, angles, curvature). When two spaces have similar CKA, their metric tensors induce similar geometric structure on identity representations.
This means CCA-based alignment (which we use for cross-modal matching) has an easier optimization landscape: matching directions already exist.
Practical Implications
Encoder selection without paired data. Before building a cross-modal biometric system, compute CKA between candidate encoders on a small calibration set. Pick the pair with highest CKA. This is ~1000x cheaper than training and evaluating a full cross-modal pipeline.
Understanding cross-modal transfer. The geometry-to-performance link suggests that successful cross-modal matching isn’t about modality-specific features — it’s about whether two encoders organize identity information with similar geometric structure.
Validation
We validated on MAV-Celeb (70 identities, different demographics): CKA correlation held at rho = -0.61 (p = 0.002) across 24 pooled pairs (2 datasets x 12 pairs). The geometric signal is robust.
What’s Next
- More encoders and datasets to strengthen statistical power
- Jointly-trained models (e.g., ImageBind) for causal evidence — does joint training explicitly increase CKA?
- Riemannian CKA using geodesic kernels instead of linear kernels
- Theoretical bounds connecting CKA to alignment error
Citation
If you find this work useful, please cite:
@inproceedings{upadhyay2026riemannian,
title={Riemannian Geometry of Multimodal Biometric Embedding Spaces},
author={Upadhyay, Alok},
booktitle={Proceedings of the International Conference on Mathematics of Artificial Intelligence (MathAI)},
year={2026},
url={https://openreview.net/forum?id=SPIdRsn5GD}
}
Accepted for oral presentation at MathAI 2026. Full paper and code available at the links above.