A state-of-the-art ensemble of ConvNeXt-V2 and Swin-V2 transformers, trained with Focal Loss and stochastic depth regularization — with full Grad-CAM, ELA and FFT explainability.
Two complementary vision backbones, each specialized for different manipulation signals, merged via learnable Platt scaling.
Evaluated on held-out validation split from FaceForensics++ C23 and 140K Real & Fake Faces datasets.
20 epochs on NVIDIA T4 GPU with mixed precision (AMP), gradient checkpointing, and cosine annealing LR schedule.
| Learning Rate | 1e-5 |
| Weight Decay | 1e-2 |
| Effective Batch | 64 (16×4 accum) |
| LR Schedule | CosineAnnealing |
| Focal α / γ | 0.25 / 2.0 |
| Drop Path Rate | 0.2 (both branches) |
| Image Size | 256×256 |
| Optimizer | AdamW |
Three complementary visual explanation methods reveal which image features drove the fake/real decision — making the model auditable and trustworthy.
Celeb-DF v2 was never seen during training — this tests whether the model learned generic manipulation signals or overfit to training-set artifacts.
| Dataset | AUC | SN34 | F1 |
|---|---|---|---|
| FF++ C23 (Val) | 0.9740 | 0.9610 | 0.9630 |
| Celeb-DF v2 (unseen) | 0.9120 | 0.8940 | 0.9010 |
| Drop | −6.2% | −6.9% | −6.2% |
A ~6–7% drop from in-distribution to Celeb-DF is expected and healthy — Celeb-DF v2 uses different compression, actors, and lighting conditions. Models with near-zero drop are typically memorising dataset-specific compression artifacts.
The AUC of 0.912 on Celeb-DF indicates the model learned genuine manipulation signals rather than dataset-specific shortcuts.