Live Detection System · v4.0

Detecting Deepfakes
at the Pixel Level

A state-of-the-art ensemble of ConvNeXt-V2 and Swin-V2 transformers, trained with Focal Loss and stochastic depth regularization — with full Grad-CAM, ELA and FFT explainability.

0.974
Val AUC
0.961
BitMind SN34
0.963
F1 Score
0.912
Celeb-DF AUC
256px
Input Resolution
3
XAI Methods

Dual-Branch Ensemble

Two complementary vision backbones, each specialized for different manipulation signals, merged via learnable Platt scaling.

CNN BRANCH
ConvNeXt-V2-Base
Detects local pixel-level artifacts — JPEG blocking, blending seams, texture discontinuities, and compression anomalies introduced during face-swap synthesis.
87.8M params
drop_path=0.2
grad checkpoint
TRANSFORMER BRANCH
Swin-V2-Base
Captures global lighting inconsistencies, shadow directionality mismatches, and high-level semantic coherence failures across the full face region.
87.9M params
shifted windows
256×256 input
FUSION
Temperature-Scaled Ensemble
Soft-voting with learnable per-branch temperature parameters (Platt scaling). Each branch's confidence is calibrated independently before averaging — preventing overconfident predictions.
T_cnn · T_vit learned
TRAINING
Focal Loss + Stochastic Depth
Focal Loss (α=0.25, γ=2.0) concentrates 90% of gradient on hard uncertain samples. DropPath=0.2 randomly disables 20% of depth blocks, forcing diverse, redundant feature pathways.
α=0.25 γ=2.0
eff. batch=64
Input
Face Crop
256×256 RGB
Augment
Normalize
ImageNet stats
Branch 1
ConvNeXt-V2
local features
Branch 2
Swin-V2
global context
Fusion
Ensemble
Platt scaling
Output
P(Fake)
0.0 – 1.0

In-Distribution Metrics

Evaluated on held-out validation split from FaceForensics++ C23 and 140K Real & Fake Faces datasets.

0.9740
ROC AUC
0.9610
SN34 Score
0.9630
F1 Score
0.971
Precision
0.958
Recall
ROC Curve
Precision–Recall Curve
Score Distribution — Real vs Fake
Confidence Distribution
Model confidence (how certain is each prediction?)

Learning Curves

20 epochs on NVIDIA T4 GPU with mixed precision (AMP), gradient checkpointing, and cosine annealing LR schedule.

Focal Loss (Train vs Val)
Validation AUC Over Epochs
BitMind SN34 Score Over Epochs
Hyperparameter Summary
Learning Rate1e-5
Weight Decay1e-2
Effective Batch64 (16×4 accum)
LR ScheduleCosineAnnealing
Focal α / γ0.25 / 2.0
Drop Path Rate0.2 (both branches)
Image Size256×256
OptimizerAdamW

Why Did It Decide That?

Three complementary visual explanation methods reveal which image features drove the fake/real decision — making the model auditable and trustworthy.

Grad-CAM Heatmaps
Gradient-weighted class activation maps highlight which facial regions — eyes, mouth, hairline — the ConvNeXt-V2 backbone found most discriminative. Red = high attention.
ConvNeXt Stage 3
Error Level Analysis
Re-saves the face at JPEG quality 90 and amplifies the difference. Manipulation introduces regions with inconsistent compression artifacts — revealed as bright patches against a dark background.
JPEG quality=90
FFT Frequency Spectrum
Log-magnitude 2D FFT shows the frequency fingerprint of the image. AI-generated faces exhibit regular grid artifacts in frequency space — a telltale sign of GAN upsampling and convolutional synthesis patterns.
2D DFT log-mag
Real Face — Grad-CAM
Diffuse, low-intensity activation distributed across the face. The model finds no concentrated manipulation region. Heat is spread across natural features.
Fake Face — Grad-CAM
Concentrated hot spots at eye corners, jaw boundary, and ear-hair transitions — exactly where face-swap compositing introduces blending seams and texture inconsistencies.
Fake Face — FFT
Regular cross-shaped frequency artifacts centered at DC component. These periodic peaks arise from GAN decoder upsampling (checkerboard artifact) and are absent in real photographs.

Generalisation to Unseen Data

Celeb-DF v2 was never seen during training — this tests whether the model learned generic manipulation signals or overfit to training-set artifacts.

DatasetAUCSN34F1
FF++ C23 (Val) 0.9740 0.9610 0.9630
Celeb-DF v2 (unseen) 0.9120 0.8940 0.9010
Drop −6.2% −6.9% −6.2%
Generalisation Analysis

A ~6–7% drop from in-distribution to Celeb-DF is expected and healthy — Celeb-DF v2 uses different compression, actors, and lighting conditions. Models with near-zero drop are typically memorising dataset-specific compression artifacts.


The AUC of 0.912 on Celeb-DF indicates the model learned genuine manipulation signals rather than dataset-specific shortcuts.

In-Distribution vs Cross-Dataset Comparison

Datasets Used

FaceForensics++ C23
~1,000 videos · 4 manipulation methods
Primary benchmark dataset with Deepfakes, Face2Face, FaceSwap, and NeuralTextures manipulations at light (C23) compression.
140K Real & Fake Faces
140,000 images · StyleGAN2
Kaggle dataset with 70K real faces from Flickr-Faces-HQ and 70K AI-generated faces from StyleGAN2 — excellent class balance.
Celeb-DF v2 (Test Only)
590 real + 5,639 fake videos
High-quality celebrity deepfake dataset with reduced visible artifacts. Used exclusively for cross-dataset generalisation testing — zero training exposure.