Live Detection System · v4.0

Detecting Deepfakes
at the Pixel Level

A state-of-the-art ensemble of ConvNeXt-V2 and Swin-V2 transformers, trained with Focal Loss and stochastic depth regularization — with full Grad-CAM, ELA and FFT explainability.

0.974

Val AUC

0.961

BitMind SN34

0.963

F1 Score

0.912

Celeb-DF AUC

256px

Input Resolution

XAI Methods

View Performance → Explore Explainability

Architecture

Dual-Branch Ensemble

Two complementary vision backbones, each specialized for different manipulation signals, merged via learnable Platt scaling.

CNN BRANCH

ConvNeXt-V2-Base

Detects local pixel-level artifacts — JPEG blocking, blending seams, texture discontinuities, and compression anomalies introduced during face-swap synthesis.

87.8M params

drop_path=0.2

grad checkpoint

TRANSFORMER BRANCH

Swin-V2-Base

Captures global lighting inconsistencies, shadow directionality mismatches, and high-level semantic coherence failures across the full face region.

87.9M params

shifted windows

256×256 input

FUSION

Temperature-Scaled Ensemble

Soft-voting with learnable per-branch temperature parameters (Platt scaling). Each branch's confidence is calibrated independently before averaging — preventing overconfident predictions.

T_cnn · T_vit learned

TRAINING

Focal Loss + Stochastic Depth

Focal Loss (α=0.25, γ=2.0) concentrates 90% of gradient on hard uncertain samples. DropPath=0.2 randomly disables 20% of depth blocks, forcing diverse, redundant feature pathways.

α=0.25 γ=2.0

eff. batch=64

Inference Pipeline

Input

Face Crop

256×256 RGB

→

Augment

Normalize

ImageNet stats

→

Branch 1

ConvNeXt-V2

local features

→

Branch 2

Swin-V2

global context

→

Fusion

Ensemble

Platt scaling

→

Output

P(Fake)

0.0 – 1.0

Performance

In-Distribution Metrics

Evaluated on held-out validation split from FaceForensics++ C23 and 140K Real & Fake Faces datasets.

0.9740

ROC AUC

0.9610

SN34 Score

0.9630

F1 Score

0.971

Precision

0.958

Recall

ROC Curve

Precision–Recall Curve

Score Distribution — Real vs Fake

Confidence Distribution

Model confidence (how certain is each prediction?)

Training

Learning Curves

20 epochs on NVIDIA T4 GPU with mixed precision (AMP), gradient checkpointing, and cosine annealing LR schedule.

Focal Loss (Train vs Val)

Validation AUC Over Epochs

BitMind SN34 Score Over Epochs

Hyperparameter Summary

Learning Rate	1e-5
Weight Decay	1e-2
Effective Batch	64 (16×4 accum)
LR Schedule	CosineAnnealing
Focal α / γ	0.25 / 2.0
Drop Path Rate	0.2 (both branches)
Image Size	256×256
Optimizer	AdamW

Explainability

Why Did It Decide That?

Three complementary visual explanation methods reveal which image features drove the fake/real decision — making the model auditable and trustworthy.

Grad-CAM Heatmaps

Gradient-weighted class activation maps highlight which facial regions — eyes, mouth, hairline — the ConvNeXt-V2 backbone found most discriminative. Red = high attention.

ConvNeXt Stage 3

Error Level Analysis

Re-saves the face at JPEG quality 90 and amplifies the difference. Manipulation introduces regions with inconsistent compression artifacts — revealed as bright patches against a dark background.

JPEG quality=90

FFT Frequency Spectrum

Log-magnitude 2D FFT shows the frequency fingerprint of the image. AI-generated faces exhibit regular grid artifacts in frequency space — a telltale sign of GAN upsampling and convolutional synthesis patterns.

2D DFT log-mag

Real Face — Grad-CAM

Diffuse, low-intensity activation distributed across the face. The model finds no concentrated manipulation region. Heat is spread across natural features.

Fake Face — Grad-CAM

Concentrated hot spots at eye corners, jaw boundary, and ear-hair transitions — exactly where face-swap compositing introduces blending seams and texture inconsistencies.

Fake Face — FFT

Regular cross-shaped frequency artifacts centered at DC component. These periodic peaks arise from GAN decoder upsampling (checkerboard artifact) and are absent in real photographs.

Cross-Dataset Benchmark

Generalisation to Unseen Data

Celeb-DF v2 was never seen during training — this tests whether the model learned generic manipulation signals or overfit to training-set artifacts.

Dataset	AUC	SN34	F1
FF++ C23 (Val)	0.9740	0.9610	0.9630
Celeb-DF v2 (unseen)	0.9120	0.8940	0.9010
Drop	−6.2%	−6.9%	−6.2%

Generalisation Analysis

A ~6–7% drop from in-distribution to Celeb-DF is expected and healthy — Celeb-DF v2 uses different compression, actors, and lighting conditions. Models with near-zero drop are typically memorising dataset-specific compression artifacts.

The AUC of 0.912 on Celeb-DF indicates the model learned genuine manipulation signals rather than dataset-specific shortcuts.

In-Distribution vs Cross-Dataset Comparison

Training Data

Datasets Used

FaceForensics++ C23

~1,000 videos · 4 manipulation methods

Primary benchmark dataset with Deepfakes, Face2Face, FaceSwap, and NeuralTextures manipulations at light (C23) compression.

140K Real & Fake Faces

140,000 images · StyleGAN2

Kaggle dataset with 70K real faces from Flickr-Faces-HQ and 70K AI-generated faces from StyleGAN2 — excellent class balance.

Celeb-DF v2 (Test Only)

590 real + 5,639 fake videos

High-quality celebrity deepfake dataset with reduced visible artifacts. Used exclusively for cross-dataset generalisation testing — zero training exposure.