Model Used
- Feature extractor
- VideoMAE / ViT-Base features, 768 dimensions
- Classifier
- AnomalyTransformer temporal encoder
- Architecture
- 768 -> 512 projection, 4 Transformer layers, 8 heads, FF 1024, dropout 0.3
- Parameters
- 9,201,922 trainable parameters
- Runtime checkpoint
- best_model.pt, 105.38 MB