Leveraging vivit transformers and foresight pruning for scalable scene change detection on distributed architecture
DOI: 10.31673/2412-9070.2025.017844
Анотація
Nowadays, the amount of video content is increasing rapidly and requires a more efficient way to analyze it. Scene change detection is a crucial step in video processing because it allows to group of results by context and provides more detailed analyses. This research focuses on the application of Video Vision Transformer (ViViT) architecture to overcome the following challenges - significant computational power requirements and lack of context capturing by convolutional and recurrent-based architectures. We also focus on applying foresight pruning for ViViT to reduce resource requirements even more.
By training the ViViT model we achieved a 5.5% improvement in the F1 score for scene detection methods over existing approaches. We also optimized the ViViT model size by 43% and inference time by 10% by applying foresight pruning while maintaining state-of-the-art accuracy.
We also propose a pipeline based on a shot detection algorithm that significantly reduces computational complexity by analyzing only key frames for scene and attribute detection. Applying parallelized processing architecture enables simultaneous scene and attribute detection that leads to a 24.21x speed up against the classic approach of analyzing every frame. This study presents a robust, efficient, and scalable solution for scene and attribute detection that allows a further improvement of methods for scene and attribute detection applications with the potential for realtime analytics.
Keywords: Scene Change Detection, Video Vision Transformer (ViViT), Video Analysis, Parallel Processing, Scalability, Neural Networks, Artificial intelligence, Software Engineering.