Cross-Model Pseudo-Labeling for
Semi-Supervised Action Recognition
Yinghao Xu1Fangyun Wei3Xiao Sun3Ceyuan Yang1
Yujun Shen1Bo Dai2Bolei Zhou1  Stephen Lin3 
1The Chinese University of Hong Kong, 2S-Lab, Nanyang Technological University,
3Microsoft Research Asia
Overview
Semi-supervised action recognition is a challenging but important task due to the high cost of data annotation. A common approach to this problem is to assign unlabeled data with pseudo-labels, which are then used as additional supervision in training. In this work, we propose a more effective pseudo-labeling scheme, called Cross-Model Pseudo-Labeling (CMPL). Concretely, we introduce a lightweight auxiliary network in addition to the primary backbone, and ask them to predict pseudo-labels for each other. We observe that, due to their different structural biases, these two models tend to learn complementary representations from the same video clips. Each model can thus benefit from its counterpart by utilizing cross-model predictions as supervision. Experiments on different data partition protocols demonstrate the significant improvement of our framework over existing alternatives.
Results
  • Quantitive Results
  • Our CMPL could achieve 17.6%, % and 25.1% top-1 accuracy on Kinetics-400 and UCF-101 using only the RGB modality and 1% labeled data, which basically outperforms other state-of-the-art methods. More detailed comparison and ablation studie are presented in our paper.

  • Empirical Study
  • Representation learned by different networks: Figure 5 displays accuracy curves of the two networks with inputs of varying temporal dynamics. It can be seen that the auxiliary network is more sensitive to temporal dynamics. This suggests that the tiny model using a lower channel capacity can better capture fast motion without building a detailed spatial representation, providing complementary temporal semantics in relation to the spatial patterns learned by the backbone network.

    Primary Gain vs. Auxiliary Accuracy: Figure 7 indicates that he performance gain ofthe backbone network is positively correlated with that ofthe auxiliary network on the corresponding classes. This study supports our motivation that the auxiliary network complements the backbone network, particularly on actionclasses recognized well by the former.

    Bibtex
    @article{xu2021cmpl,
      title   = {Cross-Model Pseudo-Labeling for Semi-Supervised Action Recognition},
      author  = {Xu, Yinghao and Wei, Fangyun and Sun, Xiao and Yang, Ceyuan and She, Yujun and Dai, Bo and Zhou, Bolei and Stephen Lin},
      article = {arXiv preprint arXiv:2112.09690},
      year    = {2021}
    }