Our CMPL could achieve 17.6%, % and 25.1% top-1 accuracy on Kinetics-400 and UCF-101 using only the RGB modality and 1% labeled data, which basically outperforms other state-of-the-art methods. More detailed comparison and ablation studie are presented in our paper.
Representation learned by different networks: Figure 5 displays accuracy curves of the two networks with inputs of varying temporal dynamics. It can be seen that the auxiliary network is more sensitive to temporal dynamics. This suggests that the tiny model using a lower channel capacity can better capture fast motion without building a detailed spatial representation, providing complementary temporal semantics in relation to the spatial patterns learned by the backbone network.
Primary Gain vs. Auxiliary Accuracy: Figure 7 indicates that he performance gain ofthe backbone network is positively correlated with that ofthe auxiliary network on the corresponding classes. This study supports our motivation that the auxiliary network complements the backbone network, particularly on actionclasses recognized well by the former.
@article{xu2021cmpl, title = {Cross-Model Pseudo-Labeling for Semi-Supervised Action Recognition}, author = {Xu, Yinghao and Wei, Fangyun and Sun, Xiao and Yang, Ceyuan and She, Yujun and Dai, Bo and Zhou, Bolei and Stephen Lin}, article = {arXiv preprint arXiv:2112.09690}, year = {2021} }