Through comprehensive analysis, we summarize our main findings as follows. We use UNITER, one of the best-performing V+L models, as the testbed, and consolidate 7 representative V+L tasks for experiments, including visual question answering, visual commonsense reasoning, visual entailment, referring expression comprehension, image-text retrieval, GQA, and NLVR$^2$. In this work, we perform the first empirical study to assess whether such trainable subnetworks also exist in pre-trained V+L models. In parallel, work on the lottery ticket hypothesis has shown that deep neural networks contain small matching subnetworks that can achieve on par or even better performance than the dense networks when trained in isolation. However, the large number of parameters in such models hinders their application in practice. ![]() Models such as LXMERT, ViLBERT and UNITER have significantly lifted the state of the art over a wide range of V+L tasks. ![]() Large-scale transformer-based pre-training has recently revolutionized vision-and-language (V+L) research.
0 Comments
Leave a Reply. |