Prototypical Contrastive Transfer Learning for Multimodal Language Understanding