Why not ResNet

Notice that the results in paper 'Deep Cross-Modal Pojection Learning for Image-Text Matching' are:{top- 1 = 49.37%，top-10 = 79.27%}
while the results in this project are {top- 1 = 42.999%,top-10 = 67.869%}, which are resulted from the model that is based on MobileNet.
So, why not provide a new version that is based on ResNet! ^^ It will be greatly helpful for our beginners ! Thanks a lot !