Browse by Year

A New Paradigm for Computer Vision Based on Compositional Representation

December 11, 2020


AbstractDeep convolutional neural networks - the state-of-the-art technique in artificial intelligence for computer vision - achieve notable success rates at simple classification tasks, but are fundamentally lacking when it comes to representation.

These neural networks encode fuzzy textural patterns into vast matrices of numbers which lack the semantically structured nature of human representations (e.g. "a table is a flat horizontal surface supported by an arrangement of identical legs").

This paper takes multiple important steps towards filling in these gaps. I first propose a series of tractable milestone problems set in the abstract two dimensional ShapeWorld, thus isolating the challenge of object compositionality. Then I demonstrate the effectiveness of a new compositional representation approach based on identifying structure among the primitive elements comprising an image and representing this structure through an augmented primitive element tree and coincidence list. My approach outperforms state-of-the-art benchmark algorithms in speed and structural representation in my object representation milestone tasks, while yielding comparable classification accuracy. Finally, I present a mathematical framework for a probabilistic programming approach that can learn highly structured generative stochastic representations of compositional objects from just a handful of examples.

Keywords – Deep convolutional neural networks, state-of-the-art benchmark algorithms, two dimensional ShapeWorld, compositional objects



[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
[2] H. Wang, B. Gao, J. Bian, F. Tian, and T. Liu. Solving verbal comprehension questions in IQ test by knowledge-powered word embedding. CoRR, abs/1505.07909, 2015.
[3] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. B. Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. CoRR, abs/1612.06890, 2016.
[4] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 1998.
[5] B. Lake, R. Salakhutdinov, J. Gross, and J. Tenenbaum. One shot learning of simple visual concepts. Proceedings of the Annual Meeting of the Cognitive Science Society, 33, 2011.
[6] Image of a leopard-print sofa.