We have waited for this moment for a long time. As a small startup with resource constraints but ambitious goals to reach the pinnacle of what’s possible in Artificial intelligence research, we think we have come a long way in less than one and half years.
One thing that sets us apart from a lot of other similar looking companies doing visual intelligence is the existence of Artifacia Research. We kept it active in its rudimentary form during our ups and downs and believed this would be our best investment in a sector where continuous technology innovation is all that matters. With our recent developments in terms of customers and association of great technology angels, this unit started getting the time and resources it deserved.
After months of hard work, we are excited to announce our most ambitious project yet. We are calling it Project Turing. This project is aimed at solving the Turing test in its new avatar in which we have recently made a lot of progress. We expect this project to be a long-term project at Artifacia Research, and benefit all our existing and future products and hence our customers in innumerable ways. In order to reach this stage, we first built a state-of-the-art image captioning system from ground-up and applied our learning from there to build a very early prototype of a Visual Q&A system to solve the Visual Turing test. This kind of visual description task could be used to assess a machine’s intelligence relative to a human — and so enhance the Turing test.
Image Captioning is a hard AI problem which has recently started attracting a lot of interest from some of the topmost researchers in the field. Among companies working in AI, only Google Research, Microsoft Research and Facebook AI Research more recently seem to have made some progress in this. All of these three have abundance of resources and a great bunch of top scientists and engineers from all over the world. And we have learned a lot from the work by research groups from these companies and the universities such as Montreal, Toronto and Stanford.
One of the fundamental problems in artificial intelligence is automatically describing the content of an image that combines computer vision and natural language processing. We have seen a lot of progress in object detection, classification and localization in the last couple of years due to recent research in AI enabled by deep neural networks.But accurately describing a complex scene requires a deeper representation of what’s going on in the scene, capturing how the various objects relate to one another and translating it all into natural-sounding language. Many researchers see image captioning as the basis for more sophisticated artificial intelligence systems that can see, hear, speak and even understand.
We people can easily summarize a complex scene in a few words without thinking twice. It’s much more difficult for computers. But we’ve started getting closer to solving this problem. We’ve developed a machine-learning system that can automatically produce captions to accurately describe images the first time it sees them after some initial training. This kind of system could eventually help visually impaired people understand pictures and video content, and maybe help robots navigate natural environments.
Visual Q&A System
The Turing test is a test of a machine’s ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. Alan Turing proposed that a human evaluator would judge natural language conversations between a human and a machine that is designed to generate human-like responses.If the evaluator cannot reliably tell the machine from the human (Turing originally suggested that the machine would convince a human 70% of the time after five minutes of conversation), the machine is said to have passed the test.
The test was introduced by Alan Turing in his 1950 paper “Computing Machinery and Intelligence,” while working at The University of Manchester. It opens with the words: “I propose to consider the question, ‘Can machines think?’” Because “thinking” is difficult to define, Turing chooses to “replace the question by another, which is closely related to it and is expressed in relatively unambiguous words.” Turing’s new question is: “Are there imaginable digital computers which would do well in the imitation game?” This question, Turing believed, is one that can actually be answered. In the remainder of the paper, he argued against all the major objections to the proposition that “machines can think”.(Source : Wikipedia)
Since Turing first introduced his test, it has proven to be both highly influential and widely criticized, and it has become an important concept in the philosophy of artificial intelligence. Our goal with this project is to get closer to pass the new Turing test which is being developed on the basis of Visual Q&A. In the newly proposed system called the Visual Turing test, an AI system should be able to answers questions based on the content of an image. Such a test could also inspire the creation of software and devices that describe and interact with their surroundings more like humans — for example, satnavs that give clearer instructions, using landmarks at the side of the road to direct drivers into turnings and so on.
How We Did It
We have been working on various image recognition problems for a while now. This last summer we did some experiments with scene recognition problems in computer vision for the first time and for that we requested MIT for their popular scene dataset called Places. We built one of our product demos using this new technology and got a deal. Around that time we also started looking at scene understanding at a deeper level and found it to be quite fascinating. We wondered if we could build a system that understood the whole scene as well as we do, we would get closer to building a powerful AI system. We identified the problem as image captioning and discovered that only a few before us had tried it and were decently successful only in restricted datasets such as Flickr 8k, Flickr 30k and MS COCO and hence never opened it up for the public as the field is very nascent and will need some more time to reach some level of maturity. These included Google Research and Microsoft Research and more recently Facebook AI Research. We started building our own system from ground up as we saw there was a lot to be done in this and we could help move the field forward with our own contributions over the coming months and years.
A CNN Network Design
A RNN Network Design
As for the process, we basically merged recent computer vision and language models into a single jointly trained system, taking an image and directly producing a human readable sequence of words to describe it. We made use of a convolutional neural network and a recurrent neural network to build this joint model. This kind of model has worked really well in machine translation community where two RNNs can be used for translating from one language to another. The captioning system works a bit differently, but essentially uses the same approach.Our software currently can write a caption in English language describing the picture. As you see below, the early results show a lot of promise.
The last example shows how the system screwed up. But it will learn to do more accurate descriptions as we do better training and more experimentation with our models. We are expecting a significant improvement in our system after our ongoing training over a much larger dataset and that’s gonna take at least a month’s time.
Our next task was to build a Visual Q&A system in the context of Visual Turing test. We made use of similar base networks for this problem too but simplified it later to suit our problem statement. Ideally, the system should be able to answer questions based on the image with enough confidence that it makes it difficult for a human evaluator to distinguish whether the results were given by a human or a machine. We have narrowed down the problem for one word answer for the time being and want to have it answer in full sentence with some more work in our language modelling, an area we have only recently started getting into in our research group. But you can still appreciate the beauty of how the system is working at this stage from the examples given below.
As you can see the system screwed up in counting the number of animals in the third example. Maybe it’s still a little weak in maths at this stage and will need to learn more about the real world!
We hope to come with even more interesting results with our ongoing research and experimentation over the next few months as part of Project Turing. This is just a start. Every day we are learning something new about intelligence, memory and probably consciousness too. And we want to bring our understanding of all of that to light with the help of software that helps people in all sorts of ways. And we have no intention of building software which could be used by bad people to power Terminators.
Thanks for reading!
P.S. We participated as team Snapshopr Research in MSCOCO 2015 challenge for the Image Captioning problem. We achieved a global rank of 14 ahead of deep learning leaders like Andrej Karpathy. In terms of corporate research teams, only Google Research and Microsoft Research were ahead of us at the time of submission.