A Different Model of A.I. Vision

How A.I. Can Make Sense of the World Around Them

9 min readMar 19, 2020

No two people look exactly the same. No two pictures of the same person are identical either. Every day, your eyes are making sense of a world that is constantly changing, and never exactly repeats itself. Lighting conditions change, angles shift every so slightly. How is it possible that you can recognize even your own family members, if every time you see them they look a little different? It seems that our vision is quite coarse, if it ignores such differences detail.

And yet you can spot the slightest differences between identical twins, based on a small mole or other facial feature. This implies you can, in fact, detect small details. What gives your eyes such a breadth of ability, so that in some cases you can, and do, pay attention to details, and in other cases you can completely ignore them?

If you’ve read the other articles on this blog, you’ll realize that an A.I. which learns by reasoning needs special types of senses to succeed in its environment. In this article I’ve outlined how this works for one sense: vision. Everything here also applies to all other senses.

This article is divided into three parts:

Just Enough Sense
Common and Consistent
Animation

Just Enough Sense

Let’s see how your eye can both pay attention to detail and at the same time ignore it. Below is a simplified eye. An eye is an organ that detects colour and light. The eye is covered with many smaller sensors, each of which is responsible for a tiny area of vision. Each can detect different colours or contrasts:

In reality there are a lot more sensors of different sizes, and they overlap significantly. I’ve simplified it so you can see what’s happening.

This eye is looking at an image in front of it:

Each of the circles can detect one of the following:

A primary colour: either red, green or blue.

Based on the colour it sees, the sensor will either turn on or not turn on.

Secondary colours like yellow, purple and cyan are made of combinations of primary colours. Below I’ve shown what a simplified eye would “see” looking at a house. Different sensors will activate based on what they see.

This image shows only the colour sensors.

Contrast: Contrast sensors behave differently from colour sensors. If the top of a contrast sensor’s coverage area is brighter than the bottom, it triggers a ‘top’ contrast, if the left is brighter, then a ‘left’ contrast, etc.

Diagonal lines trigger a combination of left, right, up and down.

When any scene is presented to the eye, some of those sensors turn on. So when you are looking in a certain direction, at a certain view, your eyes have one specific combination if sensors for that image.

These are the basic sensors for vision. I’ve simplified away a few abilities such as detecting angles and shades of colour to keep this article short. I’ll discuss those in a later post.

With A.I., since we’re not tied to biology, we can invent any type of sensor we want. We could detect infrared, complex objects, 3D objects, or even an abstract object. The system would work in all cases.

This may be difficult to grok, so it might help to see it in action. As the eye moves over the scene below, different sensors activate.

It’s important that my eye detects the same object even if it were shown slightly different images. Fortunately, because the sensors cover a large space, the image can change slightly , or your eye can move a small distance over the same image, and it would still detect the same thing.

The bigger the sensor, the more the image can change before it no longer triggers. Earlier we worried about making a tradeoff between flexibility and detail. But since your eye has sensors of many different sizes over the same space, you can get the best of both worlds.

The larger the area a sensor covers, the more the image can change and the sensor stays on

How does the mind select which senses are relevant and which are unimportant?

Common and Consistent

When a friend of yours wears a new set of clothes, you don’t have difficulty recognizing him. Some things matter more than others. In this case, your friend’s face matters more than their clothes, because his clothes can change day by day. On the other hand, in the case of twins, a particular mole or slightly different head shape can distinguish two people who look very similar, but are different.

Looking at the two images of the same person below, you’d want to recognize them as the same person. This means ignoring everything that doesn’t matter.

On each day, your eye detects colours and shapes, with sensors of different sizes:

Looking at the signal we get from them both, your eye finds the common features.

The colour of the clothes changed, but the shape of the body remained the same. Therefore the colour sensors on the body aren’t common to both, but the contrast sensors are.
In parts where details are the same on both days, such as the face, the high detail contrast sensors are common. Where the details are different, such as near the hands, but the general shape is the same, only the low granularity contrasts were common to both.

The mind is trying to discover what stays consistent, and ignore what changes over time. It does this by overlapping the two groups of triggered senses and trims away what is inconsistent.

In the above demo, the A.I. is trying to identify the same person ,“John”, as he changes clothes. The A.I. isolates the features that are consistent — in this case, his face and his general form. It gradually ignores the colour of his clothes. By the end, you’ll notice the colours around the body weaken to light grey. The colour features have been pruned, and only contrast and luminance is left. The A.I. recognizes that there should be something around the body area, but the colour is irrelevant. In contrast, the colour detection around his face remains consistent, because his skin colour is consistent.

In some cases, two examples of the same “thing” may be so different that there is almost no overlap. This can happen with different breeds of dog, or different angles on the same dog. In this case, the mind creates a brand new group rather than trying to find what little is common between them.

Everything I’ve described above happens almost entirely in the eyes. And it all happens automatically. There is much more to the act of seeing than what your eyes do. Abstractions, and thinking in high-level concepts happen elsewhere in the mind.

Animation

To react to the world around you, you need to apply these ideas not only to where things are in space, but also how things change over time.

As a pedestrian walks across your view, many things may be happening at the same time. Cars may be moving behind him, another person may walk in front of him, etc. Given such a jumble of sights, how can you separate the signal from the noise?

Look at the sample scene below, and how it changes over time. Imagine each of the sense activations over a span of time drawn on a timeline. I’ve put a dot for when each sense is activated. I’ve only selected a few dots for each frame.

[Work in progress: depict this in visualizer]

For every dot, I’m going to draw a line between it and the dots that appear before it, for the last 1 second.

By doing this, I’ve tied the appearance of an event to the events that came before it. This way of recording a series of events not only records what happened, but also the order in which it happened.

I’m going to do the same for a slightly different version of the same scene. In this version, the first red box is not there anymore, and the third box is now blue.

Taking these two cases, I’ll do the same exercise as before of trimming away anything that doesn’t overlap. But this time, I’ll trim not just the individual sensors that didn’t turn on in both cases, but also the connections between them. By doing this I’m recording the order that was common between them too.

Green lines show the connections that were common between them

The green lines show the consistent connections between the two versions. These help you identify not just what was common, but also in what order the events happened.

In the case of the third cube that changed colour from red to blue, the contrast sensors were the same in both cases, but the colour sensor was different. This means that the shape is common, but the colour is not.

Interestingly, such a system would still work even if the events were slowed down or sped up (within limits).

Examples

Shape Classifier

In this demo, which is recorded in real time, the software learns to recognize and name 2D shapes, regardless of their colour and their position in the view. You can see its guesses as they are typed out at the top-centre of the screen.

A red outline appears when it makes a mistake, as well as dark red boxes near the top of the timeline. When it makes a mistake, it is corrected. Over time, there are fewer and fewer such boxes.

Most notably, it separates the shapes into uniquely learned “objects”. Near the end of the video I click through each of the “objects” it has learned, and show the features that are determined to be the defining, i.e. common features of that object. Colour and position are notably absent, so the impressions are all greyscale and centred. In some cases (such as the pentagon), only the outline contrast is considered relevant. This means it encountered some dark coloured pentagons, so luminance, as a feature, was removed.

Walk Cycle Recognition

In the above video you can see the agent being trained to recognize different walk cycles. This is a purely visual process, without interpretation. It’s crucial that the agent is able to select the inputs, and order of inputs, that distinguish the walk cycles, and ignore the rest as noise. That way, once a background is added, it can ignore the background and only focus on the inputs that matter.

The video shows two important features:

The first is how it separates, or pulls out, the inputs that are relevant to recognizing an item, and ignores the rest as noise. As a bonus, doing it this way both fixes the problem of brittleness/adversarial inputs, and also gives you perfect transparency into which inputs affect its decisions
The second feature is how it predicts subsequent inputs based on its memory of previous inputs (the white circles).

Extra Credit: Direction

For the sake of brevity, I left out one crucial detail in this article. I’ll include it here for those who like an extra challenge.

Are you also interested in applying Artificial Intelligence to human creativity, human understanding, even human values? I’m looking to connect with others who have a similarly ambitious vision of the future of A.I., whose goal is to tap the full creative potential of human intelligence through software.