Chris Mueller

Computer Vision

Dr. Olaf Hall-Holt, lead researcher
Aaron Etshokin, Matt Handley, David Middlecamp, Chris Mueller, Justin Von Stroh
Fall, 2004

 

The Problem

The Computer Vision problem has been around for a while. (There is a web site at Middlebury dedicated to the problem.) The basic problem is to make a computer that has two "eyes" understand itself in a three dimensional world. This is something that humans are able to do really well. We call it depth perception. (It's really nice when playing tennis, for example, so you know how far away the tennis ball is.) Computers, however, have a much harder time determining distance.

These are the left and right Tsukuba images, considered the standard in stereo vision research.

So suppose we simplify the problem a little bit. We give the computer two images: one image is for the left eye, the other is for the right eye. The trick is then to make the computer understand which parts of the left image correspond to which parts of the right image. Once that is done, we will know how far an object in the left image has shifted in the right image. Objects that shift more are closer, and there are some basic geometric properties that allow us to determine exactly how far back an object is. The hard part of this problem is matching pixels in one image to objects in the other image.

How could we determine which pixel in the left image corresponds to a given pixel in the right image? If we take the problem down to its very root, the only thing we really know about each pixel is its color. So we could look at every pixel in the right image and find the closest matching color in the left image, and proceed from there. But this produces terrible results. There is too much variation at the pixel level. A better approach is to find some sort of average of a particular region, and to try to find similar regions in the left and right images.

Our Approach

The approach we took was to try to define each image in terms of boxes. We seeded the image with boxes on a grid, then tried growing the boxes and wiggling them around and rotating them. All the while we were keeping track of the average color, and trying to grow or wiggle the box in such a way that the average color was not drastically altered. Once we had found a few thousand boxes in one image, we tried placing each of those boxes on the other image and sliding it around to find the best match in average color. We stored our best guess for each box as a "hypothesis".

But we also needed to do some pruning on our boxes. There were too many of them, and many of them overlapped. We were especially concerned with the case where two boxes overlapped about halfway: something is likely very wrong in such a situation. So we created a graph, with nodes being hypotheses, and edges between nodes being a number that indicated how likely it was that both hypotheses along that edge should be included in the final result. Our goal was then to optimize the graph so that we had as many good hypotheses (nodes) as possible, with as few bad edges as possible. This is not so easy to do, and another area of much concern for computer scientists. In fact, it is generally assumed that finding the most optimal solution to a graph problem like ours will take much longer than the lifetime of the universe, even with supercomputers. Instead of trying to find the absolute best solution, we tried to find a pretty good solution. One technique we used was the "greedy" algorithm: take the best node, then the next best, then the next best, etc. until taking new nodes only makes things worse. This produced a satisfactory result, though we developed other algorithms that were also efficient and high-scoring.

Once we had produced a solution to the graph problem, we could draw our results as a disparity map, assigning different colors to different depths in the picture. (White was the closest, black the farthest away.) Because we were using rectangles to approximate curved regions in the images, there were some places that we didn't cover. We were also missing information in places where there was a "shadow" from one image to the other: i.e., the left eye could see things that the right eye could not. Our solution was basically to look at each void in our disparity map, then to look at the pixels in the left and right images near the void, and do a best guess as to which region or box the void belonged to.

At the time that this page was published (Dec. 17, 2004), we had produced a result, using the Tuskaba images, that was 75% accurate.

Future work

There is much work to be done. We hope that this approach will eventually be competitive with some of the best stereo alogorithms.

One improvement we hope to make is to "link" boxes together. In some cases, a large background object (such as a wall) has been sliced by a small foreground object (such as a cord). It makes sense to human eyes to regard the wall as one object, but currently our algorithm regards the two portions of the wall as separate entities. If we could link those two regions together, we could produce better results. Our idea currently is to try and draw shapes that can encompass both regions and jump over the foreground objects, then to indicate which regions belong together by adding information to our graph.

We also hope to make it possible to use n-sided polygons instead of boxes. This would allow us to better handle regions with curved or convex edges.

 

CG Index Square Fountain Van Gogh Raytracer Computer Vision
© cm