Week 5 Reading : The Evolution of Computer Vision – From Myron Kreuger to OpenAI’s SORA

On Feb 16, 2024, OpenAI released a preview of SORA, a text-to-video diffusion transformer model. With that, almost everyone will be able to (to an extent) generate videos they imagine. We have come a long long way since Myron Kreuger’s 1989 Video Place (Gosh, his implementation makes all my VR experiences weak). In the previous years, a lot of public computer vision models came out and became accessible –  YOLO,  GANs, stable diffusion, DALL-E mid journey.etc .The entire world is amazed when DALL-E shows its in-painting functionalities. However, it should be noted that such capabilities (or at least theories behind it) were in existence since antiquity (i.e. PatchMatch  is a 2009 inpainting algorithm, which later got integrated into Photoshop as the infamous content aware fill tool).

What a time to be alive.

And back in 2006, Golan Levin, another artistic engineer wrote the Computer Vision for Artists and Designers. He gave a brief overview of the state of computer vision, and discussed frame differencing, background subtraction, brightness thresholding as extremely simple algorithms which the artists can utilize. Then gave us the links to some Processing code at the end as examples. I wish that the writing contains a bit more how-to guide and figures on how to set up the Processing interface and so on. 

Golan wanted to stress that, in his own words, a number of widely-used and highly effective techniques can be implemented by novice programmers in as little as an afternoon and bring the power of computer vision to the masses. However, in order to get computer vision to the masses, there are certain challenges… mainly not technology, but digital literacy.

The Digital Literacy Gap in Utilizing Computer Vision

From observation, a stunning amount of people (including the generation which grew up with ipads) lack basic digital literacy. There are some “things” you have to figure out yourself once you have used the computer for some time, for instance to select multiple different files at once, hold the Ctrl key and click on the files. On windows,  your applications are most likely installed in C:\Program Files (x86).  If the app is not responding, fire up the task manager and kill the process in windows, or force quit in mac, or use the pkill command in linux. If you run an application and the GUI is not showing up, it is probably running as a process in the system tray.etc .etc.

However, the masses who use computers on a daily basis for nearly a decade (a.k.a. my dad, and a lot more people, even young ones) struggle to navigate around their computer. For such masses, Golan Levin’s article is not a novice programmer tutorial, but already an intermediate tutorial – one has to have installed Processing in their computer, set up the Java prior to that and so on. Personally, I feel that a lot of potential artists give up the integration of technology due to the barrier of entry of the environment setup (for code based tools and computer vision). Hence, as soon as any enthusiastic artist tries to run an OpenCV code from github on their computer, and when their computer says “Could not find a version that satisfies the requirement opencv”, they just give up.

Nevertheless, things are becoming a lot more accessible. Nowadays, if you want to do such computer vision processing, but don’t want the code, there are Blender Geometry Nodes, Unity Shader Graphs where you can drag around stuff to do stuff. For code demonstrations, there is Google Colaboratory where you can run python OpenCV code without dealing with any python dependency errors (and even get GPUs if your computer is not powerful enough).

Golan mentioned “The fundamental challenge presented by digital video is that it is computationally “opaque.” Unlike text, digital video data in its basic form — stored solely as a stream of rectangular pixel buffers — contains no intrinsic semantic or symbolic information.” This no longer exists in 2024, since you can either use either Semantic Segmentation, or plug your image into any transformer model to have each of your pixels labeled. Computers are no longer dumb.

The Double-Edged Sword of User-Friendly Computer Vision Tools

With more computer vision and image generation tools such as Dall-E,  you can type text to generate images, of course with limitations. I had an amusing time watching a friend try to generate his company logo in Dall-E with the text in it, and it failed to spell it correctly, and he keeps typing the prompt again and again and gets frustrated with the wrong spelling. 

In such cases, I feel that technology has gone too far. This is the type of computer vision practitioners that these new generations of easy tools are going to produce. Ones who will never bother to open up an IDE and try coding a few lines, or to just get Photoshop or GIMP and place the letters by themselves. Just because the tools get better does not mean that you don’t have to put in any effort to get quality work. The ease of use of these tools might discourage people from learning the underlying principles and skills, such as basic programming or graphic editing.

However…

The rate of improvement of these tools is really alarming. 

Initially, I was also gonna say the masses need to step up the game, and also upgrade their tech skills, but anyway…at this rate of improvement in readily available AI based computer vision tools, computer vision may really have reached the masses.

Leave a Reply