To view the videos on this page you need the quicktime player which you can download from here: http://quicktime.mediaplayer-7.com/1/
The UCSD automatic cameraman
|A short movie introducing the automatic cameraman|
The automatic cameraman (TAC) is an experimental multi-modal adaptive computer interface. Using TAC people can interact with a computer through speech and gestures. Ultimately, we would like to have a system through which people can play games, create music and interact with others through the Internet, all without holding any special device such as a keyboard, a computer mouse, or a cell phone.
The system localizes people when they speak using sound source localization, then point the PTZ camera to the speaker, detect the person's face in the video and finally record the audio and video as long as the the face is in view. The result is a system that anybody can use to record a video of themselves. The videos are then made available here.
The system consists of two loosely coupled computer systems. The first system implements the functionality described above using a multi-core workstation. A second system, intended for fast reaction, low latency, is based on a video-processing FPGA (Field-Programmable-Gate-Array). The goal of the FPGA implementation is to achieve reaction time shorter than 1/10 second so a to engage the sub-conscious eye-hand coordination of the user and make the interaction natural and smooth.
A paper about this project appeared in ICMI-MLMI 2009: Detecting, tracking and interacting with people in a public space. Here is the poster presented in the conference.
Controlling the recording
|Yoav demonstrating how to start and stop a recording|
As the detection and localization abilities of the system improved, we started getting complaints from people that they are being recorded when they would rather not be recorded. To solve this problem we implemented the following simple protocol: In order to start the recording, the detected person has to raise their hand and place it so that it's image covers a little square marked on the screen, a short distance from their detected face. To stop the recording the user needs to raise their other hand and place it's image on a little square on the other side of the detected face. This seems to work pretty well. If you look at the videos that are recorded now, you'll see that they always start with one hand raised, and usually end with the other hand raised. The recording will also stop if the system can no longer detect the face of the person.
Sound source localization
The source localizer measures the relative time delay between pairs of microphones (TDOA values). The TDOA is defined for each pair of microphones. As we have 7 microphones, we have 21 TDOA values. If we know the location of the 7 microphones, and the TDOA values, we can calculate the location of the sound source (in fact 4 microphones are enough for doing that). Given the location of the speaker and the location and orientation of the camera, we can calculate the pan and tilt angles required for the camera to point towards the speaker. All this seems simple enough, but is quite hard to achieve in practice because of measurement errors in the TDOAs and in the relative locations of the microphones and the camera.
Instead, we use machine learning to learn the direct mapping from TDOA values to Pan-tilt angles. The basic idea of this method is described in this paper: [Evan Ettinger and Yoav Freund. Coordinate-Free Calibration of an Acoustically Driven Camera Pointing System].
This initial work used a linear mapping from the TDOA vector to the Pan-Tilt values. A linear approximation is appropriate for a small region inside the room, but no for the room as a whole. To extend the mapping to the whole room we need to allow for non-linearities. A simplifying factor is that all correct TDOA vectors lie close to a 3 dimensional manifold in this 21 dimensional space, and the mapping to Pan-Tilt values is a smooth mapping on this manifold. To approximate the 3D manifold we use the RP-tree algorithm of Dasgupta and Freund, described in the following papers: [Yoav Freund, Sanjoy Dasgupta, Mayank Kabra, Nakul Verma / Learning the structure of manifolds using random projections / NIPS 2007] , and in full detail here: [Sanjoy Dasgupta and Yoav Freund. / Random projection trees and low dimensional manifolds / UCSD Technical Report CS2007-0890 ]
|Detecting human faces using TAC.||Identifying regions of human skin using TAC.|
Another important component is the face detector. We currently use the face detector of Viola & Jones. With an addition of skin color detection. We are currently working on a new face detector that would use histograms-of-gradients features instead of haar features. here are some results from our new face detector. Results are very encouraging.
Some Potential Applications
Using Skype through TAC
We have tried using TAC as the front end for SKYPE. It works reasonably well in terms of tracking the video. However, the audio is pretty poor and full of echoes. We plan to use sound source reconstruction through beam forming to focus the audio on the speaker. We believe we can achieve higher quality than speaker phones because we combine several microphones and we learn the geometry and acoustics of the room.
|Mayank and Yoav using Skype through TAC|
Playing Music Using TAC
The cameraman can also act as an interactive generative music system. Music is constrained by a melodic generator which requires no training on the user's part. The system is controlled by human motion, passively detected by TAC rather than any special input device.
A change detector generates the input signals for the music generator. TAC feeds video into the change detector, from which the generator finds an average central point (on the horizontal axis) of motion. Movement of this central point determines the direction (up or down) of the notes played.
|Noah creating music using TAC.|
FPGA based architecture
All of our current applications have a delay of about 0.2-0.25 second between a change in the video and the initial response on the screen, the response of the annotation comes an additional 0.1sec later (see slow-mo movie on the right). This delay will necessarily increase as we make the analysis increasingly sophisticated. On the other hand, for an interactive audio-visual system to feel responsive, the delay has to be shorter than 1/10 of a second. In order to achieve such short delay between camera input and display response we are developing a dedicated hardware approach using FPGAs.
|A recording of the FPGA based video change detector in action|
Some measurements of time delays are recorded here: TAC_Delay
- [new movies from the 4th floor]
- [Badly syncd movies from the 4th floor]
- [Movies from the old system, in the lab and in NIPS]
|Yoav Freund||Evan Ettinger||Sunsern Cheamanunkul||Matt Jacobsen||Patrick Lai|
- Noah Tye
- Thanks goes to Steve G. at CamTwist for providing a great camera virtualization solution.