Eavesdropping by Visual Vibrations

Researchers are able to recover sound through soundproof glass by recording the vibrations of a plastic bag.

Researchers at MIT, Microsoft, and Adobe have developed an algorithm that can reconstruct an audio signal by analyzing minute vibrations of objects depicted in video. In one set of experiments, they were able to recover intelligible speech from the vibrations of a potato-chip bag photographed from 15 feet away through soundproof glass.

In other experiments, they extracted useful audio signals from videos of aluminum foil, the surface of a glass of water, and even the leaves of a potted plant.

This isn’t a new idea. I remember military security policies requiring people to close the window blinds to prevent someone from shining a laser on the window and recovering the sound from the vibrations. But both the camera and processing technologies are getting better.

News story.

Posted on August 8, 2014 at 11:50 AM25 Comments


Anura August 8, 2014 12:05 PM

“I remember military security policies requiring people to close the window blinds to prevent someone from shining a laser on the window and recovering the sound from the vibrations.”

Wouldn’t that allow people to just read the vibrations on the window blinds, or am I missing something?

Sebastian August 8, 2014 12:26 PM

As I understand the description in the link, the researchers did not use a laser microphone, which reconstructs the vibrations from the interference of a reflected laser beam with itself (https://en.wikipedia.org/wiki/Laser_microphone). Instead they use a high-speed camera to just observe the vibrations of objects in the room. One advantage is that they don’t need to shine a laser, but can just use naturally occurring light, the other is that they don’t need a particularly shiny surface.

Mark August 8, 2014 12:35 PM

The really interesting part was not just that this was done with a camera, but it could be done with a regular 60fps camera as mentioned in this part of the story:

In other experiments, however, they used an ordinary digital camera. Because of a quirk in the design of most cameras’ sensors, the researchers were able to infer information about high-frequency vibrations even from video recorded at a standard 60 frames per second. While this audio reconstruction wasn’t as faithful as that with the
high-speed camera, it may still be good enough to identify the gender of a speaker in a room; the number of speakers; and even, given accurate enough information about the acoustic properties of speakers’ voices, their identities.

Robert August 8, 2014 12:45 PM

Seems kind of meaningless. I have never worked in a secure building that had windows. Secure confrence rooms do not have windows or phones. They have at least double wallboard construction to minimize sound transmissison.

Alex August 8, 2014 1:02 PM

Oh, I forgot to mention things like this here
An article about safe “air-gaping” a computer would be very interesting in this case.

Bob S. August 8, 2014 1:57 PM

I have always made my tinfoil hats with two or three layers of foil irregularly crumpled in anticipation of an event like this. I am looking into rubber band vibration dampeners just in case, however.

David Dyer-Bennet August 8, 2014 2:33 PM

I knew about using a laser beam to read vibration, most famously off a window but also potentially off other stuff. I’m surprised that a normal video carries enough information to reconstruct voices, though — it’s sampled at 30 frames per second, which ought to rather limit the upper frequency of the data they can capture out of it. Clearly they’re doing something more than the naively obvious!

Anura August 8, 2014 2:46 PM

@Bob S.

Not only will that not solve this problem, but tinfoil hats have been shown to amplify frequencies known to be used by the government for mind control. This is why I wear a faraday suit. I think I’m also going to attach a white-noise generator to it and turn it up to 90db.

Anura August 8, 2014 2:49 PM

@David Dyer-Bennet

They used a high speed camera at 2000fps-6000fps to capture the good quality audio, and used a 60fps camera to capture the kind-of-okay, possibly useful audio.

Carl "Bear" Bussjaeger August 8, 2014 2:50 PM

Anyone know what that “quirk in the design of most cameras’ sensors” that lets them “infer information about high-frequency vibrations even from video recorded at a standard 60 frames per second” is? 60FPS seems like a rather low Nyquist rate for reconstructing voice signals.

My mil security days are long past, but as I recall that “close the blinds” was only to stop visual observation of NOFORN/CONFIDENTIAL/SECRET documents. TS/SCI stuff wasn’t to be dragged and discussed in anything with windows, because the laser trick involving bouncing a beam off any resonating, reflecting surface like a window pane. SCIF walls are supposed to have decent sound damping just for that. Or so I was told; I made a point of never needing a clearance above SECRET.

Carl "Bear" Bussjaeger August 8, 2014 2:56 PM

“As it turns out, it’s less expensive to design the sensor hardware so that it reads off the measurements of one row of photodetectors at a time. Ordinarily, that’s not a problem, but with fast-moving objects, it can lead to odd visual artifacts. An object — say, the rotor of a helicopter — may actually move detectably between the reading of one row and the reading of the next.”

Never mind. That must be what they mean. If you know the detector array scan rate, you can effectively multiply the formal rate by that. But you’d be losing audio resolution since each row scan is naturally of fewer pixels.

me August 8, 2014 4:21 PM

During one of The Guardian’s meetings with GCHQ the spooks mentioned in passing that “this plastic cup on the table” is a good source of audio…

Clive Robinson August 9, 2014 5:59 AM

@ Bruce

The problem with saying “plastic bag” is there are many types of plastic bag, most won’t work for this unless they are mechanicaly stiffend in some way. Crisp bags are generaly a bag made of a plastic suitable for this as the surface holds ridges and folds and thus has edge stiffened areas (and it’s the edges they are working with).

There is a bunch of physics behind this that appear in materials science. Importantly there are non lineararities which makes easy comprehension of what would and would not be suitable for this difficult as normal real world experience conflicts. It’s over a third if a centuy since I had a need to use the physics in anger so I’m assuming it has moved on a lot in that time.

In some respects it’s similar to the use of glass inside hard drives for the platters, most outside the industry would assume that some metal would be more appropriate due to their real world experiance.

@ me,

There are various ways objects behave under the influance of the machanical vibration caused by the pressure changes that are sound. Any object will try and absorb the preasure changes by moving to the sound source on low preasure and away on high preasure. The stiffness and thickness of the objects surface facing the source will determine the amount of the sound energy that is absorbed, reflected or transmitted through the object. If the dimensions of the object are less than 1/16th of a wavelength of the sound then the object will react more as a solid that moves in it’s entirety than having resonance or flexure in the surface. However round thin walled objects like plastic cups exhibit complex and often unexpected modes of oscillation, which you can see in slow motion films of wine glasses building up resonant energy befor shattering under the influance of a trained singers voice. You can see that the frequencies it resonates at are not related to the simple measures of hight or width of the object.

@ Carl,

If you think about a video signal it has three limiting frequencies of interest, the frame rate, line rate and pixel width bandwidth.

From the article you will see they are using visual knife edges on the object that move in sympathy with the audio and thus change the luminance or colour (or both) between adjacent pixels in a given scan line. If you only compare on the frame rate your sampling frequency will be at the frame rate or double that if the video is interlaced. However if you use the signal from several lines then you get bursts of sampes at the line rate, where the bursts are seperated by the frame rate.

Further if you have several visual knife edges you can get a multitude of bursts of signals, if these can be appropriately aligned in time then you start to have more bursts and fill in the holes.

There is a problem though… you are using a sampled signal. The output of a sampled signal suffers from the problem of spectrum folding where a 1.25KHz signal sampled at 1KHz will appear as a 0.25KHz signal in the sampled output, the same with 2.25KHz, 3.25KHz signals.

I can think of some tricks where by the frequency folding can be partialy unfolded with the use of the line frequency sampling, my brain is nolonger up to doing the maths side in anything close to real time 😉 So you would be looking to find a researcher in the likes of radar signal processing to give you a good starting point.

I would note that both the NSA and GCHQ have been recruiting signal processing researchers and specialists for well over a quater of a century so I suspect they already know how to do this from other related systems.

@ Bob S,

Due to all the crumples and folds in your tinfoil hat, the number of visual knife edges would be very large, thus an almost ideal target for this type of evesdroping.

@ All,

If you think about it due to the use of visual knife edges, these would also make ideal “reflection targets” unlike other methods, as the curved surface of many common objects would magnify the knife edge rather usefully…

Which also means you could using further signal processing take the sound source from several points in the room giving in effect a stereo or better signal, this can be used to give the equivalent of VLB measurments which can be used to much more easily pick out desired signals from undesired signals.

Thus picking up the sound of you pressing keys on your keyboard even in a room with a radio on and people talking. Oh and also easily overcome the “running water” sound masking you see in spy movies…

Coyne Tibbets August 9, 2014 1:24 PM

@Carl “Bear” Bussjaeger

I suspect the “quirk” is that most CCD sensors do not scan across the visual image, as older cameras used to do. Instead, the entire image is captured by the chip as a frame, then scanned off of the captured image.

So consider a sheet of plastic, 1 foot by 1 foot that is being deformed by speech. Various parts of the plastic will be deformed by different degrees; corresponding to the physical position of the pressure waves from the sound.

The CCD then reads the entire sheet of plastic simultaneously in a frame; repeated at 60 frames per second.

Sound travels at 1125 feet per second, so the wavelength of 300 Hz speech is around 3.75 feet, which you can read as general curvature of the sheet. The wavelength of 3000 Hz is around 4.5 inches, so you would get one whole waveform and 2/3 of another.

Sampling at 60 FPS, with analysis of the changes in the plastic over time, would therefore not get a complete picture of the speech waveform, but you should be able to guess enough to reconstruct something like the full speech envelope at 60 per second intervals.

Since most people speak at a rate of 200 words per minute or less, this gives you around 200 samples per word.

The result should be understandable–enough to get the sense anyway–but would definitely not be what we would think of as a normal recording.

Coyne Tibbets August 9, 2014 1:26 PM


There is a miscalculation in the next to last paragraph, do to sloppiness on my part. The sampling rate would be around 18 per word, not 200.

K-Veikko August 10, 2014 3:20 AM

The sound travels 300 m/s. Thus, a video filmed by 30 fps captures the soundwave each 10 meters.

We can virtually increase the framerate by individually observing several objects in the same videoframe. Objects have to be at a different distance from the sound source. If we know the distance of the object from the sound source, it is possible to generate missing frames and position them (by time) into the generated “high speed” video.

vas pup August 11, 2014 10:05 AM

Is white noise generator placed inside the room sufficient protection by overriding all informative sounds by wide noise OR it is possible to filter out multiple vibration superimposed on the plastic bag based on the video captured? Thank you.

Seth Rice August 11, 2014 10:46 AM

@ Anura
The procedure you are referencing is due to the fact that some gov embasays in foreign countries did have windows put in by contractors that had security comprimised glass. There was either a Nova or FrontLine special around 2005 or 2007 about how the cold war erra contractors put in a small crystal that would be placed in a specific location. Like the top left corner of the glass. Embeded you wouldn’t notice it by the eye except to maybe notice a weird looking speck or flaw in the glass. However if you pointed a laser listening device at it from a building across the street, the crystal would reflect back at the same angle, and they could decypher the vibrations and get sound out of it. This is the same design used to measure how far away the moon is from earth. Astronauts (I want to say Apollo 15) placed some reflectors on the moon, and we can now point a laser at that spot and see exactly how much the distance has changed.

Now as for this article. It’s referencing the same concept but with cameras. So instead of relying on a laser to reflect off something, it uses the ambient light in the room to ‘transmit’ the vibrations. NPR had a pretty cool story about it on Science Friday last week, where they guy sings Mary Had a Little Lamb into a chip bag, and into a house plant. The bag of course picked up more than just the voice, like the air coming out of the guys mouth, pushing on the bag. But the plant pretty much just caught the sound.

Clive Robinson August 11, 2014 5:02 PM

@ vas pup,

Is white noise generator placed inside the room sufficient protection by overriding all informative sounds by wide noise OR it is possible to filter out multiple vibration superimposed on the plastic bag based on the video captured?

A white noise generator only works against a single microphone ” sometimes”…

To see why a simple thought experiment,

Take a room of the usual 8ft high 16ft long and 12ft wide or thereabouts. Place a 1Khz omnidirectional source at the top end of the room and a 1.2KHz generator at the bottom end, both producing identical amplitude sine waves. With a suitable program it’s easily possible to make a predictive plot of exactly what pattern would be picked up in the room by an omnidirectional microphone.

What the plot will show is that the relative amplitude of the two signals is quite predictable and experimentation generaly verifies the plot.

What is obvious is that there are very few points where the amplitudes are the same and boundry diagrams can be quite easily drawn.

Multiple microphones on the boundries won’t pick up amplitude differences but they will pickup delay differences that with sine waves will be seen as phase differences. Thus by delaying the signal from one mic you can arrange for one of the two signals to be out of phase with itself, thuss adding the two signals together will cancel the out of phase signal leaving the vector sum of the other signal.

Obviously two mics not on a boundry will need both amplitude correction and dely correction to achieve the same effect.

This gets more complicated with broadband signals where both amplitude and phase may change differently for two disimilar microphones. However modern software available for PCs for studio work have sufficient signal processing ability to correct this.

So with this camera system the output from two different plastic diaphragms in different parts of the room will overcome a single white noise source.

So you use two or more noise sources in different parts of the room… However this turns into the old ECM ECCM ECCCM game of diminishing returns for both parties.

Wael August 9, 2016 12:03 AM


Read the PDF. It describes the difference between previous work and the current one. Excellent work. The PDF was slow to download today.

r August 9, 2016 9:46 PM


Glad you liked it, I strive to be useful/helpful. Sometimes I mess up though, hopefully you guys can forgive me. I know I need to be more cool, calm and collected than I am though so thank you for all of your patience with me sir. 🙂

Dirk Praet August 10, 2016 9:08 AM

@ r

I know I need to be more cool, calm and collected than I am though so thank you for all of your patience with me sir.

For what it’s worth, no hard feelings here 😎

Leave a comment


Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via https://michelf.ca/projects/php-markdown/extra/

Sidebar photo of Bruce Schneier by Joe MacInnis.