Picking a Single Voice out of a Crowd

Squarehead’s new system is like bullet-time for sound. 325 microphones sit in a carbon-fiber disk above the stadium, and a wide-angle camera looks down on the scene from the center of this disk. All the operator has to do is pinpoint a spot on the court or field using the screen, and the Audioscope works out how far that spot is from each of the mics, corrects for delay and then synchronizes the audio from all 315 of them. The result is a microphone that can pick out the pop of a bubblegum bubble in the middle of a basketball game….

[…]

Audio from all microphones is stored in separate channels, so you can even go back and listen in on any sounds later. Want to hear the whispered insult that caused one player to lose it and attack the other? You got it.

Tags: privacy, sensors, surveillance

Posted on October 14, 2010 at 12:10 PM • 45 Comments

Comments

aPro • October 14, 2010 12:32 PM

Barring the legality/privacy concerns in a public setting this will have some cool applications. Wireless amplification for theatre performances or public address/interview situations would be interesting.

kangaroo • October 14, 2010 12:35 PM

aPro: seems an exorbitant cost for almost all usages, rather than a wireless mike.

The usage is for what the speaker doesn’t want everyone to hear, rather than the reverse.

miscounted • October 14, 2010 12:42 PM

So is it 325 or 315 microphones? The article uses both numbers.

Brian • October 14, 2010 12:48 PM

@miscounted They used 325, 315, and 300. Pick your favorite, I guess.

Bob • October 14, 2010 12:53 PM

It’s probably more than 3.

drootzler • October 14, 2010 12:55 PM

I hate Wired’s idea that surveillance can address problems relating to violence in professional sports.

Wouldn’t it be much simpler, and more humane, to give players mental and psychological tools to keep them from losing their cool to begin with?

Christian Vogel • October 14, 2010 12:56 PM

Unfortunately the article, just like the website of the company, is a little scarce with details. But almost certainly they built a phased array for audio. These are used (not only) for radar antennae for quite some time.

http://en.wikipedia.org/wiki/Phased_array

Bob Saget • October 14, 2010 1:02 PM

This would have more applications for ascertaining the other team’s strategy than for supposed “security.”

Just think of beaming an audio feed of the opposing coach’s communications to a wireless mic of an assistant coach/staff whose primary job is “intelligence.”

Game over!

Eyegor • October 14, 2010 1:03 PM

Phased arrays are how sonar systems work. The actual number of input sources doesn’t matter too much other than more is usually better. Since storage is so cheap and audio files tend to be rather small, keeping the input to each microphone and forming new beams from stored input’s a great idea (other than the privacy implications).

Mike • October 14, 2010 1:50 PM

Wouldn’t this entire system be defeated by a loud ultrahigh pitch speaker?

Jerra • October 14, 2010 2:07 PM

New technology, oldish methodology…

Richard Steven Hack • October 14, 2010 2:25 PM

It looks like the dish isn’t TOO big to be put into a surveillance van – although how you’d disguise it is a question. The control panel is no bigger than a workstation.

Clearly this device has security implications and privacy implications. It’s one thing to overhear someone you’re standing next to on the street, it’s another to be able to pick out any conversation from a thousand people standing on the street or in a rail terminal or an airport.

Basically this technology is a death knell for the idea of holding any kind of confidential meeting outdoors or in any building which might house it. It was bad enough with shotgun mikes but standing in a noisy crowd could defeat those. This kills that tactic.

GregW • October 14, 2010 2:38 PM

@Richard,

Good point.

Although ironically, if you record crowd noise and play it back from a single point source at/near the same location you are speaking from, this technology breaks, since it is depending on the spatial triangulation rather than the frequency analysis to pick out the voices, right?

Clive Robinson • October 14, 2010 2:46 PM

The first beam stearable microphone I’m aware of was developed by BBC engineers in Power Road Chiswick (London).

They later produced a very practical system using 4 cardiod mics positioned orthagonaly to each other (think points of a four pointed pyramid). This was back in the 1970’s when DSP chips where not even a twinkle in the eye of engineers.

Based on what I had learnt from one of the Beeb engineers I later designed a system for surveillance that consisted of two metal plates seperated about two sixteents of an inch on one side and about a half inch on the other with a line of electret mics spaced about an inch apart down the narrow side. These went into a bunch of variable length delay lines (bucket brigade with variable clock) and a series of analog summing ccts (there where also linear phase filters etc).

when fitted at the top of a picture or painting or cupboard door it enabled you to hear very quite conversations in an otherwise very noisy room.

So the system as described is not exactly original it just uses modern technology to do what in the past could not be done.

The real question is if it’s to be used for football stadia etc is what are they going to do about “beam bending” due to refraction etc as the sun heats up some air hoter than other bits…

jgreco • October 14, 2010 2:56 PM

@GregW

Is there a reason you wouldn’t be able to combine the two? I’m not familiar with what the state of the art for picking voices out of crowds is.

mcb • October 14, 2010 3:01 PM

Wouldn’t it just be easier for the FBI to put a covert wireless microphone & GPS transmitter under the seat of every potential suspect at the arena?

Shane • October 14, 2010 3:04 PM

I can’t wait!!! Now I won’t even be able to whisper to a friend while in public!

Please, PLEASE, keep developing more and more invasive, ubiquitous, and subversive technologies to dissolve my privacy!

Shane • October 14, 2010 3:08 PM

It’s not enough that the terrorists won in 2001. They apparently have to have their victory renewed, exalted, and rewarded anew with every headline declaring an ‘interesting new technology’.

Somebody fire the %#@(%&@( idiots who worked on this garbage, and hire them to figure out a way to power the planet for another 100 years without destroying it. THAT’d be interesting.

Shane • October 14, 2010 3:16 PM

…. it is pretty neat though…

savanik • October 14, 2010 3:28 PM

While there are good arguments as to privacy concerns with this, we’re rapidly approaching the point where the barriers to privacy are no longer technological, but behavior. In other words, countermeasures are becoming more limited, and we’ll actually need to start trusting other people to respect our privacy.

That said, I can see legitimate noncommercial applications of this technology. Stagecraft – being able to drop amplification on actors without having to mike them. Since you’re doing it on a recorded feed, you can track multiple people at the same time and sync their tracks to another camera recording the performance, removing the coughs and sneezes you might otherwise hear in a recording.

Or just general PA use – being able to mike someone without electronics seems like quite a feat. Especially if you call someone up from the audience and then instantly mike them at a moment’s notice. I can see a lot of uses for that right there.

JimFive • October 14, 2010 3:42 PM

@Kangaroo re: wireless mics for performances

I hate it when I go to a play and the performers have wireless mics, either on their cheeks or glued to their foreheads. It’s very distracting.

JimFive

aPro • October 14, 2010 3:50 PM

@JimFive @Kangaroo Also think about both live and studio recordings….The ability to ‘place your mics’ after the fact. Just trying to point out pros to technology where the article faces on a negative application.

Nick P • October 14, 2010 4:00 PM

I wonder what we would get if we combined this with tiny cameras, object tracking and facial recognition software. Maybe some Micro-UAV’s with silenced pistol, object tracking, and networking with the audio-visual surveillance system. This technology brings to mind many applications in the area of surveillance, but it would benefit a police state most.

Walter Underwood • October 14, 2010 4:13 PM

Interesting how the costs come down. Beam-steered phased arrays aren’t exactly new, PAVE PAWS radar was first operational in 1980.

The idea of recording all the channels and steering the beam later is pretty cool. You could probably auto-track a voice as it moved.

Jeff Bell • October 14, 2010 6:31 PM

Humans are already pretty good at picking out single voices in a busy room. It’s called the “cocktail party effect” (http://en.wikipedia.org/wiki/Cocktail_party_effect)

This is just recording in enough detail to do it computationally.

Johns • October 14, 2010 6:45 PM

I spoke with a google street view vehicle driver here in San Francisco that showed me a similar receptor being testing in a few select google vehicles. He said there are multiple microphones for each camera lens and the computer in the vehicle isolates and records 8 to 12 seconds of relevant sound for each image.

Woofle • October 14, 2010 7:39 PM

@NickP
“I wonder what we would get if we combined this with tiny cameras, object tracking and facial recognition software.”

A shirt-load of false positives!

Woofle

thecoldspy • October 14, 2010 8:57 PM

Always reminds me of the The Jack Tar Hotel and

“He’d kill us if he got the chance.”

How prescient they all were in The Conversation.

Roger • October 14, 2010 9:04 PM

As others have pointed out, this is just an interesting commercialisation of a technology that has been used for many years in sonar, and which was in turn inspired by a technique that has long been fairly common for RF antenna (and is now nearly universal in cell phone towers.)

The idea that it is portable enough to be used to monitor conversations on the street from a moving vehicle is not true, however. The article doesn’t actually give any information about the size of the device but it’s easy to calculate that it has to be larger than a van.

To pick out a conversation from the background noise it needs to have an angular resolution on the same order or less than typical spacing between noise sources. So, in a crowd with 1 person per 2 metres, to pick up an individual conversation at a distance of (say) 40 m requires an angular resolution of 0.05 radian, or 2.9°. The angular resolution lower limit can be calculated from the Rayleigh limit equation. Many people have nearly all of their voice energy at wavelengths longer than 12 cm. So for this, we see the array needs a diameter of at least 3 m. This is the fundamental physical limit to achieve marginal performance at which the two sources can just be distinguished with difficulty. To get good performance, or if implementation issues mean that it performs less well than the fundamental limit (which is usually the case) it would need to be several times larger.

Note that this is the resolution in a direction which sees the whole width of the array. In particular, when the source is near the plane of the array, the vertical resolution would be lousy. Clearly this particular array is designed for one purpose: to be mounted on a ceiling “looking down” onto the sources it is recording. For a mobile unit you need a three dimensional structure (like the pyramidal arrangement described by Clive) — but that would be challenging to build into a moving vehicle.

The arrangement of the microphones in this array is curious. The mathematical composition of the signals is much easier if they are in a square array, and there doesn’t seem to be any constructional reason not to have built this one in that way. In fact the variation of spacing as the radius increases suggests that this array has been designed (presumably through an inverse Fourier transform) to achieve a particular beam pattern. Somebody might like to transform the array pattern back into a beam pattern to see what they were after, but I’m too busy at the moment. However, my first hunch is that this pattern is designed to give a reasonably uniform coverage in some cone beneath the array, and then a fairly sharp cut-off at the edge of the beam. That would be ideal for monitoring a sports arena as it would minimise interference from the crowd.

Roger • October 14, 2010 9:20 PM

@Walter Underwood:
“Interesting how the costs come down. Beam-steered phased arrays aren’t exactly new, PAVE PAWS radar was first operational in 1980.”

It’s a lot older than that, even. Some pre-WW2 radio telescopes could be considered to be 1-dimensional precursors of the concept, while some WW2 radars were steerable phased arrays.

“The idea of recording all the channels and steering the beam later is pretty cool. You could probably auto-track a voice as it moved.”

Also older than it looks: in radio astronomy, the VLBI project has been doing this (with RF signals) since the 1960s. Of course in the RF domain the precision of the signal’s time base needs to be extremely high, and that drove a lot of early work on maser clocks. Doing it at audio frequencies is comparatively trivial.

Roger • October 14, 2010 9:47 PM

Some more information on this size of these arrays. At Squarehead’s website, they advertise 3 models: a small one for phone conferencing, a medium one for mikeless pickup of a conference presenter, and the big one, for sports arenas.

They don’t actually list the dimensions of any of them, but the medium sized one has a photograph that shows its size in comparison to a person:
http://www.squarehead.no/DesignFiler/Conference/statoil.jpg
Assuming the speaker is of average height for an adult male, this dish is about 2.4 m across, and is picking up a presenter separated from the rest of the audience by about 0.4 radians (24°). (Compared to our prediction of 0.06 radians for the shortest wavelengths of his voice, up to 0.6 radians for the deepest notes.) It clearly is both too large for covert surveillance, and yet also has too coarse an angular resolution for recording in crowds.

Recall that the basic technology has been around for decades. The reason these things haven’t been used before for spying on people in crowds is that they are too big. This isn’t a technological limitation, it’s fundamental physics. To get finer resolution, you must either use higher frequencies (not useful for voice recordings) or use a very large collector (extremely difficult to conceal, especially for a moving target.)

I think it says something interesting when an old technology is adapted in an ingenious way for sports broadcasting, of all things, and the immediate response is “oh nos, they will use this to spy on me!” What was Bruce saying about fear?

Nick N • October 15, 2010 12:47 AM

@thecoldspy, The Conversation is the very first thing I thought of too.

BF Skinner • October 15, 2010 6:41 AM

@Clive ” due to refraction ”

Reflection, distortion, interference, diffraction, absorption, dispersion, tens of thousands of point sources from varying distances to the monitored source in an acoustic bowl? Crosstalk like no ones business.

It would be interesting to test this to see if would even work in the scenario described.

Redrobes • October 15, 2010 9:06 AM

Its old technology:
http://www.youtube.com/watch?v=gGlrY46nfe4

Would it be hammered by a single high frequency source ? No, you can filter that and isolate it spatially away.

Is it better than the multi mic stick mics used to pick up secret conversations ? Not necessarily better quality but this way is steerable in a crowd situation.

The mathematical composition for any shaped array is the same as you have to find each mics phase distance to audio source. Some arrangements offer less scope for aliasing and side lobes so not having a fixed spaced grid is better, stochastic arrangement is best. Putting them onto a disk shape is probably most efficient as mics near to edge have to be attenuated more than those at center.

x • October 15, 2010 10:05 AM

notice that its a large dish mounted high above the arena. This is no accident, this kind of technology needs proper feng shui to work well. It could not have done this from any of the places near the ground that the coach was at, it would be much more limited in that kind of placement. versions of this are all over my neighborhood. they triangulate gunshot sounds and supposedly also are able to point the neighborhood video surveilance cameras. the accuracy still is a matter of how close the shot is to the microphone array and the other arrays around the neighborhood. and the effects of echo and bounce from structures.
a basketball court is a very hard environment for audio recording without some kind of filter program, the high pitch squeek of the shoes on the wooden floor, the bang and echo off the walls and ceiling of the ball its just not a quiet room with plenty of carpet and stuffed furniture and drapes to moderate all the sharp sounds of feet moving.
similiar techniques could be applied to other shapes, even the grill of a vehicle. it would not be as long range or omni directional, but it could be used aimed in its most functional way at some target. of course it would have to have a dedicated program to interpret the sounds and it might need calibration to each new location.

Douglas2 • October 15, 2010 5:34 PM

Yes, it’s old techniques made more practical by use of current technology.
I’m surprised that no-one has mentioned that for uses such as theatre and voice amplification, the bandwidth of the device must be several octaves. Three to four octaves of frequencies with good directivity would work for understanding speech, but twice that will be needed to make it sound “good”. Beam-forming becomes more difficult when you are dealing with multiple octaves of bandwith rather than discrete frequencies.

s • October 15, 2010 10:43 PM

@mcb
I think from the GPS you’d find that stadium seats pretty much stay in the same place.

BW • October 16, 2010 4:55 PM

I will never pick a basketball game as a cover for selling the soviets secrets again!

alfred • October 17, 2010 11:41 AM

didn’t Batman have something like this, only with cellphones?

mcb • October 18, 2010 9:55 AM

@ s

“I think from the GPS you’d find that stadium seats pretty much stay in the same place.”

Can’t be too careful. Some hooligan might prankishly relocate one to the wheel well of some innocent college student’s car. Then there’d be ‘splaining to do.

Davi Ottenheimer • October 19, 2010 4:04 PM

I saw something similar a few years ago in video conferencing. The array of mics were supposed to help give the receiving end directional voice — sound corresponded in space/time to the video feed — so an image to your left with moving lips would produce sound to your left.

@Roger

Your estimates give me the impression that the easy way to defeat this is line-of-sight to the mics. Sit far enough away and put someone directly in-between you and the array and it will be unable to distinguish.

Doug Coulter • October 19, 2010 10:16 PM

This is indeed interesting, and perhaps practical.

I used to consult for a large paging equipment manufacturer and tried to sell them on the idea. They already had two way paging, and “follow me” as you walked down the hall (the speakers could be used as microphones — they even made their own speakers and they were high impedance to make that easy) and we even made a not too crude demo for things like use in airports. Not for surveillance (9/11 hadn’t yet happened), but for practical two way communication in noisy places.

We also did it going the other way. When you have a large installation, you have a ton of speakers, and doing phased array transmit you could pick someone out of the crowd and just about blow them out of their socks without disturbing anyone else much. A thousand efficient one watt paging speakers is still a kilowatt…

And not only that, like some of the more advanced radars, you could do it to more than one location at a time, with different signals and not too much cross talk — if you weren’t in one of the beams you’d hear both, a little, but it wasn’t a problem to do a few channels. All the outputs only add up in one spot where all the phases are right, and as some mentioned above, more sources are better to make the sidelobes better, it’s not just total size there, though that is important as well for the longer wavelengths (think 100hz or so cutoff).

They didn’t go for it as the system costs in other things (like figuring out where to point it at any time, and making that simple for the operator) and other complexities wouldn’t sell. But it was a fun demo back in 1999 or so.

What we wound up doing, which was still pretty cool, was a trick that used two microphones on each speaker, one a noise canceler (2nd order gradient) and the other a cardioid. Both heard the speaker, but only the cardioid heard the room in general, and with an adaptive FIR filter we could completely cancel the speaker sound from the cardioid, allowing full duplex hands-free comm. Even that ten buck solution was pricey for them, but some people loved it and it sold some.
This meant we didn’t have to try and cancel speaker distortion, else we could have just used the speaker input signal instead of the other microphone. We also worked up a demo to cancel the room reverb, the bane of hands-free, but it was just too expensive and had too much latency for it to be practical then. A real time version with no latency had instead a too long training time if the target was moving, which was the norm.

It was a good place to work for, and a well run outfit, so I guess I should say who it was. Valcom, who is kind of low profile, most of their sales go to re-branders, but they sort of own the paging business, as well as background music, and phone over LAN, which we developed for them. And all that stuff that makes it easy for a building architect to satisfy all the Americans with Disabilities act requirements for public buildings.

Kind of sad (statement on current affairs) that now the first thing everyone thinks of is violating privacy with that kind of tech, which yes, would work. It did seem to have some very reasonable uses that were completely cool.

kme • October 21, 2010 12:14 AM

You should be able to use it in reverse, as a “voice of God” device too.

earson • October 21, 2010 10:17 PM

Noise phase cancellation is so cool. For a fairly old example of this, A/B compare King Crimson’s recording of “The Night Watch” to the song of the same name on the 2 CD compilation of the same name. It’s the same recording! Of course they probably used gates and duckers and possibly added a sonic bit hear and there. I mean here and there.

earson • October 21, 2010 10:30 PM

Oops. Forgot to state that the original “The Night Watch” is on “Starless and Bible Black.”

Schneier on Security