Automatically Identifying Government Secrets

Interesting research: “Using Artificial Intelligence to Identify State Secrets,” by Renato Rocha Souza, Flavio Codeco Coelho, Rohan Shah, and Matthew Connelly.

Abstract: Whether officials can be trusted to protect national security information has become a matter of great public controversy, reigniting a long-standing debate about the scope and nature of official secrecy. The declassification of millions of electronic records has made it possible to analyze these issues with greater rigor and precision. Using machine-learning methods, we examined nearly a million State Department cables from the 1970s to identify features of records that are more likely to be classified, such as international negotiations, military operations, and high-level communications. Even with incomplete data, algorithms can use such features to identify 90% of classified cables with <11% false positives. But our results also show that there are longstanding problems in the identification of sensitive information. Error analysis reveals many examples of both overclassification and underclassification. This indicates both the need for research on inter-coder reliability among officials as to what constitutes classified material and the opportunity to develop recommender systems to better manage both classification and declassification.

Tags: academic papers, AI, national security policy, secrecy

Posted on November 11, 2016 at 1:18 PM • 16 Comments

Comments

Tunchi • November 11, 2016 2:40 PM

Can they identify the acustics of a government blowj0b?

Frank • November 11, 2016 2:48 PM

A utility patent has just been granted for a breakthrough digital security innovation : Graphic Access Tabular Entry [ GATE ], an interception-proof authentication and encryption system and method, for detailed information, go to : nmjava.com/gate

Using Common Sense to Identify State Secrets • November 11, 2016 3:53 PM

I was trained in ITAR and handling classified information. Most of it is just common sense as to what is sensitive and needs to be classified. If in doubt ask!
However most items were simply company proprietary. Any employee would be fired if found acting with “extreme carelessness”.
As the FBI director stated several foreign intelligence services had access to hundred of thousands Department of State sensitive or classified emails/documents. ‘They’ all knew years ago what Americans only recently were made aware of. Looks like all stakeholders agreed on this massive dereliction of duty.

Andrew • November 11, 2016 3:59 PM

Just as face recognition can be fooled with some glasses, classification based on keywords filter applied to communication can be fooled with simple tricks.

Starting with letters swicth or missspe11ing – til replacing “kill the motherfucker” with “take care of the problem”, an automatic search can miss a lot. Either is simple keyword search, Bayesian or machine learning, they still search for words or words associations.
We are tens of years away from a cognitive understanding of the message and this require much more than simple words processing.
Based on this, automatic processing of, let’s say 600.000 documents, can be simply… bullshit.

John Peterson • November 11, 2016 5:53 PM

So the new line is: “I could tell you what’s going on, but then I’d need to unplug you…”

CallMeLateForSupper • November 12, 2016 9:03 AM

After all the bugs in such a system have been eliminated (cough), the logical follow-on system would be automatic redaction. (Big backlog of FOIA requests, doncha know.) Automating the process “could save taxpayers trillions of dollars over (an unfathomable period of) time”.

I see it so clearly[1]: Classified not identified and therefore not redacted. Document owners scramble to stop the presses, as it were. Thousands of sinister letters are printed and mailed to FIAO recipients, ordering them to return their newly acquired reams and to wipe their memories of any and all knowledge of their contents, “under penalty of law” (citation follows).

[1] My normally high SSL (Serum Skepticism Level) has been off the chart this week as a result of the November Surprise you might have read about.

Clive Robinson • November 12, 2016 10:08 AM

@ CallMeLateForSupper,

After all the bugs in such a system have been eliminated (cough), the logical follow-on system would be automatic redaction.

You’ve forgotten about the nature of these things and the mixture of evolutionary growth and cost-plus payments that will guarantee without any doubt that the bugs will never be eliminated, nor the system finished.

My normally high SSL (Serum Skepticism Level) has been off the chart this week as a result of the November Surprise you might have read about.

Whilst it might well have been a surprise for the Polsters of the MSM and some people, for those taking bets on it and happy to give 25:1 odds if H Clinton won… Suggests that those with real skin in the game had much more accurate views and were thus anything but surprised…

Steve • November 12, 2016 1:17 PM

@Andrew

Starting with letters swicth or missspe11ing – til replacing “kill the motherfucker” with “take care of the problem”, an automatic search can miss a lot.

We have had spelling and grammar correction for years now. Of course an automatic system would take all of this into account. This wouldn’t cover everything or course, the article itself says it only identified 90%, but this is an impressive success rate.

albert • November 12, 2016 4:14 PM

And here we go again, doing Ai without actually having Ai.

When you think about millions of government documents, you gotta wonder how many are totally unnecessary and meaningless BS.

I want to see a program that can determine -that-. Then we can delete them instead of classifying them.

Problem solved.

. .. . .. — ….

Andrew • November 12, 2016 4:29 PM

@steve
https://www.google.com/search?num=40&site=&source=hp&q=missspe11ing&oq=missspe11ing&gs_l=hp.3…804.1859.0.2253.2.2.0.0.0.0.152.284.0j2.2.0….0…1c.1.64.hp..0.0.0.QTKVihn8zSw

Andrew • November 13, 2016 12:59 AM

@steve back with a full answer, what I meant is that a keywords filtering often miss the real message of the document, particullary email comunications containing for example diplomatic language. It’s even difficult to define what should be searched to detect something. The 90% detection rate in oficial documents doesnt seems high to me…

Bruce Ediger • November 13, 2016 9:12 AM

So what happens when the Classification AI classifies things properly?

That is, it actually keeps missile ranges and intelligence methods and practices and the shape of submarine prows classified, but it lets all the career-ending blunders, the graft, bribery and nepotism and the straight up waste out in the open? Let’s face it, there’s only a very, very few real state secrets, but there’s tons and tons of blunders, graft and waste.

I think this is a real opportunity for AI – to get shut down very quickly.

It’s abundantly clear that people aren’t thinking these things through.

AC2 • November 14, 2016 8:31 AM

I wish they’d stop calling it Artificial Intelligence.

“Using statistics to identify state secrets”

mark • November 14, 2016 11:21 AM

And then there’s my take, which is the same as a lot of other “ordinary” folks: what percentage of classified documents are classified because they’re a) embarrassing to the author, or b) such a FUBAR that if it wasn’t classified, they’d be out of a career, pension, and possibly in jail?

WhiskersInMenlo • November 16, 2016 10:36 PM

“The declassification of millions of electronic records has made it possible to
analyze these issues with greater rigor and precision. ”

This is an interesting sample.
A flaw is these are cables not other communications. Routine cables
would have routine signatures.
A second flaw is that they are all declassified and the bigger secrets
are omitted.

The scope and range of a classified document can be as mundane
as an invoice for toilet paper or other food, fuel or power into an
embassy or facility. At another end a named target or nuclear deployment
or stuff I have no clue about is not mundane.

The buckets, Classified, Secret, Top Secret are one of the key nuts
that needs to be pondered. If a system cannot classify information
in a continuous scale not simple three buckets seems limited.

A real problem with secrets is one leak or failure has serious impact.
A second is that secrets change. Classifications have persistence that
outlasts the secret and hopefully its value.
Another is classifications can be retroactively increased.
Also some classifications are quantity related. For example the full and partial
APO postal delivery documents are highly classified but the APO address on a
letter is not.
This last APO example is interesting in the context of millions of records
that might allow the assembly of information that as a large dictionary is
very interesting but no snowflake in the data set is classified.

Bulk declassification may prove to be a real monster in the closet.
Interesting…

Bill93 • November 19, 2016 5:36 PM

We have had spelling and grammar correction for years now

And we know how well that doesn’t work; although the suggestions the program gives can be helpful, they are often wrong.

Another problem with classification that I’ve had to deal with is probably even a bigger problem for automated systems. That is aggregation of information.

Fact A may not be classified and person A uses it in unclassified communications. Fact B may not be explicitly classified, and person B uses it. But someone in possession of Fact A and Fact B can deduce Fact C that is explicitly classified.

If person A and person B are working closely together they may realize the need to treat Fact B as sensitive even though it is not marked as such. An automated program will not be able to do so.

Schneier on Security