Robert's Stochastic thoughts

Tuesday, May 23, 2006

Kevin Drum is very smart, but I think the older of two successive posts undermines the analogy in the newer post.

First Drum quotes Seymour Hersh being outstanding as always. Hersh has sources who confirm the general guess of what the NSA was doing with calling records

The N.S.A. also programmed computers to map the connections between telephone numbers in the United States and suspect numbers abroad, sometimes focussing on a geographic area, rather than on a specific person — for example, a region of Pakistan. Such calls often triggered a process, known as “chaining,” in which subsequent calls to and from the American number were monitored and linked.

The way it worked, one high-level Bush Administration intelligence official told me, was for the agency “to take the first number out to two, three, or more levels of separation, and see if one of them comes back” — if, say, someone down the chain was also calling the original, suspect number. As the chain grew longer, more and more Americans inevitably were drawn in.

Drum simply comments "Hersh's source says that this eavesdropping is a 'violation of the spirit of the law.' But if the program works the way Hersh says it does, it doesn't violate the "spirit" of anything. It just flatly violates the law" and of course he is right.

My thought is that the statistical analysis is not only illegal but also almost certainly ineffective. It makes sense to "chain" calls to someone the NSA has good reason to suspect is an NSA operative. The FISA court would almost certainly approve such analysis (remember the FISA court had never ever denied a requested warrant when the program began).

It makes much less sense to "chain" calls to a region in Pakistan or to numbers suspected of being related to al Qaeda with very little evidence. The initial guess is feeble. The analysis is based on that guess. The data available to the NSA can't help them if their guess is wrong. Consider the region in Pakistan. First it is possible that al Qaeda operatives in the USA avoid calling Pakistan, since the NSA strategy is obvious (and they surely have what turns to be a not so paranoid distrust of US privacy protections). They might call, say Hamburg instead. Second, they might use one phone for calls up the hierarchy and one for calls down the hierarchy making chaining useless.

Third, the main point is that many people in the USA call that region in Pakistan for perfectly innocent reasons. They might be immigrants or temporary migrants from that region, they might be US born calling a US expat (betcha the NSA did a good bit of spying on private calls of CIA agents) they might be someone in the USA who met a Pakistani at college or something. The initial evidence of Al Qaeda affiliation is clearly very week. A pattern of calls found by chaining makes it no stronger. Say there are a group of people in the USA who call each other and the region in Pakistan. The innocent explanations all still apply. You have something which al Qaeda operatives might or might not do which many other people definitely do. No huge data set nor powerful computer can get around that problem.

Now consider individual suspect phone numbers. You can find a group of people who call each other some of whom are al Qaeda suspects. The pattern of a group of people who call each other so the chains of calls instersect is perfectly normal. Members of the same terrorist organisation might or might not do that, but groups of friends and acquaintances certainly do that. The analysis has not added discriminating capacity to the original guess.

I think the only way to find an al Qaeda specific pattern is with a lot of data on al Qaeda operatives. They might be careless and call according to simple rules. With data on thousands of al Qaeda operatives in the USA it might be possible to learn those rules. This clealy has nothing to do with the real world. However, it has a lot to do with the unfortunate analogy in Drum's next post

Take a different, but equally incendiary example. Suppose that we could semi-reliably create a statistical portrait of child molesters: their age, geographical location, gender, and calling and buying patterns. Suppose they tend to rent certain kinds of videos, make phone calls to certain kinds of chat lines, and call up other known child molesters.

Needless to say, the FBI could track these patterns using the same methods as the NSA and then exploit the results to create lists of "possible child molesters." And it might work. But would we be OK with the FBI tapping someone's phone just because they fit a statistical profile? Or staking out their house? Or investigating their friends?

He asks if we would find such a use of statistics acceptable (I say yes except for using race, ethnicity, religion etc as variables). However the analogy is totally faulty. Sad to say, we have a huge known sample of child molesters. With that sample, we can find patterns (one commenter claims that child molesters are more likely than non molesters to be Star Trek fans). We do not have a huge sample of al Qaeda sleepers in the USA. In fact, I'm not sure any have been identified. We can't use any available data set to estimate typical behavior of al Qaeda operatives in the USA. The NSA can make wild guesses based on other data (like they call a region in Pakistan) but it can't evaluate those guesses and improve on them.

To me this point is blindingly obvious but no one seems to mention it. Aside from whether we want to pry with statistics, statistical analysis is useless without data on the phenomenon of interest.

Tuesday, May 23, 2006

No comments: