Robert's Stochastic thoughts

Tuesday, January 24, 2006

What is Data Mining ?

I am familiar with the phrase as a pejorative. If a researcher performs a large number of statistical tests and cherry picks the on which provides the strongest evidence for his or her hypothesis, he or she has come close to fraud. A reader informed only of the selected results can be misled. This is a particularly serious problem if readers rely on standard significance tests. Data snooping is a less severe offence against statistics. In data snooping the researcher informally looks at the data without generating test statistics. This can still create the illusion that a statistically significant pattern has been found.

The phrase is suddenly all over the papers. However, it is not used as a pejorative. It definitely is not used to mean "cherry picking". It has now been widely noticed that cherry picking can lead to false conclusions.

"Data mining" in contrast is now used to refer to mysterious information processing techniques which are very powerful, perhaps dangerously powerful. They are not described in detail but they have something to do with huge powerful computers and data sets so massive as to be incomprehensible to mere humans. Evidently the computers search through huge amounts of data looking for suspicious patters. Also the computers find a huge number of such suspicious patters.

Unfortunately people processing the ore from the data mine are not impressed

"We'd chase a number, find it's a schoolteacher with no indication they've ever been involved in international terrorism - case closed," said one former FBI official, who was aware of the program and the data it generated for the bureau. "After you get a thousand numbers and not one is turning up anything, you get some frustration."

Ah yess that sounds like the result of data mining to me. If you look through a huge amound of data concerning innocent people for suspicious you will find many cases of patters so suspicious that the probability they are do to chance is very low like one in a hundred thousand. If you check every phone number in the world, you will end up sendign harassed FBI agents to harass thousands of innocent people. I mean the math isn't complicated.

However, a 1 in a 100,000 pattern is very very impressive. Our brains are not made to understand that 1 in a 100,000 is very different from 1 in 10,000 or 1 in 10,000,000. Furthermore computers remain strange and very impressive (especially google I mean how the hell can it search so fast ?).

It seems that the new meaning of "data mining" is very close to the old meaning and, thus, about the same as "cherry picking" or, to be exact, mechanised cherry picking.
It's connotations were not pejorative a year or so ago. People have to learn the same lesson again and again.

1 comment:

Anonymous said...: In biology, data mining is often discussed casually as the automated search for correlations among huge datasets. After finding strong correlations you can go in and either think up hypotheses as to why the correlation exists or use graphical models or structural equation modelling to try to find a causative architecture. So the term is pejorative if one publishes the correlation alone without further analysis, but non pejorative if one uses the mining result to generate new hypotheses. But this is only in casual conversation; I have no idea whether the bioinformatic community has a formal definition of data mining.; 4:29 AM