VisLargeKoreaDec2000 - Visualisation for Data Mining Antony...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon
Visualisation for Data Mining Antony Unwin Department of Computer-Oriented Statistics and Data Analysis University of Augsburg, 86135 Augsburg, Germany [email protected] Abstract Modern computing power makes possible analysis of larger and larger data sets and many new methods have been suggested under the broad heading of Data Mining. Visualisation of data, of model-fitting, and of results plays an important part, but large data sets are different and new methods of display are needed for dealing with them. This paper reviews the standard problems in displaying large numbers of cases and variables, both continuous and categorical, and emphasises the need for improving current software. Much could be achieved by adding interactive tools like querying, linking and sorting to standard displays to provide greater flexibility and to facilitate a more exploratory approach. 1 What is Data Mining? Large data sets are more and more common. Every organisation is able to collect and store vast quantities of information. Supermarkets have sales figures for individual items and for customers. Phone companies have details of every phone call made. Weather computers store records of all manner of meteorological data. Websites try to monitor internet usage. And so on and so on. There is no point in maintaining data sets unless some attempt is made to get information out of them. Statisticians have always analysed large data sets, but what is meant by large has changed over the years with the increasing power of computers. Analyses which took months by hand fifty years ago can now be carried out in a second. Much larger data sets can be considered and new problems have arisen in consequence. Some standard statistical methods do not scale up well to the big data sets to be analysed nowadays. New ideas and new approaches are needed. One term which has been heard more and more often in this connection in recent years is Data Mining. It is so new, that not all are agreed what it might mean. David Hand has suggested that any definition should include the qualification that Data Mining is usually applied to data sets which have been collected for another purpose, that, in other words, Data Mining analyses are secondary analyses of data. This has implications for the quality of the data and for the difficulties of interpreting and generalising any results obtained. Results should not be reported as if they were based on random samples
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Visualisation for Data Mining Unwin Seoul, December 2000 2 froma population of interest. Another unexpected characteristic of Data Mining to be born in mind is that the “best” results are not likely to be the ones that are of most interest. The strongest results will either be known already or superficially obvious. The results which were previously unknown and do not stand out require more careful elicitation and will appear further down any list of outputs from Data Mining analyses. This suggests
Background image of page 2
Image of page 3
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 02/13/2012 for the course CS 91.510 taught by Professor Staff during the Fall '09 term at UMass Lowell.

Page1 / 11

VisLargeKoreaDec2000 - Visualisation for Data Mining Antony...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online