Wednesday, April 18, 2007

R packages with interesting datasets

The asuR Package:
Functions and data sets for a lecture in Advanced Statistics using R. Especially the functions mancontr() and inspect() may be of general interest.
The alr3 package
The faraway package

Monday, April 16, 2007

Clustering Music Clips

Data from

http://www.public.iastate.edu/~dicook/stat503/music-plusnew-sub-full.csv

1 Description [ From Di Cook's website]
This data was collected by Dr Cook from her own CDs. Using a Mac she read the track into the music editing software Amadeus II, snipped and saved the first 40 seconds as a WAV file. (WAV is an audio format developed by Microsoft, commonly used on Windows but it is getting less popular.) These files were read into R using the package tuneR. This converts the audio file into numeric data. All of the CDs contained left and right channels, and variables were calculated on both channels. The resulting data has 62 rows (cases)
and 7 columns (variables).

• LVar, LAve, LMax: average, variance, maximum of the frequencies of the left channel.
• LFEner: an indicator of the amplitude or loudness of the sound.
• LFreq: Median of the location of the 15 highest peak in the periodogram.

There are 11 tracks by Abba, 11 from the Beatles and 10 the Eels, which would be considered to be Rock, and 13 tracks by Vivaldi, 6 of Mozart and 8 of Beethoven, considered to be Classical. There are also 3 tracks from Enya, considered to be New Wave. The main question we want to answer is:

Can we group the tracks into a small number of clusters according to their similarity on audio charactieristics?”

This information might be used to arrange tracks on a digital music player. Other questions of interest might
be:
• Do the rock tracks have different characteristics than classical tracks?
• How does Enya compare to rock and classical tracks?
• Are there differences between the tracks of different artists?

Saturday, April 7, 2007

The Coma Supercluster

Still in need for actual dataset


The Coma supercluster is a very famous supercluster. This map below is a plot of the brightest galaxies (from the Principal Galaxies Catalogue) in this region of the sky. Dominating this picture is the Virgo cluster - the nearest rich cluster in the universe and the dominant cluster in the the Virgo supercluster. Above the Virgo cluster and much further away are two much richer clusters - A1367 and A1656. These are the two main clusters in the Coma supercluster.

The Coma Supercluster

Wednesday, April 4, 2007

vehicle Sillhoute ( Clustering)

Spam Dataset for classification

1. Title:  SPAM E-mail Database

2. Sources:
(a) Creators: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt
Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304
(b) Donor: George Forman (gforman at nospam hpl.hp.com) 650-857-7835
(c) Generated: June-July 1999

3. Past Usage:
(a) Hewlett-Packard Internal-only Technical Report. External forthcoming.
(b) Determine whether a given email is spam or not.
(c) ~7% misclassification error.
False positives (marking good mail as spam) are very undesirable.
If we insist on zero false positives in the training/testing set,
20-25% of the spam passed through the filter.

4. Relevant Information:
The "spam" concept is diverse: advertisements for products/web
sites, make money fast schemes, chain letters, pornography...
Our collection of spam e-mails came from our postmaster and
individuals who had filed spam. Our collection of non-spam
e-mails came from filed work and personal e-mails, and hence
the word 'george' and the area code '650' are indicators of
non-spam. These are useful when constructing a personalized
spam filter. One would either have to blind such non-spam
indicators or get a very wide collection of non-spam to
generate a general purpose spam filter.

For background on spam:
Cranor, Lorrie F., LaMacchia, Brian A. Spam!
Communications of the ACM, 41(8):74-83, 1998.
http://www.ics.uci.edu/~mlearn/databases/spambase/

Forest Coverage dataset

http://kdd.ics.uci.edu/databases/covertype/covertype.data.html

A random subset of this dataset is available as the dataset covtest in the DAAGxtras in R

The forest cover type for 30 x 30 meter cells obtained from US Forest Service (USFS) Region 2 Resource Information System (RIS) data.

Task: Classification

Past Usage

Blackard, Jock A. 1998. "Comparison of Neural Networks and Discriminant Analysis in Predicting Forest Cover Types." Ph.D. dissertation. Department of Forest Sciences. Colorado State University. Fort Collins, Colorado.

World Craniometric Variation

Dataset from Model-Based Clustering of World Craniometric Variation
by Dienekes Pontiko

http://dienekes.angeltowns.net/articles/anthropologica/clustering.html

Dataset: ( Still looking for it )

Abstract
Model-based clustering is applied to 2,504 crania of 28 populations of recent Homo sapiens using 57 cranial metric variates. This technique uses no a priori knowledge about the population affiliation of each skull. Model-based clustering varies the number and form of the clusters and selects a “good” model, showing a balance of data fit and parsimony, using the Bayes Information Criterion. Fourteen separate clusters were identified in the best run, each of which corresponds strongly to either one of the original populations, or to a racial group. It is shown that cranial variation can be used to infer ethno-racial affiliation.