hey there.

Just wondering if anyone could lend their expertise on cluster analysis.

Basically I am trying to see if I can find 5-10 customer segments based on shopping habits.

I have a table (single customer view) of 10,000 randomly selected customers along with a column for each category containing a value between 1 and 4, denoting they have transacted in that category 1, 2, 3 or 4+ times. The cut off of point at 4 is employed in order to eliminate outliers and I have normalised the values by converting them into z-scores.

I'm looking to use a two-step approach to effectively cluster this large data set. First I am partitioning using k-means into "sub-clusters" and then using agglomerative hierarchical clustering to cluster the clusters.

Problem I have is that every time I cluster using k-means I get a massively different result, especially when I randomly sort the data set before each run, and so by the time the sub-clusters are being inputted into the AHC the damage is already done so-to-speak.

Half of the issue is potentially that I don't know how many sub-clusters to create. Is there any best practice for number of sub-clusters e.g. if you have X observations then created X/10 sub-clusters?

Also does anyone have any general advice on the approach? I'm quite new to modelling and I feel like I've read every post on every forum, lots of academic papers and web resources but am still no better off.

If my data set is fraught with things that will cause issues then how do I know? Can certain data sets not be clustered? Are there any other alternative techniques I could use?

I've attached the first 1k rows (of 10k total) from my original dataset (normalised "z-score" variables are at the right) in tab delimited format

Any advice would be really appreciated!

cheers,

Sammy

Just wondering if anyone could lend their expertise on cluster analysis.

Basically I am trying to see if I can find 5-10 customer segments based on shopping habits.

I have a table (single customer view) of 10,000 randomly selected customers along with a column for each category containing a value between 1 and 4, denoting they have transacted in that category 1, 2, 3 or 4+ times. The cut off of point at 4 is employed in order to eliminate outliers and I have normalised the values by converting them into z-scores.

I'm looking to use a two-step approach to effectively cluster this large data set. First I am partitioning using k-means into "sub-clusters" and then using agglomerative hierarchical clustering to cluster the clusters.

Problem I have is that every time I cluster using k-means I get a massively different result, especially when I randomly sort the data set before each run, and so by the time the sub-clusters are being inputted into the AHC the damage is already done so-to-speak.

Half of the issue is potentially that I don't know how many sub-clusters to create. Is there any best practice for number of sub-clusters e.g. if you have X observations then created X/10 sub-clusters?

Also does anyone have any general advice on the approach? I'm quite new to modelling and I feel like I've read every post on every forum, lots of academic papers and web resources but am still no better off.

If my data set is fraught with things that will cause issues then how do I know? Can certain data sets not be clustered? Are there any other alternative techniques I could use?

I've attached the first 1k rows (of 10k total) from my original dataset (normalised "z-score" variables are at the right) in tab delimited format

Any advice would be really appreciated!

cheers,

Sammy

Last edited: