View the step-by-step solution to:

Example nding teen market segments using k-means clustering Interacting with friends on a social networking service (SNS), such as Facebook, Tumblr,...

in R 

Follow the steps given in Machine Learning With R, Chapter 3 section "Diagnosing Breast Cancer with the kNN Algorithm."[;vnd.vst.idref=id286762181]!/4[page]/2/2/2/2/[email protected]:0

Follow the steps given in Machine Learning With R, Chapter 9 section "Finding Teen Market Segments Using k-Means Clustering."[;vnd.vst.idref=id286799594]!/4[page]/2/2/2/2/[email protected]:0

Chapter screenshots are below

Attach brief that reports the execution steps and outcomes of the k-NN and k-means lab. In addition, the following questions will be addressed in the brief:

  • Use the caret package to automatically tune the k parameter for the k-NN algorithm. Were you able to identify a k parameter that increased the accuracy from previous attempts? Show your work and the final result.
  • Train the k-means model again using k=3 and then k=10. How did this affect the cluster distribution for mean age and proportion of females?

This submission must be one to two pages in length; screenshots of work are acceptable for showing result outcomes. Screenshots of chapters are below Chapter 9-09.jpgChapter 9-08.jpgChapter 9-07.jpgChapter 9-06.jpgChapter 9-05.jpgChapter 9-04.jpgChapter 9-03.jpgChapter 9-02.jpgin R 

Follow the steps given in Machine Learning With R, Chapter 3 section "Diagnosing Breast Cancer with the kNN Algorithm."[;vnd.vst.idref=id286762181]!/4[page]/2/2/2/2/[email protected]:0

Follow the steps given in Machine Learning With R, Chapter 9 section "Finding Teen Market SegmenChapter 9-01.jpgts Using k-Means Clustering."[;vnd.vst.idref=id286799594]!/4[page]/2/2/2/2/[email protected]:0

Attach brief that reports the execution steps and outcomes of the k-NN and k-means lab. In addition, the following questions will be addressed in the brief:

  • Use the caret package to automatically tune the k parameter for the k-NN algorithm. Were you able to identify a k parameter that increased the accuracy from previous attempts? Show your work and the final result.
  • IMG_4692.JPGTrain the k-means model again using k=3 and then k=10. How did this affect the cluster distribution for mean age and proportion of females?

This submission must be one to two pages in length; screenshots of work are acceptable for showing result outcomes.


screenshot images below of chapters mentioned above

Chapter 9-01.jpg

Example — finding teen market segments
using k-means clustering Interacting with friends on a social networking service (SNS), such as Facebook, Tumblr, and
Instagram has become a rite of passage for teenagers around the world. Having a relatively large
amount of disposable income, these adolescents are a coveted demographic for businesses
hoping to sell snacks, beverages, electronics, and hygiene products. The many millions of teenage consumers using such sites have attracted the attention of
marketers struggling to find an edge in an increasingly competitive market. One way to gain this
edge is to identify segments of teenagers who share similar tastes, so that clients can avoid
targeting advertisements to teens with no interest in the product being sold. For instance, sporting
apparel is likely to be a difficult sell to teens with no interest in sports. Given the text of teenagers' SNS pages, we can identify groups that share common interests such
as sports, religion, or music. Clustering can automate the process of discovering the natural
segments in this population. However, it will be up to us to decide whether or not the clusters are
interesting and how we can use them for advertising. Let's try this process from start to finish. Step 1 — collecting data For this analysis, we will use a dataset representing a random sample of 30,000 U.S. high school
students who had profiles on a well-known SNS in 2006. To protect the users' anonymity, the
SNS will remain unnamed. However, at the time the data was collected, the SNS was a popular
web destination for US teenagers. Therefore, it is reasonable to assume that the profiles represent
a fairly wide cross section of American adolescents in 2006. Tip This dataset was compiled by Brett Lantz while conducting sociological research on the teenage
identities at the University of Notre Dame. If you use the data for research purposes, please cite
this book chapter. The full dataset is available at the Packt Publishing website with the filename
snsdata. csv. To follow along interactively, this chapter assumes that you have saved this file to
your R working directory. The data was sampled evenly across four high school graduation years (2006 through 2009)
representing the senior, junior, sophomore, and freshman classes at the time of data collection.
Using an automated web crawler, the full text of the SNS profiles were downloaded, and each
teen's gender, age, and number of SNS friends was recorded. A text mining tool was used to divide the remaining SNS page content into words. From the top
500 words appearing across all the pages, 36 words were chosen to represent five categories of
interests: namely extracurricular activities, fashion, religion, romance, and antisocial behavior.

Chapter 9-02.jpg

The 36 words include terms such as football, sexy, kissed, bible, shopping, death, and drugs. The
final dataset indicates, for each person, how many times each word appeared in the person's SNS
profile. Step 2 — exploring and preparing the data
We can use the default settings of read. csv () to load the data into a data frame: > teens <- read.csv("snsdata.csv") Let's also take a quick look at the specifics of the data. The first several lines of the st]: () output
are as follows: > str(teens)
'data.frame': 30000 obs. of 40 variables: 5 gradyear : int 2006 2006 2006 2006 2006 2006 2006 2006 ...
5 gender : Factor w/ 2 levels "F","M“: 2 1 2 1 NA 1 1 2 ...
$ age : num 19 18.8 18.3 18.9 19 ... $ friends : int 7 0 69 0 10 142 72 17 52 39 ... $ basketball : int 0 0 0 0 O 0 0 0 0 0 ... As we had expected, the data include 30,000 teenagers with four variables indicating personal
characteristics and 36 words indicating interests. Do you notice anything strange around the gender row? If you were looking carefully, you may
have noticed the NA value, which is out of place compared to the 1 and 2 values. The NA is R's
way of telling us that the record has a missing value—we do not know the person's gender. Until
now, we haven't dealt with missing data, but it can be a significant problem for many types of
analyses. Let's see how substantial this problem is. One option is to use the table () command, as follows: > table(teens$gender)
22054 5222 Although this command tells us how many F and M values are present, the table () function
excluded the NA values rather than treating it as a separate category. To include the NA values (if
there are any), we simply need to add an additional parameter: > table(teens$gender, useN = "ifany")
F M <NA>
22054 5222 2724 Here, we see that 2,724 records (9 percent) have missing gender data. Interestingly, there are
over four times as many females as males in the SNS data, suggesting that males are not as
inclined to use SNS websites as females.

Chapter 9-03.jpg

If you examine the other variables in the data frame, you will find that besides gender, only age
has missing values. For numeric data, the summary () command tells us the number of missing
NA values: > summary(teens$age)
Min. lst Qu. Median Mean 3rd Qu. Max. NA's
3.086 16.310 17.290 17.990 18.260 106.900 5086 A total of 5,086 records (17 percent) have missing ages. Also concerning is the fact that the
minimum and maximum values seem to be unreasonable; it is unlikely that a 3 year old or a 106
year old is attending high school. To ensure that these extreme values don't cause problems for
the analysis, we'll need to clean them up before moving on. A more reasonable range of ages for the high school students includes those who are at least 13
years old and not yet 20 years old. Any age value falling outside this range should be treated the
same as missing data—we cannot trust the age provided. To recode the age variable, we can use
the ifelse () function, assigning teen$age the value of teen$age ifthe age is at least 13 and
less than 20 years; otherwise, it will receive the value NA: > teens$age <— ifelse(teens$age >= 13 E teens$age < 20,
teenssage, NA) By rechecking the summary () output, we see that the age range now follows a distribution that
looks much more like an actual high school: > summary(teens$age)
Min. lst Qu. Median Mean 3rd Qu. Max. NA's
13.03 16.30 17.26 17.25 18.22 20.00 5523 Unfortunately, now we've created an even larger missing data problem. We'll need to find a way
to deal with these values before continuing with our analysis. Data preparation — dummy coding missing values An easy solution for handling the missing values is to exclude any record with a missing value.
However, if you think through the implications of this practice, you might think twice before
doing so—just because it is easy does not mean it is a good idea! The problem with this approach
is that even if the missingness is not extensive, you can easily exclude large portions of the data. For example, suppose that in our data, the people with the NA values for gender are completely
different from those with missing age data. This would imply that by excluding those missing
either gender or age, you would exclude 9% + 1 7% : 26% of the data, or over 7,500 records.
And this is for missing data on only two variables! The larger the number of missing values
present in a dataset, the more likely it is that any given record will be excluded. Fairly soon, you
will be left with a tiny subset of data, or worse, the remaining records will be systematically
different or non-representative of the full population.

Chapter 9-04.jpg

An alternative solution for categorical variables like gender is to treat a missing value as a
separate category. For instance, rather than limiting to female and male, we can add an additional
category for the unknown gender. This allows us to utilize dummy coding, which was covered in
Chapter 3, Lazy Learning — Classrfication Using Nearest Neighbors. If you recall, dummy coding involves creating a separate binary (1 or O) valued dummy variable
for each level of a nominal feature except one, which is held out to serve as the reference group.
The reason one category can be excluded is because its status can be inferred fiom the other
categories. For instance, if someone is not female and not unknown gender, they must be male.
Therefore, in this case, we need to only create dummy variables for female and unknown gender: > teens$female <- ifelse(teens$gender == "F" 5
!$gender), 1, 0)
> teens$no_gender <— ifelse($gender), 1, 0) As you might expect, the is . na () function tests whether gender is equal to NA. Therefore, the
first statement assigns teens$ female the value 1 if gender is equal to F and the gender is not
equal to NA; otherwise, it assigns the value 0. In the second statement, if is .na () returns TRUE,
meaning the gender is missing, the teensSno_gender variable is assigned 1; otherwise, it is
assigned the value 0. To confirm that we did the work correctly, let's compare our constructed
dummy variables to the original gender variable: > table(teans$gender, useNA = "ifany")
F M <NA> 22054 5222 2724 > tab1e(teens$female, useNA
0 1 7946 22054 > table(teens$no_gender, useNA = "ifany")
0 1 27276 2724 Ilifanyll) The number of 1 values for teens$female and teens$noigender matches the number of F and
NA values, respectively, so we should be able to trust our work. Data preparation — imputing the missing values Next, let's eliminate the 5,523 missing age values. As age is numeric, it doesn't make sense to
create an additional category for the unknown values—where would you rank "unknown"
relative to the other ages? Instead, we'll use a different strategy known as imputation, which
involves filling in the missing data with a guess as to the true value. Can you think of a way we might be able to use the SNS data to make an informed guess about a
teenager’s age? If you are thinking of using the graduation year, you've got the right idea. Most
people in a graduation cohort were born within a single calendar year. If we can identify the
typical age for each cohort, we would have a fairly reasonable estimate of the age of a student in
that graduation year.

Chapter 9-05.jpg

One way to find a typical value is by calculating the average or mean value. If we try to apply
the mean () function, as we did for previous analyses, there's a problem: > mean(teens$age)
[1] NA The issue is that the mean is undefined for a vector containing missing data. As our age data
contains missing values, mean (teens$age) returns a missing value. We can correct this by
adding an additional parameter to remove the missing values before calculating the mean: > mean(teens$age, na.rm = TRUE)
[1] 17.25243 This reveals that the average student in our data is about 17 years old. This only gets us part of
the way there; we actually need the average age for each graduation year. You might be tempted
to calculate the mean four times, but one of the benefits of R is that there's usually a way to avoid
repeating oneself. In this case, the aggregate () function is the tool for the job. It computes
statistics for subgroups of data. Here, it calculates the mean age by graduation year after
removing the NA values: > aggregate(data = teens, age ~ gradyear, mean, na.rm = TRUE)
gradyear age 2006 18.65586 2007 17.70617 2008 16.76770 2009 15.81957 thH The mean age differs by roughly one year per change in graduation year. This is not at all
surprising, but a helpful fmding for confirming our data is reasonable. The aggregate () output is a data frame. This is helpfill for some purposes, but would require
extra work to merge back onto our original data. As an alternative, we can use the ave ()
function, which returns a vector with the group means repeated so that the result is equal in
length to the original vector: > ave_age <— avetteens$age, teens$gradyear, FUN =
function(x) mean(x, na.rm = TRUE)) To impute these means onto the missing values, we need one more ifelse () call to use the
ave_age value only if the original age value was NA: > teenssage <— ifelse($age), ave_age, teens$age) The summary () results show that the missing values have now been eliminated: > summary(teens$age)
Min. lst Qu. Median Mean 3rd Qu. Max.
13.03 16.28 17.24 17.24 18.21 20.00

Chapter 9-06.jpg

With the data ready for analysis, we are ready to dive into the interesting part of this project.
Let's see whether our efforts have paid off. Step 3 — training a model on the data To cluster the teenagers into marketing segments, we will use an implementation of k-means in
the stats package, which should be included in your R installation by default. If by chance you
do not have this package, you can install it as you would any other package and load it using the
library (stats) command. Although there is no shortage of k—means functions available in
various R packages, the kmeans () function in the stats package is widely used and provides a
vanilla implementation of the algorithm. The kmeans () function requires a data frame containing only numeric data and a parameter
specifying the desired number of clusters. If you have these two things ready, the actual process
of building the model is simple. The trouble is that choosing the right combination of data and
clusters can be a bit of an art; sometimes a great deal of trial and error is involved. We'll start our cluster analysis by considering only the 36 features that represent the number of
times various interests appeared on the teen SNS profiles. For convenience, let's make a data
fi‘ame containing only these features: > interests <— teens[5:40] If you recall from Chapter 3, Lazy Learning — Classification Using Nearest Neighbors, a
common practice employed prior to any analysis using distance calculations is to normalize or z-
score standardize the features so that each utilizes the same range. By doing so, you can avoid a
problem in which some features come to dominate solely because they have a larger range of
values than the others. The process of z-score standardization rescales features so that they have a mean of zero and a
standard deviation of one. This transformation changes the interpretation of the data in a way that
may be useful here. Specifically, if someone mentions football three times on their profile,
without additional information, we have no idea whether this implies they like football more or
less than their peers. On the other hand, if the z-score is three, we know that that they mentioned
football many more times than the average teenager. To apply the z-score standardization to the interests data frame, we can use the scale ()
fimotion with lapply () as follows: > interests_z <- as . data. . frame (lapply (interests , scale) ) Since lapply () returns a matrix, it must be coerced back to data frame form using the
as . data . frame () function.

Chapter 9-07.jpg

Our last decision involves deciding how many clusters to use for segmenting the data. If we use
too many clusters, we may find them too specific to be useful; conversely, choosing too few may
result in heterogeneous groupings. You should feel comfortable experimenting with the values of
k. If you don't like the result, you can easily try another value and start over. Tip Choosing the number of clusters is easier if you are familiar with the analysis population. Having
a hunch about the true number of natural groupings can save you some trial and error. To help us predict the number of clusters in the data, 1'11 defer to one of my favorite films, The
Brealgrast Club, a coming-of-age comedy released in 1985 and directed by John Hughes. The
teenage characters in this movie are identified in terms of five stereotypes: a brain, an athlete, a
basket case, a princess, and a criminal. Given that these identities prevail throughout popular
teen fiction, five seems like a reasonable starting point for k. To use the k—means algorithm to divide the teenagers' interest data into five clusters, we use the
krneans () function on the interests data frame. Because the k-means algorithm utilizes
random starting points, the set . seed () function is used to ensure that the results match the
output in the examples that follow. If you recall from the previous chapters, this command
initializes R's random number generator to a specific sequence. In the absence of this statement,
the results will vary each time the k—means algorithm is run: > set.seed(2345)
> teen_clusters <— kmeans(interests_z, 5) The result of the k-means clustering process is a list named teen_clusters that stores the
properties of each of the five clusters. Let's dig in and see how well the algorithm has divided the
teens' interest data. Tip If you find that your results differ from those shown here, ensure that the set. seed (2 345)
command is run immediately prior to the kmeans () function. Step 4 — evaluating model performance Evaluating clustering results can be somewhat subjective. Ultimately, the success or failure of
the model hinges on whether the clusters are usefiil for their intended purpose. As the goal of this
analysis was to identify clusters of teenagers with similar interests for marketing purposes, we
will largely measure our success in qualitative terms. For other clustering applications, more
quantitative measures of success may be needed. One of the most basic ways to evaluate the utility of a set of clusters is to examine the number of
examples falling in each of the groups. If the groups are too large or too small, they are not likely

Chapter 9-08.jpg

to be very useful. To obtain the size of the kmeans () clusters, use the teen_clusters$size
component as follows: > teen_clusters$size
[1] 871 600 5981 1034 21514 Here, we see the five clusters we requested. The smallest cluster has 600 teenagers (2 percent)
while the largest cluster has 21,514 (72 percent). Although the large gap between the number of
people in the largest and smallest clusters is slightly concerning, without examining these groups
more carefully, we will not know whether or not this indicates a problem. It may be the case that
the clusters' size disparity indicates something real, such as a big group of teens that share similar
interests, or it may be a random fluke caused by the initial k—means cluster centers. We'll know
more as we start to look at each cluster's homogeneity. Tip Sometimes, k-means may find extremely small clusters—occasionally, as small as a single point.
This can happen if one of the initial cluster centers happens to fall on an outlier far from the rest
of the data. It is not always clear whether to treat such small clusters as a true finding that
represents a cluster of extreme cases, or a problem caused by random chance. If you encounter
this issue, it may be worth re-running the k-means algorithm with a different random seed to see
whether the small cluster is robust to different starting points. For a more in-depth look at the clusters, we can examine the coordinates of the cluster centroids
using the teen_clusters$centers component, which is as follows for the first four interests: > teen_c1usters$centers
basketball football soccer softball 1 0.16001227 0.2364174 0.10385512 0.07232021
2 —0.09195886 0.0652625 -0.09932124 -0.01739428
3 0.52755083 0.4373480 0.29778605 0.37178377
4 0.34081039 0.3593965 0.12722250 0.16384661
5 -0.16695523 -0.1641499 -0.09033520 -0.11367669 The rows of the output (labeled 1 to 5) refer to the five clusters, while the numbers across each
row indicate the cluster‘s average value for the interest listed at the top of the column. As the
values are z-score standardized, positive values are above the overall mean level for all the teens
and negative values are below the overall mean. For example, the third row has the highest value
in the basketball column, which means that cluster 3 has the highest average interest in
basketball among all the clusters. By examining whether the clusters fall above or below the mean level for each interest category,
we can begin to notice patterns that distinguish the clusters from each other. In practice, this
involves printing the cluster centers and searching through them for any patterns or extreme
values, much like a word search puzzle but with numbers. The following screenshot shows a
highlighted pattern for each of the five clusters, for 19 of the 36 teen interests:

Chapter 9-09.jpg

Given this subset of the interest data, we can already infer some characteristics of the clusters.
Cluster 3 is substantially above the mean interest level on all the sports. This suggests that this
may be a group of Athletes per The Brealfast Club stereotype. Cluster 1 includes the most
mentions of "cheerleading," the word "hot," and is above the average level of football interest.
Are these the so-called Princesses? By continuing to examine the clusters in this way, it is possible to construct a table listing the
dominant interests of each of the groups. In the following table, each cluster is shown with the
features that most distinguish it from the other clusters, and The Breakfast Club identity that
most accurately captures the group's characteristics. Interestingly, Cluster 5 is distinguished by the fact that it is unexceptional; its members had
lower-than—average levels of interest in every measured activity. It is also the single largest group
in terms of the number of members. One potential explanation is that these users created a profile
on the website but never posted any interests. Tip When sharing the results of a segmentation analysis, it is often helpful to apply informative
labels that simplify and capture the essence of the groups such as The Breakfast Club typology
applied here. The risk in adding such labels is that they can obscure the groups' nuances by
stereotyping the group members. As such labels can bias our thinking, important patterns can be
missed if labels are taken as the whole truth. Given the table, a marketing executive would have a clear depiction of five types of teenage
visitors to the social networking website. Based on these profiles, the executive could sell
targeted advertising impressions to businesses with products relevant to one or more of the
clusters. In the next section, we will see how the cluster labels can be applied back to the original
population for such uses. Step 5 — improving model performance Because clustering creates new information, the performance of a clustering algorithm depends
at least somewhat on both the quality of the clusters themselves as well as what is done with that
information. In the preceding section, we already demonstrated that the five clusters provided
usefiil and novel insights into the interests of teenagers. By that measure, the algorithm appears
to be performing quite well. Therefore, we can now focus our effort on turning these insights into











Recently Asked Questions

Why Join Course Hero?

Course Hero has all the homework and study help you need to succeed! We’ve got course-specific notes, study guides, and practice tests along with expert tutors.


Educational Resources
  • -

    Study Documents

    Find the best study resources around, tagged to your specific courses. Share your own to gain free Course Hero access.

    Browse Documents
  • -

    Question & Answers

    Get one-on-one homework help from our expert tutors—available online 24/7. Ask your own questions or browse existing Q&A threads. Satisfaction guaranteed!

    Ask a Question