Description
looking carefully for document
Unformatted Attachment Preview
Name___________________________
This assignment is worth 100 points. There are 11 questions/tasks for you to complete. Questions/tasks
1-10 are worth 9 points each and question #11 is worth 10 points.
Assignment Due Date:
The assignment is due Thursday October 26, 2023 by 11:59pm CST. If the assignment is submitted late,
points will be deducted according to the late assignment policy on the syllabus.
Course Coverage:
This assignment covers the material in Modules 7 (Variable Clustering) and 8 (Observational Clustering).
Submission Instructions:
Upload this Word document with your responses/SAS output to the Cluster Analysis Assignment posted
in Module 8.
Scenario, Data Description, and Data Dictionary:
You work for a financial institution that has a line of business in granting and managing credit cards. You
are going to perform a cluster analysis (both variable and observational) to see if you can gain any insight
into your customers credit card usage.
In SAS OnDemand in the sq library there is a data set entitled credit_cards that contains 8,950 rows and
18 variables/inputs. Each row represents a customer. A description of the variables/inputs in
credit_cards is provided in the table below.
Variable/Input Name
CUST_ID
BALANCE
BALANCE_FREQUENCY
PURCHASES
ONEOFF_PURCHASES
INSTALLMENTS_PURCHASES
CASH_ADVANCE
PURCHASES_FREQUENCY
ONEOFFPURCHASESFREQUENCY
Description
Unique identifier for the customer (not to be
used in analysis).
Balance amount left in their account to make
purchases.
How frequently the Balance is updated, score
between 0 and 1 (1 = frequently updated, 0 = not
frequently updated).
Amount of purchases made from account.
Maximum purchase amount done in one-go.
Amount of purchase done in installment.
Cash in advance given by the user.
How frequently the Purchases are being made,
score between 0 and 1 (1 = frequently purchased,
0 = not frequently purchased).
How frequently Purchases are happening in onego (1 = frequently purchased, 0 = not frequently
purchased).
1
PURCHASESINSTALLMENTSFREQUENCY
CASHADVANCEFREQUENCY
CASHADVANCETRX
PURCHASES_TRX
CREDIT_LIMIT
PAYMENTS
MINIMUM_PAYMENTS
PRCFULLPAYMENT
TENURE
How frequently purchases in installments are
being done (1 = frequently done, 0 = not
frequently done).
How frequently the cash in advance being paid.
Number of Transactions made with “Cash in
Advanced”.
Number of purchase transactions made.
Limit of Credit Card for user.
Amount of Payment made by customer.
Minimum amount of payments made by user.
Percent of full payment paid by user.
Tenure of credit card service for user (number of
years customer
Note: Do not use the variable CUST_ID for any analysis in this assignment; this is just a value to identify
each customer.
Reminder of Steps for a Cluster Analysis:
In general, the following steps can be employed when performing an observational cluster analysis like kmeans.
1.
2.
3.
4.
5.
6.
7.
Perform variable selection (we will be using a variable clustering method).
Determine the number of clusters.
Standardize the data.
Perform the observational clustering analysis (via k-means algorithm).
Un-standardize the data (in preparation for visualization/profiling the clusters).
Visualize the clusters.
Profile the clusters (that is, describe the customers based on the values of the inputs/variables
used in the k-means algorithm).
This assignment is going to step you through each of these cluster analysis steps.
Step 1 Variable Selection:
Use the 17 inputs/variables and perform a variable clustering analysis. Use a second eigenvalue cutoff of
0.70.
1. Copy and past the table you used to select the variables that will move forward into the k-means
clustering algorithm.
2
2. Using the table from #1, list the variables that will move forward into the k-means clustering
algorithm. Hint: there should be 10 variables.
3. How did you make the decision to select the 10 variables you listed in part #2?
Step 2 Determine the Optimal Number of Clusters:
Modify the number_of_clusters_code.sas file in the Module 8 folder in SAS Studio and determine the
optimal number of clusters to be used in the k-means algorithm.
Hint: remember to standardize the data before the CCC_PSF_Kmeans_Macro.sas file is used and only use
the variables/inputs chosen from the variable selection process from Step 1.
4. Using the smallest number that both the CCC and Psuedo F statistic agree upon, what is the
number of clusters that should be used in the k-means algorithm?
Steps 3 and 4 are required steps to perform observational clustering (k-means algorithm).
Step 3 Standardize the Data:
Using the variables/inputs selected from the variable selection procedure (Step 1), standardize the data
using the range method. Output the standardized data into a SAS output data set. Output the
parameters of the standardization in a SAS output data set (will need these parameters in a future step
to un-standardize the data).
5. Select Yes or No to the following statement: I have standardized the data in SAS according to the
requirements stated above in Step 3.
☐ Yes.
☐ No.
3
Step 4 Perform the k-means clustering algorithm (observational clustering):
Using the number of clusters stated in Step 2, #4, perform the k-means clustering on the
standardized data. Use 50 iterations. Output a SAS data set that has the cluster number that
each customer was placed into. The CLUSTER variable will contain the cluster that each customer
was placed into.
6. Select Yes or No to the following statement: I have performed the k-means clustering algorithm
according to the requirements stated above in Step 4.
☐ Yes.
☐ No.
Step 5 Un-standardize the data (in preparation for visualization/profiling):
Using the standardization parameters that were saved in the SAS output data set from Step 3,
un-standardize the data from the k-means algorithm that contains the CLUSTER variable
(variable that allocates each customer to a specific cluster).
7. Select Yes or No to the following statement: I have un-standardized the data according to the
requirements stated above in Step 5.
☐ Yes.
☐ No.
Step 6 Visualize the clusters:
Select a random sample of 250 customers and construct a matrix of scatterplots for the clusters
using CASH_ADVANCE_TRX, INSTALLMENTS_PURCHASES, CREDIT_LIMIT, and TENURE.
8. Copy and paste the matrix of scatterplots below.
Step 7 Profile the clusters:
A successful cluster analysis will provide insight into your customers, patients, transactions, etc…
(whatever ‘object’ you are performing the cluster analysis on). Using the matrix of scatterplots,
answer the following questions about the customers in the clusters.
4
9. Which customers have been customers for a longer time? Customers in cluster 1 or cluster 2?
10. Which customers use their credit cards more for installment purchases? Customers in cluster 1 or
cluster 2?
11. Pick one of the scatterplots and make a statement or two about the customers based on two
inputs/variables in the scatterplot. Do not make a statement based on the responses in #9 or
#10.
https://welcome.oda.sas.com/
User Name: sghwy@umsystem.edu
Password: Netwrok123$
5
Purchase answer to see full
attachment