Question

# INFS5710

C.-L. Tseng

Homework #3

- The total mark for this assignment is

100.

1

day. This is a rich data set and there are a lot to explore from it. To limit our scope, we only focus on some basic analyses in this homework.

Given a data file opened in SAS Enterprise Guide, you can see some analysis and visualisation functions available (from the tool bar below).

Most functions are straightforward to use. Graphs can be found under Graph; some useful analysis tools can be found under Analyze in the tool bar. You are expected to try them by yourself.

Note that the data visualisation functions only apply to a SAS data file only. When you write a query, before you can graph the table of the query outcome, you need to save the result table as a SAS data file. There are three ways of doing it.

Method 1: Create a table in your query. For such a big data set with raw data, executing a query may take a while. Sometimes it may be beneficial to consolidate data into new tables for analysis (the same idea for having a data warehouse). You can create a table by having the first row in your query as follows:

Create table orion.your_file_name as

Select x as name_x, y as name_y, z as name_z From ...

Where ...

This saves your query result as a data file. Later, you can even retrieve data from this new table by Select name_x, name_y, name_z

From Orion.your_file_name

Where ...

Method 2: copy your output table to a spreadsheet and import it as a data file. This, however, cannot work on large tables, but should be good enough for your output tables. To know how to plot a graph in Excel, there are many tutorials on the Internet, e.g., click this link. This approach will not be further addressed here.

Method 3: incorporate plotting functions in your query, which is too advanced for this course. You are recommended to use Method 1 or Method 2 for this homework and your Project 2. Question 1 (25%)

This question focuses on shoppers.

(a) (15%) Write a query to list the total number of shoppers who spent more than $100 each day in these two months. Your result should look like the following:

Date

Number of Shoppers

01Jan2013

02 Jan2013

2

:

:

28Feb2013

List the first and last 10 rows of the table. Furthermore, graph the output using a line plot with date as the horizontal axis and the number of shoppers as the vertical axis.

Provide the screenshot of the line chart.

(b) (10%) Write a query to list the total number of shopper visits by weekdays. In SAS, weekday =1 for Sunday, 2 for Monday, etc. Your solution should look like the following table.

List the whole table. Graph the output using a bar chart with weekday as the horizontal axis and the number of shoppers as the vertical axis.

Weekday

Number of Shoppers

1

2

:

:

7

Provide the screenshot of the bar chart.

Question 2 (25%)

This question analyses shopping behaviors of customers (id).

(a) (10%) Write a query to list the average amount of money each customer spent in each visit. Your result should look like the following table.

id

Avg. Amount

$xxxxx.xx

3

Sort the data by amount in descending order. List the first and the last 10 rows of the table.

(b) (15%) Each row displayed in the transaction data represents a purchased item. Write a query to find the average number of items purchased by each customer in a visit. List only those who bought on average less than 100 items in a visit. Your result should look like the following one.

Sort the table by the number of items in descending order. List the first and last 10 rows of the table. Furthermore, plot the histogram (probability distribution) of average number of items purchased in a visit. If you are not familiar with the concept of histogram, please read the following site about histogram. You may view the following link to learn how to plot a histogram using Excel. https://support.office.com/en-us/article/create-a-histogram-85680173-064b-4024-b39d- 80f17ff2f4e8

To plot a histogram, choose Bar Chart Wizard. In Step 2 out of 4, choose Percentage for the Bar height. Provide the screenshot of the histogram. What can you observe from the histogram?

Question 3 (25%)

This question analyses the correlation between total money spent and the total number of visits by each customer. If you are not familiar with the concept of correlation, please read it from the following site about correlation. Basically, if their correlation is positive (or negative), we can say that a customer who visits the supermarket more frequently tends to spend more (or less) money there. It is possible that you may also find that these two things have no correlation (or correlation close to 0), which means the total money spent and the number of visits are independent.

Write a query to list the total amount money spent and the number visits for each customer. Your result should look like the following table.

We shall focus on those customers whose total amount spent is positive and is less than $1,000 (i.e., to excllude returns/refunds and corporate customers). Sort your data based on id in ascending order. List the first and last 10 rows of the table. Furthermore, plot a 2D scatter chart for the Amount (horizontal axis) and the No. of Visits (vertical axis).

id

Avg. No. of Purchased Items

xxx.xx

id

Amount

Visits

$xxxxx.xx

4

Is there a correlation revealed from the scatter chart? Calculate the exact correlation between Amount and Visits of the result table. What is the (Pearson) correlation calculated by SAS? Provide a screenshot of the result.

Drag Amount and Visits from the left pane to the right pane.

PC Excel users, you will need the Analysis ToolPak add-in to use the correlation function. Please see the following links to install the tool pack and learn how to calculate a correlation.

https://www.excel-easy.com/data-analysis/analysis-toolpak.html

https://www.excel-easy.com/examples/correlation.html

Mac Excel users, I believe Mac of a recent version has this add-in already

https://support.office.com/en-us/article/load-the-analysis-toolpak-in-excel-6a63e598-cd6d-42e3- 9317-6b40ba1a66b4. Otherwise, you can download a similar, free add-in StatPlus:Mac LE https://www.analystsoft.com/en/products/statplusmacle/

5

Question 4 (25%)

Given transactions data, marketing people are interested in identifying "very valuable" customers who visit frequently with a certain level of spending. Here we focus on those customers who have spent (in total) between $500 and $1000 and have visited the stores over 6 times during these two months. We want to find out the most preferred product (identified by both category and brand) for each such customer and the total amount of money that the customer spends for this product during these two months. Your result should look like the following. You may write multiple queries for solving this question.

Sort the table by money in descending order. List the first and last 10 rows and detail how you obtain the result.

Customer id

Category

Brand

Money Spent

$xxxxx.xx

6