- The total mark for this assignment is
- This homework assignment accounts for 5% of the total marks for this course.
- This homework is due in Week 8 in the lecture.
- This homework prepares you to do Project 2 including using the data visualisation functions
- provided by Enterprise Guide. Please follow the instructions carefully. Mac users may use
- Excel for data visualisation.
- In some questions below, you will be asked to graph some results. You may use the graphic functions of SAS Enterprise Guide or Excel to graph. Provide the screenshot of the graphs if asked.
- Please refer to the document "Homework Submission Requirements" to prepare your submission.
- The following questions refer to the data set given by hw3_transactions.csv (file size ≈ 1.4 GB). In the file, you see detailed anonymised transaction data from thousands of shoppers in Jan 2013-Feb 2013. It contains more than 2 million rows of transaction data, which cannot be completely open via Excel.
- To begin, you need to use SAS Enterprise Guide to open the file by using File > Import Data. You will then be asked to specify data; please try to understand what each step means. In this case, you may keep clicking next and finish. After the file is loaded to SAS, you need to save it as a SAS data file (e.g., hw3_transactions) in your OrionDB folder to be linked by your queries.
- The definition of each column is given below.
- id: a unique id representing a customer
- chain: an integer representing a store chain
- dept: an aggregate grouping of the category (e.g. water)
- category: the product category (e.g. sparkling water)
- company: an id of the company that sells the item
- brand: an id of the brand to which the item belongs
- date: the date of purchase
- productsize: the amount of the product purchase (e.g. 16 oz of water)
- productmeasure: the units of the product purchase (e.g. ounces)
- purchasequantity: the number of units purchased
- purchaseamount: the dollar amount of the purchase
- To interpret the transaction data, for example rows 1 - 10 describe what Customer (id 86246) bought at Store (id 205) on 2/01/2013. This particular customer bought 10 different items. The exact items purchased by this customer are unknown, but we know their product categories, brands, and product companies. We also know what department (section) in a supermarket where each item is displayed. Since the only time stamp that we have from this data set is the date when a shopper shops, we assume in this homework that each shopper shops at one place no more than once per
day. This is a rich data set and there are a lot to explore from it. To limit our scope, we only focus on some basic analyses in this homework.
Given a data file opened in SAS Enterprise Guide, you can see some analysis and visualisation functions available (from the tool bar below).
Most functions are straightforward to use. Graphs can be found under Graph; some useful analysis tools can be found under Analyze in the tool bar. You are expected to try them by yourself.
Note that the data visualisation functions only apply to a SAS data file only. When you write a query, before you can graph the table of the query outcome, you need to save the result table as a SAS data file. There are three ways of doing it.
Method 1: Create a table in your query. For such a big data set with raw data, executing a query may take a while. Sometimes it may be beneficial to consolidate data into new tables for analysis (the same idea for having a data warehouse). You can create a table by having the first row in your query as follows:
Create table orion.your_file_name as
Select x as name_x, y as name_y, z as name_z From ...
This saves your query result as a data file. Later, you can even retrieve data from this new table by Select name_x, name_y, name_z
Method 2: copy your output table to a spreadsheet and import it as a data file. This, however, cannot work on large tables, but should be good enough for your output tables. To know how to plot a graph in Excel, there are many tutorials on the Internet, e.g., click this link. This approach will not be further addressed here.
Method 3: incorporate plotting functions in your query, which is too advanced for this course. You are recommended to use Method 1 or Method 2 for this homework and your Project 2. Question 1 (25%)
This question focuses on shoppers.
(a) (15%) Write a query to list the total number of shoppers who spent more than $100 each day in these two months. Your result should look like the following:
Number of Shoppers
List the first and last 10 rows of the table. Furthermore, graph the output using a line plot with date as the horizontal axis and the number of shoppers as the vertical axis.
Provide the screenshot of the line chart.
(b) (10%) Write a query to list the total number of shopper visits by weekdays. In SAS, weekday =1 for Sunday, 2 for Monday, etc. Your solution should look like the following table.
List the whole table. Graph the output using a bar chart with weekday as the horizontal axis and the number of shoppers as the vertical axis.
Number of Shoppers
Provide the screenshot of the bar chart.
Question 2 (25%)
This question analyses shopping behaviors of customers (id).
(a) (10%) Write a query to list the average amount of money each customer spent in each visit. Your result should look like the following table.
Sort the data by amount in descending order. List the first and the last 10 rows of the table.
(b) (15%) Each row displayed in the transaction data represents a purchased item. Write a query to find the average number of items purchased by each customer in a visit. List only those who bought on average less than 100 items in a visit. Your result should look like the following one.
Sort the table by the number of items in descending order. List the first and last 10 rows of the table. Furthermore, plot the histogram (probability distribution) of average number of items purchased in a visit. If you are not familiar with the concept of histogram, please read the following site about histogram. You may view the following link to learn how to plot a histogram using Excel. https://support.office.com/en-us/article/create-a-histogram-85680173-064b-4024-b39d- 80f17ff2f4e8
To plot a histogram, choose Bar Chart Wizard. In Step 2 out of 4, choose Percentage for the Bar height. Provide the screenshot of the histogram. What can you observe from the histogram?
Question 3 (25%)
This question analyses the correlation between total money spent and the total number of visits by each customer. If you are not familiar with the concept of correlation, please read it from the following site about correlation. Basically, if their correlation is positive (or negative), we can say that a customer who visits the supermarket more frequently tends to spend more (or less) money there. It is possible that you may also find that these two things have no correlation (or correlation close to 0), which means the total money spent and the number of visits are independent.
Write a query to list the total amount money spent and the number visits for each customer. Your result should look like the following table.
We shall focus on those customers whose total amount spent is positive and is less than $1,000 (i.e., to excllude returns/refunds and corporate customers). Sort your data based on id in ascending order. List the first and last 10 rows of the table. Furthermore, plot a 2D scatter chart for the Amount (horizontal axis) and the No. of Visits (vertical axis).
Avg. No. of Purchased Items
Is there a correlation revealed from the scatter chart? Calculate the exact correlation between Amount and Visits of the result table. What is the (Pearson) correlation calculated by SAS? Provide a screenshot of the result.
Drag Amount and Visits from the left pane to the right pane.
PC Excel users, you will need the Analysis ToolPak add-in to use the correlation function. Please see the following links to install the tool pack and learn how to calculate a correlation.
Mac Excel users, I believe Mac of a recent version has this add-in already
https://support.office.com/en-us/article/load-the-analysis-toolpak-in-excel-6a63e598-cd6d-42e3- 9317-6b40ba1a66b4. Otherwise, you can download a similar, free add-in StatPlus:Mac LE https://www.analystsoft.com/en/products/statplusmacle/
Question 4 (25%)
Given transactions data, marketing people are interested in identifying "very valuable" customers who visit frequently with a certain level of spending. Here we focus on those customers who have spent (in total) between $500 and $1000 and have visited the stores over 6 times during these two months. We want to find out the most preferred product (identified by both category and brand) for each such customer and the total amount of money that the customer spends for this product during these two months. Your result should look like the following. You may write multiple queries for solving this question.
Sort the table by money in descending order. List the first and last 10 rows and detail how you obtain the result.