Unformatted text preview: Engr 9397
Week 2 Exploratory Data Analysis using numerical and graphical methods Basic Probability concepts If things were done right 99.9 % of the Fme, we’d end up with • 1 hr of unsafe drinking water per month • 2,000 incorrect drug prescripFons per year • 32,000 missed heartbeats per person per year Why take measurements? “You manage what you measure” “What gets measured gets done” • Measures are indicators of performance and play a criFcal role in controlling quality • EﬀecFve measurements help to: – jusFfy change to a process – quanFfy performance, results and improvements – Determine prioriFes and opFmize resources • Measures need to encompass values of both the customer and the organizaFon StaFsFcs – some quotes “StaFsFcs are like bikinis: what they reveal is suggesFve but what they conceal is vital” “Numbers are like people, torture them enough and they will tell you anything” “There are three kinds of lies: lies, damned lies and staFsFcs” “StaFsFcs will prove anything, event the truth” “StaFsFcs is the art of lying by means of ﬁgures” “StaFsFcs is never having to say you are certain” “One should use staFsFcs as a drunken man uses lampposts – for support rather than for illuminaFon” Why study staFsFcs? • It helps us to make informed decisions based on data collected in the face of uncertainty and variaFon What are staFsFcs? • StaFsFcs describe a set of tools and techniques used to describe, organize, model and interpret data Why are StaFsFcs useful in Quality Engineering? • StaFsFcs, deﬁned as the collecFon, tabulaFon, analysis, interpretaFon and presentaFon of numerical data (ex. quality measurements), enables one to make decisions (ex. acFons and changes to a product, process or service) about a populaFon using sample data Use of StaFsFcs • Supports decision making by looking beyond the face value of the data • If used eﬀecFvely, staFsFcal analysis results will illuminate and verify important aspects of the issue or problem being studied • More informed decisionmaking be_er design improved performance lower cost higher eﬃciency bigger proﬁts! Example – ArcFc Pipeline Engineering • The decision to build a subsea pipeline in a geographical area that has a history of iceberg scouring is a very complex (and expensive) one. Empirical data and staFsFcal analysis of risk is used to gauge the beneﬁt/cost trade
oﬀs of the project. How many factors can you think of might impact this calculaFon? What kind of data would help you to determine which factors are the most important? Some Basic Terminology • Popula7on: collecFon of all possible elements, object of interest • Sample: a populaFon subset
representaFve sampling that allows for predicFons of the enFre populaFon and a degree of conﬁdence to be assigned • Random: an unpredictable result, but having a known probability of occurrence • Bias: not random, sample does not adequately represent the populaFon • Inference: process of drawing a conclusion using rules Types of Variables • Variable / ConFnuous data: – uncountable quality characterisFcs that can be measured by the real number scale (ex. Temp, speed) • Discrete / Categorical data: – countable quality characterisFcs that are measured using whole numbers
either present or absent (ex. # components (a_ribute data ), # of failures) • Mixed Data – A mix of discrete and conFnuous data, upper/lower bound values (min/max quanFty of water) Random Variable • A characterisFc whose value may change unpredictably • Associates events, or values of experimental outcomes, with probabiliFes • Needed to deﬁne probability distribuFons • Random variables can be conFnuous, discrete or mixed • With a conFnuous random variable (rv), the probability of any speciﬁc value is zero, whereas a discrete rv has a probability associated to each possible value Types of data • Univariate data: – 1 variable (ex. velocity) • MulFvariate Data: – >1 variable (ex. velocity, temp) • Ungrouped data: – observaFons have no order • Grouped data: – observaFons organized based on when values occurred Measurement Error • The validity and usefulness of a measurement is dependent on sample selecFon, the characterisFc(s) being measured and measurement technique • Measurement Error = Diﬀerence between measured and true value • Measurement error can be classiﬁed into two categories: accuracy and precision Error Terminology • Bias = Average value – True value of measurement • Accuracy (variability) – Measure of data tendency to center around the actual value • Precision (repeatability) – ability to consistently repeat measurements of ~same value – Measurements diﬀer li_le from one another • Usually, excess variability is harder to correct than inaccuracy (why do you think this is so?) • What are some common source of error? Error from VariaFon – Link to Quality Engineering • VariaFon Error can be separated into two kinds: assignable causes + residual error • VariaFon from assignable causes results from sources outside the process, (ex. human error, faFgue, poor instrucFons, poor condiFons) • Residual error is what remains aner assignable causes have been idenFﬁed – it is associated with measurement limits and background variability • QuesFon: If you were tesFng for the presence of toxins in the local water supply, what are the assignable causes and residual error? DescripFve StaFsFcs (Exploratory Data Analysis) • A thorough staFsFcal analysis uses graphical, analyFcal and interpreFve methods • DescripFve StaFsFcs includes: – Graphical (empirical) methods • Ex: frequency diagrams, histograms, bar graphs etc. – Numerical (analyFcal/mathemaFcal) methods • Data locaFon is described by measures of central tendency, data spread is described by measures of dispersion DescripFve StaFsFcs Measures of LocaFon • Sample Size: number of data points (n) • Mean (a.k.a. Average) – Most common measurement to locate center of the data distribuFon
histogram balance point – centroid of data – Same units as sample/populaFon data – SensiFve to outliers • Trimmed Mean – Mean calculated aner smallest and largest σ% of data removed Median and Mode • Median – Middle value of data if n is odd, the median is the middle value if n is even, the median is the mean of the two middle values – Preferred measure of locaFon, useful when data is skewed • Mode – Largest frequency
used onen with categorical data (counts units) DescripFve StaFsFcs Measures of Dispersion (Spread) • Variance – Classic measure of data spread, concentraFon of data about the mean – Measured in square units of the observaFons – 2nd moment about the mean • Standard deviaFon (root
mean
squared deviaFon) – Measured in original metric – Most common dispersion measure Measures of Dispersion conFnued • Range (r) – Dependent on two observaFons only
OK for small data sets r = max(xi) – min(xi) • QuarFles – Data divided into 4 equal parts 1st, 2nd and 3rd quarFles (Q1, Q2 and Q3} – Q1 is the lower 25% mark, Q2 = 50%, Q3 = 75% – Gives a be_er indicaFon of data characterisFcs (vs. the mean) when there are outliers • InterquarFle Range (IQR) – range of middle 50% of data IQR = Q3
Q1 • Midrange Coeﬃcient of VariaFon • Coeﬃcient of VariaFon cv – dimensionless measure of spread – expresses σ as a percentage of µ – Useful to understand spread in context of the mean and also useful when comparing diﬀerent data sets – Measure not so useful if mean value is close to zero DescripFve StaFsFcs
Example using Minitab Example: Calculate the mean, median, Q1, Q3 and 5% trimmed mean, variance, std deviaFon, IQR and Range of the following data: Data: 2, 208, 3, 5, 90, 151, 45, 46, 47, 48, 50 ...
View
Full Document
 Winter '11
 SusanHunt
 the00

Click to edit the document details