Box-Plot. â« A Box-and-Whisker plot, sometimes simply called a box-plot is a graphical display in which a box extending from Q1 to Q3 is constructed and which ...
Lecture-2 Box-Plots
Dr. Naveen Kumar Boiroju
Five Number Summary Minimum First Quartile Second Quartile (Median) Third Quartile Maximum
Box-Plot A Box-and-Whisker plot, sometimes simply called a box-plot is a graphical display in which a box extending from Q1 to Q3 is constructed and which contains the middle 50% of the data. Lines, called whiskers, are drawn from Q1 to the smallest value and from Q3 to the largest value. In addition a vertical line is constructed inside the box corresponding to the median.
Box-Plots Box plots are an excellent tool for conveying location and variation information in data sets, particularly for detecting and illustrating location and variation changes between different groups of data.
Uses Boxplots generally display less information than histograms. A histogram with many columns will give a detailed picture of the location of the data values to within the width of a narrow column, whereas a boxplot does little more than show a division of the data into four parts. However, boxplots are useful for making a large number of visual comparisons. Imagine that we wanted to compare peoples' incomes from twenty different regions. A set of twenty histograms to display all the data could not be easily absorbed by the eye. However, twenty boxplots could be drawn, one underneath the other down the page, and it would be obvious which region had larger overall incomes and which region had the greatest amount of inequality in terms of the spread of values between rich and poor.
Interpretation of Box-Plots Center Can read median directly from the box-plot.
Variation The length of the whiskers and IQR provides the variation in the data.
Skewness If the distribution were symmetric the median would be equidistant from Q1 and Q3. Otherwise, the distribution would be skewed.
Interpretation of Box-Plots Right Skewed: Whisker on the right of Q3 longer than the whisker on the left of Q1 (For horizontal boxplot) and/or Median line closer to Q1. Left Skewed: Whisker on the right of Q3 shorter than the whisker on the left of Q1 (For horizontal boxplot) and/or Median line closer to Q3. For example, when the distance between Q3 and the median is greater than the distance between Q1 and the median, the distribution is skewed to the right.
Modified Box-Plot for detection of outliers The whiskers extend to the furthest data points within a distance of 1 or 1.5 IQR on either side of the box with any more extreme points (outliers) being marked individually. Any observation not in the range [T1, T2] is considered an outlier (Informal Rule). In a Boxplot, Outliers are typically shown by a “*”. T1=Max( Min.observation, Q1-1.5IQR) T2=Min (Max. observation, Q3+1.5IQR)
Box-Plot can provide answers to Shape of the distribution Does the location differ between subgroups? Does the variation differ between subgroups? Are there any outliers?
Box-Plot does not explain the Bi-modality or multi-modality of the distribution Peakedness or flatness of the histogram/frequency curve.
Why do we need box-plots Visualization of five number summary To compare two or more data sets To identify the outliers.
Data Set Employee data set
Where is most of the data? Q.What does most of the data mean? Ans: We usually want to give a range of values that covers a certain percentage of the data in the sample ( Say 90% , 95%, 99%). Why not 100%?
Where is most of the data? Answer depends on the shape of the data. If data is bell shaped then we use Empirical Rule. If data is not bell shaped then we use Chebyshev rule.
Empirical Rule The empirical rule states that for a data set having a bell-shaped distribution, approximately 68% of the observations lie within one standard deviation of the mean. Approximately 95% of the observations lie within two standard deviations of the mean. and Approximately 99.7% of the observations lie within three standard deviations of the mean. The empirical rule applies to either large samples or populations.
Empirical Rule Interval
Percentage of Data
Mean-SD
to
mean +SD
68
Mean-2SD
to
mean +2SD
95
Mean-3SD
to
mean +3SD
99.7
The empirical rule says that almost all of the data fall within three standard deviation of the mean
Example Test scores have a bell shaped histogram with a mean of 70 and standard deviation of 5. Give the range of scores that has 68% of the students 95% of the students 99.7% of the students
About what percent of students scored More than 70 Between 70 and 80
Chebyshev’s Rule
Chebyshev’s Rule Provides a useful interpretation of the standard deviation. Can be used for all data including the one that are not bell shaped. Would also work for a bell shaped data but estimates will be conservative, so for bell shaped data Empirical Rule should be used.
Example Test scores have a mean of 70 and standard deviation of 5. (assume distribution is not bell shaped) Give the range of scores that has At least 75% At least 89% At least 93.75% of the students
About what percentage of the students scored More than 80 (assume symmetric, not bell shaped). Below 50 (assume symmetric, not bell shaped).
Employee data Descriptive Statistics
Current Salary
N 474
Mean Std. Deviation $34,419.57 $17,075.661
Reference Stine, R.E. and Foster, D. (2012), Statistics for Business Decision Making and Analysis, Pearson Education.