Chapter 6 Describing Data
There are lots of different ways to describe data and some of them have been mentioned already (mode, mean and median). This chapter details how various descriptive statistics are calculated.
6.1 Measures of Central Tendency
Measures of central tendency are used to find the middle, or the average, of a data set. The measures of central tendency are the mean, median and mode.
6.1.1 Mean
The mean (sometimes referred to as the arithmetic mean) is the sum of the recorded values divided by the number of values recorded. The formula for the mean then is given by:
\[ \textrm{Mean} = \frac{\textrm{Sum of recorded values}}{\textrm{Number of values recorded}}.\]
6.1.1.1 Example
The table below shows data on male and female life expectancy at birth in the period 2018-2020 (NISRA 2022a).
Local Government District (2014) | Male life expectancy at birth | Female life expectancy at birth |
---|---|---|
Antrim and Newtownabbey | 78.8 | 82.6 |
Ards and North Down | 79.6 | 82.7 |
Armagh City, Banbridge and Craigavon | 79.3 | 83.2 |
Belfast | 75.8 | 80.5 |
Causeway Coast and Glens | 79.7 | 82.6 |
Derry City and Strabane | 78.0 | 81.6 |
Fermanagh and Omagh | 79.2 | 83.2 |
Lisburn and Castlereagh | 80.3 | 83.3 |
Mid and East Antrim | 79.0 | 82.3 |
Mid Ulster | 79.6 | 83.1 |
Newry, Mourne and Down | 79.3 | 83.2 |
The mean male life expectancy at birth is:
\[\textrm{Mean}=\frac{(78.8+79.6+79.3+75.8+79.7+78.0+79.2+80.3+79.0+79.6+79.3)}{11},\]
\[\textrm{Mean}=79.0\]
6.1.2 Median
The median is the middle number in a sorted, ascending or descending, list of values.
If there are an odd number of values the median is simply the middle value.
For an even number of values there will be two values in the center. Those values are summed and divided by two.
The median is sometimes used as opposed to the mean when there are outliers that might skew the average of the values. For ordinal data, the median is usually the best indicator of central tendency (McHugh M. L. and Hudson-Barr D. 2003).
6.1.2.1 Example
Using the same table as before calculate the median male life expectancy at birth.
Local Government District (2014) | Male life expectancy at birth | Female life expectancy at birth |
---|---|---|
Antrim and Newtownabbey | 78.8 | 82.6 |
Ards and North Down | 79.6 | 82.7 |
Armagh City, Banbridge and Craigavon | 79.3 | 83.2 |
Belfast | 75.8 | 80.5 |
Causeway Coast and Glens | 79.7 | 82.6 |
Derry City and Strabane | 78.0 | 81.6 |
Fermanagh and Omagh | 79.2 | 83.2 |
Lisburn and Castlereagh | 80.3 | 83.3 |
Mid and East Antrim | 79.0 | 82.3 |
Mid Ulster | 79.6 | 83.1 |
Newry, Mourne and Down | 79.3 | 83.2 |
To calculate the median the male life expectancy at birth needs to be ordered.
Local Government District (2014) | Male life expectancy at birth |
---|---|
Belfast | 75.8 |
Derry City and Strabane | 78.0 |
Antrim and Newtownabbey | 78.8 |
Mid and East Antrim | 79.0 |
Fermanagh and Omagh | 79.2 |
Armagh City, Banbridge and Craigavon | 79.3 |
Newry, Mourne and Down | 79.3 |
Ards and North Down | 79.6 |
Mid Ulster | 79.6 |
Causeway Coast and Glens | 79.7 |
Lisburn and Castlereagh | 80.3 |
The middle value (median) in this case is the 6th entry (Armagh City, Banbridge and Craigavon): 79.3.
The same data can be sorted by Administrative Area rather than Local Government District. As there is an even number of administrative areas, there are two middle values (East Antrim and North Down):
Administrative Area | Male life expectancy at birth |
---|---|
Belfast West | 73.7 |
Belfast North | 73.8 |
Foyle | 77.1 |
Belfast East | 77.6 |
Upper Bann | 78.5 |
West Tyrone | 79.2 |
North Antrim | 79.2 |
Fermanagh and South Tyrone | 79.4 |
East Antrim | 79.4 |
North Down | 79.5 |
South Down | 79.5 |
Newry and Armagh | 79.7 |
Belfast South | 79.7 |
South Antrim | 79.7 |
Mid Ulster | 79.9 |
East Londonderry | 79.9 |
Strangford | 80.0 |
Lagan Valley | 80.1 |
When there are two middle values they should be summed and divided by two to find the median:
\[\textrm{Median} =\frac{79.4+79.5}{2}=79.45.\]
Information
Try finding the mean and median of this list of numbers: 2, 3, 3, 4, 20.
The mean is 6.4 while the median is 3.
The mean is being skewed by the outlier (20) while the median remains closer to what might be considered to be the middle of the data set if the outlier was not present. This illustrates one of the main uses of the median. It is often used when there are outliers in a data set that might skew the average of the values.
6.1.3 Mode
The mode of a set of data values is the value that appears most often. It is the value that is most likely to be sampled. There can be multiple modes or no modes. The mode is the only useful measure of central tendency when a variable is measured on a nominal scale (McHugh M. L. and Hudson-Barr D. 2003).
6.1.3.1 Example
The table below shows the total number of events funded by the Arts Council of Northern Ireland in 2010 through the Annual Support for Organisations Programme (NISRA 2012) filtered by Local Government District.
Local Government District | Events Funded |
---|---|
Cookstown | 11 |
Magherafelt | 14 |
Moyle | 20 |
Limavady | 22 |
Ballymoney | 23 |
Carrickfergus | 23 |
Strabane | 23 |
Ballymena | 28 |
Dungannon | 32 |
Ards | 34 |
Larne | 36 |
Banbridge | 38 |
Fermanagh | 38 |
Newtownabbey | 40 |
Omagh | 50 |
Down | 95 |
Coleraine | 109 |
Lisburn | 117 |
Antrim | 143 |
North Down | 145 |
Newry and Mourne | 152 |
Armagh | 169 |
Craigavon | 207 |
Castlereagh | 216 |
Derry | 750 |
Belfast | 3,177 |
Ballymoney, Carrickfergus and Strabane all had a total of 23 events funded by the Arts Council of Northern Ireland and this represents the mode as this is the most frequently occurring value.
6.1.3.2 Example
It is also possible to have multiple modes. For instance, consider the list of numbers:
\[ 7, 3, 5, 3, 4, 3, 5, 6, 8, 5.\]
The frequency table below counts how often each value appears.
Value | Frequency |
---|---|
3 | 3 |
4 | 1 |
5 | 3 |
6 | 1 |
7 | 1 |
8 | 1 |
This is bimodal, it has two modes, 3 and 5.
6.1.4 Mean or Median?
The median may be a better indicator of the most typical value if a set of scores has outliers. Outliers are extreme values that differ greatly from other values. When the sample size is large and does not contain outliers the mean score usually provides a better measure of central tendency.
6.1.5 Using Excel
It is useful to calculate descriptive statistics by hand for understanding but for larger data sets it is not always possible to arrange data and perform calculations by hand.
Excel has a number of functions designed to perform descriptive statistics.
Frequency
=FREQUENCY(start:end,bins_array)
The frequency() function will return a frequency table describing your data. It takes two arguments, the first being the array of values and the second being an array describing the upper boundary of the bins used.
Average
=AVERAGE(start:end)
The mean is calculated using the average() function. There are several other functions relating to means: geomean(), harmean() and trimmean(). Take care not to use these as they are quite different from calculating the mean that has been described here.
Median
=MEDIAN(start:end)
The median is calculated using the median() function.
Mode
=MODE.SNGL(start:end)
=MODE.MULT(start:end)
There are several functions for calculating the mode: mode(), mode.sngl() and mode.mult().
mode() was used in Excel 2007 and may still appear as an option in some versions of Excel.
mode.sngl() will return one mode and mode.mult() will return multiple modes (if there are multiple modes).
Neither mode() nor mode.sngl() will provide a warning if there are multiple modes so mode.mult() is usually the safest option.
Summary
Measures of Central Tendency
The mean (sometimes referred to as the arithmetic mean) is the sum of the recorded values divided by the number of values recorded. The formula for the mean then is given by:
\[ \textrm{Mean} = \frac{\textrm{Sum of recorded values}}{\textrm{Number of values recorded}}.\]
The median is the middle number in a sorted, ascending or descending, list of values.
If there are an odd number of values the median is simply the middle value.
For an even number of values there will be two values in the center. Those values are summed and divided by two.
The mode of a set of data values is the value that appears most often.
6.2 Frequency
The frequency of an observation is the number of times it occurs or is recorded. A frequency table for data, like the one shown below detailing fictional exam grades, is a commonly used method of depicting frequency.
Grade | Frequency |
---|---|
A | 15 |
B | 20 |
C | 25 |
D | 21 |
E | 14 |
A frequency distribution is a collection of observations produced by sorting observations into classes and showing their frequency of occurrence in each class. A frequency distribution helps us discern patterns in data (assuming they exist) by imposing a structure to the data (Witte R. S. and Witte J. S. 2017).
The total of all frequencies so far in a frequency distribution is the cumulative frequency. It is the ‘running total’ of frequencies.
Grade | Frequency | Cumulative Frequency |
---|---|---|
A | 15 | 15 |
B | 20 | 35 |
C | 25 | 60 |
D | 21 | 81 |
E | 14 | 95 |
The relative frequency is the ratio of the category frequency to the total number of outcomes. For grade A, the relative frequency is:
\[ \textrm{Relative Frequency}=\frac{15}{15+20+25+21+14}=0.16. \]
The table can be extended to include the relative frequency.
Grade | Frequency | Relative Frequency |
---|---|---|
A | 15 | 0.16 |
B | 20 | 0.21 |
C | 25 | 0.26 |
D | 21 | 0.22 |
E | 14 | 0.15 |
The relative frequency relates the count for a particular event to the total number of events using percentages, proportions or fractions and it can be reported as a percentage by multiplying the values by 100%. For grade A, the relative frequency reported as a percentage is: 100% x 0.16 = 16%.
6.2.1 Mean of a Frequency Distribution
While it is common to calculate the mean of a data set sometimes we receive data in the form of a frequency table. To calculate the mean we multiply the value by its frequency, sum the results and divide by the cumulative frequency.
6.2.1.1 Example
Calculate the mean given the values and their respective frequencies in the table below:
Value | Frequency | Value x Frequency |
---|---|---|
1 | 2 | 2 |
2 | 3 | 6 |
3 | 5 | 15 |
4 | 6 | 24 |
5 | 5 | 25 |
6 | 4 | 24 |
7 | 2 | 14 |
8 | 1 | 8 |
The products of the values and their frequencies have been calculated in the table above, all that is left is to sum them and divide by the cumulative frequency:
\[ \textrm{Mean}=\frac{2+6+15+24+25+24+14+8}{2+3+5+6+5+4+2+1}=\frac{118}{28}=4.21 \]
Information
We can write this using mathematical notation:
\[\mu=\frac{\sum_{i=1}^n x_i f_i}{\sum_{i=1}^n f_i},\]
where \(x_i\) are the individual values and \(f_i\) their respective frequencies.
6.2.2 Mode of a Frequency Distribution
The modal value (or the modal class in the case of a frequency distribution) is simply the value which corresponds to the largest frequency. In the example above the modal value is 4.
6.2.3 Median of a Frequency Distribution
To find the median of a frequency distribution we need to first calculate the cumulative frequency:
Value | Frequency | Value x Frequency | Cumulative Frequency |
---|---|---|---|
1 | 2 | 2 | 2 |
2 | 3 | 6 | 5 |
3 | 5 | 15 | 10 |
4 | 6 | 24 | 16 |
5 | 5 | 25 | 21 |
6 | 4 | 24 | 25 |
7 | 2 | 14 | 27 |
8 | 1 | 8 | 28 |
We divide the cumulative frequency by 2 to find the midpoint. In this case, it’s 14. Then, check each value to see if its corresponding cumulative frequency is greater than that number. The first value which has a cumulative frequency greater than that number is the median value. The first value in the table above which has a cumulative frequency greater than 14 is 4. This is the median.
6.2.4 Mean of a Grouped Frequency Distribution
If the frequency table is a grouped data frequency table, where the values are banded (0-5,5-10,10-15…etc), then the equation for the mean uses the midpoint of the band (which is the upper limit minus the lower limit) in place of a single value.
Take the table below for instance:
Bin | Frequency |
---|---|
10-14 | 1 |
15-19 | 3 |
20-24 | 9 |
25-29 | 2 |
To calculate the mean we would rewrite this table as follows:
Midpoint | Frequency |
---|---|
12 | 1 |
17 | 3 |
22 | 9 |
27 | 2 |
Previously we created a new column for the product of the value and the frequency. We do the same again but this time the new column will hold values for the product of the midpoint with the frequency:
Midpoint | Frequency | Mf |
---|---|---|
12 | 1 | 12 |
17 | 3 | 51 |
22 | 9 | 198 |
27 | 2 | 54 |
The process is the same as before. We sum the products of the midpoint and the frequency and divide by the cumulative frequency:
\[\textrm{Mean}=\frac{12+51+198+54}{1+3+9+2}=21\]
Information
We can write this in mathematical notation as:
\[ \mu=\frac{\sum_{i=1}^n M_i f_i}{\sum_{i=1}^n f_i},\].
where \(M\) is the midpoint and \(f\) is the frequency.
6.2.5 Median of a Grouped Frequency Distribution
To find the median we need several values, \(l\), the lower limit of the median class, \(n\) the total number of observations, \(c_f\), the cumulative frequency of the class preceding the median class, \(f\), the frequency of the median class and \(c_l\) the class length. Given these, the median is:
\[\textrm{Median}=l + c_l \frac{\frac{n}{2}-c_f}{f} \]
Bin | Frequency | cf | Mf |
---|---|---|---|
10-14 | 1 | 1 | 12 |
15-19 | 3 | 4 | 51 |
20-24 | 9 | 13 | 198 |
25-29 | 2 | 15 | 54 |
The total number of observations \(n = 15\).
Divide this by 2 to get 7.5
From this we can find the lower limit of the median class by finding the cumulative frequency which is just larger than this number. This corresponds to the median class. For us that’s the 20-24 class.
The lower limit, \(l\), of this class is 20.
The cumulative frequency of the class preceding the median class, \(c_f\), is 4.
The frequency of the median class, \(f\), is 9.
The class length, \(c_l\), is 4.
The median then is calculated by plugging these values into the formula above:
\[\textrm{Median} = l + c \frac{\frac{n}{2}-cf}{f},\]
\[\textrm{Median} = 20 + \frac{4(\frac{15}{2}-4)}{9}, \]
\[\textrm{Median} =21.6.\]
Information
The variance (more on this later) of a grouped frequency distribution is given by:
\[ V=\frac{\sum_{i=1}^n f_i M_i^2 - \mu \sum_{i=1}^n f_i}{\mu -1}, \] where \(f_i\) are the frequencies, \(M_i\) are the midpoints of the bands (or bins), \(\mu\) is the mean.
The standard deviation given by the square root of \(V\).
Summary
Frequency
The frequency of an observation is the number of times it occurs or is recorded. A frequency table is a commonly used method of depicting frequency.
The total of all frequencies so far in a frequency distribution is the cumulative frequency. It is the ‘running total’ of frequencies.
The relative frequency is the ratio of the category frequency to the total number of outcomes.
6.3 Measures of Dispersion
Dispersion (or variability) describes how far apart data points lie from each other and the center of a distribution. The range, interquartile range, variance and standard deviation are all measures of dispersion and they describe how far apart data points lie from one another and the center of a distribution.
6.3.1 Range
The range is the difference between the highest and lowest values and is calculated by subtracting the minimum value from the maximum value.
6.3.2 Interquartile Range
The interquartile range (IQR) describes the spread of the middle half of a distribution. How the interquartile range is calculated depends on whether there are an even or an odd number of values in a dataset.
For an even number of values the dataset in split half. The medians for the two new subsets of data are calculated. The positive difference of those medians is the interquartile range.
For an odd number of values either the inclusive or the exclusive method of finding the interquartile range must be used.
The algorithm for the exclusive method is detailed below:
- Arrange the data in numeric order.
- Remove the median and split the data about its center.
- Find the medians of the two newly appended subsets of data.
- Calculate the difference.
The algorithm for the inclusive method is detailed below:
- Arrange the data in numeric order.
- Remove the median and split the data about its center.
- Append the two new subsets of data with the median.
- Find the medians of the two newly appended subsets of data.
- Calculate the difference.
6.3.2.1 Example
Find the interquartile range for the list of numbers below:
\[6, 7, 8, 8, 7, 6, 9, 5, 10, 4. \] There are an even number of values. Arrange them in numeric order:
\[ 4, 5, 6, 6, 7, 7, 8, 8, 9, 10.\] Split the values about their center into two sub sets of data.
\[ (4, 5, 6, 6, 7), (7, 8, 8, 9, 10). \] Find the medians of each of these sub sets. The first subset has a median of 6 while the second has a median of 8.
The interquartile range is:
\[ \textrm{IQR} = 8 - 6 = 2.\] Note: To calculate the interquartile range the smaller median value is always subtracted from the larger.
6.3.2.2 Example
Find the interquartile range for the list of numbers below:
\[2, 3, 2, 4, 3, 5, 4, 4, 2.\] Arrange the values in numeric order:
\[2, 2, 2, 3, 3, 4, 4, 4, 5. \] Remove the median (3) and split the data as before:
\[ (2, 2, 2, 3), (4, 4, 4, 5).\] The interquartile range is:
\[ \textrm{IQR}=\textrm{Median of sub set 2}- \textrm{Median of sub set 1},\] \[ \textrm{IQR}=\frac{4+4}{2} - \frac{2+2}{2}=\frac{8}{2} - \frac{4}{2} = 4 - 2= 2.\] #### Example
Find the interquartile range of the list of numbers below:
\[ 2, 3, 2, 4, 3, 5, 4, 4, 2.\] Sort in numeric order as before:
\[2, 2, 2, 3, 3, 4, 4, 4, 5.\]
Split the data as before but append each subset of data with the median (at the end and start of each subset respectively):
\[(2, 2, 2, 3, 3),(3, 4, 4, 4, 5).\] Find the medians of each of the subsets and calculate the interquartile range. The median of the first subset is 2 and the median of the second subset is 4.
\[ \textrm{IQR} = 4 - 2 = 2 \]
The interquartile range is a useful measure of variability for skewed distributions. It can show where most values lie and how clustered they are. It is useful for datasets with outliers as it is based on the middle half of the distribution and less influenced by extreme values. Exclusive calculations result in a wider interquartile range than inclusive calculations.
6.3.3 Variance
The standard deviation describes to what extent a set of numbers lie apart (their spread). It is the square root of variance which is also an indicator of the spread of values.
To calculate the variance:
- Start by finding the mean of the values in the dataset.
- Find the difference between each recorded value and the mean.
- Square those differences.
- Sum the squared differences.
- Divide the sum by the number of values recorded for population variance or the sum of the number of values minus 1 for sample variance.
Information
The population variance is given by:
\[V_{p} = \frac{1}{n}\sum_{i=1}^n (x_i - \mu)^2,\]
where \(V_p\) is the population variance, \(n\) is the number of observations, \(x_i\) are the observations and \(\mu\) is the population mean.
The sample variance is given by:
\[V_{s} = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2,\] where \(V_s\) is the population variance, \(n\) is the number of observations, \(x_i\) are the observations and \(\bar{x}\) is the sample mean.
6.3.4 Standard Deviation
Taking square root of the variance corrects for the fact that all the differences were squared, resulting in the standard deviation. It is the square root of the variance which is also an indicator of spread.
A standard deviation can range from 0 to infinity. A standard deviation of 0 means that a list of numbers are all equal and they don’t lie apart at all.
To make sense of this through an example, the plot below shows some simulated data for test scores. Three groups given the same test could achieve the same average score but with different spreads of scores.
For the group with a mean test score of 30 and a standard deviation of 5, most of the test scores are tightly packed within the range 25-35.
Information
In statistics there is a rule called the empirical rule that states that 68%, 95%, and 99.7% of the values lie within one, two, and three standard deviations of the mean, respectively (Lee D. K., In J. and Lee S. 2015).
For a mean of 30 and standard deviation of 5: 68% of the values will lie within the range 25-35.
For a mean of 30 and standard deviation of 10: 68% of the values will lie within the range 20-40.
For a mean of 30 and standard deviation of 15: 68% of the values will lie within the range 15-45.
Statisticians will sometimes use a z-score to indicate how far from the mean a particular element in the data set is.
6.3.4.1 Example
Calculate the sample estimate of variance and sample estimate of standard deviation for the following list of values:
\[ 2, 4, 4, 5, 6.\]
Start by finding the mean of the values in the dataset:
\[ \textrm{Mean}= \frac{2 + 4 + 4 + 5 + 6}{5}=4.2.\] Find the difference between each recorded value and the mean.
Value | Difference |
---|---|
2 | 2 - 4.2 = -2.2 |
4 | 4 - 4.2 = -0.2 |
4 | 4 - 4.2 = -0.2 |
5 | 5 - 4.2 = 0.8 |
6 | 6 - 4.2 = 1.8 |
Square the differences.
Value | Difference | Squared Difference |
---|---|---|
2 | -2.2 | 4.84 |
4 | -0.2 | 0.04 |
4 | -0.2 | 0.04 |
5 | 0.8 | 0.64 |
6 | 1.8 | 3.24 |
Sum the squared differences.
\[\textrm{Sum} = 4.84 + 0.04 + 0.04 + 0.64 + 3.24 = 8.8. \]
Divide the sum by the number of values recorded minus one to get the sample estimate of variance.
\[ \textrm{Variance}_{s} = \frac{8.8}{5-1} = 2.2.\]
To get the sample estimate of the standard deviation take the square root of this value:
\[ \textrm{Standard Deviation}_s = \sqrt{ \textrm{Variance}_{s}} = \sqrt{2.2} = 1.48.\]
6.3.5 Using Excel
Calculating the variance and standard deviation by hand is a long process and due to the number of steps involved it is prone to error. Excel, SPSS, Python and R all have functions which allow users to calculate these descriptive statistics and their use is highly recommended over calculating the statistics by hand.
Range
=MAX(start:end)-MIN(start:end)
There is no single function for calculating the range in Excel but the formula above will subtract the smallest value from the largest value in an array.
Standard Deviation
=STDEV.S(start:end)
=STDEV.P(start:end)
stdev.s() estimates standard deviation based on a sample. stdev.p() calculates standard deviation based on the entire population given as arguments.
Variance
=VAR.S(start:end)
=VAR.P(start:end)
var.s() estimates variance based on a sample. var.p() calculates variance based on the entire population given as arguments.
Summary
Measures of Dispersion
The range is the difference between the highest and lowest values.
The interquartile range (IQR) describes the spread of the middle half of a distribution. How the interquartile range is calculated depends on whether there are an even or an odd number of values in a dataset.
The standard deviation describes to what extent a set of numbers lie apart (their spread). It is the square root of variance which is also an indicator of the spread of values.
To calculate the variance:
- Start by finding the mean of the values in the dataset.
- Find the difference between each recorded value and the mean.
- Square those differences.
- Sum the squared differences.
- Divide the sum by the number of values recorded for population variance or the sum of the number of values minus 1 for sample variance.