Chapter 6 Describing Data


There are lots of different ways to describe data and some of them have been mentioned already (mode, mean and median). This chapter details how various descriptive statistics are calculated.

6.1 Measures of Central Tendency

Measures of central tendency are used to find the middle, or the average, of a data set. The measures of central tendency are the mean, median and mode.

6.1.1 Mean

The mean (sometimes referred to as the arithmetic mean) is the sum of the recorded values divided by the number of values recorded. The formula for the mean then is given by:

\[ \textrm{Mean} = \frac{\textrm{Sum of recorded values}}{\textrm{Number of values recorded}}.\]

6.1.1.1 Example

The table below shows data on male and female life expectancy at birth in the period 2018-2020 (NISRA 2022a).

Life Expectancy at Birth, 2020
Local Government District (2014) Male life expectancy at birth Female life expectancy at birth
Antrim and Newtownabbey 78.8 82.6
Ards and North Down 79.6 82.7
Armagh City, Banbridge and Craigavon 79.3 83.2
Belfast 75.8 80.5
Causeway Coast and Glens 79.7 82.6
Derry City and Strabane 78.0 81.6
Fermanagh and Omagh 79.2 83.2
Lisburn and Castlereagh 80.3 83.3
Mid and East Antrim 79.0 82.3
Mid Ulster 79.6 83.1
Newry, Mourne and Down 79.3 83.2

The mean male life expectancy at birth is:

\[\textrm{Mean}=\frac{(78.8+79.6+79.3+75.8+79.7+78.0+79.2+80.3+79.0+79.6+79.3)}{11},\]

\[\textrm{Mean}=79.0\]

6.1.2 Median

The median is the middle number in a sorted, ascending or descending, list of values.

If there are an odd number of values the median is simply the middle value.

For an even number of values there will be two values in the center. Those values are summed and divided by two.

The median is sometimes used as opposed to the mean when there are outliers that might skew the average of the values. For ordinal data, the median is usually the best indicator of central tendency (McHugh M. L. and Hudson-Barr D. 2003).

6.1.2.1 Example

Using the same table as before calculate the median male life expectancy at birth.

Life Expectancy at Birth, 2020
Local Government District (2014) Male life expectancy at birth Female life expectancy at birth
Antrim and Newtownabbey 78.8 82.6
Ards and North Down 79.6 82.7
Armagh City, Banbridge and Craigavon 79.3 83.2
Belfast 75.8 80.5
Causeway Coast and Glens 79.7 82.6
Derry City and Strabane 78.0 81.6
Fermanagh and Omagh 79.2 83.2
Lisburn and Castlereagh 80.3 83.3
Mid and East Antrim 79.0 82.3
Mid Ulster 79.6 83.1
Newry, Mourne and Down 79.3 83.2

To calculate the median the male life expectancy at birth needs to be ordered.

Life Expectancy at Birth, 2020
Local Government District (2014) Male life expectancy at birth
Belfast 75.8
Derry City and Strabane 78.0
Antrim and Newtownabbey 78.8
Mid and East Antrim 79.0
Fermanagh and Omagh 79.2
Armagh City, Banbridge and Craigavon 79.3
Newry, Mourne and Down 79.3
Ards and North Down 79.6
Mid Ulster 79.6
Causeway Coast and Glens 79.7
Lisburn and Castlereagh 80.3

The middle value (median) in this case is the 6th entry (Armagh City, Banbridge and Craigavon): 79.3.

The same data can be sorted by Administrative Area rather than Local Government District. As there is an even number of administrative areas, there are two middle values (East Antrim and North Down):

Life Expectancy at Birth, 2020
Administrative Area Male life expectancy at birth
Belfast West 73.7
Belfast North 73.8
Foyle 77.1
Belfast East 77.6
Upper Bann 78.5
West Tyrone 79.2
North Antrim 79.2
Fermanagh and South Tyrone 79.4
East Antrim 79.4
North Down 79.5
South Down 79.5
Newry and Armagh 79.7
Belfast South 79.7
South Antrim 79.7
Mid Ulster 79.9
East Londonderry 79.9
Strangford 80.0
Lagan Valley 80.1

When there are two middle values they should be summed and divided by two to find the median:

\[\textrm{Median} =\frac{79.4+79.5}{2}=79.45.\]

Information

Try finding the mean and median of this list of numbers: 2, 3, 3, 4, 20.

The mean is 6.4 while the median is 3.

The mean is being skewed by the outlier (20) while the median remains closer to what might be considered to be the middle of the data set if the outlier was not present. This illustrates one of the main uses of the median. It is often used when there are outliers in a data set that might skew the average of the values.

6.1.3 Mode

The mode of a set of data values is the value that appears most often. It is the value that is most likely to be sampled. There can be multiple modes or no modes. The mode is the only useful measure of central tendency when a variable is measured on a nominal scale (McHugh M. L. and Hudson-Barr D. 2003).

6.1.3.1 Example

The table below shows the total number of events funded by the Arts Council of Northern Ireland in 2010 through the Annual Support for Organisations Programme (NISRA 2012) filtered by Local Government District.

Events Funded by the Arts Council, 2010
Local Government District Events Funded
Cookstown 11
Magherafelt 14
Moyle 20
Limavady 22
Ballymoney 23
Carrickfergus 23
Strabane 23
Ballymena 28
Dungannon 32
Ards 34
Larne 36
Banbridge 38
Fermanagh 38
Newtownabbey 40
Omagh 50
Down 95
Coleraine 109
Lisburn 117
Antrim 143
North Down 145
Newry and Mourne 152
Armagh 169
Craigavon 207
Castlereagh 216
Derry 750
Belfast 3,177

Ballymoney, Carrickfergus and Strabane all had a total of 23 events funded by the Arts Council of Northern Ireland and this represents the mode as this is the most frequently occurring value.

6.1.3.2 Example

It is also possible to have multiple modes. For instance, consider the list of numbers:

\[ 7, 3, 5, 3, 4, 3, 5, 6, 8, 5.\]

The frequency table below counts how often each value appears.

Frequency Table of Values
Value Frequency
3 3
4 1
5 3
6 1
7 1
8 1

This is bimodal, it has two modes, 3 and 5.

6.1.3.3 Example

Find the mode of this list of numbers:

\[ 1, 2, 3, 4, 5, 6.\]

Frequency Table of Values
Value Frequency
1 1
2 1
3 1
4 1
5 1
6 1

Every value is unique and occurs only once so this data has no mode.

6.1.4 Mean or Median?

The median may be a better indicator of the most typical value if a set of scores has outliers. Outliers are extreme values that differ greatly from other values. When the sample size is large and does not contain outliers the mean score usually provides a better measure of central tendency.

6.1.5 Using Excel

It is useful to calculate descriptive statistics by hand for understanding but for larger data sets it is not always possible to arrange data and perform calculations by hand.

Excel has a number of functions designed to perform descriptive statistics.

Frequency

=FREQUENCY(start:end,bins_array)

The frequency() function will return a frequency table describing your data. It takes two arguments, the first being the array of values and the second being an array describing the upper boundary of the bins used.

Average

=AVERAGE(start:end)

The mean is calculated using the average() function. There are several other functions relating to means: geomean(), harmean() and trimmean(). Take care not to use these as they are quite different from calculating the mean that has been described here.

Median

=MEDIAN(start:end)

The median is calculated using the median() function.

Mode

=MODE.SNGL(start:end)

=MODE.MULT(start:end)

There are several functions for calculating the mode: mode(), mode.sngl() and mode.mult().

mode() was used in Excel 2007 and may still appear as an option in some versions of Excel.

mode.sngl() will return one mode and mode.mult() will return multiple modes (if there are multiple modes).

Neither mode() nor mode.sngl() will provide a warning if there are multiple modes so mode.mult() is usually the safest option.

Summary

Measures of Central Tendency

The mean (sometimes referred to as the arithmetic mean) is the sum of the recorded values divided by the number of values recorded. The formula for the mean then is given by:

\[ \textrm{Mean} = \frac{\textrm{Sum of recorded values}}{\textrm{Number of values recorded}}.\]

The median is the middle number in a sorted, ascending or descending, list of values.

If there are an odd number of values the median is simply the middle value.

For an even number of values there will be two values in the center. Those values are summed and divided by two.

The mode of a set of data values is the value that appears most often.

6.2 Frequency

The frequency of an observation is the number of times it occurs or is recorded. A frequency table for data, like the one shown below detailing fictional exam grades, is a commonly used method of depicting frequency.

Frequency Table
Grade Frequency
A 15
B 20
C 25
D 21
E 14

A frequency distribution is a collection of observations produced by sorting observations into classes and showing their frequency of occurrence in each class. A frequency distribution helps us discern patterns in data (assuming they exist) by imposing a structure to the data (Witte R. S. and Witte J. S. 2017).

The total of all frequencies so far in a frequency distribution is the cumulative frequency. It is the ‘running total’ of frequencies.

Cumulative Frequency Table
Grade Frequency Cumulative Frequency
A 15 15
B 20 35
C 25 60
D 21 81
E 14 95

The relative frequency is the ratio of the category frequency to the total number of outcomes. For grade A, the relative frequency is:

\[ \textrm{Relative Frequency}=\frac{15}{15+20+25+21+14}=0.16. \]

The table can be extended to include the relative frequency.

Relative Frequency Table
Grade Frequency Relative Frequency
A 15 0.16
B 20 0.21
C 25 0.26
D 21 0.22
E 14 0.15

The relative frequency relates the count for a particular event to the total number of events using percentages, proportions or fractions and it can be reported as a percentage by multiplying the values by 100%. For grade A, the relative frequency reported as a percentage is: 100% x 0.16 = 16%.

6.2.1 Mean of a Frequency Distribution

While it is common to calculate the mean of a data set sometimes we receive data in the form of a frequency table. To calculate the mean we multiply the value by its frequency, sum the results and divide by the cumulative frequency.

6.2.1.1 Example

Calculate the mean given the values and their respective frequencies in the table below:

Frequency Table
Value Frequency Value x Frequency
1 2 2
2 3 6
3 5 15
4 6 24
5 5 25
6 4 24
7 2 14
8 1 8

The products of the values and their frequencies have been calculated in the table above, all that is left is to sum them and divide by the cumulative frequency:

\[ \textrm{Mean}=\frac{2+6+15+24+25+24+14+8}{2+3+5+6+5+4+2+1}=\frac{118}{28}=4.21 \]

Information

We can write this using mathematical notation:

\[\mu=\frac{\sum_{i=1}^n x_i f_i}{\sum_{i=1}^n f_i},\]

where \(x_i\) are the individual values and \(f_i\) their respective frequencies.

6.2.2 Mode of a Frequency Distribution

The modal value (or the modal class in the case of a frequency distribution) is simply the value which corresponds to the largest frequency. In the example above the modal value is 4.

6.2.3 Median of a Frequency Distribution

To find the median of a frequency distribution we need to first calculate the cumulative frequency:

Frequency Table
Value Frequency Value x Frequency Cumulative Frequency
1 2 2 2
2 3 6 5
3 5 15 10
4 6 24 16
5 5 25 21
6 4 24 25
7 2 14 27
8 1 8 28

We divide the cumulative frequency by 2 to find the midpoint. In this case, it’s 14. Then, check each value to see if its corresponding cumulative frequency is greater than that number. The first value which has a cumulative frequency greater than that number is the median value. The first value in the table above which has a cumulative frequency greater than 14 is 4. This is the median.

6.2.4 Mean of a Grouped Frequency Distribution

If the frequency table is a grouped data frequency table, where the values are banded (0-5,5-10,10-15…etc), then the equation for the mean uses the midpoint of the band (which is the upper limit minus the lower limit) in place of a single value.

Take the table below for instance:

Grouped Frequency Table
Bin Frequency
10-14 1
15-19 3
20-24 9
25-29 2

To calculate the mean we would rewrite this table as follows:

Frequency Table
Midpoint Frequency
12 1
17 3
22 9
27 2

Previously we created a new column for the product of the value and the frequency. We do the same again but this time the new column will hold values for the product of the midpoint with the frequency:

Frequency Table
Midpoint Frequency Mf
12 1 12
17 3 51
22 9 198
27 2 54

The process is the same as before. We sum the products of the midpoint and the frequency and divide by the cumulative frequency:

\[\textrm{Mean}=\frac{12+51+198+54}{1+3+9+2}=21\]

Information

We can write this in mathematical notation as:

\[ \mu=\frac{\sum_{i=1}^n M_i f_i}{\sum_{i=1}^n f_i},\].

where \(M\) is the midpoint and \(f\) is the frequency.

6.2.5 Median of a Grouped Frequency Distribution

To find the median we need several values, \(l\), the lower limit of the median class, \(n\) the total number of observations, \(c_f\), the cumulative frequency of the class preceding the median class, \(f\), the frequency of the median class and \(c_l\) the class length. Given these, the median is:

\[\textrm{Median}=l + c_l \frac{\frac{n}{2}-c_f}{f} \]

Grouped Frequency Table
Bin Frequency cf Mf
10-14 1 1 12
15-19 3 4 51
20-24 9 13 198
25-29 2 15 54

The total number of observations \(n = 15\).

Divide this by 2 to get 7.5

From this we can find the lower limit of the median class by finding the cumulative frequency which is just larger than this number. This corresponds to the median class. For us that’s the 20-24 class.

The lower limit, \(l\), of this class is 20.

The cumulative frequency of the class preceding the median class, \(c_f\), is 4.

The frequency of the median class, \(f\), is 9.

The class length, \(c_l\), is 4.

The median then is calculated by plugging these values into the formula above:

\[\textrm{Median} = l + c \frac{\frac{n}{2}-cf}{f},\]

\[\textrm{Median} = 20 + \frac{4(\frac{15}{2}-4)}{9}, \]

\[\textrm{Median} =21.6.\]

Information

The variance (more on this later) of a grouped frequency distribution is given by:

\[ V=\frac{\sum_{i=1}^n f_i M_i^2 - \mu \sum_{i=1}^n f_i}{\mu -1}, \] where \(f_i\) are the frequencies, \(M_i\) are the midpoints of the bands (or bins), \(\mu\) is the mean.

The standard deviation given by the square root of \(V\).

Summary

Frequency

The frequency of an observation is the number of times it occurs or is recorded. A frequency table is a commonly used method of depicting frequency.

The total of all frequencies so far in a frequency distribution is the cumulative frequency. It is the ‘running total’ of frequencies.

The relative frequency is the ratio of the category frequency to the total number of outcomes.

6.3 Measures of Dispersion

Dispersion (or variability) describes how far apart data points lie from each other and the center of a distribution. The range, interquartile range, variance and standard deviation are all measures of dispersion and they describe how far apart data points lie from one another and the center of a distribution.

6.3.1 Range

The range is the difference between the highest and lowest values and is calculated by subtracting the minimum value from the maximum value.

6.3.1.1 Example

Calculate the range for the following set of numbers:

\[ 23, 42, 75, 19, 74. \] First, arrange the values in ascending order:

\[ 19, 23, 42, 74, 75. \] The maximum value is 75 and the minimum is 19.

\[ \textrm{Range}= 75 - 19, \] \[ \textrm{Range} = 56.\]

6.3.2 Interquartile Range

The interquartile range (IQR) describes the spread of the middle half of a distribution. How the interquartile range is calculated depends on whether there are an even or an odd number of values in a dataset.

For an even number of values the dataset in split half. The medians for the two new subsets of data are calculated. The positive difference of those medians is the interquartile range.

For an odd number of values either the inclusive or the exclusive method of finding the interquartile range must be used.

The algorithm for the exclusive method is detailed below:

  1. Arrange the data in numeric order.
  2. Remove the median and split the data about its center.
  3. Find the medians of the two newly appended subsets of data.
  4. Calculate the difference.

The algorithm for the inclusive method is detailed below:

  1. Arrange the data in numeric order.
  2. Remove the median and split the data about its center.
  3. Append the two new subsets of data with the median.
  4. Find the medians of the two newly appended subsets of data.
  5. Calculate the difference.

6.3.2.1 Example

Find the interquartile range for the list of numbers below:

\[6, 7, 8, 8, 7, 6, 9, 5, 10, 4. \] There are an even number of values. Arrange them in numeric order:

\[ 4, 5, 6, 6, 7, 7, 8, 8, 9, 10.\] Split the values about their center into two sub sets of data.

\[ (4, 5, 6, 6, 7), (7, 8, 8, 9, 10). \] Find the medians of each of these sub sets. The first subset has a median of 6 while the second has a median of 8.

The interquartile range is:

\[ \textrm{IQR} = 8 - 6 = 2.\] Note: To calculate the interquartile range the smaller median value is always subtracted from the larger.

6.3.2.2 Example

Find the interquartile range for the list of numbers below:

\[2, 3, 2, 4, 3, 5, 4, 4, 2.\] Arrange the values in numeric order:

\[2, 2, 2, 3, 3, 4, 4, 4, 5. \] Remove the median (3) and split the data as before:

\[ (2, 2, 2, 3), (4, 4, 4, 5).\] The interquartile range is:

\[ \textrm{IQR}=\textrm{Median of sub set 2}- \textrm{Median of sub set 1},\] \[ \textrm{IQR}=\frac{4+4}{2} - \frac{2+2}{2}=\frac{8}{2} - \frac{4}{2} = 4 - 2= 2.\] #### Example

Find the interquartile range of the list of numbers below:

\[ 2, 3, 2, 4, 3, 5, 4, 4, 2.\] Sort in numeric order as before:

\[2, 2, 2, 3, 3, 4, 4, 4, 5.\]

Split the data as before but append each subset of data with the median (at the end and start of each subset respectively):

\[(2, 2, 2, 3, 3),(3, 4, 4, 4, 5).\] Find the medians of each of the subsets and calculate the interquartile range. The median of the first subset is 2 and the median of the second subset is 4.

\[ \textrm{IQR} = 4 - 2 = 2 \]

The interquartile range is a useful measure of variability for skewed distributions. It can show where most values lie and how clustered they are. It is useful for datasets with outliers as it is based on the middle half of the distribution and less influenced by extreme values. Exclusive calculations result in a wider interquartile range than inclusive calculations.

6.3.3 Variance

The standard deviation describes to what extent a set of numbers lie apart (their spread). It is the square root of variance which is also an indicator of the spread of values.

To calculate the variance:

  1. Start by finding the mean of the values in the dataset.
  2. Find the difference between each recorded value and the mean.
  3. Square those differences.
  4. Sum the squared differences.
  5. Divide the sum by the number of values recorded for population variance or the sum of the number of values minus 1 for sample variance.

Information

The population variance is given by:

\[V_{p} = \frac{1}{n}\sum_{i=1}^n (x_i - \mu)^2,\]

where \(V_p\) is the population variance, \(n\) is the number of observations, \(x_i\) are the observations and \(\mu\) is the population mean.

The sample variance is given by:

\[V_{s} = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2,\] where \(V_s\) is the population variance, \(n\) is the number of observations, \(x_i\) are the observations and \(\bar{x}\) is the sample mean.

6.3.4 Standard Deviation

Taking square root of the variance corrects for the fact that all the differences were squared, resulting in the standard deviation. It is the square root of the variance which is also an indicator of spread.

A standard deviation can range from 0 to infinity. A standard deviation of 0 means that a list of numbers are all equal and they don’t lie apart at all.

To make sense of this through an example, the plot below shows some simulated data for test scores. Three groups given the same test could achieve the same average score but with different spreads of scores.

For the group with a mean test score of 30 and a standard deviation of 5, most of the test scores are tightly packed within the range 25-35.

Information

In statistics there is a rule called the empirical rule that states that 68%, 95%, and 99.7% of the values lie within one, two, and three standard deviations of the mean, respectively (Lee D. K., In J. and Lee S. 2015).

For a mean of 30 and standard deviation of 5: 68% of the values will lie within the range 25-35.

For a mean of 30 and standard deviation of 10: 68% of the values will lie within the range 20-40.

For a mean of 30 and standard deviation of 15: 68% of the values will lie within the range 15-45.

Statisticians will sometimes use a z-score to indicate how far from the mean a particular element in the data set is.

6.3.4.1 Example

Calculate the sample estimate of variance and sample estimate of standard deviation for the following list of values:

\[ 2, 4, 4, 5, 6.\]

Start by finding the mean of the values in the dataset:

\[ \textrm{Mean}= \frac{2 + 4 + 4 + 5 + 6}{5}=4.2.\] Find the difference between each recorded value and the mean.

Calculating Differences
Value Difference
2 2 - 4.2 = -2.2
4 4 - 4.2 = -0.2
4 4 - 4.2 = -0.2
5 5 - 4.2 = 0.8
6 6 - 4.2 = 1.8

Square the differences.

Squaring Differences
Value Difference Squared Difference
2 -2.2 4.84
4 -0.2 0.04
4 -0.2 0.04
5 0.8 0.64
6 1.8 3.24

Sum the squared differences.

\[\textrm{Sum} = 4.84 + 0.04 + 0.04 + 0.64 + 3.24 = 8.8. \]

Divide the sum by the number of values recorded minus one to get the sample estimate of variance.

\[ \textrm{Variance}_{s} = \frac{8.8}{5-1} = 2.2.\]

To get the sample estimate of the standard deviation take the square root of this value:

\[ \textrm{Standard Deviation}_s = \sqrt{ \textrm{Variance}_{s}} = \sqrt{2.2} = 1.48.\]

6.3.5 Using Excel

Calculating the variance and standard deviation by hand is a long process and due to the number of steps involved it is prone to error. Excel, SPSS, Python and R all have functions which allow users to calculate these descriptive statistics and their use is highly recommended over calculating the statistics by hand.

Range

=MAX(start:end)-MIN(start:end)

There is no single function for calculating the range in Excel but the formula above will subtract the smallest value from the largest value in an array.

Standard Deviation

=STDEV.S(start:end)

=STDEV.P(start:end)

stdev.s() estimates standard deviation based on a sample. stdev.p() calculates standard deviation based on the entire population given as arguments.

Variance

=VAR.S(start:end)

=VAR.P(start:end)

var.s() estimates variance based on a sample. var.p() calculates variance based on the entire population given as arguments.

Summary

Measures of Dispersion

The range is the difference between the highest and lowest values.

The interquartile range (IQR) describes the spread of the middle half of a distribution. How the interquartile range is calculated depends on whether there are an even or an odd number of values in a dataset.

The standard deviation describes to what extent a set of numbers lie apart (their spread). It is the square root of variance which is also an indicator of the spread of values.

To calculate the variance:

  1. Start by finding the mean of the values in the dataset.
  2. Find the difference between each recorded value and the mean.
  3. Square those differences.
  4. Sum the squared differences.
  5. Divide the sum by the number of values recorded for population variance or the sum of the number of values minus 1 for sample variance.

References

Lee D. K., In J. and Lee S. 2015. “Standard Deviation and Standard Error of the Mean.” Korean Journal of Anesthesiology 68 (3): 220–23. https://doi.org/10.4097/kjae.2015.68.3.220.
McHugh M. L. and Hudson-Barr D. 2003. Descriptive Statistics, Part II: Most Commonly Used Descriptive Statistics.” Journal for Specialists in Pediatric Nursing 8 (3): 111–16. https://doi.org/10.1111/j.1088-145X.2003.00111.x.
NISRA. 2012. “Arts Council Funded Activities (Administrative Geographies), 2010.”
———. 2022a. “Life Expectancy at Birth (Administrative Geographies), 2018-2020.”
Witte R. S. and Witte J. S. 2017. Statistics. John Wiley & Sons. https://www.wiley.com/en-us/Statistics,+11th+Edition-p-9781119254515.