Chapter 2 Sampling

Samples and populations are important in both inferential and descriptive statistics. Some descriptive statistics are calculated differently depending on whether they are being calculated for a sample or a population.

2.1 Populations and Samples

2.1.1 Population

The population includes all the elements from a set of data and the population characteristics that we calculate (e.g. mean, median, mode, etc) are called parameters.

Populations can include people but other examples include objects, events, businesses and so on. Before starting a research study it is important to define the population that is being studied. Many populations can be stratified into sub-populations which share attributes (Frost J. 2019). A country for instance can be sub-divided into male and female or sub-divided by age. The differences between sub-populations are sometimes unimportant but other times they are crucial.

2.1.2 Sample

A sample differs from a population in that it is comprised of one or more observations which are drawn from the population. It is a subset of the population. The sample characteristics (e.g. mean, median, mode, etc) that we calculate are called statistics. Where possible, w use samples to draw inferences about wider populations. For instance, the voting intentions of 1,000 people (a sample) might be used to predict the outcome of a vote held for a general election involving 68,000,000 people (the population).

Surveying a sample rather than an entire population is often more financially cost effective but using a sample comes with a different kind of cost. Using a sample rather than the population means that the reported statistics come with associated measures of uncertainty which describe how the estimate might differ from the true value of the population (Office for National Statistics 2022). If we were to measure the heights of 1,000 individuals in Northern Ireland, the average height of the sample is typically not going to be the same as the average height of all 1.8 million people in the country. The difference between the sample and the population values is considered a sampling error. If the sample mean is 176 cm and the population mean is 178 cm then the sampling error is 2 cm.

The exact measurement of sampling error is generally not feasible, since the true population values are not usually known. Sampling error however can be estimated by techniques such as the calculation of confidence intervals.

The various measures of uncertainty used to describe how estimates differ from the true value of the population include (Office for National Statistics 2022):

Calculating parameters of a population or statistics for a sample often falls within the remit of descriptive statistics but if we want to convert sample responses to population estimates we need to use inferential statistics.

To generalise the results from a sample to the full population, the sample must be representative of the population (this is known as representative sampling). For a sample to be representative of the population it must accurately represent the characteristics of the population.

In practice, this means that if we conducted a survey to gauge the views of the Northern Ireland population on an upcoming election and only residents of Antrim were surveyed then the sample results could not be used to infer attitudes of all Northern Ireland residents.

Summary

Populations and Samples

The population includes all the elements from a set of data. The population characteristics (e.g. mean or standard deviation) that we measure are called parameters.

A sample is comprised of one or more observations which are drawn from the population. Sample characteristics that we measure are called statistics.

2.2 Sampling Methods

2.2.1 Probability Sampling

Probability sampling involves random selection allowing you to make statistical inferences about the whole group.

This chapter focuses on simple random sampling as it is one of the most common forms of probability sampling however there a a variety of different methods of probability sampling including: systematic sampling, stratified random sampling, cluster sampling, multi-stage sampling and multi-phase sampling.

2.2.1.1 Simple Random Sampling

Simple random sampling (SRS) is a procedure for selecting samples from a population. Under this procedure samples are taken from a population where each sample has equal probability of being selected. A very simple example of this is removing marbles from a bag of ten marbles one at a time, recording some characteristic (their colour or weight for instance) and replacing them. All the marbles have equal probability of being selected (p=0.1) regardless of how many samples we have taken.

Not replacing the marbles we sampled results in simple random sampling without replacement (SRSWOR). The first time we take a sample, each marble has a p=0.1 chance of being sampled. If we don’t replace the marble we sampled before sampling a second marble however then the chance we select the first marble again is 0 and the other 9 marbles now have a p=0.11 chance of being selected. By the time we get to the bottom of the bag and two marbles remain they each have a p=0.5 chance of being selected.

In real world scenarios SRS can be difficult to achieve either due to the scale of the population or because SRS can introduce resampling (inadvertently giving a survey to the same person twice). SRSWOR is often used instead instead for these reasons.

2.2.2 Non-Probability Sampling

Non-probability sampling involves non-random selection. To draw conclusions about a population from a sample it is a requirement that the sample be representative of the population however when using non-probability sampling this is not guaranteed and it is important to consider this when using it as a sampling method in a study.

Non-probability sampling has seen some use within official statistics as a result of the growing popularity of non-probability data sources such as social media (webscraping Twitter or Facebook to conduct sentiment analysis for instance) and the desire for real time statistics. There are a range of methods which can be used to conduct non-probability sampling including: convenience or haphazard sampling, volunteer sampling, judgement sampling, quota sampling, snowball or network sampling, crowd-sourcing and web panels.

One of the more widely used forms of non-probability sampling is convenience sampling. Convenience sampling consists of drawing from a source that is conveniently accessible to us (Andrade C. 2020). A convenience sample of students may be drawn from a physics department in Queen’s University but these students may not be representative of all students in Northern Ireland.

The findings of a study based on convenience sampling can normally only be generalized to the sub population from which the sample is drawn and not to the entire population (Andrade C. 2020).

2.3 Independent and Dependent Samples

Samples are independent if the subjects in one sample do not determine which subjects are chosen for a second sample. Each group contains different subjects with no meaningful way to pair them. Independent groups are more commonly seen in hypothesis testing, for instance medical drug trials typically have a control group and a treatment group with different subjects. These studies typically use inferential statistical tests to determine if there are differences between the groups.

Dependent samples (sometimes referred to as matched pairs) differ from independent samples in that subjects in one sample can be matched with a corresponding subject in another sample. This can get confusing because sometimes matched pair consists of just one subject. For instance, if a sample is drawn of people who have hip replacement surgery with the NHS and the people in the sample are each interviewed before and after the surgery to assess their mobility before and after surgery then the study is engaged in dependent sampling. The same person was interviewed at two points in time.

Summary

Sampling

To generalise the results from a sample to the full population, the sample must be representative of the population.

Surveys are based on a sample rather than the whole population so they are subject to sampling error.

The sampling error is the difference between the sample estimate and the ‘true’ value (which would have been obtained if a census of the whole population were undertaken.

Probability sampling involves random selection.

Simple random sampling (SRS) is a method for selecting samples from a population where all possible samples are equally likely to occur. Taking marbles from a bag, detailing a property (like weight), and replacing it in the bag is an example of SRS.

Not replacing a sample would constitute simple random sampling without replacement (SRSWOR).

Non-probability sampling involves non-random selection based on convenience or other criteria.

In independent samples subjects in one group provide no information about subjects in another.

References

Andrade C. 2020. “The Inconvenient Truth About Convenience and Purposive Samples.” Indian Journal of Psychological Medicine 43(1): 86–88. https://doi.org/10.1177/0253717620977000.

Frost J. 2019. Introduction to Statistics. Statistics By Jim Publishing. https://statisticsbyjim.com/basics/introduction-statistics-intuitive-guide/.

Office for National Statistics. 2022. “Uncertainty and How We Measure It for Our Surveys.” https://www.ons.gov.uk/methodology/methodologytopicsandstatisticalconcepts/uncertaintyandhowwemeasureit.