We all are aware that statistics is a tool for converting data into information. Consequently, without data, statistics is all but useless. But where then does data come from and how should it be gathered to ensure its accuracy and reliability? Is it representative of the population from which it was drawn? Today we tackle these issues arising from perhaps the most complicated data source, large populaces, e.g. a vast target population or database. When collecting this kind of data, the two most popular methods are via a census or sampling.
A census is the procedure of systematically acquiring and recording information about the members of a given population. Under this method, data is collected for each and every unit in the population, database, or even universe, for example every person, household, field, shop, factory etc.
For instance, if the average salary of players in Major League Baseball is to be calculated, then salary figures would be obtained by dividing the total salaries which all of the players received by the number of players currently active in the MLB. A census is designed to be a complete enumeration.
Sampling, on the other hand, is concerned with the selection of a subset of individuals from within a statistical population to estimate characteristics of the whole population. Unlike the census, instead of examining every unit of a populace, only a part of the population is studied and consequent conclusions are drawn on that basis for the entire populace. Sampling is designed to be a partial enumeration.
Interestingly, while the theory of sampling has taken off really only in relatively recent years, the idea of sampling is actually pretty old. Since times immemorial, humans have examined a handful of grains to ascertain the quality of the entire lot. For example, a cook examines only a two or three grains of boiling rice to know whether the pot of rice is ready or not. Additionally, like most examples in a statistics textbook, a businessman places orders for material by examining only a small sample of the same.
Apart from quite unusual circumstances, a census is going to be more accurate than a sample by definition. Samples have a margin of error, or sampling error, which gets lower as the sample size increases. In other words, sampling more people means obtaining better data. However, this type of error is not present in a census as each and every element of the populace has to be examined for data collection. Of course, census data can get stale if not updated regularly, as shown aptly by this comic from xkcd:
A sample could be more accurate than a census if, because it is a census as opposed to a sample, the bias from non-sampling error increases. This could come about, for example, if the census generates an adverse political campaign advocating non-response, which is something less likely to happen to a sample.
Consider a common source of non-sampling error, the systematic non-response by a particular socio-demographic group. If people from group X are likely to refuse the census, they are just as likely to refuse the sample. Even with post-stratification sampling to weigh up the responses of those people from group X who do happen to answer the survey, or whatever data wished to be collected, a problem still persists because those might be the very segment of X that are pro-surveys. There is no real way around this problem other than to be as careful as possible with the design of instrument and delivery method.
Another possible issue that could make a census less accurate than a sample is the fact that technically speaking, all censuses are attempted. Samples routinely have post-stratification weighting to population, which mitigates bias problems from issues such as systematic non-response. An attempted census that does not get a full 100% return is really just a large sample, and should in principle be subject to the same processing.
However, because it is seen as a "census" rather than an attempted census, this analysis may be neglected. In this case, a census might be less accurate than the appropriately weighted sample. Although the problem here is the analytical processing technique (or omission of), not something intrinsic to it being an attempted census.
Sampling, however, is almost always more efficient that a census. Here are just a few issues with a census:
If good sampling techniques are used, the results can be very representative of the actual population, while also eliminating those inherent issues of a census.
Essentially, data collected through a census is more accurate as it takes the entire population into account while the data collected through sampling is close to real information but has a scope for error since the information is collected from a sample population. This does not mean that data collected through sampling is not important, as it can more efficient in terms of cost, time, and other logistics.
How do you use sampling in your everyday life? Let us know down below!
The SaberSmart Team