Statistical Data Types
All variables can roughly be categorized into three broad groups. [1]Note that the explanations provided here very much simplify or ignore the underlying mathematical foundations of these distributions – the probability density functions. These explanations are an attempt to describe the types observationally; sort of like being able to recognize an elephant by description without needing to know all the bits and pieces inside that make the elephant work.
The names vary a little between sources but the types are always the same; we’ll use the following:
Continuous Variables – Anything that could be measured with an unambiguous measurement tool, something where 4 means precisely “twice the value of 2”, where decimal-point precision has clear meaning, where more precise tools will lead to more precise measurements, etc. If variable does not naturally fall into a limited number of categories but could occur in many different values across a range, it is likely continuous. Examples: Weight, length, age, time, estimated blood loss, drug dosages, etc.
Note that it still depends how you measure it though! For something like weight: you put a patient on a scale and you get a weight – two different observers should not disagree on the measurement, you’ll get a precise value, and your data will be continuous. On the other hand, if you don’t weight the patients but merely group them into “underweight”, “normal weight”, “overweight”, “severely overweight”, then you’ve changed weight into an ordinal variable! In some cases, changing a continuous variable into an ordinal variable by grouping it like this makes sense to do, see below for details.
Discrete Variables – Discrete variables are identified by the fact that they measure specific, objective, comparable values (i.e. “4” means double whatever “2” is), and thus clearly do not qualify as either ordinal or nominal, but they are different from a ‘true’ continuous variable because they only come in specific ‘discrete’ values and are therefore a sort of degenerate version of continuous variables. The way to tell the difference is that if you could theoretically measure your variable to an number of specific digits (any arbitrary precision), your variable is truly continuous. If there is a limit beyond which no further precision is possible, it’s likely a discrete distribution. For example: weight, as per above, is truly continuous. Measuring how many phone calls come in to a call center each hour, however, is discrete; only whole numbers will be recorded. Rolling dice is discrete. Discrete variables are generally tested and treated like continuous variables, but in some cases they must be treated differently.
In the same way that a true continuous variable can be converted to an ordinal variable by splitting it into arbitrary groups, a continuous may be turned into a discrete variable by rounding measurements and decreasing precision. To use weights again as an example, if you rounded all patient weights into 10-kg increments, the distribution becomes discrete, because only specific stepwise values are possible.
Ordinal Variables – Measures that fall into categories that can be put into an order like “worst to best” or similar, but are not precise measurements where something like 4 means double of 2. Examples: ASA Patient Score, Glasgow coma scale, Likert scales and most survey responses. If it’s not clear if the measure is continuous or ordinal: If putting a decimal on it doesn’t make sense (like “ASA class 3.7”), then it’s ordinal. “Before” and “after” groups can be viewed as Ordinal (there is an order to the sequence of measurements, after all), however when there are only two groups it doesn’t matter if you choose ordinal or nominal approaches, the results will be the same! Ordinal tests only differ from nominal when there are 3 or more levels of the variable.
Pain scores on a scale from 1 to 10 are an especially troublesome example. Is it continuous or is it ordinal? Does a pain of 4 mean double the pain of 2? Maybe it could, so that doesn’t make it clear. Does a pain score of 5.5 out of ten makes sense? Sort of; if someone said that I’d have some idea what they meant, after all. When in doubt, it’s always safer (and always valid) to use ordinal.
Nominal Variable – Measures that fall into categories/groups that do not have any meaningful ordering. Examples: gender, colors, organ systems, ethnicity. Two different groups (treatment / no-treatment) in a randomized trial are nominal. Data coming from three different high schools are nominally grouped.
Changing continuous data into ordinals or binaries – If you have a continuous measure, in some cases it can be useful to split it up into a grouped ordinal variable instead (like we did with weight above in the ordinal example). You could also arbitrarily split something like age into “elderly” (75+) versus “not elderly” (<=74) to look for a specific effect on these groupings. This approach is most often employed when you want to interpret your results based on the groupings instead of the raw continuous data (like weight classes instead of exact BMI); that is, when there is a clinical rationale for dividing your variable into specific groups.
Any time you split up a continuous into groups you will probably lose power in your statistical tests (power is the ability to detect a significant difference if one exists), but you may make your results more applicable or interpretable, and if an effect exists only over one part of the continuous range (in “elderly” patients, for example), you may in some cases increase your power.
Caution: Arbitrarily splitting continuous variables up into different groups without a reason for the groupings, particularly when repeatedly testing those groups for significant differences, is a form of data mining! Beware any results obtained! If you’re going to be categorizing or dichotomizing a continuous variable, you should know in advance of doing so how you’re going to do it and how you’re going to test it and there should be a justification for doing so!
References
1. | ↑ | Note that the explanations provided here very much simplify or ignore the underlying mathematical foundations of these distributions – the probability density functions. These explanations are an attempt to describe the types observationally; sort of like being able to recognize an elephant by description without needing to know all the bits and pieces inside that make the elephant work. |