CH3 Summarizing Data

(Nominal, Ordinal, Numerical)

(Nominal, Ordinal, Numerical)

based on:

Review of Basic & Clinical Biostatistics by

Beth Dawson, Robert Trapp (2001) CH3

**SCALES OF MEASUREMENT**

**Nominal scale**(placed in (**named**) categories, grouped, qualitiative)

- organized by an attribute

- is a member of a group or not (discrete membership)

- count the number of observations with or without an attribute that fit the category

- percentages or proportions used

- often represented by frequencies by category rather than the raw numbers by category

- often displayed in contingency tables, dividing up into numbers which meet each selected criteria

- organized by an attribute
**Ordinal scale**(nominal scale plus groups are also put some order, semi-quantitative)

- organized by an attribute that is more or greater than

- diference between categories is not the same thruout the scale

- percentages or proportions used

- sometimes summerized by median values

- example is rank-order from highest to lowest

- organized by an attribute that is more or greater than
**Numerical scale**(characterized by a number, quantitative)

- interval or continuous

- characterize by a number that varies in some continuous way

- characterize by a number that varies in some continuous way
- discrete

- characterize by integer number values

- may be used to describe a continuous scale

- characterize by integer number values

- interval or continuous

**MEASURES OF NOMINAL (GROUPED) DATA**

**Proportion**

**part divided by the whole**(ie number divided by the total number)

**Percentage**is a proportion multiplied by 100%

- useful for ordinal, numerical and nominal data

- dimensionless

**Ratio**

**part divided by another part**

a given characteristic is an example**(number / with**___**)**

**(number / without)**- dimensionless, no units

**Rate**

- a ratio times a multiplier (read as rate per units of the multiplier factor)

**for a given unit time interval**(ratio per unit time)

**Crude rate**

- rate for the whole population (everyone)

- affected by the distribution of characteristics in the population that affect the rate

- rate for the whole population (everyone)
**Sex-specific rate**

- restricted to a given sex

- restricted to a given sex
**Cause-specific rate**

- restricted to a given cause

- restricted to a given cause
**Age adjusted rate**

- adjusting the rate weighted by the age of the people at risk

- allows comparing changes in a given population over time (since ages and numbers in each age group changes over time)

- adjusting the rate weighted by the age of the people at risk
**Mortality Rate**

**(**__number who died__________________**)**in the given time interval

**(number who are at risk of dying)**- number at risk halfway thru the period often used for the denominator as an estimate

- Infant Mortality Rate

- number of infants who die before age1 per 1000 live births

- number of infants who die before age1 per 1000 live births

**Morbidity Rate**

**(**in the given time interval__number who develop a disease__)

**(number at risk of a disease.........)**

**Incidence Rate**(per time interval)

__(number of__*new cases*_______________________) in the given*time interval*

**(number at the***beginning*of the time interval)

**Standardized Rates**

- rates have to be adjusted for the distribution in order to be compared with other populations

- assumes weighted averages maintain the validity of the comparison

- assumes weighted averages maintain the validity of the comparison
- often adjusted by each category to one of the populations as the reference population (
*direct*method of rate standardization)

**standardized mortality**(*ratio**indirect*method of rate standardization)

**(**__number of deaths___________**____)**

**(number of expected deaths)**- expected deaths calculated in each population using the specific rates from a standard population as a reference

- no dimensions

- rates have to be adjusted for the distribution in order to be compared with other populations

- a ratio times a multiplier (read as rate per units of the multiplier factor)
**Prevalence**(is a*proportion*and not a rate since*no time interval*, a snapshot)

__(number with a disease at a moment in time)__

**(number at risk at that same moment in time)**

**MEASURES OF NUMERICAL DATA**

**Distributions (the spread of Numerical Data)**

**Scale**(how closely spaced)

**Shape**(how the frequency changes along the scale)

- symmetric (evenly distributed about some middle)

- skewed (not evenly distributed about some middle)

- symmetric (evenly distributed about some middle)

**Measures of the Middle**(Statistical Tests on Numerical Data)

**Arithmetic Mean or Mean**(average of the data values)

- weighted average (used to calculate the mean if frequencies of a measure are reported and not the raw data)

- used for numerical, symmetric data

- weighted average (used to calculate the mean if frequencies of a measure are reported and not the raw data)
**Median**(middle or midpoint observation, which half are smaller and half are larger)

- less sensitive to extreme values or shape of distribution of data than the mean

- used for ordinal, or numerical data if the distribution is skewed

- less sensitive to extreme values or shape of distribution of data than the mean
**Mode**(most frequent value)

- often a range if data grouped in intervals

- used primarily when distribution is bimodal

- often a range if data grouped in intervals
**Geometric Mean**(average of the squares of the data values)

- used when data is measured on a logarthmic scale

- logarithm of the geometric mean is equal to the mean of the logarithms of the data values

- used when data is measured on a logarthmic scale

**Measures of Spread or Dispersion**(Statistical Tests on Numerical Data)

**Range**(spread of lowest to highest data value)

- emphasizes the extreme values

- emphasizes the extreme values
**Percentiles**

- divides up the distribution into percentile intervals

- allows comparison of data to a norm

- divides up the distribution into percentile intervals
**Normal or Standard Distribution**

- if the distribution is a Bell Shape

- if the distribution is a Bell Shape
**Standard Deviation**(**SD**, how data clusters about the mean)

- average square of deviations from the means

- (similar average absolute values of deviations from the mean not used)

- used with a mean

**Variance**is the SD squared

- average square of deviations from the means
**Coefficient of Variation**

- Standard Deviation adjusted or normalized by dividing by the mean value

- makes it possible to compare distributions of different ranges and means

- Standard Deviation adjusted or normalized by dividing by the mean value
**Degrees of Freedom**

- number of observations minus one

- number of independent variables

- number of observations minus one

**COMPARING TWO OR MORE CHARACTERISTICS**

**Nominal Data (grouped items)**

**Basic Inquiry Statistics**(refer to Fig Definitions of Symbols for Nominal Relationships)

**Experimental Event Rate**in Exposed (**ERR**)

- ERR =
__A__..........A+B

- number with risk factor who have or develop the outcome

- ERR =
**Control Event Rate**in Unexposed (**CRR**)

- CRR =
__C__

...........C+D - number without risk factor who have or develop the outcome

- CRR =
**Absolute Risk Reduction/Increase**(**ARR/ARI**)

- ARR = ERR - CRR

- way to appraise the reduction in risk compared to the baseline risk

- events avoided per 10,000 people

- ARR = ERR - CRR
**Number Needed to Treat/Harm**(**NNT/NNH**)

- NNT = 1/ARR = 1/(ERR - CRR)

- number needed to treat to prevent one event

- NNT = 1/ARR = 1/(ERR - CRR)
**Relative Risk Reduction**(**RRR**)

- RRR = ARR/CR

- can obtain ARR if multiple RRR by the Control Event Rate (RRR * CRR)

- RRR = ARR/CR

**Measures of Significance for Nominal Data**(more Inquiry Statistics)

**Relative Risk**ratio**(RR)****

- ratio of the outcome with the risk factor/exposed and without the risk factor/not exposed

- RR =ERR/CRR

__A__

__A+B__

__C__

C+D- investigator decides the number of subjects with and without risk factors

**from risk factor to outcome**(inquiry statistics are)*forward*in time

- contrast to Odds Ratio

- so
*can be calculated only from a cohort or clinical trial*

- cohort and clinical trial is also forward in time (from causes to outcomes) so appropriate measure

- persons with and without risk factor followed over time to determine which persons develop the outcome of interest

- cohort and clinical trial is also forward in time (from causes to outcomes) so appropriate measure

- ratio of the outcome with the risk factor/exposed and without the risk factor/not exposed
**Odds Ratio**(**OR**) **

- (odds that a person with an adverse outcome was exposed or at risk prior to the outcome) divided by the (odds that a person without an adverse outcome was not exposed or at risk)

- OR = (A/C)/(B/D) =
**AD/BC**(see also fig Definitions of Symbols for Nominal Relationships)

- also called cross-product ratio

- investigator decides the number of subjects with and without disease outcome

**from outcome to risk factor**(inquiry statistics are)*backward*in time

- contrast to Relative Risk

- Odds ratio usually used for case-control studies

- case-control study is also backward in time (from outcomes to causes) so appropriate measure

- logistic regression can also be interpreted in terms of odds ratio in addition to relative risk (see chapter 8)

- case-control study is also backward in time (from outcomes to causes) so appropriate measure
- OR is non-linear, so
*exaggerates extremely high and low odds*

- (odds that a person with an adverse outcome was exposed or at risk prior to the outcome) divided by the (odds that a person without an adverse outcome was not exposed or at risk)
- matching measure statistic (RR, OR) to appropriate study type (Cohort, Trials, Case-control)

- each observational
**study type**differs in both its__Tense__and__Direction of Inquiry Statistics__

**Tense**:

*in the future=prospective*

*in the past=retrospective*

**Inquiry Statistics Direction**:

*forward (risk to outcome)*

*backward (outcome to risk)*

- each
**measure statistic**also differs in its__Direction of Inquiry Statistics__

- RR Relative Risk (
*forward*in time)

- OR OddsRatio (
*backward*in time)

- RR Relative Risk (
- matching appropriate statistic (see
__fig observational studies__)

- case-control is ..
-**retrospective**(*backward*__use__)**OR**

- cohort is ....................
-**prospective**(*forward*__use__)**RR**

- historical cohort is
-**retrospective**(*forward*__use__)**RR**

- case-control is ..

- each observational

**Ordinal Data (ranked items)**

**Spearman rank correlation (rs = -1 to +1)**

- for 2 ordinal, one ordinal and one numerical, or numerical variables if the data is skewed

- rank order the data (a
*derived*statistic) from lowest to highest by some characteristic

- ranks then used to calculate the statistic instead of the data

- +1 or -1 indicates perfect agreement between the ranks (order) of the values, but
*not*the values themselves

- tedious

- for 2 ordinal, one ordinal and one numerical, or numerical variables if the data is skewed

**Numerical Data (raw numbers)**

**Correlation Coeficient (r = -1 to +1)**

- between two
*numbers*

- independent of units

- greatly influenced by outlying (extreme) data values, so not good for skewed data

- use a transformation that changes of scale before the correlation is computed

- these transformations provide a weighted correlation

- rank or logarithmic transformations

- use a transformation that changes of scale before the correlation is computed
__crude rule of thumb for interpreting correlation r__(same for negative r's)

0.00 to 0.25 ...(little or no relationship)

0.25 to 0.50....(fair degree of relationship)

0.50 to 0.75 ...(moderate to good relationship)

0.75 or >........ (very good to excellent relationship)- correlations of r=0.95 to 1.00 are suspect and may indicate artifact or error in the biological sciences

- only measures a straight line correlation, not if a curvilinear relationship where one changes more as the other changes

- "
"**correlation does not imply causation**

*must*justify by experimental observations or logical argument

**coefficient of determination (r2)**

- indicates percent of shared causation between characteristics (accountability or predictability)

- indicates percent of shared causation between characteristics (accountability or predictability)

- between two

**TABLES CAN GIVE MISLEADING PERCENTAGES**- a common error

- how the data is presented and the scaling affects the interpretation

- when two or more measures are of interest, the purpose of the study generally determines which measure is viewed within the context of the other (which is dependent and independent).

- can imply a causality that is misleading
__if normalize data to the__.**outcomes**rather than the**risks**

- alters the percentage statistic relative to a different denominator, as part of a different whole

- difficult to keep clear, each a percent of which part, and in which causal direction?

- this example (Table 3-23 p 57) shows that data, collected on compliance and insurance status from a survey at one point in time, indicates that 35% of patients with a low level of compliance have no insurance (A. table), and 55% of patients with no insurance have low compliance (B. Table).
- shows if the data is consistent with the interpretation, but doesnt prove causation (here results are calculated on a survey at one point in time, so causality is a confusing question. especially when groups are subdivided, the ability to distinguish causality as opposed to coincidence is diminished. The table for a coincidence look exactly the same as the table for a causality.

- alters the percentage statistic relative to a different denominator, as part of a different whole
- can also imply causality that is
*false*, if invert the table to display something in the reverse direction that was causal only in the forward direction (C. Table).

- how the data is presented and the scaling affects the interpretation