Biostatistics😊

Biostatistics

Types of Statistics

Descriptive Statistics
Inferential Statistics
Buzzword → Collect, describe
Buzzword → Conclude
Collect and describe
Inference → Conclude
Collect, organize, summarize, and present the data using numbers and graphs
Use sample data to draw conclusions, perform estimations, and make predictions, inference about population
Uses measures of central tendency and dispersion
Uses statistical tests
Used when the data set is small
Used when the data set is large

Q. What s the branch of statistics dealing with describing the internal collection of data? (INICET-2022)
A. Theoretical statistics
B. Analytical statistics
C. Inferential statistics
D. Descriptive statistics

Ans:
Descriptive statistics
What is the branch of statistics dealing with method of data collection and counting?
A. Theoretical statistics
B. Descriptive statistics
C. Analytical statistics
D. Inferential statistics
Ans
Descriptive statistics

Data and Central Tendency Measures

Data Types

notion image

Data Scales → NOIR

notion image

Example

  • GCS, VISION ANALOGUE SCALE Ordinal
  • Has negative values Interval
    • football (FC → Fahrenheit and Celsius) - has Interval (Interval scale)
  • Temperature measured
      1. Kelvin → ratio
      1. Celsius/F → Interval
      1. hot or cold → nominal
      1. Hot, hotter, hottest → Ordinal
  • Hb
      1. Ratio as such
      1. Anaemic and non anaemic → Nominal
      1. Mild, mod and severe anaemia → ordinal

Types of Ordinal Scale

  • Likert scale:
    • Mnemonic: Likert → Like or dislike (only distinct options)
    • Agree/disagree continuum
  • Guttman scale:
    • Summative scale
      • notion image

Variations and Deviations

Measures of Variations

  • Range = Maximum Value - Minimum Value
  • Standard Deviation (σ):
    • σ = √[Σ(x-x̄)²/n], where:
      • Mnemonic: RMSD (Root of Mean of Squared Deviation).
      • Σ: Summation
      • n: Sample size
      • x̄: Mean
      • x: Given value
    • In small samples, "n" is replaced by n-1 (correction factor).
      • notion image

Coefficient of Variation:

  • SD x 100
    mean
  • Used to find variation in 2 different groups.

Variance

  • = SD x SD

Standard Error (SE)

  • For Mean (SE of Mean):
    • notion image
  • For Standard Deviation (SE of SD): 2s - 2n
    • notion image
notion image
notion image

SAMPLE SIZE

  • 4pq/ (L x L)
    • p = Probability (%)
    • q = 100 - p
    • L = Allowable error (%)

Z-score

  • (x - mean)/ SD
  • For individual values → Osteoporosis (T score)

RE α 1 / (Precision α Reproducibility/ Reliability)

  • Precisely () random () alkkare kuthiyal reproduce () cheyyam

SE α 1 / (Accuracy α Validity)

  • Vaccurat (Validity → Accuracy) systematic (systematic error)
notion image
notion image
notion image
  • Validity:
    • Hit the board
    • Results are within a desired range.
  • Accuracy:
    • Bull’s eye
    • Nearness/closeness to the actual true value.
  • Reliability:
    • Repeatability
    • Out of the board, but same result when repeated
    • Dependable, reproducible, repeatable results.
  • Note:
    • Serial interval:
      • Proxy indicator for incubation period.
    • CFR:
      • Denote virulence of a disease.

Normal Distribution Curve and Skew

notion image

Features of Normal Distribution Curve

  • Bi/Lateral symmetrical, bell-shaped curve.
    • Depends on mean and standard deviation
  • Also known as Gaussian distribution.
  • Unimodal
    • One peak
  • Ends never touch baseline.
  • Mean = median = mode:
    • Coincide at centre.
  • Mean = 0
  • SD = 1, Variance = 1.
  • Area Under Curve (AUC) = 100% (1).
  • Note:
    • Standard Deviation (SD) at centre = zero.
    • ± 1SD = 68.3%
    • ± 2SD = 95.5% (Normal zone).
    • ± 3SD = 99.7%

NOTE

  • Base of graph α margin of error
    • notion image

Bimodal distribution

notion image
  • Two peaks
  • Bi mode = Two Modes
  • Mean = Median ⇏ Mode

Skew

  • Abnormal distribution curve
notion image
notion image
  • Right/positive skew:
    • Mean > median > mode
    • Right meen
  • Left/negative skew:
    • Mean < median < mode
    • Left mode

Binomial Distribution

  • Only 2 possible outcomes
    • Example: Yes / No, Success / Failure

Poisson Distribution

  • Describes chance of events in a given time frame
    • Examples:
      • Number of OPD cases per week
      • Predicting future case counts
  • Used for rare events
  • Formula:
    • μ = λ × t
      • λ = average rate
      • t = time

Measures of Central Tendency

notion image
Mean
Median = 50th centile
Mode
Average of all values
Easy, simple measure
Central value after arranging in ascending/descending order
Most frequently occurring value (aka robust value)
Most affected by extreme values
Least affected by extreme values
Last to change with data variations
[1, 2,
3, 3, 3, 40000] → But not useful
Mean = average
Median → Middle
Mo → Most recurring
Best for metric data
Mean = Metric
Best for ordinal data
Middle of order
Best for nominal data
No Mo
  • Mode = 3 (median) - 2 (mean)
  • Best measure of central tendency:
    • Mean > median > mode
  • Best measure when there is diverse range of values
    • Median
  • In skewed data (extreme)
    • Most affected: Mean
    • Least affected: Mode
    • Most useful measure: Median
What is the most appropriate measure of central tendency when there is a diverse range of values within a community?
A. Mean
B. Mode
C. Median
D. Standard deviation
ANS
C. Median
  • The median is less affected by outliers or extreme values compared to the mean.
  • In distributions with a diverse range of values or skewed data, the median provides a more robust and representative measure of the center.
What is the term used to describe a probability distribution that represents the likelihood of a
specific number of events occurring within a specific time period?
A. Binomial distribution
B. Gaussian distribution
C. Poissons distribution
D. Normal distribution
ANS
Poissons
Type
Meaning
Population distribution
How values would look in the whole population (everyone)

Heights of all students in the whole school → the full set of values (usually unknown, theoretical)
Sample distribution
How values look in the group we actually studied (sample)

You measure heights of 30 students in one class → actual values you see form the sample distribution
Sampling distribution
How a statistic (like mean) would vary if we took many samples of the same size

If you take many different samples of 30 students from the school and calculate the average height for each sample, the pattern of those averages forms the sampling distribution

P-Value

notion image

P-Value (Probability Value)

  • Range from 0 to 1.
  • P-value at ± 2 SD = 0.05 (Normal).

Null Hypothesis (N°H)

  • No difference between the groups being compared.
Non-significant p-value
Significant p-value
P-value ≥ 0.05
P-value < 0.05
Value within the normal zone
Value outside the normal zoneAbnormal zoneAbnormal effect between 2 groups → Effect observed
No effect is observed → N°H: Accepted
(Normal variation)
Effect observed → Difference is Present (between 2 groups tested) → N°H: Rejected

Does not comment on anything else

P-value = 0.04 means:

  • 96% chance of being significant
    • Null hypothesis → False
  • 4% chance of being insignificant
    • Null hypothesis → True
  • Also 4% chance of being
    • Null hypothesis → False positive → Type 1 error (AFP)

A randomized control trial is being conducted in patients with direct inguinal hernia, to compare the outcomes of a new surgery 'A' and the gold standard surgery 'B'. The p-value of this trial is found to be 0.04. What can we conclude from this?

A. Type II error is small and we can accept the findings of the study
B. The probability of a false negative conclusion that operation A is better than B is 4%
C. The power of the study to detect the difference between operations A and B is 96%
D. The probability of a false positive conclusion that operation A is better than B is 4%
ANS
  • A p-value of 0.04null hypothesis is rejectedstatistically significant difference present
  • But a 4% chance exists ⇔ no difference existed between the two surgeries.
  • False positive conclusion = Type I error 
    • Actually No difference in Surgeries
    • But study concluded opposite
  • Thus, the p-value of 0.04 = 4% chance of Type 1 error/ False positive conclusion

Option-wise Analysis:

  • A. Type II error is small...
    • ❌ Incorrect: Type II error (false negative) is not related to p-value. It relates to power.
  • B. Probability of false negative...
    • ❌ Incorrect: This refers to Type II error, which is not the p-value. The p-value refers to false positives.
  • C. Power is 96%...
    • ❌ Incorrect: Power = 1 − β. P-value does not give power directly.
  • D. Probability of a false positive conclusion is 4%
    • Correct: This is the definition of p-value — the chance of false positive (Type I error) if the null hypothesis is true.

✅ Final Answer: D. The probability of a false positive conclusion that operation A is better than B is 4%

Errors in P-value (Hypothesis Testing)

Memory Tips:

  • Type I Error: First letter = α = False positive = Rejected True hypothesis
    • AFP → RT (α - Reject True)
    • P < 0.05 → Reject → true
    • (him α guy) → False positive guy → Rejected true love
  • Type II Error: Second = β = False negative = Accepted false hypothesis
    • Boyfriend (β) was fun (FN) accepted False gf (False hypothesis)
    • P > 0.05 → Accept → false
    • (me β guy) → False negative guy → Accepted false love
      • Power of a Test Formula: (1 − β)
      • Why (1 − β)?
        • β error = Type II error (false negative)
        • H₀ is false but accepted

Null Hypothesis (H₀)

  • States: "No difference exists"
  • We assume it's true, then perform tests to reject or accept it based on data.

Type 1 Error (α error)

  • Reality: H₀ is true (no difference between drugs)
  • Error: Rejected a true H₀
  • Buzzword:
    • H₀ true, False positive study
    • 1 → True, Positive
  • Example:
    • New drug is not better, but launched due to clinical trial
  • Outcome:
    • P value < 0.05 ⇒ Positive
      • Significant difference = rejected a true null hypothesis
    • False positive trial
      • Meaning = Trial concluded positive, but It was false
    • Serious error
    • Drug launched unnecessarily
  • Basically what it means
    • Some pottan made a potta drug, no difference to exisisting drug
      (i.e. reality → null hypothesis is true)
      • But trial said it is effective drug → So introduced into market
      • He is a pottan, but held as alpha by everyone
      • This potta drug may replace existing drug
        • It is a serious error

Type 2 Error (β error)

  • Reality: H₀ is false (new drug is better)
  • Error: Accepted a false H₀
  • Buzzword:
    • H₀ false, False negative trial
    • 2 → False, negative
  • Example:
    • New drug is better, but not launched due to clinical trial
  • Outcome:
    • P > 0.05 ⇒ Negative
    • False negative trial
    • Less serious
    • Missed opportunity for better drug
  • Basically what it means
    • I made a good drug → which is better than existing drug
      (i.e. null hypothesis was false)
      • But trial somehow rejected it by saying my drug is not better
        (interpreted it as true → [& accepted a null hypothesis]→ which was actually false)
      • Trial showed negative → but it was false → False negative
      • Made me look like a β
      • But it is not a serious error, because it wont affect anyones health more badly than currently is

MCQs

Q. Type 1 statistical error is said to have occurred if:
  • A. Null hypothesis is true and is accepted
  • B. Null hypothesis is false but is accepted
  • C. Null hypothesis is true but is rejected
  • D. Null hypothesis is false and is rejected
    • ANS
      Null hypothesis is true but is rejected
Q. A randomized trial comparing the efficacy of two drugs showed a difference between the two (p<0.05). Assume that in reality, however, the two drugs do not differ. This is, therefore, an example of
  1. A. Type l error
  1. B. Type ll error
  1. C. 1 - alpha
  1. D. 1 - beta
    1. ANS
      A. Type l error
 

Power of a Test

  • The power of the test can be increased by ↑↑ sample size, Precision
      • Power of a Test Formula: (1 − β)
      • Why (1 − β)?
        • β error = Type II error (false negative)
        • H₀ is false but accepted

Test of Significance

notion image
Mathematical formula to derive p-value.
Parametric
Non-parametric
Type of data
Quantitative
Qualitative
Compare means and standard deviations
Compare percentages and proportions
Normal distributions
Skewed distributions
More powerful
Less powerful
Number of groups
(Before and after testing)
Orupadu Quantity ullath Students →
Para vakkum (Parametric)
Set avum (paired) break up (unpaired) avum.
Pears (Pearson) soap itt kulikkum.
Innova (ANOVA) caril varum,
a-Z (Z test) padikkum
Mnemonic:
No (Non parametric) quality → Screwed (skewed) Properly (proportions and percentage) → fry avum
1 group
Students Paired t-test

Compare mean Hb in a groups of dengue patients BEFORE and AFTER treatment
Paired non-parametric test
McNemar test
Wilcoxon Signed Rank Test
Mnemonic: Neymar sign for 1 person
2 groups
Students Unpaired t-test

E.g. Compare mean Hb in a group of dengue and malaria patients
Unpaired non-paired tests:
Wilcoxon Rank Sum test
Chi square test
[DEGREE OF FREEDOM = (ROW - 1) X {COL - 1)]
Mnemonic: Square and sum → 2
≥3 groups
ANOVA
(analysis of variation)
Analysis of variance test:
Kruskal Wallis test >>
Chi square test
Pearson’s Correlation Coefficient
Z-Test
• Used in place of the T-test when the sample size is > 30
Friedman/ Fisher exact test
• Used in place of the chi-square test when the sample size is <30.
notion image
notion image
notion image
  • BP or sugar of 50 people checked before and after one month of drug
    • Quantitative → Same group
      → Use Paired t-test

  • HB levels compared in 2 villages
    • One village gets IFA tablets, other gets diet modifications
      • Quantitative → Different groups,
        → Use Unpaired t-test

  • Effect of 3 different intervention measured by % of smokers in 3 different groups:
    • 1st group: Technical lecture
    • 2nd group: Spiritual talks
    • 3rd group: Nicotine tablets
      • % = Qualitative
      • → Use Kruskal Wallis test > Chi square test

  • Birthweight of baby measured in 2 groups of pregnant women:
    • Group 1: Eats 200g papaya
    • Group 2: Eats 200g guava
      • Birthweight Quantitative
        • → Use Unpaired t-test
      • If instead of birth weight, LBW vs HBW was plotted, it would be a qualitative study
        • Chi square test

Chi-square Test vs Wilcoxon Signed-Rank Test

Test
Use Case
Paired/Unpaired
Suitable for Association Between Risk Factor & Outcome?
Chi-square test
To check association between two categorical variables (e.g. Obesity vs Breast cancer)
Unpaired → 2 group
✅ Yes
Wilcoxon signed-rank test
Compare paired observations (e.g. before and after intervention on same subjects)
Paired → 1 group
❌ No

Other Statistical Tests

Tests
Description
Kolmogorov Smirnov test
Normalcy of data
normal name for Russians
Dixon's Q test
Outliers of data
dicks lie outside
Kappa statistic
Agreement for kappa tv
Agreement b/w observers
Formula =
(Observed agreement - Expected agreement)
(1 - Expected agreement)
Yates correction
Small sample size
Fischer exact test
• Cell sample size < 5 (Very small sample)
Small amount of fish (<5)
Wilcoxon rank test
Ordinal data (Stages/grades of data)

Sampling Methods

Non probability

Non-random sampling :
Key Point
Convenient sampling
Easier to do.
Purposive sampling
2° intention.
Quota sampling
Pre-determined group of sampling.
Snowball sampling
- Selected samples will select more samples.
- Used in diseases with
social stigma (e.g. drug abuse, rape, acid attack).

Eg: needle exchange program for drug abuse

Probability (Better)

notion image
notion image
Sampling :
Key Points
Simple Random
Random number method.
• Adopted for a
Homogenous population.
• A
Sampling frame is available.
• Used for
smaller populations.
Stratified
Simple stratified sampling

Population proportionate to size sampling (Better method)

Heterogenous population converted into homogenous population

Predetermined criteria → based on which population is stratified
• Ex:
Religion and age groups
Systemic Random
Nth individual selected.
• Adopted for a heterogeneous population.
• Large population.
• Here every 3rd participant is selected and
kth interval is calculated.

kth interval = total population/sample size
Cluster
Done to evaluate healthcare programs in large homogenous population.

• 30x7 concept
• DESIGN EFFECT
→ To account for loss of data → while selecting clusters

Clusters: Naturally occurring groups → We select these groups randomly.

Evaluation of
immunization coverage & Antenatal coverage

Q. The urban areas of Delhi have 4000 population with different religions. Research is being done to study the dietary habits of the population. Which of the following techniques can be used to obtain a study sample?

A. Cluster random sampling
B. Stratified random sampling
C. Simple random sampling
D. Systemic random sampling.
ANS
B. Stratified random sampling
notion image

Regression

  • Used to calculate one variable with the help of another variable.

Types:

  1. Simple Regression
      • Using 1 independent variable
      • Calculate 1 dependent variable
  1. Multiple Regression
      • Using > 1 independent variable
      • Calculate 1 dependent variable to calculate
  1. Logistic Regression
      • Outcome is binary
      • Yes/No
      • e.g.,
        • Death/No Death,
        • Clean/Not Clean
How is the calculation of one variable using another variable typically performed?
A. Coefficient of correlation
B. Coefficient of Regression
C. Coefficient of variation
D. Coefficient of determination

Graphical Presentations

Qualitative Pictorial Mnemonic
Qualitative Pictorial Mnemonic
• Bar chart → Bar
• Pie chart →
Pie
• Pictogram →
Picture
• Spot maps →
Maps
• Venn diagram →
Venice picture
• Choropleth → Chlorin
Quantitative
Qualitative
Histogram
Frequency polygon/chart
Scatter plot
Line diagram
Cumulative frequency curve (Ogive)
Bar chart
Pie chart
Pictogram
Spot maps
Venn diagram
Choropleth

Charts

notion image
notion image

Kaplan Meir Curve

  • PYQ → Compare survival probability
    • notion image

ROC Curve

  • Receiver Operating Characteristic (ROC) Curve
  • LIKELYHOOD RATIO = Sn / (1 - Sp) = TP/FP
  • Ideal Test → Maximum AUC
  • Like a Rock
    • notion image
      notion image
      A → Screening ?
      • Uses:
        • Defining cut-off.
        • Comparing investigations.
      • Best investigation:
        • Maximum area under curve.
        • Top left-most peak on a curve.
      notion image
       

STEM and LEAF

notion image

TREE PLOT

notion image

Bar Chart

notion image
  • Columns separated
  • Use:
    • Qualitative data
    • Frequency on other axis

Multiple Bar Chart

notion image
  • Multiple data in single column
  • Don't confuse with Component Bar Chart
 

Component Bar Chart/ Composite bar chart

notion image
  • Same data is represented across different timelines
  • total magnitude is divided into different subset
 
notion image
ANS
C

Histogram

notion image
  • Continuous data
    • History is continuous
  • Columns joined
  • Use:
    • Quantitative data
    • Frequency and variable
      • NOT TIME
    • Flow cytometry

Frequency Chart/Polygon Curve

notion image
  • Midpoints of histogram joined
    • Between variable and frequency
 

Stacked Bar Chart

notion image
  • Multiple bars stacked

Line Diagram or Chart

notion image
  • Time vs. frequency
  • Also Time Trend Chart
    • Between Frequency and time
 
What is the most effective approach to graphically represent the variations in the occurrence of a disease within a specific geographical region throughout a period of time?
A. Line graph
B. Histogram
C. Ogive
D. Tree diagram
ANS
A. Line graph
What is the most effective method for graphically representing the fluctuations in disease
occurrence within a specific geographical region
?
A. Line graph
B. Histogram
C. Ogive
D. Tree diagram
ANS
A. Line graph

Pie Chart

notion image
  • Sectoral presentation
  • Always in percentage
  • Best for technical data

Venn Diagram

notion image
  • For overlapping qualitative data
  • Mnemonic: When on your lap

Spot Map

notion image
  • Geographical location of data on map
 

Choropleth

notion image
  • Geographic intensity of data on map

Pictogram

notion image
  • Symbolic data presentation
  • Best for general population
  • Creates mass awareness

Ogive

notion image
 
notion image
  • Cumulative frequency vs. variable
    • notion image
  • Curve always rising
  • Defines:
    • Class interval
    • Cut-off
    •  

Box and Whisker Plot

Leiomyoma → Max variability
Hyperplasia → Less distributed
Leiomyoma → Max variability
Hyperplasia → Less distributed
notion image
  • Shows data distribution
  • Based on quartile
  • Values in box = 50%
  • Inter-quartile range: Q3 - Q1
  • Does not show mean
 
notion image
notion image
notion image
notion image
ANS
B
 
notion image
ANS
Mean > Median > Mode → Right skew, Positive

Scatter Plot

notion image
  • Between two quantitative variables
  • Finds correlation
  • Shows direction, strength, degree of relation

FLOW CYTOMETRY

  • SCATTER PLOT
  • HISTOGRAM
 

Pearson's Correlation Coefficient (r)

notion image
notion image
  • Used in scatter plot
  • NOT AN INDICATOR OF VARIATION
  • Helps to find correlation between two variables.
    • r = -1 : Perfect negative correlation = PROTECTIVE
    • r = 0 : No correlation
    • r = +1 : Perfect positive correlation = RISK FACTOR
      • b/w 1 and 0 = Weak correlation
  • Spearman correlation coefficient also used

Which of the following statements accurately describes the correlation between bone marrow density and gestational age, as depicted in two separate studies?

notion image
notion image
notion image
ANS
notion image
 
 

A homogenous sample of 4 groups was taken as shown below. The height and weight of each group had a correlation coefficient of 0.6. What will be the total coefficient of correlation of the whole group?

notion image
A. Equal to 0.6
B. Less than 0.6
C. More than 0.6
D. Skewed coefficient of 0.6
ANS
A. Equal to 0.6

Ishikawa Diagram / Fishbone Diagram

notion image
  • Also known as Cause-and-Effect Diagram.
  • Purpose:
    • Identifies all possible causes contributing to a specific problem or effect.
  • Major Cause Categories (Primary Causes):
    • Materials
    • Methods / Processes
    • Manpower / People
    • Machines / Equipment
    • Mother Nature / Environment
    • Measurements
  • Structure:
    • Main “bones” represent primary causes.
    • Each primary cause is further broken down into secondary causes.
  • Effect:
    • Represented at the fish head (right side): The Problem to be analyzed.
notion image
You are initiating hypertension services at your Primary Health Center (PHC). 50 patients in need of antihypertensive treatment have been transferred from another facility. Among them, 40 were
prescribed amlodipine (5mg orally) and 10 were prescribed lisinopril (10mg orally) due to
contraindications with amlodipine. Medications are supplied monthly at the PHC, and you are
responsible for ordering them. How many tablets do you need to order and what should be the reorder factor?
A. 1000, rf=3
B. 1200, rf=2
C. 1400, rf=3
D. 1600, rf=2
Ans

Monthly Requirement:

  • Amlodipine = 40 patients × 30 days = 1200 tablets
  • Lisinopril = 10 patients × 30 days = 300 tablets
  • Total tablets required = 1200 + 300 = 1500 tablets

Reorder Factor (rf):

  • Reorder factor is the buffer stock multiplier.
Let’s test each option:
  • A. 1000 tablets: Too low (even base requirement is 1500) → ❌
  • B. 1200 tablets: Too low → ❌
  • C. 1400 tablets: Still low → ❌
  • D. 1600 tablets: Sufficient. Closest matching reorder factor:
    • → 1500 × rf = 1600 → rf ≈ 1.06 → Matches rf = 2 as a standard buffer factor in public health.