Biostatistics
Types of Statistics
Descriptive Statistics | Inferential Statistics |
Buzzword → Collect, describe | Buzzword → Conclude |
Collect and describe | Inference → Conclude |
Collect, organize, summarize, and present the data using numbers and graphs | Use sample data to draw conclusions, perform estimations, and make predictions, inference about population |
Uses measures of central tendency and dispersion | Uses statistical tests |
Used when the data set is small | Used when the data set is large |
Q. What s the branch of statistics dealing with describing the internal collection of data? (INICET-2022)
A. Theoretical statistics
B. Analytical statistics
C. Inferential statistics
D. Descriptive statistics
Ans:
Descriptive statistics
What is the branch of statistics dealing with method of data collection and counting?
A. Theoretical statistics
B. Descriptive statistics
C. Analytical statistics
D. Inferential statistics
A. Theoretical statistics
B. Descriptive statistics
C. Analytical statistics
D. Inferential statistics
Ans
Descriptive statistics
Data and Central Tendency Measures
Data Types

Data Scales → NOIR

Example
- GCS, VISION ANALOGUE SCALE → Ordinal
- Has negative values → Interval
- football (FC → Fahrenheit and Celsius) - has Interval (Interval scale)
- Temperature measured
- Kelvin → ratio
- Celsius/F → Interval
- hot or cold → nominal
- Hot, hotter, hottest → Ordinal
- Hb
- Ratio as such
- Anaemic and non anaemic → Nominal
- Mild, mod and severe anaemia → ordinal
Types of Ordinal Scale
- Likert scale:
- Mnemonic: Likert → Like or dislike (only distinct options)
- Agree/disagree continuum
- Guttman scale:
- Summative scale

Variations and Deviations
Measures of Variations
- Range = Maximum Value - Minimum Value
- Standard Deviation (σ):
- σ = √[Σ(x-x̄)²/n], where:
- Mnemonic: RMSD (Root of Mean of Squared Deviation).
- Σ: Summation
- n: Sample size
- x̄: Mean
- x: Given value
- In small samples, "n" is replaced by n-1 (correction factor).

Coefficient of Variation:
- SD x 100
mean
- Used to find variation in 2 different groups.
Variance
- = SD x SD
Standard Error (SE)
- For Mean (SE of Mean):

- For Standard Deviation (SE of SD): 2s - 2n



SAMPLE SIZE
- 4pq/ (L x L)
- p = Probability (%)
- q = 100 - p
- L = Allowable error (%)
Z-score
- (x - mean)/ SD
- For individual values → Osteoporosis (T score)
RE α 1 / (Precision α Reproducibility/ Reliability)
- Precisely () random () alkkare kuthiyal reproduce () cheyyam
SE α 1 / (Accuracy α Validity)
- Vaccurat (Validity → Accuracy) systematic (systematic error)



- Validity:
- Hit the board
- Results are within a desired range.
- Accuracy:
- Bull’s eye
- Nearness/closeness to the actual true value.
- Reliability:
- Repeatability
- Out of the board, but same result when repeated
- Dependable, reproducible, repeatable results.
- Note:
- Serial interval:
- Proxy indicator for incubation period.
- CFR:
- Denote virulence of a disease.
Normal Distribution Curve and Skew

Features of Normal Distribution Curve
- Bi/Lateral symmetrical, bell-shaped curve.
- Depends on mean and standard deviation
- Also known as Gaussian distribution.
- Unimodal
- One peak
- Ends never touch baseline.
- Mean = median = mode:
- Coincide at centre.
- Mean = 0
- SD = 1, Variance = 1.
- Area Under Curve (AUC) = 100% (1).
- Note:
- Standard Deviation (SD) at centre = zero.
- ± 1SD = 68.3%
- ± 2SD = 95.5% (Normal zone).
- ± 3SD = 99.7%
NOTE
- Base of graph α margin of error

Bimodal distribution

- Two peaks
- Bi mode = Two Modes
- Mean = Median ⇏ Mode
Skew
- Abnormal distribution curve


- Right/positive skew:
- Mean > median > mode
- Right meen
- Left/negative skew:
- Mean < median < mode
- Left mode
Binomial Distribution
- Only 2 possible outcomes
- Example: Yes / No, Success / Failure
Poisson Distribution
- Describes chance of events in a given time frame
- Examples:
- Number of OPD cases per week
- Predicting future case counts
- Used for rare events
- Formula:
- μ = λ × t
- λ = average rate
- t = time
Measures of Central Tendency

Mean | Median = 50th centile | Mode |
Average of all values Easy, simple measure | Central value after arranging in ascending/descending order | Most frequently occurring value (aka robust value) |
Most affected by extreme values | Least affected by extreme values | Last to change with data variations [1, 2, 3, 3, 3, 40000] → But not useful |
Mean = average | Median → Middle | Mo → Most recurring |
Best for metric data Mean = Metric | Best for ordinal data Middle of order | Best for nominal data No Mo |
- Mode = 3 (median) - 2 (mean)
- Best measure of central tendency:
- Mean > median > mode
- Best measure when there is diverse range of values
- Median
- In skewed data (extreme)
- Most affected: Mean
- Least affected: Mode
- Most useful measure: Median
What is the most appropriate measure of central tendency when there is a diverse range of values within a community?
A. Mean
B. Mode
C. Median
D. Standard deviation
A. Mean
B. Mode
C. Median
D. Standard deviation
ANS
C. Median
- The median is less affected by outliers or extreme values compared to the mean.
- In distributions with a diverse range of values or skewed data, the median provides a more robust and representative measure of the center.
What is the term used to describe a probability distribution that represents the likelihood of a
specific number of events occurring within a specific time period?
A. Binomial distribution
B. Gaussian distribution
C. Poissons distribution
D. Normal distribution
specific number of events occurring within a specific time period?
A. Binomial distribution
B. Gaussian distribution
C. Poissons distribution
D. Normal distribution
ANS
Poissons
Type | Meaning |
Population distribution | How values would look in the whole population (everyone) Heights of all students in the whole school → the full set of values (usually unknown, theoretical) |
Sample distribution | How values look in the group we actually studied (sample) You measure heights of 30 students in one class → actual values you see form the sample distribution |
Sampling distribution | How a statistic (like mean) would vary if we took many samples of the same size If you take many different samples of 30 students from the school and calculate the average height for each sample, the pattern of those averages forms the sampling distribution |
P-Value

P-Value (Probability Value)
- Range from 0 to 1.
- P-value at ± 2 SD = 0.05 (Normal).
Null Hypothesis (N°H)
- No difference between the groups being compared.
Non-significant p-value | Significant p-value |
P-value ≥ 0.05 | P-value < 0.05 |
Value within the normal zone | Value outside the normal zone → Abnormal zone → Abnormal effect between 2 groups → Effect observed |
No effect is observed → N°H: Accepted (Normal variation) | Effect observed → Difference is Present (between 2 groups tested) → N°H: Rejected Does not comment on anything else |
P-value = 0.04 means:
- 96% chance of being significant
- Null hypothesis → False
- 4% chance of being insignificant
- Null hypothesis → True
- Also 4% chance of being
- Null hypothesis → False positive → Type 1 error (AFP)
A randomized control trial is being conducted in patients with direct inguinal hernia, to compare the outcomes of a new surgery 'A' and the gold standard surgery 'B'. The p-value of this trial is found to be 0.04. What can we conclude from this?
A. Type II error is small and we can accept the findings of the study
B. The probability of a false negative conclusion that operation A is better than B is 4%
C. The power of the study to detect the difference between operations A and B is 96%
D. The probability of a false positive conclusion that operation A is better than B is 4%
B. The probability of a false negative conclusion that operation A is better than B is 4%
C. The power of the study to detect the difference between operations A and B is 96%
D. The probability of a false positive conclusion that operation A is better than B is 4%
ANS
- A p-value of 0.04 ⇔ null hypothesis is rejected ⇔ statistically significant difference present
- But a 4% chance exists ⇔ no difference existed between the two surgeries.
- False positive conclusion = Type I error
- Actually No difference in Surgeries
- But study concluded opposite
- Thus, the p-value of 0.04 = 4% chance of Type 1 error/ False positive conclusion
Option-wise Analysis:
- A. Type II error is small...
❌ Incorrect: Type II error (false negative) is not related to p-value. It relates to power.
- B. Probability of false negative...
❌ Incorrect: This refers to Type II error, which is not the p-value. The p-value refers to false positives.
- C. Power is 96%...
❌ Incorrect: Power = 1 − β. P-value does not give power directly.
- D. Probability of a false positive conclusion is 4%
✅ Correct: This is the definition of p-value — the chance of false positive (Type I error) if the null hypothesis is true.
✅ Final Answer: D. The probability of a false positive conclusion that operation A is better than B is 4%
Errors in P-value (Hypothesis Testing)
Memory Tips:
- Type I Error: First letter = α = False positive = Rejected True hypothesis
- AFP → RT (α - Reject True)
- P < 0.05 → Reject → true
- (him α guy) → False positive guy → Rejected true love
- Type II Error: Second = β = False negative = Accepted false hypothesis
- Boyfriend (β) was fun (FN) accepted False gf (False hypothesis)
- P > 0.05 → Accept → false
- (me β guy) → False negative guy → Accepted false love
- Power of a Test Formula: (1 − β)
- Why (1 − β)?
- β error = Type II error (false negative)
- H₀ is false but accepted
Null Hypothesis (H₀)
- States: "No difference exists"
- We assume it's true, then perform tests to reject or accept it based on data.
Type 1 Error (α error)
- Reality: H₀ is true (no difference between drugs)
- Error: Rejected a true H₀
- Buzzword:
- H₀ true, False positive study
- 1 → True, Positive
- Example:
- New drug is not better, but launched due to clinical trial
- Outcome:
- P value < 0.05 ⇒ Positive
- Significant difference = rejected a true null hypothesis
- False positive trial
- Meaning = Trial concluded positive, but It was false
- Serious error
- Drug launched unnecessarily
- Basically what it means
- Some pottan made a potta drug, no difference to exisisting drug
(i.e. reality → null hypothesis is true) - But trial said it is effective drug → So introduced into market
- He is a pottan, but held as alpha by everyone
- This potta drug may replace existing drug
- It is a serious error
Type 2 Error (β error)
- Reality: H₀ is false (new drug is better)
- Error: Accepted a false H₀
- Buzzword:
- H₀ false, False negative trial
- 2 → False, negative
- Example:
- New drug is better, but not launched due to clinical trial
- Outcome:
- P > 0.05 ⇒ Negative
- False negative trial
- Less serious
- Missed opportunity for better drug
- Basically what it means
- I made a good drug → which is better than existing drug
(i.e. null hypothesis was false) - But trial somehow rejected it by saying my drug is not better
(interpreted it as true → [& accepted a null hypothesis]→ which was actually false) - Trial showed negative → but it was false → False negative
- Made me look like a β
- But it is not a serious error, because it wont affect anyones health more badly than currently is
MCQs
Q. Type 1 statistical error is said to have occurred if:
- A. Null hypothesis is true and is accepted
- B. Null hypothesis is false but is accepted
- C. Null hypothesis is true but is rejected
- D. Null hypothesis is false and is rejected
ANS
Null hypothesis is true but is rejected
Q. A randomized trial comparing the efficacy of two drugs showed a difference between the two (p<0.05). Assume that in reality, however, the two drugs do not differ. This is, therefore, an example of
- A. Type l error
- B. Type ll error
- C. 1 - alpha
- D. 1 - beta
ANS
A. Type l error
Power of a Test
- The power of the test can be increased by ↑↑ sample size, Precision
- Power of a Test Formula: (1 − β)
- Why (1 − β)?
- β error = Type II error (false negative)
- H₀ is false but accepted
Test of Significance

Mathematical formula to derive p-value.
ㅤ | Parametric | Non-parametric |
Type of data | Quantitative | Qualitative |
ㅤ | Compare means and standard deviations | Compare percentages and proportions |
ㅤ | Normal distributions | Skewed distributions |
ㅤ | More powerful | Less powerful |
Number of groups (Before and after testing) | Orupadu Quantity ullath Students → Para vakkum (Parametric) Set avum (paired) break up (unpaired) avum. Pears (Pearson) soap itt kulikkum. Innova (ANOVA) caril varum, a-Z (Z test) padikkum | Mnemonic: No (Non parametric) quality → Screwed (skewed) Properly (proportions and percentage) → fry avum |
1 group | Students Paired t-test Compare mean Hb in a groups of dengue patients BEFORE and AFTER treatment | Paired non-parametric test • McNemar test • Wilcoxon Signed Rank Test • Mnemonic: Neymar sign for 1 person |
2 groups | Students Unpaired t-test E.g. Compare mean Hb in a group of dengue and malaria patients | Unpaired non-paired tests: • Wilcoxon Rank Sum test • Chi square test ↳ [DEGREE OF FREEDOM = (ROW - 1) X {COL - 1)] • Mnemonic: Square and sum → 2 |
≥3 groups | ANOVA (analysis of variation) | Analysis of variance test: • Kruskal Wallis test >> • Chi square test |
ㅤ | Pearson’s Correlation Coefficient | ㅤ |
ㅤ | Z-Test • Used in place of the T-test when the sample size is > 30 | Friedman/ Fisher exact test • Used in place of the chi-square test when the sample size is <30. |



- BP or sugar of 50 people checked before and after one month of drug
→ Quantitative → Same group
→ Use Paired t-test
- HB levels compared in 2 villages
- One village gets IFA tablets, other gets diet modifications
→ Quantitative → Different groups,
→ Use Unpaired t-test
- Effect of 3 different intervention measured by % of smokers in 3 different groups:
- 1st group: Technical lecture
- 2nd group: Spiritual talks
- 3rd group: Nicotine tablets
- % = Qualitative
- → Use Kruskal Wallis test > Chi square test
- Birthweight of baby measured in 2 groups of pregnant women:
- Group 1: Eats 200g papaya
- Group 2: Eats 200g guava
- Birthweight → Quantitative
- If instead of birth weight, LBW vs HBW was plotted, it would be a qualitative study
→ Use Unpaired t-test
→ Chi square test
Chi-square Test vs Wilcoxon Signed-Rank Test
Test | Use Case | Paired/Unpaired | Suitable for Association Between Risk Factor & Outcome? |
Chi-square test | To check association between two categorical variables (e.g. Obesity vs Breast cancer) | Unpaired → 2 group | ✅ Yes |
Wilcoxon signed-rank test | Compare paired observations (e.g. before and after intervention on same subjects) | Paired → 1 group | ❌ No |
Other Statistical Tests
Tests | Description |
Kolmogorov Smirnov test | • Normalcy of data • normal name for Russians |
Dixon's Q test | • Outliers of data • dicks lie outside |
Kappa statistic | • Agreement for kappa tv • Agreement b/w observers Formula = (Observed agreement - Expected agreement) (1 - Expected agreement) |
Yates correction | • Small sample size |
Fischer exact test | • Cell sample size < 5 (Very small sample) • Small amount of fish (<5) |
Wilcoxon rank test | • Ordinal data (Stages/grades of data) |
Sampling Methods
Non probability
Non-random sampling : | Key Point |
Convenient sampling | Easier to do. |
Purposive sampling | 2° intention. |
Quota sampling | Pre-determined group of sampling. |
Snowball sampling | - Selected samples will select more samples. - Used in diseases with social stigma (e.g. drug abuse, rape, acid attack). Eg: needle exchange program for drug abuse |
Probability (Better)


Sampling : | Key Points |
Simple Random | Random number method. • Adopted for a Homogenous population. • A Sampling frame is available. • Used for smaller populations. |
Stratified | Simple stratified sampling Population proportionate to size sampling (Better method) Heterogenous population converted into homogenous population • Predetermined criteria → based on which population is stratified • Ex: Religion and age groups |
Systemic Random | Nth individual selected. • Adopted for a heterogeneous population. • Large population. • Here every 3rd participant is selected and kth interval is calculated. kth interval = total population/sample size |
Cluster | Done to evaluate healthcare programs in large homogenous population. • 30x7 concept • DESIGN EFFECT → To account for loss of data → while selecting clusters Clusters: Naturally occurring groups → We select these groups randomly. Evaluation of immunization coverage & Antenatal coverage |
Q. The urban areas of Delhi have 4000 population with different religions. Research is being done to study the dietary habits of the population. Which of the following techniques can be used to obtain a study sample?
A. Cluster random sampling
B. Stratified random sampling
C. Simple random sampling
D. Systemic random sampling.
B. Stratified random sampling
C. Simple random sampling
D. Systemic random sampling.
ANS
B. Stratified random sampling

Regression
- Used to calculate one variable with the help of another variable.
Types:
- Simple Regression
- Using 1 independent variable
- Calculate 1 dependent variable
- Multiple Regression
- Using > 1 independent variable
- Calculate 1 dependent variable to calculate
- Logistic Regression
- Outcome is binary
- Yes/No
- e.g.,
- Death/No Death,
- Clean/Not Clean
How is the calculation of one variable using another variable typically performed?
A. Coefficient of correlation
B. Coefficient of Regression
C. Coefficient of variation
D. Coefficient of determination
A. Coefficient of correlation
B. Coefficient of Regression
C. Coefficient of variation
D. Coefficient of determination
Graphical Presentations

• Bar chart → Bar
• Pie chart → Pie
• Pictogram → Picture
• Spot maps → Maps
• Venn diagram → Venice picture
• Choropleth → Chlorin
• Pie chart → Pie
• Pictogram → Picture
• Spot maps → Maps
• Venn diagram → Venice picture
• Choropleth → Chlorin
Quantitative | Qualitative |
• Histogram • Frequency polygon/chart • Scatter plot • Line diagram • Cumulative frequency curve (Ogive) | • Bar chart • Pie chart • Pictogram • Spot maps • Venn diagram • Choropleth |
Charts


Kaplan Meir Curve
- PYQ → Compare survival probability

ROC Curve
- Receiver Operating Characteristic (ROC) Curve
- LIKELYHOOD RATIO = Sn / (1 - Sp) = TP/FP
- Ideal Test → Maximum AUC
- Like a Rock
- Uses:
- Defining cut-off.
- Comparing investigations.
- Best investigation:
- Maximum area under curve.
- Top left-most peak on a curve.


A → Screening ?

STEM and LEAF

TREE PLOT

Bar Chart

- Columns separated
- Use:
- Qualitative data
- Frequency on other axis
Multiple Bar Chart

- Multiple data in single column
- Don't confuse with Component Bar Chart
Component Bar Chart/ Composite bar chart

- Same data is represented across different timelines
- total magnitude is divided into different subset

ANS
C
Histogram

- Continuous data
- History is continuous
- Columns joined
- Use:
- Quantitative data
- Frequency and variable
- NOT TIME
- Flow cytometry
Frequency Chart/Polygon Curve

- Midpoints of histogram joined
- Between variable and frequency
Stacked Bar Chart

- Multiple bars stacked
Line Diagram or Chart

- Time vs. frequency
- Also Time Trend Chart
- Between Frequency and time
What is the most effective approach to graphically represent the variations in the occurrence of a disease within a specific geographical region throughout a period of time?
A. Line graph
B. Histogram
C. Ogive
D. Tree diagram
A. Line graph
B. Histogram
C. Ogive
D. Tree diagram
ANS
A. Line graph
What is the most effective method for graphically representing the fluctuations in disease
occurrence within a specific geographical region?
A. Line graph
B. Histogram
C. Ogive
D. Tree diagram
occurrence within a specific geographical region?
A. Line graph
B. Histogram
C. Ogive
D. Tree diagram
ANS
A. Line graph
Pie Chart

- Sectoral presentation
- Always in percentage
- Best for technical data
Venn Diagram

- For overlapping qualitative data
- Mnemonic: When on your lap
Spot Map

- Geographical location of data on map
Choropleth

- Geographic intensity of data on map
Pictogram

- Symbolic data presentation
- Best for general population
- Creates mass awareness
Ogive


- Cumulative frequency vs. variable

- Curve always rising
- Defines:
- Class interval
- Cut-off
Box and Whisker Plot

Hyperplasia → Less distributed

- Shows data distribution
- Based on quartile
- Values in box = 50%
- Inter-quartile range: Q3 - Q1
- Does not show mean




ANS
B

ANS
Mean > Median > Mode → Right skew, Positive
Scatter Plot

Pearson's Correlation Coefficient (r)


- Used in scatter plot
- NOT AN INDICATOR OF VARIATION
- Helps to find correlation between two variables.
- r = -1 : Perfect negative correlation = PROTECTIVE
- r = 0 : No correlation
- r = +1 : Perfect positive correlation = RISK FACTOR
- b/w 1 and 0 = Weak correlation
- Spearman correlation coefficient also used
Which of the following statements accurately describes the correlation between bone marrow density and gestational age, as depicted in two separate studies?



ANS

A homogenous sample of 4 groups was taken as shown below. The height and weight of each group had a correlation coefficient of 0.6. What will be the total coefficient of correlation of the whole group?

A. Equal to 0.6
B. Less than 0.6
C. More than 0.6
D. Skewed coefficient of 0.6
B. Less than 0.6
C. More than 0.6
D. Skewed coefficient of 0.6
ANS
A. Equal to 0.6
Ishikawa Diagram / Fishbone Diagram

- Also known as Cause-and-Effect Diagram.
- Purpose:
- Identifies all possible causes contributing to a specific problem or effect.
- Major Cause Categories (Primary Causes):
- Materials
- Methods / Processes
- Manpower / People
- Machines / Equipment
- Mother Nature / Environment
- Measurements
- Structure:
- Main “bones” represent primary causes.
- Each primary cause is further broken down into secondary causes.
- Effect:
- Represented at the fish head (right side): The Problem to be analyzed.

You are initiating hypertension services at your Primary Health Center (PHC). 50 patients in need of antihypertensive treatment have been transferred from another facility. Among them, 40 were
prescribed amlodipine (5mg orally) and 10 were prescribed lisinopril (10mg orally) due to
contraindications with amlodipine. Medications are supplied monthly at the PHC, and you are
responsible for ordering them. How many tablets do you need to order and what should be the reorder factor?
A. 1000, rf=3
B. 1200, rf=2
C. 1400, rf=3
D. 1600, rf=2
prescribed amlodipine (5mg orally) and 10 were prescribed lisinopril (10mg orally) due to
contraindications with amlodipine. Medications are supplied monthly at the PHC, and you are
responsible for ordering them. How many tablets do you need to order and what should be the reorder factor?
A. 1000, rf=3
B. 1200, rf=2
C. 1400, rf=3
D. 1600, rf=2
Ans
Monthly Requirement:
- Amlodipine = 40 patients × 30 days = 1200 tablets
- Lisinopril = 10 patients × 30 days = 300 tablets
- Total tablets required = 1200 + 300 = 1500 tablets
Reorder Factor (rf):
- Reorder factor is the buffer stock multiplier.
Let’s test each option:
- A. 1000 tablets: Too low (even base requirement is 1500) → ❌
- B. 1200 tablets: Too low → ❌
- C. 1400 tablets: Still low → ❌
- D. 1600 tablets: Sufficient. Closest matching reorder factor:
→ 1500 × rf = 1600 → rf ≈ 1.06 → Matches rf = 2 as a standard buffer factor in public health.