Statistics

[statistic] κΈ°μ΄ˆν†΅κ³„ _ T-test

κ°œλ°œμžμ—΄λ¬΄ 2023. 4. 28. 12:15
λ°˜μ‘ν˜•

βœ‹ T-test

  βœ”  one sample t-test

  βœ”  student t-test

  βœ”  welch's t-test

  βœ”  paired sample t-test

 

βœ‹ one sample t-test

  βœ”  μƒ˜ν”Œμ˜ ν‘œμ€€νŽΈμ°¨λ₯Ό μ—¬λŸ¬λ²ˆ κ΅¬ν•΄μ„œ λͺ¨μˆ˜μ˜ ν‘œμ€€νŽΈμ°¨λ₯Ό μΆ”μ •

import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/ethanweed/pythonbook/main/Data/zeppo.csv")
df.head()

 

import statistics
statistics.stdev(df['grades'])

# 9.520614752375915

πŸ‘‰ ν‘œμ€€νŽΈμ°¨(σ) = 9.5 라고 말할 수 μ—†μ§€λ§Œ, μƒ˜ν”Œμ˜ ν‘œμ€€νŽΈμ°¨(σ^)λŠ” 9.52 라고 말할 수 있음

πŸ“Œ κ°€μ„€ μ„Έμš°κΈ°

  βœ”  κ·€λ¬΄κ°€μ„€ : λͺ¨μ§‘λ‹¨μ˜ 평균 == 100
  βœ”  λŒ€λ¦½κ°€μ„€ : λͺ¨μ§‘λ‹¨μ˜ 평균 != 100
  βœ”  κΈ°λ³Έ μ „μ œ 쑰건은 λͺ¨μˆ˜μ˜ ν‘œμ€€νŽΈμ°¨λ₯Ό μ•Œμ§€ λͺ»ν•œλ‹€λŠ” 것
  βœ”  λͺ¨μˆ˜μ˜ ν‘œμ€€νŽΈμ°¨λ₯Ό λͺ¨λ₯Ό λ•Œ, λͺ¨μˆ˜μ˜ 평균과 μƒ˜ν”Œμ˜ 평균이 같은지, λ‹€λ₯Έμ§€λ₯Ό νŒλ‹¨

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import scipy.stats as stats

mu = 0
sigma = 1
x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
y = 100* stats.norm.pdf(x, mu, sigma)

fig, axes = plt.subplots(1, 2, figsize=(15, 5))

sns.lineplot(x=x,y=y, color='black', ax=axes[0])
sns.lineplot(x=x,y=y, color='black', ax=axes[1])

axes[0].set_frame_on(False)
axes[1].set_frame_on(False)
axes[0].get_yaxis().set_visible(False)
axes[1].get_yaxis().set_visible(False)
axes[0].get_xaxis().set_visible(False)
axes[1].get_xaxis().set_visible(False)


axes[0].axhline(y=0, color='black')
axes[0].axvline(x=mu, color='black', linestyle='--')

axes[1].axhline(y=0, color='black')
axes[1].axvline(x=mu + sigma, color='black', linestyle='--')

axes[0].hlines(y=23.6, xmin = mu-sigma, xmax = mu, color='black')
axes[1].hlines(y=23.6, xmin = mu-sigma, xmax = mu, color='black')


axes[0].text(mu,42, r'$\mu = \mu_0$', size=20, ha="center")
axes[1].text(mu + sigma, 42, r'$\mu \neq \mu_0$', size=20, ha="center")

axes[0].text(mu-sigma - 0.2, 23.6, r'$\sigma = ??$', size=20, ha="right")
axes[1].text(mu-sigma - 0.2, 23.6, r'$\sigma = ??$', size=20, ha="right")

πŸ“Œ μžμœ λ„ (Degree of Freedom)

  βœ”  df  :  μ „체 μƒ˜ν”Œ(N)μ—μ„œ -1을 μ·¨ν•œ μžμœ λ„λ₯Ό 의미

  βœ”  dfκ°€ μž‘μœΌλ©΄ 뢄포가 이상적인 지점에 λ‹€λ‹€λ₯΄μ§€ μ•Šμ§€λ§Œ, dfκ°€ 크면 뢄포가 이상적인 뢄포에 가깝닀

  βœ”  즉, λͺ¨μˆ˜ λŒ€λΉ„ μƒ˜ν”Œμ΄ 크면 클수둝 λͺ¨μˆ˜λ₯Ό μΆ”μ •ν•˜λŠ” μ •ν™•λ„λŠ” μ˜¬λΌκ°„λ‹€

mu = 0
variance = 1
sigma = np.sqrt(variance)


x = np.linspace(-4, 4, 100)
y_norm = stats.norm.pdf(x, mu, sigma)


fig, axes = plt.subplots(1, 2, figsize=(15, 5))


# t-distribution with 2 degrees of freedom
y_t = stats.t.pdf(x, 2)
sns.lineplot(x = x, y = y_norm, color = 'black', linestyle='--', ax = axes[0])
sns.lineplot(x = x, y = y_t, color = 'black', ax = axes[0])

# t-distribution with 10 degrees of freedom
y_t = stats.t.pdf(x, 10)
sns.lineplot(x = x, y = y_norm, color = 'black', linestyle='--', ax = axes[1])
sns.lineplot(x = x, y = y_t, color = 'black', ax = axes[1])

axes[0].text(0, 0.42, r'$df = 2$', size=20, ha="center")
axes[1].text(0, 0.42, r'$df = 10$', size=20, ha="center")


#sns.despine() - μΆ•/ν…Œλ‘λ¦¬ 제거
#https://hleecaster.com/python-seaborn-set-style-and-context/ μ°Έκ³ 
axes[0].get_yaxis().set_visible(False)
axes[1].get_yaxis().set_visible(False)
axes[0].set_frame_on(False)
axes[1].set_frame_on(False)

plt.show()

πŸ™Œ Example

  βœ”  μ „μ œμ‘°κ±΄ : μ „κ΅­ λŒ€ν•™μƒμ˜ 평균 μ μˆ˜λŠ” 67.5  /  우리 λŒ€ν•™μƒ 20λͺ…μ˜ 평균은 ?

  βœ”  귀무가섀 : μ „κ΅­ λŒ€ν•™μƒμ˜ 평균 μ μˆ˜μ™€ 우리 λŒ€ν•™μƒμ˜ ν‰κ· μ μˆ˜λŠ” ν†΅κ³„μ μœΌλ‘œ μœ μ˜ν•˜κ²Œ κ°™λ‹€

  βœ”  κ·€λ¬΄κ°€μ„€ : μ „κ΅­ λŒ€ν•™μƒμ˜ 평균 μ μˆ˜μ™€ 우리 λŒ€ν•™μƒμ˜ ν‰κ· μ μˆ˜λŠ” ν†΅κ³„μ μœΌλ‘œ μœ μ˜ν•˜κ²Œ λ‹€λ₯΄λ‹€

πŸ“Œ t ν†΅κ³„λŸ‰, p-value κ°’ κ΅¬ν•˜κΈ°

stats.ttest_1samp(데이터, popmean=λͺ¨μ§‘단 ν‰κ· ) : one sample t-test ν•¨μˆ˜

from scipy.stats impor ttest_1samp
t, p = ttest_1samp(a = df['grades'], popmean = 67.5)
t, p

# (2.25471286700693, 0.03614521878144544)

πŸ“Œ ν‘œλ³Έμ˜ 평균, μžμœ λ„ κ΅¬ν•˜κΈ°

N = len(df['grades'])
degfree = N-1
sample_mean = statistics.mean(df['grades'])
print('Sample mean:', sample_mean)
print('Degrees of freedom:', degfree)

# Sample mean: 72.3
# Degrees of freedom: 19

πŸ“Œ 신뒰ꡬ간 κ΅¬ν•˜κΈ°

from scipy import stats

confidence_level = 0.95
degrees_freedom = len(df['grades'])-1
sample_mean = statistics.mean(df['grades'])
sample_standard_error = stats.sem(df['grades'])

confidence_interval = stats.t.interval(confidence_level, degrees_freedom, sample_mean, sample_standard_error)
confidence_interval

# (67.84421513791415, 76.75578486208585)

πŸ‘‰  평균 72.3인 우리 ν•™μƒλ“€μ˜ 평균 μ μˆ˜λŠ” 전체 λŒ€ν•™μƒ ν‰κ· μ μˆ˜ 67.5 (t(19)=2.25,p<.05,

       μ‹ λ’°λ„ 95% 신뒰ꡬ간 [67.8, 76.8]) 보닀 μ•½κ°„ λ†’λ‹€

 

πŸ“Œpingouin 라이브러리

 

Installation — pingouin 0.5.3 documentation

Pingouin is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy. Some of its main features are listed below. For a full list of available functions, please refer to the API documentation. ANOVAs: N-ways, repeated mea

pingouin-stats.org

# pingouin library μ„€μΉ˜
!pip install --upgrade pingouin

# t-test
from pingouin import ttest
ttest(df['grades'], 67.5)

βœ‹ 두 ν‰κ· μ˜ 비ꡐ

  βœ”  ν•˜λ‚˜μ˜ λͺ¨μˆ˜μ— λŒ€ν•΄ 두 그룹의 값이 같은지 κ²€μ •ν•˜λŠ” 것

  βœ”  가정을 μΆ©μ‘±ν•˜λ©΄ Independent Sample T-test

  βœ”  가정을 μΆ©μ‘±ν•˜μ§€ λͺ»ν•˜λ©΄ Mann-Whitney Test

πŸ“Œ Load Data

  βœ”  κ° tutor의 ν΄λž˜μŠ€λ³„ ν•™μƒλ“€μ˜ 점수λ₯Ό λ‚˜μ—΄ν•œ 데이터

  βœ”  Anastasia(N=15) & Bernadette(N=18)

# load data
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/ethanweed/pythonbook/main/Data/harpo.csv")
df.head()

# data summary
import numpy as np 
df.groupby('tutor')['grade'].agg(['mean', 'std', 'count']).reset_index()

πŸ“Œ 두 κ·Έλ£Ή κ°„ 데이터 μ‹œκ°ν™”

  βœ”  Histplot

import matplotlib.pyplot as plt
import seaborn as sns

fig, axes = plt.subplots(1, 2, figsize=(15, 5))
Anastasia = pd.DataFrame(df.loc[df['tutor'] == 'Anastasia']['grade'])
Bernadette = pd.DataFrame(df.loc[df['tutor'] == 'Bernadette']['grade'])

sns.histplot(Anastasia['grade'], ax = axes[0], binwidth = 5)
sns.histplot(Bernadette['grade'], ax = axes[1], binwidth = 5)

axes[0].set_xlim(50,100)
axes[1].set_xlim(50,100)

axes[0].set_ylim(0,7)
axes[1].set_ylim(0,7)

axes[0].set_title('Anastasia')
axes[1].set_title('Bernadette')

sns.despine()
plt.show()

πŸ‘‰ Anastasia 의 점수 뢄포가 Bernadette 뢄포보닀 λ†’κ²Œ λ‚˜μ˜¨λ‹€

 

  βœ”  Pointplot

# pointplot() - 평균과 ν•΄λ‹Ή 신뒰ꡬ간 ν‘œμ‹œ
sns.pointplot(x = 'tutor', y = 'grade', data = df)
sns.despine()

πŸ“Œ 두 개의 κ²€μ •

  βœ”  Student's T-test

     - λͺ¨μˆ˜κ²€μ •  :  μ—„κ²©ν•œ κ°€μ • 적용

 

  βœ”  Welch's T-test

     - λΉ„λͺ¨μˆ˜κ²€μ •  :  가정에 λŒ€ν•΄ μœ μ—°ν•¨

πŸ“Œ κ°€μ •

  βœ”  λ…λ¦½λœ κ΄€μΈ‘μΉ˜ (Independent Observation)

     - 두 그룹의 관츑값은 μ„œλ‘œ νŠΉλ³„ν•œ 관계가 μ—†λ‹€

 

  βœ”  μ •κ·œμ„± (normality)

     - μ •κ·œ 뢄포λ₯Ό λ”°λ₯Έλ‹€. λ§Œμ•½ Sample Size κ°€ N > 30 이면 κ³ λ €ν•˜μ§€ μ•Šμ•„λ„ 됨

 

  βœ”  λ“±λΆ„μ‚°μ„± κ°€μ •

     - Levene's κ²€μ •

πŸ“Œ κ°€μ •

  βœ”  κ·€λ¬΄κ°€μ„€ (null hypothesis)  :  두 그룹의 평균은 κ°™λ‹€

  βœ”  λŒ€λ¦½κ°€μ„€ (alternative hypothesis)  :  두 그룹의 평균은 κ°™μ§€ μ•Šλ‹€

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import scipy.stats as stats

mu1 = 0
sigma = 1
mu2 = 2

x1 = np.linspace(mu1 - 4*sigma, mu1 + 4*sigma, 100)
y1 = 100* stats.norm.pdf(x1, mu1, sigma)
x2 = np.linspace(mu2 - 4*sigma, mu2 + 4*sigma, 100)
y2 = 100* stats.norm.pdf(x2, mu2, sigma)


fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))


sns.lineplot(x=x1,y=y1, color='black', ax = ax1)

sns.lineplot(x=x1,y=y1, color='black', ax = ax2)
sns.lineplot(x=x2,y=y2, color='black', ax = ax2)

ax1.text(0, 43, 'null hypothesis', size=20, ha="center")
ax2.text(0, 43, 'alternative hypothesis', size=20, ha="center")

ax1.set_frame_on(False)
ax2.set_frame_on(False)
ax1.get_yaxis().set_visible(False)
ax2.get_yaxis().set_visible(False)
ax1.get_xaxis().set_visible(False)
ax2.get_xaxis().set_visible(False)
ax1.axhline(y=0, color='black')
ax2.axhline(y=0, color='black')

plt.show()

πŸ‘‰ 귀무가섀이 λ§žλ‹€λ©΄ 두 κ·Έλ£Ήκ°„μ˜ 평균 μ°¨μ΄λŠ” 0

 

πŸ‘‰ But, 평균 차이가 μžˆλ‹€λ©΄ 이λ₯Ό 였차라고 λΆ€λ₯΄κ³ , tν†΅κ³„λŸ‰μ„ ꡬ할 λ•ŒλŠ” ν‘œμ€€μ˜€μ°¨λ₯Ό μ‚¬μš©

 

πŸ‘‰ ν‘œμ€€μ˜€μ°¨λŠ” μ•„λž˜μ™€ 같이 ꡬ할 수 있음

 

πŸ“Œ 데이터 가곡

  βœ”  Long format data  →  Wide format data ν˜•νƒœμ˜ ν…Œμ΄λΈ”λ‘œ λ³€κ²½

wide_df = pd.pivot(df, columns = 'tutor', values = 'grade')
wide_df.head()

πŸ‘‰ NaN (Not a Number), μˆ«μžκ°€ μ•„λ‹˜

     λ³„λ„λ‘œ 처리λ₯Ό ν•΄μ•Όν•˜μ§€λ§Œ, ttest() 의 κΈ°λŠ₯을 μ΄μš©ν•˜μ—¬ 처리

 

πŸ“Œ t-test by pingouin library

from pingouin import ttest
ttest(wide_df['Anastasia'], wide_df['Bernadette'], correction = False)  # Independent t-test

wide_df.describe()

πŸ“’ κ²°κ³Ό λ³΄κ³ μ„œ

  βœ”  Anastasia 클래슀의 평균 점수 = 74.5 (std dev = 9.0)

  βœ”  Bernadette 클래슀의 평균 점수 = 69.1 (std dev = 5.8)

  βœ”  λ…립 평균 μƒ˜ν”Œ t-test κ²°κ³Ό μ•½ 5.4점 차이가 μœ μ˜λ―Έν•˜κ²Œ λ‹€λ₯΄κ²Œ λ‚˜νƒ€λ‚œλ‹€.

 

λ°˜μ‘ν˜•