[statistic] 기초통계

Statistics

[statistic] 기초통계 _ T-test

개발자열무 2023. 4. 28. 12:15

✋ T-test

✔ one sample t-test

✔ student t-test

✔ welch's t-test

✔ paired sample t-test

✋ one sample t-test

✔ 샘플의 표준편차를 여러번 구해서 모수의 표준편차를 추정

import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/ethanweed/pythonbook/main/Data/zeppo.csv")
df.head()

import statistics
statistics.stdev(df['grades'])

# 9.520614752375915

👉 표준편차(σ) = 9.5 라고 말할 수 없지만, 샘플의 표준편차(σ^)는 9.52 라고 말할 수 있음

📌 가설 세우기

✔  귀무가설 : 모집단의 평균 == 100
✔  대립가설 : 모집단의 평균 != 100
✔  기본 전제 조건은 모수의 표준편차를 알지 못한다는 것
✔  모수의 표준편차를 모를 때, 모수의 평균과 샘플의 평균이 같은지, 다른지를 판단

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import scipy.stats as stats

mu = 0
sigma = 1
x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
y = 100* stats.norm.pdf(x, mu, sigma)

fig, axes = plt.subplots(1, 2, figsize=(15, 5))

sns.lineplot(x=x,y=y, color='black', ax=axes[0])
sns.lineplot(x=x,y=y, color='black', ax=axes[1])

axes[0].set_frame_on(False)
axes[1].set_frame_on(False)
axes[0].get_yaxis().set_visible(False)
axes[1].get_yaxis().set_visible(False)
axes[0].get_xaxis().set_visible(False)
axes[1].get_xaxis().set_visible(False)


axes[0].axhline(y=0, color='black')
axes[0].axvline(x=mu, color='black', linestyle='--')

axes[1].axhline(y=0, color='black')
axes[1].axvline(x=mu + sigma, color='black', linestyle='--')

axes[0].hlines(y=23.6, xmin = mu-sigma, xmax = mu, color='black')
axes[1].hlines(y=23.6, xmin = mu-sigma, xmax = mu, color='black')


axes[0].text(mu,42, r'$\mu = \mu_0$', size=20, ha="center")
axes[1].text(mu + sigma, 42, r'$\mu \neq \mu_0$', size=20, ha="center")

axes[0].text(mu-sigma - 0.2, 23.6, r'$\sigma = ??$', size=20, ha="right")
axes[1].text(mu-sigma - 0.2, 23.6, r'$\sigma = ??$', size=20, ha="right")

📌 자유도 (Degree of Freedom)

✔ df : 전체 샘플(N)에서 -1을 취한 자유도를 의미

✔ df가 작으면 분포가 이상적인 지점에 다다르지 않지만, df가 크면 분포가 이상적인 분포에 가깝다

✔ 즉, 모수 대비 샘플이 크면 클수록 모수를 추정하는 정확도는 올라간다

mu = 0
variance = 1
sigma = np.sqrt(variance)


x = np.linspace(-4, 4, 100)
y_norm = stats.norm.pdf(x, mu, sigma)


fig, axes = plt.subplots(1, 2, figsize=(15, 5))


# t-distribution with 2 degrees of freedom
y_t = stats.t.pdf(x, 2)
sns.lineplot(x = x, y = y_norm, color = 'black', linestyle='--', ax = axes[0])
sns.lineplot(x = x, y = y_t, color = 'black', ax = axes[0])

# t-distribution with 10 degrees of freedom
y_t = stats.t.pdf(x, 10)
sns.lineplot(x = x, y = y_norm, color = 'black', linestyle='--', ax = axes[1])
sns.lineplot(x = x, y = y_t, color = 'black', ax = axes[1])

axes[0].text(0, 0.42, r'$df = 2$', size=20, ha="center")
axes[1].text(0, 0.42, r'$df = 10$', size=20, ha="center")


#sns.despine() - 축/테두리 제거
#https://hleecaster.com/python-seaborn-set-style-and-context/ 참고
axes[0].get_yaxis().set_visible(False)
axes[1].get_yaxis().set_visible(False)
axes[0].set_frame_on(False)
axes[1].set_frame_on(False)

plt.show()

🙌 Example

✔ 전제조건 : 전국 대학생의 평균 점수는 67.5 / 우리 대학생 20명의 평균은 ?

✔ 귀무가설 : 전국 대학생의 평균 점수와 우리 대학생의 평균점수는 통계적으로 유의하게 같다

✔ 귀무가설 : 전국 대학생의 평균 점수와 우리 대학생의 평균점수는 통계적으로 유의하게 다르다

📌 t 통계량, p-value 값 구하기

stats.ttest_1samp(데이터, popmean=모집단 평균) : one sample t-test 함수

from scipy.stats impor ttest_1samp
t, p = ttest_1samp(a = df['grades'], popmean = 67.5)
t, p

# (2.25471286700693, 0.03614521878144544)

📌 표본의 평균, 자유도 구하기

N = len(df['grades'])
degfree = N-1
sample_mean = statistics.mean(df['grades'])
print('Sample mean:', sample_mean)
print('Degrees of freedom:', degfree)

# Sample mean: 72.3
# Degrees of freedom: 19

📌 신뢰구간 구하기

from scipy import stats

confidence_level = 0.95
degrees_freedom = len(df['grades'])-1
sample_mean = statistics.mean(df['grades'])
sample_standard_error = stats.sem(df['grades'])

confidence_interval = stats.t.interval(confidence_level, degrees_freedom, sample_mean, sample_standard_error)
confidence_interval

# (67.84421513791415, 76.75578486208585)

👉 평균 72.3인 우리 학생들의 평균 점수는 전체 대학생 평균점수 67.5 (t(19)=2.25,p<.05,

신뢰도 95% 신뢰구간 [67.8, 76.8]) 보다 약간 높다

📌pingouin 라이브러리

https://pingouin-stats.org/build/html/index.html

Installation — pingouin 0.5.3 documentation

Pingouin is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy. Some of its main features are listed below. For a full list of available functions, please refer to the API documentation. ANOVAs: N-ways, repeated mea

pingouin-stats.org

# pingouin library 설치
!pip install --upgrade pingouin

# t-test
from pingouin import ttest
ttest(df['grades'], 67.5)

✋ 두 평균의 비교

✔ 하나의 모수에 대해 두 그룹의 값이 같은지 검정하는 것

✔ 가정을 충족하면 Independent Sample T-test

✔ 가정을 충족하지 못하면 Mann-Whitney Test

📌 Load Data

✔ 각 tutor의 클래스별 학생들의 점수를 나열한 데이터

✔ Anastasia(N=15) & Bernadette(N=18)

# load data
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/ethanweed/pythonbook/main/Data/harpo.csv")
df.head()

# data summary
import numpy as np 
df.groupby('tutor')['grade'].agg(['mean', 'std', 'count']).reset_index()

📌 두 그룹 간 데이터 시각화

✔ Histplot

import matplotlib.pyplot as plt
import seaborn as sns

fig, axes = plt.subplots(1, 2, figsize=(15, 5))
Anastasia = pd.DataFrame(df.loc[df['tutor'] == 'Anastasia']['grade'])
Bernadette = pd.DataFrame(df.loc[df['tutor'] == 'Bernadette']['grade'])

sns.histplot(Anastasia['grade'], ax = axes[0], binwidth = 5)
sns.histplot(Bernadette['grade'], ax = axes[1], binwidth = 5)

axes[0].set_xlim(50,100)
axes[1].set_xlim(50,100)

axes[0].set_ylim(0,7)
axes[1].set_ylim(0,7)

axes[0].set_title('Anastasia')
axes[1].set_title('Bernadette')

sns.despine()
plt.show()

👉 Anastasia 의 점수 분포가 Bernadette 분포보다 높게 나온다

✔ Pointplot

# pointplot() - 평균과 해당 신뢰구간 표시
sns.pointplot(x = 'tutor', y = 'grade', data = df)
sns.despine()

📌 두 개의 검정

✔ Student's T-test

- 모수검정 : 엄격한 가정 적용

✔ Welch's T-test

- 비모수검정 : 가정에 대해 유연함

📌 가정

✔ 독립된 관측치 (Independent Observation)

- 두 그룹의 관측값은 서로 특별한 관계가 없다

✔ 정규성 (normality)

- 정규 분포를 따른다. 만약 Sample Size 가 N > 30 이면 고려하지 않아도 됨

✔ 등분산성 가정

- Levene's 검정

📌 가정

✔ 귀무가설 (null hypothesis) : 두 그룹의 평균은 같다

✔ 대립가설 (alternative hypothesis) : 두 그룹의 평균은 같지 않다

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import scipy.stats as stats

mu1 = 0
sigma = 1
mu2 = 2

x1 = np.linspace(mu1 - 4*sigma, mu1 + 4*sigma, 100)
y1 = 100* stats.norm.pdf(x1, mu1, sigma)
x2 = np.linspace(mu2 - 4*sigma, mu2 + 4*sigma, 100)
y2 = 100* stats.norm.pdf(x2, mu2, sigma)


fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))


sns.lineplot(x=x1,y=y1, color='black', ax = ax1)

sns.lineplot(x=x1,y=y1, color='black', ax = ax2)
sns.lineplot(x=x2,y=y2, color='black', ax = ax2)

ax1.text(0, 43, 'null hypothesis', size=20, ha="center")
ax2.text(0, 43, 'alternative hypothesis', size=20, ha="center")

ax1.set_frame_on(False)
ax2.set_frame_on(False)
ax1.get_yaxis().set_visible(False)
ax2.get_yaxis().set_visible(False)
ax1.get_xaxis().set_visible(False)
ax2.get_xaxis().set_visible(False)
ax1.axhline(y=0, color='black')
ax2.axhline(y=0, color='black')

plt.show()

👉 귀무가설이 맞다면 두 그룹간의 평균 차이는 0

👉 But, 평균 차이가 있다면 이를 오차라고 부르고, t통계량을 구할 때는 표준오차를 사용

👉 표준오차는 아래와 같이 구할 수 있음

📌 데이터 가공

✔ Long format data → Wide format data 형태의 테이블로 변경

wide_df = pd.pivot(df, columns = 'tutor', values = 'grade')
wide_df.head()

👉 NaN (Not a Number), 숫자가 아님

별도로 처리를 해야하지만, ttest() 의 기능을 이용하여 처리

📌 t-test by pingouin library

from pingouin import ttest
ttest(wide_df['Anastasia'], wide_df['Bernadette'], correction = False)  # Independent t-test

wide_df.describe()

📢 결과 보고서

✔ Anastasia 클래스의 평균 점수 = 74.5 (std dev = 9.0)

✔ Bernadette 클래스의 평균 점수 = 69.1 (std dev = 5.8)

✔ 독립 평균 샘플 t-test 결과 약 5.4점 차이가 유의미하게 다르게 나타난다.

저작자표시 변경금지 (새창열림)