[statistic] κΈ°μ΄ν΅κ³ _ T-test
β T-test
β one sample t-test
β student t-test
β welch's t-test
β paired sample t-test
β one sample t-test
β μνμ νμ€νΈμ°¨λ₯Ό μ¬λ¬λ² ꡬν΄μ λͺ¨μμ νμ€νΈμ°¨λ₯Ό μΆμ
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/ethanweed/pythonbook/main/Data/zeppo.csv")
df.head()
import statistics
statistics.stdev(df['grades'])
# 9.520614752375915
π νμ€νΈμ°¨(σ) = 9.5 λΌκ³ λ§ν μ μμ§λ§, μνμ νμ€νΈμ°¨(σ^)λ 9.52 λΌκ³ λ§ν μ μμ
π κ°μ€ μΈμ°κΈ°
β κ·λ¬΄κ°μ€ : λͺ¨μ§λ¨μ νκ· == 100
β λ립κ°μ€ : λͺ¨μ§λ¨μ νκ· != 100
β κΈ°λ³Έ μ μ 쑰건μ λͺ¨μμ νμ€νΈμ°¨λ₯Ό μμ§ λͺ»νλ€λ κ²
β λͺ¨μμ νμ€νΈμ°¨λ₯Ό λͺ¨λ₯Ό λ, λͺ¨μμ νκ· κ³Ό μνμ νκ· μ΄ κ°μμ§, λ€λ₯Έμ§λ₯Ό νλ¨
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import scipy.stats as stats
mu = 0
sigma = 1
x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
y = 100* stats.norm.pdf(x, mu, sigma)
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
sns.lineplot(x=x,y=y, color='black', ax=axes[0])
sns.lineplot(x=x,y=y, color='black', ax=axes[1])
axes[0].set_frame_on(False)
axes[1].set_frame_on(False)
axes[0].get_yaxis().set_visible(False)
axes[1].get_yaxis().set_visible(False)
axes[0].get_xaxis().set_visible(False)
axes[1].get_xaxis().set_visible(False)
axes[0].axhline(y=0, color='black')
axes[0].axvline(x=mu, color='black', linestyle='--')
axes[1].axhline(y=0, color='black')
axes[1].axvline(x=mu + sigma, color='black', linestyle='--')
axes[0].hlines(y=23.6, xmin = mu-sigma, xmax = mu, color='black')
axes[1].hlines(y=23.6, xmin = mu-sigma, xmax = mu, color='black')
axes[0].text(mu,42, r'$\mu = \mu_0$', size=20, ha="center")
axes[1].text(mu + sigma, 42, r'$\mu \neq \mu_0$', size=20, ha="center")
axes[0].text(mu-sigma - 0.2, 23.6, r'$\sigma = ??$', size=20, ha="right")
axes[1].text(mu-sigma - 0.2, 23.6, r'$\sigma = ??$', size=20, ha="right")
π μμ λ (Degree of Freedom)
β df : μ 체 μν(N)μμ -1μ μ·¨ν μμ λλ₯Ό μλ―Έ
β dfκ° μμΌλ©΄ λΆν¬κ° μ΄μμ μΈ μ§μ μ λ€λ€λ₯΄μ§ μμ§λ§, dfκ° ν¬λ©΄ λΆν¬κ° μ΄μμ μΈ λΆν¬μ κ°κΉλ€
β μ¦, λͺ¨μ λλΉ μνμ΄ ν¬λ©΄ ν΄μλ‘ λͺ¨μλ₯Ό μΆμ νλ μ νλλ μ¬λΌκ°λ€
mu = 0
variance = 1
sigma = np.sqrt(variance)
x = np.linspace(-4, 4, 100)
y_norm = stats.norm.pdf(x, mu, sigma)
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
# t-distribution with 2 degrees of freedom
y_t = stats.t.pdf(x, 2)
sns.lineplot(x = x, y = y_norm, color = 'black', linestyle='--', ax = axes[0])
sns.lineplot(x = x, y = y_t, color = 'black', ax = axes[0])
# t-distribution with 10 degrees of freedom
y_t = stats.t.pdf(x, 10)
sns.lineplot(x = x, y = y_norm, color = 'black', linestyle='--', ax = axes[1])
sns.lineplot(x = x, y = y_t, color = 'black', ax = axes[1])
axes[0].text(0, 0.42, r'$df = 2$', size=20, ha="center")
axes[1].text(0, 0.42, r'$df = 10$', size=20, ha="center")
#sns.despine() - μΆ/ν
λ리 μ κ±°
#https://hleecaster.com/python-seaborn-set-style-and-context/ μ°Έκ³
axes[0].get_yaxis().set_visible(False)
axes[1].get_yaxis().set_visible(False)
axes[0].set_frame_on(False)
axes[1].set_frame_on(False)
plt.show()
π Example
β μ μ 쑰건 : μ κ΅ λνμμ νκ· μ μλ 67.5 / μ°λ¦¬ λνμ 20λͺ μ νκ· μ ?
β κ·λ¬΄κ°μ€ : μ κ΅ λνμμ νκ· μ μμ μ°λ¦¬ λνμμ νκ· μ μλ ν΅κ³μ μΌλ‘ μ μνκ² κ°λ€
β κ·λ¬΄κ°μ€ : μ κ΅ λνμμ νκ· μ μμ μ°λ¦¬ λνμμ νκ· μ μλ ν΅κ³μ μΌλ‘ μ μνκ² λ€λ₯΄λ€
π t ν΅κ³λ, p-value κ° κ΅¬νκΈ°
stats.ttest_1samp(λ°μ΄ν°, popmean=λͺ¨μ§λ¨ νκ· ) : one sample t-test ν¨μ
from scipy.stats impor ttest_1samp
t, p = ttest_1samp(a = df['grades'], popmean = 67.5)
t, p
# (2.25471286700693, 0.03614521878144544)
π νλ³Έμ νκ· , μμ λ ꡬνκΈ°
N = len(df['grades'])
degfree = N-1
sample_mean = statistics.mean(df['grades'])
print('Sample mean:', sample_mean)
print('Degrees of freedom:', degfree)
# Sample mean: 72.3
# Degrees of freedom: 19
π μ λ’°κ΅¬κ° κ΅¬νκΈ°
from scipy import stats
confidence_level = 0.95
degrees_freedom = len(df['grades'])-1
sample_mean = statistics.mean(df['grades'])
sample_standard_error = stats.sem(df['grades'])
confidence_interval = stats.t.interval(confidence_level, degrees_freedom, sample_mean, sample_standard_error)
confidence_interval
# (67.84421513791415, 76.75578486208585)
π νκ· 72.3μΈ μ°λ¦¬ νμλ€μ νκ· μ μλ μ 체 λνμ νκ· μ μ 67.5 (t(19)=2.25,p<.05,
μ λ’°λ 95% μ λ’°κ΅¬κ° [67.8, 76.8]) λ³΄λ€ μ½κ° λλ€
πpingouin λΌμ΄λΈλ¬λ¦¬
Installation — pingouin 0.5.3 documentation
Pingouin is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy. Some of its main features are listed below. For a full list of available functions, please refer to the API documentation. ANOVAs: N-ways, repeated mea
pingouin-stats.org
# pingouin library μ€μΉ
!pip install --upgrade pingouin
# t-test
from pingouin import ttest
ttest(df['grades'], 67.5)
β λ νκ· μ λΉκ΅
β νλμ λͺ¨μμ λν΄ λ κ·Έλ£Ήμ κ°μ΄ κ°μμ§ κ²μ νλ κ²
β κ°μ μ μΆ©μ‘±νλ©΄ Independent Sample T-test
β κ°μ μ μΆ©μ‘±νμ§ λͺ»νλ©΄ Mann-Whitney Test
π Load Data
β κ° tutorμ ν΄λμ€λ³ νμλ€μ μ μλ₯Ό λμ΄ν λ°μ΄ν°
β Anastasia(N=15) & Bernadette(N=18)
# load data
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/ethanweed/pythonbook/main/Data/harpo.csv")
df.head()
# data summary
import numpy as np
df.groupby('tutor')['grade'].agg(['mean', 'std', 'count']).reset_index()
π λ κ·Έλ£Ή κ° λ°μ΄ν° μκ°ν
β Histplot
import matplotlib.pyplot as plt
import seaborn as sns
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
Anastasia = pd.DataFrame(df.loc[df['tutor'] == 'Anastasia']['grade'])
Bernadette = pd.DataFrame(df.loc[df['tutor'] == 'Bernadette']['grade'])
sns.histplot(Anastasia['grade'], ax = axes[0], binwidth = 5)
sns.histplot(Bernadette['grade'], ax = axes[1], binwidth = 5)
axes[0].set_xlim(50,100)
axes[1].set_xlim(50,100)
axes[0].set_ylim(0,7)
axes[1].set_ylim(0,7)
axes[0].set_title('Anastasia')
axes[1].set_title('Bernadette')
sns.despine()
plt.show()
π Anastasia μ μ μ λΆν¬κ° Bernadette λΆν¬λ³΄λ€ λκ² λμ¨λ€
β Pointplot
# pointplot() - νκ· κ³Ό ν΄λΉ μ λ’°κ΅¬κ° νμ
sns.pointplot(x = 'tutor', y = 'grade', data = df)
sns.despine()
π λ κ°μ κ²μ
β Student's T-test
- λͺ¨μκ²μ : μ격ν κ°μ μ μ©
β Welch's T-test
- λΉλͺ¨μκ²μ : κ°μ μ λν΄ μ μ°ν¨
π κ°μ
β λ 립λ κ΄μΈ‘μΉ (Independent Observation)
- λ κ·Έλ£Ήμ κ΄μΈ‘κ°μ μλ‘ νΉλ³ν κ΄κ³κ° μλ€
β μ κ·μ± (normality)
- μ κ· λΆν¬λ₯Ό λ°λ₯Έλ€. λ§μ½ Sample Size κ° N > 30 μ΄λ©΄ κ³ λ €νμ§ μμλ λ¨
β λ±λΆμ°μ± κ°μ
- Levene's κ²μ
π κ°μ
β κ·λ¬΄κ°μ€ (null hypothesis) : λ κ·Έλ£Ήμ νκ· μ κ°λ€
β λ립κ°μ€ (alternative hypothesis) : λ κ·Έλ£Ήμ νκ· μ κ°μ§ μλ€
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import scipy.stats as stats
mu1 = 0
sigma = 1
mu2 = 2
x1 = np.linspace(mu1 - 4*sigma, mu1 + 4*sigma, 100)
y1 = 100* stats.norm.pdf(x1, mu1, sigma)
x2 = np.linspace(mu2 - 4*sigma, mu2 + 4*sigma, 100)
y2 = 100* stats.norm.pdf(x2, mu2, sigma)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
sns.lineplot(x=x1,y=y1, color='black', ax = ax1)
sns.lineplot(x=x1,y=y1, color='black', ax = ax2)
sns.lineplot(x=x2,y=y2, color='black', ax = ax2)
ax1.text(0, 43, 'null hypothesis', size=20, ha="center")
ax2.text(0, 43, 'alternative hypothesis', size=20, ha="center")
ax1.set_frame_on(False)
ax2.set_frame_on(False)
ax1.get_yaxis().set_visible(False)
ax2.get_yaxis().set_visible(False)
ax1.get_xaxis().set_visible(False)
ax2.get_xaxis().set_visible(False)
ax1.axhline(y=0, color='black')
ax2.axhline(y=0, color='black')
plt.show()
π κ·λ¬΄κ°μ€μ΄ λ§λ€λ©΄ λ κ·Έλ£Ήκ°μ νκ· μ°¨μ΄λ 0
π But, νκ· μ°¨μ΄κ° μλ€λ©΄ μ΄λ₯Ό μ€μ°¨λΌκ³ λΆλ₯΄κ³ , tν΅κ³λμ ꡬν λλ νμ€μ€μ°¨λ₯Ό μ¬μ©
π νμ€μ€μ°¨λ μλμ κ°μ΄ ꡬν μ μμ
π λ°μ΄ν° κ°κ³΅
β Long format data → Wide format data ννμ ν μ΄λΈλ‘ λ³κ²½
wide_df = pd.pivot(df, columns = 'tutor', values = 'grade')
wide_df.head()
π NaN (Not a Number), μ«μκ° μλ
λ³λλ‘ μ²λ¦¬λ₯Ό ν΄μΌνμ§λ§, ttest() μ κΈ°λ₯μ μ΄μ©νμ¬ μ²λ¦¬
π t-test by pingouin library
from pingouin import ttest
ttest(wide_df['Anastasia'], wide_df['Bernadette'], correction = False) # Independent t-test
wide_df.describe()
π’ κ²°κ³Ό λ³΄κ³ μ
β Anastasia ν΄λμ€μ νκ· μ μ = 74.5 (std dev = 9.0)
β Bernadette ν΄λμ€μ νκ· μ μ = 69.1 (std dev = 5.8)
β λ 립 νκ· μν t-test κ²°κ³Ό μ½ 5.4μ μ°¨μ΄κ° μ μλ―Ένκ² λ€λ₯΄κ² λνλλ€.