์นดํ…Œ๊ณ ๋ฆฌ ์—†์Œ

[kaggle] ์บ๊ธ€ ํ•„์‚ฌ - Parkinson's Disease Progression Prediction 1

๊ฐœ๋ฐœ์ž์—ด๋ฌด 2023. 4. 27. 10:37
๋ฐ˜์‘ํ˜•

โœ‹ Kaggle Competition  -  AMPยฎ-Parkinson's Disease Progression Prediction

https://www.kaggle.com/competitions/amp-parkinsons-disease-progression-prediction

 

AMPยฎ-Parkinson's Disease Progression Prediction | Kaggle

 

www.kaggle.com

โœ”  The Goal of Competition  -  ํŒŒํ‚จ์Šจ๋ณ‘ ํ™˜์ž์˜ ์ง„ํ–‰์„ ์ธก์ •ํ•˜๋Š” MDS-UPDR ์ ์ˆ˜๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ

โœ”  ํŒŒํ‚จ์Šจ๋ณ‘ ํ™˜์ž์™€ ์—ฐ๋ น์ด ์ผ์น˜ํ•˜๋Š” ์ •์ƒ ๋Œ€์กฐ๊ตฐ์˜ ์‹œ๊ฐ„ ๊ฒฝ๊ณผ์— ๋”ฐ๋ฅธ

     ๋‹จ๋ฐฑ์งˆ ๋ฐ ํŽฉํƒ€์ด๋“œ ์ˆ˜์น˜ ๋ฐ์ดํ„ฐ๋กœ ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์„ ๊ฐœ๋ฐœ

 

โœ‹ Copying in kaggle New Notebook

โœ”  Copying Notebook  :  Simple Linear model with only clinical data

# write a read-only input directory for datasets
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
/kaggle/input/amp-parkinsons-disease-progression-prediction/train_proteins.csv
/kaggle/input/amp-parkinsons-disease-progression-prediction/train_clinical_data.csv
/kaggle/input/amp-parkinsons-disease-progression-prediction/public_timeseries_testing_util.py
/kaggle/input/amp-parkinsons-disease-progression-prediction/supplemental_clinical_data.csv
/kaggle/input/amp-parkinsons-disease-progression-prediction/train_peptides.csv
/kaggle/input/amp-parkinsons-disease-progression-prediction/amp_pd_peptide/competition.cpython-37m-x86_64-linux-gnu.so
/kaggle/input/amp-parkinsons-disease-progression-prediction/amp_pd_peptide/__init__.py
/kaggle/input/amp-parkinsons-disease-progression-prediction/example_test_files/sample_submission.csv
/kaggle/input/amp-parkinsons-disease-progression-prediction/example_test_files/test_proteins.csv
/kaggle/input/amp-parkinsons-disease-progression-prediction/example_test_files/test_peptides.csv
/kaggle/input/amp-parkinsons-disease-progression-prediction/example_test_files/test.csv

โœ” Loading Libraries

# Loading libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import lightgbm as lgb
import tqdm                      # ์‹œ๊ฐ„์ด ๊ฑธ๋ฆฌ๋Š” ์ž‘์—…์˜ ๊ฒฝ์šฐ ์ƒํƒœ ํ™•์ธ์„ ์œ„ํ•ด ์‚ฌ์šฉ
import re                        # ์ •๊ทœํ‘œํ˜„ ์ฒ˜๋ฆฌ๋ฅผ ํ•˜๊ธฐ ์œ„ํ•ด ํ‘œ์ค€ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
from itertools import product    # ํšจ์œจ์ ์ธ ๋ฃจํ•‘์„ ์œ„ํ•œ ์œ„ํ•œ ์ดํ„ฐ๋ ˆ์ดํ„ฐ๋ฅผ ๋งŒ๋“œ๋Š” ํ•จ์ˆ˜
from functools import reduce     # ๊ณ ์ฐจ์› ํ•จ์ˆ˜๋ฅผ ์œ„ํ•ด ๊ณ ์•ˆ

import warnings   # ๊ฒฝ๊ณ ์ œ์–ด
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

# ์†Œ์ˆ˜์  ์ดํ•˜ 3์ž๋ฆฌ๋งŒ ํ‘œํ˜„
pd.set_option('display.float_format',lambda x: '%.3f' % x)

โœ” Reading the Datasets

# Reading the datasets
protein_data = pd.read_csv('/kaggle/input/amp-parkinsons-disease-progression-prediction/train_proteins.csv')
peptides_data = pd.read_csv('/kaggle/input/amp-parkinsons-disease-progression-prediction/train_peptides.csv')
target_data = pd.read_csv('/kaggle/input/amp-parkinsons-disease-progression-prediction/train_clinical_data.csv')
sup_target_data = pd.read_csv('/kaggle/input/amp-parkinsons-disease-progression-prediction/supplemental_clinical_data.csv')
protein_data.shape, peptides_data.shape, target_data.shape

## ((232741, 5), (981834, 6), (2615, 8))

โœ” target data + sup_target_data (Merging)

๐Ÿ‘‰ ๋ชจ๋ธ๋ง์— ์ž„์ƒ ๋ฐ์ดํ„ฐ (train_clinical_data = target_data)๋งŒ ์‚ฌ์šฉํ•  ๊ฒƒ์ด๋ฏ€๋กœ,

      target_data ์™€ sup_target_data ๋ณ‘ํ•ฉํ•˜๊ธฐ

# Merging target data and sup_target_data since we will be using only clinical data for modelling
target_data = pd.concat([target_data,sup_target_data],axis = 0).reset_index(drop = True)  # ๋ณ‘ํ•ฉ
target_data = target_data[target_data.visit_month != 5].copy()  # ์˜๋ฏธ์—†๋Š” ๋ฐ์ดํ„ฐ ์ œ๊ฑฐ

# ์œ ์ผ๊ฐ’(unique) ํ™•์ธ
target_data.shape, target_data.visit_id.nunique(), target_data.patient_id.nunique(), target_data.visit_month.nunique()

## ((4720, 8), 4720, 1019, 17)

 

โœ” target_data ๊ฒฐ์ธก์น˜ ์ฑ„์šฐ๊ธฐ

๐Ÿ‘‰ upd23b_clinical_state_on_medication ์ปฌ๋Ÿผ์˜ ๊ฒฐ์ธก์น˜๋ฅผ unknown์œผ๋กœ ์ฑ„์›€

# ๊ฒฐ์ธก์น˜ ์ฑ„์šฐ๊ธฐ
# upd23b_clinical_state_on_medication ์•ˆ์˜ ๊ฒฐ์ธก์น˜๋ฅผ unknown์œผ๋กœ
target_data.upd23b_clinical_state_on_medication.fillna('unknown',inplace = True)
# target_data ๊ฒฐ์ธก์น˜ ํ™•์ธ
target_data.isna().sum()

# visit_id                                  0
# patient_id                                0
# visit_month                               0
# updrs_1                                  96
# updrs_2                                  98
# updrs_3                                  30
# updrs_4                                1863
# upd23b_clinical_state_on_medication       0
# dtype: int64

 

โœ” ์นดํ…Œ๊ณ ๋ฆฌ ์ปฌ๋Ÿผ ์„ค์ •

# Category columns
id_cols = ['visit_id','patient_id','visit_month']
target_cols = ['updrs_1','updrs_2','updrs_3','updrs_4']
month_list  =  [0,6,12,24]

โœ” create_target_features ํ•จ์ˆ˜ ์ƒ์„ฑ (by target data)

def create_target_features(target_data,test=None):
    
    # test data๊ฐ€ ์—†๋‹ค๋ฉด, target data์˜ 'patient_id','visit_month','upd23b_clinical_state_on_medication'๋งŒ
    # ํฌํ•จ๋œ dataframe ์ƒ์„ฑ
    if test is None:
        test1 = target_data[['patient_id','visit_month','upd23b_clinical_state_on_medication']].copy()
        test1['upd23b_clinical_state_on_medication']  
        = test1['upd23b_clinical_state_on_medication'].fillna('unknown')
        
    # test data๊ฐ€ ์žˆ๋‹ค๋ฉด, test data ๋ณต์‚ฌํ•˜์—ฌ 'upd23b_clinical_state_on_medication' ์นผ๋Ÿผ์ด ์—†๋‹ค๋ฉด
    # 'unknown' ๊ฐ’์„ ์ถ”๊ฐ€ํ•˜๊ณ  ๊ฒฐ์ธก๊ฐ’์€ 'unknown'์œผ๋กœ ๋Œ€์ฒด
    else:
        test1 = test.copy()
        if 'upd23b_clinical_state_on_medication' not in test1.columns.tolist():
            test1['upd23b_clinical_state_on_medication'] = 'unknown'
        else:
            test1['upd23b_clinical_state_on_medication']  
            = test1['upd23b_clinical_state_on_medication'].fillna('unknown')
            
        # 'patient_id', 'visit_month', 'upd23b_clinical_state_on_medication' ์นผ๋Ÿผ๋งŒ ๋‚จ๊ธฐ๊ณ  ์ค‘๋ณต์„ ์ œ๊ฑฐ    
        test1 = test1[['patient_id','visit_month',
                       'upd23b_clinical_state_on_medication']].drop_duplicates().copy()
        
        # patient_id / visit_month ์œ ์ผ๊ฐ’ ์ €์žฅ
        patients = test1.patient_id.unique()
        visit_months = test1.visit_month.unique()
        
        # test data ์ž…๋ ฅ ์‹œ, ๋น„๊ต๋ฅผ ์œ„ํ•ด original data ์ปฌ๋Ÿผ ์„ค์ •
        test1['visit_month_orig'] = test1['visit_month']

        # test data๊ฐ€ ์ œ๊ณต๋˜๋ฉด ์ถ”๊ฐ€ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ
        # ๊ฐ ํ™˜์ž์˜ ๋ชจ๋“  ๊ด€๋ จ ๋ฐฉ๋ฌธ ์›”์„ ํฌํ•จํ•˜๋„๋ก ๋ฐ์ดํ„ฐ๋ฅผ ํ™•์žฅ
        # ๋ฐ์ดํ„ฐ์˜ ๊ฐ ํ™˜์ž๋ฅผ ๋ฐ˜๋ณตํ•˜๊ณ  ๊ฐ ํ™˜์ž์— ๋Œ€ํ•ด ๊ฐ ๋ฐฉ๋ฌธ ์›”(0, 6, 12, 24๊ฐœ์›”)์— ๋Œ€ํ•œ
        # ํ™˜์ž ๋ฐ์ดํ„ฐ์˜ ์‚ฌ๋ณธ์„ ์ƒ์„ฑ ์ดํ›„ ๋ฐ์ดํ„ฐ ๋ณต์‚ฌ๋ณธ ์—ฐ๊ฒฐํ•˜์—ฌ test1 data ์ƒ์„ฑ
        test_data = pd.DataFrame()
        for patient in patients:
            for month in [0,6,12,24]:
                p_data = test1[test1['patient_id'] == patient].copy()
                p_data['visit_month'] = p_data['visit_month'].values + month
                test_data = pd.concat([test_data,p_data],axis=0)
        test1 = test_data.copy()
        
    return test1
    
    # visit_month & medication ๊ฒฐ์ธก์น˜ ๋Œ€์ฒด
    target_data1 = target_data.copy()
    tmeds = target_data1.groupby(['visit_month','upd23b_clinical_state_on_medication']
                            ,group_keys =False)[target_cols].apply(lambda x: x.fillna(x.median())).sort_index()
    target_data1 = target_data1[['patient_id','visit_month','upd23b_clinical_state_on_medication']].join(tmeds)
    
    # grp features ์ƒ์„ฑ
    # ๊ฐ visit_month์™€ medication ๊ทธ๋ฃน์— ๋Œ€ํ•ด updrs_1 ์ปฌ๋Ÿผ์˜ ๊ฐœ์ˆ˜(count)๋ฅผ ์ƒ์„ฑ
    all_grp_cols = [['visit_month'],['upd23b_clinical_state_on_medication']]
                    + [['visit_month','upd23b_clinical_state_on_medication']]
    
    target_data2 = target_data1[['visit_month','upd23b_clinical_state_on_medication']].drop_duplicates()
    for grp_col in all_grp_cols:    
        temp = target_data1.groupby(grp_col)[target_cols].agg(['min','max','mean','median','sum','std'])
        temp.columns = [i+'_' + j + '_'+ '_'.join(grp_col) for i,j in temp.columns]
        target_data2 = target_data2.join(temp,on = grp_col)
        target_data2  = target_data2.join(target_data1.groupby(grp_col)['updrs_1'].count().rename('_'.join(grp_col)+'_count')
                                         , on = grp_col)
                                         
    # 'med_unknown', 'med_off', 'med_on' ์ปฌ๋Ÿผ ์ถ”๊ฐ€
    # 'upd23b_clinical_state_on_medication' ๊ฐ’์ด ๊ฐ๊ฐ 'unknown', 'Off', 'On'์ธ ๊ฒฝ์šฐ 1, ๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒฝ์šฐ 0
    test1['med_unknown'] = (test1['upd23b_clinical_state_on_medication'] == 'unknown').astype(int)
    test1['med_off'] = (test1['upd23b_clinical_state_on_medication'] == 'Off').astype(int)
    test1['med_on'] = (test1['upd23b_clinical_state_on_medication'] == 'On').astype(int)
    
    model_data = test1.merge(target_data2,on = ['visit_month','upd23b_clinical_state_on_medication']
                             , how = 'inner').drop(columns = 'upd23b_clinical_state_on_medication')
    feature_cols = model_data.drop(columns = ['patient_id']).columns.tolist()
    
    # model_data์— visit_month๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ค‘์•™๊ฐ’์œผ๋กœ ๋Œ€์ฒด
    model_data = model_data.groupby('visit_month',group_keys = False).apply(lambda x: x.fillna(x.median()))
    
    # test data๊ฐ€ ์—†๋‹ค๋ฉด 'patient_id', 'visit_month', target_cols ์ปฌ๋Ÿผ์ด ์ถ”๊ฐ€
    if test is None:
        model_data = model_data.merge(target_data1[['patient_id','visit_month'] + target_cols],
                                      on = ['patient_id','visit_month'],
                                      how = 'inner')

    return model_data,feature_cols

โœ” feature ๊ฐ€์ ธ์˜ค๊ธฐ (by create_target_features)

# Getting the features 
model_data, feature_cols = create_target_features(target_data)
model_data.shape, len(feature_cols)

## ((4720, 84), 79)


# ์ค‘๋ณต๊ฐ’ ์ œ๊ฑฐ
model_data[['patient_id','visit_month']].drop_duplicates().shape, model_data.shape

## ((4720, 2), (4720, 84))

โœ” feature_cols ์„ค์ •

# feature_cols = ['visit_month','med_unknown','med_off','med_on','visit_month_count',
#                 'upd23b_clinical_state_on_medication_count','visit_month_upd23b_clinical_state_on_medication_count']
feature_cols = ['visit_month','visit_month_count']

 

๋ฐ˜์‘ํ˜•