알콜중독 학생 분석

  • 결론 : 당연한 결과가 나온 것 같음.
  • 집과 학교가 거리가 멀고, 친구들과 밖으로 자주 놀러 나가는 남자 아이가 술을 먹을 확률이 높다.
  • 주중에 먹는애가 주말에 먹고, 주말에 먹는 아이가 주중에 먹을 확률 또한 높다. (당연한 소리)
  • 결석을 자주하는 아이 또한 가능성은 있지만 높은 편은 아니다.

Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:

  • school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
  • sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
  • age - student’s age (numeric: from 15 to 22)
  • address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
  • famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
  • Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
  • Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
  • Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
  • Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
  • Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
  • reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
  • guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
  • traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
  • studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
  • failures - number of past class failures (numeric: n if 1<=n<3, else 4)
  • schoolsup - extra educational support (binary: yes or no)
  • famsup - family educational support (binary: yes or no)
  • paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
  • activities - extra-curricular activities (binary: yes or no)
  • nursery - attended nursery school (binary: yes or no)
  • higher - wants to take higher education (binary: yes or no)
  • internet - Internet access at home (binary: yes or no)
  • romantic - with a romantic relationship (binary: yes or no)
  • famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
  • freetime - free time after school (numeric: from 1 - very low to 5 - very high)
  • goout - going out with friends (numeric: from 1 - very low to 5 - very high)
  • Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
  • Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
  • health - current health status (numeric: from 1 - very bad to 5 - very good)
  • absences - number of school absences (numeric: from 0 to 93)
  • G1 - first period grade (numeric: from 0 to 20)
  • G2 - second period grade (numeric: from 0 to 20)
  • G3 - final grade (numeric: from 0 to 20, output target)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
%matplotlib inline
df1 = pd.read_csv("data/student-mat.csv") # 수학
df2 = pd.read_csv("data/student-por.csv") # 포르투칼어
df1['class'] = 'math'
df2['class'] = 'por'
df = df1.append(df2)
df.T.iloc[:,1:5]
1 2 3 4
school GP GP GP GP
sex F F F F
age 17 15 15 16
address U U U U
famsize GT3 LE3 GT3 GT3
Pstatus T T T T
Medu 1 1 4 3
Fedu 1 1 2 3
Mjob at_home at_home health other
Fjob other other services other
reason course other home home
guardian father mother mother father
traveltime 1 1 1 1
studytime 2 2 3 2
failures 0 3 0 0
schoolsup no yes no no
famsup yes no yes yes
paid no yes yes yes
activities no no yes no
nursery no yes yes yes
higher yes yes yes yes
internet yes yes yes no
romantic no no yes no
famrel 5 4 3 4
freetime 3 3 2 3
goout 3 2 2 2
Dalc 1 2 1 1
Walc 1 3 1 2
health 3 3 5 5
absences 4 10 2 4
G1 5 7 15 6
G2 5 8 14 10
G3 6 10 15 10
class math math math math

탐색적 분석

print(df['class'].value_counts())
sns.factorplot('class',kind='count', data=df)
por     649
math    395
Name: class, dtype: int64





<seaborn.axisgrid.FacetGrid at 0x19b19d91748>

png

print(df['sex'].value_counts())
sns.factorplot('sex',kind='count', data=df)
F    591
M    453
Name: sex, dtype: int64





<seaborn.axisgrid.FacetGrid at 0x19b19ddf2e8>

png

print(df['age'].value_counts())
sns.factorplot('age',data=df, kind='count')
16    281
17    277
18    222
15    194
19     56
20      9
21      3
22      2
Name: age, dtype: int64





<seaborn.axisgrid.FacetGrid at 0x19b19e4f240>

png

print(df['school'].value_counts())
sns.factorplot('school',kind='count', data=df)
GP    772
MS    272
Name: school, dtype: int64





<seaborn.axisgrid.FacetGrid at 0x19b19e81550>

png

  • GridExtra 처럼 그릴 수 있는 방법이 있을텐데… 찾아봐야됨
sns.factorplot('famsize',kind='count', data=df)
sns.factorplot('Pstatus',kind='count', data=df)
sns.factorplot('Medu',kind='count', data=df)
sns.factorplot('Fedu',kind='count', data=df)
sns.factorplot('Mjob',kind='count', data=df)
sns.factorplot('Fjob',kind='count', data=df)
sns.factorplot('reason',kind='count', data=df)
sns.factorplot('guardian',kind='count', data=df)
sns.factorplot('traveltime',kind='count', data=df)
sns.factorplot('studytime',kind='count', data=df)
sns.factorplot('failures',kind='count', data=df)
<seaborn.axisgrid.FacetGrid at 0x19b1b28bba8>

png

png

png

png

png

png

png

png

png

png

png

Finding Correation with Alcol

  • 연속형 변수와 명목형으로 나누어 명목형은 Dummpy형태로 변환
  • 숫자형이 아닐 경우 Correation을 구할 수가 없다.
columns = df.columns
discrete = []
continuous = []
for i in columns:
    if df[i].dtype =='object':
        discrete.append(i)
    else:
        continuous.append(i)
print(discrete)
['school', 'sex', 'address', 'famsize', 'Pstatus', 'Mjob', 'Fjob', 'reason', 'guardian', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'class']
print(continuous)
['age', 'Medu', 'Fedu', 'traveltime', 'studytime', 'failures', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G1', 'G2', 'G3']
dummy = pd.get_dummies(df[discrete])
dummy.head(3)
school_GP school_MS sex_F sex_M address_R address_U famsize_GT3 famsize_LE3 Pstatus_A Pstatus_T ... nursery_no nursery_yes higher_no higher_yes internet_no internet_yes romantic_no romantic_yes class_math class_por
0 1 0 1 0 0 1 1 0 1 0 ... 0 1 0 1 1 0 1 0 1 0
1 1 0 1 0 0 1 1 0 0 1 ... 1 0 0 1 0 1 1 0 1 0
2 1 0 1 0 0 1 0 1 0 1 ... 0 1 0 1 0 1 1 0 1 0

3 rows × 45 columns

  • 데이터 결합.
X = pd.concat([df[continuous], dummy], axis=1)
X.head()
age Medu Fedu traveltime studytime failures famrel freetime goout Dalc ... nursery_no nursery_yes higher_no higher_yes internet_no internet_yes romantic_no romantic_yes class_math class_por
0 18 4 4 2 2 0 4 3 4 1 ... 0 1 0 1 1 0 1 0 1 0
1 17 1 1 1 2 0 5 3 3 1 ... 1 0 0 1 0 1 1 0 1 0
2 15 1 1 1 2 3 4 3 2 2 ... 0 1 0 1 0 1 1 0 1 0
3 15 4 2 1 3 0 3 2 2 1 ... 0 1 0 1 0 1 0 1 1 0
4 16 3 3 1 2 0 4 3 2 1 ... 0 1 0 1 1 0 1 0 1 0

5 rows × 61 columns

corr = X.corr()
corr
age Medu Fedu traveltime studytime failures famrel freetime goout Dalc ... nursery_no nursery_yes higher_no higher_yes internet_no internet_yes romantic_no romantic_yes class_math class_por
age 1.000000 -0.130196 -0.138521 0.049216 -0.007870 0.282364 0.007162 0.002645 0.118510 0.133453 ... 0.046846 -0.046846 0.244601 -0.244601 0.033229 -0.033229 -0.173800 0.173800 -0.018790 0.018790
Medu -0.130196 1.000000 0.642063 -0.238181 0.090616 -0.187769 0.015004 0.001054 0.025614 0.001515 ... -0.149287 0.149287 -0.206551 0.206551 -0.249728 0.249728 0.008685 -0.008685 0.101246 -0.101246
Fedu -0.138521 0.642063 1.000000 -0.196328 0.033458 -0.191390 0.013066 0.002142 0.030075 -0.000165 ... -0.104681 0.104681 -0.191956 0.191956 -0.170012 0.170012 0.039906 -0.039906 0.094795 -0.094795
traveltime 0.049216 -0.238181 -0.196328 1.000000 -0.081328 0.087177 -0.012578 -0.007403 0.049740 0.109423 ... 0.018641 -0.018641 0.081857 -0.081857 0.169485 -0.169485 -0.013603 0.013603 -0.079881 0.079881
studytime -0.007870 0.090616 0.033458 -0.081328 1.000000 -0.152024 0.012324 -0.094429 -0.072941 -0.159665 ... -0.056817 0.056817 -0.186556 0.186556 -0.049695 0.049695 -0.038435 0.038435 0.060934 -0.060934
failures 0.282364 -0.187769 -0.191390 0.087177 -0.152024 1.000000 -0.053676 0.102679 0.074683 0.116336 ... 0.083027 -0.083027 0.284893 -0.284893 0.074263 -0.074263 -0.076042 0.076042 0.083043 -0.083043
famrel 0.007162 0.015004 0.013066 -0.012578 0.012324 -0.053676 1.000000 0.136901 0.080619 -0.076483 ... -0.024599 0.024599 -0.041502 0.041502 -0.065972 0.065972 0.051891 -0.051891 0.007091 -0.007091
freetime 0.002645 0.001054 0.002142 -0.007403 -0.094429 0.102679 0.136901 1.000000 0.323556 0.144979 ... 0.013837 -0.013837 0.086824 -0.086824 -0.061016 0.061016 -0.012372 0.012372 0.025949 -0.025949
goout 0.118510 0.025614 0.030075 0.049740 -0.072941 0.074683 0.080619 0.323556 1.000000 0.253135 ... -0.013779 0.013779 0.062837 -0.062837 -0.083766 0.083766 -0.003606 0.003606 -0.032011 0.032011
Dalc 0.133453 0.001515 -0.000165 0.109423 -0.159665 0.116336 -0.076483 0.144979 0.253135 1.000000 ... 0.080647 -0.080647 0.112964 -0.112964 -0.039511 0.039511 -0.045311 0.045311 -0.011335 0.011335
Walc 0.098291 -0.029331 0.019524 0.084292 -0.229073 0.107432 -0.100663 0.130377 0.399794 0.627814 ... 0.084874 -0.084874 0.087271 -0.087271 -0.043615 0.043615 0.016426 -0.016426 0.004043 -0.004043
health -0.029129 -0.013254 0.034288 -0.029002 -0.063044 0.048311 0.104101 0.081517 -0.013736 0.065515 ... 0.005869 -0.005869 -0.008036 0.008036 0.041685 -0.041685 0.002096 -0.002096 0.006205 -0.006205
absences 0.153196 0.059708 0.040829 -0.022669 -0.075594 0.099998 -0.062171 -0.032079 0.056142 0.132867 ... 0.010842 -0.010842 0.072556 -0.072556 -0.090652 0.090652 -0.105323 0.105323 0.160125 -0.160125
G1 -0.124121 0.226101 0.195898 -0.121053 0.211314 -0.374175 0.036947 -0.051985 -0.101163 -0.150943 ... -0.047878 0.047878 -0.271476 0.271476 -0.104772 0.104772 0.055869 -0.055869 -0.079727 0.079727
G2 -0.119475 0.224662 0.182634 -0.140163 0.183167 -0.377172 0.042054 -0.068952 -0.108411 -0.131576 ... -0.052818 0.052818 -0.250619 0.250619 -0.122517 0.122517 0.097719 -0.097719 -0.126459 0.126459
G3 -0.125282 0.201472 0.159796 -0.102627 0.161629 -0.383145 0.054461 -0.064890 -0.097877 -0.129642 ... -0.039950 0.039950 -0.236578 0.236578 -0.107064 0.107064 0.098363 -0.098363 -0.187166 0.187166
school_GP -0.169938 0.235114 0.187611 -0.258834 0.133255 -0.066856 0.036359 -0.026008 -0.037000 -0.066006 ... -0.019349 0.019349 -0.131382 0.131382 -0.222993 0.222993 0.074506 -0.074506 0.256088 -0.256088
school_MS 0.169938 -0.235114 -0.187611 0.258834 -0.133255 0.066856 -0.036359 0.026008 0.037000 0.066006 ... 0.019349 -0.019349 0.131382 -0.131382 0.222993 -0.222993 -0.074506 0.074506 -0.256088 0.256088
sex_F 0.038832 -0.109387 -0.070786 -0.042508 0.239972 -0.065543 -0.074725 -0.181603 -0.062530 -0.275928 ... -0.030492 0.030492 -0.078775 0.078775 0.062671 -0.062671 -0.108944 0.108944 -0.062192 0.062192
sex_M -0.038832 0.109387 0.070786 0.042508 -0.239972 0.065543 0.074725 0.181603 0.062530 0.275928 ... 0.030492 -0.030492 0.078775 -0.078775 -0.062671 0.062671 0.108944 -0.108944 0.062192 -0.062192
address_R 0.071257 -0.179720 -0.124303 0.343803 -0.037480 0.061160 0.016801 0.009744 -0.030790 0.064030 ... 0.031946 -0.031946 0.074716 -0.074716 0.194790 -0.194790 -0.021209 0.021209 -0.087916 0.087916
address_U -0.071257 0.179720 0.124303 -0.343803 0.037480 -0.061160 -0.016801 -0.009744 0.030790 -0.064030 ... -0.031946 0.031946 -0.074716 0.074716 -0.194790 0.194790 0.021209 -0.021209 0.087916 -0.087916
famsize_GT3 -0.013290 0.025556 0.047290 -0.031550 0.035109 0.044589 0.005328 0.007249 -0.005889 -0.075646 ... 0.101279 -0.101279 0.000650 -0.000650 0.008315 -0.008315 -0.007656 0.007656 0.007705 -0.007705
famsize_LE3 0.013290 -0.025556 -0.047290 0.031550 -0.035109 -0.044589 -0.005328 -0.007249 0.005889 0.075646 ... -0.101279 0.101279 -0.000650 0.000650 -0.008315 0.008315 0.007656 -0.007656 -0.007705 0.007705
Pstatus_A -0.006887 0.077133 0.049156 -0.033883 -0.005049 0.004615 -0.042448 -0.038714 -0.020498 -0.015777 ... -0.054016 0.054016 0.007339 -0.007339 0.065260 -0.065260 -0.050021 0.050021 -0.029497 0.029497
Pstatus_T 0.006887 -0.077133 -0.049156 0.033883 0.005049 -0.004615 0.042448 0.038714 0.020498 0.015777 ... 0.054016 -0.054016 -0.007339 0.007339 -0.065260 0.065260 0.050021 -0.050021 0.029497 -0.029497
Mjob_at_home 0.089702 -0.387814 -0.188731 0.170171 -0.018424 0.070264 -0.017289 -0.047825 -0.036958 -0.015903 ... 0.025619 -0.025619 0.153985 -0.153985 0.240790 -0.240790 -0.036321 0.036321 -0.073121 0.073121
Mjob_health -0.093470 0.258135 0.133393 -0.106540 -0.015221 -0.025398 -0.040978 -0.015520 0.046969 -0.076301 ... -0.048189 0.048189 -0.089128 0.089128 -0.088132 0.088132 -0.021277 0.021277 0.021842 -0.021842
Mjob_other 0.037066 -0.231026 -0.200426 0.038616 -0.007451 -0.001451 0.003394 -0.017702 0.006338 -0.004774 ... 0.089281 -0.089281 0.021075 -0.021075 0.063474 -0.063474 -0.042049 0.042049 -0.040494 0.040494
Mjob_services -0.024883 0.104984 0.079390 -0.068560 0.019401 0.058457 0.044812 0.017525 0.031040 0.044716 ... -0.027609 0.027609 -0.035714 0.035714 -0.127412 0.127412 0.056836 -0.056836 0.059108 -0.059108
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Fjob_at_home 0.065349 -0.091603 -0.084975 -0.052228 0.004087 0.028483 -0.069595 0.045318 -0.023500 -0.029547 ... -0.054813 0.054813 0.068422 -0.068422 0.071043 -0.071043 -0.033594 0.033594 -0.028896 0.028896
Fjob_health -0.106505 0.128323 0.202267 -0.090635 0.107722 -0.036386 0.003336 -0.039445 0.006843 -0.017665 ... -0.051856 0.051856 -0.044062 0.044062 0.005809 -0.005809 0.016175 -0.016175 0.025293 -0.025293
Fjob_other 0.038894 -0.115679 -0.230861 0.099122 -0.038541 0.007676 0.017535 0.038416 0.043242 -0.060645 ... 0.043820 -0.043820 0.001482 -0.001482 0.021934 -0.021934 0.018271 -0.018271 -0.015745 0.015745
Fjob_services 0.001709 -0.019372 0.024698 -0.031258 0.011951 0.031904 0.045152 -0.051199 -0.021470 0.097602 ... 0.008235 -0.008235 0.008462 -0.008462 -0.066757 0.066757 0.003417 -0.003417 0.002293 -0.002293
Fjob_teacher -0.061390 0.260111 0.348978 -0.021649 -0.033607 -0.073646 -0.054509 0.003558 -0.031480 -0.013600 ... -0.010030 0.010030 -0.050270 0.050270 0.004782 -0.004782 -0.024030 0.024030 0.036023 -0.036023
reason_course 0.018524 -0.116806 -0.059851 0.128033 -0.084553 0.098883 -0.005015 0.082123 0.028489 -0.033164 ... 0.033645 -0.033645 0.086021 -0.086021 0.103707 -0.103707 -0.004853 0.004853 -0.070995 0.070995
reason_home -0.002368 0.024313 0.011945 -0.112132 -0.019542 -0.021017 -0.017715 -0.064393 -0.012108 0.045040 ... 0.007495 -0.007495 -0.055619 0.055619 -0.063627 0.063627 -0.010746 0.010746 0.052131 -0.052131
reason_other 0.006563 -0.022861 -0.025451 0.040928 -0.097277 -0.007442 0.003139 -0.008310 -0.002354 0.133297 ... 0.002981 -0.002981 0.076511 -0.076511 0.043032 -0.043032 -0.050078 0.050078 -0.031532 0.031532
reason_reputation -0.023719 0.126800 0.075322 -0.063705 0.187202 -0.087729 0.021509 -0.023762 -0.018991 -0.102682 ... -0.048640 0.048640 -0.097860 0.097860 -0.086240 0.086240 0.052339 -0.052339 0.051832 -0.051832
guardian_father -0.126978 -0.043620 0.094286 0.024526 0.011457 -0.059589 0.008734 -0.032711 -0.064810 0.034565 ... 0.018996 -0.018996 -0.022041 0.022041 -0.025185 0.025185 0.049031 -0.049031 -0.009065 0.009065
guardian_mother -0.081701 0.097703 -0.046298 -0.061961 -0.020958 -0.090476 0.003844 0.003161 0.056714 -0.077368 ... -0.087220 0.087220 -0.052720 0.052720 0.024057 -0.024057 0.024851 -0.024851 -0.010492 0.010492
guardian_other 0.357601 -0.103730 -0.072834 0.070983 0.018770 0.261738 -0.021398 0.048511 0.005225 0.082103 ... 0.125651 -0.125651 0.131501 -0.131501 -0.001605 0.001605 -0.126020 0.126020 0.033924 -0.033924
schoolsup_no 0.202824 0.023618 -0.032450 0.033940 -0.070598 -0.002483 0.007634 0.026126 0.051227 0.025852 ... 0.028795 -0.028795 0.077115 -0.077115 -0.016827 0.016827 -0.089979 0.089979 -0.037141 0.037141
schoolsup_yes -0.202824 -0.023618 0.032450 -0.033940 0.070598 0.002483 -0.007634 -0.026126 -0.051227 -0.025852 ... -0.028795 0.028795 -0.077115 0.077115 0.016827 -0.016827 0.089979 -0.089979 0.037141 -0.037141
famsup_no 0.116904 -0.143063 -0.153342 0.026117 -0.143858 0.027574 -0.002261 -0.006227 -0.005252 0.022275 ... 0.039921 -0.039921 0.088449 -0.088449 0.082522 -0.082522 -0.009997 0.009997 0.000590 -0.000590
famsup_yes -0.116904 0.143063 0.153342 -0.026117 0.143858 -0.027574 0.002261 0.006227 0.005252 -0.022275 ... -0.039921 0.039921 -0.088449 0.088449 -0.082522 0.082522 0.009997 -0.009997 -0.000590 0.000590
paid_no 0.027917 -0.161349 -0.118897 0.083679 -0.105704 0.036389 -0.015404 0.034747 0.012943 -0.041919 ... 0.053074 -0.053074 0.124097 -0.124097 0.114189 -0.114189 -0.020512 0.020512 -0.473453 0.473453
paid_yes -0.027917 0.161349 0.118897 -0.083679 0.105704 -0.036389 0.015404 -0.034747 -0.012943 0.041919 ... -0.053074 0.053074 -0.124097 0.124097 -0.114189 0.114189 0.020512 -0.020512 0.473453 -0.473453
activities_no 0.073648 -0.116924 -0.093800 0.025834 -0.078847 0.027500 -0.051574 -0.128601 -0.072236 0.010584 ... 0.025370 -0.025370 0.061667 -0.061667 0.072016 -0.072016 0.042559 -0.042559 -0.022794 0.022794
activities_yes -0.073648 0.116924 0.093800 -0.025834 0.078847 -0.027500 0.051574 0.128601 0.072236 -0.010584 ... -0.025370 0.025370 -0.061667 0.061667 -0.072016 0.072016 -0.042559 0.042559 0.022794 -0.022794
nursery_no 0.046846 -0.149287 -0.104681 0.018641 -0.056817 0.083027 -0.024599 0.013837 -0.013779 0.080647 ... 1.000000 -1.000000 0.044429 -0.044429 -0.002605 0.002605 -0.003646 0.003646 0.009498 -0.009498
nursery_yes -0.046846 0.149287 0.104681 -0.018641 0.056817 -0.083027 0.024599 -0.013837 0.013779 -0.080647 ... -1.000000 1.000000 -0.044429 0.044429 0.002605 -0.002605 0.003646 -0.003646 -0.009498 0.009498
higher_no 0.244601 -0.206551 -0.191956 0.081857 -0.186556 0.284893 -0.041502 0.086824 0.062837 0.112964 ... 0.044429 -0.044429 1.000000 -1.000000 0.063407 -0.063407 -0.103002 0.103002 -0.096707 0.096707
higher_yes -0.244601 0.206551 0.191956 -0.081857 0.186556 -0.284893 0.041502 -0.086824 -0.062837 -0.112964 ... -0.044429 0.044429 -1.000000 1.000000 -0.063407 0.063407 0.103002 -0.103002 0.096707 -0.096707
internet_no 0.033229 -0.249728 -0.170012 0.169485 -0.049695 0.074263 -0.065972 -0.061016 -0.083766 -0.039511 ... -0.002605 0.002605 0.063407 -0.063407 1.000000 -1.000000 0.049882 -0.049882 -0.078377 0.078377
internet_yes -0.033229 0.249728 0.170012 -0.169485 0.049695 -0.074263 0.065972 0.061016 0.083766 0.039511 ... 0.002605 -0.002605 -0.063407 0.063407 -1.000000 1.000000 -0.049882 0.049882 0.078377 -0.078377
romantic_no -0.173800 0.008685 0.039906 -0.013603 -0.038435 -0.076042 0.051891 -0.012372 -0.003606 -0.045311 ... -0.003646 0.003646 -0.103002 0.103002 0.049882 -0.049882 1.000000 -1.000000 0.034534 -0.034534
romantic_yes 0.173800 -0.008685 -0.039906 0.013603 0.038435 0.076042 -0.051891 0.012372 0.003606 0.045311 ... 0.003646 -0.003646 0.103002 -0.103002 -0.049882 0.049882 -1.000000 1.000000 -0.034534 0.034534
class_math -0.018790 0.101246 0.094795 -0.079881 0.060934 0.083043 0.007091 0.025949 -0.032011 -0.011335 ... 0.009498 -0.009498 -0.096707 0.096707 -0.078377 0.078377 0.034534 -0.034534 1.000000 -1.000000
class_por 0.018790 -0.101246 -0.094795 0.079881 -0.060934 -0.083043 -0.007091 -0.025949 0.032011 0.011335 ... -0.009498 0.009498 0.096707 -0.096707 0.078377 -0.078377 -0.034534 0.034534 -1.000000 1.000000

61 rows × 61 columns

알콜 정도

  • 주중 알콜 중독정도 (Dalc)
  • 주말 알콜 중독정도 (Walc)
pd.DataFrame({'Walc':corr['Walc'], 'Dalc':corr['Dalc']}).T
age Medu Fedu traveltime studytime failures famrel freetime goout Dalc ... nursery_no nursery_yes higher_no higher_yes internet_no internet_yes romantic_no romantic_yes class_math class_por
Dalc 0.133453 0.001515 -0.000165 0.109423 -0.159665 0.116336 -0.076483 0.144979 0.253135 1.000000 ... 0.080647 -0.080647 0.112964 -0.112964 -0.039511 0.039511 -0.045311 0.045311 -0.011335 0.011335
Walc 0.098291 -0.029331 0.019524 0.084292 -0.229073 0.107432 -0.100663 0.130377 0.399794 0.627814 ... 0.084874 -0.084874 0.087271 -0.087271 -0.043615 0.043615 0.016426 -0.016426 0.004043 -0.004043

2 rows × 61 columns

Rel_Dalc = corr['Dalc']
Rel_Walc = corr['Walc']
  • 상관계수가 0.1 이상인 변수만 추출 => 절대값을 하는 이유는 특정 변수가 높을 수록 술을 적게 먹는 변수도 찾기 위해서
Rel_Dalc[abs(Rel_Dalc)>0.1]
age                  0.133453
traveltime           0.109423
studytime           -0.159665
failures             0.116336
freetime             0.144979
goout                0.253135
Dalc                 1.000000
Walc                 0.627814
absences             0.132867
G1                  -0.150943
G2                  -0.131576
G3                  -0.129642
sex_F               -0.275928
sex_M                0.275928
reason_other         0.133297
reason_reputation   -0.102682
higher_no            0.112964
higher_yes          -0.112964
Name: Dalc, dtype: float64
Rel_Walc[abs(Rel_Walc)>0.1]
studytime   -0.229073
failures     0.107432
famrel      -0.100663
freetime     0.130377
goout        0.399794
Dalc         0.627814
Walc         1.000000
health       0.106669
absences     0.139703
G1          -0.142401
G2          -0.128114
G3          -0.115740
sex_F       -0.302623
sex_M        0.302623
Name: Walc, dtype: float64
  • Index명 추출.
Rel_Col_Dal = Rel_Dalc[abs(Rel_Dalc)>0.1].index.tolist()
Rel_Col_Wal = Rel_Walc[abs(Rel_Walc)>0.1].index.tolist()
Dal_df = X[Rel_Col_Dal]
Wal_df = X[Rel_Col_Wal]
Dal_df.head()
age traveltime studytime failures freetime goout Dalc Walc absences G1 G2 G3 sex_F sex_M reason_other reason_reputation higher_no higher_yes
0 18 2 2 0 3 4 1 1 6 5 6 6 1 0 0 0 0 1
1 17 1 2 0 3 3 1 1 4 5 5 6 1 0 0 0 0 1
2 15 1 2 3 3 2 2 3 10 7 8 10 1 0 1 0 0 1
3 15 1 3 0 2 2 1 1 2 15 14 15 1 0 0 0 0 1
4 16 1 2 0 3 2 1 2 4 6 10 10 1 0 0 0 0 1
Wal_df.head()
studytime failures famrel freetime goout Dalc Walc health absences G1 G2 G3 sex_F sex_M
0 2 0 4 3 4 1 1 3 6 5 6 6 1 0
1 2 0 5 3 3 1 1 3 4 5 5 6 1 0
2 2 3 4 3 2 2 3 3 10 7 8 10 1 0
3 3 0 3 2 2 1 1 5 2 15 14 15 1 0
4 2 0 4 3 2 1 2 5 4 6 10 10 1 0
Dal_corr = Dal_df.corr()
Wal_corr = Wal_df.corr()

결론

  • 남자학생의 경우가 여성의 학생보다 알콜 중독 현상이 높다.
  • 당연한 얘기로 밖에 친구와 많이 놀러 가는 학생이 알콜에 노출될 확률이 높으므로 더 많은 섭취 현상을 보였다.
  • 자유 시간이 많은 학생이 위와 같은 원인으로 더 많은 노출이 되었다.
  • 주중 / 주말 알콜 섭취 비율은 당연히 상관관계가 제일 높았다.
  • 결석을 많이 하는 학생 또한 약간의 상관 관계를 가지고 있으나 확정적이지는 않다.
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(Dal_corr,
            xticklabels=Dal_corr.columns.values,
            yticklabels=Dal_corr.columns.values,
           annot=True, linewidths=.5, ax=ax)
<matplotlib.axes._subplots.AxesSubplot at 0x19b17c00dd8>

png

fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(Wal_corr,
            xticklabels=Wal_corr.columns.values,
            yticklabels=Wal_corr.columns.values,
            annot=True, linewidths=.5, ax=ax)
<matplotlib.axes._subplots.AxesSubplot at 0x19b1779d8d0>

png

sns.boxplot(x='goout',y='Walc',data=df)
<matplotlib.axes._subplots.AxesSubplot at 0x19b185dddd8>

png

sns.boxplot(x='goout',y='Dalc',data=df)
<matplotlib.axes._subplots.AxesSubplot at 0x19b186df630>

png

  • 여성의 알콜 중독 비율이 남자학생보다 훨씬 낮다.
sns.factorplot('Walc',kind='count',hue='sex',data=df)
<seaborn.axisgrid.FacetGrid at 0x19b187dc320>

png

Machine Learning 예측

X[['Walc','Dalc']].head()
Walc Dalc
0 1 1
1 1 1
2 3 2
3 1 1
4 2 1
X['Alcohol'] = X['Walc'] + X['Dalc']
X['Alcohol'].head()
0    2
1    2
2    5
3    2
4    3
Name: Alcohol, dtype: int64
ml_df = X.copy()
ml_df['Alcohol'].value_counts()
2     391
3     182
4     159
5     118
6      85
7      49
8      26
10     24
9      10
Name: Alcohol, dtype: int64
from sklearn.utils import shuffle
ml_df = shuffle(ml_df)
ml_df = ml_df.reset_index()
del ml_df['index']
ml_df_columns = ml_df.columns.difference(['Walc','Dalc','Alcohol'])
ml_df_columns
Index(['Fedu', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services',
       'Fjob_teacher', 'G1', 'G2', 'G3', 'Medu', 'Mjob_at_home', 'Mjob_health',
       'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Pstatus_A', 'Pstatus_T',
       'absences', 'activities_no', 'activities_yes', 'address_R', 'address_U',
       'age', 'class_math', 'class_por', 'failures', 'famrel', 'famsize_GT3',
       'famsize_LE3', 'famsup_no', 'famsup_yes', 'freetime', 'goout',
       'guardian_father', 'guardian_mother', 'guardian_other', 'health',
       'higher_no', 'higher_yes', 'internet_no', 'internet_yes', 'nursery_no',
       'nursery_yes', 'paid_no', 'paid_yes', 'reason_course', 'reason_home',
       'reason_other', 'reason_reputation', 'romantic_no', 'romantic_yes',
       'school_GP', 'school_MS', 'schoolsup_no', 'schoolsup_yes', 'sex_F',
       'sex_M', 'studytime', 'traveltime'],
      dtype='object')
X = ml_df[ml_df_columns]
y = ml_df['Alcohol']
print(ml_df.columns)
ml_df.head()
Index(['age', 'Medu', 'Fedu', 'traveltime', 'studytime', 'failures', 'famrel',
       'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G1', 'G2',
       'G3', 'school_GP', 'school_MS', 'sex_F', 'sex_M', 'address_R',
       'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T',
       'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services',
       'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other',
       'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home',
       'reason_other', 'reason_reputation', 'guardian_father',
       'guardian_mother', 'guardian_other', 'schoolsup_no', 'schoolsup_yes',
       'famsup_no', 'famsup_yes', 'paid_no', 'paid_yes', 'activities_no',
       'activities_yes', 'nursery_no', 'nursery_yes', 'higher_no',
       'higher_yes', 'internet_no', 'internet_yes', 'romantic_no',
       'romantic_yes', 'class_math', 'class_por', 'Alcohol'],
      dtype='object')
age Medu Fedu traveltime studytime failures famrel freetime goout Dalc ... nursery_yes higher_no higher_yes internet_no internet_yes romantic_no romantic_yes class_math class_por Alcohol
0 16 4 4 1 2 0 4 4 2 1 ... 1 0 1 0 1 0 1 1 0 2
1 16 1 2 1 2 0 4 4 3 1 ... 1 0 1 0 1 1 0 0 1 2
2 16 4 4 1 2 0 4 2 4 2 ... 1 0 1 0 1 1 0 1 0 6
3 17 4 4 1 1 0 5 2 3 1 ... 1 0 1 0 1 1 0 0 1 3
4 19 2 3 1 3 1 4 1 2 1 ... 1 0 1 0 1 0 1 1 0 2

5 rows × 62 columns

y[:5]
0    2
1    2
2    6
3    3
4    2
Name: Alcohol, dtype: int64

데이터 분할

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
X_train.head()
Fedu Fjob_at_home Fjob_health Fjob_other Fjob_services Fjob_teacher G1 G2 G3 Medu ... romantic_no romantic_yes school_GP school_MS schoolsup_no schoolsup_yes sex_F sex_M studytime traveltime
830 3 0 0 1 0 0 14 14 15 4 ... 1 0 1 0 1 0 1 0 2 1
172 2 0 0 1 0 0 15 14 15 3 ... 0 1 0 1 1 0 1 0 2 2
1005 3 0 0 1 0 0 12 11 12 1 ... 0 1 0 1 1 0 0 1 1 2
1013 3 0 0 1 0 0 11 10 10 2 ... 1 0 1 0 1 0 1 0 3 1
397 3 0 0 0 1 0 10 10 11 3 ... 0 1 1 0 1 0 0 1 2 1

5 rows × 59 columns

lm.fit(X_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
lm.intercept_ #  절편
-552496061212.00085
lm.coef_ #기울기
array([  1.10681983e-01,   7.62085702e+11,   7.62085702e+11,
         7.62085702e+11,   7.62085702e+11,   7.62085702e+11,
        -1.16823836e-01,   6.91050842e-02,  -3.45437923e-03,
        -1.71508789e-02,  -1.42382326e+10,  -1.42382326e+10,
        -1.42382326e+10,  -1.42382326e+10,  -1.42382326e+10,
        -2.79352456e+09,  -2.79352456e+09,   4.22363281e-02,
        -1.04961259e+10,  -1.04961259e+10,   2.35242131e+09,
         2.35242131e+09,   1.11968994e-01,  -4.36248354e+09,
        -4.36248354e+09,   7.52563477e-02,  -3.29772949e-01,
        -1.10872374e+08,  -1.10872374e+08,   1.12212080e+10,
         1.12212080e+10,   8.78906250e-03,   5.69213867e-01,
        -7.05294877e+09,  -7.05294877e+09,  -7.05294877e+09,
         1.04492188e-01,   8.30723019e+09,   8.30723019e+09,
         5.57260644e+09,   5.57260644e+09,  -6.15471185e+08,
        -6.15471186e+08,   4.73224834e+09,   4.73224834e+09,
        -8.98184669e+10,  -8.98184669e+10,  -8.98184669e+10,
        -8.98184669e+10,   4.35991140e+10,   4.35991140e+10,
        -1.55858529e+11,  -1.55858529e+11,   1.15810618e+10,
         1.15810618e+10,  -1.16088761e+10,  -1.16088761e+10,
        -1.85943604e-01,   1.70471191e-01])
from sklearn.metrics import mean_squared_error
mean_squared_error(y_train,lm.predict(X_train))
2.659151246163943
mean_squared_error(y_test,lm.predict(X_test))
2.4619246322163351
compare_alcohol = pd.DataFrame({'prediction':lm.predict(X_test),'real_value':y_test})
compare_alcohol['prediction'] = round(compare_alcohol['prediction'])
compare_alcohol['diff'] = compare_alcohol['prediction'] - compare_alcohol['real_value']
compare_alcohol['diff'].value_counts()
 1.0    83
 0.0    72
-1.0    54
 2.0    49
-2.0    30
-3.0     8
 3.0     8
 4.0     3
-5.0     3
-4.0     3
 6.0     1
Name: diff, dtype: int64

Classification으로 풀기

from sklearn.ensemble import GradientBoostingClassifier
from sklearn import metrics
gb = GradientBoostingClassifier(n_estimators=3000)
gb.fit(X_train,y_train)
GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=3000, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False)
def getResult(y_test,y_pred):
    print(metrics.confusion_matrix(y_test, y_pred))
    print('accurracy:', metrics.accuracy_score(y_test, y_pred))
gb.predict(X_test)
array([ 3,  3,  2,  2,  3,  3,  2,  5,  3,  4,  2,  4,  3,  2,  7,  2,  5,
        2,  5,  2,  2,  2,  4,  2,  2,  6,  2,  5,  2,  6,  5,  2,  2,  7,
        5,  2,  2,  2,  6,  2,  3,  4,  2,  5,  2,  5,  2,  3,  6, 10,  4,
        3,  4, 10,  2,  2,  2,  4,  4,  2,  2,  2,  2,  2,  2,  3,  2,  4,
        4,  4,  3,  2,  3,  3,  2,  2,  2,  2,  6,  2,  5,  2,  5,  3,  3,
        6,  2,  2,  3,  7,  6,  5,  5,  4,  2,  2,  3,  3,  2,  4,  6,  4,
        4,  5,  2,  5,  3,  2,  3,  2,  2,  3,  3,  4,  6,  5,  5,  6,  2,
        2,  8,  2,  6,  5,  2,  3,  2,  6,  2,  5,  2,  2,  3,  6,  4,  3,
        2,  3,  3,  2,  2,  4,  2,  2,  4,  2,  4,  5,  3,  2,  2,  2,  3,
        2,  4,  8,  2,  2,  2,  5,  2,  5,  2,  2,  2,  2,  3,  6,  3,  7,
        6,  2,  2,  3,  2,  8,  4,  2,  2,  2,  4,  2,  6,  6,  2,  2,  2,
        2,  2,  2,  2,  2,  5,  2,  3,  6,  2,  6,  2,  3,  2,  2,  2,  2,
        4,  4,  3,  2,  4,  3,  2,  7,  4,  5,  6,  2,  2,  4,  3,  4,  6,
        3,  2,  7,  3,  4,  5,  2,  2,  3,  2,  6,  4,  3,  6,  3,  2,  2,
        3,  4,  2,  2,  2,  5,  7,  7,  4,  5,  2,  4,  2,  2,  2,  3,  4,
        3,  2,  3,  2,  3,  2,  4,  2,  2,  6,  2,  5,  7,  2,  3,  2,  2,
        4,  5,  2,  2,  5,  4,  3,  3,  2,  2,  3,  4,  7,  2,  3,  2,  3,
        2,  2,  2,  2,  3,  4,  2,  4,  3,  3,  4,  2,  3,  5,  2,  3,  2,
        2,  2,  2,  4,  2,  2,  2,  7], dtype=int64)
getResult(gb.predict(X_test),y_test)
[[86 22 14 14  5  2  0  0  0]
 [14 18 12  9  2  3  0  0  0]
 [ 6  7 25  2  2  0  0  0  1]
 [ 2  3  1 16  2  2  2  0  2]
 [ 3  2  0  2  9  6  1  1  0]
 [ 3  2  0  0  1  5  0  0  0]
 [ 1  0  0  0  0  1  1  0  0]
 [ 0  0  0  0  0  0  0  0  0]
 [ 0  0  1  0  0  0  0  0  1]]
accurracy: 0.512738853503