데이터 분석 및 프로젝트/section 1

n112a 과제

막막한 2023. 3. 16. 15:52

In [1]:

# seaborn 라이브러리에 있는 타이타닉 불러오기

import numpy as np
import pandas as pd
import seaborn as sns
#데이터셋 불러오기 

df=sns.load_dataset("titanic")
df

Out[1]:

	survived	pclass	sex	age	sibsp	parch	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	0	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	0	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	0	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
886	0	2	male	27.0	0	0	13.0000	S	Second	man	True	NaN	Southampton	no	True
887	1	1	female	19.0	0	0	30.0000	S	First	woman	False	B	Southampton	yes	True
888	0	3	female	NaN	1	2	23.4500	S	Third	woman	False	NaN	Southampton	no	False
889	1	1	male	26.0	0	0	30.0000	C	First	man	True	C	Cherbourg	yes	True
890	0	3	male	32.0	0	0	7.7500	Q	Third	man	True	NaN	Queenstown	no	True

891 rows × 15 columns

In [2]:

# 생존자와 사망자 중 평균 나이가 더 높은 그룹

#.groupby()를 사용해 생존 여부별 평균나이 구하기 

df.groupby('survived')['age'].mean() #생존자1, 사망자0 평균나이 

Out[2]:

survived
0    30.626179
1    28.343690
Name: age, dtype: float64

In [4]:

# 사망자 중 남자의 좌석등급의 중앙값 - .median()사용

df.groupby(['survived', 'sex'])[['pclass']].median() #[[]] 두번하니깐 데이터프레임형태

Out[4]:

		pclass
survived	sex
0	female	3.0
0	male	3.0
1	female	2.0
1	male	2.0

In [5]:

# young 그룹의 사망자 비율을 반올림해 소숫점 둘째 자리까지 쓰세요 
# .query() 
'''
young: 20세 미만
middle: 20~60미만
old: 60이상
'''
young=df.query("age<20")
middle=df.query("20<= age<60")
old=df.query("age>=60")

In [6]:

df.groupby('age')['survived'].count()

Out[6]:

age
0.42     1
0.67     1
0.75     2
0.83     2
0.92     1
        ..
70.00    2
70.50    1
71.00    2
74.00    1
80.00    1
Name: survived, Length: 88, dtype: int64

In [8]:

#value_counts()의 normalize파라미터 설정 각 그룹별 survived 데이터 비율 구해라 
young['survived'].value_counts(normalize=True)

Out[8]:

0    85
1    79
Name: survived, dtype: int64

In [9]:

middle['survived'].value_counts(normalize=True)

Out[9]:

0    0.610687
1    0.389313
Name: survived, dtype: float64

In [10]:

old['survived'].value_counts(normalize=True)

Out[10]:

0    0.730769
1    0.269231
Name: survived, dtype: float64

In [11]:

# 데이터 시각화 

#생존여부와 성별에 따른 평균요금을 바 플랏으로 나타내라 

import matplotlib.pyplot as plt

df.groupby(['survived', 'sex'])[['fare']].mean().plot(kind='bar')

Out[11]:

<AxesSubplot:xlabel='survived,sex'>

In [12]:

#생존 피율 - 파이그래프 

#pie plot-distribution-survived

ratio=df['survived'].value_counts(normalize=True)

plt.pie(ratio, labels=[0,1], autopct='%.0f%%', explode=[0,0.05], colors=['lightgrey', 'skyblue'])
plt.title('Survived');

In [13]:

# pclass 도수 확인 - seaborn 의 .countplot()

#pclass unique value 개수 확인 
df['pclass'].value_counts()

Out[13]:

3    491
1    216
2    184
Name: pclass, dtype: int64

In [17]:

# bar plot - pclass distribution확인 
sns.countplot(data=df, x='pclass', color='orange')
plt.show()

In [18]:

# countinuous variables 시각화 

#age column을 histogram 나타내기 - bin개수 8이 되도록

#age 통계치 확인 

df['age'].describe()

Out[18]:

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: age, dtype: float64

In [19]:

#histogram 통해 age의 distribution
plt.hist(data=df, x='age')
plt.show()

In [20]:

# 나이대 별 분포 위해 bin size 설정
bin_size=10
bins=np.arange(0, df.age.max()+bin_size, bin_size)

plt.hist(data=df, x='age', bins=bins)
plt.show()

In [ ]:

n112a_sol_EDA(2).ipynb

73.7 kB

현재글n112a 과제

MAKMAK

취뽀, 역기획, 취준, PO, 마케팅, 자기계발, PM, 포트폴리오, 포폴, 기획취준, 갓생, 대외활동, 스터디, 기획자, 잇기s, 기획,

Today :
Yesterday :

MAKMAK

n112a 과제

'데이터 분석 및 프로젝트/section 1'의 다른글

티스토리툴바

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31