일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | |||
5 | 6 | 7 | 8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 | 16 | 17 | 18 |
19 | 20 | 21 | 22 | 23 | 24 | 25 |
26 | 27 | 28 | 29 | 30 | 31 |
- Machine Learning
- control statement
- Python crawler
- 넘파이
- Titanic data set
- 순회 크롤러
- dataframe
- 파이썬 크롤러
- 배열
- 파이썬 객체 지향 프로그래밍
- sklearn
- 제어문
- Naive Bayes
- 머신러닝
- python control statement
- KMeans Clustering
- pandas
- Data pre-processing
- 파이썬
- 파이썬 제어문
- 판다스
- K평균군집화
- 나이브베이즈
- 사이킷런
- scikit-learn
- ML
- python
- 타이타닉 데이터
- 파이썬 크롤링
- NumPy
- Today
- Total
Try to 개발자 EthanJ의 성장 로그
Titanic data pre-processing with pandas 판다스 타이타닉 데이터 전처리 본문
Titanic data pre-processing with pandas 판다스 타이타닉 데이터 전처리
EthanJ 2022. 11. 13. 11:19
Titanic data pre-processing with pandas
판다스 타이타닉 데이터 전처리
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Data 추출¶
data = pd.read_csv('data/train.csv')
data.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB
Data Set 개요¶
- Survived: 생존여부
- 0 = NO
- 1 = Yes
- pclass: 티켓 등급
- 1 = 1st
- 2 = 2nd
- 3 = 3rd
- Sex: 성별
- Age: 나이
- Sibsp: 함께 탑승한 형제자매, 배우자 수
- Parch: 함께 탑승한 부모, 자식의 수
- Ticket: 티켓 번호
- Name
- Fare: 운임
- Cabin: 객실번호
- Embarked: 탑승 항구
- C = Cherbourg
- Q = Queenstown
- S = Southampton
column
목록 추출
data.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object')
map()
활용:column label
에str.upper()
적용
data.columns.map(str.upper)
Index(['PASSENGERID', 'SURVIVED', 'PCLASS', 'NAME', 'SEX', 'AGE', 'SIBSP', 'PARCH', 'TICKET', 'FARE', 'CABIN', 'EMBARKED'], dtype='object')
DataFrame
의colum label
을 일괄적으로 대문자 처리
data.columns = data.columns.map(str.upper)
data = data.rename(columns=str.upper)
data = data.rename(columns=str.upper)
data.head()
PASSENGERID | SURVIVED | PCLASS | NAME | SEX | AGE | SIBSP | PARCH | TICKET | FARE | CABIN | EMBARKED | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
data.shape
(891, 12)
결측치 확인¶
각
column
별 개수Data set의 크기가 작은 경우는 결측치(NaN)가 있는 레코드 삭제 X, 값 대체 O
data.isnull().sum()
PASSENGERID 0 SURVIVED 0 PCLASS 0 NAME 0 SEX 0 AGE 177 SIBSP 0 PARCH 0 TICKET 0 FARE 0 CABIN 687 EMBARKED 2 dtype: int64
결측치 비율 확인
Series
간의 연산
data.isna().sum() / len(data.index) * 100
PASSENGERID 0.000000 SURVIVED 0.000000 PCLASS 0.000000 NAME 0.000000 SEX 0.000000 AGE 19.865320 SIBSP 0.000000 PARCH 0.000000 TICKET 0.000000 FARE 0.000000 CABIN 77.104377 EMBARKED 0.224467 dtype: float64
CABIN
column 결측치: 약 77% 매우 높은 비율: column 자체 삭제 고려 or 유도변수화AGE
column 결측치: 약 19%
불필요한 변수 제거¶
제거해도 될
column label
:TICKET
- 이유: 각 record별 고유 Data는
NAME
과PASSENGERID
로 충분
- 이유: 각 record별 고유 Data는
data = data.drop('TICKET', axis=1)
data.head()
PASSENGERID | SURVIVED | PCLASS | NAME | SEX | AGE | SIBSP | PARCH | FARE | CABIN | EMBARKED | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 8.0500 | NaN | S |
CABIN
Data로 유도변수 생성¶
CABIN
Data을 활용하여 유도변수 생성: 원래의 Data를 가공해서 의미있는 Data 추출유도변수:
CABIN
data의 첫 번째 알파벳(str[0]
)- 이유:
FARE
와 상관관계 있을 수 있음
- 이유:
CABIN
data의 첫 글자가 deck을 의미유도변수:
DECK
column 추가DECK
data 고유 value:list
에 추가
deck_list
에 해당하는CABIN
data의 첫 글자:DECK
column으로 추가Series
의str
속성값 사용
deck_list = list()
for cabin in data['CABIN']:
if type(cabin) == type(np.nan):
continue
if cabin[0] in deck_list:
continue
else:
deck_list.append(cabin[0])
deck_list
['C', 'E', 'G', 'D', 'A', 'B', 'F', 'T']
deck_list
에 해당하는CABIN
data의 첫 글자:DECK
column으로 추가deck_list
에 해당하지 않는 data(결측치):0
으로 채움
sample = data.copy()
sample.CABIN = sample.CABIN.fillna(0)
sample.head()
PASSENGERID | SURVIVED | PCLASS | NAME | SEX | AGE | SIBSP | PARCH | FARE | CABIN | EMBARKED | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | 7.2500 | 0 | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | 7.9250 | 0 | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 8.0500 | 0 | S |
deck_data = list()
for cabin in sample['CABIN']:
if cabin == 0:
deck_data.append('Unknown')
elif cabin[0] in deck_list:
deck_data.append(cabin[0])
else:
deck_data.append('Unknown')
deck_data
['Unknown', 'C', 'Unknown', 'C', 'Unknown', 'Unknown', 'E', 'Unknown', 'Unknown', 'Unknown', 'G', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'D', 'Unknown', 'A', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'D', 'Unknown', 'B', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'C', 'Unknown', 'Unknown', 'Unknown', 'F', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'F', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'E', 'Unknown', 'Unknown', 'Unknown', 'A', 'D', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'D', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'E', 'D', 'Unknown', 'Unknown', 'Unknown', 'F', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'D', 'C', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'F', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'E', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'A', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'F', 'Unknown', 'A', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'F', 'B', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'G', 'Unknown', 'Unknown', 'Unknown', 'A', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'D', 'Unknown', 'Unknown', 'D', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'D', 'Unknown', 'Unknown', 'G', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'E', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'C', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'D', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'A', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'D', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'C', 'B', 'Unknown', 'Unknown', 'Unknown', 'E', 'Unknown', 'C', 'Unknown', 'C', 'Unknown', 'E', 'C', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'E', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'D', 'Unknown', 'B', 'Unknown', 'C', 'C', 'Unknown', 'Unknown', 'Unknown', 'C', 'E', 'Unknown', 'T', 'F', 'C', 'Unknown', 'Unknown', 'Unknown', 'F', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'E', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'D', 'Unknown', 'Unknown', 'B', 'E', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'D', 'G', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'E', 'C', 'Unknown', 'Unknown', 'Unknown', 'E', 'B', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'A', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'C', 'C', 'Unknown', 'Unknown', 'E', 'D', 'Unknown', 'Unknown', 'E', 'Unknown', 'E', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'D', 'Unknown', 'A', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'C', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'D', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'E', 'Unknown', 'Unknown', 'D', 'F', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'B', 'B', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'A', 'Unknown', 'E', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'E', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'E', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'A', 'Unknown', 'E', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'D', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'A', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'F', 'Unknown', 'Unknown', 'D', 'Unknown', 'Unknown', 'Unknown', 'D', 'Unknown', 'D', 'Unknown', 'Unknown', 'A', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'D', 'Unknown', 'A', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'D', 'Unknown', 'Unknown', 'E', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'D', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'F', 'C', 'E', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'E', 'Unknown', 'Unknown', 'C', 'C', 'C', 'Unknown', 'Unknown', 'F', 'C', 'E', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'E', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'D', 'C', 'B', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'D', 'Unknown', 'Unknown', 'E', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'D', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'E', 'Unknown', 'Unknown', 'Unknown', 'F', 'Unknown', 'Unknown', 'B', 'Unknown', 'B', 'D', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'D', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'A', 'Unknown', 'Unknown', 'E', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'E', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'E', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'D', 'Unknown', 'Unknown', 'Unknown', 'E', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'D', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'A', 'Unknown', 'Unknown', 'Unknown', 'D', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'C', 'Unknown']
sample['DECK'] = deck_data
sample.tail()
PASSENGERID | SURVIVED | PCLASS | NAME | SEX | AGE | SIBSP | PARCH | FARE | CABIN | EMBARKED | DECK | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 13.00 | 0 | S | Unknown |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 30.00 | B42 | S | B |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | 23.45 | 0 | S | Unknown |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 30.00 | C148 | C | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 7.75 | 0 | Q | Unknown |
Series
의str
속성값 사용:str
이 아닌 data에는 적용되지 않고 넘어간다 > 가장 효율적
data['DECK'] = data['CABIN'].str[0]
data.DECK = data['DECK'].fillna('Unknown')
data.head()
PASSENGERID | SURVIVED | PCLASS | NAME | SEX | AGE | SIBSP | PARCH | FARE | CABIN | EMBARKED | DECK | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | 7.2500 | NaN | S | Unknown |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | 71.2833 | C85 | C | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | 7.9250 | NaN | S | Unknown |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 53.1000 | C123 | S | C |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 8.0500 | NaN | S | Unknown |
CABIN
유도변수 활용 완료: column 삭제
data = data.drop('CABIN', axis=1)
data.tail()
PASSENGERID | SURVIVED | PCLASS | NAME | SEX | AGE | SIBSP | PARCH | FARE | EMBARKED | DECK | |
---|---|---|---|---|---|---|---|---|---|---|---|
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 13.00 | S | Unknown |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 30.00 | S | B |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | 23.45 | S | Unknown |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 30.00 | C | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 7.75 | Q | Unknown |
NAME
Data를 활용해 유도변수 생성¶
개별 이름은 유의미한 변수가 아니다.
이름에 기재된 호칭으로 특징을 갖는 범주형 data를 추출할 수 있다.
- Mr, Ms, Miss, Master, ...
NAME
의 호칭을 추출해TITLE
column으로 추가
split()
두 번 사용해서 추출1.1.
lambda
식 사용1.2. 사용자 정의 함수
map()
method 활용
- 정규식(regular expression)을 사용해 추출
data.NAME.head(10)
0 Braund, Mr. Owen Harris 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 2 Heikkinen, Miss. Laina 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 Allen, Mr. William Henry 5 Moran, Mr. James 6 McCarthy, Mr. Timothy J 7 Palsson, Master. Gosta Leonard 8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) 9 Nasser, Mrs. Nicholas (Adele Achem) Name: NAME, dtype: object
sample_lam = data.copy()
sample_map = data.copy()
sample_reg = data.copy()
split()
두 번 사용해 추출1.1.
lambda
식 사용
title = sample_lam.NAME.apply(lambda x: x.split('.')[0].split(',')[1])
title
0 Mr 1 Mrs 2 Miss 3 Mrs 4 Mr ... 886 Rev 887 Miss 888 Miss 889 Mr 890 Mr Name: NAME, Length: 891, dtype: object
sample_lam['TITLE'] = title
sample_lam.TITLE.value_counts()
Mr 517 Miss 182 Mrs 125 Master 40 Dr 7 Rev 6 Mlle 2 Major 2 Col 2 the Countess 1 Capt 1 Ms 1 Sir 1 Lady 1 Mme 1 Don 1 Jonkheer 1 Name: TITLE, dtype: int64
1.2. 사용자 정의 함수로 map()
method 이용
def name_split(name) -> str:
return name.split(',')[1].split('.')[0]
sample_map['TITLE'] = sample_map['NAME'].map(name_split)
sample_map.TITLE.value_counts()
Mr 517 Miss 182 Mrs 125 Master 40 Dr 7 Rev 6 Mlle 2 Major 2 Col 2 the Countess 1 Capt 1 Ms 1 Sir 1 Lady 1 Mme 1 Don 1 Jonkheer 1 Name: TITLE, dtype: int64
- 정규식 사용해서 추출
import re
def get_title(x):
pattern=re.compile('Sir|Mr|Don|Major|Capt|Jonkheer|Rev|Col|Dr|Mrs|Countess|Dona|Mme|Ms|Miss|Mlle|Master',re.I)
result = pattern.search(x)
if result:
return result.group()
else:
return str(np.nan)
sample_reg['NAME'].apply(get_title)
0 Mr 1 Mr 2 Miss 3 Mr 4 Mr ... 886 Rev 887 Miss 888 Miss 889 Mr 890 Mr Name: NAME, Length: 891, dtype: object
sample_reg['TITLE'] = sample_reg['NAME'].str.extract('([a-zA-Z]+)\.',expand=False)
sample_reg.TITLE.value_counts()
Mr 517 Miss 182 Mrs 125 Master 40 Dr 7 Rev 6 Mlle 2 Major 2 Col 2 Countess 1 Capt 1 Ms 1 Sir 1 Lady 1 Mme 1 Don 1 Jonkheer 1 Name: TITLE, dtype: int64
- 원본
data
에는lambda
방법 사용
data['TITLE'] = data['NAME'].apply(lambda x: x.split(',')[1].split('.')[0])
data.TITLE.value_counts()
Mr 517 Miss 182 Mrs 125 Master 40 Dr 7 Rev 6 Mlle 2 Major 2 Col 2 the Countess 1 Capt 1 Ms 1 Sir 1 Lady 1 Mme 1 Don 1 Jonkheer 1 Name: TITLE, dtype: int64
NAME
column label drop
data = data.drop('NAME', axis=1)
data.head()
PASSENGERID | SURVIVED | PCLASS | SEX | AGE | SIBSP | PARCH | FARE | EMBARKED | DECK | TITLE | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Unknown | Mr |
1 | 2 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | C | Mrs |
2 | 3 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Unknown | Miss |
3 | 4 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | C | Mrs |
4 | 5 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Unknown | Mr |
data.groupby('TITLE')['AGE'].mean()
TITLE Capt 70.000000 Col 58.000000 Don 40.000000 Dr 42.000000 Jonkheer 38.000000 Lady 48.000000 Major 48.500000 Master 4.574167 Miss 21.773973 Mlle 24.000000 Mme 24.000000 Mr 32.368090 Mrs 35.898148 Ms 28.000000 Rev 43.166667 Sir 49.000000 the Countess 33.000000 Name: AGE, dtype: float64
data[['TITLE', 'AGE']].head()
TITLE | AGE | |
---|---|---|
0 | Mr | 22.0 |
1 | Mrs | 38.0 |
2 | Miss | 26.0 |
3 | Mrs | 35.0 |
4 | Mr | 35.0 |
df.groupby('column')['col']
에 붙여서 사용하는.transform('func')
>col
의 각 data를 집계함수'func'
적용한 값으로 반환
data.groupby('TITLE')['AGE'].transform('mean')
0 32.368090 1 35.898148 2 21.773973 3 35.898148 4 32.368090 ... 886 43.166667 887 21.773973 888 21.773973 889 32.368090 890 32.368090 Name: AGE, Length: 891, dtype: float64
AGE
결측치: 같은TITLE
의 평균값mean
으로 대체:obj.transform('mean')
data['AGE'] = data['AGE'].fillna(data.groupby('TITLE')['AGE'].transform('mean'))
print(data['AGE'].value_counts())
print(data.AGE.isnull().sum())
32.368090 119 21.773973 36 24.000000 30 22.000000 27 18.000000 26 ... 14.500000 1 70.500000 1 12.000000 1 36.500000 1 74.000000 1 Name: AGE, Length: 92, dtype: int64 0
AGE
data를 범주형 data로 변경¶
0 ~ 20
20 ~ 40
40 ~ 60
60 ~
사용자 정의 함수 >
map()
pd.cut()
np.digitize()
sample_A = data.copy()
sample_B = data.copy()
sample_C = data.copy()
A. 사용자 정의 함수 활용해 map()
method 사용
def age_class(age):
if age < 20: return 'under20'
if age < 40: return 'under40'
if age < 60: return 'under60'
else: return 'over60'
sample_A['AGE_RANGE'] = sample_A['AGE'].map(age_class)
sample_A['AGE_RANGE'].value_counts()
under40 559 under20 168 under60 138 over60 26 Name: AGE_RANGE, dtype: int64
sample_A.tail()
PASSENGERID | SURVIVED | PCLASS | SEX | AGE | SIBSP | PARCH | FARE | EMBARKED | DECK | TITLE | AGE_RANGE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
886 | 887 | 0 | 2 | male | 27.000000 | 0 | 0 | 13.00 | S | Unknown | Rev | under40 |
887 | 888 | 1 | 1 | female | 19.000000 | 0 | 0 | 30.00 | S | B | Miss | under20 |
888 | 889 | 0 | 3 | female | 21.773973 | 1 | 2 | 23.45 | S | Unknown | Miss | under40 |
889 | 890 | 1 | 1 | male | 26.000000 | 0 | 0 | 30.00 | C | C | Mr | under40 |
890 | 891 | 0 | 3 | male | 32.000000 | 0 | 0 | 7.75 | Q | Unknown | Mr | under40 |
B. pd.cut()
sample_B['AGE_RANGE']=pd.cut(sample_B['AGE'],[0,19,39,59,100],
labels=['under20','under40','under60','over60'])
sample_B['AGE_RANGE'].value_counts()
under40 559 under20 168 under60 138 over60 26 Name: AGE_RANGE, dtype: int64
sample_B.tail()
PASSENGERID | SURVIVED | PCLASS | SEX | AGE | SIBSP | PARCH | FARE | EMBARKED | DECK | TITLE | AGE_RANGE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
886 | 887 | 0 | 2 | male | 27.000000 | 0 | 0 | 13.00 | S | Unknown | Rev | under40 |
887 | 888 | 1 | 1 | female | 19.000000 | 0 | 0 | 30.00 | S | B | Miss | under20 |
888 | 889 | 0 | 3 | female | 21.773973 | 1 | 2 | 23.45 | S | Unknown | Miss | under40 |
889 | 890 | 1 | 1 | male | 26.000000 | 0 | 0 | 30.00 | C | C | Mr | under40 |
890 | 891 | 0 | 3 | male | 32.000000 | 0 | 0 | 7.75 | Q | Unknown | Mr | under40 |
C. np.digitize()
sample_C['AGE_RANGE'] = np.digitize(sample_C['AGE'], [20, 40, 60])
range_list = list(sample_C['AGE_RANGE'].unique())
range_list
[1, 2, 0, 3]
range_list.sort()
range_map = dict()
for k, v in zip(range_list, ['under20', 'under40', 'under60', 'over60']):
range_map[k] = v
range_map
{0: 'under20', 1: 'under40', 2: 'under60', 3: 'over60'}
sample_C['AGE_RANGE'] = sample_C['AGE_RANGE'].map(range_map)
sample_C.AGE_RANGE.value_counts()
under40 559 under20 168 under60 138 over60 26 Name: AGE_RANGE, dtype: int64
- 원본
data
에 사용자 정의 함수map()
방법 적용
data['AGE_RANGE'] = data['AGE'].apply(age_class)
data['AGE_RANGE'].value_counts()
under40 559 under20 168 under60 138 over60 26 Name: AGE_RANGE, dtype: int64
data['SEX'].info()
<class 'pandas.core.series.Series'> RangeIndex: 891 entries, 0 to 890 Series name: SEX Non-Null Count Dtype -------------- ----- 891 non-null object dtypes: object(1) memory usage: 7.1+ KB
sample_map = data.copy()
sample_func = data.copy()
obj.map({ })
sex_map = {'male':0, 'female':1}
sample_map['SEX_NUM'] = sample_map['SEX'].map(sex_map)
sample_map['SEX_NUM'].value_counts()
0 577 1 314 Name: SEX_NUM, dtype: int64
- 사용자 함수
apply()
sample_func['SEX_NUM'] = -1
def change_str_to_num(value):
if value == 'male': return 0
if value == 'female': return 1
sample_func['SEX_MAP'] = sample_func['SEX'].apply(change_str_to_num)
sample_func['SEX_MAP'].value_counts()
0 577 1 314 Name: SEX_MAP, dtype: int64
- 원본
data
에mapping
적용`
data['SEX_NUM'] = data['SEX'].map(sex_map)
data['SEX_NUM'].value_counts()
0 577 1 314 Name: SEX_NUM, dtype: int64
data.head()
PASSENGERID | SURVIVED | PCLASS | SEX | AGE | SIBSP | PARCH | FARE | EMBARKED | DECK | TITLE | AGE_RANGE | SEX_NUM | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Unknown | Mr | under40 | 0 |
1 | 2 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | C | Mrs | under40 | 1 |
2 | 3 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Unknown | Miss | under40 | 1 |
3 | 4 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | C | Mrs | under40 | 1 |
4 | 5 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Unknown | Mr | under40 | 0 |
data = data.drop('SEX', axis=1)
data.tail()
PASSENGERID | SURVIVED | PCLASS | AGE | SIBSP | PARCH | FARE | EMBARKED | DECK | TITLE | AGE_RANGE | SEX_NUM | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
886 | 887 | 0 | 2 | 27.000000 | 0 | 0 | 13.00 | S | Unknown | Rev | under40 | 0 |
887 | 888 | 1 | 1 | 19.000000 | 0 | 0 | 30.00 | S | B | Miss | under20 | 1 |
888 | 889 | 0 | 3 | 21.773973 | 1 | 2 | 23.45 | S | Unknown | Miss | under40 | 1 |
889 | 890 | 1 | 1 | 26.000000 | 0 | 0 | 30.00 | C | C | Mr | under40 | 0 |
890 | 891 | 0 | 3 | 32.000000 | 0 | 0 | 7.75 | Q | Unknown | Mr | under40 | 0 |
FARE
data 정수형 표기
소수점 첫 번째 자리에서 계산
Python method:
round()
:반올림numpy method:
np.ceil()
:올림
- Python method
round()
활용
data['FARE'] = data['FARE'].round(0)
data.FARE.value_counts()
8.0 206 7.0 65 26.0 46 13.0 46 10.0 43 ... 75.0 1 76.0 1 4.0 1 212.0 1 5.0 1 Name: FARE, Length: 90, dtype: int64
data.columns
Index(['PASSENGERID', 'SURVIVED', 'PCLASS', 'AGE', 'SIBSP', 'PARCH', 'FARE', 'EMBARKED', 'DECK', 'TITLE', 'AGE_RANGE', 'SEX_NUM'], dtype='object')
- 가족 구성원 수 = (형제, 자매) + (부모, 자식) + 본인(1)
data["FAMILY_SIZE"] = data.SIBSP + data.PARCH + 1
data.FAMILY_SIZE.value_counts()
1 537 2 161 3 102 4 29 6 22 5 15 7 12 11 7 8 6 Name: FAMILY_SIZE, dtype: int64
- 3, 7을 기준으로 범주화
data.FAMILY_SIZE = np.digitize(data.FAMILY_SIZE, [3, 7])
data.FAMILY_SIZE.unique()
array([0, 1, 2], dtype=int64)
int_var = data.FAMILY_SIZE.unique()
fam_map = dict()
for k, v in zip(int_var, ['small', 'medium', 'large']):
fam_map[k] = v
fam_map
{0: 'small', 1: 'medium', 2: 'large'}
data['FAMILY_SIZE'] = data['FAMILY_SIZE'].map(fam_map)
data.FAMILY_SIZE.value_counts()
small 698 medium 168 large 25 Name: FAMILY_SIZE, dtype: int64
독립변수와 종속변수SURVIVED
간의 상관관계 도출¶
data[['FARE', 'SURVIVED']].corr()
FARE | SURVIVED | |
---|---|---|
FARE | 1.000000 | 0.257088 |
SURVIVED | 0.257088 | 1.000000 |
> 양의 값이긴 하나 거의 상관관계 없음
data[['SEX_NUM', 'SURVIVED']].corr()
SEX_NUM | SURVIVED | |
---|---|---|
SEX_NUM | 1.000000 | 0.543351 |
SURVIVED | 0.543351 | 1.000000 |
> 양의 값으로 어느정도 의미 있음
Data Visualization¶
연령대 AGE_RANGE
의 비율을 pie
chart로 시각화¶
data['AGE_RANGE'].value_counts().plot(kind='pie', autopct='%.2f%%')
<AxesSubplot:ylabel='AGE_RANGE'>
pv_tb = data.pivot_table(index='SURVIVED', columns='AGE_RANGE',
aggfunc='count')['AGE']
pv_tb.plot(kind='bar', stacked=True)
<AxesSubplot:xlabel='SURVIVED'>
- 시각화용 Data 생성
passed_away = data[data['SURVIVED'] == 0]['AGE_RANGE'].value_counts()
survived = data[data.SURVIVED == 1]['AGE_RANGE'].value_counts()
passed_away.name = 'Passed Away'
survived.name = 'Survived'
df_vis = pd.DataFrame([passed_away, survived])
df_vis
under40 | under20 | under60 | over60 | |
---|---|---|---|---|
Passed Away | 359 | 87 | 84 | 19 |
Survived | 200 | 81 | 54 | 7 |
df_vis.plot(kind='bar', stacked=True)
<AxesSubplot:>
AGE_RANGE
별 SURVIVED
여부 시각화¶
age_surv = data.pivot_table(index='AGE_RANGE', columns='SURVIVED',
aggfunc='count')['AGE']
age_surv.plot(kind='bar', stacked=True)
<AxesSubplot:xlabel='AGE_RANGE'>
FAMILY_SIZE
별 SURVIVED
Bar Chart¶
fam_surv = data.pivot_table(index='FAMILY_SIZE', columns='SURVIVED',
aggfunc='count')['SEX_NUM']
fam_surv.plot(kind='bar', stacked=True)
<AxesSubplot:xlabel='FAMILY_SIZE'>
surv_class = data.groupby('PCLASS')['SURVIVED'].sum()
gone_class = data.groupby('PCLASS')['SURVIVED'].count()
print(surv_class)
print(gone_class)
PCLASS 1 136 2 87 3 119 Name: SURVIVED, dtype: int64 PCLASS 1 216 2 184 3 491 Name: SURVIVED, dtype: int64
gone_class = gone_class.sub(surv_class)
gone_class
PCLASS 1 80 2 97 3 372 Name: SURVIVED, dtype: int64
gone_class = gone_class.rename('PASSED AWAY')
pclass_surv = pd.DataFrame([surv_class, gone_class])
pclass_surv
PCLASS | 1 | 2 | 3 |
---|---|---|---|
SURVIVED | 136 | 87 | 119 |
PASSED AWAY | 80 | 97 | 372 |
pclass_surv.plot(kind='bar', stacked=True)
<AxesSubplot:>
FAMILY_SIZE
별 생존 비율¶
data['SIZE_COUNT'] = data['SIBSP'] + data['PARCH'] + 1
data.pivot_table(index='FAMILY_SIZE',columns='SURVIVED',
aggfunc='count')['SIZE_COUNT'].plot(kind='bar',stacked=True)
<AxesSubplot:xlabel='FAMILY_SIZE'>
EMBARKED
시각화¶
data.EMBARKED.isna().sum()
2
- 결측치 처리: 최빈값으로 대체
data.EMBARKED.value_counts()
S 644 C 168 Q 77 Name: EMBARKED, dtype: int64
data.EMBARKED = data.EMBARKED.fillna('S')
print(data.EMBARKED.count())
print(data.EMBARKED.value_counts())
891 S 646 C 168 Q 77 Name: EMBARKED, dtype: int64
data['EMBARKED_COUNT'] = data['EMBARKED'].map({'S':0, 'C':1, 'Q':2})
data.EMBARKED_COUNT.value_counts()
0 646 1 168 2 77 Name: EMBARKED_COUNT, dtype: int64
data.pivot_table(index='EMBARKED', columns='SURVIVED',
aggfunc='count')['EMBARKED_COUNT'].plot(kind='bar', stacked=True)
<AxesSubplot:xlabel='EMBARKED'>
data.EMBARKED.value_counts().plot(kind='pie')
<AxesSubplot:ylabel='EMBARKED'>
'CS & DS > Numpy & Pandas' 카테고리의 다른 글
Pandas Data visualization with matplotlib 판다스 데이터 시각화 (0) | 2022.11.12 |
---|---|
Pandas Data analysis with Baseball player 판다스 야구 선수 데이터 분석 (0) | 2022.11.12 |
Pandas Data pre-processing 판다스 데이터 전처리 (0) | 2022.11.12 |
Pandas Data Loading 판다스 데이터 적재 (0) | 2022.11.05 |
Pandas DataFrame 판다스 데이터프레임 (0) | 2022.11.05 |