| 일 | 월 | 화 | 수 | 목 | 금 | 토 |
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | |
| 7 | 8 | 9 | 10 | 11 | 12 | 13 |
| 14 | 15 | 16 | 17 | 18 | 19 | 20 |
| 21 | 22 | 23 | 24 | 25 | 26 | 27 |
| 28 | 29 | 30 | 31 |
- python
- Machine Learning
- Titanic data set
- control statement
- Data pre-processing
- dataframe
- 제어문
- Python crawler
- sklearn
- 판다스
- KMeans Clustering
- 나이브베이즈
- 배열
- ML
- NumPy
- scikit-learn
- 타이타닉 데이터
- pandas
- python control statement
- 파이썬 크롤링
- 파이썬
- 순회 크롤러
- Naive Bayes
- 사이킷런
- K평균군집화
- 파이썬 제어문
- 넘파이
- 파이썬 객체 지향 프로그래밍
- 파이썬 크롤러
- 머신러닝
- Today
- Total
Try to 개발자 EthanJ의 성장 로그
Titanic data pre-processing with pandas 판다스 타이타닉 데이터 전처리 본문
Titanic data pre-processing with pandas 판다스 타이타닉 데이터 전처리
EthanJ 2022. 11. 13. 11:19
Titanic data pre-processing with pandas
판다스 타이타닉 데이터 전처리
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Data 추출¶
data = pd.read_csv('data/train.csv')
data.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB
Data Set 개요¶
- Survived: 생존여부
- 0 = NO
- 1 = Yes
- pclass: 티켓 등급
- 1 = 1st
- 2 = 2nd
- 3 = 3rd
- Sex: 성별
- Age: 나이
- Sibsp: 함께 탑승한 형제자매, 배우자 수
- Parch: 함께 탑승한 부모, 자식의 수
- Ticket: 티켓 번호
- Name
- Fare: 운임
- Cabin: 객실번호
- Embarked: 탑승 항구
- C = Cherbourg
- Q = Queenstown
- S = Southampton
column목록 추출
data.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
map()활용:column label에str.upper()적용
data.columns.map(str.upper)
Index(['PASSENGERID', 'SURVIVED', 'PCLASS', 'NAME', 'SEX', 'AGE', 'SIBSP',
'PARCH', 'TICKET', 'FARE', 'CABIN', 'EMBARKED'],
dtype='object')
DataFrame의colum label을 일괄적으로 대문자 처리
data.columns = data.columns.map(str.upper)data = data.rename(columns=str.upper)
data = data.rename(columns=str.upper)
data.head()
| PASSENGERID | SURVIVED | PCLASS | NAME | SEX | AGE | SIBSP | PARCH | TICKET | FARE | CABIN | EMBARKED | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
data.shape
(891, 12)
결측치 확인¶
각
column별 개수Data set의 크기가 작은 경우는 결측치(NaN)가 있는 레코드 삭제 X, 값 대체 O
data.isnull().sum()
PASSENGERID 0 SURVIVED 0 PCLASS 0 NAME 0 SEX 0 AGE 177 SIBSP 0 PARCH 0 TICKET 0 FARE 0 CABIN 687 EMBARKED 2 dtype: int64
결측치 비율 확인
Series간의 연산
data.isna().sum() / len(data.index) * 100
PASSENGERID 0.000000 SURVIVED 0.000000 PCLASS 0.000000 NAME 0.000000 SEX 0.000000 AGE 19.865320 SIBSP 0.000000 PARCH 0.000000 TICKET 0.000000 FARE 0.000000 CABIN 77.104377 EMBARKED 0.224467 dtype: float64
CABINcolumn 결측치: 약 77% 매우 높은 비율: column 자체 삭제 고려 or 유도변수화AGEcolumn 결측치: 약 19%
불필요한 변수 제거¶
제거해도 될
column label:TICKET- 이유: 각 record별 고유 Data는
NAME과PASSENGERID로 충분
- 이유: 각 record별 고유 Data는
data = data.drop('TICKET', axis=1)
data.head()
| PASSENGERID | SURVIVED | PCLASS | NAME | SEX | AGE | SIBSP | PARCH | FARE | CABIN | EMBARKED | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 8.0500 | NaN | S |
CABIN Data로 유도변수 생성¶
CABINData을 활용하여 유도변수 생성: 원래의 Data를 가공해서 의미있는 Data 추출유도변수:
CABINdata의 첫 번째 알파벳(str[0])- 이유:
FARE와 상관관계 있을 수 있음
- 이유:
CABINdata의 첫 글자가 deck을 의미유도변수:
DECKcolumn 추가DECKdata 고유 value:list에 추가
deck_list에 해당하는CABINdata의 첫 글자:DECKcolumn으로 추가Series의str속성값 사용
deck_list = list()
for cabin in data['CABIN']:
if type(cabin) == type(np.nan):
continue
if cabin[0] in deck_list:
continue
else:
deck_list.append(cabin[0])
deck_list
['C', 'E', 'G', 'D', 'A', 'B', 'F', 'T']
deck_list에 해당하는CABINdata의 첫 글자:DECKcolumn으로 추가deck_list에 해당하지 않는 data(결측치):0으로 채움
sample = data.copy()
sample.CABIN = sample.CABIN.fillna(0)
sample.head()
| PASSENGERID | SURVIVED | PCLASS | NAME | SEX | AGE | SIBSP | PARCH | FARE | CABIN | EMBARKED | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | 7.2500 | 0 | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | 7.9250 | 0 | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 8.0500 | 0 | S |
deck_data = list()
for cabin in sample['CABIN']:
if cabin == 0:
deck_data.append('Unknown')
elif cabin[0] in deck_list:
deck_data.append(cabin[0])
else:
deck_data.append('Unknown')
deck_data
['Unknown', 'C', 'Unknown', 'C', 'Unknown', 'Unknown', 'E', 'Unknown', 'Unknown', 'Unknown', 'G', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'D', 'Unknown', 'A', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'D', 'Unknown', 'B', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'C', 'Unknown', 'Unknown', 'Unknown', 'F', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'F', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'E', 'Unknown', 'Unknown', 'Unknown', 'A', 'D', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'D', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'E', 'D', 'Unknown', 'Unknown', 'Unknown', 'F', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'D', 'C', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'F', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'E', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'A', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'F', 'Unknown', 'A', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'F', 'B', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'G', 'Unknown', 'Unknown', 'Unknown', 'A', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'D', 'Unknown', 'Unknown', 'D', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'D', 'Unknown', 'Unknown', 'G', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'E', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'C', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'D', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'A', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'D', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'C', 'B', 'Unknown', 'Unknown', 'Unknown', 'E', 'Unknown', 'C', 'Unknown', 'C', 'Unknown', 'E', 'C', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'E', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'D', 'Unknown', 'B', 'Unknown', 'C', 'C', 'Unknown', 'Unknown', 'Unknown', 'C', 'E', 'Unknown', 'T', 'F', 'C', 'Unknown', 'Unknown', 'Unknown', 'F', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'E', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'D', 'Unknown', 'Unknown', 'B', 'E', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'D', 'G', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'E', 'C', 'Unknown', 'Unknown', 'Unknown', 'E', 'B', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'A', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'C', 'C', 'Unknown', 'Unknown', 'E', 'D', 'Unknown', 'Unknown', 'E', 'Unknown', 'E', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'D', 'Unknown', 'A', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'C', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'D', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'E', 'Unknown', 'Unknown', 'D', 'F', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'B', 'B', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'A', 'Unknown', 'E', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'E', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'E', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'A', 'Unknown', 'E', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'D', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'A', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'F', 'Unknown', 'Unknown', 'D', 'Unknown', 'Unknown', 'Unknown', 'D', 'Unknown', 'D', 'Unknown', 'Unknown', 'A', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'D', 'Unknown', 'A', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'D', 'Unknown', 'Unknown', 'E', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'D', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'F', 'C', 'E', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'E', 'Unknown', 'Unknown', 'C', 'C', 'C', 'Unknown', 'Unknown', 'F', 'C', 'E', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'E', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'D', 'C', 'B', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'D', 'Unknown', 'Unknown', 'E', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'D', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'E', 'Unknown', 'Unknown', 'Unknown', 'F', 'Unknown', 'Unknown', 'B', 'Unknown', 'B', 'D', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'D', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'A', 'Unknown', 'Unknown', 'E', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'E', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'E', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'D', 'Unknown', 'Unknown', 'Unknown', 'E', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'D', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'A', 'Unknown', 'Unknown', 'Unknown', 'D', 'B', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'C', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'B', 'Unknown', 'C', 'Unknown']
sample['DECK'] = deck_data
sample.tail()
| PASSENGERID | SURVIVED | PCLASS | NAME | SEX | AGE | SIBSP | PARCH | FARE | CABIN | EMBARKED | DECK | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 13.00 | 0 | S | Unknown |
| 887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 30.00 | B42 | S | B |
| 888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | 23.45 | 0 | S | Unknown |
| 889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 30.00 | C148 | C | C |
| 890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 7.75 | 0 | Q | Unknown |
Series의str속성값 사용:str이 아닌 data에는 적용되지 않고 넘어간다 > 가장 효율적
data['DECK'] = data['CABIN'].str[0]
data.DECK = data['DECK'].fillna('Unknown')
data.head()
| PASSENGERID | SURVIVED | PCLASS | NAME | SEX | AGE | SIBSP | PARCH | FARE | CABIN | EMBARKED | DECK | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | 7.2500 | NaN | S | Unknown |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | 71.2833 | C85 | C | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | 7.9250 | NaN | S | Unknown |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 53.1000 | C123 | S | C |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 8.0500 | NaN | S | Unknown |
CABIN유도변수 활용 완료: column 삭제
data = data.drop('CABIN', axis=1)
data.tail()
| PASSENGERID | SURVIVED | PCLASS | NAME | SEX | AGE | SIBSP | PARCH | FARE | EMBARKED | DECK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 13.00 | S | Unknown |
| 887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 30.00 | S | B |
| 888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | 23.45 | S | Unknown |
| 889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 30.00 | C | C |
| 890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 7.75 | Q | Unknown |
NAME Data를 활용해 유도변수 생성¶
개별 이름은 유의미한 변수가 아니다.
이름에 기재된 호칭으로 특징을 갖는 범주형 data를 추출할 수 있다.
- Mr, Ms, Miss, Master, ...
NAME의 호칭을 추출해TITLEcolumn으로 추가
split()두 번 사용해서 추출1.1.
lambda식 사용1.2. 사용자 정의 함수
map()method 활용
- 정규식(regular expression)을 사용해 추출
data.NAME.head(10)
0 Braund, Mr. Owen Harris 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 2 Heikkinen, Miss. Laina 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 Allen, Mr. William Henry 5 Moran, Mr. James 6 McCarthy, Mr. Timothy J 7 Palsson, Master. Gosta Leonard 8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) 9 Nasser, Mrs. Nicholas (Adele Achem) Name: NAME, dtype: object
sample_lam = data.copy()
sample_map = data.copy()
sample_reg = data.copy()
split()두 번 사용해 추출1.1.
lambda식 사용
title = sample_lam.NAME.apply(lambda x: x.split('.')[0].split(',')[1])
title
0 Mr
1 Mrs
2 Miss
3 Mrs
4 Mr
...
886 Rev
887 Miss
888 Miss
889 Mr
890 Mr
Name: NAME, Length: 891, dtype: object
sample_lam['TITLE'] = title
sample_lam.TITLE.value_counts()
Mr 517 Miss 182 Mrs 125 Master 40 Dr 7 Rev 6 Mlle 2 Major 2 Col 2 the Countess 1 Capt 1 Ms 1 Sir 1 Lady 1 Mme 1 Don 1 Jonkheer 1 Name: TITLE, dtype: int64
1.2. 사용자 정의 함수로 map() method 이용
def name_split(name) -> str:
return name.split(',')[1].split('.')[0]
sample_map['TITLE'] = sample_map['NAME'].map(name_split)
sample_map.TITLE.value_counts()
Mr 517 Miss 182 Mrs 125 Master 40 Dr 7 Rev 6 Mlle 2 Major 2 Col 2 the Countess 1 Capt 1 Ms 1 Sir 1 Lady 1 Mme 1 Don 1 Jonkheer 1 Name: TITLE, dtype: int64
- 정규식 사용해서 추출
import re
def get_title(x):
pattern=re.compile('Sir|Mr|Don|Major|Capt|Jonkheer|Rev|Col|Dr|Mrs|Countess|Dona|Mme|Ms|Miss|Mlle|Master',re.I)
result = pattern.search(x)
if result:
return result.group()
else:
return str(np.nan)
sample_reg['NAME'].apply(get_title)
0 Mr
1 Mr
2 Miss
3 Mr
4 Mr
...
886 Rev
887 Miss
888 Miss
889 Mr
890 Mr
Name: NAME, Length: 891, dtype: object
sample_reg['TITLE'] = sample_reg['NAME'].str.extract('([a-zA-Z]+)\.',expand=False)
sample_reg.TITLE.value_counts()
Mr 517 Miss 182 Mrs 125 Master 40 Dr 7 Rev 6 Mlle 2 Major 2 Col 2 Countess 1 Capt 1 Ms 1 Sir 1 Lady 1 Mme 1 Don 1 Jonkheer 1 Name: TITLE, dtype: int64
- 원본
data에는lambda방법 사용
data['TITLE'] = data['NAME'].apply(lambda x: x.split(',')[1].split('.')[0])
data.TITLE.value_counts()
Mr 517 Miss 182 Mrs 125 Master 40 Dr 7 Rev 6 Mlle 2 Major 2 Col 2 the Countess 1 Capt 1 Ms 1 Sir 1 Lady 1 Mme 1 Don 1 Jonkheer 1 Name: TITLE, dtype: int64
NAMEcolumn label drop
data = data.drop('NAME', axis=1)
data.head()
| PASSENGERID | SURVIVED | PCLASS | SEX | AGE | SIBSP | PARCH | FARE | EMBARKED | DECK | TITLE | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Unknown | Mr |
| 1 | 2 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | C | Mrs |
| 2 | 3 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Unknown | Miss |
| 3 | 4 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | C | Mrs |
| 4 | 5 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Unknown | Mr |
data.groupby('TITLE')['AGE'].mean()
TITLE Capt 70.000000 Col 58.000000 Don 40.000000 Dr 42.000000 Jonkheer 38.000000 Lady 48.000000 Major 48.500000 Master 4.574167 Miss 21.773973 Mlle 24.000000 Mme 24.000000 Mr 32.368090 Mrs 35.898148 Ms 28.000000 Rev 43.166667 Sir 49.000000 the Countess 33.000000 Name: AGE, dtype: float64
data[['TITLE', 'AGE']].head()
| TITLE | AGE | |
|---|---|---|
| 0 | Mr | 22.0 |
| 1 | Mrs | 38.0 |
| 2 | Miss | 26.0 |
| 3 | Mrs | 35.0 |
| 4 | Mr | 35.0 |
df.groupby('column')['col']에 붙여서 사용하는.transform('func')
>col의 각 data를 집계함수'func'적용한 값으로 반환
data.groupby('TITLE')['AGE'].transform('mean')
0 32.368090
1 35.898148
2 21.773973
3 35.898148
4 32.368090
...
886 43.166667
887 21.773973
888 21.773973
889 32.368090
890 32.368090
Name: AGE, Length: 891, dtype: float64
AGE결측치: 같은TITLE의 평균값mean으로 대체:obj.transform('mean')
data['AGE'] = data['AGE'].fillna(data.groupby('TITLE')['AGE'].transform('mean'))
print(data['AGE'].value_counts())
print(data.AGE.isnull().sum())
32.368090 119
21.773973 36
24.000000 30
22.000000 27
18.000000 26
...
14.500000 1
70.500000 1
12.000000 1
36.500000 1
74.000000 1
Name: AGE, Length: 92, dtype: int64
0
AGE data를 범주형 data로 변경¶
0 ~ 20
20 ~ 40
40 ~ 60
60 ~
사용자 정의 함수 >
map()pd.cut()np.digitize()
sample_A = data.copy()
sample_B = data.copy()
sample_C = data.copy()
A. 사용자 정의 함수 활용해 map() method 사용
def age_class(age):
if age < 20: return 'under20'
if age < 40: return 'under40'
if age < 60: return 'under60'
else: return 'over60'
sample_A['AGE_RANGE'] = sample_A['AGE'].map(age_class)
sample_A['AGE_RANGE'].value_counts()
under40 559 under20 168 under60 138 over60 26 Name: AGE_RANGE, dtype: int64
sample_A.tail()
| PASSENGERID | SURVIVED | PCLASS | SEX | AGE | SIBSP | PARCH | FARE | EMBARKED | DECK | TITLE | AGE_RANGE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 886 | 887 | 0 | 2 | male | 27.000000 | 0 | 0 | 13.00 | S | Unknown | Rev | under40 |
| 887 | 888 | 1 | 1 | female | 19.000000 | 0 | 0 | 30.00 | S | B | Miss | under20 |
| 888 | 889 | 0 | 3 | female | 21.773973 | 1 | 2 | 23.45 | S | Unknown | Miss | under40 |
| 889 | 890 | 1 | 1 | male | 26.000000 | 0 | 0 | 30.00 | C | C | Mr | under40 |
| 890 | 891 | 0 | 3 | male | 32.000000 | 0 | 0 | 7.75 | Q | Unknown | Mr | under40 |
B. pd.cut()
sample_B['AGE_RANGE']=pd.cut(sample_B['AGE'],[0,19,39,59,100],
labels=['under20','under40','under60','over60'])
sample_B['AGE_RANGE'].value_counts()
under40 559 under20 168 under60 138 over60 26 Name: AGE_RANGE, dtype: int64
sample_B.tail()
| PASSENGERID | SURVIVED | PCLASS | SEX | AGE | SIBSP | PARCH | FARE | EMBARKED | DECK | TITLE | AGE_RANGE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 886 | 887 | 0 | 2 | male | 27.000000 | 0 | 0 | 13.00 | S | Unknown | Rev | under40 |
| 887 | 888 | 1 | 1 | female | 19.000000 | 0 | 0 | 30.00 | S | B | Miss | under20 |
| 888 | 889 | 0 | 3 | female | 21.773973 | 1 | 2 | 23.45 | S | Unknown | Miss | under40 |
| 889 | 890 | 1 | 1 | male | 26.000000 | 0 | 0 | 30.00 | C | C | Mr | under40 |
| 890 | 891 | 0 | 3 | male | 32.000000 | 0 | 0 | 7.75 | Q | Unknown | Mr | under40 |
C. np.digitize()
sample_C['AGE_RANGE'] = np.digitize(sample_C['AGE'], [20, 40, 60])
range_list = list(sample_C['AGE_RANGE'].unique())
range_list
[1, 2, 0, 3]
range_list.sort()
range_map = dict()
for k, v in zip(range_list, ['under20', 'under40', 'under60', 'over60']):
range_map[k] = v
range_map
{0: 'under20', 1: 'under40', 2: 'under60', 3: 'over60'}
sample_C['AGE_RANGE'] = sample_C['AGE_RANGE'].map(range_map)
sample_C.AGE_RANGE.value_counts()
under40 559 under20 168 under60 138 over60 26 Name: AGE_RANGE, dtype: int64
- 원본
data에 사용자 정의 함수map()방법 적용
data['AGE_RANGE'] = data['AGE'].apply(age_class)
data['AGE_RANGE'].value_counts()
under40 559 under20 168 under60 138 over60 26 Name: AGE_RANGE, dtype: int64
data['SEX'].info()
<class 'pandas.core.series.Series'> RangeIndex: 891 entries, 0 to 890 Series name: SEX Non-Null Count Dtype -------------- ----- 891 non-null object dtypes: object(1) memory usage: 7.1+ KB
sample_map = data.copy()
sample_func = data.copy()
obj.map({ })
sex_map = {'male':0, 'female':1}
sample_map['SEX_NUM'] = sample_map['SEX'].map(sex_map)
sample_map['SEX_NUM'].value_counts()
0 577 1 314 Name: SEX_NUM, dtype: int64
- 사용자 함수
apply()
sample_func['SEX_NUM'] = -1
def change_str_to_num(value):
if value == 'male': return 0
if value == 'female': return 1
sample_func['SEX_MAP'] = sample_func['SEX'].apply(change_str_to_num)
sample_func['SEX_MAP'].value_counts()
0 577 1 314 Name: SEX_MAP, dtype: int64
- 원본
data에mapping적용`
data['SEX_NUM'] = data['SEX'].map(sex_map)
data['SEX_NUM'].value_counts()
0 577 1 314 Name: SEX_NUM, dtype: int64
data.head()
| PASSENGERID | SURVIVED | PCLASS | SEX | AGE | SIBSP | PARCH | FARE | EMBARKED | DECK | TITLE | AGE_RANGE | SEX_NUM | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Unknown | Mr | under40 | 0 |
| 1 | 2 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | C | Mrs | under40 | 1 |
| 2 | 3 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Unknown | Miss | under40 | 1 |
| 3 | 4 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | C | Mrs | under40 | 1 |
| 4 | 5 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Unknown | Mr | under40 | 0 |
data = data.drop('SEX', axis=1)
data.tail()
| PASSENGERID | SURVIVED | PCLASS | AGE | SIBSP | PARCH | FARE | EMBARKED | DECK | TITLE | AGE_RANGE | SEX_NUM | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 886 | 887 | 0 | 2 | 27.000000 | 0 | 0 | 13.00 | S | Unknown | Rev | under40 | 0 |
| 887 | 888 | 1 | 1 | 19.000000 | 0 | 0 | 30.00 | S | B | Miss | under20 | 1 |
| 888 | 889 | 0 | 3 | 21.773973 | 1 | 2 | 23.45 | S | Unknown | Miss | under40 | 1 |
| 889 | 890 | 1 | 1 | 26.000000 | 0 | 0 | 30.00 | C | C | Mr | under40 | 0 |
| 890 | 891 | 0 | 3 | 32.000000 | 0 | 0 | 7.75 | Q | Unknown | Mr | under40 | 0 |
FARE data 정수형 표기
소수점 첫 번째 자리에서 계산
Python method:
round():반올림numpy method:
np.ceil():올림
- Python method
round()활용
data['FARE'] = data['FARE'].round(0)
data.FARE.value_counts()
8.0 206
7.0 65
26.0 46
13.0 46
10.0 43
...
75.0 1
76.0 1
4.0 1
212.0 1
5.0 1
Name: FARE, Length: 90, dtype: int64
data.columns
Index(['PASSENGERID', 'SURVIVED', 'PCLASS', 'AGE', 'SIBSP', 'PARCH', 'FARE',
'EMBARKED', 'DECK', 'TITLE', 'AGE_RANGE', 'SEX_NUM'],
dtype='object')
- 가족 구성원 수 = (형제, 자매) + (부모, 자식) + 본인(1)
data["FAMILY_SIZE"] = data.SIBSP + data.PARCH + 1
data.FAMILY_SIZE.value_counts()
1 537 2 161 3 102 4 29 6 22 5 15 7 12 11 7 8 6 Name: FAMILY_SIZE, dtype: int64
- 3, 7을 기준으로 범주화
data.FAMILY_SIZE = np.digitize(data.FAMILY_SIZE, [3, 7])
data.FAMILY_SIZE.unique()
array([0, 1, 2], dtype=int64)
int_var = data.FAMILY_SIZE.unique()
fam_map = dict()
for k, v in zip(int_var, ['small', 'medium', 'large']):
fam_map[k] = v
fam_map
{0: 'small', 1: 'medium', 2: 'large'}
data['FAMILY_SIZE'] = data['FAMILY_SIZE'].map(fam_map)
data.FAMILY_SIZE.value_counts()
small 698 medium 168 large 25 Name: FAMILY_SIZE, dtype: int64
독립변수와 종속변수SURVIVED간의 상관관계 도출¶
data[['FARE', 'SURVIVED']].corr()
| FARE | SURVIVED | |
|---|---|---|
| FARE | 1.000000 | 0.257088 |
| SURVIVED | 0.257088 | 1.000000 |
> 양의 값이긴 하나 거의 상관관계 없음
data[['SEX_NUM', 'SURVIVED']].corr()
| SEX_NUM | SURVIVED | |
|---|---|---|
| SEX_NUM | 1.000000 | 0.543351 |
| SURVIVED | 0.543351 | 1.000000 |
> 양의 값으로 어느정도 의미 있음
Data Visualization¶
연령대 AGE_RANGE의 비율을 pie chart로 시각화¶
data['AGE_RANGE'].value_counts().plot(kind='pie', autopct='%.2f%%')
<AxesSubplot:ylabel='AGE_RANGE'>
pv_tb = data.pivot_table(index='SURVIVED', columns='AGE_RANGE',
aggfunc='count')['AGE']
pv_tb.plot(kind='bar', stacked=True)
<AxesSubplot:xlabel='SURVIVED'>
- 시각화용 Data 생성
passed_away = data[data['SURVIVED'] == 0]['AGE_RANGE'].value_counts()
survived = data[data.SURVIVED == 1]['AGE_RANGE'].value_counts()
passed_away.name = 'Passed Away'
survived.name = 'Survived'
df_vis = pd.DataFrame([passed_away, survived])
df_vis
| under40 | under20 | under60 | over60 | |
|---|---|---|---|---|
| Passed Away | 359 | 87 | 84 | 19 |
| Survived | 200 | 81 | 54 | 7 |
df_vis.plot(kind='bar', stacked=True)
<AxesSubplot:>
AGE_RANGE별 SURVIVED 여부 시각화¶
age_surv = data.pivot_table(index='AGE_RANGE', columns='SURVIVED',
aggfunc='count')['AGE']
age_surv.plot(kind='bar', stacked=True)
<AxesSubplot:xlabel='AGE_RANGE'>
FAMILY_SIZE별 SURVIVED Bar Chart¶
fam_surv = data.pivot_table(index='FAMILY_SIZE', columns='SURVIVED',
aggfunc='count')['SEX_NUM']
fam_surv.plot(kind='bar', stacked=True)
<AxesSubplot:xlabel='FAMILY_SIZE'>
surv_class = data.groupby('PCLASS')['SURVIVED'].sum()
gone_class = data.groupby('PCLASS')['SURVIVED'].count()
print(surv_class)
print(gone_class)
PCLASS 1 136 2 87 3 119 Name: SURVIVED, dtype: int64 PCLASS 1 216 2 184 3 491 Name: SURVIVED, dtype: int64
gone_class = gone_class.sub(surv_class)
gone_class
PCLASS 1 80 2 97 3 372 Name: SURVIVED, dtype: int64
gone_class = gone_class.rename('PASSED AWAY')
pclass_surv = pd.DataFrame([surv_class, gone_class])
pclass_surv
| PCLASS | 1 | 2 | 3 |
|---|---|---|---|
| SURVIVED | 136 | 87 | 119 |
| PASSED AWAY | 80 | 97 | 372 |
pclass_surv.plot(kind='bar', stacked=True)
<AxesSubplot:>
FAMILY_SIZE별 생존 비율¶
data['SIZE_COUNT'] = data['SIBSP'] + data['PARCH'] + 1
data.pivot_table(index='FAMILY_SIZE',columns='SURVIVED',
aggfunc='count')['SIZE_COUNT'].plot(kind='bar',stacked=True)
<AxesSubplot:xlabel='FAMILY_SIZE'>
EMBARKED 시각화¶
data.EMBARKED.isna().sum()
2
- 결측치 처리: 최빈값으로 대체
data.EMBARKED.value_counts()
S 644 C 168 Q 77 Name: EMBARKED, dtype: int64
data.EMBARKED = data.EMBARKED.fillna('S')
print(data.EMBARKED.count())
print(data.EMBARKED.value_counts())
891 S 646 C 168 Q 77 Name: EMBARKED, dtype: int64
data['EMBARKED_COUNT'] = data['EMBARKED'].map({'S':0, 'C':1, 'Q':2})
data.EMBARKED_COUNT.value_counts()
0 646 1 168 2 77 Name: EMBARKED_COUNT, dtype: int64
data.pivot_table(index='EMBARKED', columns='SURVIVED',
aggfunc='count')['EMBARKED_COUNT'].plot(kind='bar', stacked=True)
<AxesSubplot:xlabel='EMBARKED'>
data.EMBARKED.value_counts().plot(kind='pie')
<AxesSubplot:ylabel='EMBARKED'>
'CS & DS > Numpy & Pandas' 카테고리의 다른 글
| Pandas Data visualization with matplotlib 판다스 데이터 시각화 (0) | 2022.11.12 |
|---|---|
| Pandas Data analysis with Baseball player 판다스 야구 선수 데이터 분석 (0) | 2022.11.12 |
| Pandas Data pre-processing 판다스 데이터 전처리 (0) | 2022.11.12 |
| Pandas Data Loading 판다스 데이터 적재 (0) | 2022.11.05 |
| Pandas DataFrame 판다스 데이터프레임 (0) | 2022.11.05 |