import pandas as pd


url = f'https://raw.githubusercontent.com/dev-EthanJ/scikit-learn_Machine_Learning/main/data/insurance.csv'

df = pd.read_csv(url)

df.head()


df.tail()


# 데이터가 가지고 있는 변수 확인
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   expenses  1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


# pandas의 float data를 소수점 아래 2번째 자리까지 formatting
pd.options.display.float_format = '{:.2f}'.format


# 기술통계
df.describe()


from sklearn.linear_model  import LinearRegression


# df.smoker object에 대한 전처리(수치화)
df.smoker

0       yes
1        no
2        no
3        no
4        no
       ... 
1333     no
1334     no
1335     no
1336     no
1337    yes
Name: smoker, Length: 1338, dtype: object


df.smoker.unique()

array(['yes', 'no'], dtype=object)


# df.smoker == 'yes'
df.smoker.eq('yes')

0        True
1       False
2       False
3       False
4       False
        ...  
1333    False
1334    False
1335    False
1336    False
1337     True
Name: smoker, Length: 1338, dtype: bool


# object(str)에 대한 수치화

# df.smoker.eq('yes') * 1
df.smoker.eq('yes').mul(1)

0       1
1       0
2       0
3       0
4       0
       ..
1333    0
1334    0
1335    0
1336    0
1337    1
Name: smoker, Length: 1338, dtype: int64


df.smoker = df.smoker.eq('yes').mul(1)

df.head()


df.sex.unique()

array(['female', 'male'], dtype=object)


print(df.region.unique())
df.region.nunique()

['southwest' 'southeast' 'northwest' 'northeast']

4


df.dtypes

age           int64
sex          object
bmi         float64
children      int64
smoker        int64
region       object
expenses    float64
dtype: object


df = pd.get_dummies(df, columns=['sex', 'region'], drop_first=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   age               1338 non-null   int64  
 1   bmi               1338 non-null   float64
 2   children          1338 non-null   int64  
 3   smoker            1338 non-null   int64  
 4   expenses          1338 non-null   float64
 5   sex_male          1338 non-null   uint8  
 6   region_northwest  1338 non-null   uint8  
 7   region_southeast  1338 non-null   uint8  
 8   region_southwest  1338 non-null   uint8  
dtypes: float64(2), int64(3), uint8(4)
memory usage: 57.6 KB


origin_list = ['a', 'b', 'c', 'd']

copyed_list = '.'.join(origin_list).split('.')

copyed_list.append('e')

print(copyed_list)
print(origin_list)

['a', 'b', 'c', 'd', 'e']
['a', 'b', 'c', 'd']


df.columns

Index(['age', 'bmi', 'children', 'smoker', 'expenses', 'sex_male',
       'region_northwest', 'region_southeast', 'region_southwest'],
      dtype='object')


# X: 독립변수, y: 종속변수
X = df.drop('expenses', axis=1)
y = df.expenses


from sklearn.model_selection import train_test_split


help(train_test_split)

Help on function train_test_split in module sklearn.model_selection._split:

train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)
    Split arrays or matrices into random train and test subsets.
    
    Quick utility that wraps input validation and
    ``next(ShuffleSplit().split(X, y))`` and application to input data
    into a single call for splitting (and optionally subsampling) data in a
    oneliner.
    
    Read more in the :ref:`User Guide <cross_validation>`.
    
    Parameters
    ----------
    *arrays : sequence of indexables with same length / shape[0]
        Allowed inputs are lists, numpy arrays, scipy-sparse
        matrices or pandas dataframes.
    
    test_size : float or int, default=None
        If float, should be between 0.0 and 1.0 and represent the proportion
        of the dataset to include in the test split. If int, represents the
        absolute number of test samples. If None, the value is set to the
        complement of the train size. If ``train_size`` is also None, it will
        be set to 0.25.
    
    train_size : float or int, default=None
        If float, should be between 0.0 and 1.0 and represent the
        proportion of the dataset to include in the train split. If
        int, represents the absolute number of train samples. If None,
        the value is automatically set to the complement of the test size.
    
    random_state : int, RandomState instance or None, default=None
        Controls the shuffling applied to the data before applying the split.
        Pass an int for reproducible output across multiple function calls.
        See :term:`Glossary <random_state>`.
    
    shuffle : bool, default=True
        Whether or not to shuffle the data before splitting. If shuffle=False
        then stratify must be None.
    
    stratify : array-like, default=None
        If not None, data is split in a stratified fashion, using this as
        the class labels.
        Read more in the :ref:`User Guide <stratification>`.
    
    Returns
    -------
    splitting : list, length=2 * len(arrays)
        List containing train-test split of inputs.
    
        .. versionadded:: 0.16
            If the input is sparse, the output will be a
            ``scipy.sparse.csr_matrix``. Else, output type is the same as the
            input type.
    
    Examples
    --------
    >>> import numpy as np
    >>> from sklearn.model_selection import train_test_split
    >>> X, y = np.arange(10).reshape((5, 2)), range(5)
    >>> X
    array([[0, 1],
           [2, 3],
           [4, 5],
           [6, 7],
           [8, 9]])
    >>> list(y)
    [0, 1, 2, 3, 4]
    
    >>> X_train, X_test, y_train, y_test = train_test_split(
    ...     X, y, test_size=0.33, random_state=42)
    ...
    >>> X_train
    array([[4, 5],
           [0, 1],
           [6, 7]])
    >>> y_train
    [2, 0, 3]
    >>> X_test
    array([[2, 3],
           [8, 9]])
    >>> y_test
    [1, 4]
    
    >>> train_test_split(y, shuffle=False)
    [[0, 1, 2], [3, 4]]


# test_size: 비율 > test set(시험셋)의 ratio
# random_state: seed value > 임의로 결정되는 값을 특정 값으로 유지되게 하는 값

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.2, random_state = 200
)

len(train_test_split(X, y, test_size = 0.2, random_state=200))

4


from sklearn.linear_model import LinearRegression


model = LinearRegression()


model.fit(X_train, y_train)

LinearRegression()


pred = model.predict(X_test)

pred

array([11583.75098752,  8723.46560984,  7794.17799753,  8537.18904201,
        1900.99306943,  5000.1965654 , 40184.13707971,  4332.65458179,
       33604.62827378,  8566.77438349, 34761.58427257, 10617.81670868,
        9203.54356439,  6425.0666393 , 14236.30630426,  5926.50264561,
       15334.62115765, 27775.22335916, 11729.02225496, 10502.41605544,
       10966.06403594,  2757.84040418, 11719.86336504, 34984.82793476,
        5779.75874448,  -144.15498223, 25007.46443988, 10374.70604051,
        8619.47240785, 40243.70301175, 18891.94370162, 12035.16897936,
        5042.03755231,  3137.62318663,  4433.98245405,  9068.79116388,
       12923.54652222,  -601.63917125, 13236.88773829,  1919.81372069,
       30080.74972726, 38498.04504965, 28122.74348022, 12606.12034078,
       11730.91113956, 11933.2727501 , 16303.21034674, 12627.02280213,
        -338.02188294, 14181.44145783, 32937.55388043, 13037.92391076,
       16870.81536476, 33940.65858658, 13331.7608698 , 38656.52249415,
       11907.55131166,  5940.03344456, 32417.93860425,   915.65265758,
        7341.08147231, 10037.71696504, 12918.74951327, 26991.92835549,
       14660.60445307,  5817.1786837 ,  4924.14408515, 37221.16703746,
        5747.21765373, 13563.17881399,  9658.83318639, 37366.68256975,
        6413.18158197, 11040.45673381,  4224.51283403, 27144.75945564,
       31018.99279629,   304.991258  ,  2924.60964279,  -457.81019999,
       12342.24880966,  6395.59258452, 27761.29144612, 14307.42913867,
        4658.62215747, 30788.4671386 , 16113.81117814, 14249.13557958,
       12244.76383861,  7080.40736321,  9453.25521437, 11072.89456597,
       16291.12702863,  5870.18557518,  3830.14961297, 13568.17555395,
       10853.02827709,  9687.26209656, 14095.95004976,  4331.6442913 ,
        8960.00148268,  9111.58268242, 15500.40378838, 17148.31044572,
        6270.54007193,  2853.53349279,  7036.9620462 , 13959.27415863,
        5394.01708623, 34782.37079229,  7089.96893769, 11735.20645614,
        2137.12258111, 28259.5194457 ,  2350.11889996, 11977.3272435 ,
        3806.3328139 , 33949.91447865,  8262.6949245 , 14400.86056615,
        9192.95725498,  9370.06222451, 13979.49085112, 11097.07729822,
        3795.89411553, 28821.63641303, 40382.26313254,  7122.47386397,
        2031.70283772, 34429.51021004,  3332.04268958, 12415.26894819,
       37065.37068971,  8324.74109994,   -55.491305  ,  3084.61482491,
        7755.4495209 , 11220.59257484,  5657.86820765,  8579.34157636,
       31051.18689333,  8667.8988103 , 28254.82301502,  1875.54946821,
       36147.07060116,  9870.6640242 , 32219.08009814,  6694.90429325,
        2548.35770971,  4896.43710831, 39243.70091389, 26954.9917613 ,
       35517.83765857,  3570.79147678,  3318.64358703,  8117.86031767,
       31944.1297626 , 12452.12537381, 14233.85430973, 13736.01484192,
       36320.56776279, 29605.87461309,  8904.0545967 , 23603.34472123,
       10453.51155504, 39539.09126734, 10059.47761079,   941.91719992,
       16802.09886207, 13815.6019484 ,  9225.18679829,  3562.52546551,
        5769.47215221, 31875.33784653, 34748.77540695,  5147.13863909,
         209.75278546,  7630.79078717, 11562.09259099, 11089.14789902,
        9318.83364913, 17274.91782292, -1424.87218118,  2776.06910403,
       12641.27543416, 15032.29898712, 28850.52766639,  3539.02732305,
        5957.29585801,  6403.58907784, 13820.75544938, 10664.36156177,
       13335.88367059, 26183.30499935,  8966.87977199,  3310.38223069,
       15412.71462859, 37208.6669387 , 36978.894144  , 30350.99374237,
       12072.36447129, 29431.19051996, 38427.40496812,  5849.94641063,
       15189.48158661,  7485.67162582,  2699.81317833, 34145.3646818 ,
       13590.43323714,  2229.96524188, 11412.49507732,  1165.14045242,
        7175.836307  ,  4891.50746247,  4456.19492304,  4375.77537716,
       11906.79844868, 10560.07794025, 10769.90703628, 27074.38889206,
       12567.91921328, 13876.30502392, 10377.736912  , 12780.96119493,
        7708.77893669, 32925.71242354, 29872.21063284, 24785.91683704,
        5214.47773903, 26292.43129261,  1399.51414587, 11745.21521119,
       33938.0031466 , 11262.24807177, 12313.47031323, 12378.58611719,
         643.39917356,  2278.2348833 , 15525.49148971,  6314.83305338,
       13677.04294947, 16040.89646129, 15278.48134617,  8909.20809769,
        8393.14726528, 34190.30314244,  3716.08315392, 25803.67579323,
       15114.36168253, 11398.87789642, 13073.57544943, -2164.52160712,
        9191.83567778, 29007.4348829 , 16921.5117476 ,  9246.5673576 ,
       -1249.55322907, 17146.52688254, 11061.57727351,  4346.78407526,
       15391.53751473, 33711.16887616, 12728.10640954, 11206.25977424,
         960.58109016, 33708.45420703, 10076.46585137, 24190.37162703])


# 두 데이터를 비교해서 정확도 확인
comparison = pd.DataFrame({'actual':y_test, 'pred':pred})

X_test.head()


comparison.head()


import matplotlib.pyplot as plt
import seaborn as sns


# 사이즈 설정
plt.figure(figsize=(8, 8))

sns.scatterplot(x='actual', y='pred', data=comparison)
plt.show()


from sklearn.metrics import mean_squared_error


# MSE(평균제곱오차): 예측값과 실제값 사이의 오차의 제곱의 평균
mean_squared_error(y_test, pred)

36760484.5645236


# RMSE(평균제곱근오차): Root MSE
# mean_squared_error(y_test, X_pred) ** 0.5
mean_squared_error(y_test, pred, squared=False)

6063.042517129795


# R^2 = 결정계수: 모델이 얼마나 종속변수-독립변수의 변동을 잘 설명하는가(예측 정확도)를 수치화한 값
R_squared = model.score(X_train, y_train)

R_squared

0.7462462636958302


model.coef_

array([  262.45023678,   326.77442073,   551.41412416, 23798.72509414,
         -77.28662589,  -166.61566228,  -795.09255719,  -957.17123014])


pd.Series(model.coef_, index=X.columns)

age                  262.45
bmi                  326.77
children             551.41
smoker             23798.73
sex_male             -77.29
region_northwest    -166.62
region_southeast    -795.09
region_southwest    -957.17
dtype: float64


model.intercept_

-12010.489564845571


!pip install mlxtend

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: mlxtend in /usr/local/lib/python3.7/dist-packages (0.14.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from mlxtend) (57.4.0)
Requirement already satisfied: pandas>=0.17.1 in /usr/local/lib/python3.7/dist-packages (from mlxtend) (1.3.5)
Requirement already satisfied: scikit-learn>=0.18 in /usr/local/lib/python3.7/dist-packages (from mlxtend) (1.0.2)
Requirement already satisfied: matplotlib>=1.5.1 in /usr/local/lib/python3.7/dist-packages (from mlxtend) (3.2.2)
Requirement already satisfied: scipy>=0.17 in /usr/local/lib/python3.7/dist-packages (from mlxtend) (1.7.3)
Requirement already satisfied: numpy>=1.10.4 in /usr/local/lib/python3.7/dist-packages (from mlxtend) (1.21.6)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=1.5.1->mlxtend) (1.4.4)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=1.5.1->mlxtend) (2.8.2)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=1.5.1->mlxtend) (0.11.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=1.5.1->mlxtend) (3.0.9)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from kiwisolver>=1.0.1->matplotlib>=1.5.1->mlxtend) (4.1.1)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.17.1->mlxtend) (2022.6)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.1->matplotlib>=1.5.1->mlxtend) (1.15.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.18->mlxtend) (3.1.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.18->mlxtend) (1.2.0)


# joblib 라이브러리를 통해 .pkl 파일 생성
import joblib

joblib.dump(model, 'first_model.pkl')

['first_model.pkl']


# .pkl file로 파일단위 머신러닝 모델 이동 가능
model_from_joblib = joblib.load('first_model.pkl')

pd.Series(model_from_joblib.coef_, index=X.columns)

age                  262.45
bmi                  326.77
children             551.41
smoker             23798.73
sex_male             -77.29
region_northwest    -166.62
region_southeast    -795.09
region_southwest    -957.17
dtype: float64

	age	bmi	children	expenses
count	1338.00	1338.00	1338.00	1338.00
mean	39.21	30.67	1.09	13270.42
std	14.05	6.10	1.21	12110.01
min	18.00	16.00	0.00	1121.87
25%	27.00	26.30	0.00	4740.29
50%	39.00	30.40	1.00	9382.03
75%	51.00	34.70	2.00	16639.92
max	64.00	53.10	5.00	63770.43

	actual	pred
992	10118.42	11583.75
937	8965.80	8723.47
688	26236.58	7794.18
1185	8603.82	8537.19
1137	3176.29	1900.99

평가지표	설명
MAE (Mean Absolute Error, 평균 절대 오차)	- 실젯값과 예측값 사이의 오차에 절댓값을 씌운 뒤 이에 대한 평균을 계산 - 0에 가까울수록 좋음
MSE (Mean Squared Error, 평균 제곱 오차)	- 실젯값과 예측값 사이의 오차를 제곱한 뒤 이에 대한 평균을 계산 - 0에 가까울수록 좋음
RMSE (Root Mean Absolute Error, 루트 평균 제곱 오차)	- MSE에 루트를 씌운 값 - 0에 가까울수록 좋음 - 연속형 변수를 예측할 때 가장 일반적으로 사용되는 평가지표
R²	- 결정계수 - 독립변수가 종속변수를 얼마만큼 설명해 주는지 나타내는 지표, 즉 설명력 - 1에 가까울수록 좋음

scikit-learn Machine Learning DecisionTree 사이킷런 머신러닝 결정트리 (0)	2022.11.21
scikit-learn Machine Learning Naive Bayes Kor NLP 사이킷런 머신러닝 나이브베이즈 한글 자연어처리 (0)	2022.11.21
scikit-learn Machine Learning Naive Bayes Eng NLP 사이킷런 머신러닝 나이브베이즈 영어 자연어처리 (0)	2022.11.15
scikit-learn Machine Learning k-NN algorithm 사이킷런 머신러닝 k-NN 알고리즘 (0)	2022.11.15
scikit-learn Machine Learning Logistic Regression 사이킷런 머신러닝 로지스틱 회귀 (0)	2022.11.15

Try to 개발자 EthanJ의 성장 로그

Try to 개발자 EthanJ의 성장 로그

scikit-learn Machine Learning Linear Regression 사이킷런 머신러닝 회귀분석 본문

scikit-learn Machine Learning Linear Regression 사이킷런 머신러닝 회귀분석

scikit-learn Machine Learning Linear Regression
사이킷런 머신러닝 회귀분석

1. Data 수집¶

2. Data pre-processing¶

2.1. 범주형 Data에 대한 전처리¶

2.2. 더미 변수, 원-핫 인코딩¶

얕은 복사(shallow copy) vs. 깊은 복사(deep copy)¶

2.3. 훈련셋 `train set`, 시험셋 `test set`¶

독립변수와 종속변수¶

3. 모델 학습¶

3.1. `model.fit(X_train, y_train)`¶

3.2. `model.predict(X_test)`¶

4. 모델 평가¶

4.1. 테이블로 평가¶

4.2. 그래프로 평가¶

4.3. RMSE & R²(결정계수)¶

5. 모델 배포¶

'CS & DS > scikit-learn Machine Learning' 카테고리의 다른 글

티스토리툴바

	age	sex	bmi	children	smoker	region	expenses
0	19	female	27.9	0	yes	southwest	16884.92
1	18	male	33.8	1	no	southeast	1725.55
2	28	male	33.0	3	no	southeast	4449.46
3	33	male	22.7	0	no	northwest	21984.47
4	32	male	28.9	0	no	northwest	3866.86

	age	sex	bmi	children	smoker	region	expenses
1333	50	male	31.0	3	no	northwest	10600.55
1334	18	female	31.9	0	no	northeast	2205.98
1335	18	female	36.9	0	no	southeast	1629.83
1336	21	female	25.8	0	no	southwest	2007.95
1337	61	female	29.1	0	yes	northwest	29141.36

	age	sex	bmi	children	smoker	region	expenses
0	19	female	27.90	0	1	southwest	16884.92
1	18	male	33.80	1	0	southeast	1725.55
2	28	male	33.00	3	0	southeast	4449.46
3	33	male	22.70	0	0	northwest	21984.47
4	32	male	28.90	0	0	northwest	3866.86

	age	bmi	children	sex_male	region_northwest	region_southwest
992	50	31.60	2	0	0	1
937	39	24.20	5	0	1	0
688	47	24.10	1	0	0	1
1185	45	23.60	2	1	0	0
1137	26	22.20	0	0	1	0

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Try to 개발자 EthanJ의 성장 로그

scikit-learn Machine Learning Linear Regression 사이킷런 머신러닝 회귀분석 본문

scikit-learn Machine Learning Linear Regression 사이킷런 머신러닝 회귀분석

scikit-learn Machine Learning Linear Regression 사이킷런 머신러닝 회귀분석

1. Data 수집¶

2. Data pre-processing¶

2.1. 범주형 Data에 대한 전처리¶

2.2. 더미 변수, 원-핫 인코딩¶

얕은 복사(shallow copy) vs. 깊은 복사(deep copy)¶

2.3. 훈련셋 train set, 시험셋 test set¶

독립변수와 종속변수¶

3. 모델 학습¶

3.1. model.fit(X_train, y_train)¶

3.2. model.predict(X_test)¶

4. 모델 평가¶

4.1. 테이블로 평가¶

4.2. 그래프로 평가¶

4.3. RMSE & R²(결정계수)¶

5. 모델 배포¶

'CS & DS > scikit-learn Machine Learning' 카테고리의 다른 글

티스토리툴바

scikit-learn Machine Learning Linear Regression
사이킷런 머신러닝 회귀분석

2.3. 훈련셋 `train set`, 시험셋 `test set`¶

3.1. `model.fit(X_train, y_train)`¶

3.2. `model.predict(X_test)`¶