728x90
시계열 데이터 결측치 처리¶
0. 라이브러리 불러오기 & 데이터 준비¶
In [ ]:
import pandas as pd
import seaborn as sns
import numpy as np
df= pd.read_csv('seattle-weather.csv') # 시계열 데이터 불리오기
In [ ]:
df
Out[ ]:
| date | precipitation | temp_max | temp_min | wind | weather | |
|---|---|---|---|---|---|---|
| 0 | 2012-01-01 | 0.0 | 12.8 | 5.0 | 4.7 | drizzle |
| 1 | 2012-01-02 | 10.9 | 10.6 | 2.8 | 4.5 | rain |
| 2 | 2012-01-03 | 0.8 | 11.7 | 7.2 | 2.3 | rain |
| 3 | 2012-01-04 | 20.3 | 12.2 | 5.6 | 4.7 | rain |
| 4 | 2012-01-05 | 1.3 | 8.9 | 2.8 | 6.1 | rain |
| ... | ... | ... | ... | ... | ... | ... |
| 1456 | 2015-12-27 | 8.6 | 4.4 | 1.7 | 2.9 | rain |
| 1457 | 2015-12-28 | 1.5 | 5.0 | 1.7 | 1.3 | rain |
| 1458 | 2015-12-29 | 0.0 | 7.2 | 0.6 | 2.6 | fog |
| 1459 | 2015-12-30 | 0.0 | 5.6 | -1.0 | 3.4 | sun |
| 1460 | 2015-12-31 | 0.0 | 5.6 | -2.1 | 3.5 | sun |
1461 rows × 6 columns
In [ ]:
df.info() # date 의 타입이 object > datetime으로 타입 변경이 필요하다.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1461 entries, 0 to 1460 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 1461 non-null object 1 precipitation 1461 non-null float64 2 temp_max 1461 non-null float64 3 temp_min 1461 non-null float64 4 wind 1461 non-null float64 5 weather 1461 non-null object dtypes: float64(4), object(2) memory usage: 68.6+ KB
- 결측치 확인
In [ ]:
df.isna().sum() # 결측치 없음
Out[ ]:
date 0 precipitation 0 temp_max 0 temp_min 0 wind 0 weather 0 dtype: int64
- 분포 확인
In [ ]:
import matplotlib.pyplot as plt
import seaborn as sns
cols = ['precipitation', 'temp_max', 'temp_min', 'wind']
fig,ax = plt.subplots()
for col in cols:
ax.plot(df['date'], df[col], label = col)
plt.show()
1. 결측치 임의 생성¶
- 'precipitation'을 y로, 'date','temp_max','temp_min','wind'를 x로 하는 모델로 선형보간법을 진행하여 모델 성능을 확인할 것이다.
In [ ]:
df_x = df[['temp_max','temp_min','wind']]
In [ ]:
## 결측치를 임의로 만들어야 하는 상황
msv = np.random.randint(0,389, size=40) # 40개의 무작위 정수를 msv 배열에 저장
df_x.iloc[msv] = np.nan # msv 행 인덱스를 사용하여 지정된 인덱스에 np.nan 값 부여
# 결측치 만들기
df_x.isna().sum()
<ipython-input-7-5e393bea1303>:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_x.iloc[msv] = np.nan # msv 행 인덱스를 사용하여 지정된 인덱스에 np.nan 값 부여
Out[ ]:
temp_max 40 temp_min 40 wind 40 dtype: int64
In [ ]:
df_t = pd.concat([df_x,df['date'],df['precipitation']], axis = 1) # 선형보간(method = 'time')
df_t1 = pd.concat([df_x,df['precipitation']], axis = 1) # impute
3. 결측치 채우기¶
- 선형보간법 사용
In [ ]:
df_t['date'] = pd.to_datetime(df_t['date'])
df_t.set_index('date', inplace=True)
df_t = df_t.interpolate(method = 'time')
- Simple imputer(mean/median/most_frequent)
In [ ]:
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer # noqa
from sklearn.impute import IterativeImputer # 임퓨터 불러오기
# SimpleImputer 인스턴스 생성
mean_imputer = SimpleImputer()
median_imputer = SimpleImputer(strategy='median', fill_value=None, verbose=0, copy=True, add_indicator=False)
most_imputer = SimpleImputer(strategy='most_frequent', fill_value=None, verbose=0, copy=True, add_indicator=False)
# strategy가 constant(지정값)인 경우에는 fill_value값을 적어줘야 한다.
In [ ]:
df_mean = pd.DataFrame(mean_imputer.fit_transform(df_t1))
df_median = pd.DataFrame(median_imputer.fit_transform(df_t1))
df_most = pd.DataFrame(most_imputer.fit_transform(df_t1))
/usr/local/lib/python3.10/dist-packages/sklearn/impute/_base.py:382: FutureWarning: The 'verbose' parameter was deprecated in version 1.1 and will be removed in 1.3. A warning will always be raised upon the removal of empty columns in the future version. warnings.warn( /usr/local/lib/python3.10/dist-packages/sklearn/impute/_base.py:382: FutureWarning: The 'verbose' parameter was deprecated in version 1.1 and will be removed in 1.3. A warning will always be raised upon the removal of empty columns in the future version. warnings.warn(
- Iterative Imputer
In [ ]:
from sklearn.experimental import enable_iterative_imputer # noqa
from sklearn.impute import IterativeImputer
imputer = IterativeImputer(imputation_order = 'descending',
max_iter=10, random_state=111,
n_nearest_features=4)
df_itr =pd.DataFrame(imputer.fit_transform(df_t1))
/usr/local/lib/python3.10/dist-packages/sklearn/impute/_iterative.py:785: ConvergenceWarning: [IterativeImputer] Early stopping criterion not reached. warnings.warn(
In [ ]:
df_mean
Out[ ]:
| 0 | 1 | 2 | 3 | |
|---|---|---|---|---|
| 0 | 12.8 | 5.0 | 4.7 | 0.0 |
| 1 | 10.6 | 2.8 | 4.5 | 10.9 |
| 2 | 11.7 | 7.2 | 2.3 | 0.8 |
| 3 | 12.2 | 5.6 | 4.7 | 20.3 |
| 4 | 8.9 | 2.8 | 6.1 | 1.3 |
| ... | ... | ... | ... | ... |
| 1456 | 4.4 | 1.7 | 2.9 | 8.6 |
| 1457 | 5.0 | 1.7 | 1.3 | 1.5 |
| 1458 | 7.2 | 0.6 | 2.6 | 0.0 |
| 1459 | 5.6 | -1.0 | 3.4 | 0.0 |
| 1460 | 5.6 | -2.1 | 3.5 | 0.0 |
1461 rows × 4 columns
In [ ]:
# 확인
df_mean.isna().sum()
Out[ ]:
0 0 1 0 2 0 3 0 dtype: int64
In [ ]:
#컬럼명 바꿔주기(simpleImputer 사용할 경우, numpy 배열로 나옴)
df_mean.columns = ['temp_max','temp_min','wind','precipitation']
df_median.columns = ['temp_max','temp_min','wind','precipitation']
df_most.columns = ['temp_max','temp_min','wind','precipitation']
df_itr.columns = ['temp_max','temp_min','wind','precipitation']
회귀분석¶
- 회귀분석 테스트 함수
In [ ]:
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
def mse_test(data):
X_train, X_test, y_train, y_test = train_test_split(data.drop('precipitation', axis=1), data['precipitation'], test_size=0.3, random_state=111)
fit_train = sm.OLS(y_train, X_train)
fit_train = fit_train.fit()
mse = mean_squared_error(y_true=y_test, y_pred=fit_train.predict(X_test))
# 시각화 코드
plt.plot(np.array(fit_train.predict(X_test)), label='pred')
plt.plot(np.array(y_test), label='True')
plt.legend()
plt.show()
return mse
- 선형보간
In [ ]:
mse_test(df_t)
Out[ ]:
47.15698626290301
- simpleimputer(strategy = 'mean')
In [ ]:
mse_test(df_mean)
Out[ ]:
47.200015812078554
- simpleimputer(strategy = 'median')
In [ ]:
mse_test(df_median)
Out[ ]:
47.19312106746696
- simpleimputer(strategy = 'most_frequent')
In [ ]:
mse_test(df_most)
Out[ ]:
47.18673224704773
- Iterative Imputer
In [ ]:
mse_test(df_itr)
Out[ ]:
46.97680025211388
728x90
'Data Science & AI > Data Analysis' 카테고리의 다른 글
| 시계열 분석 - 시계열 데이터 특성, ARIMA (1) | 2024.01.08 |
|---|---|
| Correlation Analytics - 상관계수, 공분산 계산 (0) | 2023.12.01 |
| 다중 시각화 그래프 (matplotlib, gridspec, seaborn) (1) | 2023.12.01 |
| [텍스트 분석] 정규표현식(전화번호 패턴, 이메일 패턴) (0) | 2023.08.31 |
| [Python] 데이터 분석_ Index Alignment (1) | 2023.08.10 |