Kaggle 시계열 데이터 분석

728x90

Kaggle Aquifer_Petrignano 데이터로 구현해본 시계열 데이터 분석 과정이다.

데이터

날짜, 강수량, 지하수 깊이, 온도, 천연가스 체적, 수위 등의 다양한 지표가 포함되어있다.

개요

0. target/features 구분

1. 결측치 시각화

fillna(np.inf) 시각화
River_Hydrometry, Drainage_volumns 결측치 시각화
Heatmap 시각화

2. 결측치 Impoutation

Drainage_volumn 대치 케이스별 시각화

3. resampling 경향성 확인

4. 다운샘플링 - 변수 별로 다르게

5. Stationary 정상성

정상성을 가진 데이터 특성
정상성이 중요한 이유

6. 정상성 판단

7. Augmented Dickey-Fuller (ADF) 검정

가설 설정
귀무가설 기각

8. 정상성을 달성시키는 방법(차분, 변환)

0. 데이터

- 분석 및 모델링 전, 타겟변수를 'Depth_to_Groundwater' , 피쳐를 결측치가 아닌 컬럼들로 지정해준다.

df = df[df.Rainfall_Bastia_Umbra.notna()] # Rainfall_Bastia_Umbra가 결측값이 아닌 행들만 선택
df = df.drop(['Depth_to_Groundwater_P24', 'Temperature_Petrignano'], axis=1)
df.columns = ['Date', 'Rainfall', 'Depth_to_Groundwater', 'Temperature', 'Drainage_Volume', 'River_Hydrometry']

targets = ['Depth_to_Groundwater']
features = [feature for feature in df.columns if feature not in targets]

- 'Date' 컬럼 datetime 형식으로 변경

from datetime import datetime,date
df['Date'] = pd.to_datetime(df.Date, format = '%d/%m/%Y')

- 변경된 'Date' 컬럼

df.head().style.set_properties(subset = ['Date'], **{'background-color':'dodgerblue'})

1. 결측치 시각화

- fillna(np.inf) 시각화

def plot_features_and_target(df):
    f, ax = plt.subplots(nrows=5, ncols=1, figsize=(15, 25))

    features = ['Rainfall', 'Temperature', 'Drainage_Volume', 'River_Hydrometry', 'Depth_to_Groundwater']

    for i, feature in enumerate(features):
        sns.lineplot(x=df.Date, y=df[feature].fillna(np.inf), ax=ax[i], color='dodgerblue')
        ax[i].set_title(f'Feature: {feature}', fontsize=14)
        ax[i].set_ylabel(ylabel=feature, fontsize=14)
        ax[i].set_xlim([date(2009, 1, 1), date(2020, 6, 30)])

    plt.show()

# 호출
plot_features_and_target(df)

모든 features에 대해 결측값을 np.inf로 채우고 lineplot을 그린다.

Drainage_Volumn , River_Hydrometry, 에서 결측값/ 0 확인

- 'River_Hydrometry'와 'Drainage_Volume' 특성에서 값이 0인 경우 > NaN(결측치)으로 처리

원본과 수정된 데이터를 색 구분을 통해서 비교한다.

f, ax = plt.subplots(nrows = 2, ncols = 1, figsize = (15,15))
old = df.River_Hydrometry.copy()
df['River_Hydrometry'] = np.where(df.River_Hydrometry == 0, np.nan, df.River_Hydrometry) # 0이면 nan 값으로 간주함

sns.lineplot(x = df.Date, y = old.fillna(np.inf), ax = ax[0], color = 'darkorange', label = 'original')
sns.lineplot(x = df.Date, y = df.River_Hydrometry.fillna(np.inf), ax = ax[0], color = 'dodgerblue', label = 'modified')

old = df.Drainage_Volume.copy()
df['Drainage_Volume'] = np.where(df.Drainage_Volume == 0, np.nan, df.Drainage_Volume) # 0이면 nan 값으로 간주함

sns.lineplot(x = df.Date, y = old.fillna(np.inf), ax = ax[1], color = 'darkorange', label = 'original')
sns.lineplot(x = df.Date, y = df.Drainage_Volume.fillna(np.inf), ax = ax[1], color = 'dodgerblue', label = 'modified')

- Heatmap 결측치 시각화

f, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (16,5))
sns.heatmap(df.T.isna(), cmap = 'Blues') # df.T.isna()는 데이터프레임을 전치한 후, 각 요소가 결측값인지 여부를 나타내는 불리언(True/False) 값의 행렬을 생성
ax.set_title('Fields with Missing Values', fontsize = 14)

# y축의 주요 눈금에 대한 레이블의 폰트 크기를 13으로 설정
for tick in ax.yaxis.get_major_ticks():
    tick.label.set_fontsize(13)
plt.show()

2. 결측치 Impoutation

- Drainage_Volume 결측값 Imputation 케이스 별 시각화

f, ax = plt.subplots(nrows=4, ncols=1, figsize=(15, 12))

sns.lineplot(x=df.Date, y=df.Drainage_Volume.fillna(0), ax=ax[0], color='darkorange', label = 'modified')
sns.lineplot(x=df.Date, y=df.Drainage_Volume.fillna(np.inf), ax=ax[0], color='dodgerblue', label = 'original')
ax[0].set_title('Fill NaN with 0', fontsize=14)
ax[0].set_ylabel(ylabel='Volume C10 Petrignano', fontsize=14)

mean_val = df.Drainage_Volume.mean()
sns.lineplot(x=df.Date, y=df.Drainage_Volume.fillna(mean_val), ax=ax[1], color='darkorange', label = 'modified')
sns.lineplot(x=df.Date, y=df.Drainage_Volume.fillna(np.inf), ax=ax[1], color='dodgerblue', label = 'original')
ax[1].set_title(f'Fill NaN with Mean Value ({mean_val:.0f})', fontsize=14)
ax[1].set_ylabel(ylabel='Volume C10 Petrignano', fontsize=14)

sns.lineplot(x=df.Date, y=df.Drainage_Volume.ffill(), ax=ax[2], color='darkorange', label = 'modified')
sns.lineplot(x=df.Date, y=df.Drainage_Volume.fillna(np.inf), ax=ax[2], color='dodgerblue', label = 'original')
ax[2].set_title(f'FFill', fontsize=14)
ax[2].set_ylabel(ylabel='Volume C10 Petrignano', fontsize=14)

sns.lineplot(x=df.Date, y=df.Drainage_Volume.interpolate(), ax=ax[3], color='darkorange', label = 'modified')
sns.lineplot(x=df.Date, y=df.Drainage_Volume.fillna(np.inf), ax=ax[3], color='dodgerblue', label = 'original')
ax[3].set_title(f'Interpolate', fontsize=14)
ax[3].set_ylabel(ylabel='Volume C10 Petrignano', fontsize=14)

for i in range(4):
    ax[i].set_xlim([date(2019, 5, 1), date(2019, 10, 1)])
plt.tight_layout()
plt.show()

# 보간법 사용해서 대체
df['Drainage_Volume'] = df['Drainage_Volume'].interpolate()
df['River_Hydrometry'] = df['River_Hydrometry'].interpolate()
df['Depth_to_Groundwater'] = df['Depth_to_Groundwater'].interpolate()

3. Resample 경향성 확인

- resample 사용해서, 일별, 주별, 월별 > 연간 강우량, 온도 시각화

import matplotlib.pyplot as plt
import seaborn as sns
from datetime import date

fig, ax = plt.subplots(ncols=2, nrows=4, sharex=True, figsize=(16, 12))

ax[0, 0].bar(df.Date, df.Rainfall, width=5, color='dodgerblue')
ax[0, 0].set_title('Daily Rainfall (Acc.)', fontsize=14)

resampled_df = df[['Date', 'Rainfall']].resample('7D', on='Date').sum().reset_index(drop=False)
ax[1, 0].bar(resampled_df['Date'], resampled_df['Rainfall'], width=10, color='dodgerblue')
ax[1, 0].set_title('Weekly Rainfall (Acc.)', fontsize=14)

resampled_df = df[['Date', 'Rainfall']].resample('M', on='Date').sum().reset_index(drop=False)
ax[2, 0].bar(resampled_df['Date'], resampled_df['Rainfall'], width=15, color='dodgerblue')
ax[2, 0].set_title('Monthly Rainfall (Acc.)', fontsize=14)

resampled_df = df[['Date', 'Rainfall']].resample('12M', on='Date').sum().reset_index(drop=False)
ax[3, 0].bar(resampled_df['Date'], resampled_df['Rainfall'], width=20, color='dodgerblue')
ax[3, 0].set_title('Annual Rainfall (Acc.)', fontsize=14)

for i in range(4):
    ax[i, 0].set_xlim([date(2009, 1, 1), date(2020, 6, 30)])

sns.lineplot(x=df['Date'], y=df['Temperature'], color='dodgerblue', ax=ax[0, 1])
ax[0, 1].set_title('Daily Temperature (Acc.)', fontsize=14)

resampled_df = df[['Date', 'Temperature']].resample('7D', on='Date').mean().reset_index(drop=False)
sns.lineplot(x=resampled_df['Date'], y=resampled_df['Temperature'], color='dodgerblue', ax=ax[1, 1])
ax[1, 1].set_title('Weekly Temperature (Acc.)', fontsize=14)

resampled_df = df[['Date', 'Temperature']].resample('M', on='Date').mean().reset_index(drop=False)
sns.lineplot(x=resampled_df['Date'], y=resampled_df['Temperature'], color='dodgerblue', ax=ax[2, 1])
ax[2, 1].set_title('Monthly Temperature (Acc.)', fontsize=14)

resampled_df = df[['Date', 'Temperature']].resample('365D', on='Date').mean().reset_index(drop=False)
sns.lineplot(x=resampled_df['Date'], y=resampled_df['Temperature'], color='dodgerblue', ax=ax[3, 1])
ax[3, 1].set_title('Annual Temperature (Acc.)', fontsize=14)

for i in range(4):
    ax[i, 1].set_xlim([date(2009, 1, 1), date(2020, 6, 30)])
    ax[i, 1].set_ylim([-5, 35])

plt.show()

4. 다운샘플링

df_downsampled = df[['Date',
                     'Depth_to_Groundwater',
                     'Temperature',
                     'Drainage_Volume',
                     'River_Hydrometry',
                    ]].resample('7D', on = 'Date').mean().reset_index(drop = False)
                    
df_downsampled['Rainfall'] = df[['Date',
                                 'Rainfall'
                                ]].resample('7D', on='Date').sum().reset_index(drop=False)[['Rainfall']]

df = df_downsampled # 다운샘플 진행

5. Stationarity 정상성

- "정상성"은 ARIMA와 같은 일부 시계열 모델에서 중요한 요소

정상성을 가진 데이터 특성
- 시간에 따라 평균이 일정하며 시간에 종속되지 않음.
- 시간에 따라 분산이 일정하며 시간에 종속되지 않음.
- 시간에 따라 공분산이 일정하며 시간에 종속되지 않음.

정상성이 중요한 이유
- 정상성이 중요한 이유는 모델링과 예측이 일정한 패턴을 기반으로 이루어지기 때문이다.
- 추세나 계절성이 있는 데이터는 일반적으로 정상성을 가지지 않는다.
- 정상성을 가진 데이터는 미래에도 과거와 유사한 행동 패턴을 나타낼 것으로 예측된다.

6. 정상성 판단

- 이동평균과 이동표준편차를 계산 > 정상성/비정상성 시각화

rolling_window = 52
f, ax = plt.subplots(nrows=3, ncols=2, figsize=(15, 12))

sns.lineplot(x=df.Date, y=df.Rainfall, ax=ax[0, 0], color='indianred')
sns.lineplot(x=df.Date, y=df.Rainfall.rolling(rolling_window).mean(), ax=ax[0, 0], color='black', label='rolling mean')
sns.lineplot(x=df.Date, y=dfa.Rainfall.rolling(rolling_window).std(), ax=ax[0, 0], color='blue', label='rolling std')
ax[0, 0].set_title('Rainfall: Non-stationary \nnon-constant mean & non-constant variance', fontsize=14)
ax[0, 0].set_ylabel(ylabel='Rainfall', fontsize=14)

sns.lineplot(x=df.Date, y=df.Temperature, ax=ax[1, 0], color='indianred')
sns.lineplot(x=df.Date, y=df.Temperature.rolling(rolling_window).mean(), ax=ax[1, 0], color='black', label='rolling mean')
sns.lineplot(x=df.Date, y=df.Temperature.rolling(rolling_window).std(), ax=ax[1, 0], color='blue', label='rolling std')
ax[1, 0].set_title('Temperature: Non-stationary \nvariance is time-dependent (seasonality)', fontsize=14)
ax[1, 0].set_ylabel(ylabel='Temperature', fontsize=14)

sns.lineplot(x=df.Date, y=df.River_Hydrometry, ax=ax[0, 1], color='indianred')
sns.lineplot(x=df.Date, y=df.River_Hydrometry.rolling(rolling_window).mean(), ax=ax[0, 1], color='black', label='rolling mean')
sns.lineplot(x=df.Date, y=df.River_Hydrometry.rolling(rolling_window).std(), ax=ax[0, 1], color='blue', label='rolling std')
ax[0, 1].set_title('Hydrometry: Non-stationary \nnon-constant mean & non-constant variance', fontsize=14)
ax[0, 1].set_ylabel(ylabel='Hydrometry', fontsize=14)

sns.lineplot(x=df.Date, y=df.Drainage_Volume, ax=ax[1, 1], color='indianred')
sns.lineplot(x=df.Date, y=df.Drainage_Volume.rolling(rolling_window).mean(), ax=ax[1, 1], color='black', label='rolling mean')
sns.lineplot(x=df.Date, y=df.Drainage_Volume.rolling(rolling_window).std(), ax=ax[1, 1], color='blue', label='rolling std')
ax[1, 1].set_title('Volume: Non-stationary \nnon-constant mean & non-constant variance', fontsize=14)
ax[1, 1].set_ylabel(ylabel='Volume', fontsize=14)

sns.lineplot(x=df.Date, y=df.Depth_to_Groundwater, ax=ax[2, 0], color='indianred')
sns.lineplot(x=df.Date, y=df.Depth_to_Groundwater.rolling(rolling_window).mean(), ax=ax[2, 0], color='black', label='rolling mean')
sns.lineplot(x=df.Date, y=df.Depth_to_Groundwater.rolling(rolling_window).std(), ax=ax[2, 0], color='blue', label='rolling std')
ax[2, 0].set_title('Depth to Groundwater: Non-stationary \nnon-constant mean & non-constant variance', fontsize=14)
ax[2, 0].set_ylabel(ylabel='Depth to Groundwater', fontsize=14)


for i in range(3):
    ax[i,0].set_xlim([date(2009, 1, 1), date(2020, 6, 30)])
    ax[i,1].set_xlim([date(2009, 1, 1), date(2020, 6, 30)])

f.delaxes(ax[2, 1])
plt.tight_layout()
plt.show()

- 히스토그램 시각화

f, ax = plt.subplots(nrows=3, ncols=2, figsize=(15, 9))

sns.distplot(df.Rainfall.fillna(np.inf), ax=ax[0, 0], color='indianred')
ax[0, 0].set_title('Rainfall: Non-stationary \nnon-constant mean & non-constant variance', fontsize=14)
ax[0, 0].set_ylabel(ylabel='Rainfall', fontsize=14)

sns.distplot(df.Temperature.fillna(np.inf), ax=ax[1, 0], color='indianred')
ax[1, 0].set_title('Temperature: Non-stationary \nvariance is time-dependent (seasonality)', fontsize=14)
ax[1, 0].set_ylabel(ylabel='Temperature', fontsize=14)

sns.distplot(df.River_Hydrometry.fillna(np.inf), ax=ax[0, 1], color='indianred')
ax[0, 1].set_title('Hydrometry: Non-stationary \nnon-constant mean & non-constant variance', fontsize=14)
ax[0, 1].set_ylabel(ylabel='Hydrometry', fontsize=14)

sns.distplot(df.Drainage_Volume.fillna(np.inf), ax=ax[1, 1], color='indianred')
ax[1, 1].set_title('Volume: Non-stationary \nnon-constant mean & non-constant variance', fontsize=14)
ax[1, 1].set_ylabel(ylabel='Volume', fontsize=14)

sns.distplot(df.Depth_to_Groundwater.fillna(np.inf), ax=ax[2, 0], color='indianred')
ax[2, 0].set_title('Depth to Groundwater: Non-stationary \nnon-constant mean & non-constant variance', fontsize=14)
ax[2, 0].set_ylabel(ylabel='Depth to Groundwater', fontsize=14)

f.delaxes(ax[2, 1])
plt.tight_layout()
plt.show()

7. Augmented Dickey-Fuller (ADF) 검정

- 단위근 테스트(unit root test)라 불리는 통계적 검정 방법

가설 설정
- 귀무가설 (H0): 시계열에는 단위근이 존재함 (시계열은 정상성을 가지지 않음).
- 대립가설 (H1): 시계열에는 단위근이 없음 (시계열은 정상성을 가짐).

귀무가설 기각
- p-값이 설정된 유의수준 아래인 경우 귀무가설을 기각할 수 있음.
- p-값 > 유의수준 (기본값: 0.05): 귀무가설 기각 실패 (H0), 데이터에는 단위근이 있으며 정상성을 가지지 않음.
- p-값 <= 유의수준 (기본값: 0.05): 귀무가설 기각 성공 (H0), 데이터에는 단위근이 없으며 정상성을 가짐.

from statsmodels.tsa.stattools import adfuller

result = adfuller(df.Depth_to_Groundwater.values)
adf_stat = result[0]
p_val = result[1]
crit_val_1 = result[4]['1%']
crit_val_5 = result[4]['5%']
crit_val_10 = result[4]['10%']

f, ax = plt.subplots(nrows=3, ncols=2, figsize=(15, 9))

def visualize_adfuller_results(series, title, ax):
    result = adfuller(series)
    significance_level = 0.05
    adf_stat = result[0]
    p_val = result[1]
    crit_val_1 = result[4]['1%']
    crit_val_5 = result[4]['5%']
    crit_val_10 = result[4]['10%']

    if (p_val < significance_level) & ((adf_stat < crit_val_1)):
        linecolor = 'forestgreen' 
    elif (p_val < significance_level) & (adf_stat < crit_val_5):
        linecolor = 'gold'
    elif (p_val < significance_level) & (adf_stat < crit_val_10):
        linecolor = 'orange'
    else:
        linecolor = 'indianred'
    sns.lineplot(x=df.Date, y=series, ax=ax, color=linecolor)
    ax.set_title(f'ADF Statistic {adf_stat:0.3f}, p-value: {p_val:0.3f}\nCritical Values 1%: {crit_val_1:0.3f}, 5%: {crit_val_5:0.3f}, 10%: {crit_val_10:0.3f}', fontsize=14)
    ax.set_ylabel(ylabel=title, fontsize=14)

visualize_adfuller_results(df.Rainfall.values, 'Rainfall', ax[0, 0])
visualize_adfuller_results(df.Temperature.values, 'Temperature', ax[1, 0])
visualize_adfuller_results(df.River_Hydrometry.values, 'River_Hydrometry', ax[0, 1])
visualize_adfuller_results(df.Drainage_Volume.values, 'Drainage_Volume', ax[1, 1])
visualize_adfuller_results(df.Depth_to_Groundwater.values, 'Depth_to_Groundwater', ax[2, 0])

f.delaxes(ax[2, 1])
plt.tight_layout()
plt.show()

8. 정상성 달성 방법

변환 (Transformation): 데이터에 수학적 변환을 적용하여 변동이 불안정한 분산을 안정화. 예를 들어, 로그 또는 제곱근을 취하는 것은 분산을 안정화하고 시간에 따라 일관되게 만듦. 데이터를 변환함으로써, 평균 및 분산과 같은 통계적 특성을 시간에 따라 일정하게 만들고, 시계열을 정상성을 가지도록 함
차분 (Differencing): 시계열에서 현재 값에서 이전 값의 차이를 빼는 것입니다. 이를 통해 데이터에서 추세나 계절적 패턴을 제거하여 정상성을 달성, 1차 차분은 연속적인 관측값을 빼고, 더 높은 차분은 더 많은 추세나 계절성을 제거하기 위해 필요한 경우 사용된다.

- 변환

df['Depth_to_Groundwater_log'] = np.log(abs(df.Depth_to_Groundwater))

f, ax = plt.subplots(nrows=2, ncols=2, figsize=(15, 6))
visualize_adfuller_results(abs(df.Depth_to_Groundwater), 'Absolute \n Depth to Groundwater', ax[0, 0])

sns.distplot(df.Depth_to_Groundwater_log, ax=ax[0, 1])
visualize_adfuller_results(df.Depth_to_Groundwater_log, 'Transformed \n Depth to Groundwater', ax[1, 0])

sns.distplot(df.Depth_to_Groundwater_log, ax=ax[1, 1])

plt.tight_layout()
plt.show()

- 차분

# First Order Differencing
ts_diff = np.diff(df.Depth_to_Groundwater)
df['Depth_to_Groundwater_diff_1'] = np.append([0], ts_diff)

# Second Order Differencing
ts_diff = np.diff(df.Depth_to_Groundwater_diff_1)
df['Depth_to_Groundwater_diff_2'] = np.append([0], ts_diff)

f, ax = plt.subplots(nrows=2, ncols=1, figsize=(15, 6))

visualize_adfuller_results(df.Depth_to_Groundwater_diff_1, 'Differenced (1. Order) \n Depth to Groundwater', ax[0])
visualize_adfuller_results(df.Depth_to_Groundwater_diff_2, 'Differenced (2. Order) \n Depth to Groundwater', ax[1])
plt.tight_layout()
plt.show()

일반적으로, 차분이 변환보다 시계열 데이터에서 정상성을 달성하는 데 많이 사용된다.

저작자표시 비영리 변경금지

'Python > Data Analysis' 카테고리의 다른 글

Python 대용량 데이터 처리 라이브러리 - Dask (0)	2024.02.04
Python 대용량 데이터 처리 파라미터 - Pandas (1)	2024.02.01
시계열 분석 - 시계열 데이터 특성, ARIMA (1)	2024.01.08
Correlation Analytics - 상관계수, 공분산 계산 (0)	2023.12.01
다중 시각화 그래프 (matplotlib, gridspec, seaborn) (1)	2023.12.01

DataPilots

Kaggle 시계열 데이터 분석

데이터

개요

0. 데이터

1. 결측치 시각화

2. 결측치 Impoutation

3. Resample 경향성 확인

4. 다운샘플링

5. Stationarity 정상성

6. 정상성 판단

7. Augmented Dickey-Fuller (ADF) 검정

8. 정상성 달성 방법

'Python > Data Analysis' 카테고리의 다른 글

티스토리툴바

Kaggle 시계열 데이터 분석

데이터

개요

0. 데이터

1. 결측치 시각화

2. 결측치 Impoutation

3. Resample 경향성 확인

4. 다운샘플링

5. Stationarity 정상성

6. 정상성 판단

7. Augmented Dickey-Fuller (ADF) 검정

8. 정상성 달성 방법

'Python > Data Analysis' 카테고리의 다른 글

관련글

티스토리툴바