TICK 데이터 리샘플링 & OHLCV 변환 + 트러블슈팅

Project/BTC Volatility Prediction

TICK 데이터 리샘플링 & OHLCV 변환 + 트러블슈팅

skwkiix 2024. 2. 18. 21:52

728x90

0. 공모전 개요

2024-1 데이터 학회 연합 ASCEND 채용 연계 데이터 분석 공모전

틱데이터 활용 비트코인(BTC)의 단기 변동성 예측 모델링

주관/ 주최 : BDA , ASCEND(퀀트 트레이딩 기업)

참여 학회 : BDA, KUBIG, KHUDA, ESAA, ESC, PARROT

사용 모델 : Random Forest

데이터 : 2023.01 ~ 2024.01 10분 단위 BTC 틱데이터

평가 방법 / 성능평가지표 : 향후 7일 동안의 ‘실제 시장 데이터’에 적용해 성능을 평가 / mape

1. 제공된 데이터 & 리샘플링 계획

Column Name	Description
id	각 거래에 고유하게 할당된 식별 번호, 거래 구별 시 사용
price	거래가 체결된 가격(달러가치)
qty	거래된 비트코인의 양
quote_qty	거래된 비트코인의 달러가치 합계(price * qty)
time	거래가 기록된 유닉스 타임스탬프 시간
is_buyer_maker	구매자가 maker(True), buyer(False) 인지 나타내는 불린값

해당 데이터를 가지고 Open , High, Low, Close 등 을 한시간 단위로 계산하여 리샘플링 하고, 변동성을 계산하는 것이 가장 첫번째 Task 이다. 우리 팀은 price 값과 추가로 qty(sum), quote_qty(std), is_buyer_maker(sum) 도 같이 포함하여 리샘플링을 진행했고(quote_qty 의 경우 값의 스케일 차이가 심해 표준편차를 사용), 리샘플링을 거친 데이터 셋에 returns와 volatility 를 계산했다.

변동성을 계산하는 주기는 20일로, 일반적인 값을 사용했다.

def convert_tick_to_ohlcv(data):
    """
    Converts given Binance tick data into 1-hour interval OHLCV (Open, High, Low, Close, Volume) data.
    :param data: DataFrame with Tick data
    :return: DataFrame with the Open, High, Low, Close, Volume values
    """

    data['time'] = pd.to_datetime(data['time'], unit='ms')
    ohlcv = data.resample('1H', on='time').agg({
        'price': ['first', 'max', 'min', 'last'],
        'qty': 'sum',
        'quote_qty': 'std', # 추가
        'is_buyer_maker': 'sum'}) #추가
        

    ohlcv.columns = ['Open', 'High', 'Low', 'Close', 'Volume','quote_qty','is_buyer_maker']
    return ohlcv

def calculate_volatility(data, window=20):
    """
    Calculate the rolling volatility using the standard deviation of returns.
    :param data: DataFrame with OHLCV data
    :param window: The number of periods to use for calculating the standard deviation
    :return: DataFrame with the volatility values
    """

    # Calculate daily returns
    data['returns'] = data['Close'].pct_change()

    # Calculate the rolling standard deviation of returns
    data['volatility'] = data['returns'].rolling(window=window).std()

    return data

2. 대용량 데이터 처리 문제 해결

제공된 데이터는 2023.01 ~2024.01 까지이다. 각 월별 데이터가 제공되었으며, 평균 파일 크기는 약 4GB 정도의 대용량 데이터이다.

따라서 대용량 데이터 처리 방식에 관한 스터디를 진행했고, 블로그에 정리했다.

https://datapilots.tistory.com/78

Python 대용량 데이터 처리 파라미터 - Pandas

pd.read_csv(file_path, usecols=usecols, dtype=dtype, chunksize=chunksize) 파일 사이즈가 매우 큰 파일은 한번에 불러오는 경우 kernal이 종료되는 경우가 있다. 소개할 방법들은 대용량 처리를 간단하게 할 수 있고

datapilots.tistory.com

Pandas 라이브러리 내 해결 방법

방법 1. chunk size 사용 (메모리 사이즈에 맞게 사용) > 시계열 데이터 처리의 경우 시계열 계산이 엉키거나 중복될 수 있음

방법 2. dtype 데이터 타입 지정

방법 3. usecols 컬럼 지정

방법 4. 3가지 동시에 진행

방법 5. for 문 사용하지 않고 개별 처리 후 concat > 시간이 너무 오래걸림

위와 같은 방법으로 실행했으나, RAM 8GB /16GB기준 Kernal이 죽는 일이 반복되었다.

개별 파일 처리 후 Concat

파일을 나누어 개별 처리 후 concat 하는 코드도 당연히 실패했다.

DASK 라이브러리 사용 -- 성공

파일을 dask 라이브러리를 사용하여 dask dataframe으로 읽어오기(usecols, dtype 사용) > 개별 파일을 다시 Pandas dataframe 형식으로 변경 > Convert_tick_to_ohlcv 함수 적용 > append() 하는 과정을 반복하고 전체 파일을 concat하는 과정으로 코드를 짰고, 전체 Concat 한 파일에 calculate_volatiliity 함수를 적용하였다, 무사히 실행되었다.

(dask 라이브러리 사용방법 또한 아래 포스팅에 정리했다.)

https://datapilots.tistory.com/79

Python 대용량 데이터 처리 라이브러리 - Dask

지난 포스팅에서는, pandas 라이브러리로 대용량 데이터를 불러올 때 가장 쉽게 사용할 수 있는 파라미터에 대해 알아봤다. 하지만, 데이터 용량이 크거나, 많은 파일의 시계열 데이터를 concat하여

datapilots.tistory.com

# 데이터를 담을 빈 리스트 생성
combined_dfs = []

# 각 파일에 대해 처리
for file in file_list:
    print(f"Reading and processing file: {file}")
    
    # 파일을 Dask DataFrame으로 읽기
    dask_df = dd.read_csv(file, usecols=['price', 'qty', 'quote_qty', 'time', 'is_buyer_maker'], dtype={'price': float, 'qty': float, 'quote_qty': float, 'time': float})
    
    try:
        # 변환 함수 적용
        computed_df = dask_df.compute()
        # 예외 처리
        try:
            processed_df = convert_tick_to_ohlcv(computed_df)
            combined_dfs.append(processed_df)
        except Exception as e:
            print(f"Error processing file {file}: {e}")
    except Exception as e:
        print(f"Error reading file {file}: {e}")

# 모든 파일 처리가 끝나면 리스트에 있는 모든 DataFrame을 concat하여 하나의 DataFrame으로 만듦
combined_df = pd.concat(combined_dfs, ignore_index=False)

각 파일에 이상이 있을 경우를 대비하여, 예외처리 코드도 추가해주었다.

전체 코드 : https://github.com/sin09135/btc_volatility_prediction

728x90

저작자표시 비영리 변경금지 (새창열림)