728x90

scrapy 크롤링

Scrapy는 웹 크롤링 및 스크래핑을 위한 파이썬 기반의 오픈소스 프레임워크이다. 웹 페이지에서 데이터를 추출하고 필요한 정보를 구조화된 형식으로 저장할 수 있다.

worldometer 인구 데이터 크롤링

https://www.worldometers.info/world-population/population-by-country

Population by Country (2023) - Worldometer

Countries in the world by population (2023) This list includes both countries and dependent territories. Data based on the latest United Nations Population Division estimates. Click on the name of the country or dependency for current estimates (live popul

www.worldometers.info

0. 크롤링 전 준비 사항

1) 폴더, 가상환경 세팅

원하는 경로에 작업할 폴더를 생성한다.

$ mkdir worldometer_scrapy

Vscode 를 실행한다.

$ cd worldometer_scrapy
$ code .

[터미널] 에서 가상환경을 세팅한다

$ virtualenv venv
$ source venv/bin/activate

scrapy 프레임워크를 설치한다.

$ pip install scrapy

scrapy 명령어가 잘 실행되는지 확인한다

$ scrapy

2) worldometer 프로젝트 생성

scrapy startproject "프로젝트이름"

$ scrapy startproject worldometer

2. 크롤링 데이터 구조 파악

웹 사이트 주소

https://www.worldometers.info/world-population/population-by-country

Population by Country (2023) - Worldometer

www.worldometers.info

웹사이트 메인 화면

worldmeter_crawl.py 생성

spiders > Worldmeter_crawl.py 를 생성한다. 앞으로 이 파일에 크롤링 코드를 저장할 것이다.

scrapy 기본 구성

import scrapy

class WorldometerItem(scrapy.Item):
    # 크롤링 코드 입력
    def parse(self, response):
      
      yield {
           # yield 키워드를 사용하여 데이터를 하나씩 반환하면 Scrapy가 이를 딕셔너리 형태로 반환한다.
        }

파일을 구성하는 기본 구조는 다음과 같다.

title element 를 통해 구조를 파악해보자

title 을 copy xpath 로 복사해준다.

코드 구현

import scrapy


class WorldometerSpider(scrapy.Spider):
    
    #크롤링 할 도메인 지정
    name = "worldometer"
    allowed_domains = ["www.worldometers.info"]
    
    # 크롤링을 시작할 웹페이지 지정
    start_urls = ["https://www.worldometers.info/world-population/population-by-country"]

    def parse(self, response):
        
        # title 크롤링 코드
        title = response.xpath('//h1/text()').get()
  
        yield {
            'title' : title 
        }

복사한 코드를 title = response.xpath('복사한 코드').get() 에 삽입하여 크롤링 코드를 입력하고 yield{} 에 입력하여 딕셔너리 형태로 처리한다.

결과 확인
- scrapy crawl 프로젝트이름 출력

$ scrapy crawl worldometer

title이 딕셔너리 형태로 출력되는 것을 확인할 수 있다.

country, population, Yearly Change 크롤링

1. 구조 파악

command + F 로 xpath 경로를 입력하면서 맞는 경로를 찾는다.

country : //table/tbody/tr/td[2]/a/text()

Population : //table/tbody/tr/td[3]/a/text()

Yearly Change : //table/tbody/tr/td[4]/a/text()

2. worldmeter_crawl.py 코드 구현

import scrapy

class WorldometerSpider(scrapy.Spider):
    
    name = "worldometer"
    allowed_domains = ["www.worldometers.info"]
    start_urls = ["https://www.worldometers.info/world-population/population-by-country"]

    def parse(self, response):
       
        # country, population, yearly_change 코드
        rows = response.xpath('//table/tbody/tr')
        
        for row in rows:
            country = row.xpath('./td[2]/a/text()').get()
            population = row.xpath('./td[3]/text()').get()
            yearly_change = row.xpath('./td[4]/text()').get()

            yield {
                'country': country,
                'population': population,
                'yearly_change': yearly_change
            }

출력 결과

country, population, yearly_change 가 딕셔너리 형태로 잘 출력되는 것을 확인할 수 있다.

728x90

저작자표시 비영리 변경금지 (새창열림)

'Web > Crawling' 카테고리의 다른 글

Ascendex 코인 OHLCV 데이터 API 크롤링 (1)	2024.02.25
[Web] API 크롤링 - 서울 열린데이터광장 유동인구 API 크롤링 (0)	2023.08.13
[Web] 웹 크롤링, 스크래핑 BeautifulSoup - 음원 차트 출력 (0)	2023.08.08
[Web] 웹 크롤링,스크래핑 Basic - 3. User-Agent (0)	2023.08.08
[Web] 웹 크롤링 , 스크래핑 Basic - 2. re , 정규표현식 (0)	2023.08.06

DataPilots

[Web] 국가별 인구 데이터 크롤링 with Scrapy - 1

scrapy 크롤링

worldometer 인구 데이터 크롤링

0. 크롤링 전 준비 사항

1) 폴더, 가상환경 세팅

2) worldometer 프로젝트 생성

2. 크롤링 데이터 구조 파악

scrapy 기본 구성

country, population, Yearly Change 크롤링

1. 구조 파악

2. worldmeter_crawl.py 코드 구현

'Web > Crawling' 카테고리의 다른 글

티스토리툴바

[Web] 국가별 인구 데이터 크롤링 with Scrapy - 1

scrapy 크롤링

worldometer 인구 데이터 크롤링

0. 크롤링 전 준비 사항

1) 폴더, 가상환경 세팅

2) worldometer 프로젝트 생성

2. 크롤링 데이터 구조 파악

scrapy 기본 구성

country, population, Yearly Change 크롤링

1. 구조 파악

2. worldmeter_crawl.py 코드 구현

'Web > Crawling' 카테고리의 다른 글

관련글

티스토리툴바