Python crawler with traversal 파이썬 순회 크롤러

Notice

Recent Posts

Recent Comments

Link

« 2025/10 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

Try to 개발자 EthanJ의 성장 로그

Python crawler with traversal 파이썬 순회 크롤러 본문

CS & DS/Basic Python with Data Crawling

Python crawler with traversal 파이썬 순회 크롤러

EthanJ 2022. 10. 18. 16:03

Python crawler with traversal 파이썬 순회 크롤러

같은 양식의 페이지를 순회하면서 자료를 수집해오는 크롤러
원 페이지 크롤러 제작 후 > 완성된 크롤러를 반복문에 넣어서 만든다

반복을 어디부터 돌릴지에 대한 파악이 제일 중요!

# crwaling library import
from bs4 import BeautifulSoup
from selenium import webdriver
import requests

# 코드 진행 지연을 위한 time 임포트
import time

# 2022-07 이후 selenium 업데이트로 인한 XPATH 추적 시 사용하는 임포트
from selenium.webdriver.common.by import By

# file io
import codecs

순서

approach N page
source code crawling
parsing
data extraction
saving in txt file
move to number 1.

다음페이지 버튼 XPATH 클릭으로 페이지 넘기기

리스트 형식 페이지: [F12] + [Network menu click] > 리스트 다음 페이지 클릭
> url 바뀌지 않아도, Network 변경사항을 [Headers], [Payload] tab에서 확인 가능
> XPATH 구하기 가능!

chrome_driver = webdriver.Chrome('chromedriver')

# approach first page
chrome_driver.get("https://product.kyobobook.co.kr/bestseller/online?period=001")

# 첫번째 제목 저장 리스트 > 반복문 중지 조건으로 필요
check_name_list = list()

rank_list = list()
title_list = list()
price_list = list()
author_list = list()

time.sleep(6)

# 반복문
while True:

    # 끝까지 스크롤 다운 (광고로 페이지 가리기 방지)
    chrome_driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # source code crawling
    source = chrome_driver.page_source

    # parsing
    html_parsed_source = BeautifulSoup(source, "html.parser")

    # extract data(span_prod_name) & saving in list

    ####### Title
    span_prod_name = html_parsed_source.find_all("span", class_="prod_name")   

    # while문 중지 조건 -> 같은 title이 list에 존재 할 때
    if (span_prod_name[0].text in check_name_list):
        chrome_driver.close()
        break
    check_name_list.append(span_prod_name[0].text)

    for title in span_prod_name:
        title_list.append(title.text)     

    ######## Rank
    div_prod_rank = html_parsed_source.find_all("div", class_="prod_rank")

    for rank in div_prod_rank:
        rank_list.append(rank.text)

    ######## Price        
    span_val = html_parsed_source.find_all("span", class_="val")

    for price in span_val:
        if(price.text == "0"):
            None
        else:
            price_list.append(price.text)

    ######### Author
    span_prod_author = html_parsed_source.find_all("span", class_="prod_author")

    for author in span_prod_author:
        author_list.append(author.text.split(" ·")[0])

    # 다음 페이지 버튼 XPATH로 이동
    chrome_driver.find_element(By.XPATH, '//*[@id="tabRoot"]/div[4]/div[2]/button[2]').click()

    time.sleep(6)

# extracted data item 개수 일치하는 지 확인
book_list = [ title_list, rank_list, price_list, author_list ]

for book in book_list:
    print(len(book))

# csv로 출력
w_csv = codecs.open("C:/Users/EthanJ/develop/PLAYDATA/Python_basic/crawling/crawler_with_traversal.csv", 'w', "utf-8-sig")

for i in range(len(title_list)):
    this_line = "%s,%s,%s,%s\n" %(rank_list[i].replace(',', '，'), title_list[i].replace(',', '，'),
                                  author_list[i].replace(',', '，'), price_list[i].replace(',', '，'))
    w_csv.write(this_line)

w_csv.close()

1013_2_1

1013_2_2

'CS & DS > Basic Python with Data Crawling' 카테고리의 다른 글

Python crawling with browserless 파이썬 browserless 크롤링 (1)	2022.10.18
Python crawler with traversal in Nested loop 이중 반복문을 활용한 파이썬 순회 크롤러 (0)	2022.10.18
Python File IO 파이썬 파일 입출력 with codecs and Encoding (0)	2022.10.18
Python Crawling 파이썬 크롤링 with selenium, BeautifulSoup (0)	2022.10.18
Python Control statement (for loop) 파이썬 제어문 (for 반복문) (1)	2022.10.08

'CS & DS/Basic Python with Data Crawling' Related Articles

Comments

Try to 개발자 EthanJ의 성장 로그

Python crawler with traversal 파이썬 순회 크롤러 본문

Python crawler with traversal 파이썬 순회 크롤러

Python crawler with traversal 파이썬 순회 크롤러

'CS & DS > Basic Python with Data Crawling' 카테고리의 다른 글

티스토리툴바