파이썬 웹 크롤링 BeautifulsoupSaichoiblog

Contents

파이썬 웹 크롤링 Beautifulsoup
- 네이버 뉴스에서 신문사 이름 크롤링하기
- 네이버에서 날씨 데이터 크롤링하기

파이썬 웹 크롤링 Beautifulsoup

데이터 수집 방법

공공데이터 API
웹 크롤링
직접 수집 : 보통 상황과 타겟 상황을 비교하여 feature(특징)을 잡아내 머신러닝을 하도록 한다.

네이버 뉴스에서 신문사 이름 크롤링하기

Step1. 주소 분석

https://news.naver.com/main/read.naver?mode=LSD&mid=shm&sid1=100&oid=366&aid=0000760936

sid	정치, 경제, 문화 등 카테고리
oid	신문사
aid	신문기사 번호(없을 수도 있으니 주의)

Step2. for문 돌려서 스크래핑

import requests

start_oid = 1
oid_list = []

for num in range(0, 1000):
    start_oid_str = str(start_oid).zfill(3)
    uri = f"https://newsstand.naver.com/?list=&pcode={start_oid_str}"

    response = requests.get(uri)

    if(response.status_code == 200):
        oid_list.append(start_oid_str)

    start_oid = start_oid + 1

print(f"oid 총 개수 : {len(oid_list)}")
print(oid_list)

oid 총 개수 : 229
['002', '003', '005', '006', '008', '009', '011', '013', '014', '015', '016', '018', '020', '021', '022', '023', '024', '025', '028', '029', '030', '031', '032', '038', '040', '042', '044', '047', '050', '052', '055', '056', '057', '073', '075', '076', '079', '081', '082', '087', '088', '089', '092', '094', '108', '109', '117', '120', '122', '123', '135', '138', '139', '140', '143', '144', '213', '214', '215', '241', '243', '277', '293', '296', '301', '308', '310', '311', '312', '314', '326', '327', '328', '329', '330', '331', '332', '333', '334', '335', '336', '337', '338', '339', '340', '344', '345', '346', '353', '354', '355', '356', '361', '362', '363', '364', '366', '368', '374', '376', '384', '385', '387', '388', '389', '391', '396', '410', '416', '417', '421', '422', '440', '477', '529', '536', '539', '801', '802', '803', '804', '805', '806', '807', '808', '809', '810', '811', '812', '813', '814', '815', '816', '817', '818', '819', '820', '821', '822', '823', '824', '825', '826', '827', '828', '829', '830', '901', '902', '903', '904', '905', '906', '907', '908', '909', '910', '911', '913', '914', '915', '916', '917', '920', '921', '922', '923', '924', '925', '926', '927', '928', '930', '932', '934', '935', '936', '937', '938', '940', '941', '942', '943', '944', '945', '947', '948', '949', '950', '951', '952', '953', '954', '955', '956', '957', '958', '959', '960', '961', '962', '963', '964', '965', '966', '967', '968', '969', '970', '971', '972', '973', '974', '975', '976', '977', '978', '979', '980', '981', '982', '983', '984', '986', '988', '989', '990', '991', '993']

다른 방법 보기Close

import requests 

#목적 : oid 수집 
start_oid = 1
oid_list = []
for start_oid in range(0,1000):
    start_oid = '%03d' %start_oid
    uri = f"https://newsstand.naver.com/?list=&pcode={start_oid}"

    response = requests.get(uri)

    if(response.status_code == 200):
        oid_list.append(start_oid)

print(f"oid 총 개수 : {len(oid_list)}")   
print(oid_list)

Step3. 라이브러리를 사용하여 파싱

html 문서를 파싱하기는 어렵기 때문에 Beautifulsoup라는 라이브러리를 사용한다.

1. 파싱할 html 끌고 오기

import requests

start_oid = 1
#oid_list = []
oid_names = []

for num in range(0, 2):
    start_oid_str = str(start_oid).zfill(3)
    uri = f"https://newsstand.naver.com/?list=&pcode={start_oid_str}"

    response = requests.get(uri)

    if(response.status_code == 200):
      
        print(response.text) #html을 받을 거라 text

    start_oid = start_oid + 1

…이하생략…
<div id="focusPanelCenter" class="panel"><div class="panel_inner">
        <h3><a href="https://www.pressian.com/" target="_blank" onclick="Newsstand.Panel.onClickLogo(this, event, '프레시안')"><img src="https://ssl.pstatic.net/static/newsstand/up/2016/0325/nsd185740259.png" width="260" height="55" alt="프레시안"></a></h3>
        <div class="sc_tp">
                <div class="fr">
                        <a href="#" class="btn_sbc" id="btn_002">MY언론사로 구독하기</a>
                        <div class="tooltip2 tt_btn_sbc" style="top:23px;left:-41px;display:none;"><p class="msg2">구독하신 언론사는 MY뉴스에서 확인할 수 있습니다.</p></div>
                        <div class="tooltip2 tt_btn_unsbc" style="top:23px;left:-41px;display:none;"><p class="msg5">MY뉴스 언론사 목록에서 제외됩니다.</p></div>
                        <a href="http://cafe.naver.com/navernewscast/menu/53" class="btn_ombs" target="_blank" onclick="Newsstand.Panel.onClickOmbs(event)"><span class="blind">이용자 한마디</span><span class="rd"></span></a>
                        <div class="tooltip2 tt_btn_ombs" style="top: 23px; left: 24px; display: none;">
                                <p class="msg6">뉴스 편집판에 대해 의견 주시면 옴부즈맨의 자료로 활용됩니다.</p>
                        </div>
                        <a href="#" class="btn_social" onclick="Newsstand.Panel.openSns(event)" title="소셜보내기"><span class="blind">소셜보내기</span></a>
                        <div class="tooltip2 social_tooltip" style="top:26px;left:38px; display:none;">
                        <div class="lyr_social">
                        <a href="#" class="btn_band" title="밴드로 뉴스보내기" onclick="Newsstand.Panel.onClickPost(event, 'band')"><span class="btn_ico"></span>밴드</a>
                        <a href="#" class="btn_twit" title="트위터로 뉴스보내기" target="_blank" onclick="Newsstand.Panel.onClickPost(event, 'twitter')"><span class="btn_ico"></span>트위터</a>
                        <a href="#" class="btn_facebook" title="페이스북으로 뉴스보내기" target="_blank" onclick="Newsstand.Panel.onClickPost(event, 'facebook')"><span class="btn_ico"></span>페이스북</a>
                        </div>
                        </div>
                </div>
                <span class="fl"><em>09-13 17:24</em>편집</span>
        </div>
        <iframe src="about:blank" width="840" height="380" frameborder="0" scrolling="no" class="ifr_arc" allowTransparency="true"></iframe>
    <iframe class="ifr_ad2" src="about:blank" title="광고" width="468" height="60" scrolling="no" frameborder="0" data-veta-preview="p_news_stand_00"></iframe>
    <a href="https://www.pressian.com/" target="_blank" class="btn_gomedia"><strong>프레시안</strong> 사이트 바로가기<span class="ico"></span></a>

…이하생략…

2. img 태그 안에 들어있는 alt를 사용하여 신문사 이름 파싱해보기

3. html 데이터 추출하기 위한 beautifulsoup 라이브러리를 사용할 준비한다.

https://beautiful-soup-4.readthedocs.io/en/latest/

라이브러리 다운로드 : pip install beautifulsoup4

pip는 파이썬 의존성 관리도구입니다.

Beautifulsoup testClose

find 사용하지 말고 select 사용하자!

from bs4 import BeautifulSoup

html_doc = """
<html>
<body>
<h1>H1태그</h1>
<div class='a'>클래스찾기 - a</div>
<div class='b'>클래스찾기 - b</div>
<div class='b'>클래스찾기 - b2</div>
<div id='hello'>아이디찾기</div>
<div id='focusPanelCenter'>
    <div class='panel_inner'>
        <img alt='국민일보'></img>
    </div>
</div>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

#HTML 엘레멘트 = DOM
m_body = soup.body
#print(m_body)

#h1 찾기
m_h1 = soup.h1
#print(m_h1)

#class='a'찾기
m_a = soup.find(class_="a")
#print(m_a)

#class='b'찾기
m_b = soup.find(class_="b")
#print(m_b)

#class 찾기
m_b_all = soup.find_all(class_="b") #리스트 타입으로 출력됨
#print(m_b_all)

#id 찾기
m_hello = soup.find(id="hello")
#print(m_hello)

#class 태그 안에 text 찾기
target2 = soup.select_one(".b")
#print(target2.text)

#focusPanelCenter 찾기
target = soup.select_one('#focusPanelCenter .panel_inner img')
#print(target)
title = target["alt"]
#print(title)

4. 테스트한 내용으로 신문사 이름을 파싱한다.

import requests
from bs4 import BeautifulSoup


start_oid = 1
oid_names = []

for num in range(0, 1000):
    start_oid_str = str(start_oid).zfill(3)
    uri = f"https://newsstand.naver.com/?list=&pcode={start_oid_str}"

    response = requests.get(uri)

    if(response.status_code == 200):
        soup = BeautifulSoup(response.text, 'html.parser')
        target = soup.select_one('#focusPanelCenter .panel_inner img')
        title = target["alt"]
        oid_names.append(title)
        

    start_oid = start_oid + 1

print(oid_names)
print(f"신문사 총 개수 : {len(oid_names)}")

파싱해서 찾아오는데 1~2분 정도 소요되니까 조금만 기다려주세요~

네이버에서 날씨 데이터 크롤링하기

import requests
from bs4 import BeautifulSoup

uri = '''
https://search.naver.com/search.naver?where=nexearch&sm=top_sly.hst&fbm=0&acr=1&acq=%EB%82%A0%EC%94%A8&qdt=0&ie=utf8&query=%EB%82%A0%EC%94%A8
'''

response = requests.get(uri)

soup = BeautifulSoup(response.text, 'html.parser')
target = soup.select_one('.todaytemp')
print(target.text)

crawling Python

파이썬 웹 크롤링 Beautifulsoup

파이썬 웹 크롤링 Beautifulsoup

데이터 수집 방법

네이버 뉴스에서 신문사 이름 크롤링하기

Step1. 주소 분석

Step2. for문 돌려서 스크래핑

Step3. 라이브러리를 사용하여 파싱

1. 파싱할 html 끌고 오기

2. img 태그 안에 들어있는 alt를 사용하여 신문사 이름 파싱해보기

3. html 데이터 추출하기 위한 beautifulsoup 라이브러리를 사용할 준비한다.

4. 테스트한 내용으로 신문사 이름을 파싱한다.

네이버에서 날씨 데이터 크롤링하기

「右肩上がり」의 의미와 유래

보호된 글: 제2회 정보 공유 회의

[Go]GORM belong to와 has one의 차이점은 무엇일까?

[Go]Gorm Tag이해하기

제 2강 빅 오 표기법을 이용한 성능 향상과 활용 방법

파이썬 넘파이와 판다스 활용하여 데이터 시각화[MariaDB]

파이썬 개발환경 만들기

스프링부트 파이썬으로 배치프로그램 시각화하기

파이썬 공공API 테스트

리액티브 스프링 크롤링과 플라스크 시각화하기

Numpy 행렬곱, 연립방정식

파이썬 기초 문법

파이썬 웹 프레임워크 Flask