The naver_news_search_scraper from lovit

Notice

최근 뉴스 스크랩이 되지 않는 현상을 발견하였습니다 (관련이슈 #3). 이 내용이 해결되면 이 notice 를 제거하겠습니다.

Install

Python 3 로 작성되었습니다. 아래의 패키지를 이용합니다.

beautifulsoup4 >= 4.7.1
requiests >= 2.14.2

설치는 git clone 으로 코드를 받거나 downloads 를 합니다.

Python 버전 이슈

Python 3.7.x 에서 일부 뉴스의 본문이 제대로 스크랩 되지 않는 이슈가 있습니다. 3.6.x 에서는 제대로 작동합니다.

BeautifulSoup4 버전 이슈

코드 작성 당시 BeautifulSoup4 의 버전이 4.6.x 이하였으며, 4.7.x 에서 작동하지 않는 부분이 있었습니다 (Issue #1 참고). 이 부분이 수정되었으니, 4.7.x 이후의 버전을 쓰시는 분은 git pull 을 한 번 하시기 바랍니다 (작성일 2019.02.02 23:20)

Usage

실행 코드는 Python 으로 py 파일을 실행합니다. naver_news_search_crawler 폴더로 이동합니다.

python searching_news_comments.py --verbose --debug --comments

searching_news_comments.py 파일을 실행하면 output 폴더에 뉴스와 댓글이 저장됩니다. 이 파일은 몇 가지 arguments 를 제공합니다.

argument name	default value	note
--root_directory	../output/	수집된 뉴스와 댓글의 저장 위치
--begin_date	2018-10-26	yyyy-mm-dd 형식으로 입력되는 데이터 수집의 첫 날
--end_date	2018-10-28	yyyy-mm-dd 형식으로 입력되는 데이터 수집의 마지막 날
--sleep	0.1	네이버 뉴스 사이트에 부하를 주지 않기 위한 여유시간. 단위는 초. 이 시간이 짧으면 네이버로부터 공격성 접근으로 인식되어 접속이 차단될 수 있습니다
--header	None	news 파일과 저장 폴더의 이름입니다. 아래에서 자세히 이야기합니다
--query_file	queries.txt	질의어가 입력된 텍스트 파일. 한 줄에 하나의 단어를 입력합니다
--debug	False	--debug 입력 시 True, 각 일자별로 3 페이지의 뉴스와 댓글만 수집합니다
--verbose	False	--verbose 입력 시 True, 진행 상황을 자세히 보여줍니다
--comments	False	--comments 입력 시 True, 각 뉴스에 해당하는 댓글을 함께 수집합니다

수집된 데이터는 날짜별로 구분되어 각각 텍스트 파일로 저장됩니다. 이를 하나의 파일로 병합하기 위해서 다음의 스크립트를 실행하면 됩니다. 예시 데이터의 ../output/2019-01-14_05-31-08/economy 내에 있는 뉴스 기사와 뉴스 기사 인덱스, 댓글은 각각 다음의 파일에 저장됩니다. ../output/2019-01-14_05-31-08/economy.txt, ../output/2019-01-14_05-31-08/economy.index, ../output/2019-01-14_05-31-08/economy.comment.txt

python merging_scrap_results.py --directory ../output/2019-01-14_05-31-08/economy

Query file 구성

질의어가 담긴 query_file (예시 코드의 query.txt) 은 세 가지 형태로 구성할 수 있습니다.

첫째는 질의어만을 입력하는 것으로 외교만 입력하면 0 이라는 폴더에 기본 날짜인 begin_date 부터 end_date 사이의 기사를 수집합니다.

둘째는 질의어와 저장 폴더 이름을 지정하는 것으로, 경제는 economic 이라는 폴더에 기본 날짜인 begin_date 부터 end_date 사이의 기사를 수집합니다.

셋째는 질의어, 저장 폴더 이름, 기사 수집과 종료 날짜를 모두 기록하는 것으로, 기본 날짜와 관계 없이 2018-01-01 부터 2018-01-03 사이의 기사를 수집합니다.

외교
경제	economic
사회	social	2018-01-01	2018-01-03

Directory structure

기본 arguments 를 기준으로 설명합니다. 수집된 데이터의 기본 저장 위치는 ../output/ 입니다.

header 를 입력하면 output 폴더 아래에 header 의 이름으로 폴더가 생깁니다. diplomacy 폴더는 --header diplomacy 를 입력한 경우입니다. diplomacy 아래에는 query term 의 순서에 따라서 0 부터 폴더가 생성됩니다. 그 아래에 news 와 comments 폴더가 생성되며, news 폴더 아래에는 각 일자별 뉴스 (.txt) 와 뉴스의 인덱스 (.index) 파일이 생성됩니다. comments 에는 댓글이 존재하는 기사의 댓글이 tap 으로 분리되는 tsv 파일 형식으로 저장됩니다.

header 를 입력하지 않을 경우 스크립트를 실행시킨 시각 (초 단위까지)으로 폴더가 생성됩니다. 이때는 뉴스와 인덱스 파일에 header 가 붙지 않습니다.

--| naver_news_search_crawler
--| output
    --| diplomacy
        --| 0
            --| news
                --| 2018-10-26_diplomacy.txt
                --| 2018-10-26_diplomacy.index
            --| comments
                --| 001-0010429592.txt
                --| 001-0010429850.txt
                --| ...
    --| 2018-10-29_18-52-27
        --| 0
            --| news
                --| 2018-10-26.txt
                --| 2018-10-26.index
            --| comments
                --| 001-0010429592.txt
                --| 001-0010429850.txt
                --| ...

News 파일 구조

2018-10-26[_header].txt 로 명명되어 있으며, 한 줄이 하나의 뉴스기사이고 한 뉴스기사 내 줄바꿈은 두 칸 띄어쓰기로 구분됩니다.

Index 파일 구조

2018-10-26[_header].index 명명되어 있으며, 한 줄이 하나의 뉴스기사의 인덱스입니다. 2018-10-26[_header].txt 파일과 줄 단위로 같은 기사를 지칭합니다.

index 파일은 네 가지 정보로 구성된 tap separated value (tsv) 형식입니다. 첫줄에 header 가 없습니다.

column	example	note
key	052/2018/10/01/0001199008	언론사ID / yy / mm / dd / 기사ID
카테고리 이름 (혹은 번호)	104	104 번 뉴스 카테고리
기사 작성 시각	2018-10-01 23:52	혹은 최종 수정 시각. 때로 포멧이 일정하지 않은 경우가 있음
기사 제목	.	기사 제목

Comments 파일 구조

001-0010429592.txt 은 10-26 에 작성된 (언론사=001, 뉴스기사=0010370550) 의 리뷰로 tap 구분으로 이뤄진 csv 파일 입니다.

이 파일의 첫줄은 column head 입니다.

언제부터인지는 확인하지 못했지만, 댓글 등록자 아이디의 해쉬값은 더이상 return 되지 않습니다. 이 경우, 별표로 뒷부분이 마스킹 된 사용자 이름을 가져오도록 수정하였습니다.

column	example	note
comment_no	1514778615	댓글 고유 아이디
user_id_no	6EVlK	댓글 등록자 아이디의 해쉬값. 혹은 해쉬값이 제공되지 않을 경우 끝의 글자가 마스킹된 사용자 이름
contents	좋은 방향으로 얘기 잘 되었으면..	댓글 내용
reg_time	2018-10-28T23:41:26+0900	댓글 등록 시각
sympathy_count	0	댓글 공감 수
antipathy_count	0	댓글 비공감 수

Error In "search_crawler.py"

안녕하세요! 오류가 좀 있어서 해결하다가 이슈를 남기게 되었습니다.

환경:

Ubuntu Server 18.04 LTS
Python : 3.6.5
beautifulsoup4 : 4.7.1
requests : 2.21.0

실행파일

$ python searching_news_comments.py --query_file queries.txt

오류 설명

search_crawler.py 파일에서 _parse_urls_from_page 함수에서 에러가 났었는데요,

url_patterns = ('a[href=https://news.naver.com/main/read.nhn?]',
            'a[href^=https://entertain.naver.com/main/read.nhn?]',
            'a[href^=https://sports.news.naver.com/sports/index.nhn?]',
            'a[href^=https://news.naver.com/sports/index.nhn?]')
...
for pattern in url_patterns:
        article_urls = [link['href'] for link in article_blocks.select(pattern)]
        urls_in_page.update(article_urls)

article_blocks 라는 변수(bs4.element.Tag 객체)가 pattern 을 인식못하는 현상이 발생했습니다.
그래서 각 패턴 string 중 링크 앞뒤에 " " 를 붙이니까 해결이 되었습니다.

수정전: ('a[href=https://news.naver.com/main/read.nhn?]'
수정후: ('a[href="https://news.naver.com/main/read.nhn?"]'

PS: 저만 그런지 모르겠지만, 다른 분들도 오류가 나면 참고 부탁드립니다. :)

오류상세

Traceback (most recent call last):
  File "/home/simonjisu/code/naver_news_search_scraper/naver_news_search_crawler/search_crawler.py", line 97, in _parse_urls_from_page
    article_urls = [link['href'] for link in article_blocks.select(pattern)]
  File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/bs4/element.py", line 1376, in select
    return soupsieve.select(selector, self, namespaces, limit, **kwargs)
  File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/soupsieve/__init__.py", line 108, in select
    return compile(select, namespaces, flags).select(tag, limit)
  File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/soupsieve/__init__.py", line 59, in compile
    return cp._cached_css_compile(pattern, namespaces, flags)
  File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/soupsieve/css_parser.py", line 192, in _cached_css_compile
    CSSParser(pattern, flags).process_selectors(),
  File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/soupsieve/css_parser.py", line 894, in process_selectors
    return self.parse_selectors(self.selector_iter(self.pattern), index, flags)
  File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/soupsieve/css_parser.py", line 744, in parse_selectors
    key, m = next(iselector)
  File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/soupsieve/css_parser.py", line 881, in selector_iter
    raise SyntaxError(msg)
SyntaxError: Malformed attribute selector at position 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "searching_news_comments.py", line 73, in <module>
    main()
  File "searching_news_comments.py", line 70, in main
    crawler.search(query, bd, ed)
  File "/home/simonjisu/code/naver_news_search_scraper/naver_news_search_crawler/search_crawler.py", line 131, in search
    scrap_date, verbose=self.verbose, debug=self.debug)
  File "/home/simonjisu/code/naver_news_search_scraper/naver_news_search_crawler/search_crawler.py", line 26, in get_article_urls
    search_result_url, num_articles, verbose, debug)
  File "/home/simonjisu/code/naver_news_search_scraper/naver_news_search_crawler/search_crawler.py", line 70, in _extract_urls_from_search_result
    urls_in_page = _parse_urls_from_page(search_result_url, page)
  File "/home/simonjisu/code/naver_news_search_scraper/naver_news_search_crawler/search_crawler.py", line 100, in _parse_urls_from_page
    raise ValueError('Failed to extract urls from page %s' % str(e))
ValueError: Failed to extract urls from page Malformed attribute selector at position 1

lovit / naver_news_search_scraper Goto Github PK

naver_news_search_scraper's Introduction

Notice

Install

Python 버전 이슈

BeautifulSoup4 버전 이슈

Usage

Query file 구성

Directory structure

News 파일 구조

Index 파일 구조

Comments 파일 구조

naver_news_search_scraper's People

Contributors

Stargazers

Watchers

Forkers

naver_news_search_scraper's Issues

환경:

실행파일

오류 설명

오류상세

Recommend Projects

Recommend Topics

Recommend Org