lovit / naver_news_search_scraper Goto Github PK
View Code? Open in Web Editor NEW검색어 기준으로 네이버뉴스와 댓글을 수집하는 파이썬 코드
검색어 기준으로 네이버뉴스와 댓글을 수집하는 파이썬 코드
지난 몇번 작성자님의 코드를 실행해본적이 있는데요, 최근에 다시한번 활용하고 싶어 다시 코드를 실행하여 보니 기사와 댓글을 찾기는 하지만 수집이 되지 않고있네요..ㅠ
혹시 작성자님은 돌아가지시는지 여쭤보고 싶습니다!
이 링크로 테스트 해보았는데 안됩니다
https://n.news.naver.com/mnews/article/comment/448/0000438468?sid=104
aid 와 oid 는 request()와 BeautifulSoup으로 추출해서 댓글 get_comments()의 base_url에 입력하였는데 안됩니다. 제가 보기에는 base_url 에 요류가 있는 듯합니다. 다른 사이트에서도 유사한 url을 보았는데 전부 안되네요. 왜 그럴까요?
base_url = ''.join(['https://apis.naver.com/commentBox/cbox/web_naver_list_jsonp.json?ticket=news&',
'templateId=view_politics&pool=cbox5&lang=ko&country=KR&objectId=news',
'{}%2C{}&pageSize={}&page={}&sort={}&initialize=true&useAltSort=true&indexSize=10'])
안녕하세요! 오류가 좀 있어서 해결하다가 이슈를 남기게 되었습니다.
Ubuntu Server 18.04 LTS
Python : 3.6.5
beautifulsoup4 : 4.7.1
requests : 2.21.0
$ python searching_news_comments.py --query_file queries.txt
search_crawler.py
파일에서 _parse_urls_from_page
함수에서 에러가 났었는데요,
url_patterns = ('a[href=https://news.naver.com/main/read.nhn?]',
'a[href^=https://entertain.naver.com/main/read.nhn?]',
'a[href^=https://sports.news.naver.com/sports/index.nhn?]',
'a[href^=https://news.naver.com/sports/index.nhn?]')
...
for pattern in url_patterns:
article_urls = [link['href'] for link in article_blocks.select(pattern)]
urls_in_page.update(article_urls)
article_blocks
라는 변수(bs4.element.Tag
객체)가 pattern 을 인식못하는 현상이 발생했습니다.
그래서 각 패턴 string 중 링크 앞뒤에 " " 를 붙이니까 해결이 되었습니다.
('a[href=https://news.naver.com/main/read.nhn?]'
('a[href="https://news.naver.com/main/read.nhn?"]'
PS: 저만 그런지 모르겠지만, 다른 분들도 오류가 나면 참고 부탁드립니다. :)
Traceback (most recent call last):
File "/home/simonjisu/code/naver_news_search_scraper/naver_news_search_crawler/search_crawler.py", line 97, in _parse_urls_from_page
article_urls = [link['href'] for link in article_blocks.select(pattern)]
File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/bs4/element.py", line 1376, in select
return soupsieve.select(selector, self, namespaces, limit, **kwargs)
File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/soupsieve/__init__.py", line 108, in select
return compile(select, namespaces, flags).select(tag, limit)
File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/soupsieve/__init__.py", line 59, in compile
return cp._cached_css_compile(pattern, namespaces, flags)
File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/soupsieve/css_parser.py", line 192, in _cached_css_compile
CSSParser(pattern, flags).process_selectors(),
File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/soupsieve/css_parser.py", line 894, in process_selectors
return self.parse_selectors(self.selector_iter(self.pattern), index, flags)
File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/soupsieve/css_parser.py", line 744, in parse_selectors
key, m = next(iselector)
File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/soupsieve/css_parser.py", line 881, in selector_iter
raise SyntaxError(msg)
SyntaxError: Malformed attribute selector at position 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "searching_news_comments.py", line 73, in <module>
main()
File "searching_news_comments.py", line 70, in main
crawler.search(query, bd, ed)
File "/home/simonjisu/code/naver_news_search_scraper/naver_news_search_crawler/search_crawler.py", line 131, in search
scrap_date, verbose=self.verbose, debug=self.debug)
File "/home/simonjisu/code/naver_news_search_scraper/naver_news_search_crawler/search_crawler.py", line 26, in get_article_urls
search_result_url, num_articles, verbose, debug)
File "/home/simonjisu/code/naver_news_search_scraper/naver_news_search_crawler/search_crawler.py", line 70, in _extract_urls_from_search_result
urls_in_page = _parse_urls_from_page(search_result_url, page)
File "/home/simonjisu/code/naver_news_search_scraper/naver_news_search_crawler/search_crawler.py", line 100, in _parse_urls_from_page
raise ValueError('Failed to extract urls from page %s' % str(e))
ValueError: Failed to extract urls from page Malformed attribute selector at position 1
저는 환경관련 키워드를 쿼리에 넣어서 그 키워드에 대한 기사와 댓글을 뽑아보고 싶은데요!
쿼리에 기후변화라고 넣고 기타분류와 특정기간을 설정하기도 했고 설정을 안하기도 했습니다.
결과로는 뉴스와 뉴스 인덱스만 가지고올뿐 댓글은 가지고 오지 않았습니다.
이런 경우에는 무엇이 잘못된건가요? 제가 쿼리에 키워드를 넣는 방식이 잘못된것인지, 특정 키워드를 추출하는 경우에는 원래 댓글 추출이 안되는것인지 다른 이유인것인지 알려주시면 감사하겠습니다!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.