lovit / naver_news_search_scraper Goto Github PK

View Code? Open in Web Editor NEW

43.0 6.0 19.0 1.83 MB

검색어 기준으로 네이버뉴스와 댓글을 수집하는 파이썬 코드

Python 100.00%

naver-news scraper

naver_news_search_scraper's Issues

프로그램 실행과 관련하여 문의 드립니다.

지난 몇번 작성자님의 코드를 실행해본적이 있는데요, 최근에 다시한번 활용하고 싶어 다시 코드를 실행하여 보니 기사와 댓글을 찾기는 하지만 수집이 되지 않고있네요..ㅠ

혹시 작성자님은 돌아가지시는지 여쭤보고 싶습니다!

댓글 수집이 안됩니다

이 링크로 테스트 해보았는데 안됩니다
https://n.news.naver.com/mnews/article/comment/448/0000438468?sid=104

aid 와 oid 는 request()와 BeautifulSoup으로 추출해서 댓글 get_comments()의 base_url에 입력하였는데 안됩니다. 제가 보기에는 base_url 에 요류가 있는 듯합니다. 다른 사이트에서도 유사한 url을 보았는데 전부 안되네요. 왜 그럴까요?

base_url = ''.join(['https://apis.naver.com/commentBox/cbox/web_naver_list_jsonp.json?ticket=news&',
                'templateId=view_politics&pool=cbox5&lang=ko&country=KR&objectId=news',
                '{}%2C{}&pageSize={}&page={}&sort={}&initialize=true&useAltSort=true&indexSize=10'])

환경:

Ubuntu Server 18.04 LTS
Python : 3.6.5
beautifulsoup4 : 4.7.1
requests : 2.21.0

실행파일

$ python searching_news_comments.py --query_file queries.txt

오류 설명

search_crawler.py 파일에서 _parse_urls_from_page 함수에서 에러가 났었는데요,

url_patterns = ('a[href=https://news.naver.com/main/read.nhn?]',
            'a[href^=https://entertain.naver.com/main/read.nhn?]',
            'a[href^=https://sports.news.naver.com/sports/index.nhn?]',
            'a[href^=https://news.naver.com/sports/index.nhn?]')
...
for pattern in url_patterns:
        article_urls = [link['href'] for link in article_blocks.select(pattern)]
        urls_in_page.update(article_urls)

article_blocks 라는 변수(bs4.element.Tag 객체)가 pattern 을 인식못하는 현상이 발생했습니다.
그래서 각 패턴 string 중 링크 앞뒤에 " " 를 붙이니까 해결이 되었습니다.

수정전: ('a[href=https://news.naver.com/main/read.nhn?]'
수정후: ('a[href="https://news.naver.com/main/read.nhn?"]'

PS: 저만 그런지 모르겠지만, 다른 분들도 오류가 나면 참고 부탁드립니다. :)

오류상세

Traceback (most recent call last):
  File "/home/simonjisu/code/naver_news_search_scraper/naver_news_search_crawler/search_crawler.py", line 97, in _parse_urls_from_page
    article_urls = [link['href'] for link in article_blocks.select(pattern)]
  File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/bs4/element.py", line 1376, in select
    return soupsieve.select(selector, self, namespaces, limit, **kwargs)
  File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/soupsieve/__init__.py", line 108, in select
    return compile(select, namespaces, flags).select(tag, limit)
  File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/soupsieve/__init__.py", line 59, in compile
    return cp._cached_css_compile(pattern, namespaces, flags)
  File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/soupsieve/css_parser.py", line 192, in _cached_css_compile
    CSSParser(pattern, flags).process_selectors(),
  File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/soupsieve/css_parser.py", line 894, in process_selectors
    return self.parse_selectors(self.selector_iter(self.pattern), index, flags)
  File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/soupsieve/css_parser.py", line 744, in parse_selectors
    key, m = next(iselector)
  File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/soupsieve/css_parser.py", line 881, in selector_iter
    raise SyntaxError(msg)
SyntaxError: Malformed attribute selector at position 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "searching_news_comments.py", line 73, in <module>
    main()
  File "searching_news_comments.py", line 70, in main
    crawler.search(query, bd, ed)
  File "/home/simonjisu/code/naver_news_search_scraper/naver_news_search_crawler/search_crawler.py", line 131, in search
    scrap_date, verbose=self.verbose, debug=self.debug)
  File "/home/simonjisu/code/naver_news_search_scraper/naver_news_search_crawler/search_crawler.py", line 26, in get_article_urls
    search_result_url, num_articles, verbose, debug)
  File "/home/simonjisu/code/naver_news_search_scraper/naver_news_search_crawler/search_crawler.py", line 70, in _extract_urls_from_search_result
    urls_in_page = _parse_urls_from_page(search_result_url, page)
  File "/home/simonjisu/code/naver_news_search_scraper/naver_news_search_crawler/search_crawler.py", line 100, in _parse_urls_from_page
    raise ValueError('Failed to extract urls from page %s' % str(e))
ValueError: Failed to extract urls from page Malformed attribute selector at position 1

저는 환경관련 키워드를 쿼리에 넣어서 그 키워드에 대한 기사와 댓글을 뽑아보고 싶은데요!
쿼리에 기후변화라고 넣고 기타분류와 특정기간을 설정하기도 했고 설정을 안하기도 했습니다.
결과로는 뉴스와 뉴스 인덱스만 가지고올뿐 댓글은 가지고 오지 않았습니다.
이런 경우에는 무엇이 잘못된건가요? 제가 쿼리에 키워드를 넣는 방식이 잘못된것인지, 특정 키워드를 추출하는 경우에는 원래 댓글 추출이 안되는것인지 다른 이유인것인지 알려주시면 감사하겠습니다!

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

lovit / naver_news_search_scraper Goto Github PK

naver_news_search_scraper's Issues

프로그램 실행과 관련하여 문의 드립니다.

댓글 수집이 안됩니다

About running error

Error In "search_crawler.py"

환경:

실행파일

오류 설명

오류상세

댓글수집과 관련하여 질문드립니다!

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent