Giter Site home page Giter Site logo

naver_news_search_scraper's Issues

프로그램 실행과 관련하여 문의 드립니다.

지난 몇번 작성자님의 코드를 실행해본적이 있는데요, 최근에 다시한번 활용하고 싶어 다시 코드를 실행하여 보니 기사와 댓글을 찾기는 하지만 수집이 되지 않고있네요..ㅠ

혹시 작성자님은 돌아가지시는지 여쭤보고 싶습니다!

댓글 수집이 안됩니다

이 링크로 테스트 해보았는데 안됩니다
https://n.news.naver.com/mnews/article/comment/448/0000438468?sid=104

aid 와 oid 는 request()와 BeautifulSoup으로 추출해서 댓글 get_comments()의 base_url에 입력하였는데 안됩니다. 제가 보기에는 base_url 에 요류가 있는 듯합니다. 다른 사이트에서도 유사한 url을 보았는데 전부 안되네요. 왜 그럴까요?

base_url = ''.join(['https://apis.naver.com/commentBox/cbox/web_naver_list_jsonp.json?ticket=news&',
                'templateId=view_politics&pool=cbox5&lang=ko&country=KR&objectId=news',
                '{}%2C{}&pageSize={}&page={}&sort={}&initialize=true&useAltSort=true&indexSize=10'])

Error In "search_crawler.py"

안녕하세요! 오류가 좀 있어서 해결하다가 이슈를 남기게 되었습니다.

환경:

Ubuntu Server 18.04 LTS
Python : 3.6.5
beautifulsoup4 : 4.7.1
requests : 2.21.0

실행파일

$ python searching_news_comments.py --query_file queries.txt

오류 설명

search_crawler.py 파일에서 _parse_urls_from_page 함수에서 에러가 났었는데요,

url_patterns = ('a[href=https://news.naver.com/main/read.nhn?]',
            'a[href^=https://entertain.naver.com/main/read.nhn?]',
            'a[href^=https://sports.news.naver.com/sports/index.nhn?]',
            'a[href^=https://news.naver.com/sports/index.nhn?]')
...
for pattern in url_patterns:
        article_urls = [link['href'] for link in article_blocks.select(pattern)]
        urls_in_page.update(article_urls)

article_blocks 라는 변수(bs4.element.Tag 객체)가 pattern 을 인식못하는 현상이 발생했습니다.
그래서 각 패턴 string 중 링크 앞뒤에 " " 를 붙이니까 해결이 되었습니다.

  • 수정전: ('a[href=https://news.naver.com/main/read.nhn?]'
  • 수정후: ('a[href="https://news.naver.com/main/read.nhn?"]'

PS: 저만 그런지 모르겠지만, 다른 분들도 오류가 나면 참고 부탁드립니다. :)


오류상세

Traceback (most recent call last):
  File "/home/simonjisu/code/naver_news_search_scraper/naver_news_search_crawler/search_crawler.py", line 97, in _parse_urls_from_page
    article_urls = [link['href'] for link in article_blocks.select(pattern)]
  File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/bs4/element.py", line 1376, in select
    return soupsieve.select(selector, self, namespaces, limit, **kwargs)
  File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/soupsieve/__init__.py", line 108, in select
    return compile(select, namespaces, flags).select(tag, limit)
  File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/soupsieve/__init__.py", line 59, in compile
    return cp._cached_css_compile(pattern, namespaces, flags)
  File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/soupsieve/css_parser.py", line 192, in _cached_css_compile
    CSSParser(pattern, flags).process_selectors(),
  File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/soupsieve/css_parser.py", line 894, in process_selectors
    return self.parse_selectors(self.selector_iter(self.pattern), index, flags)
  File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/soupsieve/css_parser.py", line 744, in parse_selectors
    key, m = next(iselector)
  File "/home/simonjisu/miniconda3/envs/venv/lib/python3.6/site-packages/soupsieve/css_parser.py", line 881, in selector_iter
    raise SyntaxError(msg)
SyntaxError: Malformed attribute selector at position 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "searching_news_comments.py", line 73, in <module>
    main()
  File "searching_news_comments.py", line 70, in main
    crawler.search(query, bd, ed)
  File "/home/simonjisu/code/naver_news_search_scraper/naver_news_search_crawler/search_crawler.py", line 131, in search
    scrap_date, verbose=self.verbose, debug=self.debug)
  File "/home/simonjisu/code/naver_news_search_scraper/naver_news_search_crawler/search_crawler.py", line 26, in get_article_urls
    search_result_url, num_articles, verbose, debug)
  File "/home/simonjisu/code/naver_news_search_scraper/naver_news_search_crawler/search_crawler.py", line 70, in _extract_urls_from_search_result
    urls_in_page = _parse_urls_from_page(search_result_url, page)
  File "/home/simonjisu/code/naver_news_search_scraper/naver_news_search_crawler/search_crawler.py", line 100, in _parse_urls_from_page
    raise ValueError('Failed to extract urls from page %s' % str(e))
ValueError: Failed to extract urls from page Malformed attribute selector at position 1

댓글수집과 관련하여 질문드립니다!

저는 환경관련 키워드를 쿼리에 넣어서 그 키워드에 대한 기사와 댓글을 뽑아보고 싶은데요!
쿼리에 기후변화라고 넣고 기타분류와 특정기간을 설정하기도 했고 설정을 안하기도 했습니다.
결과로는 뉴스와 뉴스 인덱스만 가지고올뿐 댓글은 가지고 오지 않았습니다.
이런 경우에는 무엇이 잘못된건가요? 제가 쿼리에 키워드를 넣는 방식이 잘못된것인지, 특정 키워드를 추출하는 경우에는 원래 댓글 추출이 안되는것인지 다른 이유인것인지 알려주시면 감사하겠습니다!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.