Giter Site home page Giter Site logo

finnlp's Introduction

image

FinNLP: Internet-scale Financial Data

Downloads Downloads Python 3.8 PyPI License

FinNLP provides a playground for all people interested in LLMs and NLP in Finance. Here we provide full pipelines for LLM training and finetuning in the field of finance.

Visitors

Ⅰ. How to Use

1. News

  • US

    # Finnhub (Yahoo Finance, Reuters, SeekingAlpha, CNBC...)
    from finnlp.data_sources.news.finnhub_date_range import Finnhub_Date_Range
    
    start_date = "2023-01-01"
    end_date = "2023-01-03"
    config = {
        "use_proxy": "us_free",    # use proxies to prvent ip blocking
        "max_retry": 5,
        "proxy_pages": 5,
        "token": "YOUR_FINNHUB_TOKEN"  # Available at https://finnhub.io/dashboard
    }
    
    news_downloader = Finnhub_Date_Range(config)                      # init
    news_downloader.download_date_range_stock(start_date,end_date)    # Download headers
    news_downloader.gather_content()                                  # Download contents
    df = news_downloader.dataframe
    selected_columns = ["headline", "content"]
    df[selected_columns].head(10)
    
    --------------------
    
    # 	headline						content
    # 0	My 26-Stock $349k Portfolio Gets A Nice Petrob...	Home\nInvesting Strategy\nPortfolio Strategy\n...
    # 1	Apple’s Market Cap Slides Below $2 Trillion fo...	Error
    # 2	US STOCKS-Wall St starts the year with a dip; ...	(For a Reuters live blog on U.S., UK and Europ...
    # 3	Buy 4 January Dogs Of The Dow, Watch 4 More	Home\nDividends\nDividend Quick Picks\nBuy 4 J...
    # 4	Apple's stock market value falls below $2 tril...	Jan 3 (Reuters) - Apple Inc's \n(AAPL.O)\n sto...
    # 5	CORRECTED-UPDATE 1-Apple's stock market value ...	Jan 3 (Reuters) - Apple Inc's \n(AAPL.O)\n sto...
    # 6	Apple Stock Falls Amid Report Of Product Order...	Apple stock got off to a slow start in 2023 as...
    # 7	US STOCKS-Wall St starts the year with a dip; ...	Summary\nCompanies\nTesla shares plunge on Q4 ...
    # 8	More than $1 trillion wiped off value of Apple...	apple store\nMore than $1 trillion has been wi...
    # 9	McLean's Iridium inks agreement to put its sat...	The company hasn't named its partner, but it's...
  • China

    # Sina Finance
    from finnlp.data_sources.news.sina_finance_date_range import Sina_Finance_Date_Range
    
    start_date = "2016-01-01"
    end_date = "2016-01-02"
    config = {
        "use_proxy": "china_free",   # use proxies to prvent ip blocking
        "max_retry": 5,
        "proxy_pages": 5,
    }
    
    news_downloader = Sina_Finance_Date_Range(config)                # init
    news_downloader.download_date_range_all(start_date,end_date)	 # Download headers
    news_downloader.gather_content()		                        # Download contents
    df = news_downloader.dataframe
    selected_columns = ["title", "content"]
    df[selected_columns].head(10)
    
    --------------------
    
    #         title	                                 content
    # 0	分析师:伊朗重回国际原油市场无法阻止	        新浪美股讯 北京时间1月1日晚CNBC称,加拿大皇家银行(RBC)分析师Helima Cro...
    # 1	FAA:波音767的逃生扶梯存在缺陷	          新浪美股讯 北京时间1日晚,美国联邦航空局(FAA)要求航空公司对波音767机型的救生扶梯进...
    # 2	非制造业新订单指数创新高 需求回升力度明显	   中新社北京1月1日电 (记者 刘长忠)记者1日从**物流与采购联合会获悉,在最新发布的201...
    # 3	雷曼兄弟针对大和证券提起索赔诉讼	          新浪美股讯 北京时间1日下午共同社称,2008年破产的美国金融巨头雷曼兄弟公司的清算法人日前...
    # 4	国内钢铁PMI有所回升 钢市低迷形势有所改善	   新华社上海1月1日专电(记者李荣)据中物联钢铁物流专业委员会1日发布的指数报告,2015年1...
    # 5	马息岭凸显朝鲜旅游体育战略	                 新浪美股北京时间1日讯 三位单板滑雪手将成为最早拜访马息岭滑雪场的西方专业运动员,他们本月就...
    # 6	五洲船舶破产清算 近十年来首现国有船厂倒闭	   (原标题:**首家国有船厂破产倒闭)\n低迷的**造船市场,多年来首次出现国有船厂破产清算的...
    # 7	过半城市房价环比上涨 百城住宅均价加速升温	    资料图。中新社记者 武俊杰 摄\n中新社北京1月1日电 (记者 庞无忌)**房地产市场在20...
    # 8	经济学人:巴西病根到底在哪里	              新浪美股北京时间1日讯 原本,巴西人是该高高兴兴迎接2016年的。8月间,里约热内卢将举办南...
    # 9	**首家国有船厂破产倒闭:五洲船舶目前已停工	 低迷的**造船市场,多年来首次出现国有船厂破产清算的一幕。浙江海运集团旗下的五洲船舶修造公司...
    
    # Eastmoney 东方财富
    from finnlp.data_sources.news.eastmoney_streaming import Eastmoney_Streaming
    
    pages = 3
    stock = "600519"
    config = {
        "use_proxy": "china_free",
        "max_retry": 5,
        "proxy_pages": 5,
    }
    
    news_downloader = Eastmoney_Streaming(config)
    news_downloader.download_streaming_stock(stock,pages)
    df = news_downloader.dataframe
    selected_columns = ["title", "create time"]
    df[selected_columns].head(10)
    
    --------------------
    
    #     title	create time
    # 0	茅台2022年报的12个小秘密	04-09 19:40
    # 1	东北证券维持贵州茅台买入评级 预计2023年净利润同比	04-09 11:24
    # 2	贵州茅台:融资余额169.34亿元,创近一年新低(04-07	04-08 07:30
    # 3	贵州茅台:融资净买入1248.48万元,融资余额169.79亿	04-07 07:28
    # 4	贵州茅台公益基金会正式成立	04-06 12:29
    # 5	贵州茅台04月04日获沪股通增持19.55万股	04-05 07:48
    # 6	贵州茅台:融资余额169.66亿元,创近一年新低(04-04	04-05 07:30
    # 7	4月4日北向资金最新动向(附十大成交股)	04-04 18:48
    # 8	大宗交易:贵州茅台成交235.9万元,成交价1814.59元(	04-04 17:21
    # 9	第一上海证券维持贵州茅台买入评级 目标价2428.8元	04-04 09:30

2. Social Media

  • US

    # Stocktwits
    from finnlp.data_sources.social_media.stocktwits_streaming import Stocktwits_Streaming
    
    pages = 3
    stock = "AAPL"
    config = {
        "use_proxy": "us_free",
        "max_retry": 5,
        "proxy_pages": 2,
    }
    
    downloader = Stocktwits_Streaming(config)
    downloader.download_date_range_stock(stock, pages)
    selected_columns = ["created_at", "body"]
    downloader.dataframe[selected_columns].head(10)
    
    --------------------
    
    # created_at	body
    # 0	2023-04-07T15:24:22Z	NANCY PELOSI JUST BOUGHT 10,000 SHARES OF APPL...
    # 1	2023-04-07T15:17:43Z	$AAPL $SPY \n \nhttps://amp.scmp.com/news/chi...
    # 2	2023-04-07T15:17:25Z	$AAPL $GOOG $AMZN I took a Trump today. \n\nH...
    # 3	2023-04-07T15:16:54Z	$SPY $AAPL will take this baby down, time for ...
    # 4	2023-04-07T15:11:37Z	$SPY $3T it ALREADY DID - look at the pre-COV...
    # 5	2023-04-07T15:10:29Z	$AAPL $QQQ $STUDY We are on to the next one! A...
    # 6	2023-04-07T15:06:00Z	$AAPL was analyzed by 48 analysts. The buy con...
    # 7	2023-04-07T14:54:29Z	$AAPL both retiring. \n \nCraig....
    # 8	2023-04-07T14:40:06Z	$SPY $QQQ $TSLA $AAPL SPY 500 HAS STARTED🚀😍 BI...
    # 9	2023-04-07T14:38:57Z	Nancy 🩵 (Tim) $AAPL
    # Reddit Wallstreetbets
    from finnlp.data_sources.social_media.reddit_streaming import Reddit_Streaming
    
    pages = 3
    config = {
        "use_proxy": "us_free",
        "max_retry": 5,
        "proxy_pages": 2,
    }
    
    downloader = Reddit_Streaming(config)
    downloader.download_streaming_all(pages)
    selected_columns = ["created", "title"]
    downloader.dataframe[selected_columns].head(10)
    
    --------------------
    
    # created	title
    # 0	2023-04-07 15:39:34	Y’all making me feel like spooderman
    # 1	2022-12-21 04:09:42	Do you track your investments in a spreadsheet...
    # 2	2022-12-21 04:09:42	Do you track your investments in a spreadsheet...
    # 3	2023-04-07 15:29:23	Can a Blackberry holder get some help 🥺
    # 4	2023-04-07 14:49:55	The week of CPI and FOMC Minutes… 4-6-23 SPY/ ...
    # 5	2023-04-07 14:19:22	Well let’s hope your job likes you, thanks Jerome
    # 6	2023-04-07 14:06:32	Does anyone else feel an overwhelming sense of...
    # 7	2023-04-07 13:47:59	Watermarked Jesus explains the market being cl...
    # 8	2023-04-07 13:26:23	Jobs report shows 236,000 gain in March. Hot l...
    # 9	2023-04-07 13:07:15	The recession is over! Let's buy more stocks!
  • China (Weibo)

    # Weibo
    from finnlp.data_sources.social_media.weibo_date_range import Weibo_Date_Range
    
    start_date = "2016-01-01"
    end_date = "2016-01-02"
    stock = "茅台"
    config = {
        "use_proxy": "china_free",
        "max_retry": 5,
        "proxy_pages": 5,
        "cookies": "Your_Login_Cookies",
    }
    
    downloader = Weibo_Date_Range(config)
    downloader.download_date_range_stock(start_date, end_date, stock = stock)
    df = downloader.dataframe
    df = df.drop_duplicates()
    selected_columns = ["date", "content"]
    df[selected_columns].head(10)
    
    --------------------
    
    # date	content
    # 0	2016-01-01		#舆论之锤#唯品会发声明证实销售假茅台-手机腾讯网O网页链接分享来自浏览器!
    # 2	2016-01-01		2016元旦节快乐酒粮网官方新品首发,茅台镇老酒,酱香原浆酒:酒粮网茅台镇白酒酱香老酒纯粮原...
    # 6	2016-01-01		2016元旦节快乐酒粮网官方新品首发,茅台镇老酒,酱香原浆酒:酒粮网茅台镇白酒酱香老酒纯粮原...
    # 17	2016-01-01		开心,今天喝了两斤酒(茅台+扎二)三个人,开心!
    # 18	2016-01-01		一家专卖假货的网站某宝,你该学学了!//【唯品会售假茅台:供货商被刑拘顾客获十倍补偿】O唯品...
    # 19	2016-01-01		一家专卖假货的网站//【唯品会售假茅台:供货商被刑拘顾客获十倍补偿】O唯品会售假茅台:供货商...
    # 20	2016-01-01		前几天说了几点不看好茅台的理由,今年过节喝点茅台支持下,个人口感,茅台比小五好喝,茅台依然是...
    # 21	2016-01-01		老杜酱酒已到货,从明天起正式在甘肃武威开卖。可以不相信我说的话,但一定不要怀疑@杜子建的为人...
    # 22	2016-01-01		【唯品会售假茅台后续:供货商被刑拘顾客获十倍补偿】此前,有网友投诉其在唯品会购买的茅台酒质量...
    # 23	2016-01-01		唯品会卖假茅台,供货商被刑拘,买家获十倍补偿8888元|此前,有网友在网络论坛发贴(唯品会宣...

3. Company Announcement

  • US

    # SEC
    from finnlp.data_sources.company_announcement.sec import SEC_Announcement
    
    start_date = "2020-01-01"
    end_date = "2020-06-01"
    stock = "AAPL"
    config = {
        "use_proxy": "us_free",
        "max_retry": 5,
        "proxy_pages": 3,
    }
    
    downloader = SEC_Announcement(config)
    downloader.download_date_range_stock(start_date, end_date, stock = stock)
    selected_columns = ["file_date", "display_names", "content"]
    downloader.dataframe[selected_columns].head(10)
    
    --------------------
    
    # file_date	display_names	content
    # 0	2020-05-12	[KONDO CHRIS (CIK 0001631982), Apple Inc. (A...	SEC Form 4 \n FORM 4UNITED STATES SECURITIES...
    # 1	2020-04-30	[JUNG ANDREA (CIK 0001051401), Apple Inc. (A...	SEC Form 4 \n FORM 4UNITED STATES SECURITIES...
    # 2	2020-04-17	[O'BRIEN DEIRDRE (CIK 0001767094), Apple Inc....	SEC Form 4 \n FORM 4UNITED STATES SECURITIES...
    # 3	2020-04-17	[KONDO CHRIS (CIK 0001631982), Apple Inc. (A...	SEC Form 4 \n FORM 4UNITED STATES SECURITIES...
    # 4	2020-04-09	[Maestri Luca (CIK 0001513362), Apple Inc. (...	SEC Form 4 \n FORM 4UNITED STATES SECURITIES...
    # 5	2020-04-03	[WILLIAMS JEFFREY E (CIK 0001496686), Apple I...	SEC Form 4 \n FORM 4UNITED STATES SECURITIES...
    # 6	2020-04-03	[Maestri Luca (CIK 0001513362), Apple Inc. (...	SEC Form 4 \n FORM 4UNITED STATES SECURITIES...
    # 7	2020-02-28	[WAGNER SUSAN (CIK 0001059235), Apple Inc. (...	SEC Form 4 \n FORM 4UNITED STATES SECURITIES...
    # 8	2020-02-28	[LEVINSON ARTHUR D (CIK 0001214128), Apple In...	SEC Form 4 \n FORM 4UNITED STATES SECURITIES...
    # 9	2020-02-28	[JUNG ANDREA (CIK 0001051401), Apple Inc. (A...	SEC Form 4 \n FORM 4UNITED STATES SECURITIES...
  • China

    # Juchao
    from finnlp.data_sources.company_announcement.juchao import Juchao_Announcement
    
    start_date = "2020-01-01"
    end_date = "2020-06-01"
    stock = "000001"
    config = {
        "use_proxy": "china_free",
        "max_retry": 5,
        "proxy_pages": 3,
    }
    
    downloader = Juchao_Announcement(config)
    downloader.download_date_range_stock(start_date, end_date, stock = stock, get_content = True, delate_pdf = True)
    selected_columns = ["announcementTime", "shortTitle","Content"]
    downloader.dataframe[selected_columns].head(10)
    
    --------------------
    
    # announcementTime	shortTitle	Content
    # 0	2020-05-27	关于2020年第一期小型微型企业贷款专项金融债券发行完毕的公告	证券代码: 000001 证券简称:平安银行 ...
    # 1	2020-05-22	2019年年度权益分派实施公告	1 证券代码: 000001 证券简称:平安银行 ...
    # 2	2020-05-20	关于获准发行小微企业贷款专项金融债券的公告	证券代码: 000001 证券简称:平安银行 ...
    # 3	2020-05-16	监事会决议公告	1 证券代码: 000001 证券简称: 平安银行 ...
    # 4	2020-05-15	2019年年度股东大会决议公告	1 证券代码: 000001 证券简称:平安银行 ...
    # 5	2020-05-15	2019年年度股东大会的法律意见书	北京总部 电话 : (86 -10) 8519 -1300 传真 : (86 -10...
    # 6	2020-04-30	中信证券股份有限公司、平安证券股份有限公司关于公司关联交易有关事项的核查意见	1 中信证券股份有限公司 、平安证券股份有限 公司 关于平安银行股份有限公司 关联交易 有...
    # 7	2020-04-30	独立董事独立意见	1 平安银行股份有限公司独立董事独立意见 根据《关于在上市公司建立独立董事制度的指导...
    # 8	2020-04-30	关联交易公告	1 证券代码: 000001 证券简称:平安银行 ...
    # 9	2020-04-21	2020年第一季度报告全文	证券代码: 000001 证券简称:平安银行 ...

Ⅱ. Data Sources

1. News

Platform Data Type Related Market Specified Company Range Type Limits Support
Yahoo Financial News US Stocks Date Range N/A
Reuters General News US Stocks × Date Range N/A Soon
Seeking Alpha Financial News US Stocks Streaming N/A
Sina Financial News CN Stocks × Date Range N/A
Eastmoney Financial News CN Stocks Date Range N/A
Yicai Financial News CN Stocks Date Range N/A Soon
CCTV General News CN Stocks × Date Range N/A
US Mainstream Media Financial News US Stocks Date Range Account (Free)
CN Mainstream Media Financial News CN Stocks × Date Range Account (¥500/year)

2. Social Media

Platform Data Type Related Market Specified Company Range Type Source Type Limits Support
Twitter Tweets US Stocks Date Range Official N/A
Twitter Sentiment US Stocks Date Range Third Party N/A
StockTwits Tweets US Stocks Lastest Official N/A
Reddit (wallstreetbets) Threads US Stocks × Lastest Official N/A
Reddit Sentiment US Stocks Date Range Third Party N/A
Weibo Tweets CN Stocks Date Range Official Cookies
Weibo Tweets CN Stocks Lastest Official N/A

3. Company Announcement

Platform Data Type Related Market Specified Company Range Type Source Type Limits Support
Juchao (Official Website) Text CN Stocks Date Range Official N/A
SEC (Official Website) Text US Stocks Date Range Official N/A
Sina Text CN Stocks Lastest Third Party N/A

4. Data Sets

Data Source Type Stocks Dates Available
AShare News 3680 2018-07-01 to 2021-11-30
stocknet-dataset Tweets 87 2014-01-02 to 2015-12-30
CHRNN Tweets 38 2017-01-03 to 2017-12-28

Ⅲ. Large Language Models (LLMs)

LICENSE

MIT License

Disclaimer: We are sharing codes for academic purposes under the MIT education license. Nothing herein is financial advice, and NOT a recommendation to trade real money. Please use common sense and always first consult a professional before trading or investing.

finnlp's People

Contributors

athe-kunal avatar bruceyanghy avatar eltociear avatar greg-tarr avatar oliverwang15 avatar pranavgarg avatar yangletliu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

finnlp's Issues

I get a module not found error when attempting to run the example in the docs

I get this error:
Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/media/yfprime/763D304F18FA13FA/tbot1/venv/lib/python3.10/site-packages/main.py", line 106, in <module> stock_industry_category_cninfo_df = stock_industry_category_cninfo( File "/media/yfprime/763D304F18FA13FA/tbot1/venv/lib/python3.10/site-packages/main.py", line 59, in stock_industry_category_cninfo js_content = _get_file_content_ths("cninfo.js") File "/media/yfprime/763D304F18FA13FA/tbot1/venv/lib/python3.10/site-packages/main.py", line 30, in _get_file_content_ths setting_file_path = get_ths_js(file) File "/media/yfprime/763D304F18FA13FA/tbot1/venv/lib/python3.10/site-packages/main.py", line 17, in get_ths_js with resources.path(package="py_mini_racer.data", resource=file) as f: File "/usr/lib/python3.10/importlib/resources.py", line 119, in path reader = _common.get_resource_reader(_common.get_package(package)) File "/usr/lib/python3.10/importlib/_common.py", line 66, in get_package resolved = resolve(package) File "/usr/lib/python3.10/importlib/_common.py", line 57, in resolve return cand if isinstance(cand, types.ModuleType) else importlib.import_module(cand) File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1050, in _gcd_import File "<frozen importlib._bootstrap>", line 1027, in _find_and_load File "<frozen importlib._bootstrap>", line 1004, in _find_and_load_unlocked ModuleNotFoundError: No module named 'py_mini_racer.data'

I am using Python 3.10 and a relative import path.

Here is my code:

from .FinNLP.finnlp.data_sources.news.finnhub_date_range import Finnhub_Date_Range

start_date = "2024-04-01"
end_date = "2024-04-03"
config = {
#"use_proxy": "us_free", # use proxies to prvent ip blocking
"max_retry": 5,
"proxy_pages": 5,
"token": "TOKEN_HERE" # Available at https://finnhub.io/dashboard
}

news_downloader = Finnhub_Date_Range(config) # init
news_downloader.download_date_range_stock(start_date,end_date) # Download headers
news_downloader.gather_content() # Download contents
df = news_downloader.dataframe
df.head(10)
selected_columns = ["headline", "content"]
df[selected_columns].head(10)

personal accounts for sentiment analysis and so on

It would be possible to also connect personal accounts such as Twitter / X for sentiment analysis, tweets and so on instead of using the information and APIs that are included by default in the finnlp code.

('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) - Error

I ran into this error when running the same cell you provided in the readme

('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

This error is specifically from this line:
news_downloader = Finnhub_Date_Range(config) , downloader = Stocktwits_Streaming(config)
and downloader = SEC_Announcement(config). (ALL US DATA)

My config was:
start_date = "2023-01-01"
end_date = "2023-01-03"
config = {
"use_proxy": "us_free", # use proxies to prevent ip blocking
"max_retry": 5,
"proxy_pages": 5,
"token": "YOUR_FINNHUB_TOKEN" # Available at https://finnhub.io/dashboard
}

Any idea?

Getting started with FinNLP

Hi, I'm a PhD student & a beginner.

When I run this line of code I get an error message that there's no module named 'finnlp'.
from finnlp.data_sources.sec_filings import SECFilingsLoader

So I saw in another post and tried these lines of code but got error messages as well. I'm trying to get access to earnings call transcripts and SEC. Could you help ?

`#you have to clone first

!git clone https://github.com/AI4Finance-Foundation/FinNLP

#then change the directory

!cd FinNLP

#Add Repository to Python Path

import sys
sys.path.append('/content/FinNLP')`

fatal: could not create work tree dir 'FinNLP': Read-only file system zsh:cd:1: no such file or directory: FinNLP

from finnlp.data_sources.news.finnhub import Finnhub_News


ModuleNotFoundError Traceback (most recent call last)
Cell In[17], line 9
7 from tqdm.notebook import tqdm
8 # from meta.data_processors.yahoofinance import Yahoofinance
----> 9 from finnlp.data_sources.news.finnhub import Finnhub_News
10 from finnlp.large_language_models.openai.openai_chat_agent import Openai_Chat_Agent

ModuleNotFoundError: No module named 'finnlp.data_sources.news.finnhub'

US_proxy connection error

`from finnlp.data_sources.news.finnhub_date_range import Finnhub_Date_Range

start_date = "2023-01-01"
end_date = "2023-01-02"
config = {
"use_proxy": "us_free", # use proxies to prvent ip blocking
"max_retry": 5,
"proxy_pages": 5,
"token": "ck22t49r01qng12gonugck22t49r01qng12gonv0" # Available at https://finnhub.io/dashboard
}

news_downloader = Finnhub_Date_Range(config) # init
news_downloader.download_date_range_stock(start_date,end_date) # Download headers
news_downloader.gather_content() # Download contents
df = news_downloader.dataframe
selected_columns = ["headline", "content"]
print(df[selected_columns].head(10))
`

Getting the us proxy will incur a connection error.
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

When the function get_us_free_proxy, at the line response = requests.get(url, headers=headers)

AttributeError: 'DataFrame' object has no attribute 'datetime'

when i run the code :

# Finnhub (Yahoo Finance, Reuters, SeekingAlpha, CNBC...)
from finnlp.data_sources.news.finnhub_date_range import Finnhub_Date_Range

start_date = "2023-01-01"
end_date = "2023-01-03"
config = {
    "use_proxy": "us_free",    # use proxies to prvent ip blocking
    "max_retry": 5,
    "proxy_pages": 5,
    "token": "YOUR_FINNHUB_TOKEN"  # Available at https://finnhub.io/dashboard
}

news_downloader = Finnhub_Date_Range(config)                      # init
news_downloader.download_date_range_stock(start_date,end_date)    # Download headers
news_downloader.gather_content()                                  # Download contents
df = news_downloader.dataframe
selected_columns = ["headline", "content"]
df[selected_columns].head(10)

the fellowing error is through.
Checking ips: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 75/75 [01:04<00:00, 1.16it/s]
Get proxy ips: 75.
Usable proxy ips: 75.
stoped--
Downloading Titles: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:04<00:00, 1.45s/it]
stop
Traceback (most recent call last):
File "/home/bbbs/Videos/FINTECH/main.py", line 15, in
news_downloader.download_date_range_stock(start_date,end_date) # Download headers
File "/home/bbbs/Videos/FINTECH/FinNLP/finnlp/data_sources/news/finnhub_date_range.py", line 50, in download_date_range_stock
self.dataframe.datetime = pd.to_datetime(self.dataframe.datetime,unit = "s")
File "/home/bbbs/anaconda3/envs/fin_nlp/lib/python3.8/site-packages/pandas/core/generic.py", line 5989, in getattr
return object.getattribute(self, name)
AttributeError: 'DataFrame' object has no attribute 'datetime'

connection error in dataframe values

run:

# Finnhub (Yahoo Finance, Reuters, SeekingAlpha, CNBC...)
from finnlp.data_sources.news.finnhub_date_range import Finnhub_Date_Range

start_date = "2023-01-01"
end_date = "2023-01-03"
config = {
    "use_proxy": "us_free",    # use proxies to prvent ip blocking
    "max_retry": 5,
    "proxy_pages": 5,
    "token": "clon9npr01qtp8tab4ngclon9npr01qtp8tab4o0"  # Available at https://finnhub.io/dashboard
}

news_downloader = Finnhub_Date_Range(config)                      # init
news_downloader.download_date_range_stock(start_date,end_date)    # Download headers
news_downloader.gather_content()                                  # Download contents
df = news_downloader.dataframe
selected_columns = ["headline", "content"]
df[selected_columns].head(10)

# 	headline						content
# 0	My 26-Stock $349k Portfolio Gets A Nice Petrob...	Home\nInvesting Strategy\nPortfolio Strategy\n...
# 1	Apple’s Market Cap Slides Below $2 Trillion fo...	Error
# 2	US STOCKS-Wall St starts the year with a dip; ...	(For a Reuters live blog on U.S., UK and Europ...
# 3	Buy 4 January Dogs Of The Dow, Watch 4 More	Home\nDividends\nDividend Quick Picks\nBuy 4 J...
# 4	Apple's stock market value falls below $2 tril...	Jan 3 (Reuters) - Apple Inc's \n(AAPL.O)\n sto...
# 5	CORRECTED-UPDATE 1-Apple's stock market value ...	Jan 3 (Reuters) - Apple Inc's \n(AAPL.O)\n sto...
# 6	Apple Stock Falls Amid Report Of Product Order...	Apple stock got off to a slow start in 2023 as...
# 7	US STOCKS-Wall St starts the year with a dip; ...	Summary\nCompanies\nTesla shares plunge on Q4 ...
# 8	More than $1 trillion wiped off value of Apple...	apple store\nMore than $1 trillion has been wi...

====================================================================
Error:
connection error in dataframe values as shown below

image

Evaluation time-consuming on FIQA

Hi,

I hope this message finds you well. First and foremost, I would like to express my gratitude for the incredible work you have put into this project; it has been instrumental in my work.

I am reaching out to seek some guidance and insights regarding the evaluation time of the model across different test sets. In my current setup, I am observing that the evaluation phase is significantly time-consuming, for example on FIQA roughly taking around two to three hours to complete even in A6000 with batch size 64 for LLAMA. This duration seems to persist across various test sets, which has brought me to seek your expertise.

I am wondering if there might be any specific recommendations or strategies that could potentially help in accelerating the evaluation process.

Here are a few questions I have in mind:

Are there any known bottlenecks in the evaluation process for FIQA that I should be aware of?
Could you please suggest any best practices or settings that could help in reducing the evaluation time?
Is there any parallelization or optimization technique available that is recommended for speeding up the evaluation?
I am more than willing to provide additional information or clarify any aspects if needed. My main goal is to ensure that I am utilizing the tool to its fullest potential and in the most efficient manner.

Thank you very much for taking the time to read my inquiry. I am looking forward to any advice or suggestions you might have.

Private accounts of social networks for accurate sentiment analysis

It would be possible to develop, add, connect some API or use those that are available to also be able to use private / proprietary accounts of the main networks such as Twitter / X, stocktwits, reddit... to improve sentiment analysis since the most retail traders tend to go against the direction of the market and it is better to train the models with better data so that they are more reliable, It would be possible to also connect personal accounts for sentiment analysis, tweets, threads and so on instead of using the information and APIs that are included by default in the finnlp code, at least start connecting the personal Twitter / X account, which is the most important and the one that can provide the most information.****

Stocktwits_Streaming demo has a typo

Hi, I think you had a typo in below demo code:
downloader = Stocktwits_Streaming(config) downloader.download_date_range_stock(stock, pages)
"download_date_range_stock" should be "download_streaming_stock" instead. As "download_date_range_stock" is not implemented in Stocktwits_Streaming.

截屏2023-08-26 下午7 32 45

finnlp/data_sources/news/eastmoney_streaming.py xpath bug.

The xpath of the page has changed, and the new xpath correction is as follows.

 def _gather_pages(self, stock, page):
     ....
     # gather the comtent of the first page
        page = etree.HTML(response.text)
        trs = page.xpath('//*[@id="mainlist"]/div/ul/li[1]/table/tbody/tr')
        have_one = False
        for item in trs:
            have_one = True
            read_amount = item.xpath("./td[1]//text()")[0]
            comments = item.xpath("./td[2]//text()")[0]
            title = item.xpath("./td[3]/div/a//text()")[0]
            content_link = item.xpath("./td[3]/div/a/@href")[0]
            author = item.xpath("./td[4]//text()")[0]
            time = item.xpath("./td[5]//text()")[0]
            tmp = pd.DataFrame([read_amount, comments, title, content_link, author, time]).T
            columns = [ "read amount", "comments", "title", "content link", "author", "create time" ]
            tmp.columns = columns
            self.dataframe = pd.concat([self.dataframe, tmp])
            #print(title)
        if have_one == False:
            return "break"
   ...

Server error

Tried running the below code as demonstrated in README.md

# Finnhub (Yahoo Finance, Reuters, SeekingAlpha, CNBC...)
from finnlp.data_sources.news.finnhub_date_range import Finnhub_Date_Range

start_date = "2023-01-01"
end_date = "2023-01-03"
config = {
    "use_proxy": "us_free",    # use proxies to prvent ip blocking
    "max_retry": 5,
    "proxy_pages": 5,
    "token": "finnhub_api_token"  # Available at https://finnhub.io/dashboard
}

news_downloader = Finnhub_Date_Range(config)                      # init
news_downloader.download_date_range_stock(start_date,end_date)    # Download headers
news_downloader.gather_content()                                  # Download contents
df = news_downloader.dataframe
selected_columns = ["headline", "content"]
df[selected_columns].head(10)

# 	headline						content
# 0	My 26-Stock $349k Portfolio Gets A Nice Petrob...	Home\nInvesting Strategy\nPortfolio Strategy\n...
# 1	Apple’s Market Cap Slides Below $2 Trillion fo...	Error
# 2	US STOCKS-Wall St starts the year with a dip; ...	(For a Reuters live blog on U.S., UK and Europ...
# 3	Buy 4 January Dogs Of The Dow, Watch 4 More	Home\nDividends\nDividend Quick Picks\nBuy 4 J...
# 4	Apple's stock market value falls below $2 tril...	Jan 3 (Reuters) - Apple Inc's \n(AAPL.O)\n sto...
# 5	CORRECTED-UPDATE 1-Apple's stock market value ...	Jan 3 (Reuters) - Apple Inc's \n(AAPL.O)\n sto...
# 6	Apple Stock Falls Amid Report Of Product Order...	Apple stock got off to a slow start in 2023 as...
# 7	US STOCKS-Wall St starts the year with a dip; ...	Summary\nCompanies\nTesla shares plunge on Q4 ...
# 8	More than $1 trillion wiped off value of Apple...	apple store\nMore than $1 trillion has been wi...
# 9	McLean's Iridium inks agreement to put its sat...	The company hasn't named its partner, but it's...

but was given error
image

visited the url https://openproxy.space/list/http and got this page
image

ModuleNotFoundError: No module named 'unstructured'

from finnlp.data_sources.sec_filings import SECFilingsLoader


ModuleNotFoundError Traceback (most recent call last)
Cell In[43], line 1
----> 1 from finnlp.data_sources.sec_filings import SECFilingsLoader

File ~/FinNLP/finnlp/data_sources/sec_filings/init.py:1
----> 1 from finnlp.data_sources.sec_filings.main import SECFilingsLoader

File ~/FinNLP/finnlp/data_sources/sec_filings/main.py:1
----> 1 from finnlp.data_sources.sec_filings.sec_filings import SECExtractor
2 import concurrent.futures
3 import json

File ~/FinNLP/finnlp/data_sources/sec_filings/sec_filings.py:3
1 from typing import Any, Dict, List
----> 3 from finnlp.data_sources.sec_filings.prepline_sec_filings.sec_document import (
4 REPORT_TYPES,
5 VALID_FILING_TYPES,
6 SECDocument,
7 )
8 from finnlp.data_sources.sec_filings.prepline_sec_filings.sections import (
9 ALL_SECTIONS,
10 SECTIONS_10K,
(...)
14 validate_section_names,
15 )
16 from finnlp.data_sources.sec_filings.utils import get_filing_urls_to_download

File ~/FinNLP/finnlp/data_sources/sec_filings/prepline_sec_filings/sec_document.py:18
14 import numpy.typing as npt
17 from sklearn.cluster import DBSCAN
---> 18 from unstructured.cleaners.core import clean
19 from unstructured.documents.elements import (
20 Element,
21 ListItem,
(...)
24 Title,
25 )
26 from unstructured.documents.html import HTMLDocument

ModuleNotFoundError: No module named 'unstructured'

china_free proxy xpath bug

trs = res.xpath("/html/body/div[1]/div[4]/div[2]/div[2]/div[2]/table/tbody/tr")
this xpath will get an empty list.
change to trs = res.xpath('//table/tbody/tr')

Reddit scrapping doesnt work - AttributeError: 'NoneType' object has no attribute 'text'

I simply pasted the example code for Reddit and it errored out..

Downloading by pages...: 0%| | 0/3 [00:00<?, ?it/s]
Downloading by pages...: 33%|███████████████████████████████ | 1/3 [00:02<00:04, 2.22s/it]

AttributeError Traceback (most recent call last)
Cell In[12], line 11
4 config = {
5 "use_proxy": "us_free",
6 "max_retry": 5,
7 "proxy_pages": 2,
8 }
10 downloader = Reddit_Streaming(config)
---> 11 downloader.download_streaming_all(pages)
12 selected_columns = ["created", "title"]
13 downloader.dataframe[selected_columns].head(10)

File ~/FinNLP/finnlp/data_sources/social_media/reddit_streaming.py:40, in Reddit_Streaming.download_streaming_all(self, rounds)
38 if rounds > 1:
39 for _ in range(1,rounds):
---> 40 last_id = self._fatch_other_pages(last_id, pbar)

File ~/FinNLP/finnlp/data_sources/social_media/reddit_streaming.py:82, in Reddit_Streaming._fatch_other_pages(self, last_page, pbar)
49 data = {
50 "id": "02e3b6d0d0d7",
51 "variables": {
(...)
79 }
80 }
81 response = self._request_post(url = url, headers= headers, json = data)
---> 82 data = json.loads(response.text)
83 data = data["data"]["subredditInfoByName"]["elements"]["edges"]
84 for d in data:

AttributeError: 'NoneType' object has no attribute 'text'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.