brianleect / etherscan-labels Goto Github PK

Full label data dump of top EVM chains in JSON/CSV.

License: MIT License

Python 100.00%

ethereum etherscan pandas scraper selenium-python csv json labels arb avalanche avax fantom ftm arbitrium optimism crypto web3

etherscan-labels's Introduction

EVM Labels

Scrapes labels from etherscan, bscscan & polygonscan, arbitrium, fantom, avalanche website and stores into JSON/CSV.

🔴 Currently broken due to undetected chromedriver not working.

Chain	Site	Label Count	Status	Last scraped
ETH	https://etherscan.io	29945	✅ ok	18/6/2023
BSC	https://bscscan.com	6726	✅ ok	26/3/2023
POLY	https://polygonscan.com	4997	✅ ok	26/3/2023
OPT	https://optimistic.etherscan.io	546	✅ ok	29/3/2023
ARB	https://arbiscan.io	837	✅ ok	26/3/2023
FTM	https://ftmscan.com	1085	✅ ok	26/3/2023
AVAX	https://snowtrace.io	1062	✅ ok	26/3/2023

Total Chains: 7

Total Labels: 45198

Setup

On the command-line, run the command pip install -r requirements.txt while located at folder with code.
(Optional) Add ETHERSCAN_USER and ETHERSCAN_PASS to sample.config.json and rename to config.json
Run the script with the command python main.py.
Proceed to enter either eth, bsc or poly to specify chain of interest
Login to your ___scan account (Prevents popup/missing data)
Press enter in CLI once logged in
Proceed to enter either single (Retrieve specific label) or all (Retrieve ALL labels)
If single: Follow up with the specific label e.g. exchange , bridge ....
If all: Simply let it run (Take about ~1h+ to retrieve all, note that it occassionally crashes as well)
Individual JSON and CSV data is dumped into data subfolder.
Consolidated JSON label info is dumped into combined subfolder.

etherscan-labels's People

Contributors

Stargazers

Watchers

Forkers

yanrising bonald bombap vicramr michelcrypt4d4mus easychris partycode superkeka factec-kyle forkoooor ppray sbp354 dianxiang-sun 0xd22 ewagmig shrijana curiousdweller c0mm4nd tranvankung xingyushu starguyaman buggy-programmer-code alexblack772 botan-r andrasearthly tayevhen chrisjadamek aceluodan wzp321 giantliu22 we1she2 heossacer trigremm sansom carloshc everywherefinance crypto-tao cybro-rob murathe gatoasang94 toannhu96 rileyge jeremytsngtsng h4x3rotab 0xsankin m-r-g-t smyyguy shanzhulizhi metaglobal aragornjia tsubasakong xnzhao196 ccyanxyz aiolos404 qwang1113 tungpham1979 wenzhi-ding okxautoo mallieflorance0 klasentamera blashchrista stilnerhattie setsukokrysh octaviaquaas casteelannemarie tamekiakostecki efforthye fushun9675 fuqishuang228 otanriverdi jeffchen006 victordasilvavat bilalbai chouapo

etherscan-labels's Issues

[QOL] Skipping 0 number cases

Currently time is often wasted looping through on cases where there are 0 accounts or tokens.

E.g. augmented-finance has 0 tokens but for every scraping run it attempts to check it still

Ideally rather than hitting the page and finding out, we should have a step at labelcloud scraping stage which determines the number of labels and if tokens/account even exists for that label.

[QOL] File struct, separate empty files

In one of the prior commits, we initialized empty json and csv files to speed up reruns

Action: Separate empty files into their own folder so valid labels of interests can quickly be searched through

[Bug] Bugged labels

liqui.io
https://etherscan.io/accounts/label/liqui.io

Noticed it existing for contracts like wrapped ether which was weird.

Apparently link doesn't even redirect to liqui.io just to a default accounts page.

[Framework] Basic extraction

Flow

Identify label of interest
Get table with INDEX=0 for start else index indicated.
if table_length == 100 else reach end and exit. (Note there's edge case where its a multiple of 100) (Exit if length==0 as well)
Else: INDEX+=100 -> getTable(INDEX)

Observations

Format seems to conform to https://etherscan.io/accounts/label/{LABEL_OF_INTEREST}?subcatid=0&size=100&start=${INDEX}
Subcatid that are non-zero are referring to legacy category which we can ignore for now.

[QOL] Delete and re-scrape selected

Rationale: Current implementation requires user to delete old folder of labels and then scrape all

A nice QOL addition would be a prompt asking during the scraping if we wanted to get fresh scrape for all and if yes, delete the labels in the chain.

Not high priority atm just nice to have.

Account labels data also contain tokens

The account label contains also tokens, I understood these should be in separate file?. eg, 0x391e86e2c002c70dee155eaceb88f7a3c38f5976 is a token and is in account labels

[Bug] Table address column name, accounts != tokens

Seems that for tokens the column name for address if Contract Address while for accounts it is Address . Not sure if this was a recent change. Unable to test running on etherscan atm due to cloudflare blocking however visually it appears that its the case.

Simple fix: Change from df.address to df['Contract Address'] for token field
Better fix: Loop through first role and dynamically determine name at runtime via checking field with address

[QOL] Csv column cleaning, remove 'ETH' & 'Txn count'

Problem

Column not used.
By removing it we will have more clarity on actual changes to our labels when scraped, rather than seeing changes due to ETH change or txn count changes.

Action

Remove all 'ETH' & 'Txn count' from existing csv
Modify main.py to drop the relevant columns before saving

[Research] EVM address quirks

Considering nature of EVM where addresses being accessible/controlled by a private key on one chain is usually indicative of it being controlled on all other chains, it might not make sense to have labels completely segregated by chains. It might make sense to treat a label on etherscan to be equally valid as one found on bscscan or polygonscan.

Todo

Determine number of overlapping addresses and whether its worthwhile to combine them?

Edge case: Old gnosis multi-sig deployed by nonce of the contract rather than address of sender. We have a counter example here where a contract address can be owned by a differing address, but this is likely an extreme edge case

Automation for generating the file on a nightly basis?

Hey there! Many thanks for this awesome script. Is it somehow possible auto generate those files every night and add it to this github repo? For example like: https://github.com/sapics/ip-location-db

Possible to get the full label?

Is it possible to config the app in such a way to retrieve the full label?

For example, many labels truncate with ..., example:

0x495f947276749ce646f68ac8c248420045cb7b5e
OpenSea Shar... (OPENST...)

In this example, is it possible to retrieve OpenSea Shared Storefront (OPENSTORE) ?

[Bug] Missing 1 row for all labels (bsc,poly)

Seems that recent changes resulted in no longer an unnecessary final row that had aggregated balances and information.

Shd be simple fix by removing code that removes last row.

Actually can remove for etherscan as well since their format changed.

[Feat] Extract ALL labels

Parse https://etherscan.io/labelcloud to get all existing labels.

We can then retrieve all existing labels in a single run rather than manually specifying each label to extract.

[QOL] Rework label ignore list to be flexible instead of hardcoded

Problem: Existing label ignore list is hardcoded to skip manually reviewed large number labels we deem unnecessary.

When new labels are added it requires users to manually add label name to hardcoded ignore list or spend a lot of time scraping an irrelevant label which may add unnecessary friction to usage

Soln:
A better implementation would be to

User defined value IGNORE_IF_EXCEED requested at runtime or config
Extraction of token and account value for label during label cloud reading
If it value from 2. exceed IGNORE_IF_EXCEED skip label

Account type detection

Hi Brian - thanks for this script - really saved me a lot of time.

Long-term, we could think about algorithmically tagging wallets by account / user types. There are a few papers out with methodologies. This is just an idea for the very long term -- it would be helpful for us to work towards having this kind of taxonomy on Ethereum data. I wonder what Etherscan itself has on its roadmap. It seems like such a basic and necessary resource. I see several teams doing this for their own projects all over again each time.

Great job so far - just leaving a note. New to Github. Thanks again! Much appreciate your work.

Add decimals to the token data

Hey, would it be to add decimal amounts to the token data?

Token data is missing tickers and icons

would be possible to also add tickers and icons?

[Feat] Coverage of other scan sites

Logic used here should be able to be replicated for other scan sites due to similar structure. Would be good for coverage.

Need further research to determine if site structure / logic used can be re-used.

[Feat] Cookie auto login etherscan

If able to attach cookie to automatically login, reduces friction in running the program.

Not sure which cookies need to be attached for it to work. Need to test.

Possibly relevant thread: https://stackoverflow.com/questions/45417335/python-use-cookie-to-login-with-selenium

[Feat] [Bug] Etherscan Cloudflare bypass

Seems that etherscan might have implemented an additional layer of scraping protection. In an attempt to scrape today it appears that while logged in I got blocked by a cloudflare linked page. Might be a major problem.

Will need to research further and see if it occurs often or was a one off case.

[QOL] Net address stats

Would be nice to have an overview of total labels from each chain etc displayed at the front of README

[Bug] Single scraping not working

Shd scrape both accounts and tokens.

Currently only scrapes accounts. stuck by prompt asking to continue

Potential soln: Remove prompt to ask for more label scraping so it gets to execute that token function call

[Research][QOL] JSON/CSV vs Single file soln

Currently we have each run of getLabels to be saved as {labelName}.json {labelName}.csv

Might be a major QOL improvement to consolidate into a single file?

address:{'name':NAMETAG, 'label':LABEL}

Might need to take a look into the size of all label information and whether it makes sense to leave it in a single large JSON file or into a DB (SQlite perhaps?)

[QOL] [combinedLabels] Remove duplicate nameTags

Change combinedLabels.json to

{'nameTag':NAME,'labels':[LABEL1,LABEL2,LABEL3]}

Main mistake made initially was thinking there was multiple nameTags for same address. This results in unnecessary repetition.

Example:
["Paradex (0x-ecosystem)", "Paradex (dex)"] -> nametag:"Paradex", labels:[0x-ecosystem,dex]

[Bug] [RetrieveAll] Suspected RL hit? ImportError: html5lib

ImportError: html5lib not found, please install it.

After retrieving 10 labels, crashed on the line.

    newTable = pd.read_html(driver.page_source)[0]

Suspect its RL returning invalid table and thus causing crash.

Will attempt introducing sleep 3-4s between each page and see if it fixes the problem.

What is that

Need specific Python version with compatible selenium and pandas

When I start this project with python 3.11, there is a lot of compatibility issues on selenuim and pandas.
So could you specify the exact python version for this project, thanks.

[Bug] Undetected Chrome Driver not working

Unable to start chrome driver. Issue with dependency undetected-chromdriver. Not sure if exists a fix.

ultrafunkamsterdam/undetected-chromedriver#1491

File "C:\Python310\lib\site-packages\webdriver_manager\core\http.py", line 16, in validate_response
    raise ValueError(f"There is no such driver by url {resp.url}")
ValueError: There is no such driver by url https://chromedriver.storage.googleapis.com/117.0.5938/chromedriver_win32.zip

[Bug] Etherscan scraping broken

Length of values (0) does not match length of index (3)

Code of interest

# Retrieve all addresses from table
                elems = driver.find_elements("xpath", "//tbody//a[@href]")
                addressList = []
                addrIndex = len(baseUrl + '/address/')
                for elem in elems:
                    href = elem.get_attribute("href")
                    if (href.startswith('baseUrl/address/')):
                        addressList.append(href[addrIndex:])

                # Quickfix: Optimism uses etherscan subcat style but differing address format
                if targetChain == 'eth':
                    # Replace address column in newTable dataframe with addressList
                    curTable['Address'] = addressList

[Bug] Etherscan CSV not matching JSON data, address wrong

,#,Contract Address,Token Name,Market Cap,Holders,Website,Address 0,1,0x06a6a7...3C54aBA8,0xUniverse (PLANET),$0.00,10850,https://0xuniverse.com/,0x06a6a7af298129e3a2ab396c9c06f91d3c54aba8

{"0x06a6a7af298129e3a2ab396c9c06f91d3c54aba8": "0xUniverse (PLANET)"}

Issue is probably during integration of new format of etherscan csv did not consider the new field.

Fix: Write script to replace column with json assuming its in order

[QOL] [RetrieveAll] Skip if exist/no change

Currently every run of retrieveAll will get ALL labels again regardless if there was any change.

Possible soln is to either

Skip if json/csv already exists (Easy but will miss updates)
Check length of JSON/CSV vs indicated length in labelcloud (Slightly more troublesome but will capture changes)