Giter Site home page Giter Site logo

brianleect / etherscan-labels Goto Github PK

View Code? Open in Web Editor NEW
252.0 12.0 74.0 13.89 MB

Full label data dump of top EVM chains in JSON/CSV.

License: MIT License

Python 100.00%
ethereum etherscan pandas scraper selenium-python csv json labels arb avalanche avax fantom ftm arbitrium optimism crypto web3

etherscan-labels's Introduction

EVM Labels

Scrapes labels from etherscan, bscscan & polygonscan, arbitrium, fantom, avalanche website and stores into JSON/CSV.

๐Ÿ”ด Currently broken due to undetected chromedriver not working.

Chain Site Label Count Status Last scraped
ETH https://etherscan.io 29945 โœ… ok 18/6/2023
BSC https://bscscan.com 6726 โœ… ok 26/3/2023
POLY https://polygonscan.com 4997 โœ… ok 26/3/2023
OPT https://optimistic.etherscan.io 546 โœ… ok 29/3/2023
ARB https://arbiscan.io 837 โœ… ok 26/3/2023
FTM https://ftmscan.com 1085 โœ… ok 26/3/2023
AVAX https://snowtrace.io 1062 โœ… ok 26/3/2023

Total Chains: 7

Total Labels: 45198

Setup

  1. On the command-line, run the command pip install -r requirements.txt while located at folder with code.
  2. (Optional) Add ETHERSCAN_USER and ETHERSCAN_PASS to sample.config.json and rename to config.json
  3. Run the script with the command python main.py.
  4. Proceed to enter either eth, bsc or poly to specify chain of interest
  5. Login to your ___scan account (Prevents popup/missing data)
  6. Press enter in CLI once logged in
  7. Proceed to enter either single (Retrieve specific label) or all (Retrieve ALL labels)
  8. If single: Follow up with the specific label e.g. exchange , bridge ....
  9. If all: Simply let it run (Take about ~1h+ to retrieve all, note that it occassionally crashes as well)
  10. Individual JSON and CSV data is dumped into data subfolder.
  11. Consolidated JSON label info is dumped into combined subfolder.

etherscan-labels's People

Contributors

brianleect avatar c0mm4nd avatar starguyaman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

etherscan-labels's Issues

[QOL] Skipping 0 number cases

Currently time is often wasted looping through on cases where there are 0 accounts or tokens.

E.g. augmented-finance has 0 tokens but for every scraping run it attempts to check it still

Ideally rather than hitting the page and finding out, we should have a step at labelcloud scraping stage which determines the number of labels and if tokens/account even exists for that label.

[QOL] File struct, separate empty files

In one of the prior commits, we initialized empty json and csv files to speed up reruns

Action: Separate empty files into their own folder so valid labels of interests can quickly be searched through

[Framework] Basic extraction

Flow

  1. Identify label of interest
  2. Get table with INDEX=0 for start else index indicated.
  3. if table_length == 100 else reach end and exit. (Note there's edge case where its a multiple of 100) (Exit if length==0 as well)
  4. Else: INDEX+=100 -> getTable(INDEX)

Observations

[QOL] Delete and re-scrape selected

Rationale: Current implementation requires user to delete old folder of labels and then scrape all

A nice QOL addition would be a prompt asking during the scraping if we wanted to get fresh scrape for all and if yes, delete the labels in the chain.

Not high priority atm just nice to have.

Account labels data also contain tokens

The account label contains also tokens, I understood these should be in separate file?. eg, 0x391e86e2c002c70dee155eaceb88f7a3c38f5976 is a token and is in account labels

[Bug] Table address column name, accounts != tokens

Seems that for tokens the column name for address if Contract Address while for accounts it is Address . Not sure if this was a recent change. Unable to test running on etherscan atm due to cloudflare blocking however visually it appears that its the case.

Simple fix: Change from df.address to df['Contract Address'] for token field
Better fix: Loop through first role and dynamically determine name at runtime via checking field with address

[QOL] Csv column cleaning, remove 'ETH' & 'Txn count'

Problem

  • Column not used.
  • By removing it we will have more clarity on actual changes to our labels when scraped, rather than seeing changes due to ETH change or txn count changes.

Action

  • Remove all 'ETH' & 'Txn count' from existing csv
  • Modify main.py to drop the relevant columns before saving

[Research] EVM address quirks

Considering nature of EVM where addresses being accessible/controlled by a private key on one chain is usually indicative of it being controlled on all other chains, it might not make sense to have labels completely segregated by chains. It might make sense to treat a label on etherscan to be equally valid as one found on bscscan or polygonscan.

Todo

  1. Determine number of overlapping addresses and whether its worthwhile to combine them?

Edge case: Old gnosis multi-sig deployed by nonce of the contract rather than address of sender. We have a counter example here where a contract address can be owned by a differing address, but this is likely an extreme edge case

Possible to get the full label?

Is it possible to config the app in such a way to retrieve the full label?

For example, many labels truncate with ..., example:

  • 0x495f947276749ce646f68ac8c248420045cb7b5e
  • OpenSea Shar... (OPENST...)

In this example, is it possible to retrieve OpenSea Shared Storefront (OPENSTORE) ?

[Bug] Missing 1 row for all labels (bsc,poly)

Seems that recent changes resulted in no longer an unnecessary final row that had aggregated balances and information.

Shd be simple fix by removing code that removes last row.

Actually can remove for etherscan as well since their format changed.

[QOL] Rework label ignore list to be flexible instead of hardcoded

Problem: Existing label ignore list is hardcoded to skip manually reviewed large number labels we deem unnecessary.

When new labels are added it requires users to manually add label name to hardcoded ignore list or spend a lot of time scraping an irrelevant label which may add unnecessary friction to usage

Soln:
A better implementation would be to

  1. User defined value IGNORE_IF_EXCEED requested at runtime or config
  2. Extraction of token and account value for label during label cloud reading
  3. If it value from 2. exceed IGNORE_IF_EXCEED skip label

Account type detection

Hi Brian - thanks for this script - really saved me a lot of time.

Long-term, we could think about algorithmically tagging wallets by account / user types. There are a few papers out with methodologies. This is just an idea for the very long term -- it would be helpful for us to work towards having this kind of taxonomy on Ethereum data. I wonder what Etherscan itself has on its roadmap. It seems like such a basic and necessary resource. I see several teams doing this for their own projects all over again each time.

Great job so far - just leaving a note. New to Github. Thanks again! Much appreciate your work.

[Feat] Coverage of other scan sites

Logic used here should be able to be replicated for other scan sites due to similar structure. Would be good for coverage.

Need further research to determine if site structure / logic used can be re-used.

[Feat] [Bug] Etherscan Cloudflare bypass

Seems that etherscan might have implemented an additional layer of scraping protection. In an attempt to scrape today it appears that while logged in I got blocked by a cloudflare linked page. Might be a major problem.

Will need to research further and see if it occurs often or was a one off case.

[QOL] Net address stats

Would be nice to have an overview of total labels from each chain etc displayed at the front of README

[Bug] Single scraping not working

Shd scrape both accounts and tokens.

Currently only scrapes accounts. stuck by prompt asking to continue

Potential soln: Remove prompt to ask for more label scraping so it gets to execute that token function call

[Research][QOL] JSON/CSV vs Single file soln

Currently we have each run of getLabels to be saved as {labelName}.json {labelName}.csv

Might be a major QOL improvement to consolidate into a single file?

address:{'name':NAMETAG, 'label':LABEL}

Might need to take a look into the size of all label information and whether it makes sense to leave it in a single large JSON file or into a DB (SQlite perhaps?)

[QOL] [combinedLabels] Remove duplicate nameTags

Change combinedLabels.json to

{'nameTag':NAME,'labels':[LABEL1,LABEL2,LABEL3]}

Main mistake made initially was thinking there was multiple nameTags for same address. This results in unnecessary repetition.

Example:
["Paradex (0x-ecosystem)", "Paradex (dex)"] -> nametag:"Paradex", labels:[0x-ecosystem,dex]

[Bug] [RetrieveAll] Suspected RL hit? ImportError: html5lib

ImportError: html5lib not found, please install it.

After retrieving 10 labels, crashed on the line.

    newTable = pd.read_html(driver.page_source)[0]

Suspect its RL returning invalid table and thus causing crash.

Will attempt introducing sleep 3-4s between each page and see if it fixes the problem.

[Bug] Undetected Chrome Driver not working

Unable to start chrome driver. Issue with dependency undetected-chromdriver. Not sure if exists a fix.

ultrafunkamsterdam/undetected-chromedriver#1491

File "C:\Python310\lib\site-packages\webdriver_manager\core\http.py", line 16, in validate_response
    raise ValueError(f"There is no such driver by url {resp.url}")
ValueError: There is no such driver by url https://chromedriver.storage.googleapis.com/117.0.5938/chromedriver_win32.zip

[Bug] Etherscan scraping broken

Length of values (0) does not match length of index (3)

Code of interest

# Retrieve all addresses from table
                elems = driver.find_elements("xpath", "//tbody//a[@href]")
                addressList = []
                addrIndex = len(baseUrl + '/address/')
                for elem in elems:
                    href = elem.get_attribute("href")
                    if (href.startswith('baseUrl/address/')):
                        addressList.append(href[addrIndex:])

                # Quickfix: Optimism uses etherscan subcat style but differing address format
                if targetChain == 'eth':
                    # Replace address column in newTable dataframe with addressList
                    curTable['Address'] = addressList

[Bug] Etherscan CSV not matching JSON data, address wrong

,#,Contract Address,Token Name,Market Cap,Holders,Website,Address 0,1,0x06a6a7...3C54aBA8,0xUniverse (PLANET),$0.00,10850,https://0xuniverse.com/,0x06a6a7af298129e3a2ab396c9c06f91d3c54aba8

{"0x06a6a7af298129e3a2ab396c9c06f91d3c54aba8": "0xUniverse (PLANET)"}

Issue is probably during integration of new format of etherscan csv did not consider the new field.

Fix: Write script to replace column with json assuming its in order

[QOL] [RetrieveAll] Skip if exist/no change

Currently every run of retrieveAll will get ALL labels again regardless if there was any change.

Possible soln is to either

  1. Skip if json/csv already exists (Easy but will miss updates)
  2. Check length of JSON/CSV vs indicated length in labelcloud (Slightly more troublesome but will capture changes)

[QOL] [retrieveAll] Ignore large labels

if numberOfEntries > X , skip labels.

Example
image
image

Alternatively we can hardcode to ignore but better to filter off by entries. Need to identify the specific element when retrieving from labelcloud.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.