Giter Site home page Giter Site logo

Miss-matching counts about twarc-hashtags HOT 9 OPEN

docnow avatar docnow commented on August 15, 2024
Miss-matching counts

from twarc-hashtags.

Comments (9)

edsu avatar edsu commented on August 15, 2024 2

I'm a little bit confused by your code but I do think you've found a difference in how twarc-hashtags works and what is in the entities.hashtags column that twarc-csv generates.

It looks like twarc-csv includes not only the tweets that were collected but also tweets that those tweets reference (replies and quotes) or so called "includes".

Personally I would expect to only get hashtags for the tweets that were collected, not the tweets that were referenced. But I guess having an --all flag to get all might be appropriate?

I wonder if users of twarc-csv understand this behavior when using the data though ...

from twarc-hashtags.

luisignaciomenendez avatar luisignaciomenendez commented on August 15, 2024 1

Sure, I tried with a random sample using :
twarc2 sample sample.jsonl
( I have also done some extra trials but this is the most inmediate one). I know this is hardly replicable as its using a live stream of tweets but I will try to attach/send you the original file that I have.

Here are the results: (for twarc only those that appear with a count=2)

twarc2 hashtags sample.jsonl

Screenshot 2022-01-25 at 12 55 10

from my code:
Screenshot 2022-01-25 at 12 55 37

sample.jsonl.zip

from twarc-hashtags.

edsu avatar edsu commented on August 15, 2024 1

@luisignaciomenendez I think @igorbrigadir means where df comes from in:

# Using the generator to create a new dataframe.
a = pd.DataFrame(list(hash_retrieve(df)),
                 columns=['hashtag', 'id'])

Is df loaded from a CSV generated with twarc2 csv?

from twarc-hashtags.

luisignaciomenendez avatar luisignaciomenendez commented on August 15, 2024 1

@luisignaciomenendez I think @igorbrigadir means where df comes from in:

# Using the generator to create a new dataframe.
a = pd.DataFrame(list(hash_retrieve(df)),
                 columns=['hashtag', 'id'])

Is df loaded from a CSV generated with twarc2 csv?

Yes,exactly. I converted it using twarc2 and then it is loaded with pandas.

from twarc-hashtags.

igorbrigadir avatar igorbrigadir commented on August 15, 2024 1

I think i found what the problem is - It's retweets. twarc-csv processes retweets so that they match what you would expect to find, using the full text of the tweet, not what the json actually contains. So, For a retweet in the json like this:

{
  "entities": {
    "hashtags": [
      {
        "start": 107,
        "end": 115,
        "tag": "EndSARS"
      }
    ]
  },
  "id": "1388203310327508995",
  "referenced_tweets": [
    {
      "type": "retweeted",
      "id": "1388174000472432650"
    }
  ],
  "text": "RT @abjghost: @imoleayomichael was abducted by DSS at 2.30am in his residence and detained for 41days over #EndSARS protest. They still wan…"
}

The retweet is truncated, so only 1 Hashtag is counted by twarc-hashtags: EndSARS

While the twarc-csv code, will dig into the referenced tweet, 1388174000472432650 which is:

{
  "entities": {
    "urls": [
      {
        "start": 280,
        "end": 303,
        "url": "https://t.co/fDgTVvbQBZ",
        "expanded_url": "https://twitter.com/abjghost/status/1388174000472432650/photo/1",
        "display_url": "pic.twitter.com/fDgTVvbQBZ"
      }
    ],
    "mentions": [
      {
        "start": 0,
        "end": 16,
        "username": "imoleayomichael"
      }
    ],
    "hashtags": [
      {
        "start": 93,
        "end": 101,
        "tag": "EndSARS"
      },
      {
        "start": 224,
        "end": 237,
        "tag": "FreeImoleAyo"
      }
    ]
  },
  "id": "1388174000472432650",
  "in_reply_to_user_id": "927129038933626880",
  "text": "@imoleayomichael was abducted by DSS at 2.30am in his residence and detained for 41days over #EndSARS protest. They still want to convict him.\n\nImoleayo is a Programmer NOT A CRIMINAL!\n\nPls lend your voice in solidarity to \n#FreeImoleAyo\nIt could be you or me.\nPls tweet, RT, Tag https://t.co/fDgTVvbQBZ"
}

So it will count 2 hashtags.

A second source of variation is that twarc-hashtags ignores case, while your code is case sensitive, so EndSARS and endsars will be separate for example. Also, ensure_flattened(data) is meant more for handling entire responses not small json objects within tweets, but since the function is robust enough to handle that it's ok to keep using it like that. It simply does not do any thing to the data, so you can leave it out, and have for hashtag in data:

These aren't mistakes or bugs as such, they're just different things that we should be aware of and decide to count one way or another.

Personally, i'm inclined to to edit twarc-hashtags to count the retweeted hashtags same as twarc-csv, and keep it ignoring the case, same as twitter UI. This does mean adding a bit more code but i think it's less surprising to users, becuause if someone were to manually verify a count, they should match.

from twarc-hashtags.

edsu avatar edsu commented on August 15, 2024 1

I think that twarc-csv is including hashtags from tweets that are referenced using conversation_id and also replies? That was the source of one discrepency at least. I thought that twarc-hashtags was counting retweets. If that's not the case it definitely feels like a bug in twarc-hashtags. I'm not sure it makes sense to count hashtags in tweets that are being replied to, quoted etc though -- unless asked to? I might need to think about this. I guess as a user of a hashtag report I'd want to see counts for tweets that I collected, not tweets related to the tweets I collected, but this is a fuzzy area where one tweet begins and ends.

from twarc-hashtags.

igorbrigadir avatar igorbrigadir commented on August 15, 2024

Do you have a sample of what your dataframe contains? How is it generated in the first place? It's hard to say or compare it to the code otherwise.

from twarc-hashtags.

igorbrigadir avatar igorbrigadir commented on August 15, 2024

I think that twarc-csv is including hashtags from tweets that are referenced using conversation_id and also replies?

It used to, but by default in the latest version, no. Just the original tweets merged into the retweets.

Also agree with not counting them from all referenced tweets like replies. Quotes are different though - the quote tweet itself yes, but the quoted tweet? I'm not sure. Right now it will count the quote itself but not the quoted tweet. Still on the fence here too. I guess making command line switches for this will work.

Some of this overlaps with what i was planning with DocNow/twarc-statistics#2 and with DocNow/twarc#562

from twarc-hashtags.

edsu avatar edsu commented on August 15, 2024

@igorbrigadir ok, thanks! I'll have to double check. I just got a new computer and am using the latest twarc-csv. I thought I noticed it pulling in basbtags from the included conversation_id after flattening.

from twarc-hashtags.

Related Issues (1)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.