Comments (10)
My hunch is that this behaviour is potentially a bug on the implementation I introduced for hashtag handling, compounded with the fact that the number of tweets with the hashtags exceed the number of max_tweets, I suspect.
Would you mind providing the value of max_tweets
in your config so I can try to reproduce and fix the bug?
from pleroma-bot.
Yes - I believe when I ran this small test run it was something like 40. I then tried to set it to 3199 based on the documentation indicating that the twitter API would tolerate no more than 3200. But I got an error message from pleroma-bot stating that the max was 100, so that's where things stand at the moment.
You are right - there are more than 100 total tweets with the hashtags I'm using. My planned workaround, assuming that pleroma-bot does them sequentially, is to do a run of 100, find the last tweet copied over and then use its timestamp +1 minute for the start time for the next run, and that way get through all of them. But if they aren't done sequentially that won't work, hence the issue report. I'm open to other workarounds or suggestions as to how to do it if you have any!
from pleroma-bot.
Just noticed that the readme indicates the following:
"If the --noProfile argument is passed, only new tweets will be posted. The profile picture, banner, display name and bio will not be updated on the Fediverse account."
I believe I had been using that switch in my previous tests - perhaps this had something to do with the order being off? I had been using the --noProfile switch b/c I don't want my profile updated from my birdsite profile, but it would be easy to set it back to the original values after the pleroma-bot import. Seems like this is worth trying out again!
from pleroma-bot.
Hmmm... I think it has to do more with that section on the README being poorly worded.
The --noProfile
flag should not interfere with the order or number of tweets gathered and posted.
It really does what you initially assumed, if the flag is present then the profile isn't updated (profile image, banner image, display name, etc.). The tweets are unaffected either way by the noProfile
argument.
The help message maybe does a better job trying to explain what it's supposed to do:
-n, --noProfile skips Fediverse profile update (no background image,
profile image, bio text, etc.)
Perhaps the use of "new" on the README is a bit misleading:
"If the --noProfile argument is passed, only new tweets will be posted."
And rewording it to something along the lines of this would be more appropiate:
"If the --noProfile argument is passed, the profile picture, banner, display name and bio will not be updated on the Fediverse account. However, it will still gather and post the tweets following the config parameters."
In any case, do let me know if somehow you find it actually changes (for better or worse) how it behaves in relation to your issue.
from pleroma-bot.
Hi,
I've reworked the pagination and how the processing is performed on the gathered tweets in my test branch and I was also trying to reproduce your issue to see how we can approach it.
I verified that the order of the tweets and the hashtags are working on the test branch, so no issue there. The exception raised for max_tweets
value being higher than 100 was an issue with the pagination, sorry about that.
Regarding your earliest tweet retrieved only being from 2018 for your hashtag:
Unfortunately, Twitter only makes available the full archive search endpoints to projects with academic research access level (even worse, the publicly available /search/recent endpoint only goes back 1 week).
So the workaround we use is to get them from the /2/tweets endpoint, which itself is capped to 3200 total tweets (100 per page).
The issue here is that we cannot use this endpoint with a query to filter tweets (that have an specific hashtag, for example), so we can only retrieve the latest 3200 tweets for the user and then check locally which ones of those to keep.
We cannot even get them from an start date onwards, if we provide an start date Twitter's response goes from latest to oldest until it reaches that start date (or goes over the 3200 cap).
So what we do goes something like this:
- Get the latest 3200 tweets' metadata for that Twitter user
- Remove those tweets that we don't want (if they are replies, RTs, not including hashtags, etc. and the config says so)
- Download related media for those tweets we want to keep
I'm sure you see the issue here, if there are older tweets (than the 3200th latest) which meet the criteria (have a specific hashtag), we won't be able to fetch them without Premium/Enterprise or Academic Research account access levels.
Testing locally, the oldest tweet I was able to get with the hashtags you wanted was from 2018-08-17T06:01:47.000Z
, getting 3200 total tweets for the Twitter account, which only 194 were kept because they matched the criteria.
It looks pretty grim, I don't see any possible workaround/solution that could work for this specific case.
It's a limitation imposed by Twitter for Essential and Elevated access levels.
from pleroma-bot.
First - thank you @robertoszek for diving so deeply into this and explaining it to me. That's some hairy stuff - and rather disappointing on Twitter's part but I'm sure they have their reasons. Since I don't think I'll be able to get essential or elevated access levels, the next thing to try is an import without any start date to get the parts of the corpus stretching back to 2018 - which is still something! Tbh, I'm not so keen on twitter anymore, so if I can find some utility for mass deletion of tweets, than I will probably get what I can down to 2018, then delete all tweets down to the last one gotten in 2018, and then get the rest if I can.... That being a big "if" I can find such a utility and that strategy works. Either way - I'm super grateful! Do you have a patreon or something I can "buy you a beer" through? (And I guess I should close this now?)
from pleroma-bot.
No problem @lightnin, happy to help!
Twitter's stance on this is rather disappointing, I agree.
Hmm... That's a good idea, I actually don't know if it would work, I'd need to check if removing a tweet let's you retrieve the latest 3200 (not counting removed tweets) once you try again to fetch them.
If that's the case, I could see myself adding some archival capabilities for people that are on the same boat as you. A way to download the 3200 latest tweets locally (with metadata, dates, media and so on), then automatically remove them from Twitter and continue the archival process with the next batch until the tweets run out (and maybe zip them up letting you keep them as a backup).
You would have to be fully committed at that point, there's no going back after that haha.
Lastly, just as a clarification, you should have at the very least Essential access if you were able to use the bot (and Elevated if you created the tokens before Nov 15th)
If youโre already using the Twitter API v2, youโll automatically see your Projects upgraded to Elevated access today.
The issue is that the full search API endpoint is only available for Academic Research accounts.
I'd leave this issue open until I create a new release version from the test branch (which includes some pagination fixes) if you don't mind.
Oh, and I don't have a Patreon, Subscribestar or anything like that. I guess I have a PayPal link if you feel inclined to donate but please, don't feel pressured into it. The fact you and other people find this software useful is reward enough.
https://paypal.me/robertoszek
from pleroma-bot.
I can confirm that the /2/tweets endpoint does not seem to include deleted tweets, next I need to verify that even then they don't count towards the 3200 limit.
Regarding the mass delete, Twitter rate limits the delete endpoint to 50 requests per 15-minute window:
https://developer.twitter.com/en/docs/twitter-api/tweets/manage-tweets/api-reference/delete-tweets-id
And even worse, to 1000 successful requests per 24-hour window per user:
https://developer.twitter.com/en/docs/twitter-api/rate-limits
So deleting 3200 tweets would take a very long time (73 hrs.) just based on rate limits alone.
I'd like to adapt pleroma-bot to allow running it as a daemon/service so it's feasible to leave it running on the background and doing these type of tasks.
The alternative would be using the official Twitter's archive files as an input to get the tweets to delete (and possibly also use them to post to the Fediverse).
I found a project that uses it to bulk-delete tweets:
https://github.com/koenrh/delete-tweets
It would be nice if pleroma-bot could use that official archive to post those tweets to the Fediverse too.
I've opened an issue to keep track of it here: #59
I've also added some donation links so it's easier for me to point people to them in case I get asked in the future ๐
https://robertoszek.github.io/pleroma-bot/#funding
from pleroma-bot.
Hey @lightnin,
I just wanted to let you know v1.0.0 is officially out and it includes some support for using Twitter's archives. It allows it to get tweets older than 2010 (and more than 3200), hopefully you find it useful!
from pleroma-bot.
Fantastic! I will give it a shot, hopefully this weekend.
from pleroma-bot.
Related Issues (20)
- Pleroma bot is failing to pull tweets from a non-protected account HOT 4
- Just a dummy, maybe, question about upgrading Pleroma-Bot HOT 2
- Only thread replies? HOT 4
- SSLError - Option to skip checking/Max retries? HOT 6
- Failing to process tweets - multiple issues HOT 18
- Add support for GoToSocial HOT 9
- Implement recovering mechanism from network interruption HOT 3
- ERROR: Exception occurred for user HOT 4
- In light of changes to Twitter's API coming Feb 9 HOT 25
- Set custom timezone when using original_date HOT 1
- RSS import: Nitter shows links instead of mentions/hashtags and HTML tags HOT 13
- Original timestamp of tweet not used while importing from Twitter archive HOT 2
- Twitter archive -> Mastodon fails with "KeyError" HOT 2
- Incompability with Misskey v13.5.6+ HOT 3
- RSS Import: Issue with KeyError: 'cw' HOT 5
- error HOT 2
- HTTP 422 error for certain quote tweets HOT 2
- Possible to add media-only flag from mastodon-bot?
- Nitter RSS: Handle over-processed links and metadata HOT 7
- RSSHub RSS: Handle remaining Twitter URLs and metadata/encoding oddities HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pleroma-bot.