Giter Site home page Giter Site logo

UnicodeEncodeError about twint HOT 9 CLOSED

twintproject avatar twintproject commented on May 22, 2024
UnicodeEncodeError

from twint.

Comments (9)

haccer avatar haccer commented on May 22, 2024

Ok, I’m going to try reproducing this.

from twint.

grepsedawk avatar grepsedawk commented on May 22, 2024

from twint.

haccer avatar haccer commented on May 22, 2024

Yea it appears when I do a search of ü which is \xfc, tweep won't scrape tweets. I tested this with some other unique unicode characters and got the same result. Going to try and figure out a fix next week.

from twint.

tredmill avatar tredmill commented on May 22, 2024

Thank you for the script Haccer. We changed the script so that it does not stop, but it is by no means an elegant solution. This might work for those in a hurry until you find a more sustainable alternative.


async def getTweets(init):
    tweets, init = await getFeed(init)
    count = 0
    for tweet in tweets:
        try:
            tweetid = tweet["data-item-id"]
            datestamp="01 jan 1970"
            try:
                datestamp = tweet.find("a", "tweet-timestamp")["title"].rpartition(" - ")[-1]
            except TypeError:
                pass
            d = datetime.datetime.strptime(datestamp, "%d %b %Y")
            date = d.strftime("%Y-%m-%d")
            timestamp="00:00:00"
            try:
                timestamp = str(datetime.timedelta(seconds=int(tweet.find("span", "_timestamp")["data-time"]))).rpartition(", ")[-1]
            except TypeError:
                pass        
            t = datetime.datetime.strptime(timestamp, "%H:%M:%S")
            time = t.strftime("%H:%M:%S")
            username = tweet.find("span", "username").text.replace("@", "")
            timezone = strftime("%Z", gmtime())
            text = tweet.find("p", "tweet-text").text.replace("\n", " ").replace("http"," http").replace("pic.twitter"," pic.twitter")
            hashtags = ",".join(re.findall(r'(?i)\#\w+', text, flags=re.UNICODE))
            replies = tweet.find("span", "ProfileTweet-action--reply u-hiddenVisually").find("span")["data-tweet-stat-count"]
            retweets = tweet.find("span", "ProfileTweet-action--retweet u-hiddenVisually").find("span")["data-tweet-stat-count"]
            likes = tweet.find("span", "ProfileTweet-action--favorite u-hiddenVisually").find("span")["data-tweet-stat-count"]
            try:
                mentions = tweet.find("div", "js-original-tweet")["data-mentions"].split(" ")
                for i in range(len(mentions)):
                    mention = "@{}".format(mentions[i])
                    if mention not in text:
                        text = "{} {}".format(mention, text)
            except:
                pass
            if arg.users:
                output = username
            elif arg.tweets:
                output = tweets
            else:
                output = "{} {} {} {} <{}> {}".format(tweetid, date, time, timezone, username, text)
                if arg.hashtags:
                    output+= " {}".format(hashtags)
                if arg.stats:
                    output+= " | {} replies {} retweets {} likes".format(replies, retweets, likes)

            if arg.o != None:
                if arg.csv:
                    dat = [tweetid, date, time, timezone, username, text, hashtags, replies, retweets, likes]
                    with open(arg.o, "a", newline='') as csv_file:
                        writer = csv.writer(csv_file, delimiter="|")
                        writer.writerow(dat.encode('utf-8'))
                else:
                    print(output.encode('utf-8'), file=open(arg.o, "a"))

            count += 1
            print(output)
        except:
            print("skipped tweet")

from twint.

haccer avatar haccer commented on May 22, 2024

Sorry it took until today for me to replicate your environment.

tl;dr
The problem was an issue with your locale. To fix, simply run:

export LC_CTYPE=en_GB.UTF-8

--
This kinda threw me off a little because, Python3 is supposed to be unicode by default.
I tested w/ the default locale against Twitter's that contained special characters and had no issues, but after changing to your specific locale settings I got that error.

After exporting that variable, I confirmed it worked and was not an issue with Tweep itself or connected to the known-issue of searching special characters with Tweep.

from twint.

grepsedawk avatar grepsedawk commented on May 22, 2024

from twint.

haccer avatar haccer commented on May 22, 2024

@pachonk nah, his locale had: LC_CTYPE=UTF-8

I guess language has to be specified, so for USA it'd be en_US.UTF-8

from twint.

grepsedawk avatar grepsedawk commented on May 22, 2024

from twint.

grepsedawk avatar grepsedawk commented on May 22, 2024

from twint.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.