chris-greening / instascrape Goto Github PK

Powerful and flexible Instagram scraping library for Python, providing easy-to-use and expressive tools for accessing data programmatically

Home Page: https://chris-greening.github.io/instascrape/

License: MIT License

Python 99.99% Shell 0.01%

python instagram webscraping data-mining instagram-scraper lightweight python3 data-science python-scraper instagram-data

instascrape's Introduction

instascrape: powerful Instagram data scraping toolkit

Note: This module is no longer actively maintained.

DISCLAIMER:

Instagram has gotten increasingly strict with scraping and using this library can result in getting flagged for botting AND POSSIBLE DISABLING OF YOUR INSTAGRAM ACCOUNT. This is a research project and I am not responsible for how you use it. Independently, the library is designed to be responsible and respectful and it is up to you to decide what you do with it. I don't claim any responsibility if your Instagram account is affected by how you use this library.

What is it?

instascrape is a lightweight Python package that provides an expressive and flexible API for scraping Instagram data. It is geared towards being a high-level building block on the data scientist's toolchain and can be seamlessly integrated and extended with industry standard tools for web scraping, data science, and analysis.

Key features

Here are a few of the things that instascrape does well:

Powerful, object-oriented scraping tools for profiles, posts, hashtags, reels, and IGTV
Scrapes HTML, BeautifulSoup, and JSON
Download content to your computer as png, jpg, mp4, and mp3
Dynamically retrieve HTML embed code for posts
Expressive and consistent API for concise and elegant code
Designed for seamless integration with Selenium, Pandas, and other industry standard tools for data collection and analysis
Lightweight; no boilerplate or configurations necessary
The only hard dependencies are Requests and Beautiful Soup

Installation
Sample Usage
Documentation
Blog Posts
Contributing
Dependencies
License
Support

💻 Installation

Minimum Python version

This library currently requires Python 3.7 or higher.

pip

Install from PyPI using

$ pip3 install insta-scrape

WARNING: make sure you install insta-scrape and not a package with a similar name!

🔎 Sample Usage

All top-level, ready-to-use features can be imported using:

from instascrape import *

instascrape uses clean, consistent, and expressive syntax to make the developer experience as painless as possible.

# Instantiate the scraper objects 
google = Profile('https://www.instagram.com/google/')
google_post = Post('https://www.instagram.com/p/CG0UU3ylXnv/')
google_hashtag = Hashtag('https://www.instagram.com/explore/tags/google/')

# Scrape their respective data 
google.scrape()
google_post.scrape()
google_hashtag.scrape()

print(google.followers)
print(google_post['hashtags'])
print(google_hashtag.amount_of_posts)
>>> 12262794
>>> ['growwithgoogle']
>>> 9053408

See the Scraped data points section of the Wiki for a complete list of the scraped attributes provided by each scraper.

📚 Documentation

The official documentation can be found on Read The Docs

📰 Blog Posts

Check out blog posts on the official site or DEV for ideas and tutorials!

🙏 Contributing

All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome!

Feel free to open an Issue, check out existing Issues, or start a discussion.

Beginners to open source are highly encouraged to participate and ask questions if you're unsure what to do/where to start ❤️

🕸️ Dependencies

💳 License

This library operates under the MIT license.

❔ Support

Check out the FAQ

Reach out to me if you want to connect or have any questions and I will do my best to get back to you

Email:
- [email protected]
Twitter:
- @ChrisGreening
LinkedIn
- Chris Greening
Personal contact form:
- www.christophergreening.com

instascrape's People

Contributors

Stargazers

Watchers

Forkers

fernando24164 paola351 o3661606 matt-ross16 benji011 stefancanvas vibhutikathuria wpappdev fo0nikens shinroo pwill2 austinekrash olidroide 3mqsa humans-huddle the-cool-coders diemesleno lattenlui kunleiky jmoeae pamtrg aimanafzal tumurtogtokh jcm005 davidcantidio zachbateman nickhendo daineal chibuikeeugene okaysidd marco97pa alnwoks naoufalhosni niemtin6789 cybersecurity-id richiezhzh josuedsneto maddarauci sushi6006 gtmdotme chaoshengggg standingbird75 stefco thisiselliot dnaaun kjohnson-digiday wesleyz leesw1347 kishoraditya williamroot nishantpuri99 system-76 hanslemm jwlmsn oops-p-creater eaeratech amanjayedi bankgit kp-forks gdn0101 themucha code1dot 100000000x dliofindia yahya-a claudius888 takweb12 notnanton cwhmarjot henrikf01 abarreto250 cherchercher jagadeeshram23 matheowis loventheair project-dmaestro tonyjosephsebastians d4rkh0rse chocomagma kareemrasheed89 ssingh13-rms drshpackz talhabacak bozzmob vaibhavsundharam kokolanako ivangoranov cecibou pterameta siya123456789 benabbes-slimane-takiedine peterjumper thelastsultan zghasempour rosiethuypham v0idin sametozenc jojozzc pauluhn jayandra06

instascrape's Issues

Generate a .csv from the scrape_posts()

Hello,

First of all, amazing job with the updates with the instascraper package, the new updates are so awesome

I have a question

I'm using this code below to retrieve the posts information (I based my self on the Joe biden scrape)

from selenium.webdriver import Chrome
from instascrape import Profile, scrape_posts
import pandas as pd
import json

# Creating our webdriver
webdriver = Chrome("chromedriver.exe")

# Scraping Joe Biden's profile
SESSIONID = '....'
headers = {"user-agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57",
           "cookie": f"sessionid={SESSIONID};"}
Insta = Profile("clicktays")
Insta.scrape(headers=headers)

# Scraping the posts
posts = Insta.get_posts(webdriver=webdriver, amount=2, login_first=True)
scraped, unscraped = scrape_posts(posts, silent=False, headers=headers, pause=10)

I saw this code to retrieve the information regarding the post, but I'm having some difficult to try to save the post information into a csv or excel file, or even inside pandas

I had a look around but I couldn't be able to do it as I don't have an expert level of python

can you give me a hand?

How to download all photos if post have multiple photos

As the example of 'download_recent_photos' , it only downloads one image per post. If there are multiple photos in a post, how can i download all the photos?

Scrape Only Videos?

Problem?
I only need to download videos from IG, not photos. Is there a way I could skip photos while scraping?

Solution?
Perhaps a parameter in get_posts or scrape_posts for video_only=True or comparable flag

Alternatives I've considered:
https://github.com/drawrowfly/instagram-scraper — not super well maintained!

Examples

Provide simple and/or complex examples showcasing the different ways you can use instascrape.

Jupyter Notebooks explaining your examples in detail are encouraged but not required!

UnicodeEncodeError when using Post.to_csv if emoji in caption

Describe the bug
If there is an emoji in the caption of a post that has been scraped, it will throw a UnicodeEncodeError when attempting to use Post.to_csv instance method to write the data to .csv.

To Reproduce

from instascrape import Post 
url = 'https://www.instagram.com/p/CGa0nQBljxN/'
post = Post(url)
post.load()
post.to_csv('test.csv')

and this raises

D:\Programming\pythonstuff\instascrape\instascrape\scrapers\post.py in to_csv(self, fp)
     42         # have to convert to serializable format
     43         self.upload_date = datetime.datetime.timestamp(self.upload_date)
---> 44         super().to_csv(fp=fp)
     45         self.upload_date = datetime.datetime.fromtimestamp(self.upload_date)
     46

D:\Programming\pythonstuff\instascrape\instascrape\core\_static_scraper.py in to_csv(self, fp)
     86             writer = csv.writer(csv_file)
     87             for key, value in self.to_dict().items():
---> 88                 writer.writerow([key, value])
     89
     90     def to_json(self, fp: str) -> None:

~\Anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
     17 class IncrementalEncoder(codecs.IncrementalEncoder):
     18     def encode(self, input, final=False):
---> 19         return codecs.charmap_encode(input,self.errors,encoding_table)[0]
     20
     21 class IncrementalDecoder(codecs.IncrementalDecoder):

UnicodeEncodeError: 'charmap' codec can't encode character '\u2728' in position 224: character maps to <undefined>

Expected behavior
It's expected that this would write the scraped data to a .csv file

Desktop (please complete the following information):

Windows 10

Downloading stories

Hello everybody

Does this repo have any feature to support story downloading?

future feature annotations is not defined (profile.py, line 1)

Describe the bug
Heroku cant find latest version of library, latest possible version showed there is 0.6.6 and secondly after installing 0.6.6 version it shows following error:
`

Django Version:	3.1.7
SyntaxError
future feature annotations is not defined (profile.py, line 1)
/app/.heroku/python/lib/python3.6/site-packages/instascrape/scrapers/init.py, line 1, in
/app/.heroku/python/bin/python
3.6.13

Expected behavior
My heroku project must have worked

MissingCookiesWarning and InstagramLoginRedirectError when using session id

Describe the bug

As the title

C:\Users\User\Anaconda3\lib\site-packages\instascrape\core\_static_scraper.py:136: MissingCookiesWarning: Request header does not contain cookies! It's recommended you pass at least a valid sessionid otherwise Instagram will likely redirect you to their login page.
  MissingCookiesWarning
Traceback (most recent call last):
  File "postscraper.py", line 5, in <module>
    from instascrape import Profile, scrape_posts
  File "C:\Users\User\Anaconda3\lib\site-packages\instascrape\__init__.py", line 9, in <module>
    google.scrape()
  File "C:\Users\User\Anaconda3\lib\site-packages\instascrape\core\_static_scraper.py", line 144, in scrape
    return_data = self._get_json_from_source(self.source, headers=headers, session=session)
  File "C:\Users\User\Anaconda3\lib\site-packages\instascrape\core\_static_scraper.py", line 265, in _get_json_from_source
    self._validate_scrape(json_dict)
  File "C:\Users\User\Anaconda3\lib\site-packages\instascrape\core\_static_scraper.py", line 301, in _validate_scrape
    raise InstagramLoginRedirectError
instascrape.exceptions.exceptions.InstagramLoginRedirectError: Instagram is redirecting you to the login page instead of the page you are trying to scrape. This could be occuring because you made too many requests too quickly or are not logged into Instagram on your machine. Try passing a valid session ID to the scrape method as a cookie to bypass the login requirement

To Reproduce

from selenium import webdriver
from instascrape import Profile, scrape_posts


webdriver = webdriver.Chrome("C:/usr/local/bin/chromedriver.exe")
SESSIONID = 'xxxxxxxxxxxxxxxxxxx'

headers = {"user-agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57",
           "cookie": f"sessionid={SESSIONID};"}
profile = Profile("google")
profile.scrape(headers=headers)

posts = profile.get_posts(webdriver=webdriver, login_first=True)
scraped_posts, unscraped_posts = scrape_posts(posts, headers=headers, pause=10, silent=False
)

OS: Windows 10
Browser chrome
Version 88.0.4324.150 (Official Build) (64-bit)

Additional context
I got the session id as described in http://valvepress.com/how-to-get-instagram-session-cookie/.
Just trying some code got from the posts, don't know why it's not working.

Add proxy options

I am trying InstaScrape but I need to set a proxy (otherwise I have an error). How do I have to proceed?

Import Error cannot import name 'exceptions' from 'instascrape.exceptions'

How do I get all posts from a specific hashtag passing the session id output=csv?

Is it possible in the current version? Cannot figure out how to do it

Get the first comment

Hey hey Chris,

First of all, great job with this package, it's really interesting and so useful. Is there a way to add a tool to retrieve the first comment on the post? Maybe that will be interesting because some users add the hashtags on the first comment after posting the photo/video on the feed.

Is there a way to do it?

Thanks in advance

Reels views

When I retrieve viewcount of a REEL post, I get a number that is lower of the real one that I see on Instagram. This doesn't happen with the normal VIDEO posts, only with REELS. How do you get this number?

Thanks

Add unit tests

Add some unit tests that greater cover the codebase

Feel free to submit a PR with even just one or two tests, any help would be much appreciated

ImportError: cannot import name 'extract_email' from 'helpers'

Scrape by location id

Is your feature request related to a problem? Please describe.
For my project I would like to scrape Instagram media from particular places. I already have found a way to retrieve the place id's.

Describe the solution you'd like
Scrape this endpoint: https://www.instagram.com/explore/locations/

Cannot import name 'Profile' from 'instascrape'

Describe the bug
After installation, I tried to run and got this error :
Cannot import name 'Profile' from 'instascrape'. check below image.

Why do you make it so difficult?

I have installed insta-scrape, why can't I just enter it in the terminal as with other projects and take action from there? Why do I have to write a script to use this program?

Unit testing

Is your feature request related to a problem? Please describe.
Want to get some simple unit tests going to make sure nothing critical breaks as changes are made, will continue developing this in the future

Describe the solution you'd like
Write some tests for the library

Describe alternatives you've considered
N/A

Additional context
I'm going to start working on this soon but wouldn't mind help

Unable to download Instagram Videos

When I run this code.

google_hashtag = Hashtag('https://www.instagram.com/explore/tags/google/')
google_hashtag.scrape()
    for x in google_hashtag.get_recent_posts():
        if x['is_video'] is True:
            x.download('tmp.mp4')
            break

I get this error. I am testing to try and download a video from instagram. Any assistance would be appreciated. It works fine when downloading images.

File "C:\Users\jacks\AppData\Local\Programs\Python\Python39\lib\site-packages\instascrape\scrapers\post.py", line 84, in download
    resp = requests.get(url, stream=True)
  File "C:\Users\jacks\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Users\jacks\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Users\jacks\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\sessions.py", line 528, in request
    prep = self.prepare_request(req)
  File "C:\Users\jacks\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\sessions.py", line 456, in prepare_request
    p.prepare(
  File "C:\Users\jacks\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\models.py", line 316, in prepare
    self.prepare_url(url, params)
  File "C:\Users\jacks\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\models.py", line 390, in prepare_url
    raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL 'nan': No schema supplied. Perhaps you meant http://nan?

Making a data frame from all instagram posts of a user

Hi there, I want to make a dataframe from all instagram posts of users (number of likes, tags, number of comments, etc.). I am using these lines of code but how do I make a dataframe from it?
posts = kyliecosmetics.get_posts(webdriver=webdriver,login_first=True)
scraped_posts, unscraped_posts = scrape_posts(posts, headers=headers, pause=10, silent=False)

It does seems to work with this code for the last 12 posts:
recent_postskylie = kyliecosmetics.get_recent_posts()
posts_datakylie = [post.to_dict() for post in recent_postskylie]
posts_dfkylie = pd.DataFrame(posts_datakylie)

But I would like to use the code on all posts. Thank you in advance

Login

Hi, i'm receiving always the error : "InstagramLoginRedirectError", so i want to login.
Taking as example your code :

Instantiate the scraper objects

google = Profile('https://www.instagram.com/google/')

Scrape their respective data

google.scrape()

How can i login with username and password ?

Update the README

Add/update details on the README for this project. Feel free to be creative, we want it to be bright, creative, and exciting! I've been kind of following pandas README for inspiration but other ideas are absolutely welcome

Some ideas to tackle:

update Features section to be more comprehensive
add more images/gifs
short snippets/examples
Dependencies section
badge(s) that add something meaningful to understanding this repo
emojis..... lots of emojis
etc!

Really open ended, feel free to comment/submit a PR with ideas that you feel would enhance the README for this project

Rewrite examples to reflect changes

There have been recent breaking, backwards incompatible changes and thus, the examples section is outdated and sorely in need of being updated to reflect these changes.

The changes aren't massive so they shouldn't be hard to fix (just slight syntax changes) but they are certainly broken and in need of updating.

lxml lib requirements

Required lxml lib by beautifulSoup parser initialization in

instascrape/instascrape/core/_static_scraper.py

Line 240 in d1700be

return BeautifulSoup(html, features="lxml")

Steps to reproduce

Clone the repository
Copy instascrape/instascrape folder to the root of another project to use it
pip3 install from requirements.txt
Make a simple profile scrape:

from instascrape import Profile
olidroide_profile = Profile('olidroide')
olidroide_profile.scrape()

Appear this error

Traceback (most recent call last):
  File "/usr/share/pycharm/plugins/python-ce/helpers/pydev/pydevd.py", line 1448, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/usr/share/pycharm/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/olidroide/python/tests/main.py", line 13, in <module>
    print_hi()
  File "/home/olidroide/python/tests/main.py", line 6, in print_hi
    olidroide_profile.scrape()
  File "/home/olidroide/python/tests/instascrape/core/_static_scraper.py", line 112, in scrape
    self.json_dict = self._get_json_from_source(self.source)
  File "/home/olidroide/python/tests/instascrape/core/_static_scraper.py", line 205, in _get_json_from_source
    self.soup = self._soup_from_html(self.html)
  File "/home/olidroide/python/tests/instascrape/core/_static_scraper.py", line 238, in _soup_from_html
    return BeautifulSoup(html, features="lxml")
  File "/home/olidroide/python/tests/venv/lib/python3.8/site-packages/bs4/__init__.py", line 243, in __init__
    raise FeatureNotFound(
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
python-BaseException

Pull Request

#43

_static_scraper.py wrong json data argument is passed

_static_scraper.py line no. 155

scraped_dict = parse_data_from_json(
                json_dict=flat_json_dict,
                map_dict=mapping,
            )

flat_json_dict object is passed in the argument. Expected argument is json_dict.
As wrong argument is passed the _mappings.py is returning nan for some of the mapped keys.

For example :
Key "caption" in following function will be nan as flat_json_dict is passed.

    @classmethod
    def post_from_hashtag_mapping(cls):
        """
        Return the mapping needed for parsing a post's JSON data from the JSON
        served back after requesting a Hashtag page.
        """
        return {
            "comments_disabled": deque(["comments_disabled"]),
            "id": deque(["id"]),
            "caption": deque(["edge_media_to_caption", "edges", 0, "node", "text"]),
            "shortcode": deque(["shortcode"]),
            "comments": deque(["edge_media_to_comment", "count"]),
            "upload_date": deque(["taken_at_timestamp"]),
            "dimensions": deque(["dimensions"]),
            "height": deque(["height"]),
            "width": deque(["width"]),
            "display_url": deque(["display_url"]),
            "likes": deque(["edge_media_preview_like_count"]),
            "owner": deque(["owner_id"]),
            "is_video": deque(["is_video"]),
            "accessibility_caption": deque(["accessibility_caption"]),
        }

Instagram Web-Scraping Bugs

Describe the bug

When using the Selenium webdriver, I get an error saying that Profile.url isn't a String (it's a NoneType object).
When forcing it to be set by Profile.url = url_string, I get a bs4 error:

bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

To Reproduce

Steps to reproduce the behavior:
First Error: Header file not working. Reproduce by running the following script

from instascrape import *
from selenium import webdriver
from selenium.webdriver.chrome.service import Service

service = Service('C:\webdriver\chromedriver.exe')
service.start()

driver = webdriver.Remote(service.service_url)

url = "some_url"
user = Profile(url)

user.scrape(webdriver=driver)
time.sleep(60)
driver.quit()

Expected behavior

A clear and concise description of what you expected to happen.
Not throw a bs4 error when using the scrape syntax as described in documentation.

Desktop (please complete the following information):

OS: [e.g. iOS] Windows 10
Browser [e.g. chrome, safari] Chrome
Version [e.g. 22] 88.0.4324.146

Get Comments Made

Is your feature request related to a problem? Please describe.
I'd like to extract the # of comments that a user made (either over all time to date, or some fixed period of time, either works for me).

Describe the solution you'd like
Similar to how you can scrape how many posts someone has made, I'd like to scrape how many comments someone has made.

Describe alternatives you've considered
Haven't found any yet.

Additional context
Love the tool! Great work!!! 💯

Scrape user's Reels

While it's great to have a Reel scrapper, it would also be very useful to have a way to retrieve a list of Reels from a user's profile.

For example, have equivalent methods to get_recent_posts and get_posts (ie. get_recent_reels and get_reels) which return a list of Reels.

profile_pic_url and profile_pic_url_hd give incorrect values when using session ID

Code to Reproduce

Let's use Kim Kardashian's IG account as a good example. The session ID can be retrieved in the usual way.

from instascrape import Profile
user = Profile("kimkardashian")
headers = {
    "user-agent": USER_AGENT,
    "cookie": "sessionid=%s" % SESSION_ID
}
user.scrape(headers)
user.to_dict()

This yields correct data in all the fields I checked except the profile_pic_url and profile_pic_url_hd, where the URL sends me to my own profile picture (for my session ID). Possibly this is an IG anti-scraping technique?

Version

latest install as of this writing with python version 3.8

Always saves login users biography

Describe the bug
The biography returned is always that of the login user.

To Reproduce
Script that gives the wrong output:
from instascrape import *

SESSIONID = ' xxx '
headers = {"user-agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57",
"cookie": f"sessionid={SESSIONID};"}
user = Profile("anna")
user.scrape(headers=headers)
print(user.username, " bio ", user.biography)

user = Profile("andreas")
user.scrape(headers=headers)
print(user.username, " bio ", user.biography)

output:
anna bio Login users text
andreas bio Login users text

Expected behavior
It should write out the bio of the individual users.

Desktop (please complete the following information):

OS: Centos 7
Latest Version of instascrape, 5th of feb

Thank you for a nice software.

Broken on Python3.8.5 / WSL2 / Ubuntu 20.04

>>> google = Profile('https://www.instagram.com/google/')
>>> google.scrape()

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/emil/.local/lib/python3.8/site-packages/instascrape/core/_static_scraper.py", line 110, in scrape
    self.json_dict = self._get_json_from_source(self.source, headers=headers)
  File "/home/emil/.local/lib/python3.8/site-packages/instascrape/core/_static_scraper.py", line 206, in _get_json_from_source
    json_dict_str = self._json_str_from_soup(self.soup)
  File "/home/emil/.local/lib/python3.8/site-packages/instascrape/core/_static_scraper.py", line 237, in _json_str_from_soup
    json_script = [str(script) for script in soup.find_all("script") if "config" in str(script)][0]
IndexError: list index out of range

email / followers (not the number of followers, but the list of followers)

Thank you for sharing this. It's awesome and really helpful.

I am looking for 'email' and 'list of follower' (not the number of followers) info.

Would it be possible to increase your scraped data points?

Thank you in advance!

Get all post comments

Is your feature request related to a problem? Please describe.
Get all post comments

Describe the solution you'd like
Get all post comments similar to get_recent_comments but for all

Describe alternatives you've considered
Get all post comments

Additional context
If the post has 100 comments, get all comments by paginating (?).

Profile object is empty

When I run
from instascrape import * \n g = Profile("https://instagram.com/google/") \n g.__dict__
The output is: {'source': 'https://instagram.com/google/', 'url': None, 'html': None, 'soup': None, 'json_dict': None, 'flat_json_dict': None, 'scrape_timestamp': None}

Empty field in scrapped profile

Describe the bug
Profile fields after scrapping are empty.

To Reproduce

Code
profile = Profile("hm")

profile.scrape()

profile.username

Gives the output
nan

Expected behavior
Right output (in this case "hm")

Desktop (please complete the following information):

OS: Ubuntu 20.04 (LTS) x64
Python 3.8.5
insta-scrape==1.1.0

Function get_posts return always 12 posts

Hi, sorry for bothering you again. As you see in the title each time i run get_posts on a instagram profile it return always 12 posts.
My code is :
sessionid = '********************

    lista = []
    # headers = {"User-Agent": "user-agent: Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57",
    # "cookie": f"sessionid={os.environ.get('sessionid')};"}
    headers = {
        "User-Agent": "user-agent: Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57",
        "cookie": f"sessionid={sessionid};"}
    options = Options()
    options.add_argument('--headless')
    driver = webdriver.Chrome(options=options)
    target = 'https://www.instagram.com/molteni_matteo/'
    # target = 'https://www.instagram.com/lubecreopratolacasteldisangro/'
    insta_profile = Profile(target)
    insta_profile.scrape(headers=headers)
    insta_profile.url = target
    list_post = insta_profile.get_posts(webdriver=driver,amount=13)
    print(insta_profile.followers)
    print("Numero post : " + str(len(list_post)))
    for profile_post in list_post:
        profile_post.scrape(headers=headers)
        lst = get_image_urls(profile_post)
        # for y in lst:
        # print("LINK INSIDE POST : ", y)
        # html = profile_post.embed()
        # soup = BeautifulSoup(html, "html.parser")
        # href = None
        # for a in soup.find_all('a', href=True):
        # href = a['href']
        # break
        # sep = '?'
        # href = href.split(sep, 1)[0]
        # href = href+"?__a=1"
        # print('href : ', href)
        # response = requests.get(href,headers=headers).json()
        # json_data = json.loads(response.text)
        # print(response)
        # print(response.get('edge_sidecar_to_children'))
        post_dict = profile_post.to_dict(metadata=False)
        post_dict['images_links'] = lst
        lista.append(post_dict)
    for x in lista:
        print(x)
    print('fine')`

list_post = insta_profile.get_posts(webdriver=driver,amount=13)
Even if i put None on amount or i don't specify amount it gives me always 12 posts.
How can i solve this problem ?
And sorry to disturb you again.

Add some docstrings

Is your feature request related to a problem? Please describe.
Add some docstrings to places they're missing to improve the documentation for this repo.

Describe the solution you'd like
Docstrings in numpy style

Create simple examples

Create some simple examples showcasing different ways of using instascrape

Question

Hi, i'm using using your library and i really love your work. I have a question regarding the post scraping, when i scarpe a post it shows me all the informations related but for the images only a link of the first picture of the post, but often the posts has many pictures in a post. How can i obtain from a post all links to all the picture inside it and not only the first ?

get_recent_posts() raises MissingCookieWarning but we can't pass a valid cookie

Describe the bug
The get_recent_posts() method raises MissingCookieWarning, but we can't pass a valid cookie header to avoid that

To Reproduce

from instascrape import *

instagram_sessionid = "xxx"
headers = {"user-agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57",
"cookie": f"sessionid={instagram_sessionid};"}
profile = Profile('https://www.instagram.com/google/')
profile.scrape(headers=headers)
print(profile.posts)
recents = profile.get_recent_posts() #We should pass a cookie here

The code is executed correctly but we get a MissingCookiesWarning: Request header does not contain cookies! It's recommended you pass at least a valid sessionid otherwise Instagram will likely redirect you to their login page. warning

If I try to pass a header cookie:

from instascrape import *

instagram_sessionid = "xxx"
headers = {"user-agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57",
"cookie": f"sessionid={instagram_sessionid};"}
profile = Profile('https://www.instagram.com/google/')
profile.scrape(headers=headers)
print(profile.posts)
recents = profile.get_recent_posts(headers=headers) #This time I try to pass an header cookie

I get a TypeError: get_recent_posts() got an unexpected keyword argument 'headers'

Expected behavior
We should be able to pass a valid cookie to avoid the warning or the warning should not be triggered altogether.

Login to Instagram

Is your feature request related to a problem? Please describe.
Depending on usage, this library won't work as intended because Instagram seemingly checks cookies to make sure the user isn't a random bot. If using instascrape from a personal computer that has been logged into Instagram before, this doesn't seem to be a problem. Unfortunately, the library breaks if trying to use it from say a remote server that has never logged into Instagram before.

Describe the solution you'd like
Provide a way to bypass these restrictions by either

logging into Instagram using requests.post or similar lightweight lib
come up with a way of spoofing/determining/etc. cookies that Instagram looks for
find out how to request JSON or similar data directly from their server
any other method that just allows users to bypass Instagram restrictions

Describe alternatives you've considered
Considered using selenium to login but I really don't want to force people to install or use selenium, the whole purpose is supposed to be lightweight and selenium has too much overhead and is slow

Additional context
N/A

Wrong profile_pic_url_hd and biography when passing SESSION_ID

Describe the bug

If we pass a valid SESSION_ID cookie to the scraper, the values profile_pic_url_hd and biography of the Profile are wrong.

To Reproduce

Code

Run this code by providing a valid instagram_sessionid

from instascrape import *

instagram_sessionid = 'xxxxx'
headers = {"user-agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57",
"cookie": f"sessionid={instagram_sessionid};"}

profile = Profile("roses_are_rosie")
profile.scrape(headers=headers)
print(profile.profile_pic_url_hd)
print(profile.biography)

Expected output

We expect the output to print the profile pic url and biography of @roses_are_rosie:

https://scontent-fco1-1.cdninstagram.com/v/t51.2885-19/s320x320/120911233_113942167015019_7757793538086741578_n.jpg?_nc_ht=scontent-fco1-1.cdninstagram.com&_nc_ohc=vLiuePti8-oAX81KArw&tp=1&oh=6347fbcf84fa8021ca0f427c4b355573&oe=604ABB41
ROSÉ

Output

Instead we got the profile pic url and the biography of the login account of the provided instagram_sessionid.
In my specific case, it is printing the data of @puntiburraco
Please be aware that the output depends on the login account you are using to scrape, but it's always wrong.

https://scontent-fco1-1.cdninstagram.com/v/t51.2885-19/s320x320/66284490_468664600594844_7307310439468105728_n.jpg?_nc_ht=scontent-fco1-1.cdninstagram.com&_nc_ohc=y6o_TmT3c6AAX9PGXzO&tp=1&oh=240222f677c66049485e2edad1d76c77&oe=604E1EC2
App Android
Segnapunti per le tue partite di Burraco
♠️♥️♣️♦️🃏

Tested on:

OS: Windows 10 x64
Python 3.8
instascrape 2.0.2

Additional context
I understand that the latest changes of Instagram broke the library but we need more documentation on how to handle this new changes.

instascrape.exceptions.exceptions.InstagramLoginRedirectError even though I used get_posts([...] login_first=True)

Describe the bug
A clear and concise description of what the bug is.
I used the tutorial code to scrape data from myself and putting straight in csv file, but I always got this error even though get_posts(login_first= True)
The code is from the Joe Biden Tutorial and I just changed the JB's @ to myself.
https://github.com/chris-greening/instascrape/blob/master/tutorial/examples/JoeBiden/joebiden.py

Traceback (most recent call last): File "scraper.py", line 18, in <module> scraped, unscraped = scrape_posts(posts, silent=False, headers=headers,pause=10) File "C:\Users\Paulo\AppData\Local\Programs\Python\Python37\lib\site-packages\instascrape\scrapers\scrape_tools.py", line 179, in scrape_posts post.scrape(session=session, webdriver=webdriver, headers=headers) File "C:\Users\Paulo\AppData\Local\Programs\Python\Python37\lib\site-packages\instascrape\scrapers\post.py", line 80, in scrape webdriver=webdriver File "C:\Users\Paulo\AppData\Local\Programs\Python\Python37\lib\site-packages\instascrape\core\_static_scraper.py", line 144, in scrape return_data = self._get_json_from_source(self.source, headers=headers, session=session) File "C:\Users\Paulo\AppData\Local\Programs\Python\Python37\lib\site-packages\instascrape\core\_static_scraper.py", line 265, in _get_json_from_source self._validate_scrape(json_dict) File "C:\Users\Paulo\AppData\Local\Programs\Python\Python37\lib\site-packages\instascrape\core\_static_scraper.py", line 301, in _validate_scrape raise InstagramLoginRedirectError instascrape.exceptions.exceptions.InstagramLoginRedirectError: Instagram is redirecting you to the login page instead of the page you are trying to scrape. This could be occuring because you made too many requests too quickly or are not logged into Instagram on your machine. Try passing a valid session ID to the scrape method as a cookie to bypass the login requirement

Desktop

OS: Windows 8.1
Browser Google Chrome

Generate CSV

Hi Chris,

How can I generate the same type of CSV file that you provided? I mean with all informations organized.

Best.

Introduce pre-commit with isort, flake8 & black

Is your feature request related to a problem? Please describe.
I looked around in the docs but couldn't find anything on formatting or linting so I'm wondering if it's okay to add pre-commit config and format the codebase to a set standard. e.g. lines longer than 120 will be detected by flake8 and then reformatted using Black automatically.

Describe the solution you'd like
Automate linting & formatting instead of doing them manually.

Describe alternatives you've considered
Manual formatting

Additional context
I did something similar like in this PR for another project - MissMeg/home-automation-app#22

Please let me know if this idea is worth implementing (or not) and if i can add this to the changes 👍

Incorrect Full Name being returned for Profile Object

Describe the bug
When creating a Profile object, the full_name attribute can be incorrect.

To Reproduce
Steps to reproduce the behavior:

Without a sessionid prof = Profile("https://www.instagram.com/atlassian/")
print(prof.full_name)
See that the name returned is Max Sutton (when creating this issue) instead of Atlassian

Expected behavior
Expecting prof.full_name to return Atlassian

Additional context
I think the issue is in instascrape.core._mappings._ProfileMapping where full_name is mapped to user_full_name. I'm not sure why this is the case, but if this was just mapped to full_name I believe, after my testing, that this should solve the issue. Happy to PR if need be.

IndexError: list index out of range

Hi there.

I am trying to get this script up on Debian / Python3.9 but this is what error I am geting from this simple example:

from instascrape import Profile 
profile = Profile('chris_greening')
profile.scrape()

/home/username/.local/lib/python3.9/site-packages/instascrape/core/_static_scraper.py:134: MissingCookiesWarning: Request header does not contain cookies! It's recommended you pass at least a valid sessionid otherwise Instagram will likely redirect you to their login page.
  warnings.warn(
Traceback (most recent call last):
  File "/home/username/web/phibrows.training/public_html/scrape/sample.py", line 3, in <module>
    profile.scrape()
  File "/home/username/.local/lib/python3.9/site-packages/instascrape/core/_static_scraper.py", line 144, in scrape
    return_data = self._get_json_from_source(self.source, headers=headers, session=session)
  File "/home/username/.local/lib/python3.9/site-packages/instascrape/core/_static_scraper.py", line 264, in _get_json_from_source
    json_dict = json_dict_arr[1]
IndexError: list index out of range

Any idea how to solve this?
Thanks.

Add load image from url

Hello! I'd like to add a simple feature of loading an image given the URL like load_image_file. What do you think?

FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

Bug
I receive the following error:
FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

To Reproduce
This happens when running the .get_posts() function from Profiles

Expected behavior
I expected to receive an object containing my posts.

Desktop (please complete the following information):

OS
Browser: chrome

chris-greening / instascrape Goto Github PK

instascrape's Introduction

instascrape: powerful Instagram data scraping toolkit

Note: This module is no longer actively maintained.

DISCLAIMER:

What is it?

Key features

Table of Contents

💻 Installation

Minimum Python version

pip

🔎 Sample Usage

📚 Documentation

📰 Blog Posts

🙏 Contributing

🕸️ Dependencies

💳 License

❔ Support

instascrape's People

Contributors

Stargazers

Watchers

Forkers

instascrape's Issues

Instantiate the scraper objects

Scrape their respective data

Steps to reproduce

Pull Request

Describe the bug

To Reproduce

Expected behavior

Code to Reproduce

Version

Describe the bug

To Reproduce

Code

Expected output

Output

Recommend Projects

Recommend Topics

Recommend Org