Giter Site home page Giter Site logo

ilyasozkurt / mobilephone-brands-and-models Goto Github PK

View Code? Open in Web Editor NEW
72.0 72.0 39.0 6.22 MB

A database includes mobilephone manufacturers and their models.

License: GNU General Public License v3.0

PHP 0.84% Blade 0.09% PLpgSQL 99.06%
brands gsmarena mobilephone mysql

mobilephone-brands-and-models's People

Contributors

ilyasozkurt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

mobilephone-brands-and-models's Issues

Some URLs are not scraped

URLs with the following syntax are not scraped:

  1. .com/_ - don't know why, e.g. Umidigi F1. Maybe the scraper omits the part after / and downloads the main page.

  2. [g] - throws the following error: GuzzleHttp\Exception\TooManyRedirectsException. It's just one model: vivo Y20s [G]. I tried to change the URL to %5Bg%5D but to no avail.

P.S.: Glad you are back and with a cute website.

Multiple models per page

Some pages contain multiple models, e.g.
HTC U11 Life For Global market (GLOBAL) and HTC U11 Life For North America (NA): www.gsmarena.com/htc_u11_life-8885.php "ALL VERSIONS" model is rather misleading 'cause it actually shows the information about the 1st version, in the above-mentioned example it's the GLOBAL version. And the information about the NA version was not scraped at all.

Looks like the additional info is handled by some JavaScript, so it's completely new scraping logic to add.

Missing Information

Some information is missing (JSON as example):

E.g.: Huawei Mate 40 Pro+

  1. Display Type ("OLED, HDR10, 90Hz"). Due to conflict with Battery Type.
  2. Memory unnamed ("SFS 1.0"). Due to conflict with Sound unnamed.
  3. Main Camera Features ("Leica optics, LED flash, panorama, HDR"). Due to conflict with Selfie Camera Features.
  4. Main Camera Video ("4K@30/60fps, 1080p@30/60/120/240/480fps, 720p@960fps, 720p@3840fps, HDR, gyro-EIS"). Due to conflict with Selfie Camera Video.

Also, it is wrong to use [Main Camera] "Penta" & [Selfie Camera] "Dual" as JSON keys. I think it's necessary to add two additional nested JSONs "main_camera" & "selfie_camera" with the following structure:

"main_camera": {"Main_Camera_Type":"Penta","Main_Camera_Specifications":"50 MP, f/1.9, 23mm (wide), 1/1.28", 1.22µm, omnidirectional PDAF, Laser AF, OIS 12 MP, f/2.4, (telephoto), PDAF, OIS, 3x optical zoom 8 MP, f/4.4, 240mm (periscope telephoto), PDAF, OIS, 10x optical zoom 20 MP, f/2.4, 14mm (ultrawide), PDAF TOF 3D, (depth)","Main_Camera_Features":"Leica optics, LED flash, panorama, HDR","Main_Camera_Video":"4K@30/60fps, 1080p@30/60/120/240/480fps, 720p@960fps, 720p@3840fps, HDR, gyro-EIS"},"selfie_camera": {"Selfie_Camera_Type":"Dual","Selfie_Camera_Specifications":"13 MP, f/2.4, 18mm (ultrawide) TOF 3D, (depth/biometrics sensor)","Selfie_Camera_Features":"HDR, panorama","Selfie_Camera_Video":"4K@30/60fps, 1080p@30/60/240fps"},

Data mining script

Hi,

By any chance, would you share the data mining script you used to get the datas ?

Thanks

is there a tutorial?

thank you for the work you have done! Is there a guide explaining how to get the updated data extraction script working?

thank you very much, greetings from Italy

file hash

I made 2 scrapes, it always gets stuck at 10630 items more or less the same as those of your scrape that took place in 2021. how can I scrape the others too?

Non-Unique DB Keys in Scrapper

Non-unique database keys being used in the ScrapeCommand.php:

$ttl = $row->find('.ttl')[0]->plaintext ?? null;

It resulted in some information have been overwritten (previously mentioned here: #5).

I hope that the following should work correctly:

$ttl = $row->find('.ttl')[0]->data-spec ?? null;

Please, check and re-scrape.

State of the repo

is this repo still being updated? I'm assuming it's not but just to make sure

Terms of Service Violation?

I am also trying to scrape gsmarena.com, when I noticed that their Terms of Service asked not to modify the information. I was wondering if using this data (or scraping the data) is allowed?

You may download, view, copy and print documents and graphics incorporated in these documents (the "Documents") from this website subject to the following: (1) the Documents may be used solely for personal, informational, non-commercial purposes; and (2) the Documents may not be modified or altered in any way. Except as expressly provided herein, you may not use, download, upload, copy, print, display, perform, reproduce, publish, license, post, transmit or distribute any information from this website in whole or in part without the prior written permission of GSMArena.com.

Recent devices are missing

Thank for the rapid scraper update!

I have re-scraped the data but all recent devices (more than 400 models! to this date) are missing due to outdated sitemap-phones.xml. Looks like they update it once a year or even less frequently.

Other minor issues:

  1. The download progress bar shows 32523 links, which is rather misleading 'cause the links containing -pictures- & related.php are skipped. And the links containing -3d-spin- (360° view) should be skipped, too:

if (!strpos($url->loc, 'related.php') && !strpos($url->loc, '-3d-spin-') && !strpos($url->loc, '-pictures-')) {

  1. Battery Type (# 2) is still missing. Battery Stand-by (# 2) and Battery Talk time (# 2) are mysteriously present, but useless without Battery Type (# 2). Fortunately, the information is unimportant.

E.g.: Nokia 5110

"Battery":{"Type":["Removable Li-Po 600 mAh battery"],"Stand-by":["40 - 180 h","60-270 h"],"Talk time":["2 h - 3 h 20 min","3-5 h"]}

  1. Nameless   data (sub-row) is "named" after the previous one and gets into the wrong array.

E.g.: Nokia 5110

"Camera":{"Call records":["No"]}

"Display":{"Type":["Monochrome graphic"],"Size":[""],"Resolution":["5 lines","Dynamic font size\\r\\n Softkey\\r\\n Welcome message"]}

I named it "Additional Information" in my spreadsheet, but getting it all out from the wrong columns was rather time consuming.

Network information is placed incorrectly

Fixing previous issue #7

  1. Nameless   data (sub-row) is "named" after the previous one and gets into the wrong array.

has broken Network information scraping.

Now the additional "xG" info from a nameless sub-row is being placed to Network: Additional_# though it should be added as Network: 2G bands_#, Network: 3G bands_# and so on. Or it's even better to concatenate it to the info from the previous sub-row with some delimiter, like | or new line \r\n.

autoload.php seems to be missing

When trying to run the re-scrape command, I get

PHP Warning: require(/<install_directory>/mobilephone-brands-and-models/scrapper/vendor/autoload.php): Failed to open stream: No such file or directory in /<install_directory>mobilephone-brands-and-models/scrapper/artisan on line 18

The vendor directory autoload.php file seem to missing from the code base.

problem

the procedure is blocked at 4800 records, how can I read the errors? or the reason for the block? can i restart it from the breakpoint?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.