ilyasozkurt / mobilephone-brands-and-models Goto Github PK

View Code? Open in Web Editor NEW

72.0 72.0 39.0 6.22 MB

A database includes mobilephone manufacturers and their models.

License: GNU General Public License v3.0

PHP 0.84% Blade 0.09% PLpgSQL 99.06%

brands gsmarena mobilephone mysql

mobilephone-brands-and-models's People

Contributors

Stargazers

Watchers

Forkers

mazyl aqibibrahim r00ft1h anthonymerlier egstar oluwatobimaxwell atlanticyu israelaliyev ciferz baddiservices faresoyam femi89 adangadang rivenwork jhonny213 gaelleguillou erik-paula jwpl190 tajirhas9 vamshidhar87 adamshahrom ip413 antare74 zohaibtariq oopanuga leonardostacke jackwang1219 msaaksjarvi zakarialabib eugene254-ship-it kechankrisna dinhhongkong bashhole ibrahims2 ekowtaylor mehmetishaktas tiresiasel

mobilephone-brands-and-models's Issues

Some URLs are not scraped

URLs with the following syntax are not scraped:

.com/_ - don't know why, e.g. Umidigi F1. Maybe the scraper omits the part after / and downloads the main page.
[g] - throws the following error: GuzzleHttp\Exception\TooManyRedirectsException. It's just one model: vivo Y20s [G]. I tried to change the URL to %5Bg%5D but to no avail.

P.S.: Glad you are back and with a cute website.

Is this legal?

Can we use this database for any purpose legally?

Add page URL to devices DB

I think that it's relevant and useful information. Isn't it?

UPDATE, ITS ALMOST 2022

Hello,

Can you run the scraper again. Would love an updated dataset.

Thank you

Some pages contain multiple models, e.g.
HTC U11 Life For Global market (GLOBAL) and HTC U11 Life For North America (NA): www.gsmarena.com/htc_u11_life-8885.php "ALL VERSIONS" model is rather misleading 'cause it actually shows the information about the 1st version, in the above-mentioned example it's the GLOBAL version. And the information about the NA version was not scraped at all.

Looks like the additional info is handled by some JavaScript, so it's completely new scraping logic to add.

Missing Information

Some information is missing (JSON as example):

E.g.: Huawei Mate 40 Pro+

Display Type ("OLED, HDR10, 90Hz"). Due to conflict with Battery Type.
Memory unnamed ("SFS 1.0"). Due to conflict with Sound unnamed.
Main Camera Features ("Leica optics, LED flash, panorama, HDR"). Due to conflict with Selfie Camera Features.
Main Camera Video ("4K@30/60fps, 1080p@30/60/120/240/480fps, 720p@960fps, 720p@3840fps, HDR, gyro-EIS"). Due to conflict with Selfie Camera Video.

Also, it is wrong to use [Main Camera] "Penta" & [Selfie Camera] "Dual" as JSON keys. I think it's necessary to add two additional nested JSONs "main_camera" & "selfie_camera" with the following structure:

"main_camera": {"Main_Camera_Type":"Penta","Main_Camera_Specifications":"50 MP, f/1.9, 23mm (wide), 1/1.28", 1.22µm, omnidirectional PDAF, Laser AF, OIS 12 MP, f/2.4, (telephoto), PDAF, OIS, 3x optical zoom 8 MP, f/4.4, 240mm (periscope telephoto), PDAF, OIS, 10x optical zoom 20 MP, f/2.4, 14mm (ultrawide), PDAF TOF 3D, (depth)","Main_Camera_Features":"Leica optics, LED flash, panorama, HDR","Main_Camera_Video":"4K@30/60fps, 1080p@30/60/120/240/480fps, 720p@960fps, 720p@3840fps, HDR, gyro-EIS"},"selfie_camera": {"Selfie_Camera_Type":"Dual","Selfie_Camera_Specifications":"13 MP, f/2.4, 18mm (ultrawide) TOF 3D, (depth/biometrics sensor)","Selfie_Camera_Features":"HDR, panorama","Selfie_Camera_Video":"4K@30/60fps, 1080p@30/60/240fps"},

Data mining script

Hi,

By any chance, would you share the data mining script you used to get the datas ?

Thanks

is there a tutorial?

thank you for the work you have done! Is there a guide explaining how to get the updated data extraction script working?

thank you very much, greetings from Italy

file hash

I made 2 scrapes, it always gets stuck at 10630 items more or less the same as those of your scrape that took place in 2021. how can I scrape the others too?

Non-Unique DB Keys in Scrapper

Non-unique database keys being used in the ScrapeCommand.php:

$ttl = $row->find('.ttl')[0]->plaintext ?? null;

It resulted in some information have been overwritten (previously mentioned here: #5).

I hope that the following should work correctly:

$ttl = $row->find('.ttl')[0]->data-spec ?? null;

Please, check and re-scrape.

State of the repo

is this repo still being updated? I'm assuming it's not but just to make sure

Terms of Service Violation?

I am also trying to scrape gsmarena.com, when I noticed that their Terms of Service asked not to modify the information. I was wondering if using this data (or scraping the data) is allowed?

You may download, view, copy and print documents and graphics incorporated in these documents (the "Documents") from this website subject to the following: (1) the Documents may be used solely for personal, informational, non-commercial purposes; and (2) the Documents may not be modified or altered in any way. Except as expressly provided herein, you may not use, download, upload, copy, print, display, perform, reproduce, publish, license, post, transmit or distribute any information from this website in whole or in part without the prior written permission of GSMArena.com.

Recent devices are missing

Thank for the rapid scraper update!

I have re-scraped the data but all recent devices (more than 400 models! to this date) are missing due to outdated sitemap-phones.xml. Looks like they update it once a year or even less frequently.

Other minor issues:

The download progress bar shows 32523 links, which is rather misleading 'cause the links containing -pictures- & related.php are skipped. And the links containing -3d-spin- (360° view) should be skipped, too:

if (!strpos($url->loc, 'related.php') && !strpos($url->loc, '-3d-spin-') && !strpos($url->loc, '-pictures-')) {

Battery Type (# 2) is still missing. Battery Stand-by (# 2) and Battery Talk time (# 2) are mysteriously present, but useless without Battery Type (# 2). Fortunately, the information is unimportant.

E.g.: Nokia 5110

"Battery":{"Type":["Removable Li-Po 600 mAh battery"],"Stand-by":["40 - 180 h","60-270 h"],"Talk time":["2 h - 3 h 20 min","3-5 h"]}

Nameless   data (sub-row) is "named" after the previous one and gets into the wrong array.

E.g.: Nokia 5110

"Camera":{"Call records":["No"]}

"Display":{"Type":["Monochrome graphic"],"Size":[""],"Resolution":["5 lines","Dynamic font size\\r\\n Softkey\\r\\n Welcome message"]}

I named it "Additional Information" in my spreadsheet, but getting it all out from the wrong columns was rather time consuming.

Network information is placed incorrectly

Fixing previous issue #7

Nameless   data (sub-row) is "named" after the previous one and gets into the wrong array.

has broken Network information scraping.

Now the additional "xG" info from a nameless sub-row is being placed to Network: Additional_# though it should be added as Network: 2G bands_#, Network: 3G bands_# and so on. Or it's even better to concatenate it to the info from the previous sub-row with some delimiter, like | or new line \r\n.

autoload.php seems to be missing

When trying to run the re-scrape command, I get

PHP Warning: require(/<install_directory>/mobilephone-brands-and-models/scrapper/vendor/autoload.php): Failed to open stream: No such file or directory in /<install_directory>mobilephone-brands-and-models/scrapper/artisan on line 18

The vendor directory autoload.php file seem to missing from the code base.

problem

the procedure is blocked at 4800 records, how can I read the errors? or the reason for the block? can i restart it from the breakpoint?