ilyasozkurt / mobilephone-brands-and-models Goto Github PK
View Code? Open in Web Editor NEWA database includes mobilephone manufacturers and their models.
License: GNU General Public License v3.0
A database includes mobilephone manufacturers and their models.
License: GNU General Public License v3.0
URLs with the following syntax are not scraped:
.com/_
- don't know why, e.g. Umidigi F1. Maybe the scraper omits the part after /
and downloads the main page.
[g]
- throws the following error: GuzzleHttp\Exception\TooManyRedirectsException
. It's just one model: vivo Y20s [G]. I tried to change the URL to %5Bg%5D
but to no avail.
P.S.: Glad you are back and with a cute website.
Can we use this database for any purpose legally?
I think that it's relevant and useful information. Isn't it?
Hello,
Can you run the scraper again. Would love an updated dataset.
Thank you
Some pages contain multiple models, e.g.
HTC U11 Life For Global market (GLOBAL) and HTC U11 Life For North America (NA): www.gsmarena.com/htc_u11_life-8885.php "ALL VERSIONS" model is rather misleading 'cause it actually shows the information about the 1st version, in the above-mentioned example it's the GLOBAL version. And the information about the NA version was not scraped at all.
Looks like the additional info is handled by some JavaScript, so it's completely new scraping logic to add.
Some information is missing (JSON as example):
E.g.: Huawei Mate 40 Pro+
Display Type
("OLED, HDR10, 90Hz"). Due to conflict with Battery Type
.Memory
unnamed ("SFS 1.0"). Due to conflict with Sound
unnamed.Main Camera Features
("Leica optics, LED flash, panorama, HDR"). Due to conflict with Selfie Camera Features
.Main Camera Video
("4K@30/60fps, 1080p@30/60/120/240/480fps, 720p@960fps, 720p@3840fps, HDR, gyro-EIS"). Due to conflict with Selfie Camera Video
.Also, it is wrong to use [Main Camera] "Penta" & [Selfie Camera] "Dual" as JSON keys. I think it's necessary to add two additional nested JSONs "main_camera" & "selfie_camera" with the following structure:
"main_camera": {"Main_Camera_Type":"Penta","Main_Camera_Specifications":"50 MP, f/1.9, 23mm (wide), 1/1.28", 1.22µm, omnidirectional PDAF, Laser AF, OIS 12 MP, f/2.4, (telephoto), PDAF, OIS, 3x optical zoom 8 MP, f/4.4, 240mm (periscope telephoto), PDAF, OIS, 10x optical zoom 20 MP, f/2.4, 14mm (ultrawide), PDAF TOF 3D, (depth)","Main_Camera_Features":"Leica optics, LED flash, panorama, HDR","Main_Camera_Video":"4K@30/60fps, 1080p@30/60/120/240/480fps, 720p@960fps, 720p@3840fps, HDR, gyro-EIS"},"selfie_camera": {"Selfie_Camera_Type":"Dual","Selfie_Camera_Specifications":"13 MP, f/2.4, 18mm (ultrawide) TOF 3D, (depth/biometrics sensor)","Selfie_Camera_Features":"HDR, panorama","Selfie_Camera_Video":"4K@30/60fps, 1080p@30/60/240fps"},
Hi,
By any chance, would you share the data mining script you used to get the datas ?
Thanks
thank you for the work you have done! Is there a guide explaining how to get the updated data extraction script working?
thank you very much, greetings from Italy
I made 2 scrapes, it always gets stuck at 10630 items more or less the same as those of your scrape that took place in 2021. how can I scrape the others too?
Non-unique database keys being used in the ScrapeCommand.php
:
$ttl = $row->find('.ttl')[0]->plaintext ?? null;
It resulted in some information have been overwritten (previously mentioned here: #5).
I hope that the following should work correctly:
$ttl = $row->find('.ttl')[0]->data-spec ?? null;
Please, check and re-scrape.
is this repo still being updated? I'm assuming it's not but just to make sure
I am also trying to scrape gsmarena.com, when I noticed that their Terms of Service asked not to modify the information. I was wondering if using this data (or scraping the data) is allowed?
You may download, view, copy and print documents and graphics incorporated in these documents (the "Documents") from this website subject to the following: (1) the Documents may be used solely for personal, informational, non-commercial purposes; and (2) the Documents may not be modified or altered in any way. Except as expressly provided herein, you may not use, download, upload, copy, print, display, perform, reproduce, publish, license, post, transmit or distribute any information from this website in whole or in part without the prior written permission of GSMArena.com.
Thank for the rapid scraper update!
I have re-scraped the data but all recent devices (more than 400 models! to this date) are missing due to outdated sitemap-phones.xml
. Looks like they update it once a year or even less frequently.
Other minor issues:
-pictures-
& related.php
are skipped. And the links containing -3d-spin-
(360° view) should be skipped, too:if (!strpos($url->loc, 'related.php') && !strpos($url->loc, '-3d-spin-') && !strpos($url->loc, '-pictures-')) {
Battery
Type
(# 2) is still missing. Battery
Stand-by
(# 2) and Battery
Talk time
(# 2) are mysteriously present, but useless without Battery
Type
(# 2). Fortunately, the information is unimportant.E.g.: Nokia 5110
"Battery":{"Type":["Removable Li-Po 600 mAh battery"],"Stand-by":["40 - 180 h","60-270 h"],"Talk time":["2 h - 3 h 20 min","3-5 h"]}
data (sub-row) is "named" after the previous one and gets into the wrong array.E.g.: Nokia 5110
"Camera":{"Call records":["No"]}
"Display":{"Type":["Monochrome graphic"],"Size":[""],"Resolution":["5 lines","Dynamic font size\\r\\n Softkey\\r\\n Welcome message"]}
I named it "Additional Information" in my spreadsheet, but getting it all out from the wrong columns was rather time consuming.
Fixing previous issue #7
- Nameless
data (sub-row) is "named" after the previous one and gets into the wrong array.
has broken Network
information scraping.
Now the additional "xG" info from a nameless sub-row is being placed to Network: Additional_#
though it should be added as Network: 2G bands_#
, Network: 3G bands_#
and so on. Or it's even better to concatenate it to the info from the previous sub-row with some delimiter, like |
or new line \r\n
.
When trying to run the re-scrape command, I get
PHP Warning: require(/<install_directory>/mobilephone-brands-and-models/scrapper/vendor/autoload.php): Failed to open stream: No such file or directory in /<install_directory>mobilephone-brands-and-models/scrapper/artisan on line 18
The vendor
directory autoload.php
file seem to missing from the code base.
the procedure is blocked at 4800 records, how can I read the errors? or the reason for the block? can i restart it from the breakpoint?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.