Giter Site home page Giter Site logo

java-webscrapper's Introduction

COMP3111: Software Engineering Project - Webscrapper

Group Name: #35-SHIBE

Member Task Task
albertparedandan Basic 4 Basic 6
hanifdean Basic 1 Basic 5
nwihardjo Basic 2 Basic 3

Assumptions

  1. Use amazon.com as addition reselling portal
  2. Price shown is in USD. Price as 0 will be used if no information regarding the price is available
  3. Posted date timezone is HKT (Hong Kong Time) (TODO: check what shown in the posted date for null)
  4. Pagination of amazon portal is not handled
  5. Keyword which return whole new sub-section on amazon, i.e. book, is not handled since it is not specific enough which does not return solely list of available items in the portal. The result will return nothing in this case
  6. Item listed without any title / name will not be scraped as it is not a valid item
  7. Main price of amazon item is used, not the 'more buying options' or 'offer price' (usually cheaper price of same item listed in the portal from different seller). Average of the main price is used when the main price is a range between two prices (usually due to different sizes, colours, etc). Cheapest 'more buying options' or 'offer' price is used when no information available on the main price, as a rough estimate on the price of the item
  8. Posted date from amazon portal is scraped from the date of which the item is posted for the first time
  9. Service listing on amazon portal (not an item) is handled as well
  10. If there are results found but prices are all 0, average selling price and lowest selling price will be displayed as 0.0 as opposed to "-". "-" will only be displayed if there are no results found
  11. Functions that do not have access modifiers are purposely made package-private for unit testing purposes.
  12. As scraping craigslist is handled concurrently, the output of the console will only be [int] page(s) of craigslist are being scraped in parallel ... instead of how many pages has been scraped so far, as multiple pages are scraped at the same time / in parallel.

TL;DR

WebScraper to scrape both amazon and newyork craigslist website based on the keyword specified. Utilised multi-threading to support concurrency on craiglist pagination and amazon items' posted date retrieval which significantly improve the performance.


Dependencies

  1. Java 8 JDK with Gradle
  2. JavaFX for GUI framework
  3. JUnit 4.12 for testing suite
  4. Jacoco for test coverage measurement

Running the programme

We configure the project with Gradle. Gradle can be considered as Makefile like tools that streamline the compilation for you.

Compile with Windows Command Prompt

  • Goto your project root folder
  • Type gradlew run. This will build and run the project.

If you want to just rerun the project without rebuilding it,

  • Go to the project root folder build\jar\
  • Double click jar file (e.g. webscraper-0.1.0.jar) yes, you need a GUI screen to run it.

Compile with Mac/Linux terminal

  • Goto your project root folder
  • ./gradlew build. This will build the project.
  • ./gradlew run which going to run the application

If you want to just rerun the project without rebuilding it,

  • Go to the folder build/jar/
  • Double click jar file (e.g. webscraper-0.1.0.jar) or simply ./gradlew run

Unit test and jacoco coverage report

  • Go to the project root directory

  • ./gradlew test jacocoTestReport to generate the test report anc coverage. It will run all unit tests and generate the coverage report

  • Jacoco coverage report can be accessed from ./build/jacocoHTML/index.html

  • Unit tests report is on ./build/reports/tests/test/index.html

Some of the unit tests use cached pages from both portals. Testing utilises Reflection method to unit test private functions (not a good practise i know).

Documentation / javadoc

Here for the latest javadoc. Or if you prefer compile it by yourself,

  • In project root directory, ./gradlew javadoc to generate javadoc
  • Documentation is available at ./build/docs/javadoc/index.html.

java-webscrapper's People

Contributors

albertparedandan avatar hanifdean avatar khwang0 avatar nwihardjo avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

java-webscrapper's Issues

Unit test cases on different machine

After pull request accepted, please try to build the programme, and let me know should any exceptions / errors occurred. Wanted to check whether the code for generating system path works on different machine.

searching some keywords give no result

screenshot 2018-11-29 at 12 34 35 am

searching some keywords such as macbook 2018 or macbook pro 2015 15 does not return anything. bug sometimes appear, sometimes doesnt. have to restart program.

Null pointer exception thrown by controller's methods

issue

Would you mind to take a look at the methods throwing exception when no result were retrieved? In addition, the GUI freezes when the hyperlink to the lowest price and the most recent date were pressed, it'll be awesome if you could take a look into that as well.

Exceptions raised when sorting Table tab

Got this error when sorting the table tab, with or without data.
screenshot from 2018-11-25 00-27-26

In addition, should the "summary tab selected" be printed out in terminal when Table tab is selected after Summary tab?

Exceptions raised when searching after refining previous search

Searching new item after refining previous search raises below exception. Happened for any items searched.


> Task :run
Nov 27, 2018 4:32:11 PM javafx.fxml.FXMLLoader$ValueElement processValue
WARNING: Loading FXML document with JavaFX API of version 8.0.171 by JavaFX runtime of version 8.0.151
actionSearch: shock
   DEBUG: scraping amazon...
   DEBUG: scraping craigslist...
DEBUG: scraping finished
refineSearch: 2017
actionSearch: iphone
   DEBUG: scraping amazon...
   DEBUG: scraping craigslist...
DEBUG: scraping finished
Exception in thread "Thread-5" java.lang.IllegalStateException: Not on FX application thread; currentThread = Thread-5
	at com.sun.javafx.tk.Toolkit.checkFxUserThread(Toolkit.java:279)
	at com.sun.javafx.tk.quantum.QuantumToolkit.checkFxUserThread(QuantumToolkit.java:423)
	at javafx.scene.Parent$2.onProposedChange(Parent.java:367)
	at com.sun.javafx.collections.VetoableListDecorator.clear(VetoableListDecorator.java:294)
	at com.sun.javafx.scene.control.skin.TableColumnHeader.updateSortGrid(TableColumnHeader.java:544)
	at com.sun.javafx.scene.control.skin.TableColumnHeader.updateSortPosition(TableColumnHeader.java:537)
	at com.sun.javafx.scene.control.skin.TableColumnHeader.lambda$new$1(TableColumnHeader.java:191)
	at javafx.collections.WeakListChangeListener.onChanged(WeakListChangeListener.java:88)
	at com.sun.javafx.collections.ListListenerHelper$Generic.fireValueChangedEvent(ListListenerHelper.java:329)
	at com.sun.javafx.collections.ListListenerHelper.fireValueChangedEvent(ListListenerHelper.java:73)
	at javafx.collections.ObservableListBase.fireChange(ObservableListBase.java:233)
	at javafx.collections.ListChangeBuilder.commit(ListChangeBuilder.java:482)
	at javafx.collections.ListChangeBuilder.endChange(ListChangeBuilder.java:541)
	at javafx.collections.ObservableListBase.endChange(ObservableListBase.java:205)
	at com.sun.javafx.collections.ObservableListWrapper.clear(ObservableListWrapper.java:157)
	at javafx.scene.control.TableView$6.invalidated(TableView.java:837)
	at javafx.beans.property.ObjectPropertyBase.markInvalid(ObjectPropertyBase.java:111)
	at javafx.beans.property.ObjectPropertyBase.set(ObjectPropertyBase.java:146)
	at javafx.scene.control.TableView.setItems(TableView.java:843)
	at comp3111.webscraper.Controller.refreshTableTab(Controller.java:215)
	at comp3111.webscraper.Controller.lambda$actionSearch$0(Controller.java:200)
	at java.lang.Thread.run(Thread.java:748)
Exception in thread "Thread-5" java.lang.IllegalStateException: Not on FX application thread; currentThread = Thread-5
	at com.sun.javafx.tk.Toolkit.checkFxUserThread(Toolkit.java:279)
	at com.sun.javafx.tk.quantum.QuantumToolkit.checkFxUserThread(QuantumToolkit.java:423)
	at javafx.scene.Parent$2.onProposedChange(Parent.java:367)
	at com.sun.javafx.collections.VetoableListDecorator.clear(VetoableListDecorator.java:294)
	at com.sun.javafx.scene.control.skin.TableColumnHeader.updateSortGrid(TableColumnHeader.java:544)
	at com.sun.javafx.scene.control.skin.TableColumnHeader.updateSortPosition(TableColumnHeader.java:537)
	at com.sun.javafx.scene.control.skin.TableColumnHeader.lambda$new$1(TableColumnHeader.java:191)
	at javafx.collections.WeakListChangeListener.onChanged(WeakListChangeListener.java:88)
	at com.sun.javafx.collections.ListListenerHelper$Generic.fireValueChangedEvent(ListListenerHelper.java:329)
	at com.sun.javafx.collections.ListListenerHelper.fireValueChangedEvent(ListListenerHelper.java:73)
	at javafx.collections.ObservableListBase.fireChange(ObservableListBase.java:233)
	at javafx.collections.ListChangeBuilder.commit(ListChangeBuilder.java:482)
	at javafx.collections.ListChangeBuilder.endChange(ListChangeBuilder.java:541)
	at javafx.collections.ObservableListBase.endChange(ObservableListBase.java:205)
	at com.sun.javafx.collections.ObservableListWrapper.clear(ObservableListWrapper.java:157)
	at javafx.scene.control.TableView$6.invalidated(TableView.java:837)
	at javafx.beans.property.ObjectPropertyBase.markInvalid(ObjectPropertyBase.java:111)
	at javafx.beans.property.ObjectPropertyBase.set(ObjectPropertyBase.java:146)
	at javafx.scene.control.TableView.setItems(TableView.java:843)
	at comp3111.webscraper.Controller.refreshTableTab(Controller.java:215)
	at comp3111.webscraper.Controller.lambda$actionSearch$0(Controller.java:200)
	at java.lang.Thread.run(Thread.java:748)
Exception in thread "Thread-5" java.lang.IllegalStateException: Not on FX application thread; currentThread = Thread-5
	at com.sun.javafx.tk.Toolkit.checkFxUserThread(Toolkit.java:279)
	at com.sun.javafx.tk.quantum.QuantumToolkit.checkFxUserThread(QuantumToolkit.java:423)
	at javafx.scene.Parent$2.onProposedChange(Parent.java:367)
	at com.sun.javafx.collections.VetoableListDecorator.clear(VetoableListDecorator.java:294)
	at com.sun.javafx.scene.control.skin.TableColumnHeader.updateSortGrid(TableColumnHeader.java:544)
	at com.sun.javafx.scene.control.skin.TableColumnHeader.updateSortPosition(TableColumnHeader.java:537)
	at com.sun.javafx.scene.control.skin.TableColumnHeader.lambda$new$1(TableColumnHeader.java:191)
	at javafx.collections.WeakListChangeListener.onChanged(WeakListChangeListener.java:88)
	at com.sun.javafx.collections.ListListenerHelper$Generic.fireValueChangedEvent(ListListenerHelper.java:329)
	at com.sun.javafx.collections.ListListenerHelper.fireValueChangedEvent(ListListenerHelper.java:73)
	at javafx.collections.ObservableListBase.fireChange(ObservableListBase.java:233)
	at javafx.collections.ListChangeBuilder.commit(ListChangeBuilder.java:482)
	at javafx.collections.ListChangeBuilder.endChange(ListChangeBuilder.java:541)
	at javafx.collections.ObservableListBase.endChange(ObservableListBase.java:205)
	at com.sun.javafx.collections.ObservableListWrapper.clear(ObservableListWrapper.java:157)
	at javafx.scene.control.TableView$6.invalidated(TableView.java:837)
	at javafx.beans.property.ObjectPropertyBase.markInvalid(ObjectPropertyBase.java:111)
	at javafx.beans.property.ObjectPropertyBase.set(ObjectPropertyBase.java:146)
	at javafx.scene.control.TableView.setItems(TableView.java:843)
	at comp3111.webscraper.Controller.refreshTableTab(Controller.java:215)
	at comp3111.webscraper.Controller.lambda$actionSearch$0(Controller.java:200)
	at java.lang.Thread.run(Thread.java:748)
Exception in thread "Thread-5" java.lang.IllegalStateException: Not on FX application thread; currentThread = Thread-5
	at com.sun.javafx.tk.Toolkit.checkFxUserThread(Toolkit.java:279)
	at com.sun.javafx.tk.quantum.QuantumToolkit.checkFxUserThread(QuantumToolkit.java:423)
	at javafx.scene.Parent$2.onProposedChange(Parent.java:367)
	at com.sun.javafx.collections.VetoableListDecorator.clear(VetoableListDecorator.java:294)
	at com.sun.javafx.scene.control.skin.TableColumnHeader.updateSortGrid(TableColumnHeader.java:544)
	at com.sun.javafx.scene.control.skin.TableColumnHeader.updateSortPosition(TableColumnHeader.java:537)
	at com.sun.javafx.scene.control.skin.TableColumnHeader.lambda$new$1(TableColumnHeader.java:191)
	at javafx.collections.WeakListChangeListener.onChanged(WeakListChangeListener.java:88)
	at com.sun.javafx.collections.ListListenerHelper$Generic.fireValueChangedEvent(ListListenerHelper.java:329)
	at com.sun.javafx.collections.ListListenerHelper.fireValueChangedEvent(ListListenerHelper.java:73)
	at javafx.collections.ObservableListBase.fireChange(ObservableListBase.java:233)
	at javafx.collections.ListChangeBuilder.commit(ListChangeBuilder.java:482)
	at javafx.collections.ListChangeBuilder.endChange(ListChangeBuilder.java:541)
	at javafx.collections.ObservableListBase.endChange(ObservableListBase.java:205)
	at com.sun.javafx.collections.ObservableListWrapper.clear(ObservableListWrapper.java:157)
	at javafx.scene.control.TableView$6.invalidated(TableView.java:837)
	at javafx.beans.property.ObjectPropertyBase.markInvalid(ObjectPropertyBase.java:111)
	at javafx.beans.property.ObjectPropertyBase.set(ObjectPropertyBase.java:146)
	at javafx.scene.control.TableView.setItems(TableView.java:843)
	at comp3111.webscraper.Controller.refreshTableTab(Controller.java:215)
	at comp3111.webscraper.Controller.lambda$actionSearch$0(Controller.java:200)
	at java.lang.Thread.run(Thread.java:748)
refineSearch: gold

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.