Giter Site home page Giter Site logo

itinerum-trip-breaker's People

Contributors

bochuliu avatar felipevh avatar mwidener avatar nate-wessel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

itinerum-trip-breaker's Issues

KDE grid is inefficient sample of sparse points

The KDE calculation takes a crazy amount of time with a sufficiently fine resolution. Tens of thousands of points are evaluated that are nowhere near points in the input dataset.

Changing from a grid to a set of provided eval.points will effect the peak-finding algorithm, but could save a huge amount of time.

Limit the number of points sent to KDE

We are using a KDE to estimate locations where more than 10 minutes has been spent in one spot. To do this, we feed in all the interpolated points. But we should know that we can't/shouldn't have any locations found where there are only interpolated points. The only reason we include them all is that it's an easier way to spread out time weights where they shouldn't be too concentrated on the original (sparse) input points.

Better/smarter weight assignments to the original points should reduce the time spent in the KDE step.

Error when running main.py

nate@nate-worklap:~/itinerum/itinerum-trip-breaker$ python3 main.py
3 user(s) to clean
Traceback (most recent call last):
  File "main.py", line 54, in <module>
    user = Trace(user_id, user_ids[user_id], survey_responses[user_id])
  File "/home/nate/itinerum/itinerum-trip-breaker/trace.py", line 29, in __init__
    h = read_headers()
TypeError: read_headers() missing 1 required positional argument: 'fname'

Implement smarter point clustering algorithm

Cluster detection is currently the most computationally expensive process in the script. We end up calculating an entire NxN distancce matrix where N can be as large as 15,000 and separating clusters based on simple distance. There should be a faster and more memory efficient way to partition that many points into clusters. This is almost certainly an out of the box function from an existing library - we just need to pick the ideal algorithm based on the nature of the problem and choose a library.

Different platforms giving different clustering results

On the test data Nate's computer yields 16 activity locations with default parameters. Felipe's has many more points above the threshold and yields only 11 14 locations.

Possibly an issue of differing precision, but the magnitude of the difference seems to indicate otherwise.

Note identical points in input data, somehow flag as more likely being error

Android phones seem to produce many points at the same locations, perhaps when they are acquiring a signal or when the signal is weak. Most of these are basically crap though and should be discarded if they aren't already. This information should be used in the data cleaning stage to remove points that are repeated, with limited exceptions where points repeat just by chance due to rounding.

Run Viterbi algorithm on interpolated subsets

Currently running this on only the original point sets. Improved spatial resolution should improve performance, hopefully. Better interpolation and subsetting algorithms will eventually lead to better results.

Refactor compare.py/compare_locations

Should be structured more like compare_episodes

Also location comparison should exclude computed locations that aren't referenced in the episodes file.

Output compare.py diagnostics to CSV

The output from this script is pretty much taking up a full screen to display properly on my computer. Once we add a few more diagnostics, printing to the console will be untenable.

Or maybe the printout can be reformatted?

How should ground truth data be classified?

The purpose of the ground truth data is to test the performance of the algorithm on a known dataset. It seems to me that there are two broad potential approaches to this:

  1. We can classify the data according to what actually happened on the ground, just translated into the required language of discrete trips and activities.
  2. Or we can classify according to what we see in the GPS points, informed by what actually happened on the ground.

The ground truth data we currently have (my own) is a sloppy mix of these.

To give an example, should we include activity locations that we actually visited but that don't look like activities in coordinates.csv, perhaps because of missing or inaccurate data?

The benefit of producing a properly true ground truth is that we can measure how far our algorithm (considered as encompassing the app, the phone, etc.) is from actual reality as interpreted by the one who lived it, or at least from a more traditional activity survey.

The benefit of ground truth as manual classification of input data is that it tells us how far we are from the best possible results we can get from the data we have available.

My Reality > Phone's Reality > Our interpretation of Phone's Reality

ground truth user aborting after/during KDE

nate@nate-worklap:~/itinerum/itinerum-trip-breaker$ python3 main.py
1 user(s) to clean
User: 1 2128 points at start for 44b9444a-ccb8-426b-878b-8b09318348e5
	 123 points removed as high stated error
	 28 points removed as duplicate
	 45 points removed by positional cleaning
	 17 points removed as duplicate
	 1 gap(s) found in data
	Running KDE on 4672 points
44b9444a-ccb8-426b-878b-8b09318348e5 aborted

Output trip geometry file

Coded as an additional column on the coordinates file, labeling each point to what trip/activity it corresponds to, if any.

Also, see if any standards exists for representing trip geometry

Many long, obscure functions everywhere should be shortened

In particular:
init file reading should be handled elsewhere (part of issue #4 )
flush likewise part of issue #4
get_days inner loop can be abstracted

refactor for readability:
get_known_subsets
get_activity_locations
break_trips
find_peaks
observe_neighbours

Output points geometry/diagnostics/classification file

We need to output information on each point used in the algorithm:

  • Location, obviously

  • Viterbi state classification (trip? activity location?) which can be used to construct geometry of trips

  • Whether it was original input point or interpolated

  • Whether it was removed during cleaning (used to debug/verify cleaning process)

  • Temporal weight assigned during KDE step

  • Estimated PDF probability at that location

  • etc
    point_diagnostics.csv (the header is being written already) should replace a couple of intermediate diagnostic files

This data should already be stored with each point object, so we just need to write the output.

Limit bottlenecks on large datasets

Larger datasets in the tower survey sample can take more than 5 minutes to run. E.g. user 0c7714d8-ce0b-421a-a698-5d77152e165a. This user happens to have too many days in the dataset, but we can use them as a test case for speed improvements.

Some suggestions:

  • Thin out the interpolation if it results in too many points.
  • Thin out just the points at which the KDE is estimated through sampling
  • Run Viterbi on only original points? This may actually be quite fast already - that needs to be tested.

Locations are underdetected after R ks update?

Check to be sure that nothing in the KDE calculation changed with the package update such that the peak height thresholds are now too high. Locations now are underdetected for all users.

Some temporal interpolation needed.

In the new algorithm, activity time attribution will be done on the set of interpolated points. The spatial interpolation is fine, but some additional temporal interpolation will be needed where there are large temporal gaps. It doesn't need to be even across the whole gap, unless that is much easier.

It may be more efficient to do this after the KDE.

Check for repeating records in the coordinates table

Not sure if this would actually cause us any trouble, but there are some (many?) records in the input coordinate data that are repeated verbatim. Same timestamp, etc.
Would be good to check for this in the cleaning step just in case though!

KDE crashes on CFS data

Various bugs come up: zero division error, some runtime errors in clustering, a segmentation fault in larger traces, errors when length of trace is too short

Parallelize processing

One user per core?

One potential issue here is the current high memory requirements for some users in the clustering step.

Study location needs to be made optional

The script fails when users are present with no study location (and probably also when there is another location like work missing).

nate@nate-worklap:~/itinerum/itinerum-trip-breaker$ python3 main.py
Traceback (most recent call last):
  File "main.py", line 46, in <module>
    school = Location(row['location_study_lon'], row['location_study_lat'])
  File "/home/nate/itinerum/itinerum-trip-breaker/location.py", line 6, in __init__
    self.latitude = float(latitude)
ValueError: could not convert string to float: 

The string in question is empty: ''

KDE bandwidth not consistent between users?

It appears that the distance decay function (kernel bandwidth) for user F95144C6-73FC-4310-9B71-7CCD217F13E4 is much too high, while for user 44b9444a-ccb8-426b-878b-8b09318348e5 it seems about right. (also for 1FEBD82C-8ED9-46AB-B3AF-2111BA007FAE)

The KDE values for the first just appear overly smooth.

Why would this be? No parameters should be changing between them.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.