The itinerum-trip-breaker from sausy-lab

KDE grid is inefficient sample of sparse points

The KDE calculation takes a crazy amount of time with a sufficiently fine resolution. Tens of thousands of points are evaluated that are nowhere near points in the input dataset.

Changing from a grid to a set of provided eval.points will effect the peak-finding algorithm, but could save a huge amount of time.

Felipe needs to document his code

Felipe needs to provide function descriptions, use case examples, and comment his code where its purpose is not clear.

Reading headers is currently hardcoded

Should be generalized for any order of headers.

Also double check output file headers for consistency with compare.py

Limit the number of points sent to KDE

We are using a KDE to estimate locations where more than 10 minutes has been spent in one spot. To do this, we feed in all the interpolated points. But we should know that we can't/shouldn't have any locations found where there are only interpolated points. The only reason we include them all is that it's an easier way to spread out time weights where they shouldn't be too concentrated on the original (sparse) input points.

Better/smarter weight assignments to the original points should reduce the time spent in the KDE step.

Error when running main.py

nate@nate-worklap:~/itinerum/itinerum-trip-breaker$ python3 main.py
3 user(s) to clean
Traceback (most recent call last):
  File "main.py", line 54, in <module>
    user = Trace(user_id, user_ids[user_id], survey_responses[user_id])
  File "/home/nate/itinerum/itinerum-trip-breaker/trace.py", line 29, in __init__
    h = read_headers()
TypeError: read_headers() missing 1 required positional argument: 'fname'

Run scripts on full datasets to identify possible bugs

Refactor trace.py/get_days

Document time_at_loc in trace.py

This is a very cryptic function. Documentation needed along with better variable names.

Be smarter about identifying gaps due to subway travel

Refactor compare.py/compare_user

Too few data points entering KDE function

I wonder if interpolation is going awry? There are fewer points in the interpolated input tot he KDE than in the original (cleaned) dataset.

Implement smarter point clustering algorithm

Cluster detection is currently the most computationally expensive process in the script. We end up calculating an entire NxN distancce matrix where N can be as large as 15,000 and separating clusters based on simple distance. There should be a faster and more memory efficient way to partition that many points into clusters. This is almost certainly an out of the box function from an existing library - we just need to pick the ideal algorithm based on the nature of the problem and choose a library.

Refactor trace.py/flush

Different platforms giving different clustering results

On the test data Nate's computer yields 16 activity locations with default parameters. Felipe's has many more points above the threshold and yields only 11 14 locations.

Possibly an issue of differing precision, but the magnitude of the difference seems to indicate otherwise.

Documentation needed on output files

I added some description for the point diagnostic file, but the others need something like this too.

Improve and/or refactor misc_funcs.py/kde

We've discussed changing the kde function to improve location detection accuracy, but short of that it should be refactored for readability.

Simplify input data usernames

How about just a,b,c etc?

Would make things easier to type and speak about.

Note identical points in input data, somehow flag as more likely being error

Android phones seem to produce many points at the same locations, perhaps when they are acquiring a signal or when the signal is weak. Most of these are basically crap though and should be discarded if they aren't already. This information should be used in the data cleaning stage to remove points that are repeated, with limited exceptions where points repeat just by chance due to rounding.

Run Viterbi algorithm on interpolated subsets

Currently running this on only the original point sets. Improved spatial resolution should improve performance, hopefully. Better interpolation and subsetting algorithms will eventually lead to better results.

Refactor compare.py/compare_locations

Should be structured more like compare_episodes

Also location comparison should exclude computed locations that aren't referenced in the episodes file.

Classify Mischa's survey data (with unknown gaps)

Mischa's survey data (already uploaded) should be a good test of our classification of unknown time. Will need detailed manual reclassification.

Identify home/work/school locations from user survey data

Right now we have a naive enough algorithm that we are basically not even trying to do this.

Output compare.py diagnostics to CSV

The output from this script is pretty much taking up a full screen to display properly on my computer. Once we add a few more diagnostics, printing to the console will be untenable.

Or maybe the printout can be reformatted?

Refactor trace.py/find_peaks

Computing time spent at each location

How should ground truth data be classified?

The purpose of the ground truth data is to test the performance of the algorithm on a known dataset. It seems to me that there are two broad potential approaches to this:

We can classify the data according to what actually happened on the ground, just translated into the required language of discrete trips and activities.
Or we can classify according to what we see in the GPS points, informed by what actually happened on the ground.

The ground truth data we currently have (my own) is a sloppy mix of these.

To give an example, should we include activity locations that we actually visited but that don't look like activities in coordinates.csv, perhaps because of missing or inaccurate data?

The benefit of producing a properly true ground truth is that we can measure how far our algorithm (considered as encompassing the app, the phone, etc.) is from actual reality as interpreted by the one who lived it, or at least from a more traditional activity survey.

The benefit of ground truth as manual classification of input data is that it tells us how far we are from the best possible results we can get from the data we have available.

My Reality > Phone's Reality > Our interpretation of Phone's Reality

ground truth user aborting after/during KDE

nate@nate-worklap:~/itinerum/itinerum-trip-breaker$ python3 main.py
1 user(s) to clean
User: 1 2128 points at start for 44b9444a-ccb8-426b-878b-8b09318348e5
	 123 points removed as high stated error
	 28 points removed as duplicate
	 45 points removed by positional cleaning
	 17 points removed as duplicate
	 1 gap(s) found in data
	Running KDE on 4672 points
44b9444a-ccb8-426b-878b-8b09318348e5 aborted

Output trip geometry file

Coded as an additional column on the coordinates file, labeling each point to what trip/activity it corresponds to, if any.

Also, see if any standards exists for representing trip geometry

Many long, obscure functions everywhere should be shortened

In particular:
init file reading should be handled elsewhere (part of issue #4 )
flush likewise part of issue #4
get_days inner loop can be abstracted

refactor for readability:
get_known_subsets
get_activity_locations
break_trips
find_peaks
observe_neighbours

Output points geometry/diagnostics/classification file

We need to output information on each point used in the algorithm:

Location, obviously
Viterbi state classification (trip? activity location?) which can be used to construct geometry of trips
Whether it was original input point or interpolated
Whether it was removed during cleaning (used to debug/verify cleaning process)
Temporal weight assigned during KDE step
Estimated PDF probability at that location
etc
point_diagnostics.csv (the header is being written already) should replace a couple of intermediate diagnostic files

This data should already be stored with each point object, so we just need to write the output.

Limit bottlenecks on large datasets

Larger datasets in the tower survey sample can take more than 5 minutes to run. E.g. user 0c7714d8-ce0b-421a-a698-5d77152e165a. This user happens to have too many days in the dataset, but we can use them as a test case for speed improvements.

Some suggestions:

Thin out the interpolation if it results in too many points.
Thin out just the points at which the KDE is estimated through sampling
Run Viterbi on only original points? This may actually be quite fast already - that needs to be tested.

Ground truth I-phone survey data

Check with Mischa/Kyle to be sure the app is working again and how to know if it isn't.

Refactor trace.py/break_trips

Locations are underdetected after R ks update?

Check to be sure that nothing in the KDE calculation changed with the package update such that the peak height thresholds are now too high. Locations now are underdetected for all users.

Add episode identification metrics to compare.py

Should be similar metrics to unknown time detection.

Remove repeated location detection

Likely in find_peaks.

Some temporal interpolation needed.

In the new algorithm, activity time attribution will be done on the set of interpolated points. The spatial interpolation is fine, but some additional temporal interpolation will be needed where there are large temporal gaps. It doesn't need to be even across the whole gap, unless that is much easier.

It may be more efficient to do this after the KDE.

Check for repeating records in the coordinates table

Not sure if this would actually cause us any trouble, but there are some (many?) records in the input coordinate data that are repeated verbatim. Same timestamp, etc.
Would be good to check for this in the cleaning step just in case though!

KDE crashes on CFS data

Various bugs come up: zero division error, some runtime errors in clustering, a segmentation fault in larger traces, errors when length of trace is too short

Peak finding algorithm seems broken after R ks update

There are 'locations' being found that are unacceptably close to one another, < the clustering distance.

Improve data input efficiency

Merging short trips and activities together

Some trips and activities have improbably short duration and should be merged in to the surrounding activities and trips

Parallelize processing

One user per core?

One potential issue here is the current high memory requirements for some users in the clustering step.

Study location needs to be made optional

The script fails when users are present with no study location (and probably also when there is another location like work missing).

nate@nate-worklap:~/itinerum/itinerum-trip-breaker$ python3 main.py
Traceback (most recent call last):
  File "main.py", line 46, in <module>
    school = Location(row['location_study_lon'], row['location_study_lat'])
  File "/home/nate/itinerum/itinerum-trip-breaker/location.py", line 6, in __init__
    self.latitude = float(latitude)
ValueError: could not convert string to float:

The string in question is empty: ''

Timestamps are inconsistent across input and output files

Not a major issue, but it does cause minor headaches and multiple different timestamp parsing functions.

Documentation and style guide

We should come up with/adopt standards for documentation and style, and identify where we're lacking.

Point interpolation creates a lot of pointless output

KDE bandwidth not consistent between users?

It appears that the distance decay function (kernel bandwidth) for user F95144C6-73FC-4310-9B71-7CCD217F13E4 is much too high, while for user 44b9444a-ccb8-426b-878b-8b09318348e5 it seems about right. (also for 1FEBD82C-8ED9-46AB-B3AF-2111BA007FAE)

The KDE values for the first just appear overly smooth.

Why would this be? No parameters should be changing between them.

Create tutorial or how to manual

Something for people to know how to use the software, a pdf ideally (tex)

Generate output quality/testing metrics comparing to ground-truth data

We need ground truth data

sausy-lab / itinerum-trip-breaker Goto Github PK

itinerum-trip-breaker's People

Contributors

Stargazers

Watchers

Forkers

itinerum-trip-breaker's Issues

Recommend Projects

Recommend Topics

Recommend Org