sausy-lab / itinerum-trip-breaker Goto Github PK
View Code? Open in Web Editor NEWTrip/activity delimiting tool for Itinerum travel survey app
Trip/activity delimiting tool for Itinerum travel survey app
The KDE calculation takes a crazy amount of time with a sufficiently fine resolution. Tens of thousands of points are evaluated that are nowhere near points in the input dataset.
Changing from a grid to a set of provided eval.points will effect the peak-finding algorithm, but could save a huge amount of time.
Felipe needs to provide function descriptions, use case examples, and comment his code where its purpose is not clear.
Should be generalized for any order of headers.
Also double check output file headers for consistency with compare.py
We are using a KDE to estimate locations where more than 10 minutes has been spent in one spot. To do this, we feed in all the interpolated points. But we should know that we can't/shouldn't have any locations found where there are only interpolated points. The only reason we include them all is that it's an easier way to spread out time weights where they shouldn't be too concentrated on the original (sparse) input points.
Better/smarter weight assignments to the original points should reduce the time spent in the KDE step.
nate@nate-worklap:~/itinerum/itinerum-trip-breaker$ python3 main.py
3 user(s) to clean
Traceback (most recent call last):
File "main.py", line 54, in <module>
user = Trace(user_id, user_ids[user_id], survey_responses[user_id])
File "/home/nate/itinerum/itinerum-trip-breaker/trace.py", line 29, in __init__
h = read_headers()
TypeError: read_headers() missing 1 required positional argument: 'fname'
This is a very cryptic function. Documentation needed along with better variable names.
I wonder if interpolation is going awry? There are fewer points in the interpolated input tot he KDE than in the original (cleaned) dataset.
Cluster detection is currently the most computationally expensive process in the script. We end up calculating an entire NxN distancce matrix where N can be as large as 15,000 and separating clusters based on simple distance. There should be a faster and more memory efficient way to partition that many points into clusters. This is almost certainly an out of the box function from an existing library - we just need to pick the ideal algorithm based on the nature of the problem and choose a library.
On the test data Nate's computer yields 16 activity locations with default parameters. Felipe's has many more points above the threshold and yields only 11 14 locations.
Possibly an issue of differing precision, but the magnitude of the difference seems to indicate otherwise.
I added some description for the point diagnostic file, but the others need something like this too.
We've discussed changing the kde function to improve location detection accuracy, but short of that it should be refactored for readability.
How about just a,b,c etc?
Would make things easier to type and speak about.
Android phones seem to produce many points at the same locations, perhaps when they are acquiring a signal or when the signal is weak. Most of these are basically crap though and should be discarded if they aren't already. This information should be used in the data cleaning stage to remove points that are repeated, with limited exceptions where points repeat just by chance due to rounding.
Currently running this on only the original point sets. Improved spatial resolution should improve performance, hopefully. Better interpolation and subsetting algorithms will eventually lead to better results.
Should be structured more like compare_episodes
Also location comparison should exclude computed locations that aren't referenced in the episodes file.
Mischa's survey data (already uploaded) should be a good test of our classification of unknown time. Will need detailed manual reclassification.
Right now we have a naive enough algorithm that we are basically not even trying to do this.
The output from this script is pretty much taking up a full screen to display properly on my computer. Once we add a few more diagnostics, printing to the console will be untenable.
Or maybe the printout can be reformatted?
The purpose of the ground truth data is to test the performance of the algorithm on a known dataset. It seems to me that there are two broad potential approaches to this:
The ground truth data we currently have (my own) is a sloppy mix of these.
To give an example, should we include activity locations that we actually visited but that don't look like activities in coordinates.csv
, perhaps because of missing or inaccurate data?
The benefit of producing a properly true ground truth is that we can measure how far our algorithm (considered as encompassing the app, the phone, etc.) is from actual reality as interpreted by the one who lived it, or at least from a more traditional activity survey.
The benefit of ground truth as manual classification of input data is that it tells us how far we are from the best possible results we can get from the data we have available.
My Reality > Phone's Reality > Our interpretation of Phone's Reality
nate@nate-worklap:~/itinerum/itinerum-trip-breaker$ python3 main.py
1 user(s) to clean
User: 1 2128 points at start for 44b9444a-ccb8-426b-878b-8b09318348e5
123 points removed as high stated error
28 points removed as duplicate
45 points removed by positional cleaning
17 points removed as duplicate
1 gap(s) found in data
Running KDE on 4672 points
44b9444a-ccb8-426b-878b-8b09318348e5 aborted
Coded as an additional column on the coordinates file, labeling each point to what trip/activity it corresponds to, if any.
Also, see if any standards exists for representing trip geometry
We need to output information on each point used in the algorithm:
Location, obviously
Viterbi state classification (trip? activity location?) which can be used to construct geometry of trips
Whether it was original input point or interpolated
Whether it was removed during cleaning (used to debug/verify cleaning process)
Temporal weight assigned during KDE step
Estimated PDF probability at that location
etc
point_diagnostics.csv
(the header is being written already) should replace a couple of intermediate diagnostic files
This data should already be stored with each point object, so we just need to write the output.
Larger datasets in the tower survey sample can take more than 5 minutes to run. E.g. user 0c7714d8-ce0b-421a-a698-5d77152e165a. This user happens to have too many days in the dataset, but we can use them as a test case for speed improvements.
Some suggestions:
Check with Mischa/Kyle to be sure the app is working again and how to know if it isn't.
Check to be sure that nothing in the KDE calculation changed with the package update such that the peak height thresholds are now too high. Locations now are underdetected for all users.
Should be similar metrics to unknown time detection.
Likely in find_peaks.
In the new algorithm, activity time attribution will be done on the set of interpolated points. The spatial interpolation is fine, but some additional temporal interpolation will be needed where there are large temporal gaps. It doesn't need to be even across the whole gap, unless that is much easier.
It may be more efficient to do this after the KDE.
Not sure if this would actually cause us any trouble, but there are some (many?) records in the input coordinate data that are repeated verbatim. Same timestamp, etc.
Would be good to check for this in the cleaning step just in case though!
Various bugs come up: zero division error, some runtime errors in clustering, a segmentation fault in larger traces, errors when length of trace is too short
There are 'locations' being found that are unacceptably close to one another, < the clustering distance.
Some trips and activities have improbably short duration and should be merged in to the surrounding activities and trips
One user per core?
One potential issue here is the current high memory requirements for some users in the clustering step.
The script fails when users are present with no study location (and probably also when there is another location like work missing).
nate@nate-worklap:~/itinerum/itinerum-trip-breaker$ python3 main.py
Traceback (most recent call last):
File "main.py", line 46, in <module>
school = Location(row['location_study_lon'], row['location_study_lat'])
File "/home/nate/itinerum/itinerum-trip-breaker/location.py", line 6, in __init__
self.latitude = float(latitude)
ValueError: could not convert string to float:
The string in question is empty: ''
Not a major issue, but it does cause minor headaches and multiple different timestamp parsing functions.
We should come up with/adopt standards for documentation and style, and identify where we're lacking.
It appears that the distance decay function (kernel bandwidth) for user F95144C6-73FC-4310-9B71-7CCD217F13E4
is much too high, while for user 44b9444a-ccb8-426b-878b-8b09318348e5
it seems about right. (also for 1FEBD82C-8ED9-46AB-B3AF-2111BA007FAE
)
The KDE values for the first just appear overly smooth.
Why would this be? No parameters should be changing between them.
Something for people to know how to use the software, a pdf ideally (tex)
We need ground truth data
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.