fathom5corp / ais Goto Github PK

Package ais provides types and methods for conducting data science on signals generated by maritime entities radiating from an Automated Identification System (AIS) transponder as mandated by the International Maritime Organization (IMO) for vessels over 300 gross tons and all passenger vessels.

License: MIT License

Go 100.00%

data-science ais maritime maritime-data-science csv-files

ais's Issues

Overall Goals for the package ais

From @zacsketches on October 30, 2018 16:7

Like most development projects that start off one chunk at a time it is starting to become clear that the collection of utilities would probably be more useful as a library. This issue will collect some ideas around what the eventual library API should look like.

Some initial thoughts are that the library should deal with a customized *File type that has some sort of validation in its NewAISFile method to ensure the headers and all subsequent lines can be read. Once we have a *ais.File then it could be passed to Reader and Writer. Also you could get subsets of ais.Record back from it for utilities like subset. There might also need to be a type for ais.Vessel that has all the pertinent info for a specific vessel and a slice of this type could come back from the current vessels utility.

Copied from original issue: FATHOM5/Seattle_Reasonable_Track2#10

Test RecordSet.AppendField

This test would bang out several percentage points of coverage.

Sort by geohash

This feature was an early concept in the package that did not prove useful in early use. Providing the feature introduced several types and methods to the package that complicated the API. The source code for this feature is maintained in a private branch and can be added back in if a user community develops and would like to see this feature.

Add package AIS to Zenodo when coverage and community feedback warrants publication

It is an intent of this project to support academic research on the large AIS datasests. When the code reaches sufficient test coverage and community user feedback warrants broader adoption we will publish the package on Zenodo so it can be found and used by others more easily.

https://guides.github.com/activities/citable-code/

Use reflect in ais Parse() to iterate over Report fields

From @zacsketches on November 13, 2018 0:39

An ais Report contains the fields necessary to conduct data science. However, the current implementation of Parse which converts string data present in public data sources into numeric data relies on several brittle practices.

Parse should use the reflect package to iterate over the fields of a Report instead of a hardcoded list of requiredFields that exists in the current version. From this iteration it should check the Headers of the passed record to ensure the record has the minimally viable set of headers necessary. Alternatively, it could parse every field that it can identify in the Headers and return a Report with those fields set.

Additionally, if each field in a Report was an interface instead of a fundamental type then Parse could call an interface method to parse that field. For example, this interface could take the design

type Parsable interface{
     Parse(s string) (interface{}, error)
}

type IntField int

// Parse wraps strconv.Atoi for integer fields in an ais Report
func (f IntField) Parse(s string) (int, error) {
     	i, err := strconv.Atoi(s)
	if err != nil {
		return 0, err
	}
	return i, nil
}

type Report struct {
     MMSI     IntField
     IMO      IntField
         . . .
     Lat      FloatField
}

This design would eliminate the second hardcoded list in the existing function that calls the specific ParseFOO function depending on the field name. The resulting Parse pseudocode would look more like this.

nameParse := make(map[string]Parsable
rep := Report{}
s := reflect.ValueOf(&rep).Elem()
typeOfRep := s.Type()
for i := 0; i < s.NumField(); i++ {
    f := s.Field(i)
    name := fmt.Sprintf(typeOfRep.Field(i).Name)
    nameParse[name] = f.MethodByName("Parse")
} 

for i , str := range record {
    header := headers[i]
    var v interface{}
    v, ok := nameParse[header]
    if !ok {
        continue
    }
    rep.header = v(str)
}

return header

There are most definitely some syntax errors here that will need to get worked out, but this method is much more robust against changes in an AIS report and flexible to user input.

Copied from original issue: FATHOM5/Seattle_Reasonable_Track2#26

Profile results are needed about the utility of calling Flush() explicitly

Explicit calls to Flush() are contained in functions like LimitSubset while the process is iterating over a RecordSet. Since all writer objects are built on types that implement buffered IO, it may be an unnecessary precaution/optimization to be calling Flush() at some randomly selected flushThreshold

VesselSet should include the count of records in the set

Modify UniqueVessels to count the number of times a Vessel appears in the set. Again, this is an API breaking change that is best accomplished early in development.

How will the package deal with interaction pairs

From @zacsketches on November 2, 2018 12:50

There should be an indexable ID for every AIS interaction pair.

As the time convolution moves down a dataset the same pair may remain in the window for several minutes, but it should only appear in the output interactions dataset one time. By taking a hash of the interaction and ensuring that the final dataset contains only one instance of the pair then we can ensure there is only one report of an interaction in each output set.

This is the time to start the ais. Building off of the work done in the geo tool with the type AisRecord add a function signature along these lines:

type PairHash uint64
func Hash64 (a1 AisRecord, a2 AisRecord) (ais.PairHash, error)

This deserves a bit more thought but the hash of the pair should be derived of the MMSI BaseDateTime LAT and LON for each vessel.

Copied from original issue: FATHOM5/Seattle_Reasonable_Track2#16

Add RecordSet() method to the ais sorting interface

Sorting as implemented does not expose internal details of loading the RecordSet into a slice of *Record, and is not very flexible for package clients. The sorting facility needs to be reworked to make sorting more flexible.

I think there should be an interface that extends sort.Interface. This new interface would provide methods to LoadRecords() and return a RecordSet() from user created types that implement the Len, Swap and 'Lessmethods requried bysort`.

The new RecordSet() should be used in SortByTime and SortByGeohash to return the recordset instead of the currently repeated code to write the ByTimestamp and ByGeohash slices to a RecordSet.

Generate uses Record instead of *Record

It may be a performance improvement to avoid the copy in the Generate function.

Window validate fails for a few reasons

getting index out of range errors as I prune the slice.

Then getting dups when I append values that are "in" where I is beyond the array.

I think the right thing here is to make the Window data structure a map[hash]*Record delete is provide by the map as well as dealing with duplicates. Since I don't care about the order of the Records in the window this will probably work.

All receivers (rec Record) need to operate on pointers not values

From @zacsketches on November 15, 2018 12:18

various operations on a Record may get called millions of times in a program. These should not be making value copies of the record. They should be operating on the Pointer to an existing Record.

Copied from original issue: FATHOM5/Seattle_Reasonable_Track2#28

UniqueVessels does not rewind the file pointer

After a call to rs.UniqueVessels the file pointer in the RecordSet reader has already reached io.EOF and any further calls to the Reader do not behave as expected. Users may want to call UniqueVessels and see if a particular vessel is present in the data, and then use the set for additional operations that fail because the pointer needs be reset.

Implement proximity by geohash

From @zacsketches on October 30, 2018 22:19

The following sites are germane to the we might use geohashing to find nearby neighbors. One technique is to store the points into an in memory database, Redis, and then use the databases pre-existing API calls to search for neighbors.

The other technique would be to implement the proximity detection natively.

https://en.wikipedia.org/wiki/Geohash

https://gis.stackexchange.com/questions/18330/using-geohash-for-proximity-searches

https://redis.io/commands/#geo

https://www.alexedwards.net/blog/working-with-redis

Copied from original issue: FATHOM5/Seattle_Reasonable_Track2#14

Test RecordSet.SubsetLimit()

This function is used in almost every utility. This is the highest priority for testing as of mid-dec 2018.

Add Validate method to RecordSet

See issue #30.
Users should have the option to validate a RecordSet to ensure that the headers contain the right information in order to do follow on actions. This could take the form

func (rs *RecordSet) Validate(format FileFormat) (bool, error)

The FileFormat argument could be from a set of constants that define the list of file sources the package has been tested against. For example this set would contain MARINECADASTRE and INTERACTIONS.

Build ais.Match examples to remove tugs

From @zacsketches on November 9, 2018 18:14

Copied from original issue: FATHOM5/Seattle_Reasonable_Track2#17

Update readme for changes made to Matching and Subset

Simplify the Headers struct

After writing about 30 different examples and user programs I have not found it useful to have the JSON parsed definitions of the headers included in the struct. For the rare purpose when this might be useful, like providing info to headers in a GUI, there are other ways to accomplish this.

Let's remove the dictionary from the Headers struct and simplify their creation and interface. This may break a lot of code, but I'd rather get this done now while I am the only user.

SubsetLimit does not rewind the Read() pointer

Similar to issue #35

SubsetLimit leaves the RecordSet in a state where the underlying Reader has already reached EOF and any subsequent use the set will not function as expected. The solution to this needs to be carefully considered because it could allow for reading the set into memory as it is being used so that the records are held in a more performant data structure. It could also be an increase in complexity that might not be required.

This bug show up in the following use case:

rs, _  := ais.OpenRecordSet("foo.csv")
ship1, _  := rs.Track(mmsi1, ais.Beginning, ais.All)
ship2, _ := rs.Track(mmsi2, ais.Beginning, ais.All)

There will be no good data for ship2 because the call to SubsetLimit initiated by the second call to Track will try to read from a file that has already reached EOF.

RecordSet.Write is missing

The existing RecordSet.Save method is not idiomatic go. An implementation that is more idiomatic along the lines of

func (rs *RecordSet) Write( io.Writer) (int n, err error)

would be an improvement.

Change the API so subsetting has an interface Matching instead of a type

The method Matching(match Match) *RecordSet, Error requires the type Match which is a thin abstraction for a function that implements func (*Record) bool.

I now want to be able to create a Box defined by min/max lat and long, and subset based off of everything in the Box. In this use case Box is a type and would implement the Matching interface instead of the type.

This will require the API to change so the signature looks more like this

func (rs *RecodSet) Subset(m Matching) (*RecordSet, error)

Making this change is going to break the existing API and I will have to go back and fix all the broken code...which wouldn't be that hard if I had better testing to find the breaks... ☹️

But since I am the only user I want to get this right and not let it linger...it will come up again.

Add examples to the `ais` package for common use cases

From @zacsketches on November 10, 2018 16:8

Copied from original issue: FATHOM5/Seattle_Reasonable_Track2#18

Fix needs to be updated to use the ais package

From @zacsketches on November 17, 2018 13:40

The fix utility does not use the library. This tool needs to be updated to use the library.

Copied from original issue: FATHOM5/Seattle_Reasonable_Track2#31

ByTimestamp.Less() calls Contains()

The call to contains can be reduced to a single call when a new ByTimestamp is created in the NewByTimestamp function.

Profile the existing code and see if changing this might improve performance.

Write tests for RecordSet: Open, Read, readFirst, Stash

These fundamental ops need testing.

Create a test of RecordSet.Close()

From @zacsketches on November 19, 2018 16:54

The rs.Close() function should be able to deal with calls on the unexported data field regardless of whether or not data implements io.Closer. This is supported through reflection, but should be included in the test package to ensure all cases are covered.

Copied from original issue: FATHOM5/Seattle_Reasonable_Track2#33

Add feature for RecordSet.Track(...) (*RecordSet, error)

A Track is a collection of ais.Record that are sequential in time and belong to the same MMSI.

This will be a custom implementation of a Subset returning records that match mmsi in the inteval [start, start+dur). I think the function signature for this new feature should be

func (rs *RecordSet) Track(mmsi int64, start time.Time, dur time.Duration) (*RecordSet, error)

In addition to normal error handling, the returned error may need some custom implementations. Specifically, if the function successfully executes, but len(returned *Recordset)==0, then the error should provide some semaphore that there is an empty set. A potential implementation could be

var EmptySet = errors.New("track returned no matching records")

The naive implementation will move through the recordset exhaustively and is the first implementation planned. This will have linear performance with the size of the set. More mature implementations could search for the start time more intelligently. This might require some creative Seek calls, or more elegantly a map or binary tree storage of the RecordSet.

ContainsMulti should return a safer map

As it is currently implemented

Headers.ContainsMulti(fields ...string) (idxmap map[string]int, ok bool

can lead to errors if a user mistypes a key when accessing the returned idxMap. Consider the following snippet:

idxMap, ok := rs.Headers().ContainsMulti("MMSI", "BaseDateTime")
if !ok {
     panic("missing headers")
}

rec, _ := rs.Read()

time := []string(*rec)[idxMap["Timestamp"]] // <--- Note incorrect header name

In this case, it is obvious that the user wants time equal to the timestamp of the record, but the incorrect header name will return the zero value for the map value, which is 0. Accessing rec[0] will quietly and incorrectly return the MMSI header instead of the desired result.

ContainsMulti should protect users from this sort of quiet but pernicious error.

Add example for RecordSet.Read()

The idiomatic way to read a recordset should be provided in the Godoc. It is insufficient for new users to just say "the same was as csvReader.Read()"

ContainsMulti would be useful

Several places in using the library and within the package it is necessary to check a RecordSet for multiple different headers (i.e. MMSI, BaseDateTime, LAT and LON. This requires several pieces of boilerplate code to call rs.Headers().Contains("foo") and subsequent handling of !ok for each call.

A function with a signature like

func (h *Headers) ContainsMulti(fields ...string) (idx map[string]int, ok bool)

would be very helpful and cut way down boilerplate checks in user code.

Add a Scanner to RecordSet

Client's of package ais are expected to properly employ the idiom of calling Read() and then testing for io.EOF or passing the error along to another check for error != nil. A Scanner as implmented in bufio would alleviate this boilerplate and then clients could just make calls to Scan() and have this low level io.EOF handling taken care of. This would reduce client code to iterate over a recordset to

for scanner.Scan(){
    rec := scanner.Record()
} 
if err := scanner.Error(); err != nil {
    // Deal with scanner errors.
}

Create a feedback channel for long-running functions like subset and sort to provide status

From @zacsketches on November 12, 2018 12:4

Copied from original issue: FATHOM5/Seattle_Reasonable_Track2#25

Headers.Valid() fails on sets with InteractionFields as the header

The InteractionFields headers do not pass the Valid() test and therefore RecordSets generated by the solution cannot be opened for further processing.

Headers valid may be too brittle at this point to include in the package. As some point when the package gains more widespread use we might have enough knowledge to apply a Valid() check. However, the current version is a user protection that does not serve a useful purpose for the early adopters who understand their data.

At this point, since the use user community is so small I think the right thing to do is remove Valid(). The other option is to make Valid() a no-op and keep it in the repo for future implementation.

Create a test of RecordSet.Close()

From @zacsketches on November 19, 2018 16:54

Copied from original issue: FATHOM5/Seattle_Reasonable_Track2#33

Write a doc.go for the package that introduces key types and methods

The godoc.org introductory material needs to be more thorough in order to make the package more accessible and easy to put into action.

Using the doc.go capability of the language we need to explain the key types, and their idiomatic usage.

Build a test to see if internal Read calls are dropping records

Perhaps internal calls to Read() are dropping records and should use readFirst() instead. A test that determines this is desired before improvements are contemplated.

Internal calls to Read() are in LimitMatching, AppendField and maybe even other places.

Create an interface Matcher so Records instead of Reports can be used in Matching()

type Matcher interface {
    Match(data interface{}) bool
}

type DateMatch Record

func (d DateMatch) Match( data interface{} ) bool {

}

//Then declare a variable as
var rdm DataMatch

coveralls.io badge is caching

The coveralls.io badge is getting cached by Github and does not update when coverage increases.

shields.io allows dynamic badge creation via query arguments, which also changes the url for the image, and should clear the Github cache.

Create the cover percentage from go test -cover and then update README with the new dynamic URL before committing. The URL for the image should still link back to the Coveralls page so that folks can see the coverage report if they click on the badge.

Wrte a test to ensure RecordSet.Read() calls are not moving the file pointer incorrectly

From @zacsketches on November 15, 2018 12:14

There may be a subtle bug in the library where calls to csv.Read() are moving the file pointer over the first line when Headers are being checked. This could cause some operations later in a program to miss the first line. There needs to be a consistent way that the file pointer on the underlying file handle in a RecordSet is returned to the top of the file after each access.

Copied from original issue: FATHOM5/Seattle_Reasonable_Track2#27

fathom5corp / ais Goto Github PK

ais's Issues

Recommend Projects

Recommend Topics

Recommend Org