Giter Site home page Giter Site logo

fathom5corp / ais Goto Github PK

View Code? Open in Web Editor NEW
17.0 4.0 6.0 182 KB

Package ais provides types and methods for conducting data science on signals generated by maritime entities radiating from an Automated Identification System (AIS) transponder as mandated by the International Maritime Organization (IMO) for vessels over 300 gross tons and all passenger vessels.

License: MIT License

Go 100.00%
data-science ais maritime maritime-data-science csv-files

ais's Introduction

GoDoc Reference Build Status Go Report Card Coverage Status

Jump straight to Usage

Package AIS - Beta Release

Note: This repo is actively maintained with focus on increasing code test coverage to 100%. Until then it is appropriate for use in research, but should not be used for navigation or other mission critical applications.

In September 2018 the United States Navy hosted the annual HACKtheMACHINE Navy Digital Experience in Seattle, Washington. The three-day prototyping, public engagment and educational experience is designed to generated insights into maritime cybersecurity, data science, and rapid prototyping. Track 2, the data science track, focused on collision avoidance between ships.

The U.S. Navy is the largest international operator of unmanned and autonomous systems sailing on and under the world's oceans. Developing algorithms that contribute to safe navigation by autonomous and manned vessels is in the best interest of the Navy and the public. To support the development of such AI-driven navigational systems the Navy sponsored HACKtheMACHINE Seattle Track 2 to create collision avoidance training data from publicly available maritime shipping data. Read the full challenge description here.

This repository is a Go language package, open-source release, of a software toolkit built from insights gained during HACKtheMACHINE Seattle. Over the course of the multi-day challenge teams tried several approaches using a variety of software languages and geospatial information systems (GIS). Ultimately, the complexity of the challenge prevented any single team from providing a complete solution, but all of the winners (see acknowledgements below) provided some useful ideas that are captured in this package. The decision by the Navy to open source release the prototype data science tools built on the ideas generated at HACKtheMACHINE is meant to continue building a vibrant community of practice for maritime data science hobbyists and professionals. Please use this code, submit issues to improve it, and join in. Our community is organized here and on LinkedIn. Please reach out with questions about the code, suggestions to make the usage documentation better, maritime data science in general, or just to ask a few questions about the community.

What's in the package?

Package FATHOM5/ais contains tools for creating machine learning datasets for navigation systems based on open data released by the U.S. Government.

The largest and most comprehensive public data source for maritime domain awareness is the Automatic Identification System (AIS) data collected and released to the public by the U.S. Government on the marinecadastre.gov website. These comma separated value (csv) data files average more than 25,000,000 records per file, and a single month of data is a set of 20 files totalling over 60Gb of information. Therefore, the first hurdle to building a machine learning dataset from these files is a big-data challenge to find interesting interactions in this large corpus of records.

The ais package contains tools for abstracting the process of opening, reading and manipulating these large files and additional tools that support algorithm development to identify interesting interactions. The primary goal of ais is to provide high performance abstractions for dealing with large AIS datasets. In this Beta release, high performance means processing a full day of data for identifying potential two-ship interactions in about 17 seconds. We know this can be improved upon and are eager to get the Beta into use within the community to make it better. Note that 17s performance is inspired by ideas from HACKtheMACHINE but far exceeds any approach demonstrated at the competition by several orders of magnitude.

Installation

Package FATHOM5/ais is a standard Go language library installed in the typical fashion.

go get github.com/FATHOM5/ais

Include the package in your code with

include "github.com/FATHOM5/ais"

Usage

The package contains many facilities for abstracting the use of large AIS csv files, creating subsets of those files, sorting large AIS datasets, appending data to records, and implementing time convolution algorithms. This usage guide introduces many of these ideas with more detailed guidelines available in the godocs.

Basic Operations on RecordSets
Basic Operations on Records
Subsets
Sorting
Appending Fields to Records
Convolution Algorithms

Basic Operations on RecordSets

The first requirement of the package is to reduce the complexities of working with large CSV files and allow algorithm developers to focus on the type RecordSet which is an abstraction to the on-disk CSV files.

OpenRecordSet(filename string) (*RecordSet, error)
func (rs *RecordSet) Save(filename string) error

The facilities OpenRecordSet and Save allow users to open a CSV file downloaded from marinecadastre.gov into a RecordSet, and to save a RecordSet to disk after completing other operations. Since the RecordSet often manages a *os.File object that requires closing, it is a best practice to call defer rs.Close() right after opening a RecordSet.

The typical workflow is to open a RecordSet from disk, analyze the data using other tools in the package, then save the modified set back to disk.

rs, err := ais.OpenRecordSet("data.csv")
defer rs.Close()
if err != nil {
    panic(err)
}

// analyze the recordset...

err = rs.Save("result.csv")
if err != nil {
    panic(err)
}

To create an empty RecordSet the package provides

NewRecordSet() *RecordSet

The *RecordSet returned from this function maintains its data in memory until the Save function is called to write the set to disk.

A RecordSet is comprised of two parts. First, there is a Headers object derived from the first row of CSV data in the file that was opened. The set of Headers can be associated with a JSON Dictionary that provides Definitions for all of the data fields. For any production use the data Dictionary should be considered a mandatory addition to the project, but is often omitted in early data analysis work. The Dictionary should be a JSON file with multiple ais.Definition objects serialized into the file. Loading and assigning the dictionary is demonstrated in this code snippet.

// Most error handling omitted for brevity, but should definitely be 
// included in package use.
rs, _ := ais.OpenRecordSet("data.csv")
defer rs.Close()
j, _ = os.Open("dataDictionary.json")
defer j.Close()
jsonBlob, _ := ioutil.ReadAll(j)
if err := rs.SetDictionary(jsonBlob); err != nil {
	panic(err)
}

h := rs.Headers()
fmt.Println(h)

The final call to fmt.Println(h) will call the Stringer interface for Headers and pretty print the index, header name, and definition for all of the column names contained in the underlying csv file that rs now accesses.

Second, in addition to Headers the RecordSet contains an unexported data store of the AIS reports in the set. Each line of data in the underlying CSV files is a single Record that can be accessed through calls to the Read() method. Each call to Read() advances the file pointer in the underlying CSV file until reaching io.EOF. The idiomatic way to process through each Record in the RecordSet is

// Some error handling omitted for brevity, but should definitely be 
// included in package use.
rs, err := ais.OpenRecordSet("data.csv")
defer rs.Close()
if err != nil {
    panic(err)
}

var rec *ais.Record
for {
	rec, err := rs.Read()
	if err == io.EOF {
		break
	}
	if err != nil {
		panic(err)
	}

    // Do something with rec
}

A RecordSet also supports Write(rec Record) error calls. This allows users to create new RecordSet objects. As previously stated, high performance is an important goal of the package and therefore slow IO operations to disk are minimized through buffering. So after completing a series of Write(...) operations package users must call Flush() to flush out any remaining contents of the buffer.

rs := ais.NewRecordSet()
defer rs.Close()

h := strings.Split("MMSI,BaseDateTime,LAT,LON,SOG,COG,Heading,VesselName,IMO,CallSign,VesselType,Status,Length,Width,Draft,Cargo", ",")
data := strings.Split("477307900,2017-12-01T00:00:03,36.90512,-76.32652,0.0,131.0,352.0,FIRST,IMO9739666,VRPJ6,1004,moored,337,,,", ",")

rs.SetHeaders(ais.NewHeaders(h, nil))  // note dictionary is not assigned

rec1 := ais.Record(data)

rs.Write(rec1)
err = rs.Flush()
if err != nil {
    panic(err)
}

rs.Save("test.csv")

In many of the examples that follow, error handling is omitted for brevity. However, in use error handling should never be omitted since IO operations and large data set manipulation are error prone activities.

One example of an algorithm against a complete RecordSet is finding all of the unique vessels in a file. This particular algorithm is provided as a method on a RecordSet and returns the type ais.VesselSet.

rs, _ := ais.OpenRecordSet("data.csv")
defer rs.Close()

var vessels ais.VesselSet
vessels, _ = rs.UniqueVessels()

From this point, you can query the vessels map to determine if a particular vessel is present in the RecordSet or count the number of unique vessls in the set with len(vessels).

Basic Operations on Records

Most data science tasks for an AIS RecordSet deal with comparisons on individual lines of data. Package ais abstracts individual lines as Record objects. In order to make comparisons between data fields in a Record it is sometimes necessary to convert the string representation of the data in the underlying csv file into an int, float or time type. The package provides utility functions for this purpose.

func (r Record) ParseFloat(index int) (float64, error)
func (r Record) ParseInt(index int) (int64, error)
func (r Record) ParseTime(index int) (time.Time, error)

The index argument for the functions is the index of the header value that you are trying to parse. The idiomatic way to use these functions is

h := strings.Split("MMSI,BaseDateTime,LAT,LON,SOG,COG,Heading,VesselName,IMO,CallSign,VesselType,Status,Length,Width,Draft,Cargo", ",")
data := strings.Split("477307900,2017-12-01T00:00:03,36.90512,-76.32652,0.0,131.0,352.0,FIRST,IMO9739666,VRPJ6,1004,moored,337,,,", ",")

headers := ais.NewHeaders(h, nil)
rec := ais.Record(data)

timeIndex, _ := headers.Contains("BaseDateTime")

var t time.Time
t, err := rec.ParseTime(timeIndex)
if err != nil {
	panic(err)
}
fmt.Printf("The record timestamp is at %s\n", t.Format(ais.TimeLayout))

Another common operation is to measure the distance between two Record reports. The package provides a Record method to compute this directly.

func (r Record) Distance(r2 Record, latIndex, lonIndex int) (nm float64, err error)

The calculated distance is computed using the haversine formula implemented in FATHOM5/haversine. For users unfamiliar with computing great circle distance see this package for an explanation of great circles and the haversine formula.

h := strings.Split("MMSI,BaseDateTime,LAT,LON,SOG,COG,Heading,VesselName,IMO,CallSign,VesselType,Status,Length,Width,Draft,Cargo", ",")
headers := ais.NewHeaders(h, nil)
latIndex, _ := headers.Contains("LAT") // !ok checking omitted for brevity
lonIndex, _ := headers.Contains("LON")

data1 := strings.Split("477307900,2017-12-01T00:00:03,36.90512,-76.32652,0.0,131.0,352.0,FIRST,IMO9739666,VRPJ6,1004,moored,337,,,", ",")
data2 := strings.Split("477307902,2017-12-01T00:00:03,36.91512,-76.22652,2.3,311.0,182.0,SECOND,IMO9739800,XHYSF,,underway using engines,337,,,", ",")
rec1 := ais.Record(data1)
rec2 := ais.Record(data2)

nm, err := rec1.Distance(rec2, latIndex, lonIndex)
if err != nil {
    panic(err)
}
fmt.Printf("The ships are %.1fnm away from one another.\n", nm)

This example and the one above it create Record objects directly instead of reading them from a RecordSet.Open call like previous examples. This usage can also come into play when writing data to a new Recordset. For example, in the previous snippet, the variable rec1 could be written to a dataset like this:

// Record and Headers created per the previous example
rs := NewRecordSet
rs.SetHeaders(headers)

_ := rs.Write(rec1) // error checking omited for brevity
_ := rs.Flush()
_ := rs.Save("newData.csv")

Many more uses ways of dealing with RecordSet and Record objects follow in the more advanced uses of the package in the next few sections.

Subsets

The most common operation on multi-gigabyte files downloaded from marinecadastre.gov is to create subsets of about one million records. The original datafiles are a one-month set covering a single UTM zone. The natural subset is to break this into single-day files and then perform analysis on these one-day subsets. To accomplish this operation the package provides the interface Matching and two functions provided by RecordSet that take arguments implementing the Matching interface in order to return a subset.

type Matching interface {
    Match(*Record) (bool, error)
}

func (rs *RecordSet) SubsetLimit(m Matching, n int) (*RecordSet, error)
func (rs *RecordSet) Subset(m Matching) (*RecordSet, error)

Package clients define a type that implements the Matching interface and then pass this type as an argument to Subset or SubsetLimit. The returned *RecordSet contains only those lines from the original RecordSet that return true from the Match function of m.

type subsetOneDay struct {
	rs        *ais.RecordSet
	d1        time.Time // date we want to match
	timeIndex int       //index value of BaseDateTime in the record
}

func (sod *subsetOneDay) Match(rec *ais.Record) (bool, error) {
	d2, err := time.Parse(ais.TimeLayout, (*rec)[sod.timeIndex])
	if err != nil {
		return false, fmt.Errorf("subsetOneDay: %v", err)
	}
	d2 = d2.Truncate(24 * time.Hour)
	return sod.d1.Equal(d2), nil
}

func main(){
	rs, _ := ais.OpenRecordSet("largeData.csv")
	defer rs.Close()

	// Implement a concreate type of subsetOneDay to return records
	// from 25 Dec 2017.
	timeIndex, ok := rs.Headers().Contains("BaseDateTime")
	if !ok {
		panic("recordset does not contain the header BaseDateTime")
	}
	targetDate, _ := time.Parse("2006-01-02", "2017-12-25")
	sod := &subsetOneDay{
		rs:        rs,
		d1:        targetDate,
		timeIndex: timeIndex,
	}

	matches, _ := rs.Subset(sod)
	//matches.Save("newSet.csv")
	subsetRec, _ := matches.Read()
	subsetDate := (*subsetRec)[timeIndex]
	date, _ := time.Parse(ais.TimeLayout, subsetDate)
	fmt.Printf("The first record in the subset has BaseDateTime %v\n", date.Format("2006-01-02"))

	// Output:
	// The first record in the subset has BaseDateTime 2017-12-25
}

This example introduces two additional features of the package. First, the call to rs.Headers().Contains(headerName) is the idiomatic way to get the index value of a header used in a later fucntion call. Always check the ok parameter of this return to ensure the RecordSet includes the necessary Header entry. Second, the package includes the constant TimeLayout = 2006-01-02T15:04:05 which represents the timestamp format in the Marinecadastre files and is designed to be passed to the time.Parse function as the layout string argument.

During algorithm development it is sometimes desirable to create a RecordSet with only a few dozen or a few hundred data lines in order to avoid long computation times between successive iterations of the program. Therefore, the package also provides SubsetLimit(m Matching, n int) where the resulting *RecordSet will only contain the first n matches.

Sorting

The package uses the Go standard library sort capabilities for high performance sorting. The most common operation is to sort a single day of data into chronological order by the BaseDateTime header. This operation is implemented within the package and is exposed to users with a single call to SortByTime().

rs, _ := ais.OpenRecordSet("oneDay.csv")
defer rs.Close()
rs, err := rs.SortByTime()
if err != nil {
    log.Fatalf("unable to sort the recordset: %v", err)
}
rs.Save("oneDaySorted.csv")

In this example, note that the original *Recordset, named rs, created from the OpenRecordSet call is reused to hold the return value from SortByTime. This presents no issues and prevents another memory allocation. The automatic garbage collection in Go (...yeah...automatic garbage collection in a high-performance language) will deal with the pointer reference abandoned by reusing rs.

Package users are encouraged to use the idiomatic sorting method presented above, but sorting is an important operation for AIS data science. So the implementation details are presented here for community discussion to improve the interface to allow more generic sorting. Issue #19 deals with this needed enhancement. The key challenge is that sorting large AIS files presents a big-data issue because a RecordSet is a pointer to an on-disk file or in-memory buffer. In order to sort the data it must be loaded into a []*Record. This requires reading every Record in a set and loading them all into memory...an expensive operation. To accomplish this only when needed the package introduces two new types: ByGeohash and ByTimestamp. In this section we will explain sorting ByTimestamp.

A new ByTimestamp object must read all of the underlying records and load them into a []*Record. This is accomplished in the implementation of NewByTimestamp() by calling the unexported method loadRecords(). Users should not create a ByTimestamp object using the builtin new(Type) command. The example below demonstrates incorrect and correct use of the ByTimestamp type.

 bt := new(ais.ByTimestamp) // Wrong 
 sort.Sort(bt) // Will panic
 
 rs, _ := ais.OpenRecordSet("oneDay.csv")
 defer rs.Close()
 
 bt2, _ := ais.NewByTimestamp(rs)  
 sort.Sort(bt2)
 
// Write the data from the ByTimestamp object into a Recordset
// NOTE: Headers are written only when the RecordSet is saved to disk
rsSorted := ais.NewRecordSet()
defer rsSorted.Close()
rsSorted.SetHeaders(rs.Headers())

for _, rec := range *bt.data {
	rsSorted.Write(rec)
}
err := rsSorted.Flush()
if err != nil {
	log.Fatalf("flush error writing to new recordset: %v", err)
}
rsSorted.Save("oneDaySorted.csv")

The ByTimestamp type implements the Len, Swap and Less methods required by sort.Interface. So bt2 can be passed directly to sort.Sort(bt) in the example. Admittedly, the sort.Interface could be implemented better in pacakge ais and a draft design is suggested in Issue #19 for community comment.

This example also introduces another new syntax use. Note the way the output was created with NewRecordSet() and specifically, the way the Headers of the new set were assigned from the existing set in the line rsSorted.SetHeaders(rs.Headers)).

Appending Fields to Records

Often times a new field for every Record is needed to capture some derived or computed element about the vessel in the Record. This new field can often comes from a cross-source lookup. For example, marinetraffic.com offers a vessel lookup service by MMSI. More commonly new fields can come from computed results derived from data already in the Record. In this example we are adding a geohash to each Record.

Package ais provides the RecordSet method

 func (rs *RecordSet) AppendField(newField string, requiredHeaders []string, gen Generator) (*RecordSet, error)

Arguments to this function are the new field name passed as a string and two additional arguments that bear a little explanation. The second argument, requiredHeaders, is a []string of the header names in the Record that will be used to derive the new Field. In our example we will be passing the "LAT" and "LON" fields so we verify they exist before calling AppendField. The final argument is a type that implements theais.Generator interface.

type Generator interface {
    Generate(rec Record, index ...int) (Field, error)
}

Types that implement the Generator interface will have the Generate method called for every record in the RecordSet. The package provides one implementation of a Generator called Geohasher to append a geohash to every Record. Putting this all together in an example we get

rs, _ := ais.OpenRecordSet("oneDay.csv") // error handling ignored
defer rs.Close()

// Verify that rs contains "LAT" and "LON" Headers
_, ok := rs.Headers().Contains("LAT")
if !ok {
    panic("recordset does not contain 'LAT' header")
}
_, ok = rs.Headers().Contains("LON") // !ok omitted for brevity

// Append the field
requiredHeaders := []string{"LAT", "LON"}
gen := ais.NewGeohasher(rs)
rs, err = rs.AppendField("Geohash", requiredHeaders, gen)
if err != nil {
	panic(err)
}

rs.Save("oneDayGeo.csv")

Convolution Algorithms

The last set of facilities discussed in the usage guidelines are related to creating algorithms that passes a time window over a chronologically sorted RecordSet and apply an analysis or algorithm over the Record data in the Window. From a data science point of view this applies a time convolution to the underlying Record data and can be visualized similar to this gif from the Wikipedia page for convolutions

In package ais the red window from the figure is implemented by the type Window created with a call to

func NewWindow(rs *RecordSet, width time.Duration) (*Window, error)

The Width of the red Window and the rate that it Slides are configurable parameters of a Window. The blue function in the figure represents the Record data that is analyzed as it comes into the Window. Users should call SortByTime() on the RecordSet before applying the convolution so that Window is in fact sliding down in time. The resulting data represented by the black line in the figure is usually written to a new RecordSet and saved when the convolution is complete. One way to configure a window from a RecordSet is used in this snippet.

rs, _ := ais.OpenRecordSet("data.csv")
defer rs.Close()

win, err := NewWindow(rs, 10 * time.Minute)
if err != nil {
    panic(err)
}

The call to NewWindow sets the left marker for the Window equal to the time in the next call to Read on rs, and the Width is set to ten minutes in this example. Once the window is created it is used by successive calls to Slide. The idiomatic way to implement this is

for {
	rec, err := rs.Read()
	if err == io.EOF {
		break
	}
	if err != nil {
		panic(err)
	}	
	
	ok, _ := win.RecordInWindow(rec)
	if ok {
		win.AddRecord(*rec)
	} else {
		rs.Stash(rec)
		
		// Do something with the Records in the Window
		
		win.Slide(windowSlide)
	}
}

The first part of this, the RecordSet traversal should begin to look familiar by this point of the tutorial. This is the idiomatic way to process a RecordSet repeated here for emphasis. The new parts come with the call to RecordInWindow(rec) where the newly read Record is tested to see whether it is in the time window. If ok then the Record is added to the data held by win. The internal data structure for this recordkeeping is a standard Go map, but the key is a fast fnv hash of the Record. This hash returns a uint64 for the key which provides a low probability of hash collision and results in a performant data structure for with approximate O(1) complexity on lookup and insertion.

The next interesting feature of a RecordSet that has not been addressed yet is the call to rs.Stash(rec) if InWindow returns false. This is critical because the most recent call to Read() provided a Record that was not in the window ; however it may be in the Window after a Slide. So this Record must be stashed so that we get to compare it again after the window slides down. The call to rs.Stash puts the record back on the metaphorical shelf and the next loop call to Read will return this same Record for the next comparison.

Finally, after the call to Stash the algorithm has reached a point where all the data that is in the Window has been loaded. When sliding down a RecordSet that is already sorted chronologically finding a Record that is not in the Window means that that all Records within that window of time have already been found. So now we can process the Record data to find whatever relationship the time dependent algorithm is trying to identify.

For example, HACKtheMACHINE Seattle challenged participants to find two-vessel interactions that indicate potential maneuvering situations between ships close to one another in time and space. The Window in this case guarantees that vessels are close to one another in time. By adding a geohash to each record in the file clean_date before running this code then sliding the Window can be implemented to find ships that are within the same geohash box. In the worked example that follows these boxes in time and space are each a Cluster. When there are more than two vessels in a Cluster then an Interaction is the two-vessel pair that is in the Window and share the same geohash.

// Interaction completes the workflow to write a RecordSet that uniquely
// identifies two-ship interaction that occur closely separated in time and
// share a geohash that ensures the vessels are within about 4nm of one another.
package main

import (
	"fmt"
	"io"
	"time"

	"github.com/FATHOM5/ais"
)

// Use a negative number to slide over the full file.  A positive integer will
// break out of the iteration loop after the specified number of slides.
const maxSlides = -1

const filename = `clean_data.csv`
const outFilename = `twoShipInteractions.csv`
const windowWidth time.Duration = 10 * time.Minute
const windowSlide time.Duration = 5 * time.Minute

func main() {
	rs, _ := ais.OpenRecordSet(filename)
	defer rs.Close()

	win, _ := ais.NewWindow(rs, windowWidth)
	fmt.Print(win.Config())

	inter, _ := ais.NewInteractions(rs.Headers())
	geoIndex, _ := rs.Headers().Contains("Geohash")

	for slides := 0; slides != maxSlides; {
		rec, err := rs.Read()
		if err == io.EOF {
			break
		}
		if err != nil {
			panic(err)
		}

		ok, _ := win.RecordInWindow(rec)
		if ok {
			win.AddRecord(*rec)
		} else {
			rs.Stash(rec)

			cm := win.FindClusters(geoIndex)
			for geohash, cluster := range cm {
				if cluster.Size() > 1 {
					_ := inter.AddCluster(cluster)
				}
			}
			win.Slide(windowSlide)
			slides++
		}
	}

	// Save the interactions to a File
	fmt.Println("Saving the interactions to:", outFilename)
	inter.Save(outFilename)
}

This last example provides a full use case of applying many of the facilities in package ais to build a dataset of potential two-ship interactions that can train a navigation system artificial intelligence. For the complete example that includes all REQUIRED error handling, some timing parameters for performance measurement and a few pretty printing additions see the solution posted to the HACKtheMACHINE Track 2 repository. There are a few new methods presented in this example, like win.Config() and win.FindClusters, but they are well-documented in the online package documentation along with other facilites and methods that did not get discussed in the tutorial. Check out the full package documentation at godoc.org for more examples and additional explanations.

More importantly, If you have read to this point you are more than casually interested in maritime data science so give the repo a star, try some of the examples and reach out. You have read now a few thousand lines, so let's hear from you. We are actively growing the community and want you to be a part of it!

Acknowledgements

The solutions presented in this repo were made possible by the idea generation and execution that occured by the contestant teams over weekend. Competitors came from government, academia, and across industries to collaboratively develop solutions to a series of critical and challenging problems. In a single weekend teams:

  • Developed data quality indicators and tools,
  • Identified key inconsistencies in the data,
  • Improved dataset quality,
  • Created algorithms that worked on small subsets of the data, and
  • Suggested and prototyped methods for extending the analysis to larger datasets.

Maintenance

FATHOM5 is a proud partner of the U.S. Navy in creating a community of practice for maritime digital security. The code developed in this repo released under MIT License is an important contribution to growing the HACKtheMACHINE community and part of our corporate commitment to creating a new wave of maritime technology innovation. And oh yeah...we are hiring!

Community

AIS will only increase in importance over the next couple of years with improved accuracy and reduced time latency. With the right algorithms real-time tracking and predictive modeling of ships’ behavior and position will be possible for the first-time. With techniques development by the community AIS data will assist in developing safe autonomous ships, help prevent collisions, reduce environmental impacts, and make the waterways safer and more enjoyable for all.

We want to create a vibrant and thriving maritime innovation community around the potential of large AIS datasets. Please consider joining us. Open issues for bugs, provide an experience report for your use of the package, or just give the repo a star because we are not trying to create algorithms that serve the greater good, not just advertisers!

ais's People

Contributors

zacsketches avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

ais's Issues

coveralls.io badge is caching

The coveralls.io badge is getting cached by Github and does not update when coverage increases.

shields.io allows dynamic badge creation via query arguments, which also changes the url for the image, and should clear the Github cache.

Create the cover percentage from go test -cover and then update README with the new dynamic URL before committing. The URL for the image should still link back to the Coveralls page so that folks can see the coverage report if they click on the badge.

Wrte a test to ensure RecordSet.Read() calls are not moving the file pointer incorrectly

From @zacsketches on November 15, 2018 12:14

There may be a subtle bug in the library where calls to csv.Read() are moving the file pointer over the first line when Headers are being checked. This could cause some operations later in a program to miss the first line. There needs to be a consistent way that the file pointer on the underlying file handle in a RecordSet is returned to the top of the file after each access.

Copied from original issue: FATHOM5/Seattle_Reasonable_Track2#27

Change the API so subsetting has an interface Matching instead of a type

The method Matching(match Match) *RecordSet, Error requires the type Match which is a thin abstraction for a function that implements func (*Record) bool.

I now want to be able to create a Box defined by min/max lat and long, and subset based off of everything in the Box. In this use case Box is a type and would implement the Matching interface instead of the type.

This will require the API to change so the signature looks more like this

func (rs *RecodSet) Subset(m Matching) (*RecordSet, error)

Making this change is going to break the existing API and I will have to go back and fix all the broken code...which wouldn't be that hard if I had better testing to find the breaks... ☹️

But since I am the only user I want to get this right and not let it linger...it will come up again.

Add Validate method to RecordSet

See issue #30.
Users should have the option to validate a RecordSet to ensure that the headers contain the right information in order to do follow on actions. This could take the form

func (rs *RecordSet) Validate(format FileFormat) (bool, error)

The FileFormat argument could be from a set of constants that define the list of file sources the package has been tested against. For example this set would contain MARINECADASTRE and INTERACTIONS.

Add a Scanner to RecordSet

Client's of package ais are expected to properly employ the idiom of calling Read() and then testing for io.EOF or passing the error along to another check for error != nil. A Scanner as implmented in bufio would alleviate this boilerplate and then clients could just make calls to Scan() and have this low level io.EOF handling taken care of. This would reduce client code to iterate over a recordset to

for scanner.Scan(){
    rec := scanner.Record()
} 
if err := scanner.Error(); err != nil {
    // Deal with scanner errors.
}

Use reflect in ais Parse() to iterate over Report fields

From @zacsketches on November 13, 2018 0:39

An ais Report contains the fields necessary to conduct data science. However, the current implementation of Parse which converts string data present in public data sources into numeric data relies on several brittle practices.

Parse should use the reflect package to iterate over the fields of a Report instead of a hardcoded list of requiredFields that exists in the current version. From this iteration it should check the Headers of the passed record to ensure the record has the minimally viable set of headers necessary. Alternatively, it could parse every field that it can identify in the Headers and return a Report with those fields set.

Additionally, if each field in a Report was an interface instead of a fundamental type then Parse could call an interface method to parse that field. For example, this interface could take the design

type Parsable interface{
     Parse(s string) (interface{}, error)
}

type IntField int

// Parse wraps strconv.Atoi for integer fields in an ais Report
func (f IntField) Parse(s string) (int, error) {
     	i, err := strconv.Atoi(s)
	if err != nil {
		return 0, err
	}
	return i, nil
}

type Report struct {
     MMSI     IntField
     IMO      IntField
         . . .
     Lat      FloatField
}

This design would eliminate the second hardcoded list in the existing function that calls the specific ParseFOO function depending on the field name. The resulting Parse pseudocode would look more like this.

nameParse := make(map[string]Parsable
rep := Report{}
s := reflect.ValueOf(&rep).Elem()
typeOfRep := s.Type()
for i := 0; i < s.NumField(); i++ {
    f := s.Field(i)
    name := fmt.Sprintf(typeOfRep.Field(i).Name)
    nameParse[name] = f.MethodByName("Parse")
} 

for i , str := range record {
    header := headers[i]
    var v interface{}
    v, ok := nameParse[header]
    if !ok {
        continue
    }
    rep.header = v(str)
}

return header

There are most definitely some syntax errors here that will need to get worked out, but this method is much more robust against changes in an AIS report and flexible to user input.

Copied from original issue: FATHOM5/Seattle_Reasonable_Track2#26

Build a test to see if internal Read calls are dropping records

Perhaps internal calls to Read() are dropping records and should use readFirst() instead. A test that determines this is desired before improvements are contemplated.

Internal calls to Read() are in LimitMatching, AppendField and maybe even other places.

Create a test of RecordSet.Close()

From @zacsketches on November 19, 2018 16:54

The rs.Close() function should be able to deal with calls on the unexported data field regardless of whether or not data implements io.Closer. This is supported through reflection, but should be included in the test package to ensure all cases are covered.

Copied from original issue: FATHOM5/Seattle_Reasonable_Track2#33

RecordSet.Write is missing

The existing RecordSet.Save method is not idiomatic go. An implementation that is more idiomatic along the lines of

func (rs *RecordSet) Write( io.Writer) (int n, err error)

would be an improvement.

Headers.Valid() fails on sets with InteractionFields as the header

The InteractionFields headers do not pass the Valid() test and therefore RecordSets generated by the solution cannot be opened for further processing.

Headers valid may be too brittle at this point to include in the package. As some point when the package gains more widespread use we might have enough knowledge to apply a Valid() check. However, the current version is a user protection that does not serve a useful purpose for the early adopters who understand their data.

At this point, since the use user community is so small I think the right thing to do is remove Valid(). The other option is to make Valid() a no-op and keep it in the repo for future implementation.

Window validate fails for a few reasons

getting index out of range errors as I prune the slice.

Then getting dups when I append values that are "in" where I is beyond the array.

I think the right thing here is to make the Window data structure a map[hash]*Record delete is provide by the map as well as dealing with duplicates. Since I don't care about the order of the Records in the window this will probably work.

How will the package deal with interaction pairs

From @zacsketches on November 2, 2018 12:50

There should be an indexable ID for every AIS interaction pair.

As the time convolution moves down a dataset the same pair may remain in the window for several minutes, but it should only appear in the output interactions dataset one time. By taking a hash of the interaction and ensuring that the final dataset contains only one instance of the pair then we can ensure there is only one report of an interaction in each output set.

This is the time to start the ais. Building off of the work done in the geo tool with the type AisRecord add a function signature along these lines:

type PairHash uint64
func Hash64 (a1 AisRecord, a2 AisRecord) (ais.PairHash, error)

This deserves a bit more thought but the hash of the pair should be derived of the MMSI BaseDateTime LAT and LON for each vessel.

Copied from original issue: FATHOM5/Seattle_Reasonable_Track2#16

Add feature for RecordSet.Track(...) (*RecordSet, error)

A Track is a collection of ais.Record that are sequential in time and belong to the same MMSI.

This will be a custom implementation of a Subset returning records that match mmsi in the inteval [start, start+dur). I think the function signature for this new feature should be

func (rs *RecordSet) Track(mmsi int64, start time.Time, dur time.Duration) (*RecordSet, error)

In addition to normal error handling, the returned error may need some custom implementations. Specifically, if the function successfully executes, but len(returned *Recordset)==0, then the error should provide some semaphore that there is an empty set. A potential implementation could be

var EmptySet = errors.New("track returned no matching records")

The naive implementation will move through the recordset exhaustively and is the first implementation planned. This will have linear performance with the size of the set. More mature implementations could search for the start time more intelligently. This might require some creative Seek calls, or more elegantly a map or binary tree storage of the RecordSet.

Implement proximity by geohash

From @zacsketches on October 30, 2018 22:19

The following sites are germane to the we might use geohashing to find nearby neighbors. One technique is to store the points into an in memory database, Redis, and then use the databases pre-existing API calls to search for neighbors.

The other technique would be to implement the proximity detection natively.

https://en.wikipedia.org/wiki/Geohash

https://gis.stackexchange.com/questions/18330/using-geohash-for-proximity-searches

https://redis.io/commands/#geo

https://www.alexedwards.net/blog/working-with-redis

Copied from original issue: FATHOM5/Seattle_Reasonable_Track2#14

Create a test of RecordSet.Close()

From @zacsketches on November 19, 2018 16:54

The rs.Close() function should be able to deal with calls on the unexported data field regardless of whether or not data implements io.Closer. This is supported through reflection, but should be included in the test package to ensure all cases are covered.

Copied from original issue: FATHOM5/Seattle_Reasonable_Track2#33

UniqueVessels does not rewind the file pointer

After a call to rs.UniqueVessels the file pointer in the RecordSet reader has already reached io.EOF and any further calls to the Reader do not behave as expected. Users may want to call UniqueVessels and see if a particular vessel is present in the data, and then use the set for additional operations that fail because the pointer needs be reset.

Add example for RecordSet.Read()

The idiomatic way to read a recordset should be provided in the Godoc. It is insufficient for new users to just say "the same was as csvReader.Read()"

Sort by geohash

This feature was an early concept in the package that did not prove useful in early use. Providing the feature introduced several types and methods to the package that complicated the API. The source code for this feature is maintained in a private branch and can be added back in if a user community develops and would like to see this feature.

Add RecordSet() method to the ais sorting interface

Sorting as implemented does not expose internal details of loading the RecordSet into a slice of *Record, and is not very flexible for package clients. The sorting facility needs to be reworked to make sorting more flexible.

I think there should be an interface that extends sort.Interface. This new interface would provide methods to LoadRecords() and return a RecordSet() from user created types that implement the Len, Swap and 'Lessmethods requried bysort`.

The new RecordSet() should be used in SortByTime and SortByGeohash to return the recordset instead of the currently repeated code to write the ByTimestamp and ByGeohash slices to a RecordSet.

Profile results are needed about the utility of calling Flush() explicitly

Explicit calls to Flush() are contained in functions like LimitSubset while the process is iterating over a RecordSet. Since all writer objects are built on types that implement buffered IO, it may be an unnecessary precaution/optimization to be calling Flush() at some randomly selected flushThreshold

ByTimestamp.Less() calls Contains()

The call to contains can be reduced to a single call when a new ByTimestamp is created in the NewByTimestamp function.

Profile the existing code and see if changing this might improve performance.

Overall Goals for the package ais

From @zacsketches on October 30, 2018 16:7

Like most development projects that start off one chunk at a time it is starting to become clear that the collection of utilities would probably be more useful as a library. This issue will collect some ideas around what the eventual library API should look like.

Some initial thoughts are that the library should deal with a customized *File type that has some sort of validation in its NewAISFile method to ensure the headers and all subsequent lines can be read. Once we have a *ais.File then it could be passed to Reader and Writer. Also you could get subsets of ais.Record back from it for utilities like subset. There might also need to be a type for ais.Vessel that has all the pertinent info for a specific vessel and a slice of this type could come back from the current vessels utility.

Copied from original issue: FATHOM5/Seattle_Reasonable_Track2#10

Simplify the Headers struct

After writing about 30 different examples and user programs I have not found it useful to have the JSON parsed definitions of the headers included in the struct. For the rare purpose when this might be useful, like providing info to headers in a GUI, there are other ways to accomplish this.

Let's remove the dictionary from the Headers struct and simplify their creation and interface. This may break a lot of code, but I'd rather get this done now while I am the only user.

SubsetLimit does not rewind the Read() pointer

Similar to issue #35

SubsetLimit leaves the RecordSet in a state where the underlying Reader has already reached EOF and any subsequent use the set will not function as expected. The solution to this needs to be carefully considered because it could allow for reading the set into memory as it is being used so that the records are held in a more performant data structure. It could also be an increase in complexity that might not be required.

This bug show up in the following use case:

rs, _  := ais.OpenRecordSet("foo.csv")
ship1, _  := rs.Track(mmsi1, ais.Beginning, ais.All)
ship2, _ := rs.Track(mmsi2, ais.Beginning, ais.All)

There will be no good data for ship2 because the call to SubsetLimit initiated by the second call to Track will try to read from a file that has already reached EOF.

ContainsMulti should return a safer map

As it is currently implemented

Headers.ContainsMulti(fields ...string) (idxmap map[string]int, ok bool

can lead to errors if a user mistypes a key when accessing the returned idxMap. Consider the following snippet:

idxMap, ok := rs.Headers().ContainsMulti("MMSI", "BaseDateTime")
if !ok {
     panic("missing headers")
}

rec, _ := rs.Read()

time := []string(*rec)[idxMap["Timestamp"]] // <--- Note incorrect header name

In this case, it is obvious that the user wants time equal to the timestamp of the record, but the incorrect header name will return the zero value for the map value, which is 0. Accessing rec[0] will quietly and incorrectly return the MMSI header instead of the desired result.

ContainsMulti should protect users from this sort of quiet but pernicious error.

ContainsMulti would be useful

Several places in using the library and within the package it is necessary to check a RecordSet for multiple different headers (i.e. MMSI, BaseDateTime, LAT and LON. This requires several pieces of boilerplate code to call rs.Headers().Contains("foo") and subsequent handling of !ok for each call.

A function with a signature like

func (h *Headers) ContainsMulti(fields ...string) (idx map[string]int, ok bool)

would be very helpful and cut way down boilerplate checks in user code.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.