This project builds on a series of efforts to increase the speed and efficiency of processing large amounts of raw AIS log data.
This test project is built as a C# console application.
The console application has four primary processes, controlled via bool settings in program.cs. There is a built-in stopwatch functionality which prints the total run-time to the console after every successful application execution.
For demonstration and sharing of ideas/concepts only (at least at the current stage of development).
Basic functionality is complete and generally satisfies tests for valid output, as well as generally meeting goals for processing speed.
Application and script structure and syntax is still quite rough and no attempt has been made to produce 'finished' or 'final' code. There is quite a bit of redundant and unoptimized code throughout and is in need of quite a bit of refactoring.
Arrange pipeline to:
- Flag invalid records on read where validity can be checked without calculation/conversion
- Filter first by invalid records
- Then filter by any other values that do not need conversion or calculation
- Find unique values to convert
- Create lookups for conversion
- Convert values only once
- Replace values in original records with conversions or calculations using hash-table/dictionary retrieval
Original methodology
- Full conversion of the log line
- Calculations required were O(n) where n is the number of log lines multiplied by the number of fields that require conversion (cf) multiplied by the number of conversions needed per field (cpf) to achieve desired end state
- Roughly, calculations = n * cf * cpf
Example
1 bil log lines
12 fields that require conversion
10 fields that require two separate conversions (6-bit to binary and binary to decimal, for instance)
= 120 bil calculations
Improved Methodology
- Improved conversion processing will be O(n) where n is the unique value count
- Speed improves (relative to a 'convert by record' approach) as the ratio of redundent records per conversion type increases (for example, an average of 10, 100, or 1000 records per MMSI or Epoch value)
- Lookup value search will be O(1) search time per replacement or filter
- This is achieved via use of hash-tables/dictionaries for all (appropriate) lookup functions
- A beneficial side-effect of the prior processes are databases created through the use of serialized lookup dictionaries/tables
- Use the work of the prior stage/process to further filter/refine the data being passed forward to further reduce downstream processing time
- Filter Out Any Invalid Lines
- Save Only Relevant Processing Info
Note: Relevant processing info currently only 'payload' and 'epoch'
Example: !AIVDM,1,1,,B,13k8ff0wh0Pfp9TI65`Vk7gl2<0t,0*21,1646006400
Example: 13cnPQ?P1KPf7DBI5O7CK?vb2<12,1646074221
- Payload checked for min length (28 char)
- Epoch checked for min length (10 char)
Example: 1646006400,28/02/2022 00:00:00
- Format: Epoch (key), Day/Month/Year HH:MM:SS
MMSI Binary Dictionary Example: 13k8ff0,254947000,1,00,0000
- Format: Binary Segment (key), MMSI, Msg ID, Repeat Binary, Remaining Binary from Segment
- Decodes MMSI, Msg ID
- Extracts Repeat Binary Vals (00, 01, 10, 11)
- Appends leftover binary data
MMSI Dictionary Example: 254947000
- Format: MMSI (key)
- Faster version of a simple list
- Allows for easy expansion later if need arises for a lookup by MMSI
- Only run as a seperate/stand-alone process for testing and timing observations
- Helper objects contain dictionaries for replacement lookups in later steps/processes
- Generally match the serialized dictionary files, though these can be optimized as needed in the helpers for specific functions/needs
Example: 254947000,1,00,0000,wh0Pfp9TI65`Vk7gl2<0t,28/02/2022 00:00:00
- Format: MMSI, Msg ID, Repeat Binary, Binary Segment, Remaing Payload (6-Bit Ascii), Date/Time
- Ingests 'first-pass' processed log lines
- Performs a fast hash-table replacement for the first 6 chars of the payload
- Performs a fast hash-table replacement for the epoch
- Appends MMSI, Msg ID, Repeat Binary, and converted Binary Segment to the remaining payload characters (removes the initial 6)
- Appends the Date/Time
- Saves new log lines
Format: MMSI, Msg ID, Repeat Binary, Binary Segment, Remaing Payload (6-Bit Ascii), Date/Time
254947000,1,00,0000,wh0Pfp9TI65`Vk7gl2<0t,28/02/2022 00:00:00
538007404,1,00,0000,02mPdn@pHIg@=GJon06sd,28/02/2022 00:00:00
247380270,18,00,0000,008;?q46Ci?;Q3wTUoP06,28/02/2022 00:00:00
477076700,1,00,0000,00HPdgr`HhuA2dR5n0@06,28/02/2022 00:00:00
247066850,1,00,0000,P0NPfF98I4r3nfOv02@0H,28/02/2022 00:00:00
538006934,1,00,0000,1030`c5@IAkooGpa00D1?,28/02/2022 00:00:00
Note: Only partial payload is left to decode (column 5)
- Log 1: 779k lines
- Log 2: 737k lines
- Total lines: 1.516m lines
Time: approx 3.3 sec
Time: approx 1.9 sec
Time: approx 0.2 sec
Time: approx 0.8 sec
Time: approx 1.0 sec
Time: approx 0.1 sec
Time: approx 1.4 sec