Create CSV to JSON Converter

The Batch Ingest API uses records in JSON format, which are represented by an array of ULI payload items. For example:

POST http://localhost/uli-service/v1/ingest/M0000033
[
    {
        "MemberFullName": "ohai2 ohai",
        "MemberLastName": "ohai",
        "MemberFirstName": "ohai2",
        "MemberMiddleInitial": "ohai",
        "MemberNickname": "ohai",
        "MemberType": "ohai",
        "MemberNationalAssociationId": "ohai",
        "MemberStateLicense": "ohai",
        "MemberStateLicenseType": "ohai",
        "MemberStateLicenseState": "ohai",
        "MemberMlsId": "ohai",
        "OfficeName": "ohai",
        "OfficeMlsId": "ohai",
        "SourceSystemID": "ohai",
        "SourceSystemName": "ohai",
        "OriginatingSystemID": "ohai",
        "OriginatingSystemName": "ohai"
    }
]

In order to make it easier for people to test the system, the UI should allow the user to upload a CSV file with membership data matching the ULI Template Schema.

This means that data will need to be converted from CSV to JSON either on the client side, before submitting it to the Batch Ingest API, or the Ingest API will need to be extended to support CSV.

There are two possibilities if the functionality is implemented on the API: a) either the endpoint could be the same (in which case the function would have to check the content type and convert if CSV, otherwise use JSON), or b) a different URL could be used. Using the same URL is probably preferable if this is to be done on the ULI API rather than UI.

Set up Node and React Docker Containers

The goal in this issue is to set up a Docker environment for a NodeJS and Nginx server to host a React app.

Populate Elastic Instance with Simulated ULI Data

In order to make the project so that any developer can test it locally, we'll need to be able to generate realistic testing data.

To run the environment locally, see this guide: https://github.com/RESOStandards/uli-service/blob/main/docs/running-the-pilot.md
Simulated data will be inserted into the local elastic instance in cases where someone wants to bulk-upload data to test the system. Rather than the upload data button in #14, we would most likely have a "generate" data button that uses something like Faker to populate the records in the format shown in the ULI proposal.
What we want is to be able to simulate the cases we know we have issues with, for example if First Initial and First Name matching is something we want to test, we'd want someone with a first name and first initial in the data set that both causes a collision and doesn't.

With the backend Elastic instance running, we'll want to have a Node/Express service that can connect to the instance and insert the simulated data. We'll want to make sure we simulate this with auth (bearer tokens):

POST /uli-service/v1/<providerUoi>/generate-test-data/[num-records]
  Authorization: Bearer <token>

We want the num-records parameter so we can load test the system with varying sizes of member data sets, up to 100,000.

We'd probably have spinner or progress indicator on the frontend to show where the generation and matching process is at.

Let's assume this will go into an index called uli-import. In order to do that, you can use the Elastic Node Client and POST data into the Elastic index, once running. See documentation on connecting to an Elastic instance.

Add ULI Fields to Whitepaper

Add the following fields to the white paper so we can link to it from other information:

uli-service/uli-pilot-ingest.json

Lines 6 to 22 in f71787a

    
           "MemberFullName", 
        
           "MemberLastName", 
        
           "MemberFirstName", 
        
           "MemberMiddleInitial", 
        
           "MemberNickname", 
        
           "MemberType", 
        
           "MemberNationalAssociationId", 
        
           "MemberStateLicense", 
        
           "MemberStateLicenseType", 
        
           "MemberStateLicenseState", 
        
           "MemberMlsId", 
        
           "OfficeName", 
        
           "OfficeMlsId", 
        
           "SourceSystemID", 
        
           "SourceSystemName", 
        
           "OriginatingSystemID", 
        
           "OriginatingSystemName"

Create Mock for Initial Bulk Ingest

For the initial onboarding process, the ULI System needs to do bulk matching against the data set provided and find the following:

Potential duplicates within the provider's own data set
The goal is to create a rough, lightweight UI to show all items that are being processed in bulk format and their outcomes. This is for testing only, so that orgs can run the scoring algorithm on their data set and see how the matching works.

When a new system is being onboarded, we want them to de-duplicate their own data first before trying to match existing bad records against the larger set (in the long run we probably want the same sanity check for live, inbound streaming records).

We probably want to add a "Upload Roster" button in CSV to the page to make it easier for them to work with the system during initial testing. We'll want to check that the CSV they provided matches the schema of the template file.

This page would show all Assigned ULIs, Suggested ULIs, and ones that were still processing in a simple view.

Existing comps can be found here: https://drive.google.com/drive/folders/18nFqTI1ySaWf5i0kaS5QkaI8pvsH4Tik

Potential duplicates across other data sets
We'll deal with this in another issue, but we may want to allow providers to see potential matches against other data sets as well on this "bulk import" screen (this would normally be handled on the other matching screens in the Sample UI).
TODO: create separate issue for this.

Create ULI Pilot Ingest Service

Due Date: Demo on Jan 7 2022.

In order to onboard ULI Pilot participants and test the matching algorithm, we need to be able to onboard each org one at a time and return the possible duplicates and their confidence scores in a payload.

The matching algorithm that this service uses will also be queried from an interactive search form down the road, and should share the same search functions.

ULI Service

Create an endpoint that can ingest new data from a given Provider UOI.

A request to this endpoint would look something like the following:

POST /uli-service/v1/<providerUoi>
{
  "value": [
    {
      "MemberFullName": "",
      "MemberLastName": "",
      "MemberFirstName": "",
      "MemberMiddleInitial": "",
      "MemberNickname": "",
      "MemberType": "",
      "MemberNationalAssociationId": "",
      "MemberStateLicense": "",
      "MemberLicenseType": "",
      "MemberStateLicenseState": "",
      "MemberAddress1": "",
      "MemberCity": "",
      "MemberStateOrProvince": "",
      "MemberPostalCode": "",
      "OfficeName": "",
      "OfficeMlsId": "",
      "OfficeAddress1": "",
      "OfficeCity": "",
      "OfficeStateOrProvince": "",
      "OfficePostalCode": "",
      "OrganizationUniqueId": "",
      "OrganizationName": ""
    }
  ]
}

Response:

  HTTP/2 200 OK
  {
    "status": "Import Queued",
    "numRecords": 134,
    "eventQueueUrl": "/uli-service/v1/<providerUoi>/processing",
    "method": "GET"
  }

Where each item in the value array is a ULI data structure matching the above schema.

As such, schema validation should be performed to ensure that some proper subset of the above dictionary fields is present, and that they are passed in the value array. If true, then respond with a 200, and if not then 400 and tell them which field(s) caused the error. Use ajv or Yup for schema validation.

Sync Orgs with UOI Production Sheet

We should also validate UOIs against the reference sheet. This will mean we'll have a cache of orgs running locally on the API server.

Later on, we'll need the ability to refresh from the UOI sheet. We could take this from the Cert API - just the Sync service and UI. We don't need to do this by the Jan 7 demo though. For now, have a singleton service that populates the cache the first time it's used, and then returns what's there until restarted. If we reuse what's in Cert then we can deal with it that way.

Event Types: ULI Assigned, ULI Suggested, Processing

Once the ingest job has been started, then if the user makes a request to the queue there won't be records in it right away. It will take some time before there are results. We'll need some kind of notification in the future.

Once each record gets pushed, then we'll run a scavenging job on them against what's already in the system.

In order to do this, we'll dynamically form a query based on the information each record contains according to the matching formula, and be able to support a variable set of weights through a separate index that would eventually have its own UI in production.

In the case where a new ULI can be created or an existing match is suggested, we'll use a format similar to the following:

   urn:reso:uli:f81d4fae-7dec-11d0-a765-00a0c91e6bf6

which is a uuid that uses the RESO URN namespace (3.4.3).

The API will classify the matching events into event types such as ULI Assigned, ULIs Suggested.

ULI Assigned - in this case, a ULI was assigned since there was no matching record within the configurable confidence threshold. A confidence score will be shown for the item including the fields that were matched on. A ULI can also be assigned through the review process.
ULI Suggested - for each inbound record there may be one or more suggested ULIs pertaining to that record. They will also be shown with their confidence scores and which fields they match on.
Processing - we will also likely want access to the records currently being processed as well, so there should be a third event type of Processing, but it shouldn't be returned unless asked for.

There may be other events added in the future, but this is a good start.

As such, we'll want the root path to also take an optional eventType parameter:

  GET /uli-service/v1/<providerUoi>[/<uli-assigned|possible-match|processing>]

  {
    "value": [
      {
        "EventKey": "uniqueKey1",
        "EventType": "ULI Assigned",
        "ULI": "urn:reso:uli:f81d4fae-7dec-11d0-a765-00a0c91e6bf6",
        "OriginatingSystemUsi": "<providerUsi>",
        "PathToOriginalRecord": "/uli-service/v1/<providerUoi>/ingest/<key>",
        "Score": 0.89,
        "ScoringFactors": {
           "MemberFullName": 0.55,
           "MemberNationalAssociationId": 0.34
        },
        "ModificationTimestamp": "2022-01-04T00:39:56Z"
      }, {
        "EventKey": "uniqueKey2",
        "EventType": "Possible Match",
        "ULI": "urn:reso:uli:f81d4fae-7dec-11d0-a765-00a0c91e6b00",
        "OriginatingSystemUsi": "<providerUsi>",
        "PathToOriginalRecord": "/uli-service/v1/<providerUoi>/ingest/<key>",
        "Score": 0.99,
        "ScoringFactors": {
           "MemberFullName": 0.85,
           "MemberNationalAssociationId": 0.14
        },
        "ModificationTimestamp": "2022-01-03T00:39:56Z"
      }, {
        "EventKey": "uniqueKey3",
        "EventType": "Possible Match",
        "ULI": "urn:reso:uli:f81d4fae-7dec-11d0-a765-00a0c91e6b00",
        "OriginatingSystemUsi": "<providerUsi>",
        "PathToOriginalRecord": "/uli-service/v1/<providerUoi>/ingest/<key>",
        "Score": 0.82,
        "ScoringFactors": {
           "MemberStateLicenseState": 0.40,
           "MemberStateLicense": 0.33,
           "FullName": 0.09
        },
        "ModificationTimestamp": "2022-01-03T00:39:56Z"
      }
    }
  ]
}

Where not specifying a type returns all items except Processing.

The ScoringFactors data should be accessible by having Elastic explain each query. It's not required for MVP though, just a nice to have.

Add White Paper

The ULI Pilot R&D process will produce a white paper containing the following information:

Business and Outreach

Technical

Outline of the methodology. How are we doing things differently? 1) Insight: it's a search and matching problem rather than normal data model or transport standards. Describe the search methodology and how it addresses the issues. 2) Novel approach: crowd-sourced, cooperative data scrubber with human review supplemented by probabilistic, consensus based ranking using open source algorithms. The ability to adjust weights based on success and error rates and analytics provides a feedback loop to continue to improve the results over time.
Reproducible results to back up our claims that it's a viable solution. This has been tested on M markets with L licensees, and we were able to match with an average score of S, with error rate E.
Potential next steps to put it into production. Benefits from shared (neutral) aggregate pool. Make a nice diagram, etc. Rough estimate of resources required to do so.
Additional opportunities: same methodology can be used to help de-duplicate listings into their underlying property records. Or anything that has a collection of weighted fields that can be scored and matched against using this approach.

resostandards / uli-service Goto Github PK

uli-service's People

Contributors

Stargazers

Watchers

Forkers

uli-service's Issues

Create CSV to JSON Converter

Set up Node and React Docker Containers

Populate Elastic Instance with Simulated ULI Data

Add ULI Fields to Whitepaper

Create Mock for Initial Bulk Ingest

Create ULI Pilot Ingest Service

Add White Paper

Business and Outreach

Technical

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	"MemberFullName",
	"MemberLastName",
	"MemberFirstName",
	"MemberMiddleInitial",
	"MemberNickname",
	"MemberType",
	"MemberNationalAssociationId",
	"MemberStateLicense",
	"MemberStateLicenseType",
	"MemberStateLicenseState",
	"MemberMlsId",
	"OfficeName",
	"OfficeMlsId",
	"SourceSystemID",
	"SourceSystemName",
	"OriginatingSystemID",
	"OriginatingSystemName"