The smooshr from tsdataclinic

Display selected categories somewhere in the interface

Can use suggestions here from the original design

Set up CI

Once we know where the server is going to run, we will need to set up CI to be able to deploy it easily.

Github seems to have a new actions service. Check this out for the deploy!

Set up an initial clustering guess

Attempt to use the word embeddings to generate a starting point for possible categories.

Open questions here:

Can we prompt users to suggest category starting points, basically seeds for the clustering algorithm
How do we define a catch all category that picks out everything that is nothing like anything else
How do we select the number of clusters? Do we try and do that automatically? If not how do we inform the user

-[ ] Can clustering be done interactively on the remaining categories?

Fix the weird flexbox issues we have been having with scroll overflow

There are a few places where flexbox and scroll overflow are not playing particularly well together. Need to resolve this.

Explore storing entire file in indexDB with storage manager

Could be interesting to give this a go, this would allow for some more interesting free text entry cleaning.

Switch to using the app layout from @dataclinic/dataclinic

Currently, the project uses a custom layout. Change this to use the side bar, footer, main area components here:

https://github.com/tsdataclinic/DataClinicComponents/blob/master/packages/app-layout/src/AppLayout.tsx

Investigate negative mappings issue

Selecting to add to negative mappings is failing occasionally... not sure why need to check that out

Either disable multiple file uploads at the same time or fix multiple file uploads

Currently the file loader is a little stuck between working ok with 1 data set but allowing multiple to be uploaded or needing to work properly with multiple.

We should be able to support multiple uploads pretty well given that the csv parser generates multiple web workers to do the parsing so uploading 2 files shouldn't effect performance too much.

Currently, the word embedding data set we are using is over 3.2Gb of data. This gets loaded in to memory by gemsim when the server starts up. This takes a while and is not ideal if we want to move this process to a worker for example, as each worker would need to load the data in to memory.

Instead we should look to see if the embedding can be loaded in to a database and queried. Or even just a key val store. This will reduce the overall memory useage of the app and load times.

This wouldn't need to do much more than look up a given word and return it's vector, seeing as we are doing most of the similarity / clustering client side.

Batch request embedings from the server for performance emprovement

Currently we send a request per unique word to the embedding server to get that words embedding vector.

The server supports sending multiple words at a time and getting back the results. We should chunk up the requests to make fewer API calls which should make the embedding fetching quicker.

smooshr/src/utils/calc_embedings.js

Lines 1 to 20 in 8b11ccb

    
           const get_embedings_from_server = entries => { 
        
             let unique_words = new Set(); 
        
             entries.forEach(entry => { 
        
               entry.name.split(' ').forEach(word => { 
        
                 unique_words.add(word); 
        
               }); 
        
             }); 
        
             return Promise.all( 
        
               Array.from(unique_words).map(entry => 
        
                 fetch( 
        
                   `${ 
        
                   process.env.REACT_APP_API_URL 
        
                   }/embedding/${entry.toLowerCase().replace(/[\W_]+/g, '')}`, 
        
                 ) 
        
                   .then(r => r.json()) 
        
                   .then(r => r[0]), 
        
               ), 
        
             ); 
        
           };

This is the function that will need to be modified to run the queries in batches and then correctly assign the result once the batch has been effected.

Things to consider :

The server might fail if one or more of the words does not have a representation in the corpus. We would need to fix that here :

smooshr/server/server.py

Lines 66 to 80 in 8b11ccb

    
           @app.route('/embedding/<words>') 
        
           def embeding(words): 
        
               conn  = get_db()  
        
               try: 
        
                   words = words.split(',') 
        
                   sql = "select * from embeddings where key in ({seq})".format( seq=','.join(['?']*len(words)))  
        
                   result = conn.execute(sql, words) 
        
                   result = [ [r[0], r[1].tolist()] for r in result ] 
        
                   result = [ {"key": key, "embedding": embed} for key,embed in dict(result).items() ] 
        
                   return jsonify(result) 
        
               except: 
        
                   return jsonify([]) 
        
           if __name__=='__main__': 
        
               print('starting up server') 
        
               app.run(host='0.0.0.0', port=5000, debug=True)

It would be also good to give some feedback on this process that can show in the classification interface to let a user know how much of the embedding has been loaded.

Define a schema for the different components of the smooshr data model

As we move to a different storage system and way of representing operations on a dataset, we will need a more robust schema. Currently, the very simple schema we have is

Project: Contains multiple datasets
Dataset: represents the full dataset as a set of summary data and multiple Columns and MetaColumns
Column: Represents a column in the original dataset, has a name and a list of unique entries
MetaColumn: A simple way of treating two columns as 1, this ultimetly gets merged in to a single column when we run the code output
Entry: A unique entry in a column which has a value and the number of times it occurs in that column
Mapping: A collections of entries for a specific column that will be mapped to another value,

We probably want to rethink this schema to make it a lot more rhobust to other tasks we want to run in smooshr.

Align with data packages in terms of project structure and output format.

Look at using data packages for the final output of the project and for project description! https://frictionlessdata.io/data-packages/

Switch to using the typography classes in smooshr rather than standard HTML tags.

Currently, we are using custom styling for the text in smooshr. We should move to using the typography elements from the data clinic component library

Update the finished text to something more appropriate.

Resize modals to be a little more inline with their content

A bunch of the modals are too big. Should fix this

Allow loading of projects

Currently we can save a project but not load it.

This should be pretty simple and will help with facilitating sharing on the community server when we have it.

Create concept of a meta-column

Currently we dont have a way to combine columns from multiple datasets. We need to create some kind of meta-column entry that can reference columns from multiple data sets as the same column in the rest of the app, combining entries from each for example.

Open questions here:

Do meta columns support concatenation of columns within a dataset? Or extraction of data from those columns?
Even if two columns from separate files are selected as the same, are there mappings that need to be file specific for each of them. I don't think so but need to check this.

[New Feature]: Building crosswalks using smooshr

Given the existing mapping functionality, I imagine we could extend it to build crosswalks by mapping (one-to-one, one-to-many ?) entities from a column in one dataset to entities in a column from another dataset.

Add DC Favicon

Importing data from NYC Open Data Portal appears broken

Every set I've tried results in '0 rows and 0 columns'

Add prompts everywhere to help with more intuitive flow

Move to using indexdb instead of local storage

Currently we are using local storage to store the project definitions offline. Would be good to move this to IndexDB to allow more space (50mb vs 5mb). Dexie might be a good way to do this

https://dexie.org/

Might be interesting to move all state management there? Not sure how this interacts with react

Interesting example here : https://github.com/dfahlander/Dexie.js/tree/master/samples/react-redux

Cull unused code

There is a bunch of unused code that we should purge fro launch

Create Project level abstraction

Create a catch all project abstraction that can be used to reference multiple datafiles, have the mappings scoped to that entity

Explore moving to typescript

As this becomes more complicated, it might be worth investing the time to move to a typescript for the project. Some basic type checking might help as we grow the project.

Change the server for embeddings to sqlite

Making the embeddings more portable will make deploy easier. Currently we are using postgresql which is perhaps more than we need. Moving sqlite might make more sense here as we can simply download the .sqlite file and run to get going.

Investigate different ways of storing the data smooshr is using

Currently, smooshr uses in-memory storage to represent a dataset while users are working on it.

As we move to mode sophisticated analysis, we might need to rethink how we do this in a more efficent way.

Some options are

Using IndexedDB the browsers built-in database system.Probably through a library like dexie. Note we currently use indexdb as a dumb offline storage but this would move it to a more structured database
Using sqljs a compiled version of SQLite that runs in webassembely and provides basically native perfromance in browser. We would still need to figure out how to store the sqlite database offline but this could give us a really nice flexiable interface (SQL) for performing operations on the datasets
Something else? The local files api is worth keeping an eye on https://web.dev/file-system-access/ as it would let us save and read projects in a similar way to a native app.

Explore undo functionality

Explore how to do this using react context api. Time travel like this is doable with redux, not sure about with context API.

Tweak loading for URL and open data options to indicate loading without progress

Right now we cant give progress updates on URL or Open Data sources. This is because we can't do a range request against these resources which is what papa-parse uses to stream data. Either need to implement this streaming on the proxy server somehow or figure out some other solution but for now we can caution that a dataset might take a while to read and display a spinner rather than a progress bar

Explore collaborative work using a P2P system.

Currently we have no way to collaborate on a Taxonomy between multiple machines. We dont want to have a centralized store of the data so if we are to implement this, we probably want to use a P2P system

Explore distributing state using Orbit DB or something similar.

Add progress bar to file loading

We can probably figure out how to add a progress bar to the file loading modal by monitoring how many bytes we have read in already. Would make for a nicer experience as you could see how long you have to go loading a file in

Tweak the design to bring in line with Newerhoods

Currently the design is using some data clinic colors but we should bring the general design in to line with Newerhoods

Delete project button

Currently no way to delete a project. Should be easy enough to fix

Allow metadata to be added to new columns / mappings and bundle this with results

It would be a shame if the new mappings and columns generated by smoosher did not also come with meta-data. We should encourage people to add meta-data to the new mappings and export this automatically with the mappings and results

Figure out what to persist in terms of public projects on the backend

It would be great to be able to share projects on the site, showcase how people are using it. Need to figure out how to persist enough of the project to make this happen.

Add ability to pull datasets from URLS

Currently, all datasets need to be loaded locally. It would be interesting to have datasets be reference-able by a URL instead. This would let us have projects that can pull from and tidy open data specifically, perhaps making those mappings public then.

Auto select new mapping when created

On the mappings page, auto select the a new mapping when it is created

Install the DataClinic components library

Super simply add @dataclinic/datalinic to the package.json and wrap the application in the DCThemeProvider.

Make drop zone bigger

Currently the drop zone for files is just the text, expand this to use the entire box

Paginate the list of entries to improve performance

Currently the classification page creates a card for each entry in the datasets and displays it as part of a long list. This leads to performance hits when the number of entries is > a few thousand.

To fix this we should paginate the list, either through and infinite scroll or through explicit pages.

Code that would need to be changes for this lives here: https://github.com/tsdataclinic/smooshr/blob/master/src/pages/ColumnPage.js

Add ability to reject suggestions to better define future ones

Currently, the word embedding system only looks for things near the mean of the current entries in the category selection. It would be great to have another button that explicitly removed the suggestion from the list and incorporates that embedding in to the suggestions search in a negative way.

Implement new design

Implement the new design

Smooshr - V1 feedback.pdf

Add prompt text and potential first category when no mappings defined

This will help with workflow and getting people to understand what to do at this point.

Simplify docker setup for backend

Currently we have a flask app + redis + celery and workers. This was because we anticipated doing much more of this work on the backend but seeing as we aren't we should remove these dependencies

Make "Create mappings" and merge options more clear in Project UI

Add ability to split entries in a column by some value

Some datasets have concatenated categories that you want to blow out in to multiple categories.

Need something that explodes the categories

"dog;cat;bird" -> "dog", "cat", "bird"

Fix click targets for the project boxes

Project box click targets are a little fiddly just now. Clean these up

Investigate different models for describing an analysis flow using a DAG or similar structure.

We currently only have 2 types of operation on smooshr

Combine columns together
Create a taxonomy for a given column

In the future we would like to have more steps for example

Extract part of a column as a new column. For example an address like "23 Some Street, Some City, US, 11221" -> "Some City" to
Standardize a time column
Merge the contents of two columns together to form a new column
Do entity matching on a given column
etc

Some of these steps will have dependencies on previous steps that are hard to predict at run time. It would be great to have each indiividual transform be defined as a node in a graph with dependecies linked by edges. Essentially a DAG.

This would inform the UI and the python code that is ultimetly spit out by the tool.

Some links to projects that might be worth looking at

	const get_embedings_from_server = entries => {
	let unique_words = new Set();
	entries.forEach(entry => {
	entry.name.split(' ').forEach(word => {
	unique_words.add(word);
	});
	});

	return Promise.all(
	Array.from(unique_words).map(entry =>
	fetch(
	`${
	process.env.REACT_APP_API_URL
	}/embedding/${entry.toLowerCase().replace(/[\W_]+/g, '')}`,
	)
	.then(r => r.json())
	.then(r => r[0]),
	),
	);
	};

	@app.route('/embedding/<words>')
	def embeding(words):
	conn = get_db()
	try:
	words = words.split(',')
	sql = "select * from embeddings where key in ({seq})".format( seq=','.join(['?']*len(words)))
	result = conn.execute(sql, words)
	result = [ [r[0], r[1].tolist()] for r in result ]
	result = [ {"key": key, "embedding": embed} for key,embed in dict(result).items() ]
	return jsonify(result)
	except:
	return jsonify([])
	if __name__=='__main__':
	print('starting up server')
	app.run(host='0.0.0.0', port=5000, debug=True)

tsdataclinic / smooshr Goto Github PK

smooshr's People

Contributors

Stargazers

Watchers

Forkers

smooshr's Issues

Recommend Projects

Recommend Topics

Recommend Org