tsdataclinic / smooshr Goto Github PK
View Code? Open in Web Editor NEWTool to consolidate entries and columns from multiple datasets
Home Page: https://tsdataclinic.github.io/smooshr/
License: Apache License 2.0
Tool to consolidate entries and columns from multiple datasets
Home Page: https://tsdataclinic.github.io/smooshr/
License: Apache License 2.0
Can use suggestions here from the original design
Once we know where the server is going to run, we will need to set up CI to be able to deploy it easily.
Github seems to have a new actions service. Check this out for the deploy!
Attempt to use the word embeddings to generate a starting point for possible categories.
Open questions here:
Can we prompt users to suggest category starting points, basically seeds for the clustering algorithm
How do we define a catch all category that picks out everything that is nothing like anything else
How do we select the number of clusters? Do we try and do that automatically? If not how do we inform the user
-[ ] Can clustering be done interactively on the remaining categories?
There are a few places where flexbox and scroll overflow are not playing particularly well together. Need to resolve this.
Could be interesting to give this a go, this would allow for some more interesting free text entry cleaning.
Currently, the project uses a custom layout. Change this to use the side bar, footer, main area components here:
Selecting to add to negative mappings is failing occasionally... not sure why need to check that out
Currently the file loader is a little stuck between working ok with 1 data set but allowing multiple to be uploaded or needing to work properly with multiple.
We should be able to support multiple uploads pretty well given that the csv parser generates multiple web workers to do the parsing so uploading 2 files shouldn't effect performance too much.
Currently, the word embedding data set we are using is over 3.2Gb of data. This gets loaded in to memory by gemsim when the server starts up. This takes a while and is not ideal if we want to move this process to a worker for example, as each worker would need to load the data in to memory.
Instead we should look to see if the embedding can be loaded in to a database and queried. Or even just a key val store. This will reduce the overall memory useage of the app and load times.
This wouldn't need to do much more than look up a given word and return it's vector, seeing as we are doing most of the similarity / clustering client side.
Currently we send a request per unique word to the embedding server to get that words embedding vector.
The server supports sending multiple words at a time and getting back the results. We should chunk up the requests to make fewer API calls which should make the embedding fetching quicker.
smooshr/src/utils/calc_embedings.js
Lines 1 to 20 in 8b11ccb
This is the function that will need to be modified to run the queries in batches and then correctly assign the result once the batch has been effected.
Things to consider :
The server might fail if one or more of the words does not have a representation in the corpus. We would need to fix that here :
Lines 66 to 80 in 8b11ccb
It would be also good to give some feedback on this process that can show in the classification interface to let a user know how much of the embedding has been loaded.
As we move to a different storage system and way of representing operations on a dataset, we will need a more robust schema. Currently, the very simple schema we have is
We probably want to rethink this schema to make it a lot more rhobust to other tasks we want to run in smooshr.
Look at using data packages for the final output of the project and for project description! https://frictionlessdata.io/data-packages/
Currently, we are using custom styling for the text in smooshr. We should move to using the typography elements from the data clinic component library
A bunch of the modals are too big. Should fix this
Currently we can save a project but not load it.
This should be pretty simple and will help with facilitating sharing on the community server when we have it.
Currently we dont have a way to combine columns from multiple datasets. We need to create some kind of meta-column entry that can reference columns from multiple data sets as the same column in the rest of the app, combining entries from each for example.
Open questions here:
Do meta columns support concatenation of columns within a dataset? Or extraction of data from those columns?
Even if two columns from separate files are selected as the same, are there mappings that need to be file specific for each of them. I don't think so but need to check this.
Given the existing mapping functionality, I imagine we could extend it to build crosswalks by mapping (one-to-one, one-to-many ?) entities from a column in one dataset to entities in a column from another dataset.
Every set I've tried results in '0 rows and 0 columns'
Currently we are using local storage to store the project definitions offline. Would be good to move this to IndexDB to allow more space (50mb vs 5mb). Dexie might be a good way to do this
Might be interesting to move all state management there? Not sure how this interacts with react
Interesting example here : https://github.com/dfahlander/Dexie.js/tree/master/samples/react-redux
There is a bunch of unused code that we should purge fro launch
Create a catch all project abstraction that can be used to reference multiple datafiles, have the mappings scoped to that entity
As this becomes more complicated, it might be worth investing the time to move to a typescript for the project. Some basic type checking might help as we grow the project.
Making the embeddings more portable will make deploy easier. Currently we are using postgresql which is perhaps more than we need. Moving sqlite might make more sense here as we can simply download the .sqlite file and run to get going.
Currently, smooshr uses in-memory storage to represent a dataset while users are working on it.
As we move to mode sophisticated analysis, we might need to rethink how we do this in a more efficent way.
Some options are
Using IndexedDB the browsers built-in database system.Probably through a library like dexie. Note we currently use indexdb as a dumb offline storage but this would move it to a more structured database
Using sqljs a compiled version of SQLite that runs in webassembely and provides basically native perfromance in browser. We would still need to figure out how to store the sqlite database offline but this could give us a really nice flexiable interface (SQL) for performing operations on the datasets
Something else? The local files api is worth keeping an eye on https://web.dev/file-system-access/ as it would let us save and read projects in a similar way to a native app.
Explore how to do this using react context api. Time travel like this is doable with redux, not sure about with context API.
Right now we cant give progress updates on URL or Open Data sources. This is because we can't do a range request against these resources which is what papa-parse uses to stream data. Either need to implement this streaming on the proxy server somehow or figure out some other solution but for now we can caution that a dataset might take a while to read and display a spinner rather than a progress bar
Currently we have no way to collaborate on a Taxonomy between multiple machines. We dont want to have a centralized store of the data so if we are to implement this, we probably want to use a P2P system
Explore distributing state using Orbit DB or something similar.
We can probably figure out how to add a progress bar to the file loading modal by monitoring how many bytes we have read in already. Would make for a nicer experience as you could see how long you have to go loading a file in
Currently the design is using some data clinic colors but we should bring the general design in to line with Newerhoods
Currently no way to delete a project. Should be easy enough to fix
It would be a shame if the new mappings and columns generated by smoosher did not also come with meta-data. We should encourage people to add meta-data to the new mappings and export this automatically with the mappings and results
It would be great to be able to share projects on the site, showcase how people are using it. Need to figure out how to persist enough of the project to make this happen.
Currently, all datasets need to be loaded locally. It would be interesting to have datasets be reference-able by a URL instead. This would let us have projects that can pull from and tidy open data specifically, perhaps making those mappings public then.
On the mappings page, auto select the a new mapping when it is created
Super simply add @dataclinic/datalinic to the package.json and wrap the application in the DCThemeProvider.
Currently the drop zone for files is just the text, expand this to use the entire box
Currently the classification page creates a card for each entry in the datasets and displays it as part of a long list. This leads to performance hits when the number of entries is > a few thousand.
To fix this we should paginate the list, either through and infinite scroll or through explicit pages.
Code that would need to be changes for this lives here: https://github.com/tsdataclinic/smooshr/blob/master/src/pages/ColumnPage.js
Currently, the word embedding system only looks for things near the mean of the current entries in the category selection. It would be great to have another button that explicitly removed the suggestion from the list and incorporates that embedding in to the suggestions search in a negative way.
Implement the new design
This will help with workflow and getting people to understand what to do at this point.
Currently we have a flask app + redis + celery and workers. This was because we anticipated doing much more of this work on the backend but seeing as we aren't we should remove these dependencies
Some datasets have concatenated categories that you want to blow out in to multiple categories.
Need something that explodes the categories
"dog;cat;bird" -> "dog", "cat", "bird"
Project box click targets are a little fiddly just now. Clean these up
We currently only have 2 types of operation on smooshr
In the future we would like to have more steps for example
Some of these steps will have dependencies on previous steps that are hard to predict at run time. It would be great to have each indiividual transform be defined as a node in a graph with dependecies linked by edges. Essentially a DAG.
This would inform the UI and the python code that is ultimetly spit out by the tool.
Some links to projects that might be worth looking at
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.