Giter Site home page Giter Site logo

duckdb-postal's Introduction

duckdb-postal's People

Contributors

maxxen avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

duckdb-postal's Issues

Notes on global context

@Maxxen I am just copy pasting our private messages here for posterity, and so others can hopefully learn from them.

Context for future readers: I started writing https://github.com/NickCrews/libpostal-duckdb, but then learned about this repo which I think is currently in a better state.

NickCrews — 12/04/2023 10:53 AM

Hi! I'm just following up re the libpostal wrapper you mentioned you wrote. Any luck finding that? If you just dump me a copy that would be super helpful even with no other explanation. Really appreciate it!

maxxen — 12/04/2023 12:30 PM

Oh yeah, shoot, sorry, I actually gave it a shot to try to revive it. The code works, but I realized that libpostal currently has some limitations that probably wont make it work that well with DuckDB, most notably it is dependent on initializing a bunch of global state, which means that if you load the extension from multiple connections it wont be reinitialized, and if you change any config options in libpostal it wont be thread-safe. This is mostly became an issue since I wanted to enable you to set the directory path for the required "data" - the model weights used by libpostal, from within DuckDB after the extension is loaded. Thats not possible though since libpostal will hard-code the default location during building, and you cant update it after initialization (which again, is global).

I was going to look into eventually forking libbpostal and replacing the global state with some sort of context object, but I havent had the time for it yet.

Here's the code, just made it public
https://github.com/Maxxen/duckdb-postal

NickCrews — 12/04/2023 2:03 PM

Wow, thanks so much! Sounds like there might be a much larger can of worms waiting for me than just not understanding duckdb's Vector data model. Also I have only used duckdb from the python API wrapped in ibis, so a lot of these low-level details are new to me. Thanks for your patience.

What is the problem with having a global data dir? Isn't it RO so wouldn't we want it to be shared?

Is the problem that the weights aren't actually bundled into the extension, so if we ship the extension to another machine it goes looking in /some/dir/ and of course there is nothing there because it isn't the machine where libpostal was built?

If we don't change any config values, then is the global state not a problem? Or if we only use one connection then would that avoid the problem?

maxxen — 12/04/2023 2:09 PM

So theres like two dimensions to the problem:
You cant change the datadir after initialization. But the only sensible place to initialize it is when the extension is loaded, but then the user has no chance to set the data dir path. By default it is hardcoded to the install directory during build, which is of course not portable.

Even if you could change the datadir after initialization (e.g. through a DuckDB config setting, a SET some_option = some_value -sql statement), thats not thread safe. E.g. if I have a duckdb connection open that is currently executing a libpostal function, and you then open another connection and change the datadir/reinitialized the global state you'll probably break something.

Additionally, it seems like libpostal has other long-standing threading issues too:
openvenues/libpostal#307
openvenues/libpostal#476
Sooo idk, my plan would be to fork it, swap the build system to cmake and add a thread-context struct to remove the global state. After that you could maybe even look at downloading the weights through DuckDB's httfps or something to the .duckdb home or extension install directory post-build.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.