larryhastings / gilectomy Goto Github PK
View Code? Open in Web Editor NEWGilectomy branch of CPython. Use "gilectomy" branch in git. Read the important, short README below!
License: Other
Gilectomy branch of CPython. Use "gilectomy" branch in git. Read the important, short README below!
License: Other
I just redid the platform abstraction. I tried to keep the Windows port going, but I don't have a Windows box handy so I can't test it. Please test and fix!
Hi Larry,
I know this is not strictly an issue didn't really belong here but for some reason I'm having a hard time finding your email address or twitter or some other way to contact you directly so here you go.
I think this would make a great topic on my podcast Talk Python To Me. I love to have you on as a guest. What do you say? Interested? We got to talk about other things like Pytho3, the release management process things like that.
Thanks,
Michael
Linux kernel switched from spinlocks to seqlocks to RCU.
Perhaps there's a parallel with Python containers, at least. If two threads want to expand dict storage, perhaps it doesn't matter which thread's allocation is ultimately used. Likewise, if two threads want to list.append
(blindly), it doesn't matter what order the entries are appended in.
The upside is no slowdown for reads (so reference counts are out).
The first one found in pydebug mode:
../gilectomy/Objects/methodobject.c:403:12: error: use of undeclared identifier 'counter'; did you mean 'count'?
assert(counter == 0);
the first one found NOT in pydebug mode:
../gilectomy/Modules/gcmodule.c:1588:40: error: initializer element is not a compile-time constant
static furtex_t gc_furtex = { 0, 0, 0, "gc_lock" };
^~~~~~~~~
i expect there are other errors. clang is your friend. :)
The OS X support for "primitivelock_t" (I'm renaming to "cheaplock_t") has this comment:
// An initialized flag is need because some futexes
// are 0-initialized
// e.g. list instances created via PyType_GenericNew
And in "futex_lock_primitive()" (I'm renaming to "cheaplock_lock()") there is this code:
if (!f->initialized) {
futex_init_primitive(lock);
}
Q: Is it safe to initialize a mutex from inside the lock function?
A: Uh, no.
What do we need to do so that list instances created via PyType_GenericNew (&c) are correctly, safely initialized?
When running the following script
import threading
lock = threading.Lock()
def task():
while True:
with lock:
pass
for _ in range(10):
threading.Thread(target=task).start()
I'm always sooner or later getting the following traceback:
Traceback (most recent call last):
File "/..../gilectomy/Lib/threading.py", line 917, in _bootstrap_inner
self.run()
File "/..../gilectomy/Lib/threading.py", line 865, in run
self._target(*self._args, **self._kwargs)
File "locksafety.py", line 7, in task
pass
RuntimeError: release unlocked lock
I'm trying to figure out what's wrong but so far I have no clue... any suggestions welcome.
P.S. This is a reduced reproducer for the issue that was sometimes seen when launching x.py
Perhaps gilectomy could learn from databases?
https://en.wikipedia.org/wiki/Two-phase_locking has a thorough description and a nice diagram.
Perhaps a sensible trade-off could be picked there?
From my profiling experience of x.py
(as shown at the sprint at PyCon'16) a lot of time was spent in locking the global dictionary due to contention.
Implementing a RW lock for an arbitraty dictionary is really hard because one would have to somehow implement a RW lock which can recurse in any way (e.g. R+W+R, R+R+W, etc.) due to the fact that comparison operator for dictionary keys can execute just anything (example might be an object which has overriden __eq__
method which inserts something in the same dictionary it was being used as a key - weird but possible).
But my profiling also suggests that most problems (at least in certain benchmark cases) are caused by contention on dictionaries of certain type, that is, on dictionaries with strings as keys.
Also when looking at dictobject.c
implementation we can see that Python dicts have a special case for those exact same dicts which does not call custom compare, but rather compares memory directly.
That said, I think it's possible to implement more efficient locking for those objects using RW locks, and lock all other dicts as before.
I'll try prototyping that.
gilectomy doesn't have a mailing list to discuss ideas yet. I'm using the tracker to keep a record of my idea.
I had a conversation with William Brown, a co-worker and developer of the 389 Directory Server. He has been working on lock free database for 389 DS [1]. William pointed me to liblfds [2], an open source library with lock-free data structures. liblfds provides freelist and queues. The wiki also contains good information on memory barriers and CAS.
[1] http://firstyear.id.au/blog/html/2016/06/07/lock_free_database.html
[2] http://liblfds.org/
This error does not block python installation.
I think that I understand correctly that currently what you changed is basically removed global locking and made inc- and decref's be atomic operations. If that is true then there's (I think) a possible data race issue when an object is borrowed.
Consider the following idea of example:
x = PyList_GetItem(lst, 0)
x
some other parallel thread at that moment replaces first item of that list with something else, e.g. PyList_SetItem(lst, 0, y)
.x
which might be called before first thread calls its incref.x
reaching zero refcount and being destroyed, and then first thread tries to increase refcount for x
. Bam! Segfault :(I think that probability for this issue is really low, but it's possible to exist.
I don't think I have any ideas on how to fix this issue, though, so I'll just drop a note here and maybe someone else would have some idea.
To sum up, there's a possibility of a data race leading to possible crash because using some API that borrows a reference and then incrementing a reference is not atomic.
Does Sam Gross' nogil officially end the gilectomy?
I have changed x.py
to be the following:
import threading
import sys
import time
def fib(n):
if n < 2: return 1
return fib(n-1) + fib(n-2)
def test():
print(fib(30))
threads = 7
if len(sys.argv) > 1:
threads = int(sys.argv[1])
for i in range(threads - 1):
threading.Thread(target=test, daemon=True).start()
time.sleep(3)
and after a few runs got a segfault.
My current understanding is that when interpreter is shutting down it calls _PyGILState_Fini
at some point which invalidates TLS key to PyThreadState, so PyThreadState_GET
yields NULL
thread state leading to a crash.
I'm not sure how to fix that, and I think it's kind of low priority now, but I though I'll put an issue about that anyway.
I am not sure if this is implemented yet, but, if refcounts are expensive, it could be worthwhile to mark some values as not refcounted at all - Python interpreter doesn't exist without None
, True
, False
, integers -5 ... 256, empty tuple, empty Unicode.... so it doesn't make sense to count references to these at all - there are not even many real uses in threading environments for tracking the number of references to these singleton objects.
Just using issues as a locking mechanism
python -m test test_hashlib
hangs.
thread 0 waits for threads to complete
thread 1 is waiting for gc furtex, locked by thread 0
backtrace(5) of last lock is
#0 gc_lock () at Modules/gcmodule.c:1592
#1 0x000000000055d4bf in invoke_gc_callback (phase=0x62ea40 "stop", generation=0, collected=0, uncollectable=0) at Modules/gcmodule.c:1090
#2 0x000000000055d705 in collect_with_callback (generation=0) at Modules/gcmodule.c:1131
#3 0x000000000055d78d in collect_generations () at Modules/gcmodule.c:1155
#4 0x000000000055e763 in _PyObject_GC_Alloc (use_calloc=0, basicsize=48) at Modules/gcmodule.c:1807
[New Thread 0x7ffff262c700 (LWP 18554)]
@larryhastings
The benchmark is a FIB sequence, duplicating this work over many cores even with 100% efficiency will never yield any speedup.
Work division is needed to see a speedup, I propose a simple, valid benchmark:
Multiplication of many elements of a list, divide the list into chunks and give a chunk to each thread.
Alternatively use a loop to multiply simple numbers many times (>10^9
), and divide the loop iterations among threads.
The good news is that if work duplication is currently a similar speed to single threaded code, division of work will already be faster.
I've watched both of your talks on the Gilectomy project online and I read the thread here: https://lwn.net/Articles/723514/ but I haven't seen any updates to this repo. In your talk you even mentioned commits that appeared to be more recent than what's here. Is there somewhere else I should be looking for updates or new commits?
I'm not a core CPython contributor but as I use Cython and C++ more and more I look at this project as something that would be a great motivator for me to learn more and try to contribute. But that's really difficult if I can't tell what the state of the project is.
Eric
I just watched your PyCon talk about this project, and I had an idea to reduce cache misses due to atomic incr/decr.
What if you introduced a thread-local refcount and a thread-global thread-refcount?
The idea would be that the number of threads that access an object rarely changes. So if a thread wants to change the refcount it can do so locally and quickly, and only when its refcount drops to zero does it need to do an atomic decr on the thread-refcount and free if it's 0. If it's non-zero it's the job of the remaining threads to clean up once their local refcount drops to 0.
As you mentioned, most objects are only ever used in one thread. So in addition to the above concept, which would still require 2 atomic operations at creation and destruction, the thread id of the object creator could be stored so that touching the thread-global counter can be deferred until another thread incr's the object, in which case it'd need to be set to 2.
[edit] Actually local storage in C works nothing like threading.local
which is implemented using a dict. That makes it a lot slower I guess. There is probably a lot more I overlooked.
"hotspot bias locks" text presents an explanation why CAS is typically fast (doesn't require bus activity if this CPU has the value in a cache line in M state).
Perhaps CAS would prove faster than atomic inc/dec for cases without contention?
The fib
test case may be particularly nasty here (at least 4 refcount ops on fib
for 34 bytecode instructions and references on the stack), and I wonder just how well it represents real-world programs?
On one hand, multithreaded CPU-bound programs are very likely to execute exact same code in most threads;
On the other, useful programs have notably more code, and the ratio of shared (fib
) to private references (locals, ephemeral objects) is going to be much lower.
This should probably increase cache hits;
dict and collections.deque is one of the few objects that are heavily utilized by multiple threads. Dicts are used for module name spaces. Module dicts are mostly read-only. threadings.Condition and the queue module use deque internally.
It makes sense to implement r/w locking to allow multiple concurrent readers.
I wonder if it would be worthwhile maintaining Gilectomy as a separate branch in the python/cpython repo. It might
Just thinking out loud...
TL;DR let's not make it about GIL.
Let's call it free-threading or multicore enhancements.
I'm for one tired of GIL-flame.
(Assume devil's advocate mode)
I'd posit that Python has only 1 competitor in dynamic language world. That's modern JS (i.e. ES6/ES2016 with its implementation of async/await, etc.)
That's a language and runtime that doesn't support threads at all, forget about free-threading.
It's a language with more human users, more runtimes active, and most importantly, a language where it's much easier to see the source code and therefore learn and improve (compare Python back-end against JS front-end, where source code, breakpoints, etc is just developer tools away).
Ultimately number of cores is still way smaller than number of open tabs or number of vms or containers someone provisions in the cloud.
Perhaps cores/native threads are the wrong focus?
Perhaps effort is better spent teaching existing Python developers new functionality? Or some form of run-from-github (e.g. pythonanywhere) whereas a regular end user can take a peek at the source code, learn from it, fork it and make their enhancements? Or getting Python to run nicely on mobile platforms?
portable, license-free, lock-free data structure library written in C89
Although Python's str type is immutable from Python space, it is a C-mutable object. PyUnicodeObject
has several writeable members. In fact PyUnicodeObject
's payload itself is writeable from C code when condition Py_REFCNT(o) == 1
. @larryhastings and I agree that a per-object lock for str is too costly. Instead we like to go with an optimistic global unicode lock.
Disclaimer: I don't fully understand the details of the current implementation and PEP 393.
https://www.python.org/dev/peps/pep-0393/#specification
The hash
members caches the hash value for hash('somestring')
. It is only computed on demand. Since hash doesn't involve any storage, no locking is required. At worst two threads compute the same hash value and override each other.
wchar_t *wstr (PyASCIIObject)
char *utf8 (PyCompactUnicodeObject)
PyUnicodeObject.data
Write access to any and all C-mutable members, that involve memory allocation, must be synchronized by the GUL. Otherwise two threads may set the same pointer, which result in a memory leak of one of the allocated buffers. My gut feeling tells me that conflicts are scarce, so optimistic locking is going to perform better here.
utf8
member is already setutf8
member is not set, compute UTF-8 valueutf8
member in the mean time.
utf8
member is still NULL, set memberutf8
member has been set by another thread, discard and free UTF-8 valuePython's str
uses a special case to optimize string concatenation and in _PyUnicodeWriter
. As far as I am able to figure out _PyUnicodeWriter
, it requires the special case to work. I'm not yet sure how to handle this special case. I have been considering a new flag constructable
which can be set if-and-only-if a PyUnicodeObject
is in C API calls in a single thread. struct state
has unused 24 bits left.
I have started a branch but gave up after a couple of hours, https://github.com/tiran/gilectomy/tree/gul
Rather than attempting to modify the whole of python to work multithreaded, perhaps a more incremental approach could be taken. What I envision is this: having a with nogil:
statement that simply unlocks the GIL, and then having python move all of the variables coming into the context to a special "nogil" section that the interpreter with GIL would not be able to access. Python would then just be responsible for converting any segfaults into Python SystemExit exceptions rather than crashing hard.
The main advantage of this is that it can be implemented now because the python code would be responsible for managing the locks rather than Python itself. This would also support iterative design. In addition, single-threaded performance would suffer minimally if nogil is not invoked and the same interpreter can run existing code without nogil.
The main disadvantage is that it would make nogil code more complex. But solving problems in Python is easier and faster than solving problems in C.
Reasons you should listen to me: none. I will not have time to devote to this as I am already working on other projects. But I would like to see a nogil implementation in Python and if I can give feedback toward that end, then I will.
I just redid all the platform abstraction stuff. I tried to keep the Mac stuff working but I might have broken it. Please download fix and send me a pull request!
This will change what people see on the front page, as well as the default merge target for pull requests, which I believe are both desirable changes for your project.
https://help.github.com/articles/setting-the-default-branch/
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.