larryhastings / gilectomy Goto Github PK

Gilectomy branch of CPython. Use "gilectomy" branch in git. Read the important, short README below!

License: Other

Makefile 0.42% Python 55.19% C 39.06% PostScript 0.03% Batchfile 0.09% CSS 0.01% JavaScript 0.02% HTML 0.38% C++ 0.69% Shell 0.94% Groff 0.55% Visual Basic 0.01% R 0.02% PLSQL 0.05% PowerShell 0.01% Objective-C 0.07% Prolog 0.01% M4 0.49% Assembly 1.28% TeX 0.70%

gilectomy's People

Contributors

Stargazers

Watchers

Forkers

warsaw justaman tiran arp11 jackdied fschlimb dinov jpe sudo-kid mcgallaspy yhg1s den-run-ai ericsnowcurrently orenleaffer muxuezi zitterbewegung rbrtl austinskies luffyhwl asfer robertutterback renejsum dimitar-petrov dejavvu nomiscientist stjordanis hteo1337 leonardofreua chepo92 antonshc gitmohit007 sansyrox swapnilghule112 limkokholefork shreyanavigyan shekharkumud mausvt python-repository-hub pavloh-dev freezefast-technologies gopijtech trellixvulnteam

gilectomy's Issues

Update port to Windows

I just redid the platform abstraction. I tried to keep the Windows port going, but I don't have a Windows box handy so I can't test it. Please test and fix!

Come talk Gilectemy on Talk Python Podcast?

Hi Larry,

I know this is not strictly an issue didn't really belong here but for some reason I'm having a hard time finding your email address or twitter or some other way to contact you directly so here you go.

I think this would make a great topic on my podcast Talk Python To Me. I love to have you on as a guest. What do you say? Interested? We got to talk about other things like Pytho3, the release management process things like that.

Thanks,
Michael

[idea] RCU

Linux kernel switched from spinlocks to seqlocks to RCU.

Perhaps there's a parallel with Python containers, at least. If two threads want to expand dict storage, perhaps it doesn't matter which thread's allocation is ultimately used. Likewise, if two threads want to list.append (blindly), it doesn't matter what order the entries are appended in.

The upside is no slowdown for reads (so reference counts are out).

gilectomy performs worse than stock cpython

gilectomy fails to compile with clang

The first one found in pydebug mode:

../gilectomy/Objects/methodobject.c:403:12: error: use of undeclared identifier 'counter'; did you mean 'count'?
    assert(counter == 0);

the first one found NOT in pydebug mode:

../gilectomy/Modules/gcmodule.c:1588:40: error: initializer element is not a compile-time constant
static furtex_t gc_furtex = { 0, 0, 0, "gc_lock" };
                                       ^~~~~~~~~

i expect there are other errors. clang is your friend. :)

Ensure locks are always safely initialized

The OS X support for "primitivelock_t" (I'm renaming to "cheaplock_t") has this comment:

// An initialized flag is need because some futexes
// are 0-initialized
// e.g. list instances created via PyType_GenericNew

And in "futex_lock_primitive()" (I'm renaming to "cheaplock_lock()") there is this code:

if (!f->initialized) {
    futex_init_primitive(lock);
}

Q: Is it safe to initialize a mutex from inside the lock function?
A: Uh, no.

What do we need to do so that list instances created via PyType_GenericNew (&c) are correctly, safely initialized?

threading locks seem to be thread-unsafe (sic!)

When running the following script

import threading                          

lock = threading.Lock()                   
def task():                               
    while True:                           
        with lock:                        
            pass                          

for _ in range(10):                       
    threading.Thread(target=task).start()

I'm always sooner or later getting the following traceback:

Traceback (most recent call last):                                                         
  File "/..../gilectomy/Lib/threading.py", line 917, in _bootstrap_inner
    self.run()                                                                             
  File "/..../gilectomy/Lib/threading.py", line 865, in run             
    self._target(*self._args, **self._kwargs)                                              
  File "locksafety.py", line 7, in task                                                 
    pass                                                                                   
RuntimeError: release unlocked lock

I'm trying to figure out what's wrong but so far I have no clue... any suggestions welcome.

P.S. This is a reduced reproducer for the issue that was sometimes seen when launching x.py

[idea] multiple granularity locking

Perhaps gilectomy could learn from databases?

https://en.wikipedia.org/wiki/Two-phase_locking has a thorough description and a nice diagram.

Perhaps a sensible trade-off could be picked there?

Improve locking of certain dicts (e.g. globals)

From my profiling experience of x.py (as shown at the sprint at PyCon'16) a lot of time was spent in locking the global dictionary due to contention.

Implementing a RW lock for an arbitraty dictionary is really hard because one would have to somehow implement a RW lock which can recurse in any way (e.g. R+W+R, R+R+W, etc.) due to the fact that comparison operator for dictionary keys can execute just anything (example might be an object which has overriden __eq__ method which inserts something in the same dictionary it was being used as a key - weird but possible).

But my profiling also suggests that most problems (at least in certain benchmark cases) are caused by contention on dictionaries of certain type, that is, on dictionaries with strings as keys.
Also when looking at dictobject.c implementation we can see that Python dicts have a special case for those exact same dicts which does not call custom compare, but rather compares memory directly.

That said, I think it's possible to implement more efficient locking for those objects using RW locks, and lock all other dicts as before.

I'll try prototyping that.

[idea] liblfds -- lock-free data structure

gilectomy doesn't have a mailing list to discuss ideas yet. I'm using the tracker to keep a record of my idea.

I had a conversation with William Brown, a co-worker and developer of the 389 Directory Server. He has been working on lock free database for 389 DS [1]. William pointed me to liblfds [2], an open source library with lock-free data structures. liblfds provides freelist and queues. The wiki also contains good information on memory barriers and CAS.

[1] http://firstyear.id.au/blog/html/2016/06/07/lock_free_database.html
[2] http://liblfds.org/

Failed to build _testcapi

This error does not block python installation.

Current refcount tracking idea seems to be incomplete for borrowed references

I think that I understand correctly that currently what you changed is basically removed global locking and made inc- and decref's be atomic operations. If that is true then there's (I think) a possible data race issue when an object is borrowed.

Consider the following idea of example:

Code borrows an object from some other object, say, takes a reference to the first item of a list, e.g. x = PyList_GetItem(lst, 0)
Before that code could call an incref on x some other parallel thread at that moment replaces first item of that list with something else, e.g. PyList_SetItem(lst, 0, y).
That setting calls a decref on x which might be called before first thread calls its incref.
This could lead to x reaching zero refcount and being destroyed, and then first thread tries to increase refcount for x. Bam! Segfault :(

I think that probability for this issue is really low, but it's possible to exist.
I don't think I have any ideas on how to fix this issue, though, so I'll just drop a note here and maybe someone else would have some idea.

To sum up, there's a possibility of a data race leading to possible crash because using some API that borrows a reference and then incrementing a reference is not atomic.

Any comments on nogil?

Does Sam Gross' nogil officially end the gilectomy?

Daemon threads seem to be broken

I have changed x.py to be the following:

import threading                                          
import sys                                                
import time                                               

def fib(n):                                               
    if n < 2: return 1                                    
    return fib(n-1) + fib(n-2)                            

def test():                                               
    print(fib(30))                                        

threads = 7                                               

if len(sys.argv) > 1:                                     
    threads = int(sys.argv[1])                            

for i in range(threads - 1):                              
    threading.Thread(target=test, daemon=True).start()    

time.sleep(3)

and after a few runs got a segfault.

My current understanding is that when interpreter is shutting down it calls _PyGILState_Fini at some point which invalidates TLS key to PyThreadState, so PyThreadState_GET yields NULL thread state leading to a crash.

I'm not sure how to fix that, and I think it's kind of low priority now, but I though I'll put an issue about that anyway.

Stop refcounting singleton values

I am not sure if this is implemented yet, but, if refcounts are expensive, it could be worthwhile to mark some values as not refcounted at all - Python interpreter doesn't exist without None, True, False, integers -5 ... 256, empty tuple, empty Unicode.... so it doesn't make sense to count references to these at all - there are not even many real uses in threading environments for tracking the number of references to these singleton objects.

Add locking support for OrderedDict

Just using issues as a locking mechanism

Dead-lock

python -m test test_hashlib
hangs.
thread 0 waits for threads to complete
thread 1 is waiting for gc furtex, locked by thread 0
backtrace(5) of last lock is
#0 gc_lock () at Modules/gcmodule.c:1592
#1 0x000000000055d4bf in invoke_gc_callback (phase=0x62ea40 "stop", generation=0, collected=0, uncollectable=0) at Modules/gcmodule.c:1090
#2 0x000000000055d705 in collect_with_callback (generation=0) at Modules/gcmodule.c:1131
#3 0x000000000055d78d in collect_generations () at Modules/gcmodule.c:1155
#4 0x000000000055e763 in _PyObject_GC_Alloc (use_calloc=0, basicsize=48) at Modules/gcmodule.c:1807

[New Thread 0x7ffff262c700 (LWP 18554)]
@larryhastings

Benchmarking is currently invalid. Use division of work, not work duplication.

The benchmark is a FIB sequence, duplicating this work over many cores even with 100% efficiency will never yield any speedup.

Work division is needed to see a speedup, I propose a simple, valid benchmark:

Multiplication of many elements of a list, divide the list into chunks and give a chunk to each thread.
Alternatively use a loop to multiply simple numbers many times (>10^9), and divide the loop iterations among threads.

The good news is that if work duplication is currently a similar speed to single threaded code, division of work will already be faster.

Any Updates or new Commits?

I've watched both of your talks on the Gilectomy project online and I read the thread here: https://lwn.net/Articles/723514/ but I haven't seen any updates to this repo. In your talk you even mentioned commits that appeared to be more recent than what's here. Is there somewhere else I should be looking for updates or new commits?

I'm not a core CPython contributor but as I use Cython and C++ more and more I look at this project as something that would be a great motivator for me to learn more and try to contribute. But that's really difficult if I can't tell what the state of the project is.

Eric

[idea] Thread-local incr/decr

I just watched your PyCon talk about this project, and I had an idea to reduce cache misses due to atomic incr/decr.

What if you introduced a thread-local refcount and a thread-global thread-refcount?

The idea would be that the number of threads that access an object rarely changes. So if a thread wants to change the refcount it can do so locally and quickly, and only when its refcount drops to zero does it need to do an atomic decr on the thread-refcount and free if it's 0. If it's non-zero it's the job of the remaining threads to clean up once their local refcount drops to 0.

As you mentioned, most objects are only ever used in one thread. So in addition to the above concept, which would still require 2 atomic operations at creation and destruction, the thread id of the object creator could be stored so that touching the thread-global counter can be deferred until another thread incr's the object, in which case it'd need to be set to 2.

[edit] Actually local storage in C works nothing like threading.local which is implemented using a dict. That makes it a lot slower I guess. There is probably a lot more I overlooked.

[idea] CAS for reference counts

"hotspot bias locks" text presents an explanation why CAS is typically fast (doesn't require bus activity if this CPU has the value in a cache line in M state).

Perhaps CAS would prove faster than atomic inc/dec for cases without contention?

The fib test case may be particularly nasty here (at least 4 refcount ops on fib for 34 bytecode instructions and references on the stack), and I wonder just how well it represents real-world programs?

On one hand, multithreaded CPU-bound programs are very likely to execute exact same code in most threads;

On the other, useful programs have notably more code, and the ratio of shared (fib) to private references (locals, ephemeral objects) is going to be much lower.

Consider using locks with lower storage requirements

This should probably increase cache hits;

https://webkit.org/blog/6161/locking-in-webkit/

[future] read/write furtex for dict and deque

dict and collections.deque is one of the few objects that are heavily utilized by multiple threads. Dicts are used for module name spaces. Module dicts are mostly read-only. threadings.Condition and the queue module use deque internally.

It makes sense to implement r/w locking to allow multiple concurrent readers.

Maintain as cpython branch instead of separate repo?

I wonder if it would be worthwhile maintaining Gilectomy as a separate branch in the python/cpython repo. It might

be easier to keep in-sync with CPython,
increase the project's visibility,
attract developers to the project who might not otherwise learn about it,
and make it easier for extension module authors to adapt to a GIL-free world

Just thinking out loud...

[talk] Spin / Competition

TL;DR let's not make it about GIL.

Let's call it free-threading or multicore enhancements.

I'm for one tired of GIL-flame.

(Assume devil's advocate mode)

I'd posit that Python has only 1 competitor in dynamic language world. That's modern JS (i.e. ES6/ES2016 with its implementation of async/await, etc.)

That's a language and runtime that doesn't support threads at all, forget about free-threading.

It's a language with more human users, more runtimes active, and most importantly, a language where it's much easier to see the source code and therefore learn and improve (compare Python back-end against JS front-end, where source code, breakpoints, etc is just developer tools away).

Ultimately number of cores is still way smaller than number of open tabs or number of vms or containers someone provisions in the cloud.

Perhaps cores/native threads are the wrong focus?

Perhaps effort is better spent teaching existing Python developers new functionality? Or some form of run-from-github (e.g. pythonanywhere) whereas a regular end user can take a peek at the source code, learn from it, fork it and make their enhancements? Or getting Python to run nicely on mobile platforms?

complete dummy in lock-free - but this looks helpful?

http://www.liblfds.org/

portable, license-free, lock-free data structure library written in C89

GUL -- Global Unicode Lock

Although Python's str type is immutable from Python space, it is a C-mutable object. PyUnicodeObject has several writeable members. In fact PyUnicodeObject's payload itself is writeable from C code when condition Py_REFCNT(o) == 1. @larryhastings and I agree that a per-object lock for str is too costly. Instead we like to go with an optimistic global unicode lock.

Disclaimer: I don't fully understand the details of the current implementation and PEP 393.

https://www.python.org/dev/peps/pep-0393/#specification

Py_hash_t hash

The hash members caches the hash value for hash('somestring'). It is only computed on demand. Since hash doesn't involve any storage, no locking is required. At worst two threads compute the same hash value and override each other.

writeable data members

wchar_t *wstr (PyASCIIObject)
char *utf8 (PyCompactUnicodeObject)
PyUnicodeObject.data

Write access to any and all C-mutable members, that involve memory allocation, must be synchronized by the GUL. Otherwise two threads may set the same pointer, which result in a memory leak of one of the allocated buffers. My gut feeling tells me that conflicts are scarce, so optimistic locking is going to perform better here.

Check if utf8 member is already set
When utf8 member is not set, compute UTF-8 value
acquire GUL
Check again of another thread has set utf8 member in the mean time.
- if utf8 member is still NULL, set member
- if utf8 member has been set by another thread, discard and free UTF-8 value
release GUL

special casing of Py_REFCNT() == 1

Python's str uses a special case to optimize string concatenation and in _PyUnicodeWriter. As far as I am able to figure out _PyUnicodeWriter, it requires the special case to work. I'm not yet sure how to handle this special case. I have been considering a new flag constructable which can be set if-and-only-if a PyUnicodeObject is in C API calls in a single thread. struct state has unused 24 bits left.

WIP branch

I have started a branch but gave up after a couple of hours, https://github.com/tiran/gilectomy/tree/gul

[DISCUSSION] Suggestions for a different approach

Rather than attempting to modify the whole of python to work multithreaded, perhaps a more incremental approach could be taken. What I envision is this: having a with nogil: statement that simply unlocks the GIL, and then having python move all of the variables coming into the context to a special "nogil" section that the interpreter with GIL would not be able to access. Python would then just be responsible for converting any segfaults into Python SystemExit exceptions rather than crashing hard.

The main advantage of this is that it can be implemented now because the python code would be responsible for managing the locks rather than Python itself. This would also support iterative design. In addition, single-threaded performance would suffer minimally if nogil is not invoked and the same interpreter can run existing code without nogil.

The main disadvantage is that it would make nogil code more complex. But solving problems in Python is easier and faster than solving problems in C.

Reasons you should listen to me: none. I will not have time to devote to this as I am already working on other projects. But I would like to see a nogil implementation in Python and if I can give feedback toward that end, then I will.

Update port to OS X

I just redid all the platform abstraction stuff. I tried to keep the Mac stuff working but I might have broken it. Please download fix and send me a pull request!

set 'gilectomy' as default branch

This will change what people see on the front page, as well as the default merge target for pull requests, which I believe are both desirable changes for your project.

https://help.github.com/articles/setting-the-default-branch/