modin-project / modin Goto Github PK
View Code? Open in Web Editor NEWModin: Scale your Pandas workflows by changing a single line of code
Home Page: http://modin.readthedocs.io
License: Apache License 2.0
Modin: Scale your Pandas workflows by changing a single line of code
Home Page: http://modin.readthedocs.io
License: Apache License 2.0
This is what I did to check out modin as a pandas replacement
import modin.pandas as pd
d = pd.read_csv('boston_housing.csv')
d.head()
and I got the following error:
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.6/site-packages/modin/pandas/utils.py", line 380, in create_blocks
return _create_blocks_helper(df, npartitions, axis)
NameError: name '_create_blocks_helper' is not defined
any clue as to why this might have come
pip install modin
Run twice:
import modin.pandas as mpd
df = mpd.read_csv('4kk_lines.csv', sep=';')
Modin doesn't free memory when a variable is reassigned. Concretely saying, the expected behavior is that in the process of reading a table from hard drive the memory usage grows up until the whole dataframe fits into RAM, and the memory drops down to the previous value when the variable containing the same dataframe before is rewritten. This is how Pandas (and any regular logic) works.
But in case of Modin for reading the dataframe the memory isn't freed when the variable is rewritten. Instead, it's doubled, so that any time I rerun this code in future the memory usage grows up meaning there's a memory leak somewhere.
I also tried to do some slicing with the loaded dataframe - it was expected that the memory isn't incremented when I don't copy the data, but it actually was. Here is the example:
df[df['id'] == 123].shape
In my table there is 4 000 000 lines with 14 columns, which takes about 3 Gb of RAM when loaded. Running the code above 50 times (to make performance test) I took all 110 Gb of RAM on my remote server.
Someone posted some benchmarks on twitter for us against a couple of other tools. head
and tail
were performing incredibly slow.
The culprit is column_partitions
. We wait until the entire columns are collected and then perform the head. This is incredibly inefficient.
The same is true with tail
.
modin.__git_revision__
could return full git SHA of latest commit.
It is useful for developers to track which particular revision they are running at the moment.
Something like dask/dask#1760
Diff returns the wrong values when axis='rows'
import modin.pandas as pd
data = {
"col1": [0, 1, 2, 3],
"col2": [4, 5, 6, 7],
"col3": [8, 9, 10, 11],
"col4": [12, 13, 14, 15],
"col5": [0, 0, 0, 0],
}
modin_df = pd.DataFrame(data)
modin_df.diff(axis='rows')
We expect to get:
col1 col2 col3 col4 col5
0 NaN NaN NaN NaN NaN
1 1.0 1.0 1.0 1.0 0.0
2 1.0 1.0 1.0 1.0 0.0
3 1.0 1.0 1.0 1.0 0.0
But get:
col1 col2 col3 col4 col5
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
/usr/local/lib/python2.7/site-packages/_pytest/python.py:197: RemovedInPytest4Warning: Fixture "test_ndim" called directly. Fixtures are not meant to be called directly, are created automatically when test functions request them as parameters. See https://docs.pytest.org/en/latest/fixture.html for more information.
Our CI is giving us tons of these kind of warning. Worth investigating.
Opening this to track ray-project/ray#2082
drop
does not drop columns or rows from the same partition if they contain the same name. This required a large refactor of the overall implementation.
import modin.pandas as pd
pd.DEFAULT_NPARTITIONS = 2
nu_df = pandas.DataFrame(pandas.compat.lzip(range(3), range(-3, 1),
list('abc')), columns=['a', 'a', 'b'])
ray_nu_df = pd.DataFrame(nu_df)
x = ray_nu_df.drop('a', axis=1)
y = nu_df[['b']]
print(x)
Printing x
gives an error because it did not drop all of the columns it should have. However,
x._col_partitions
print(x)
Works correctly.
df.mean
can produce np.nan
values for the result incorrectly.
This is really only a problem on extremely small datasets, which is why it went unnoticed before.
import modin.pandas as pd
frame_data = {
"col1": [1, 2, 3, 4],
"col2": [4, 5, 6, 7],
"col3": [8.0, 9.4, 10.1, 11.3],
"col4": ["a", "b", "c", "d"],
}
df = pd.DataFrame(frame_data)
df.mean(skipna=False, axis='columns', numeric_only=None)
We only use a maximum of 8 partitions. We should automatically set the number of partitions instead of requiring users to set themselves.
This is the only change we should need
import multiprocessing
DEFAULT_NPARTITIONS = multiprocessing.cpu_count()
Calling: df.iloc[0:5,:] results in the following error:
Traceback (most recent call last):
File "bakeoffModin.py", line 15, in
print( f'Sample:\n {df.iloc[0:5,:]} ')
File "/usr/local/lib/python3.6/dist-packages/modin/pandas/dataframe.py", line 230, in str
return repr(self)
File "/usr/local/lib/python3.6/dist-packages/modin/pandas/dataframe.py", line 325, in repr
return repr(self.repr_helper())
File "/usr/local/lib/python3.6/dist-packages/modin/pandas/dataframe.py", line 235, in repr_helper
return to_pandas(self)
File "/usr/local/lib/python3.6/dist-packages/modin/pandas/utils.py", line 227, in to_pandas
pandas_df = pandas.concat(ray.get(df._row_partitions), copy=False)
File "/usr/local/lib/python3.6/dist-packages/modin/pandas/dataframe.py", line 198, in _get_row_partitions
empty_rows_mask = self._row_metadata._lengths > 0
TypeError: '>' not supported between instances of 'list' and 'int'
Currently Ray's Redis ports are not secured by default which is a problem on systems exposed to the internet.
Once ray-project/ray#2952 is merged, I recommend securing Redis ports with ray.init(redis_password=password)
where password
is securely generated e.g. by using the secrets module.
0.1.1
Groupby with lists of columns not yet supported.
Waiting for redis server at 127.0.0.1:59835 to respond...
Waiting for redis server at 127.0.0.1:16671 to respond...
Starting local scheduler with the following resources: {'CPU': 56, 'GPU': 4}.
Traceback (most recent call last):
File "gen_fea_online.py", line 318, in
df = get_data_log_hive(pre_time=(2018,8,7))
File "gen_fea_online.py", line 290, in get_data_log_hive
df_vod40 = gen_fea_active_log(df_vod40)
File "gen_fea_online.py", line 238, in gen_fea_active_log
t1 = t1.groupby(["subid","label"])["cnt"].count().reset_index()
File "/root/anaconda3/lib/python3.6/site-packages/modin/pandas/dataframe.py", line 823, in groupby
"Groupby with lists of columns not yet supported.")
NotImplementedError: Groupby with lists of columns not yet supported.
Opening this to track ray-project/ray#1988
cc @Veryku
Original question from the Ray repo: ray-project/ray#1858
cc @dmadeka
After the backend re-write, we no longer have a global view of global_index
-> (blk_index
, internal_index
) exists anywhere. This makes re-index doing a full copy (it will be a bottleneck for distributed case). This also creates problem for DataManagerView. Take the following example:
In [6]: df
Out[6]:
col1 col2 col3
0 1 6 10
1 2 7 11
2 3 8 12
3 4 9 13
In [7]: df.iloc[:, [1,2,0]]
Out[7]:
col2 col3 col1
0 6 10 1
1 7 11 2
2 8 12 3
3 9 13 4
Say the dataframe is partitioned as (2,2) blocks. Block widths: [2,1]
. So many rows of col1
and col2
shares RemotePartitions. The correct DataManagerView does not support operation like this. It is applying an iloc
call to internal dataframes. This does not work in this case.
We need a way to shuffle without gathering data. Here's an idea. Whenever we need to do a shuffle, we brake the metadata down to each row/column and duplicate certain RemotePartition object, then attach an apply_func to select certain columns. Here's an example:
When we are going to reorder ['col1', 'col2', 'col3', 'col4']
into ['col3', 'col2', 'col4', 'col1']
, blocks will be organized as follows to re-present the dataframe:
I'm not planning on implementing this in the re-write. But it would be great to include it in the repartition/efficient shuffle PR.
Currently as_blocks
is not implemented.
In [1]: import modin.pandas as pd
...: import numpy as np
...:
...: frame_data = np.random.randint(0, 100, size=(2**12, 2**8))
...: df = pd.DataFrame(frame_data)
...:
...:
Process STDOUT and STDERR is being redirected to /tmp/raylogs/.
Waiting for redis server at 127.0.0.1:59321 to respond...
Waiting for redis server at 127.0.0.1:36048 to respond...
Starting the Plasma object store with 27.00 GB memory.
Starting local scheduler with the following resources: {'GPU': 0, 'CPU': 8}.
======================================================================
View the web UI at http://localhost:8888/notebooks/ray_ui66732.ipynb?token=f0ef48a507f20daa610b2bd8da93e7e2f62955bec34e6bcb
======================================================================
In [2]: df.as_blocks()
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
<ipython-input-2-0a8fd5d307e9> in <module>()
----> 1 df.as_blocks()
~/anaconda/lib/python3.5/site-packages/modin/pandas/dataframe.py in as_blocks(self, copy)
1242 def as_blocks(self, copy=True):
1243 raise NotImplementedError(
-> 1244 "To contribute to Pandas on Ray, please visit "
1245 "github.com/modin-project/modin.")
1246
NotImplementedError: To contribute to Pandas on Ray, please visit github.com/modin-project/modin.
For reference it is useful to have links to the Pandas documentation page from our own documentation.
This should be possible with a script and hopefully would not require significant manual data entry.
We can discuss here whether to create a new page for the links, or to just link from the existing methods pages.
When IndexMetadata and Block partitions doesn't match, getting _col_partitions
and _row_partitions
will raise error.
In [4]: df
Out[4]:
col1 col2 col3 col4 col5
0 0 4 8 12 0
1 1 5 9 13 0
2 2 6 10 14 0
3 3 7 11 15 0
In [5]: df.iloc[:3]
Out[5]: ---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/IPython/core/formatters.py in __call__(self, obj)
700 type_pprinters=self.type_printers,
701 deferred_pprinters=self.deferred_printers)
--> 702 printer.pretty(obj)
703 printer.flush()
704 return stream.getvalue()
~/anaconda3/lib/python3.6/site-packages/IPython/lib/pretty.py in pretty(self, obj)
398 if cls is not object \
399 and callable(cls.__dict__.get('__repr__')):
--> 400 return _repr_pprint(obj, self, cycle)
401
402 return _default_pprint(obj, self, cycle)
~/anaconda3/lib/python3.6/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
693 """A pprint that just redirects to the normal repr function."""
694 # Find newlines and replace them with p.break_()
--> 695 output = repr(obj)
696 for idx,output_line in enumerate(output.splitlines()):
697 if idx:
~/Desktop/modin/modin/modin/pandas/dataframe.py in __repr__(self)
451 if len(self._row_metadata) <= 60 and \
452 len(self._col_metadata) <= 20:
--> 453 return repr(self._repr_pandas_builder())
454 # The split here is so that we don't repr pandas row lengths.
455 result = self._repr_pandas_builder()
~/Desktop/modin/modin/modin/pandas/dataframe.py in _repr_pandas_builder(self)
380 # If we don't exceed the maximum number of values on either dimension
381 if len(self.index) <= 60 and len(self.columns) <= 20:
--> 382 return to_pandas(self)
383
384 if len(self.index) >= 60:
~/Desktop/modin/modin/modin/pandas/utils.py in to_pandas(df)
225 A new pandas DataFrame.
226 """
--> 227 pandas_df = pandas.concat(ray.get(df._row_partitions), copy=False)
228 pandas_df.index = df.index
229 pandas_df.columns = df.columns
~/Desktop/modin/modin/modin/pandas/dataframe.py in _get_row_partitions(self)
200 self._row_metadata._lengths = \
201 self._row_metadata._lengths[empty_rows_mask]
--> 202 self._block_partitions = self._block_partitions[empty_rows_mask, :]
203 return [_blocks_to_row.remote(*part)
204 for i, part in enumerate(self._block_partitions)]
IndexError: boolean index did not match indexed array along dimension 0; dimension is 3 but corresponding boolean dimension is 4
See code
In the test, the Modin DataFrame is returns two dots for the row separating the head and tail, but pandas returns three dots for that section:
E.g.
Modin
28 70 48 66 14 23 82 26 6 7 14 ... 8 44 8 28 60 38 1
29 90 63 26 73 14 36 83 72 15 9 ... 40 84 6 44 2 54 94
.. .. .. .. .. .. .. .. .. .. .. ... .. .. .. .. .. .. ..
970 46 41 68 61 89 3 42 13 58 4 ... 30 11 86 58 99 77 86
971 73 30 40 31 85 59 39 23 60 36 ... 47 66 90 46 23 82 69
pandas
28 70 48 66 14 23 82 26 6 7 14 ... 8 44 8 28 60 38 1
29 90 63 26 73 14 36 83 72 15 9 ... 40 84 6 44 2 54 94
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
970 46 41 68 61 89 3 42 13 58 4 ... 30 11 86 58 99 77 86
971 73 30 40 31 85 59 39 23 60 36 ... 47 66 90 46 23 82 69
We use pandas __repr__
, but it is truncating a dot.
In recent PR (#78) we removed dependency on ray master. Let's add it back for a separate build on ray master to test our code to see if it's future proof.
Ray's master wheels are named to latest pypi version and on S3. (https://ray.readthedocs.io/en/latest/installation.html)
Here's how it can be installed without hardcoded it:
pip install -U ray; python -c "import ray; print(ray.__version__)"
We just need to run the test on ray master for py3.6 and linux.
I think it would be good to convert our formatter to black. Depending on the version, yapf
can give different outcomes, and black
formatting looks better IMO.
Additionally, we should add some git hooks to make sure that the formatting is always submitted matching correctly.
@simon-mo cc
Uses only numeric values even when numeric_only=False
. We should be consistent with pandas and throw TypeError
s
import modin.pandas as pd
data = {
"col1": 1.0,
"col2": np.datetime64("2011-06-15T00:00"),
"col3": np.array([3] * 4, dtype="int32"),
"col4": "foo",
"col5": True,
}
modin_df = pd.DataFrame(data)
modin_df.median(axis='rows', skipna = False, numeric_only = False)
Should throw a TypeError
but returns
col1 1.0
col3 3.0
col5 1.0
dtype: float64
as if numeric_only=True
Opening this to track ray-project/ray#2206
cc @chanansh
describe
gives an error after read_csv
if not all columns are described.
subprocess.call(['wget', 'https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2017-01.csv',
'-O', '/tmp/green_tripdata_2017-01.csv'])
csv_data = pd.read_csv('/tmp/green_tripdata_2017-01.csv')
csv_data.describe()
read_csv
falls back to the Pandas implementation for certain situations. This is expensive in terms of memory due to data duplication; first, we create a Pandas dataframe using pandas.read_csv
, and then convert it to a Modin dataframe.
The following events fall back to pandas.read_csv
and should be fixed to be more memory efficient:
filepath_or_buffer
is not an instance of str
, py.path.local
or pathlib.Path
as_recarray
is True
chunksize
is not None
skiprows
is list-like or callablenrows
is not None
Most changes need to be done in io.py
.
Thanks to @Bidek56 for reporting!
groupby.median()
will return floats dataframe in most cases; however, when possible, it might return ints dataframe as well:
In [40]: df = pd.DataFrame(np.random.randint(0, 8, size=(100, 4)),
...: columns=list('ABCD'))
...: df.groupby(df['A'].tolist()).median()
...:
Out[40]:
A B C D
0 0.0 4.0 4.0 4.0
1 1.0 3.0 1.0 1.0
2 2.0 6.0 4.0 4.0
3 3.0 3.0 5.5 3.0
4 4.0 4.0 4.0 5.0
5 5.0 4.0 4.0 3.0
6 6.0 5.0 2.5 2.0
7 7.0 2.5 3.0 4.5
In [41]: df = pd.DataFrame(np.random.randint(0, 8, size=(100, 4)),
...: columns=list('ABCD'))
...: df.groupby(df['A'].tolist()).median()
...:
Out[41]:
A B C D
0 0 3 4 2
1 1 3 4 2
2 2 2 2 3
3 3 4 3 2
4 4 1 4 4
5 5 1 3 2
6 6 2 4 5
7 7 5 4 4
We need to address this in our groupby.
Currently, align
is not implemented for DataFrame objects.
In [1]: import modin.pandas as pd
...: import numpy as np
...:
...: frame_data = np.random.randint(0, 100, size=(2**12, 2**8))
...: df = pd.DataFrame(frame_data)
...:
...:
Process STDOUT and STDERR is being redirected to /tmp/raylogs/.
Waiting for redis server at 127.0.0.1:24882 to respond...
Waiting for redis server at 127.0.0.1:55764 to respond...
Starting the Plasma object store with 27.00 GB memory.
Starting local scheduler with the following resources: {'GPU': 0, 'CPU': 8}.
======================================================================
View the web UI at http://localhost:8888/notebooks/ray_ui89247.ipynb?token=ca5e30c50fb8d5a3bc873a8adf2539198dbfef8b003ff195
======================================================================
In [2]: df.align(df)
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
<ipython-input-2-afa7038926eb> in <module>()
----> 1 df.align(df)
~/anaconda/lib/python3.5/site-packages/modin/pandas/dataframe.py in align(self, other, join, axis, level, copy, fill_value, method, limit, fill_axis, broadcast_axis)
1094 broadcast_axis=None):
1095 raise NotImplementedError(
-> 1096 "To contribute to Pandas on Ray, please visit "
1097 "github.com/modin-project/modin.")
1098
NotImplementedError: To contribute to Pandas on Ray, please visit github.com/modin-project/modin.
For about 3.5 years we have been developing a unified expression-based front end system for SQL systems as well as pandas: https://github.com/ibis-project/ibis. I suspect there are ripe opportunities for collaboration on this
Throws RayGetError
when doing cumulative functions of nonnumeric dtypes
import modin.pandas as pd
data = {
"col1": 1.0,
"col2": np.datetime64("2011-06-15T00:00"),
"col3": np.array([3] * 4, dtype="int32"),
"col4": "foo",
"col5": True,
}
modin_df = pd.DataFrame(data)
modin_df.cummax(axis=1)
Throws the following error:
RayGetError: Could not get objectid ObjectID(01000000f0cc325805f5c83895ed0532827d52de). It was created by remote function modin.data_management.partitioning.axis_partition.deploy_ray_axis_func which failed with:
Remote function modin.data_management.partitioning.axis_partition.deploy_ray_axis_func failed with:
Traceback (most recent call last):
File "/Users/William/Documents/modin/modin/data_management/partitioning/axis_partition.py", line 188, in deploy_ray_axis_func
result = func(dataframe, **kwargs)
File "/Users/William/Documents/modin/modin/data_management/data_manager.py", line 158, in helper
def helper(df, internal_indices=[]):
File "/Users/William/Documents/modin/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 9661, in cum_func
result = accum_func(y, axis)
File "/Users/William/Documents/modin/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 8829, in <lambda>
lambda y, axis: np.maximum.accumulate(y, axis), "max",
File "pandas/_libs/tslibs/timestamps.pyx", line 170, in pandas._libs.tslibs.timestamps._Timestamp.__richcmp__
TypeError: Cannot compare type 'Timestamp' with type 'float'
pd.describe()
should describe all the columns when there are no numeric columns in the Dataframe, otherwise it should describe only the numeric columns. Currently, pd.describe()
isn't ignoring the non-numeric columns (for example, booleans and datetime/timedelta columns). This causes an index length mismatch during set-axis columns with different types have differently-sized indices for their descriptions.
mean returns the wrong values
import modin.pandas as pd
data = {
"col1": [0, 1, 2, 3],
"col2": [4, 5, 6, 7],
"col3": [8, 9, 10, 11],
"col4": [12, 13, 14, 15],
"col5": [0, 0, 0, 0],
}
modin_df = pd.DataFrame(data)
modin_df.mean(axis=1)
We expect to get:
0 4.8
1 5.6
2 6.4
3 7.2
dtype: float64
but we get
0 4.000000
1 4.666667
2 5.333333
3 6.000000
dtype: float64
by default, pandas numeric_only option in full_reduce like operation (e.g. max
, min
, mean
, ..) will take an numeric_only
argument. If will:
In Modin, we decided the following behavior:
numeric_only = True if axis else kwargs.get("numeric_only", False)
because the asynchronous nature of our computation model.
However, this will lead to the following behavior in python2. In python2:
In [1]: max([1,2,3,'a'])
Out[1]: 'a'
In a mixed type dataframe:
col1 col2 col3 col4
0 1 4 8.0 a
1 2 5 9.4 b
2 3 6 10.1 c
3 4 7 11.3 d
taking max over rows will lead to
0 a
1 b
2 c
3 d
dtype: object
This is not expected behavior, therefore we choose to not following pandas behavior at this situation.
Modin currently depends on Pandas 0.22, which depends on a version of Numpy that won't compile against Python 3.7. I had to downgrade my system Python from 3.7 to 3.6 to install Modin (i.e. brew switch python 3.6.5
). If you could upgrade to the latest version of Pandas and add a Python 3.7 build, it would save others from having to figure out this workaround.
standard way pd.__version__
does not work
When inserting a new column to the back of a Modin DataFrame (or invoking __setitem__
with a new column name), Modin would error (Axis Length Mismatch) since the new column was not being inserted properly. Used the following script to reproduce:
import modin.pandas as pd
import numpy as np
frame_data = np.random.randint(0, 100, size=(2**20, 2**8))
df = pd.DataFrame(frame_data)
df['new'] = 0
Pandas 0.23.4 returns
0 NaN
1 NaN
2 NaN
3 NaN
dtype: float64
only when it cannot calculate the mean, median, and other similar functions when numeric_only=True
or numeric_only=None
(if possible) and axis=1
. If axis=0
, then pandas returns a empty series of type np.int64
.
import modin.pandas as pd
data = {
"col1": [1, 'a', 3, 4],
"col2": [4, 5, 6, 'd'],
"col3": [8.0, 9.4, 'e', 11.3],
"col4": ["a", "b", "c", "d"],
}
modin_df = pd.DataFrame(data)
modin_df.mean(axis = 'columns', skipna = False, numeric_only = None)
Line 21 in cccea0b
Current git revision tries to find the git commit by executing Popen directly from user process. This will start wherever user process start, which might not be a git repository. To fix it, we just need to run above line inside modin.__file__
or something like that.
This was discovered when I was working on #49.
Setting npartitions
to something less than 4 will break the test_fillna_sanity
test when it tries to apply the dictionary replacement.
During a recent pull request #88, we had a linting failure from black. We need to update the codebase with the most recent version of black.
groupby
can return incorrect or truncated results when by
is a list of integers that are all contained in the index
in each partition.
In [1]: import modin.pandas as pd
Process STDOUT and STDERR is being redirected to /tmp/raylogs/.
Waiting for redis server at 127.0.0.1:49938 to respond...
Waiting for redis server at 127.0.0.1:39318 to respond...
Starting local scheduler with the following resources: {'CPU': 8, 'GPU': 0}.
======================================================================
View the web UI at http://localhost:8888/notebooks/ray_ui75359.ipynb?token=546e4e9fca85392d39d82072b5d206ef67d2753a5b3743e3
======================================================================
In [2]: import ray
In [3]: import pandas
In [4]: pandas_df = pandas.DataFrame({'col1': [0, 1, 2, 3],
'col2': [4, 5, 6, 7],
'col3': [3, 8, 12, 10],
'col4': [17, 13, 16, 15],
'col5': [-4, -5, -6, -7]})
In [5]: modin_df = pd.DataFrame(pandas_df)
In [6]: for k, v in modin_df.groupby(by=[1,2,1,2]):
print(k)
print(v)
pd.clip()
does not clip correctly when both an upper bound list and lower bound list are passed in.
Modin errors out when we try to return an empty dataframe with full reduce operations (operations that return a series such as all
, any
, count
)
import modin.pandas as pd
data = {
"col1": [1, 'a', 3, 4],
"col2": [4, 5, 6, 'd'],
"col3": [8.0, 9.4, 'e', 11.3],
"col4": ["a", "b", "c", "d"],
}
modin_df = pd.DataFrame(data)
modin_df.count(numeric_only=True)
When bool_only=True
for all
and any
(the only functions that take in the bool_only
argument), modin throws a ValueError
import modin.pandas as pd
data = {
"col1": 1.0,
"col2": np.datetime64("2011-06-15T00:00"),
"col3": np.array([3] * 4, dtype="int32"),
"col4": "foo",
"col5": True,
}
modin_df = pd.DataFrame(data)
modin_df.any(bool_only=True)
Throws the following error:
ValueError: Length mismatch: Expected axis has 1 elements, new values have 5 elements
Opening this to track ray-project/ray#2262
cc @crystalzyan
Running mode returns either a wrong value or RayGetError
import modin.pandas as pd
data = {'col1': [1, 2, 3, 4],
'col2': [4, 5, 6, 7],
'col3': [8.0, 9.4, 10.1, 11.3],
'col4': ['a', 'b', 'c', 'd']}
modin_df = pd.DataFrame(data)
modin_df.mode(axis='rows', numeric_only=True)
returns a RayGetError
Running Flake8 and manually fix the lint hurts developer experience.
We should use an automated code formatter like Black or Yapf and have automated scripts to format the code before commit.
Flake8 can still be ran to check for code style issue like unused variables. But line-width and whitespaces should be done via automation.
I'll start adding type hints to the source code starting from the re-write (#70)
Here will be my proposed approach:
def func(x: int) -> float
mypy
to CI checks#type x: int -> float
comment workaround, instead when we distribute the package, we will strip away all the type hints using strip-hints so the code is python2 compatible.Any comments and suggestions welcomed!
Previously, transpose
was extremely slow and copied the entire dataset. Now we store some metadata and do the transpose
at the same time as another operation.
It is instant to do df.T
, but when you repr(df.T)
it will trigger the transpose
. We need to debug why it takes so long on the repr
.
> %time x = df.T
> %time repr(x)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.