Giter Site home page Giter Site logo

Comments (34)

rgommers avatar rgommers commented on June 11, 2024 3

IIRC the size limit is 250 MB unzipped, rather than 50 MB on upload.

You can significantly cut down the size by deleting all the tests/ directories. Also, you probably don't need the 3 scipy/misc/*.dat test images and they are large. Deleting all that may cut the package size by ~25% or so.

It used to be possible to get numpy/scipy/pandas in a single layer. I'd be curious what the status is now.

from klayers.

aperture147 avatar aperture147 commented on June 11, 2024 2

IIRC the size limit is 250 MB unzipped, rather than 50 MB on upload.

You can significantly cut down the size by deleting all the tests/ directories. Also, you probably don't need the 3 scipy/misc/*.dat test images and they are large. Deleting all that may cut the package size by ~25% or so.

It used to be possible to get numpy/scipy/pandas in a single layer. I'd be curious what the status is now.

Tested and it works. I also added --no-compile and delete all dist-info directory and now NumPy, SciPy and Pandas can be placed in a single layer. All of them take approx 195M, so I can have extra 50M for all of my imagination.

This is one of the most hilarious black magic I've ever seen.

from klayers.

keithrozario avatar keithrozario commented on June 11, 2024 2

Thank you so much all. I'll look into this next week or so, and hopefully we can get a layer of scipy out!!!

I'm not sure how much of this is generic (can be applied to all packages) and how much is specific to scipy though. Will have to think a bit more.

from klayers.

keithrozario avatar keithrozario commented on June 11, 2024 2

Test layer is here:

arn:aws:lambda:ap-southeast-1:367660174341:layer:Klayers-p312-scipy:1

We perform the --no-compile flag to reduce the .pyc and pycache files, and also delete all directories marked 'tests', as recommended by experts on this thread :)

Feel free to run some test on the layer. If all goes well, I'll push this before end of this week into production, and we'll have 'optimized' builds going forward.

from klayers.

keithrozario avatar keithrozario commented on June 11, 2024 2

I love this conversation. I did a test today using just numpy. Comparing a layer that had __pycache__ vs a layer that didn't have __pycache__ on a 128MB function using Python 3.12

The findings:

With __pycache__ init times were: 635ms, 593ms, 637ms
Without __pycache__ init times were: 677ms, 684ms, 708ms

Which suggest a ~50ms time penalty for compiling from py into .pyc. I think unless the package is huge (numpy is quite big already) you won't see any discernible performance gain. I think if you tweak the lambda settings like memory size, that performance difference would shrink even further.

Given this, if you're importing something like boto3, or requests, the difference is so little nobody will notice if the cache is included or not. For the larger packages like numpy and scipy, most (not all) will want to optimize for space, so that their own code or additional layers can be larger. Defaulting to removing pycache seems to be a logical decision.

So right now, we will remove .pyc files from all layers moving forward. Again, will not meet 100% of the requirements from everyone, but will meet the majority of users for the majority of times. Let me know your thoughts below.

Does that mean I can remove the need for separate packages for different versions of python??? Interesting....!!

from klayers.

keithrozario avatar keithrozario commented on June 11, 2024 1

If someone would make a pull request to add these packages, then I'll merge them and automatically build :)

from klayers.

keithrozario avatar keithrozario commented on June 11, 2024 1

Thanks, I'll check and see if it's possible. But there's a lot of bespoke effort that may be unsustainable.

The Lambda limit is 50MB zipped, and currently the total zipped size is bigger than that :(.

from klayers.

rgommers avatar rgommers commented on June 11, 2024 1

@aperture147 that's historical. Once upon a time, many more users built from source. And then it was critical to be able to run tests with numpy.test() in order to diagnose all sorts of weird issues. Having tests in tests/ subfolders of the package used to be very common, maybe even the standard way for where to store tests.

For new projects started today, the test suite usually goes outside of the importable tree. Moving everything in numpy now though would be very disruptive, as it would (among other things) make all open PRs unmerge-able.

from klayers.

rgommers avatar rgommers commented on June 11, 2024 1

I'll need to write docs for it, but this command will already remove test data is well as some large-ish _test_xxx.so extension modules that live outside of the tests/ directories:

$ python -m build -wnx -Cinstall-args=--tags=runtime,python-runtime,devel

It's available in SciPy's main branch since a week ago (scipy/scipy#20712).

I forgot to remove .dat and dist-info as well. That's up next.

You probably want to keep .dist-info. It's actually a functional part of a package, e.g. importlib.metadata uses it. And the license file is mandatory to keep when you're redistributing. .dist-info is also small, ~100 kb or so. If you really need to shave things off, I'd only remove RECORD since it's both the largest file and a not very important one within a Lambda layer.

from klayers.

keithrozario avatar keithrozario commented on June 11, 2024 1

Thanks. Unfortunately, I do not build the package from source, I merely pip install.

Will take your comment on keeping the dist-info, but I'll see if I can identify any _test_xxx.so files to be removed as well.

from klayers.

keithrozario avatar keithrozario commented on June 11, 2024 1

thanks -- the challenge from Klayers at least is that we need to make the script generic. I'm very hesitant to include package specific build steps for something like scipy, because maintaining that going forward would be difficult.

Although it sounds OK, deleting something like every file that meets the _test*.so might cause issues with other packages, but i would say the probability that someone has a runtime required .so file that begins with _test is very low.

Still pondering. Wonder what others are thinking.

from klayers.

rgommers avatar rgommers commented on June 11, 2024 1

I'm thinking that there could be a way to make scipy and numpy use the same GFortran and OpenBLAS library, then we could save about 25M of size. Is there anyway to achieve this @rgommers? I'm not a guru on building static-linked library, especially using mesos build system. If we build this layer on amazonlinux2 and dynamically link some libraries which is already exists in the environment then we can shrink the layer even more.

Not really when building the layer from wheels published to PyPI. NumPy uses 64-bit (ILP64) OpenBLAS, while SciPy uses 32-bit (LP64). We have a long-term plan to unify these two builds, but PyPI/wheels make this very complex. I would not recommend doing manual surgery here.

from klayers.

keithrozario avatar keithrozario commented on June 11, 2024 1

Test layer is here:

arn:aws:lambda:ap-southeast-1:367660174341:layer:Klayers-p312-scipy:1

We perform the --no-compile flag to reduce the .pyc and pycache files, and also delete all directories marked 'tests', as recommended by experts on this thread :)

Feel free to run some test on the layer. If all goes well, I'll push this before end of this week into production, and we'll have 'optimized' builds going forward.

I can notice that having python byte-code truncated will increase cold start time. Should we keep those to reduce the cold start time or it's just me fiddling too much with the layer?

Yes. Do you know how much slower the cold start time. Python will need to convert the .py into byte code, and that will incur some latency. For big packages this might be a lot, but not sure.

from klayers.

keithrozario avatar keithrozario commented on June 11, 2024 1

No it's probably bytecode compilation. Let me think about this a bit more. Bytecode is major version specific, so should be shareable across functions even if the runtime is upgraded.

But bytecode also takes space, we have to trade off between space considerations and speed considerations. Nothing will work for everyone -- so my thoughts are to remove bytecode only if the package is large.

from klayers.

keithrozario avatar keithrozario commented on June 11, 2024

If you're asking about the office AWS layer, I don't really know.

We can try to add Scipy for 3.10 here, but we may run into the size MB limit, which is a hard limit that can't be worked around.

from klayers.

dschmitz89 avatar dschmitz89 commented on June 11, 2024

Scipy wheels are roughly 30-40 MB in size lately: https://github.com/scipy/scipy/releases/tag/v1.11.4 . Does that seem too much?

I would like to see if I can help out with this issue. As regular SciPy contributor, I am familiar with the scipy tooling and I use Lambda at my day job but I am pretty new to Lambda layer creation. Do you have old scripts for SciPy still lying around?

from klayers.

keithrozario avatar keithrozario commented on June 11, 2024

I tried building SciPy for Lambda, but currently it's size exceeds the accepted in Lambda.

Lambda has a limit of 50MB, and ScIPy size is above that (~57MB). Note this is the result of a pip install scipy ... which includes not just SciPy but numpy as well.

I will see if we can remove the cache files to resuce the size, but at the moment this is the size :(

from klayers.

keithrozario avatar keithrozario commented on June 11, 2024

Currently the output looks like this:
image

from klayers.

keithrozario avatar keithrozario commented on June 11, 2024

I will experiment with removing pycache directories --- and just keeping the pycache directories to see what happens.

from klayers.

gpap-gpap avatar gpap-gpap commented on June 11, 2024

I am also interested in a scipy layer for 3.10+, and can't find a workaround for the size limit. I am not sure if you already do this but running something like find . | grep -E "(/tests$|__pycache__$|\.pyc$|\.pyo$)" | xargs rm -rf before zipping gets rid of files that are not needed in the layer. If that fails then all you can do is install submodules of scipy separately as needed which is not ideal

from klayers.

dschmitz89 avatar dschmitz89 commented on June 11, 2024

Friendly ping: was there any progress here? For the custom removal of code, is it possible to automatically inject such package specific code into the whole terraform build script?

from klayers.

keithrozario avatar keithrozario commented on June 11, 2024

If someone could modify the build function, that'd be much appreciated :). I think for now we can remove all pycache files to save space, that may help.

from klayers.

alexiskat avatar alexiskat commented on June 11, 2024

Not sure if this would help at all but this saved a lot of space when building the layer.

docker run -v "$PWD":/var/task "public.ecr.aws/sam/build-python3.9" /bin/sh -c "pip install -r requirements.txt --platform manylinux2014_x86_64 --implementation cp --python 3.9 --only-binary=:all: --upgrade --trusted-host pypi.org --trusted-host files.pythonhosted.org -t python/lib/python3.9/site-packages/; exit"

from klayers.

keithrozario avatar keithrozario commented on June 11, 2024

Wow. I need to find someway to automate this. What does --no-compile do?

from klayers.

rgommers avatar rgommers commented on June 11, 2024

@keithrozario we have just implemented proper support for this in NumPy, via "install tags" in the build system. Here is how to use it: numpy/numpy#26289 (comment). I'm planning to do the same for SciPy. It would come down to adding -Cinstall-args="--tags=runtime,devel,python-runtime" to your pip install (or pip wheel or python -m build) invocation in order to drop the test suite.

--no-compile is a pip flag: https://pip.pypa.io/en/stable/cli/pip_install/#cmdoption-no-compile

That together should make all this a one-liner. It should work for NumPy now.

from klayers.

aperture147 avatar aperture147 commented on June 11, 2024

Wow. I need to find someway to automate this. What does --no-compile do?

It will not precompile python code into byte code during the install process. But the test suites are those which consuming a lot of megabytes, the byte code takes just a few megabytes at most.

My approach is summed up in this script:

# install cpython implementation only
pip install numpy pandas scipy --no-compile --implementation cp -t python

# remove all dist-info
rm -r *.dist-info

# delete all tests directories
find . | grep -E "*/tests$" | xargs rm -rf

# clean up python byte code if any
find . | grep -E "(/__pycache__$|\.pyc$|\.pyo$)" | xargs rm -rf

# Xoá cả pyproject vì không cần đến
find . | grep -E "pyproject.toml$" | xargs rm -rf

# delete unused .dat file which is deprecated since scipy 1.10
find . | grep -E "scipy\misc\*.dat$" | xargs rm -rf

Btw i think modifying the bundled source code is not a good practice tho.

from klayers.

aperture147 avatar aperture147 commented on June 11, 2024

@keithrozario we have just implemented proper support for this in NumPy, via "install tags" in the build system. Here is how to use it: numpy/numpy#26289 (comment). I'm planning to do the same for SciPy. It would come down to adding -Cinstall-args="--tags=runtime,devel,python-runtime" to your pip install (or pip wheel or python -m build) invocation in order to drop the test suite.

--no-compile is a pip flag: https://pip.pypa.io/en/stable/cli/pip_install/#cmdoption-no-compile

That together should make all this a one-liner. It should work for NumPy now.

I don't get why numpy and scipy has their test suite in the wheel, when they don't contribute anything on the run process. I thought it was the sanity check in every import but it's just the test package during the meson build phase. It's bummer to rebuild numpy just to get rid of the test suite.

from klayers.

aperture147 avatar aperture147 commented on June 11, 2024

AFAIK, SciPy and NumPy are safe to have tests directory removed. NumPy tests directory size is even larger than SciPy.

from klayers.

keithrozario avatar keithrozario commented on June 11, 2024

I forgot to remove .dat and dist-info as well. That's up next.

from klayers.

rgommers avatar rgommers commented on June 11, 2024

I think this is the full list:

$ ls -l build/scipy/*/*.so | rg test
-rwxr-xr-x 1 rgommers rgommers    28664 17 mei 18:04 build/scipy/integrate/_test_multivariate.cpython-312-x86_64-linux-gnu.so
-rwxr-xr-x 1 rgommers rgommers   270968 17 mei 18:04 build/scipy/integrate/_test_odeint_banded.cpython-312-x86_64-linux-gnu.so
-rwxr-xr-x 1 rgommers rgommers   151456 17 mei 18:04 build/scipy/io/_test_fortran.cpython-312-x86_64-linux-gnu.so
-rwxr-xr-x 1 rgommers rgommers    52912 17 mei 18:04 build/scipy/_lib/_test_ccallback.cpython-312-x86_64-linux-gnu.so
-rwxr-xr-x 1 rgommers rgommers   158752 21 mei 13:45 build/scipy/_lib/_test_deprecation_call.cpython-312-x86_64-linux-gnu.so
-rwxr-xr-x 1 rgommers rgommers    92272 21 mei 13:45 build/scipy/_lib/_test_deprecation_def.cpython-312-x86_64-linux-gnu.so
-rwxr-xr-x 1 rgommers rgommers    31336 17 mei 18:04 build/scipy/ndimage/_ctest.cpython-312-x86_64-linux-gnu.so
-rwxr-xr-x 1 rgommers rgommers   386480 21 mei 13:45 build/scipy/ndimage/_cytest.cpython-312-x86_64-linux-gnu.so
-rwxr-xr-x 1 rgommers rgommers  1095216 21 mei 13:45 build/scipy/special/_test_internal.cpython-312-x86_64-linux-gnu.so

from klayers.

dschmitz89 avatar dschmitz89 commented on June 11, 2024

Yep, this would be a nightmare to maintain in the long run.

I would be interested to test it out on a fork of this repo though without making a PR to your main repo. Any chance we can make that work?

from klayers.

aperture147 avatar aperture147 commented on June 11, 2024

You could try adding some specific script for specific library, like adding a file called scipy.sh to customize installation (by deleting unwanted files), then whenever you install scipy, you can check that is there any scipy.sh exists in the repo, if there is than use scipy.sh instead of plain pip install scipt to install to the layer.

I noticed that scipy and numpy are using GFortran and OpenBLAS, but both scipy and numpy are using a slightly different version of GFortran and OpenBLAS, which is separatedly stored as .so files in numpy.lib and scipy.lib directory. I'm thinking that there could be a way to make scipy and numpy use the same GFortran and OpenBLAS library, then we could save about 25M of size. Is there anyway to achieve this @rgommers? I'm not a guru on building static-linked library, especially using mesos build system. If we build this layer on amazonlinux2 and dynamically link some libraries which is already exists in the environment then we can shrink the layer even more.

from klayers.

aperture147 avatar aperture147 commented on June 11, 2024

Test layer is here:

arn:aws:lambda:ap-southeast-1:367660174341:layer:Klayers-p312-scipy:1

We perform the --no-compile flag to reduce the .pyc and pycache files, and also delete all directories marked 'tests', as recommended by experts on this thread :)

Feel free to run some test on the layer. If all goes well, I'll push this before end of this week into production, and we'll have 'optimized' builds going forward.

I can notice that having python byte-code truncated will increase cold start time. Should we keep those to reduce the cold start time or it's just me fiddling too much with the layer?

from klayers.

aperture147 avatar aperture147 commented on June 11, 2024

Yes. Do you know how much slower the cold start time. Python will need to convert the .py into byte code, and that will incur some latency. For big packages this might be a lot, but not sure.

Normally it only takes about 500ms to 1s to warm up the lambda, but now it takes 2s+ (sometimes up to 5s+ if I import all numpy, scipy and pandas) to turn it up (tested on 1024GiB RAM python 3.10 lambda function). It's bytecode compilation problem or it's just me doing too much surgeries on the layer.

from klayers.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.