Giter Site home page Giter Site logo

Comments (42)

chrxh avatar chrxh commented on August 24, 2024 3

Good to know that it's a more general problem. I've a wild guess as to what it might be. But it's hard for me to fix it if I can't reproduce the bug myself. I'll try to provide a special debug version and then come back to you.

from alien.

chrxh avatar chrxh commented on August 24, 2024 3

Thanks for all the info! I'll provide automatic tests that I used during the development.
This should give a more precise indication. I'll get back to you when I have them ready.

from alien.

Aurel-C avatar Aurel-C commented on August 24, 2024 2

When i run the tests with a 1070ti it gets stuck at running CleanupGpuTests.testCleanupCells (no failure).

from alien.

chrxh avatar chrxh commented on August 24, 2024 2

Super, thanks for all the testings!
This is still a work in progress. I haven't yet tried all extreme situations. There is now a heuristic that constantly predicts the near future memory requirements of particles, cells, etc. and the arrays in the GPU are automatically adjusted to that prediction.
Maybe it can't handle very sudden and large changes yet. Or it is, as you said, that the memory in the GPU was not sufficient anymore.
EDIT: You can get more information if you enable the log window and switch to "verbose".

If you are interested, there are more examples to play around with: https://github.com/chrxh/alien/tree/features/version3/examples/simulations
There is currently no built-in editor like in the old version. The examples were constructed in the old one and then imported.

It will take a few more months until the new built-editor is ready.

from alien.

scrps avatar scrps commented on August 24, 2024 1

@hermanTenuki: Can you also please send me the list of tests which were successful? Thx!

I've also been working on that, sorry for the wall of text. This was tested on Win 10, GTX 1060 6GB with the latest drivers. Any test listed without an error means it hung (maybe add a 5000-6000ms timeout to the tests code?):

CleanupGpuTests:

	Failed on:
	testCleanupCells
	testCleanupCellPointers
	testCleanupTokens

ClusterGpuTests:

	Failed on:
	testFusionOfHorizontalClusters
	testFastRotatingCluster
	regressionTestManyOverlappingRectangleClusters
	regressionTestManyRectangleClusters_concentratedAtUniverseBoundary

ClusterGpuWithManyThreadsPerBlockTests:

	All failed

ClusterGpuWithOneBlockTests:

	Failed on:      
	regressionTestFusionAndHeavyCollision

CommunicatorGpuTests:

	All failed

ConstructorGpuTests:

	Failed on:
	testConstructFirstCellOnHorizontalCluster_standardParameters
	testConstructFirstCellOnHorizontalCluster_nonStandardParameters1
	testConstructFirstCellOnHorizontalCluster_nonStandardParameters2
	testConstructFirstCellOnHorizontalCluster_ignoreDistanceOnFirstConstructedCell1
	testConstructFirstCellOnHorizontalCluster_ignoreDistanceOnFirstConstructedCell2
	testConstructFirstCellOnHorizontalCluster_rightHandSide
	testConstructFirstCellOnHorizontalCluster_errorNoEnergy
	testConstructFirstCellOnHorizontalCluster_otherClusterRightObstacle_safe
	testConstructFirstCellOnHorizontalCluster_otherClusterLeftObstacle_safe
	testConstructFirstCellOnHorizontalCluster_ownClusterObstacle_backward
	testConstructFirstCellOnWedgeCluster_leftHandSide
	testConstructFirstCellOnTiangleCluster
	testConstructFirstCellOnHorizontalCluster_withEmptyToken
	testConstructFirstCellOnHorizontalCluster_withDuplicatedToken
	testConstructFirstCellOnHorizontalCluster_finishWithoutSeparation
	testConstructFirstCellOnHorizontalCluster_finishWithSeparation
	testConstructFirstCellOnHorizontalCluster_finishWithSeparation_otherClusterRightObstacle_safe
	testConstructFirstCellOnHorizontalCluster_finishWithSeparationAndReduction
	testConstructSecondCellOnHorizontalCluster_standardParameters
	testConstructSecondCellOnHorizontalCluster_maxDistance
	testConstructSecondCellOnHorizontalCluster_rightHandSide
	testConstructSecondCellOnHorizontalCluster_leftHandSide
	testConstructSecondCellOnHorizontalCluster_errorNoEnergy
	testConstructSecondCellOnHorizontalCluster_otherClusterRightObstacle_safe
	testConstructSecondCellOnHorizontalCluster_otherClusterLeftObstacle_safe
	testConstructSecondCellOnSelfTouchingCluster_ownClusterObstacle_safe
	testConstructSecondCellOnHorizontalCluster_ownClusterObstacle_backward
	testConstructSecondCellOnHorizontalCluster_withEmptyToken
	testConstructSecondCellOnHorizontalCluster_withEmptyToken_tokenAnreadyOnConstructionSite
	testConstructSecondCellOnHorizontalCluster_withDuplicatedToken
	testConstructSecondCellOnHorizontalCluster_withDuplicatedToken_suppressed
	testConstructSecondCellOnHorizontalCluster_finishWithoutSeparation
	testConstructSecondCellOnHorizontalCluster_finishWithSeparation_standardPositionConstructorGpuTests.testConstructSecondCellOnHorizontalCluster_finishWithSeparation_nonStandardPosition
	testConstructSecondCellOnHorizontalCluster_finishWithSeparationAndReduction
	testConstructSecondCellOnHorizontalCluster_finishWithTokenAndSeparationAndReduction
	testConstructSecondCellOnHorizontalCluster_finishWithTokenAndSeparationAndReduction_suppressed
	testConstructSecondCellOnHorizontalCluster_finishWithSeparation_otherClusterRightObstacle_safe
	testConstructThirdCellOnHorizontalCluster_standardParameters
	testConstructThirdCellOnLineCluster_nonStandardParameters
	testConstructThirdCellOnLineCluster_tokenOnConstructionSite
	testConstructThirdCellOnLineCluster_noMultipleConnections_limitMaxConnections
	testRotationOnlyOnHorizontalCluster
	testRotationOnlyOnHorizontalCluster_otherClusterObstacle_safe
	testRotationOnlyOnHorizontalCluster_ownClusterObstacle_safe
	testLargeCluster_limitSize
	testParallelConstructionFromDifferentSources
	testParallelConstructionFromDifferentConstructors_manyIsolatedClusters
	testParallelConstructionFromDifferentConstructors_touchingClusters
	testMultipleConnectedConstructionSites_errorConnection

ConstructorGpuWithHighBlockCountTests:

	All failed

GpuBenchmark:

	All failed
	testClusterAndParticleMovement caused tests.exe to crash with:
	Error: unknown file: error: C++ exception with description "CUDA error at CudaSimulation.cu:192 code=700(cudaErrorIllegalAddress) "cudaGetLastError()"" thrown in the test body.
	CUDA error at D:\temp\alien\source\EngineGpuKernels\Array.cuh:47 code=46(cudaErrorDevicesUnavailable) "cudaMemcpy(&data, _data, sizeof(T*), cudaMemcpyDeviceToHost)"

PropulsionGpuTests:

	Failed on:
	testParallelization1
	testParallelization2

ReplicatorGpuTests:

	All failed
	Error: unknown file: error: C++ exception with description "<unspecified file>(1): expected value" thrown in the test body.

ReplicatorGpuTestsWithManyThreads:

	All failed
	Error: unknown file: error: C++ exception with description "<unspecified file>(1): expected value" thrown in the test body.

ScannerGpuTests:

	Failed on:
	ScannerGpuTests.testScanRestart2

SensorGpuTests:

	Failed on:
	testSearchVicinity_success
	testSearchByAngle_success
	testSearchFromCenter_success
	testSearchTowardCenter_success

TokenEnergyGuidanceGpuTests:

	Failed on:
	testParallelization

TokenSpreadingGpuTests:

	Failed on:
	testMovementWithFittingBranchNumbers_manyLargeClusters
	testMovementWithEncounter
	testMovementDuringDecomposition
	testCreationAfterSecondFusion
	testMovementWithTooManyTokens
	testMassiveMovements
	testMovementOnDestroyedCell_closeCell
	regressionTestManyStickyRotatingTokenClusters
	testCellDecayDueToTokenUsage

WeaponGpuTests:

	Failed on:
	testStrike
	Error: D:\temp\alien\source\Tests\WeaponGpuTests.cpp(155): error: Expected equality of these values:
		  -expectedEnergyLoss
		    Which is: -11
		  *result.energyDiffOfTarget1
		    Which is: -10.5

from alien.

Aurel-C avatar Aurel-C commented on August 24, 2024 1

Setting fusion velocity to 1000 fixes collisions.sim and obstacles.sim but not default.sim.

from alien.

chrxh avatar chrxh commented on August 24, 2024 1

This exception seems to occur during termination (maybe something different from the problem above).
The actual problem has something to do with thread divergence. The behavior of some Geforce 10-series cards are apparently different from the 20-series and 30-series. I'll have to think about this in more detail.
For testing I've included alien.exe in the Debug.zip. Maybe you can try if the standard example is working now.

from alien.

chrxh avatar chrxh commented on August 24, 2024 1

It'll take a little while. The background of the problem is as follows:
In CUDA it's very costly to implement critical sections and one should avoid this by organizing the data appropriately.
Unfortunately, this isn't possible here everywhere due to the complexity of the simulation code. The implementation of the mutex doesn't seem to work on some cards of the 10-series. Therefore a few code places have to be reorganized/rewritten.

from alien.

chrxh avatar chrxh commented on August 24, 2024 1

I'm in the process of rewriting and improving the entire simulation code for next major version. In the new engine this blocking issue should not appear anymore. Could someone with a Geforce 10 card please test alien-project.org/files/alien-version3-preview.zip (build from branch features/version3 ) and run the simulation, which will be loaded automatically after startup?
This simulation is intended to show machines that create elastic ribbons which in turn fold into curved structures. It would be nice to let me know if this works.
Thanks in advance!
Many functions are disconnected (e.g. editor does not work) and there is still a lot to do. This is only an intermediate result. You can zoom and apply forces to the particles with the mouse button pressed while the simulation is running (in "Action mode").

from alien.

nodeSpace avatar nodeSpace commented on August 24, 2024 1

Just ran the new version, nope, no freeze this time!

from alien.

chrxh avatar chrxh commented on August 24, 2024 1

Great, thank you very much! The performance is greatly reduced in this version due to the massive console outputs.
In the future, I'll probably need some further tests from time to time.

from alien.

chrxh avatar chrxh commented on August 24, 2024 1

I've been working a lot on the new major version lately and it would be a great help for me to know if everything works on an Nvidia 10 card. There are many visible changes now. In particular, I've changed the gui framework and now use Dear imgui (...and thus got rid of many problems).

Could someone please give this a try?
The source code is on the feature branch https://github.com/chrxh/alien/tree/features/version3.
(A Windows binary/compiled version is available on https://alien-project.org/files/alien-version3-preview.zip.)

from alien.

nodeSpace avatar nodeSpace commented on August 24, 2024 1

Seems to work pretty well, no issues so far for me!

from alien.

hermanTenuki avatar hermanTenuki commented on August 24, 2024 1

Now it works fine on 1050ti, well done

from alien.

nodeSpace avatar nodeSpace commented on August 24, 2024 1

I managed to crash the new version when I maxed out radiation strength and set minimum energy very low as well as removing the spot areas. I think its because it tried to use more memory then my gpu had as these settings cause many cells to spawn in simultaneously.
(this is probably more of an adversarial case though, it ran fine before I set the extreme values)

from alien.

chrxh avatar chrxh commented on August 24, 2024

Hi,
this sounds like a CUDA kernel timeout issue because the GUI thread still seems to be running. I had received reports that on Nvidia 10 series, the default CUDA thread block count might be too high. If this is the case for you, a simple solution would be to, for example, halve the number of blocks (under Settings -> General Settings -> Number of Blocks).
Does this work for you?

Unfortunately, I'd not yet found a good heuristic for the default value.

from alien.

jhugard avatar jhugard commented on August 24, 2024

No luck. Cut threads and blocks to 32/64, as well as to 16/16. Also tried halving all other settings, but no change in symptoms.

from alien.

chrxh avatar chrxh commented on August 24, 2024

Hm... It seems to be difficult to get CUDA programs to work on different hardware. I had tested it on a 1050 Ti without problems.
Do you experience the same issue with the simple physics example in examples/simulations/collisions.sim ?
It uses only the physics part of the engine.

from alien.

chrxh avatar chrxh commented on August 24, 2024

sorry I actually meant obstacles.sim.

from alien.

jhugard avatar jhugard commented on August 24, 2024

Simulation runs until shortly after 2nd vertical bar hits the 3rd one. Can exit without hang until this happens.

So, obstacles.sim also fails.

from alien.

domano avatar domano commented on August 24, 2024

Having the same issue with a 1070 (not ti)

from alien.

UltraSabreman avatar UltraSabreman commented on August 24, 2024

I'm having the same issue with my 1080ti, including the same results with obstacles.sim.

from alien.

scrps avatar scrps commented on August 24, 2024

I've actually managed to get obstacles.sim running on my 1060 6gb. I set number of threads per block at 1000, after changing it around it seems to fail when set to anything under 900. Hope that helps.

Edit: Realized I put 1080 instead of 1060.

from alien.

chrxh avatar chrxh commented on August 24, 2024

The installer of the latest version 2.5.3 now provides ~ 300 compiled integration tests.
Could you please run them (Tests.exe in the bin folder)? Alternatively, you can compile them for yourself (it's the Tests project).
Possibly 1-2 tests fail indeterministically (I haven't analyzed them yet... this can be ignored for the moment).
Thanks a lot!

from alien.

chrxh avatar chrxh commented on August 24, 2024

Ok, there we have a timeout. Unfortunately, this info is of limited help.
This test checks a kind of garbage collector. It runs some test simulation where many cells are created and destroyed in the process and finally proves that there is no memory leak.
There are many more subtle/fine-granular tests that will (unfortunately) be executed later in the test execution order because of alphabetical order. It would be very helpful to know all test results. I don't know any mechanism to terminate a CUDA kernel from host code (=on cpu). To achieve this, you have to set a timeout for the CUDA kernels in the OS.
It's e.g. explained here https://forums.developer.nvidia.com/t/cuda-kernel-timeout/12160/5
Would be great if you could try this out.

If this is not working: A more time-consuming solution would be to skip this test with a command line parameter:
Tests.exe --gtest_filter=-*CleanupGpuTests*
But then, of course, we can have the same problem with another coarse-granular test that runs into timeout. Nevertheless, it would give me valuable information.

from alien.

hermanTenuki avatar hermanTenuki commented on August 24, 2024

Same all symptoms from above on 1050Ti + i5 7400. In tests, stuck on CleanupGpuTests.testCleanupCells as well.
Tried Tests.exe --gtest_filter=-*CleanupGpuTests*, stuck on ClusterGpuTests.testFusionOfHorizontalClusters.
Then Tests.exe --gtest_filter=-*CleanupGpuTests*:*ClusterGpuTests* got me stuck on ClusterGpuWithManyThreadsPerBlockTests.regressionTestManyRectangleClusters, and so on with ClusterGpuWithOneBlockTests.regressionTestFusionAndHeavyCollision (not tried to ignore more of them at this point).
i hope it somewhat helps you :/

from alien.

UltraSabreman avatar UltraSabreman commented on August 24, 2024

I've attempted do disable the timeout using multiple methods, the linked TdrLevel Registry edit, going through Nsight Monitor, and other things I found while googling around, but nothings worked. Still gets stuck/times out on CleanupGpuTests.testCleanupCells.

from alien.

chrxh avatar chrxh commented on August 24, 2024

@hermanTenuki: This may be helpful. Perhaps the problem is located in the fusion algorithm. One can prevent fusions by setting the velocity threshold very high. So you can try the following: Load obstacles.sim simulation, open the dialog Settings -> Simulation parameters and set the simulation parameter cell -> fusion velocity to maybe 1000.
Does it crash now?

from alien.

chrxh avatar chrxh commented on August 24, 2024

@hermanTenuki: Can you also please send me the list of tests which were successful? Thx!

from alien.

chrxh avatar chrxh commented on August 24, 2024

Thanks a lot for your work!
Something fundamental doesn't seem to be working. I'm trying to figure out what it could be based on the test list.
But I'm afraid that without the hardware it will be hard for me to identify the problem. I'll probably compose a test with detailed console output. I can't think of anything else right now :-/

Yes, I'd like to be able to automatically terminate a test after x seconds timeout. But it doesn't seem that easy to terminate a running CUDA kernel from CPU code. If you have any ideas, I would be very grateful!

from alien.

chrxh avatar chrxh commented on August 24, 2024

I've prepared a test with verbose console output. It's stored in alien-project.org/files/Debug.zip.
It contains the compiled Test project. Please run it with Test.exe --gtest_filter=*testFusionOfHorizontalClusters.
Only one GPU thread is used (at least I hope that the error can be reproduced with a single thread).
Could you please tell me the output (the last line will do assuming it hangs)?

from alien.

UltraSabreman avatar UltraSabreman commented on August 24, 2024

Ran as requested, output here: https://termbin.com/0lyi

I also ran the full set of tests while dumping the output to a file. Killed the process when the file reached half a gig. The hangup seems to have the program bouncing around between lines 180, 182, 178, 170, and 168, in the D:\temp\alien\source\EngineGpuKernels\SimulationKernels.cuh file.

from alien.

chrxh avatar chrxh commented on August 24, 2024

The other tests partly use thousands of GPUs threads, that's why the console can't cope with the output anymore.
I think we need to play some more ping pong :) I modified the above test (testFusionOfHorizontalClusters) to run with multiple GPU threads again (so it should now run into timeout), but only create output at sync points: alien-project.org/files/Debug.zip. This would at least allow a little more precise localization in the code.

from alien.

UltraSabreman avatar UltraSabreman commented on August 24, 2024

Hangs on D:\temp\alien\source\EngineGpuKernels\SimulationKernels.cuh:157

Full output here if you need it: https://termbin.com/mu0x

from alien.

chrxh avatar chrxh commented on August 24, 2024

Next round: I now use only 1 thread block (but with >1 threads) and more outputs.

from alien.

UltraSabreman avatar UltraSabreman commented on August 24, 2024

Hangs on D:\temp\alien\source\EngineGpuKernels\SimulationKernels.cuh:81. Output: https://termbin.com/rknp

from alien.

chrxh avatar chrxh commented on August 24, 2024

That helps! Thanks! Is there still a timeout? If not, can you please run all tests?

from alien.

UltraSabreman avatar UltraSabreman commented on August 24, 2024

Major progress, I can run the tests now, however it errors out about 1/3 of the way through:

[ RUN      ] ClusterGpuTests.regressionTestManyOverlappingRectangleClusters
Qt has caught an exception thrown from an event handler. Throwing
exceptions from an event handler is not supported in Qt.
You must not let any exception whatsoever propagate through Qt code.
If that is not possible, in Qt 5 you must at least reimplement
QCoreApplication::notify() and catch all exceptions there.

D:\temp\alien\source\Tests\ClusterGpuTests.cpp(916): error: Expected: IntegrationTestHelper::runSimulation(300, _controller) doesn't throw an exception.
  Actual: it throws.
CUDA error at D:\temp\alien\source\EngineGpuKernels\Array.cuh:47 code=46(cudaErrorDevicesUnavailable) "cudaMemcpy(&data, _data, sizeof(T*), cudaMemcpyDeviceToHost)"

from alien.

UltraSabreman avatar UltraSabreman commented on August 24, 2024

It works. I ran every simulation included in the examples folder. Only planet gaia and replicator - ecosystem crashed. Gaia after a minute or two of the simulation running, and ecosytem crashes every time it's started. I sent error reports on both crashes out. Let me know if you need more.

from alien.

nodeSpace avatar nodeSpace commented on August 24, 2024

I have a gtx 1060 3GB, after running alien-project.org/files/alien-version3-preview.zip (on windows 10) the view window freezes after a second or so. Trying to step through it frame by frame and after about 150~ish steps the entire program freezes (not just the view window) and I get the alien.exe is not reponding windows popup.

from alien.

chrxh avatar chrxh commented on August 24, 2024

Thanks a lot for your help!
I've added a correction and an additional console log output: It would still help me a lot if you could run the program (alien-project.org/files/alien-version3-preview.zip) again. Does it freeze again?
If so, could you please send me the last output from the console (which is in the background and opens in addition to the main window). Thanks a lot!
I know this is a bit of a nuisance. But unfortunately I see no other way to get information about the problem because I cannot reproduce it with my hardware.

from alien.

chrxh avatar chrxh commented on August 24, 2024

I close this issue because the problem is now solved.

from alien.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.