jbush001 / nyuziprocessor Goto Github PK
View Code? Open in Web Editor NEWGPGPU microprocessor architecture
License: Apache License 2.0
GPGPU microprocessor architecture
License: Apache License 2.0
Seems to be a regression. Happens on MacOS. I haven't tested on Linux.
Launch doom in debugger:
doom> make debug
Start the process, stop it, then restart it
(lldb) c
Process 1 resuming
(lldb) process interrupt
Process 1 stopped
(lldb) c
Process 1 resuming
The emulator will restart, but the first time the window is clicked on (or presumably receives any event), the emulator will crash:
* thread #1: tid = 0x1ed6dd, 0x00007fff92da3c43 libsystem_c.dylib`__findenv + 90, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
frame #0: 0x00007fff92da3c43 libsystem_c.dylib`__findenv + 90
libsystem_c.dylib`__findenv + 90:
-> 0x7fff92da3c43: movb (%r11), %cl
0x7fff92da3c46: testb %cl, %cl
0x7fff92da3c48: je 0x7fff92da3c6a ; __findenv + 129
0x7fff92da3c4a: movzbl (%rbx), %r15d
* thread #1: tid = 0x1ed6dd, 0x00007fff92da3c43 libsystem_c.dylib`__findenv + 90, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
* frame #0: 0x00007fff92da3c43 libsystem_c.dylib`__findenv + 90
frame #1: 0x00007fff92da3cc7 libsystem_c.dylib`getenv + 29
frame #2: 0x00007fff894f3cc6 CarbonCore`GetDYLDEntryPointWithImage + 56
frame #3: 0x00007fff8eef5eea HIToolbox`HLTBRegisterLazyHIObjectClass + 140
frame #4: 0x00007fff8ee95bbb HIToolbox`HIObjectClass::Lookup(__CFString const*, unsigned char) + 47
frame #5: 0x00007fff8ee95a59 HIToolbox`HIObject::Create(__CFString const*, OpaqueEventRef*, HIObject**) + 41
frame #6: 0x00007fff8ee95a0b HIToolbox`HIObjectCreate + 90
frame #7: 0x00007fff8ef8b929 HIToolbox`HIMenuBarView::GetDrawingDelegate() + 37
frame #8: 0x00007fff8eea8854 HIToolbox`HIMenuBarView::HIMenuBarView(OpaqueHIObjectRef*) + 222
frame #9: 0x00007fff8eea875a HIToolbox`HIMenuBarView::Construct(OpaqueHIObjectRef*) + 34
frame #10: 0x00007fff8eea6156 HIToolbox`HIView::EventHandler(OpaqueEventHandlerCallRef*, OpaqueEventRef*, void*) + 890
frame #11: 0x00007fff8ee95f31 HIToolbox`HIObject::Construct(HIObjectClass*, HIObject**, OpaqueHIObjectRef*) + 181
frame #12: 0x00007fff8ee95b0e HIToolbox`HIObject::Create(__CFString const*, OpaqueEventRef*, HIObject**) + 222
frame #13: 0x00007fff8ee95a0b HIToolbox`HIObjectCreate + 90
frame #14: 0x00007fff8eea8510 HIToolbox`HIMenuBarView::Create(CGRect const*, OpaqueMenuRef*, OpaqueControlRef**) + 86
frame #15: 0x00007fff8eea7e6c HIToolbox`HIMenuBarFrameView::Initialize(OpaqueEventRef*) + 38
frame #16: 0x00007fff8ee9c450 HIToolbox`HIObject::HandleClassHIObjectEvent(OpaqueEventHandlerCallRef*, OpaqueEventRef*, void*) + 490
frame #17: 0x00007fff8ee9c24a HIToolbox`HIObject::EventHook(OpaqueEventHandlerCallRef*, OpaqueEventRef*, void*) + 128
frame #18: 0x00007fff8ee9b98c HIToolbox`DispatchEventToHandlers(EventTargetRec*, OpaqueEventRef*, HandlerCallRec*) + 1260
frame #19: 0x00007fff8ee9adce HIToolbox`SendEventToEventTargetInternal(OpaqueEventRef*, OpaqueEventTargetRef*, HandlerCallRec*) + 386
frame #20: 0x00007fff8ee9ac42 HIToolbox`SendEventToEventTargetWithOptions + 43
frame #21: 0x00007fff8ee95b31 HIToolbox`HIObject::Create(__CFString const*, OpaqueEventRef*, HIObject**) + 257
frame #22: 0x00007fff8ee95a0b HIToolbox`HIObjectCreate + 90
frame #23: 0x00007fff8eea5768 HIToolbox`NewWindowCommon(WindowData**, unsigned int, unsigned long long, WindowDefSpec const*, Rect const*, unsigned char const*, unsigned char, OpaqueWindowPtr*, void*, unsigned int, unsigned short*, bool) + 945
frame #24: 0x00007fff8eea4f73 HIToolbox`_HIWindowCreateWithCGWindow + 335
frame #25: 0x00007fff8eea4d49 HIToolbox`MBWindows::CreateWindow(CGRect, unsigned int) + 245
frame #26: 0x00007fff8eea49f6 HIToolbox`MBWindows::GetWindowIDOnDisplay(unsigned int, unsigned char) + 174
frame #27: 0x00007fff8eea47ee HIToolbox`MenuBarInstance::ForEachWindowDo(void (unsigned int, unsigned int) block_pointer) + 162
frame #28: 0x00007fff8eea454d HIToolbox`MenuBarInstance::UpdateWindowBoundsAndResolution() + 155
frame #29: 0x00007fff8eea4117 HIToolbox`MenuBarInstance::Show(MenuBarAnimationStyle, unsigned char, unsigned char, unsigned char) + 229
frame #30: 0x00007fff8eecfc95 HIToolbox`SetMenuBarObscured + 232
frame #31: 0x00007fff8eecf8f2 HIToolbox`HIApplication::HandleActivated(OpaqueEventRef*, unsigned char, OpaqueWindowPtr*) + 184
frame #32: 0x00007fff8eece4f0 HIToolbox`HIApplication::EventObserver(unsigned int, OpaqueEventRef*, void*) + 238
frame #33: 0x00007fff8ee9b12c HIToolbox`_NotifyEventLoopObservers + 155
frame #34: 0x00007fff8f8ea216 AppKit`-[NSWindow _reallySendEvent:] + 10671
frame #35: 0x00007fff8f37116e AppKit`-[NSWindow sendEvent:] + 446
frame #36: 0x000000010009836a libSDL2-2.0.0.dylib`-[SDLWindow sendEvent:] + 48
frame #37: 0x00007fff8f323451 AppKit`-[NSApplication sendEvent:] + 4183
frame #38: 0x0000000100094d1f libSDL2-2.0.0.dylib`Cocoa_PumpEvents + 171
frame #39: 0x000000010003bc97 libSDL2-2.0.0.dylib`SDL_PumpEvents_REAL + 23
frame #40: 0x000000010003bd05 libSDL2-2.0.0.dylib`SDL_WaitEventTimeout_REAL + 55
frame #41: 0x0000000100006f46 emulator`poll_fb_window_event + 38
frame #42: 0x00000001000069d8 emulator`run_until_interrupt + 104
frame #43: 0x000000010000628d emulator`remote_gdb_main_loop + 1293
In tests/compiler/multiply64.c, should return 'FixedMul af07b15d', instead returns 'FixedMul 8fcdb15d'. The top 16 bits were computed by mulhs_i.
Hi,
there is no public email to us.i want to consult with you
my email is [email protected] thank you very much! ๐
--Jian Liu
Investigate adding two new instructions that perform signed addition and subtraction, but check for overflow and handle as follows:
if a + b > 2^31-1
sum = 2^31
else if a + b < -(2^31)
sum = -(2^31)
else
sum = a + b;
Useful for fixed point operations. Need a benchmark that could exercise use case.
The following
static void foo()
{
bar(a, (b + 1));
}
Crashes:
Assertion failed: (Entry != DelayedTypos.end() && "Failed to get the state for a TypoExpr!"), function getTypoExprState, file /Users/jeffbush/src/NyuziToolchain/tools/clang/lib/Sema/SemaLookup.cpp, line 5032.
0 clang-3.9 0x000000010a6207ae llvm::sys::PrintStackTrace(llvm::raw_ostream&) + 46
1 clang-3.9 0x000000010a620c09 PrintStackTraceSignalHandler(void*) + 25
2 clang-3.9 0x000000010a61d719 llvm::sys::RunSignalHandlers() + 425
3 clang-3.9 0x000000010a620f94 SignalHandler(int) + 372
4 libsystem_platform.dylib 0x00007fff937aef1a _sigtramp + 26
5 clang-3.9 0x000000010df4d8ac guard variable for shouldAddRequirement(clang::Module*, llvm::StringRef, bool&)::IOKitAVC + 81916
6 clang-3.9 0x000000010a620c2b raise + 27
7 clang-3.9 0x000000010a620ce2 abort + 18
8 clang-3.9 0x000000010a620cc1 __assert_rtn + 129
9 clang-3.9 0x000000010cb57c81 clang::Sema::getTypoExprState(clang::TypoExpr*) const + 225
10 clang-3.9 0x000000010ca9881c (anonymous namespace)::TransformTypos::TransformTypoExpr(clang::TypoExpr*) + 188
11 clang-3.9 0x000000010ca89c5a clang::TreeTransform<(anonymous namespace)::TransformTypos>::TransformExpr(clang::Expr*) + 5114
12 clang-3.9 0x000000010ca42ae7 (anonymous namespace)::TransformTypos::TryTransform(clang::Expr*) + 71
13 clang-3.9 0x000000010ca418c1 (anonymous namespace)::TransformTypos::Transform(clang::Expr*) + 65
14 clang-3.9 0x000000010ca4169d clang::Sema::CorrectDelayedTyposInExpr(clang::Expr*, clang::VarDecl*, llvm::function_ref<clang::ActionResult<clang::Expr*, true> (clang::Expr*)>) + 493
15 clang-3.9 0x000000010c9f3eca clang::Sema::CorrectDelayedTyposInExpr(clang::Expr*, llvm::function_ref<clang::ActionResult<clang::Expr*, true> (clang::Expr*)>) + 74
16 clang-3.9 0x000000010c9f6947 clang::Sema::CorrectDelayedTyposInExpr(clang::ActionResult<clang::Expr*, true>, clang::VarDecl*, llvm::function_ref<clang::ActionResult<clang::Expr*, true> (clang::Expr*)>) + 119
17 clang-3.9 0x000000010c1fe4cb clang::Parser::ParseRHSOfBinaryExpression(clang::ActionResult<clang::Expr*, true>, clang::prec::Level) + 6107
18 clang-3.9 0x000000010c1fccd8 clang::Parser::ParseAssignmentExpression(clang::Parser::TypeCastState) + 280
19 clang-3.9 0x000000010c1fcb8f clang::Parser::ParseExpression(clang::Parser::TypeCastState) + 31
20 clang-3.9 0x000000010c206f41 clang::Parser::ParseParenExpression(clang::Parser::ParenParseOption&, bool, bool, clang::OpaquePtr<clang::QualType>&, clang::SourceLocation&) + 5729
21 clang-3.9 0x000000010c200d71 clang::Parser::ParseCastExpression(bool, bool, bool&, clang::Parser::TypeCastState) + 401
22 clang-3.9 0x000000010c1fe773 clang::Parser::ParseCastExpression(bool, bool, clang::Parser::TypeCastState) + 83
23 clang-3.9 0x000000010c1fccba clang::Parser::ParseAssignmentExpression(clang::Parser::TypeCastState) + 250
24 clang-3.9 0x000000010c20a933 clang::Parser::ParseExpressionList(llvm::SmallVectorImpl<clang::Expr*>&, llvm::SmallVectorImpl<clang::SourceLocation>&, std::__1::function<void ()>) + 371
25 clang-3.9 0x000000010c1ff917 clang::Parser::ParsePostfixExpressionSuffix(clang::ActionResult<clang::Expr*, true>) + 4263
26 clang-3.9 0x000000010c205278 clang::Parser::ParseCastExpression(bool, bool, bool&, clang::Parser::TypeCastState) + 18072
27 clang-3.9 0x000000010c1fe773 clang::Parser::ParseCastExpression(bool, bool, clang::Parser::TypeCastState) + 83
28 clang-3.9 0x000000010c1fccba clang::Parser::ParseAssignmentExpression(clang::Parser::TypeCastState) + 250
29 clang-3.9 0x000000010c1fcb8f clang::Parser::ParseExpression(clang::Parser::TypeCastState) + 31
30 clang-3.9 0x000000010c25ec7c clang::Parser::ParseExprStatement() + 60
31 clang-3.9 0x000000010c25dbff clang::Parser::ParseStatementOrDeclarationAfterAttributes(llvm::SmallVector<clang::Stmt*, 32u>&, clang::Parser::AllowedContsructsKind, clang::SourceLocation*, clang::Parser::ParsedAttributesWithRange&) + 2751
32 clang-3.9 0x000000010c25cff8 clang::Parser::ParseStatementOrDeclaration(llvm::SmallVector<clang::Stmt*, 32u>&, clang::Parser::AllowedContsructsKind, clang::SourceLocation*) + 168
33 clang-3.9 0x000000010c264b1a clang::Parser::ParseCompoundStatementBody(bool) + 1322
34 clang-3.9 0x000000010c265740 clang::Parser::ParseFunctionStatementBody(clang::Decl*, clang::Parser::ParseScope&) + 496
35 clang-3.9 0x000000010c286462 clang::Parser::ParseFunctionDefinition(clang::ParsingDeclarator&, clang::Parser::ParsedTemplateInfo const&, clang::Parser::LateParsedAttrList*) + 3922
36 clang-3.9 0x000000010c1bbdf5 clang::Parser::ParseDeclGroup(clang::ParsingDeclSpec&, unsigned int, clang::SourceLocation*, clang::Parser::ForRangeInit*) + 1061
37 clang-3.9 0x000000010c2854d5 clang::Parser::ParseDeclOrFunctionDefInternal(clang::Parser::ParsedAttributesWithRange&, clang::ParsingDeclSpec&, clang::AccessSpecifier) + 1397
38 clang-3.9 0x000000010c284b65 clang::Parser::ParseDeclarationOrFunctionDefinition(clang::Parser::ParsedAttributesWithRange&, clang::ParsingDeclSpec*, clang::AccessSpecifier) + 197
39 clang-3.9 0x000000010c2842ed clang::Parser::ParseExternalDeclaration(clang::Parser::ParsedAttributesWithRange&, clang::ParsingDeclSpec*) + 3981
40 clang-3.9 0x000000010c283315 clang::Parser::ParseTopLevelDecl(clang::OpaquePtr<clang::DeclGroupRef>&) + 1061
41 clang-3.9 0x000000010c1a24ee clang::ParseAST(clang::Sema&, bool, bool) + 766
42 clang-3.9 0x000000010b1d548f clang::ASTFrontendAction::ExecuteAction() + 511
43 clang-3.9 0x000000010ac34bcb clang::CodeGenAction::ExecuteAction() + 6043
44 clang-3.9 0x000000010b1d49b8 clang::FrontendAction::Execute() + 120
45 clang-3.9 0x000000010b117bc7 clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) + 1847
46 clang-3.9 0x000000010b268466 clang::ExecuteCompilerInvocation(clang::CompilerInstance*) + 4870
47 clang-3.9 0x000000010944164a cc1_main(llvm::ArrayRef<char const*>, char const*, void*) + 4986
48 clang-3.9 0x0000000109431011 ExecuteCC1Tool(llvm::ArrayRef<char const*>, llvm::StringRef) + 481
49 clang-3.9 0x000000010942eb23 main + 3283
50 libdyld.dylib 0x00007fff86f785c9 start + 1
51 libdyld.dylib 0x0000000000000024 start + 2030598748
Stack dump:
0. Program arguments: /usr/local/llvm-nyuzi/bin/clang-3.9 -cc1 -triple nyuzi-none-none -emit-obj -disable-free -main-file-name thread.c -mrelocation-model static -mthread-model posix -fmath-errno -masm-verbose -mconstructor-aliases -munwind-tables -nostdsysteminc -fuse-init-array -target-cpu nyuzi -target-linker-version 241.9 -momit-leaf-frame-pointer -dwarf-column-info -debugger-tuning=gdb -O3 -ferror-limit 19 -fmessage-length 80 -fobjc-runtime=gcc -fdiagnostics-show-option -fcolor-diagnostics -x c thread-8755e8.c
1. thread-8755e8.c:6:18: current parser token ')'
2. thread-8755e8.c:5:1: parsing function body 'foo'
3. thread-8755e8.c:5:1: in compound statement ('{}')
./thread-8755e8.sh: line 4: 88650 Illegal instruction: 4 "/usr/local/llvm-nyuzi/bin/clang-3.9" "-cc1" "-triple" "nyuzi-none-none" "-emit-obj" "-disable-free" "-main-file-name" "thread.c" "-mrelocation-model" "static" "-mthread-model" "posix" "-fmath-errno" "-masm-verbose" "-mconstructor-aliases" "-munwind-tables" "-nostdsysteminc" "-fuse-init-array" "-target-cpu" "nyuzi" "-target-linker-version" "241.9" "-momit-leaf-frame-pointer" "-dwarf-column-info" "-debugger-tuning=gdb" "-O3" "-ferror-limit" "19" "-fmessage-length" "80" "-fobjc-runtime=gcc" "-fdiagnostics-show-option" "-fcolor-diagnostics" "-x" "c" "thread-8755e8.c"
This may be a bug that was pulled in from the last upstream integration. Retest with next one.
Add more bits to physical page index in TLB entry to access more than 4GB of physical memory.
But need to leave control bits for features like large pages, for example.
Spinlocks burn a lot of cycles. Can threads be suspended at barriers without consuming issues cycles?
https://gist.github.com/jbush001/22a3c336f0b59b095547025cdb7cee5d
I'd say this is more of a suggestion than an issue.
As you already use submodules for the toolchain and verilator, I would suggest you split out the software and RTL code into different repos as well. For the RTL code I would even suggest moving out the board-specific stuff (hardware/fpga/de2_115) and the common peripherals (hardware/fpga/common) so that it will be easier to make create new board ports and reuse your peripheral controllers
When a bit is set in the flags control register, it should execute exactly one instruction and then raise a trap.
However if the issued instruction causes another trap, what should happen then?
This occurred in the io_interrupt test:
io_interrupt
Process returned error: Random seed is 1460932741
cores 1|threads per core 4|l1i$ 16k 4 ways|l1d$ 16k 4 ways|l2$ 128k 8 ways|itlb 64 entries|dtlb 64 entries
>ABC*DE*FGHI*JKLM*NOPQ*RSTU*VWXYZ*abcd*efgh*ijklm*nopq*rstu*vwxy*z0123*4567*89
[32128] %Error: l1_store_queue.sv:208: Assertion failed in TOP.v.nyuzi.core_gen[0].core.l1_l2_interface.l1_store_queue.thread_store_buf_gen[0]
%Error: core/l1_store_queue.sv:208: Verilog $stop
The code is here:
if (store_requested_this_entry)
begin
if (is_restarted_sync_request)
begin
...
assert(!rollback[thread_idx]); // <----- Here
The problem appears to be as follows:
l1_store_queue assumes a core will not attempt to issue another store after it's been rolled back until the l1_store_queue wakes it:
else if (pending_stores[thread_idx].valid && !can_write_combine
&& !got_response_this_entry)
rollback[thread_idx] = 1;
The reason it occurred in this test was because there is a spinlock in crt0.s when the program terminates (for calling destructors). A compiler update changed the timing enough to hit this race condition. A better test would be to issue synchronized stores in a loop with interrupts enabled.
The fix is probably to disable interrupts while a synchronized store is pending. Something similar is already done for I/O operations: the ic_interrupt_pending bitmask goes into the instruction_decode_state to detect this stage. An additional signal could go from the store buffer to indicate this.
In tests/cosimulation:
$ ./generate_random.py -i
$ ./runtest.sh random.s
...
COSIM MISMATCH, thread 0 instruction d4d414e7
Advantages:
In tests/kernel/multiprocess, modify makefile to set randseed:
--- a/tests/kernel/multiprocess/Makefile
+++ b/tests/kernel/multiprocess/Makefile
@@ -30,7 +30,7 @@ run: program.elf fsimage.bin
$(EMULATOR) -b fsimage.bin $(TOPDIR)/software/kernel/kernel.hex
verirun: program.elf fsimage.bin
- $(VERILATOR) +bin=$(TOPDIR)/software/kernel/kernel.hex +block=fsimage.bin
+ $(VERILATOR) +randseed=1469249757 +bin=$(TOPDIR)/software/kernel/kernel.hex +block=fsimage.bin
The program will run for a while, then crash:
Loading segment 0 offset 00000000 vaddr 00001000 file size 000001d8 mem size 000001d8 flags 5
Loading segment 0 offset 00000000 vaddr 00001000 file size 00000280 mem size 00000280 flags 5
Loading segment 0 offset 00000000 vaddr 00001000 file size 00000280 mem size 00000280 flags 5
Loading segment 0 offset 00000000 vaddr 00001000 file size 00000280 mem size 00000280 flags 5
Loading segment 0 offset 00000000 vaddr 00001000 file size 00000280 mem size 00000280 flags 5
Loading segment 0 offset 00000000 vaddr 00001000 file size 00000280 mem size 00000280 flags 5
KERNEL PANIC: ASSERT FAILED: rwlock.c:75: m->active_read_count > 0
Does not happen when running in emulator.
The AXI bus has signals to indicate an error, but the processor ignores them. Because bus writes are deferred, these are necessarily imprecise.
Version 14.1+ do not work.
Executing a floating point compare instruction that checks for equality when both operands are negative infinity will incorrectly not treat them as equal. That's because the equality checks that the result of a subtraction is zero and it is not in this case. There needs to be logic to check for infinity explicitly in these operations.
This could improve performance, as writes often happen back-to-back.
https://jbush001.github.io/2016/11/30/measure-twice-cut-once.html
Currently, when a TLB miss occurs, it raises a trap and software walks the page table and fills in the TLB entry. This feature would perform that walk in hardware to reduce the overhead.
Single step, read/write registers, read/write memory.
Currently integer and floating point multiplication occur in one stage (fp_execute_stage2) using the '*' Verilog operator. This is the critical path when synthesizing for silicon. However, three stages are reserved in the pipeline for it. Create a proper multi-stage multiplier that uses modified Booth encoding and a Wallace tree to accumulate partial products.
https://web.stanford.edu/class/archive/ee/ee371/ee371.1066/lectures/lect_05.2up.pdf
Or, eliminate booth encoding and use a single row of 4:2 compressors:
http://www.acsel-lab.com/Publications/Papers/38-booth-para-multi-EL93.pdf
USE_VERILATOR=1 ./runtest.sh longlong.c
FAIL: line 12 expected string ffee1580 was not found
searching here:
e0b4157f
Works correctly when run in simulator
In NyuziRegisterInfo::eliminateFrameIndex and NyuziInstrInfo::loadConstant, a combination of shifts and moves is used to load constants. Switch this to use MOVEHI/OR
Currently the perspective transform does not divide z by w. The view volume is z (-1.0 -> -inf), which is nonstandard and will break non-floating point depth buffers.
http://fabiensanglard.net/polygon_codec/
http://research.microsoft.com/pubs/73937/p245-blinn.pdf
http://www.cs.unc.edu/~olano/papers/2dh-tri/
http://www.songho.ca/opengl/gl_projectionmatrix.html
tests/misc/dflush will fail sometimes (depending on random seed). The test passes consistently if +autoflushl2=1 is specified, so it appears the flushes are not completing consistently.
Doing a setlt.f 1.13418960571, inf results in false. There needs to be special case logic for doing comparisons to infinity.
Depth buffer is currently floating point. Convert to packed 24 bit integer depth and 8 bit stencil buffer. Need to fix perspective transform so that Z is constrained to view volume, otherwise integer depth will overflow:
Depends on: issue #68
On TravisCI, times out once in a while. The timeout is set to 120 seconds. It takes about 86 seconds to run on my laptop. It's possible that's the timeout simply doesn't have enough margin when Travis's build server is running slowly.
Currently, the only way to access external coprocessors is to use the I/O bus. Writes to addresses at high physical memory addresses, instead of going through the normal cache hierarchy, are redirected to the I/O bus. While this is useful for relatively low-speed peripherals, it has performance limitations when used with coprocessors such as texture fetch units:
The proposal is to create a new bus for use in high-speed peripherals. This would not replace the low speed bus, but would address different use cases and constraints.
Design here:
https://gist.github.com/jbush001/09f51178a366c0f6b8f07363c30f414f
hello sorry for asking this question but i was wondering if people can list what THIS gpugpu processor is good for once synthesized i don't want a general what gpgpu is good for but what this project will help speedup or facilitate
The following code:
float a = 3.40282347E-24F;
float b = 3.40282347E-24F;
int main()
{
union {
float fval;
int ival;
} u;
u.fval = a * b;
printf("%08x\n", u.ival);
}
Should print 00000000, but prints 7187625d when run in Verilog simulation. Hardware needs to explicitly detect if the sum of the exponents underflows and set the result to zero.
With new implementation, a constant pool takes 3 instructions:
movehi s0, hi(.LCPI1_0)
or s0, s0, lo(.LCPI1_0)
load_32 s0, (s0)
This can be done in two instructions by using an immediate offset with the memory instruction:
movehi s0, hi(.LCPI1_0)
load_32 s0, lo(.LCPI1_0)(s0)
This will require a new type of relocation that can patch the memory instruction. Global loads and stores can be optimized similarly.
Make performance counter have the ability to generate an interrupt when a specific event has occurred some number of times. This allows statistically sampling which routines cause more of that type of event.
There would be a register for the event type, and perhaps one for the interrupt threshold. It would also need a way to reset the counter when the interrupt handler occurred, which could be automatic or driven by software.
This passes most of the time, but if the random seed is hardcoded to 1419094753, it flags a memory mismatch:
Building cache_stress.s
Random seed is 1419094753
400036 total instructions
Binary files WORK/vmem.bin and WORK/mmem.bin differ
13480c13480
< 0036110 00 00 00 00 a2 d3 00 00 00 00 00 00 00 00 00 00
---
> 0036110 00 00 00 00 a2 d3 00 00 00 00 00 00 ca fb 01 00
FAIL: final memory contents do not match
The current implementation in l2_axi_bus_interface.sv uses a single state machine that:
However, AXI is designed to allow the next address to be sent before the first data transfer finishes. This improves performance on high latency links (for example, ones with multiple clock-domain-crossings). This design could be modified to use pipelines for write and read transactions, where the first stage issues the address and the second performs the transfer.
The current testbench has a relatively low latency link, with SDRAM coupled directly to the processor. For testing, it might be interesting to add a parameterized FIFO to simulate different latencies and measure the performance difference with various workloads.
It's also possible to separate the write and read state machines and allow them to operate independently, but this adds more edge cases (e.g. if a line is evicted than there is a cache miss on it before writeback, need to sequence write before read) and is of questionable performance benefit.
When shared memory is specified in the emulator using the '-s' flag, it creates a file, but it does not delete it when finished or if there is an error.
USE_VERILATOR=1 ./runtest.sh fconv.cpp
FAIL: line 37 expected string 0x81234000 was not found
searching here:
0x80000000
Works correctly in simulator.
Add ability to send an interrupt from one thread to another/all, across cores. The easiest way to do this is probably to send it as a message in the L2 cache, as it already has the ability to broadcast messages to all cores and to serialize requests. The l1_l2_interface would be decode the message and have a signal directly to the control register unit to flag an interrupt. The control register would probably treat this interrupt specially, as it is routed directly to threads unlike other interrupts.
This is manifested with a number of programs, but the test in tests/fpga/atomic_bug/ fails reproduces it consistently. While this program is running, it will output the result of __sync_add_and_fetch in binary on the 7 segment LEDs--with each segment corresponding to a bit, and each digit corresponding to a hardware thread. If the program is behaving correctly, the digits should be updated continuously. However, after the first iteration, the program hangs and two of the digits generally display the same value (which violates the semantics of __sync_add_and_fetch).
This problem does not occur when running the same program RTL simulation under Modelsim (invoked from Quartus, which should ostensibly be using almost the same configuration).
Recently I have managed to port the Nyuzi to Xilinx ZC706 evaluation board and run doom & quakeview successfully, with respect. But during this process, I noticed a cache performance degradation when debugging with Modelsim SE 10.5c. Since such degradation seems only existing when simulating in some simulators and doesn't exist in verilator or after synthesize, I think maybe this topic is more like a portability suggestion than an issue.
When running the core on some simulators such as Modelsim, the x value coming from uninitialized SRAM will cause the casez statement in module cache_lru fail to match any pattern, further cause that in all cache set only way 0 is available. To illustrate it clearly, here is a piece of code snipping from cache_lru.sv when parameter NUM_WAYS is 4.
...
#130 4:
#131 begin
#132 always_comb
#133 begin
#134 casez (lru_flags)
#135 3'b00?: fill_way = 0;
#136 3'b10?: fill_way = 1;
#137 3'b?10: fill_way = 2;
#138 3'b?11: fill_way = 3;
#139 default: fill_way = '0;
#140 endcase
#141 end
#142
#143 always_comb
#144 begin
#145 case (new_mru)
#146 2'd0: update_flags = {2'b11, lru_flags[0]};
#147 2'd1: update_flags = {2'b01, lru_flags[0]};
#148 2'd2: update_flags = {lru_flags[2], 2'b01};
#149 2'd3: update_flags = {lru_flags[2], 2'b00};
#150 default: update_flags = '0;
#151 endcase
#152 end
#153 end
...
Consider after reset, two or more cache fill operations happened in the same cache set. When the first fill request comes, because the value of "lru_flags" is 3'bxxx (the value of uninitialized SRAM), the default branch will take effect, way 0 got filled, and the "update_flags" will be 3'b11x. Then, in the subsequent request handling process, the casez statement will always fail to match any pattern with the value of "lru_flags" (3'b11x), the default branch will always take effect, way 0 will always get filled, and "update_flags" will always be 3'b11x. The similar issue also exists when NUM_WAY is 8. The marked red area in the waveform generated by Modelsim attached below shows such situation:
Although it sounds like a bad idea, one solution to this problem is changing all casez in cache_lru.sv to casex. I made an experiment to estimate the performance impact: Some logic was inserted into the writeback_stage firstly to trace the function activity. Then, the same program got executed on two different designs, the original design and the design with modified cache_lru, for a given time. Since the program is same, the amount of log generated during simulation is related to the instructions got executed. By comparing the number of function tracing record, we can compare the performance approximately. The screenshot below shows the result: The design with modified cache_lru produced 2.8x more log than the original one, which can roughly be considered as running faster 2.8x.
The version I use is ce221ff, but this issue still exists in the latest version.
Finally, Thanks for bringing such interesting project!
Probably will be multiple tests. Ideally exercises:
tests/stress/mmu/
Add floating point values represented by bit patterns 0x41000001 and 0xBF800004. The result should be 40E00001, but this implementation will return 40E00002. The shifted GRS (guard/round/sticky) bits are 100, which looks like it should round to even (because it's half way), but the normalizing shift that happens after should shift the guard bit into the LSB. This page describes in more detail:
http://pages.cs.wisc.edu/~david/courses/cs552/S12/handouts/guardbits.pdf
./generate_random -i
./runtest random.s
Has about an 8% failure rate. In all cases, hardware executes an instruction from the interrupt handler, while the emulator continue executing code. In each example below, hardware jumps to 1dc (beginning of the interrupt handler). The value the interrupt handler loads into s11 is the PC that was interrupted. In each case, the address of the instruction executed by the reference emulator is equal to the PC loaded by the hardware implementation.
interrupt_handler:
1dc: 62 01 00 ac getcr s11, 2
COSIM MISMATCH, thread 0
Reference: 000755e0 s4 <= 00000005
Hardware: 000001dc s11 <= 000755e0
COSIM MISMATCH, thread 2
Reference: 00000260 v4{ffff} <= 18f878a5 040630c3 bc0614e0 253f20a5 bc060cc0 c0738063 12900022 bc0810c1 41fbf904 d0a400e6 028f0022 d0e400e0 257e9485 128d0022 c47380a4 c14380e4
Hardware: 000001dc s11 <= 00000260
This would suggest an issue where the reference implementation is not getting the cosimulation interrupt message, or has interrupts masked for some reason.
From IEEE754-2008, 6.2.3:
"If two or more inputs are NaN, then the payload of the resulting NaN should be identical to the payload of one of the input NaNs if representable in the destination format. This standard does not specify which of the input NaNs will provide the payload."
The convention seems to be to take the first operand. For example:
union uval {
float fval;
int32_t ival;
};
volatile union uval a, b, c;
int main()
{
a.ival = 0xff8abcde;
b.ival = 0xff812345;
c.fval = a.fval + b.fval;
printf("%08x\n", c.ival);
}
On a desktop machine, this returns 0xffcabcde (the most significant bit of the significand is set). On Nyuzi, it returns the hardcoded NaN value 0x7fffffff.
The following code:
float a = 3.40282347E+38F;
float b = 3.40282347E+38F;
int main()
{
union {
float fval;
int ival;
} u;
u.fval = a + b;
printf("%08x\n", u.ival);
}
Should print '7f800000', which corresponds to 'inf'. However, when running in Verilog simulation, it prints 7fffffff (NaN), because hardware does not explicitly detect overflow.
%Error: ../fpga_common/axi_internal_ram.v:204: Internal: Extra arguments for $display-like format
%Error: core/writeback_stage.v:150: Internal: Extra arguments for $display-like format
%Error: core/instruction_fetch_stage.v:202: Internal: Extra arguments for $display-like format
%Error: Exiting due to 3 error(s)
%Error: Command Failed /usr/local/Cellar/verilator/3.860/bin/verilator_bin --assert -DENABLE_PERFORMANCE_COUNTERS -DSIMULATION -Icore -y fpga -y testbench -y ../fpga_common -Wno-fatal -Werror-implicit --cc testbench/verilator_tb.v --exe testbench/verilator_main.cpp
make[1]: *** [verilator]
Currently hardcoded to round-to-nearest, ties to even. Add bits to flags control register to configure and logic in fp_execute_stage3 to compute do_round based on the mode and GRS bits.
https://en.wikipedia.org/wiki/Floating-point_arithmetic#Rounding_modes
Probably need to fix issue #58 first.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.