shibihe / q-optimality-tightening Goto Github PK
View Code? Open in Web Editor NEWThis is my implementation of the Optimality Tightening
License: MIT License
This is my implementation of the Optimality Tightening
License: MIT License
@ShibiHe
Hi, thanks for your great paper, and sorry to bother you.
In the paper, the upper bound and lower bound are incorporated into the algorithm via quadratic penalties. But I cannot find the implementation corresponding to these two quadratic penalties.
It seems that the loss function is defined in the init function of DeepQLearner class. Here no penalties are added.
And some main differences comparing with the original DQN codes are shown in _do_training function of OptimalityTightening class. I am not so sure what is the meaning of targets1 variable. And how can this implementation works as two quadratic penalties in paper?
Please correct me if I'm wrong, and thank you very much!!
Apologies if there is an obvious answer, but from the readme I gathered that when running properly, the steps per second should remain constant throughout training. Running on a GTX 970, I started out with ~90 steps per second and 25% GPU utilization. After leaving it to run overnight, I've found it's only run for 6 epochs and has slowed to about 46 steps per second, with about 15% GPU utilization. Everything runs perfectly otherwise, it takes several hours for the issue to appear, and restarting brings it back up to a normal rate. Is there a known cause/solution for this?
Thank you
@ShibiHe ,
First of all, thanks for this inspiring paper and implementation, great work!
In paper, you use index substitution to derive the upper bound for Q, which perfectly makes sense mathematically.
However, in implementation, Upper bound is used the same way as Lower bound, without dependency(thus gradient) w.r.t. parameters.
Which means, for example, at time step t
, in trajectory (s[t-2], a[t-2], r[t-2], s[t-1], a[t-1], r[t-1], s[t], a[t], r[t], ...)
, if r[t-2]
and r[t-1]
is very low, we need to decrease the value of Q[t]
according to upper bounds introduced by r[t-2]
, r[t-1]
.
which means essentially what happened before time step t
will have impact on the value Q[t]
.
Does that conflict with definition of Discounted Future Reward and also the assumption of MDP?
Please correct me if anything wrong,
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.