ericaltendorf / plotman Goto Github PK
View Code? Open in Web Editor NEWChia plotting manager
License: Apache License 2.0
Chia plotting manager
License: Apache License 2.0
alfonsopereze on keybase found the root cause:
It seems that if you have running jobs with the dest dirs that are not part of the config file, they can still be picked up here and added to the dst list. IMO, this is NOT good.
Line 37 in c1665b2
Potential Fix
He suggests adding a filter by the current list of dest dirs as per the config file.
My problem is that while I can write python enough to write a script that filters a logfile, i don't know how to fix this issue in this code yet.
My current workaround is that after editing config.yaml to reflect my new dst folders, I create a symlink from CHIA1 to the new destination CHIA3. That's kind of breaks plotmans ability to balance the load between destinations and is a total hack but works for now.
Thanks in advance.
-g
I had to make the two following changes to archiving working.
$ git diff archive.py
diff --git a/archive.py b/archive.py
index 4f69ea5..274229d 100644
--- a/archive.py
+++ b/archive.py
@@ -7,6 +7,7 @@ import psutil
import re
import random
import sys
+import contextlib
import texttable as tt
@@ -49,7 +50,7 @@ def compute_priority(phase, gb_free, n_plots):
def get_archdir_freebytes(arch_cfg):
archdir_freebytes = { }
- df_cmd = ('ssh %s@%s df -aBK | grep " %s/"' %
+ df_cmd = ('ssh %s@%s df -aBK | grep " %s"' %
(arch_cfg['rsyncd_user'], arch_cfg['rsyncd_host'], arch_cfg['rsyncd_path']) )
with subprocess.Popen(df_cmd, shell=True, stdout=subprocess.PIPE) as proc:
for line in proc.stdout.readlines():
Currently plotman grabs the first tmp dir that is ready.
This means if the global stagger setting is the limitation (rather than disk phase readiness), jobs will bunch up on the earliest disks. We should schedule jobs to the disks that are in the best state to take a job (similar to how we prioritize archiving from dst dirs)
It seems likely that scheduling would work with a lower global-delay during initial rampup of jobs. E.g., starting with 5m delay until most tmp drives are occupied then backing off to a more normal e.g. 20m delay.
Currently archive jobs are scheduled one at a time.
If there are multiple dst and multiple archive dirs, and a fast network (significantly faster than individual drive i/o), we could go faster by running multiple archive jobs at once.
This would require making archive scheduling be aware of other jobs, however, so it doesn't hit the same drives.
Here: https://github.com/ericaltendorf/plotman/blob/main/LICENSE#L189
Should be added with your details, @ericaltendorf
when archiving is inactive, the status message never gets updated.
this means an old archiving job's PID may continue to be displayed as running long after it's gone.
SS and HD drives require different plotting parameters for optimal speed. When there is a mixture of drives in a system, there should be a way to tailor the plotting parameters for each drive type, rather than just choose one.
This could be achieved by specifying a plotting configuration for each tmp drive, and then listing multiple plotting configurations, each with a name. Then, when initiating a plot, select the named plotting configuration for the tmp drive.
ValueError: time data 'Fri Feb 12 16:02:56 2021' does not match format '%a %b %d %H:%M:%S %Y'
User reported some confusing issues with paths. Turns out they had tmp: /my/path
instead of tmp: [/my/path]
(or a multiline yaml list). The config file should be checked against a schema before we go using it. The following code expects a list but is just fine taking a string '/my/path'
and turning it into a bunch of temporary directories like ['/', 'm', 'y', '/', 'p', 'a', 't', 'h']
. This bug (we guess) also resulted in temporary files getting dumped in /
which strikes me as extra bad.
tmp: /root/venvChia/plots/temp1
Lines 79 to 80 in 31ccb8f
# Where to plot and log.
directories:
# One directory in which to store all plot job logs (the STDOUT/
# STDERR of all plot jobs). In order to monitor progress, plotman
# reads these logs on a regular basis, so using a fast drive is
# recommended.
log: /root/venvChia/plotman/logs
# One or more directories to use as tmp dirs for plotting. The
# scheduler will use all of them and distribute jobs among them.
# It assumes that IO is independent for each one (i.e., that each
# one is on a different physical device).
#
# If multiple directories share a common prefix, reports will
# abbreviate and show just the uniquely identifying suffix.
tmp: /root/venvChia/plots/temp1
# Optional: tmp2 directory. If specified, will be passed to
# chia plots create as -2. Only one tmp2 directory is supported.
# tmp2:
# One or more directories; the scheduler will use all of them.
# These again are presumed to be on independent physical devices,
# so writes (plot jobs) and reads (archivals) can be scheduled
# to minimize IO contention.
dst: /root/venvChia/plots/plot4
# Archival configuration. Optional; if you do not wish to run the
# archiving operation, comment this section out.
#
# Currently archival depends on an rsync daemon running on the remote
# host, and that the module is configured to match the local path.
# See code for details.
# archive:
# rsyncd_module: plots
# rsyncd_path: /plots
# rsyncd_bwlimit: 80000 # Bandwidth limit in KB/s
# rsyncd_host: myfarmer
# rsyncd_user: chia
# Plotting scheduling parameters
scheduling:
# Don't run a job on a particular temp dir until all existing jobs
# have progresed at least this far. Phase major corresponds to the
# plot phase, phase minor corresponds to the table or table pair
# in sequence.
tmpdir_stagger_phase_major: 2
tmpdir_stagger_phase_minor: 1
# Don't run more than this many jobs at a time on a single temp dir.
tmpdir_max_jobs: 1
# Don't run any jobs (across all temp dirs) more often than this.
global_stagger_m: 30
# How often the daemon wakes to consider starting a new plot job
polling_time_s: 1000
# Plotting parameters. These are pass-through parameters to chia plots create.
# See documentation at
# https://github.com/Chia-Network/chia-blockchain/wiki/CLI-Commands-Reference#create
plotting:
k: 32
e: True # Use -e plotting option
n_threads: 4 # Threads per job
n_buckets: 128 # Number of buckets to split data into
job_buffer: 3300 # Per job memory
$ python plotman.py plot
...starting plot loop
Traceback (most recent call last):
File "plotman.py", line 92, in <module>
wait_reason = manager.maybe_start_new_plot(dir_cfg, sched_cfg, plotting_cfg)
File "/home/billy/Desktop/plotman/manager.py", line 120, in maybe_start_new_plot
p = subprocess.Popen(plot_args,
File "/home/billy/anaconda3/lib/python3.8/subprocess.py", line 854, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/home/billy/anaconda3/lib/python3.8/subprocess.py", line 1702, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'chia'
Here is what my config file looks like:
# Where to plot and log.
directories:
# One directory in which to store all plot job logs (the STDOUT/
# STDERR of all plot jobs). In order to monitor progress, plotman
# reads these logs on a regular basis, so using a fast drive is
# recommended.
log: /home/billy/.chia/mainnet/plotman
# One or more directories to use as tmp dirs for plotting. The
# scheduler will use all of them and distribute jobs among them.
# It assumes that IO is independent for each one (i.e., that each
# one is on a different physical device).
#
# If multiple directories share a common prefix, reports will
# abbreviate and show just the uniquely identifying suffix.
tmp:
- /media/billy/Data/chiatmp
- /media/billy/Windows/chiatmp
# Optional: tmp2 directory. If specified, will be passed to
# chia plots create as -2. Only one tmp2 directory is supported.
# tmp2: /mnt/tmp/a
# One or more directories; the scheduler will use all of them.
# These again are presumed to be on independent physical devices,
# so writes (plot jobs) and reads (archivals) can be scheduled
# to minimize IO contention.
dst:
- /media/billy/chia-1-16TB/chiaplots
- /media/billy/chia-2-16TB/chiaplots
- /media/billy/chia-3-16TB/chiaplots
- /media/billy/chia-4-16TB/chiaplots
- /media/billy/chia-5-16TB/chiaplots
# Archival configuration. Optional; if you do not wish to run the
# archiving operation, comment this section out.
#
# Currently archival depends on an rsync daemon running on the remote
# host, and that the module is configured to match the local path.
# See code for details.
# Plotting scheduling parameters
scheduling:
# Don't run a job on a particular temp dir until all existing jobs
# have progresed at least this far. Phase major corresponds to the
# plot phase, phase minor corresponds to the table or table pair
# in sequence.
tmpdir_stagger_phase_major: 3
tmpdir_stagger_phase_minor: 4
# Don't run more than this many jobs at a time on a single temp dir.
tmpdir_max_jobs: 4
# Don't run any jobs (across all temp dirs) more often than this.
global_stagger_m: 15
# How often the daemon wakes to consider starting a new plot job
polling_time_s: 60
# Plotting parameters. These are pass-through parameters to chia plots create.
# See documentation at
# https://github.com/Chia-Network/chia-blockchain/wiki/CLI-Commands-Reference#create
plotting:
k: 32
e: True # Use -e plotting option
n_threads: 4 # Threads per job
n_buckets: 128 # Number of buckets to split data into
job_buffer: 6750 # Per job memory
I have no idea where or why this is looking for a directory named 'chia' when nothing I've specified has a directory name of 'chia'.
add an option to analyze
to emit CSV instead of a rendered ASCII table.
this would allow people to more easily pull stats into spreadsheets.
With main
as the default branch, PRs are automatically made against this. If this is kept then a PR template should likely be added to remind people as they are creating the PR that it should be developed and submitted against development
. If I understand the branch usage intention correctly...
Alternatively, the branch which should be developed against could be made the default. I'm guessing the interest in main
being the default is so that people that aren't developing can clone and jump straight to running a 'stable' version. Following #61 where plotman becomes installable we can start publishing releases to PyPI and people that aren't developing can skip the git clone
and the source entirely.
In theory, plotman status could be used to generate a text file summarizing the current system plotting (e.g. './plotman.py status > /chia/chialogs/currentplots.txt'). In theory, this command can be scripted and added as a cronjob and that text file can then be served with apache/lighttpd for a simple web view of what's going on.
ENHANCEMENT: it would be nice if plotman status added the following to this simple text output
Now, I'm currently running into an issue where the plotmanstatus.sh script i created works perfectly to generate the text file when run from commandline but generates a 0byte file when run from cronjob. I suspect its related to environmental variable differences between the non-interactive shell started by crontab vs the 'real' user shell I'm using to start the command manually but could also be tied to some python plugins that exist for my real user but somehow not accessible by the crontab started shell. Im still working through this.
Once i solve this issue though, it would be nice to have these two enhancements.
psutil offers some functions which are platform dependent. we could be more robust and simply not report ones that aren't available.
e.g., fix this stacktrace:
File "/home/chia/plotman-main/interactive.py", line 207, in curses_main
jobs_win.addstr(0, 0, reporting.status_report(jobs, n_cols, jobs_height,
File "/home/chia/plotman-main/reporting.py", line 107, in status_report
plot_util.time_format(j.get_time_iowait())
File "/home/chia/plotman-main/job.py", line 282, in get_time_iowait
return int(self.proc.cpu_times().iowait)
AttributeError: 'pcputimes' object has no attribute 'iowait'
Should check the other calls to psutil as well.
Right now plotman fails if it sees an existing plot job but can't find it's log (typically because it was started manually outside of plotman). It should be robust to this; simplest thing would be to issue a warning and ignore the job.
need to do a fork of my local config.yaml so i can maintain both my own version for use and a canonical default for the github repository.
see: https://gist.github.com/eFishCent/dcce711d6babb123d8ab8d0ba3dc0532
@ericaltendorf I found two more issues I wanted to see if you or someone else confirmed as my second two HDDs were filled today. I was busy getting my farmer 100% ready for mainnet (including evicting some small HDDs back into USB cases and into a milkcrate) but
My use model is that as I fill a USB HDD, I remove it from plotman's config.yaml, add the new one, and then restart plotman interactive. The idea is that that plotman will just pickup and send new plots to the new drive(s) instead of the old one.
I can live with removing the drive when its full and temporarily creating a symlink to the new drive so the old plots complete but then I ran into problem #2
Yeah, this is not sustainable. I can keep things going continuously by just creating symlinks from the old targets to the new target BUT its not clean. I guess on my next reboot (when I take down ALL my //plots), I'll configure plotman to target 1-2 symlinks called output1 and output2. I will then symlink those to my destiantion drives until plotman can detect that dst drives are full.
I will then avoid messing with plotman's folders entirely but its kind of a hack.
manager.py currently implements tmpdir selection logic. Meanwhile, parallel separate code in reporting.py computes when a tmpdir is eligible for plotting. We should create a single library for tmpdir prioritzation computation, and share this between both locations.
Create a web dashboard that shows basically what the interactive curses-based dashboard shows.
plotman status
currently just shows current jobs.
we should make it more like what's currently shown on the plotman interactive
dashboard.
maybe make the dashboard obsolete.....
Currently, unconfigured or misconfigured archive settings probably cause plotman to not work at all. Also, when starting plotman interactive
, it always begins with archiving active.
Make plotman support use case without running archiving (i.e., just leaving plots in the configured dst dirs).
Killed server jobs with the command ./plotman.py kill xxxx
, plotman correctly identifies the job and plot and asks if I want to kill it. When I said yes, it errors out saying that it cannot find the log file. The plot process is killed but I have to manually go in and delete all the temp files from the temp directory.
Currently, the plotting loop runs infinitely, continuing to plot whenever the conditions permit.
Some users would like to be able to request the plotting of exactly n
plots. Implement this.
On the current development branch (haven't tried anywhere else), this command fails:
plotman analyze # anthing else, doesn't matter
because the analyzer
module is overloaded with a variable of the same name.
Actual time of day doesn't matter, we just want a timer.
https://docs.python.org/3/library/os.path.html#os.path.expanduser
https://docs.python.org/3/library/os.path.html#os.path.expandvars
At least a couple people have run into exceptions over ~
not being supported. I don't know if there are good arguments for not supporting this, or not. If using more structured deserialization (possibly a result of working on #77) then this could be integrated in there to readily cover all paths.
Currently plotman distributes plotting, destination, and archival across arrays of disks. However, it assumes a single -2 tmp dir.
Personal experience suggests the load on the -2 tmp dir is low, and a single drive can support the plotting operations of about a dozen tmp drives. However, for robustness and scalability, we should support distributing -2 usage across multiple configured drives/directories.
This line:
Line 203 in 31ccb8f
Is shown unconditionally, which means it's also shown when archiving is disabled, in which case it seems to suggest that the dst
prefix is remote, which it isn't.
Should be conditional on archiving being active.
Sure it's available in other monitoring systems, but it's pretty relevant to tuning plotting.
plotman interactive
, on startup, doesn't show useful information in the first line status message. Only after the first refresh does it show status. It should show this immediately on startup.
When things go wrong, old files get left in the tmp dirs. Implement a cleanup operation to remove orphaned files that do not appear to be owned by any active chia processes.
analyze
reports means and stdev but not sample size. sample size can vary across the columns, but we could still report something about it.
Installs fine.
However I run "python3 plotman.py" and nothing happens. It simply returns to the prompt without an error or anything.
Does config happen in a YAML file? Cant see any documentation. Am I blind ? :)
it would be useful to see the actual delta in wall time for a job compared to its previous job, ie to see how regularly the jobs are starting.
I am occasionally seeing the limit for how many plotters per temp drive being ignored.
My suspicion is that the phases like 2:? and 4:? are to blame.
My config only allows 1 plotter per temp drive. In the screenshot below, notice how drive "temp7" show as OK even though there is a job present. This happened while in phase 2:?. Once it moved on to phase 2:1, it was no longer showing as ready.
If the dst dirs are SSDs, people may wish to not use a separate -2 dir and instead use the final dst dirs as the -2 tmp dir.
The configs don't currently allow this to be set up -- we should support it.
Thanks for plotman!
I noticed a bug:
I started plotman (interactive) with two dst folders, say dst01 and dst02. Plotting went fine, plots were created etc.
Then I removed dst01 folder from the config, leaving only dst02.
I restarted plotman (interactive).
Now I noticed that still a plot command with "-d dst01" was generated. Though only one dst is (correctly) shown in plotman (interactive).
I suspect this is due to method "dstdirs_to_youngest_phase" in manager.py. This method takes the running jobs (including ones with dst01 from before the config update) and their dst directories for selection. However these dst directories should be filtered to not include any not in config.
How to fix:
In manager.py, variable dir2ph should not contain dst entries which are not in the config.
E.g., each time a plot is created. This would let people do various alerts or monitoring, e.g. a curl
to a service like https://healthchecks.io
A nonzero priority is harmless but is slightly confusing to the user and makes it harder to quickly spot which dst dirs have any plots at all.
E.g.:
$ ./plotman.py interactive
Warning: unrecognized args: -k32 -n1
Warning: unrecognized args: -t/tmp -2/tmp2
Warning: unrecognized args: -d/dst -b6000
Warning: unrecognized args: -u128 -r3
...
Should probably use a standard library for parsing the args. :)
I'm not really sure what the priority should be in the case where there's a chia
on the $PATH
and also in the same env as plotman, but at least in the case where chia
is not on the on the $PATH
and is available in the env, it seems like it should be used? Or maybe it must be configured? At a high level this relates to being able to run plotman without activating the environment.
sysconfig.get_path()
can get us the scripts (bin) path.
$ venv/bin/python -c 'import sysconfig; print(sysconfig.get_path("scripts"))'
/farm/venv/bin
Or, maybe we can get the executable path from psutil
when there are existing plots and if they all match then we can presume that's the chia
to use?
Here are some issues that are at least in part related to chia-finding.
addwstr() is used to display text in curses and throws an error if there's not enough space for the text.
aside from trying to not add too much text, we could at least wrap the writes in a try
and catch the error and print out something like "try increasing your terminal size".
current workaround: try increasing your terminal size...
We scan the process table then inspect each process. During that time, the process can disappear. We should ensure we're robust to that possibility. The following stacktrace suggests we're not:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/psutil/_pslinux.py", line 1517, in wrapper
return fun(self, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/psutil/_pslinux.py", line 1637, in cmdline
with open_text("%s/%s/cmdline" % (self._procfs_path, self.pid)) as f:
File "/usr/local/lib/python3.8/dist-packages/psutil/_common.py", line 724, in open_text
return open(fname, "rt", **kwargs)
FileNotFoundError: [Errno 2] No such file or directory: '/proc/2755564/cmdline'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./plotman.py", line 123, in <module>
interactive.run_interactive()
File ".../chia/plotman/interactive.py", line 293, in run_interactive
curses.wrapper(curses_main)
File "/usr/lib/python3.8/curses/__init__.py", line 105, in wrapper
return func(stdscr, *args, **kwds)
File ".../chia/plotman/interactive.py", line 134, in curses_main
(started, msg) = manager.maybe_start_new_plot(dir_cfg, sched_cfg, plotting_cfg)
File ".../chia/plotman/manager.py", line 66, in maybe_start_new_plot
jobs = job.Job.get_running_jobs(dir_cfg['log'])
File ".../chia/plotman/job.py", line 53, in get_running_jobs
if proc.name() == 'chia':
File "/usr/local/lib/python3.8/dist-packages/psutil/__init__.py", line 622, in name
cmdline = self.cmdline()
File "/usr/local/lib/python3.8/dist-packages/psutil/__init__.py", line 675, in cmdline
return self._proc.cmdline()
File "/usr/local/lib/python3.8/dist-packages/psutil/_pslinux.py", line 1524, in wrapper
raise NoSuchProcess(self.pid, self._name)
psutil.NoSuchProcess: psutil.NoSuchProcess process no longer exists (pid=2755564, name='/usr/sbin/munin')
Overall, and broken down by temp dir
There should be a central location for config.yaml (see this discussion).
I'm making this issue mostly because @altendky suggested I split this feature off of #61.
Currently, the archiving process assumes archiving to a remote host using rsync
.
We should allow multiple transports. At least, rysnc between local dirs.
The main complication here is that we not only need to transfer files, but also to check free disk space on the target directories. Currently this is implemented by remotely executing the df
command on the remote host. So we need a clean and robust way to check free space that is coordinated with the file transport mechanism.
There's an additional complication because when transferring via rsync to a remote rsync daemon, we see virtual paths as exported by the rsync daemon configured modules, but when we execute df
, we view paths as they exist natively on the remote host. We currently have a fairly crufty mechanism for mapping between the two views of paths. This mechanism needs to be robust, and to generalize across whatever transport mechanisms we implement.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.