netflix / fenzo Goto Github PK
View Code? Open in Web Editor NEWExtensible Scheduler for Mesos Frameworks
Extensible Scheduler for Mesos Frameworks
I'd like the "good enough" calculation to vary with the urgency of the task. For example, imagine that a task should be scheduled within 30 seconds. At first, the fitness bar should be held high, then be gradually lowered as the task request gets older; at the 30 second mark, "anything that meets the hard constraints will do".
I suggest that the FitnessGoodEnoughFunction
accept the TaskAssignmentResult
to convey the TaskRequest
along with the fitness measurement.
is the indiv contributor agreement different than corporate contributor agreement.
also, is it as easy as just signing your PR via git -s
is it just the normal apache 2 ..i.e.: https://www.apache.org/licenses/cla-corporate.txt ?
Assume two task requests and a VM lease come in:
VMLease1: { fooset: {foo-a, foo-b} }
TaskRequest1: requires foo-a
TaskRequest2: requires foo-a
Both task requests are entered simultaneously:
taskScheduler.scheduleOnce( [TaskRequest1, TaskRequest2], [VMLease1])
For both tasks, I have overridden getHardConstraints() to include a ConstraintEvaluator that checks whether a VMLease has the set resource "foo-a"
If TaskRequest1 is evaluated first and succeeds, how does Fenzo tell TaskRequest2 that "foo-a" is not available anymore when it runs TaskRequest2's constraints evaluator? Or is it the case that each task request's foo-a ConstraintEvaluator sees a different VirtualMachineCurrentState when ConstraintEvaluator.evaluate() is called (so that only one of the task requests' foo-a evaluators will see foo-a)?
What I'm trying to ask is, if VMLease1 satisfies both TaskRequest1 and TaskRequest2, how does Fenzo know not to return from taskScheduler.scheduleOnce() with a success to both TaskRequests since they both ask for the same resource?
The library is currently linking against Mesos 0.24. Consider updating the Mesos system requirements to a more recent version, e.g. 1.0+. This would help with #83 (which requires a more recent protocol), and with reducing the test scope.
The Javadocs don't match the latest code. It'd be awesome to update them!
Generally there are a couple of options:
new Thread().run()
When developing a large framework - i.e.: relatively involved, where mesos is a small part - based on what you guys built at netflix, is there anything that works particularly well ?
The downside of (1) is using thread.sleep() which is blocking.
The downside of (2) is that given the async nature of fullfilling offers, it will probably not work for short expiration tasks.
The downside of (3) is that you commit to their threading model - i.e.: actors, or callbacks for the time wheel approach which seem ok.
Also this is more of a mailing list discussion, but couldn't find any, so I thought of issues as a way to record some fenzo wisdom. :) - maybe it can turn into docs PR later !
Problem Description
Fenzo misinterprets offers containing a mix of reserved and unreserved resources, causing it to fail to consider all offered resources. For example, given an offer of 2 reserved CPUs and 3 unreserved CPUs, Fenzo behaves as though the offer contains 2 (or 3) CPUs, not 5 CPUs as it should.
This situation arises when the operator (or another framework in the same role) reserves a subset of a host for the framework's role. This is an increasingly common phenomenon due to:
Here's an example depicting the resources within such an offer (2 cpus for myrole
, 3 unreserved):
cpus(myrole):2.0; mem(myrole):4096.0; ports(myrole):[1025-2180];
disk(*):28829.0; cpus(*):3.0; mem(*):10766.0; ports(*):[2182-3887,8082-8180,8182-32000]
Problem Location
The root cause is within com.netflix.fenzo.plugins.VMLeaseObject
. The VMLeaseObject
assumes that a given resource name (e.g. cpus
) will appear at most once in the offer.
Suggested fix
VMLeaseObject
should aggregate all resources with the same name (subject to a set of roles to filter on).
A suggested workaround is for the framework to use an alternate implementation of com.netflix.fenzo.VirtualMachineLease
. See example here.
Fill this in with links to more complete documentation.
11:30:45.916 [ERROR] [org.gradle.BuildExceptionReporter]
11:30:45.917 [ERROR] [org.gradle.BuildExceptionReporter] FAILURE: Build failed with an exception.
11:30:45.918 [ERROR] [org.gradle.BuildExceptionReporter]
11:30:45.918 [ERROR] [org.gradle.BuildExceptionReporter] * What went wrong:
11:30:45.918 [ERROR] [org.gradle.BuildExceptionReporter] Execution failed for task ':fenzo-core:compileJava'.
11:30:45.918 [ERROR] [org.gradle.BuildExceptionReporter] > 无效的源发行版: 1.8
11:30:45.919 [ERROR] [org.gradle.BuildExceptionReporter]
11:30:45.919 [ERROR] [org.gradle.BuildExceptionReporter] * Exception is:
11:30:45.920 [ERROR] [org.gradle.BuildExceptionReporter] org.gradle.api.tasks.TaskExecutionException: Execution failed f
or task ':fenzo-core:compileJava'.
11:30:45.921 [ERROR] [org.gradle.BuildExceptionReporter] at org.gradle.api.internal.tasks.execution.ExecuteAction
sTaskExecuter.executeActions(ExecuteActionsTaskExecuter.java:69)
11:30:45.921 [ERROR] [org.gradle.BuildExceptionReporter] at org.gradle.api.internal.tasks.execution.ExecuteAction
sTaskExecuter.execute(ExecuteActionsTaskExecuter.java:46)
11:30:45.921 [ERROR] [org.gradle.BuildExceptionReporter] at org.gradle.api.internal.tasks.execution.PostExecution
AnalysisTaskExecuter.execute(PostExecutionAnalysisTaskExecuter.java:35)
11:30:45.921 [ERROR] [org.gradle.BuildExceptionReporter] at org.gradle.api.internal.tasks.execution.SkipUpToDateT
askExecuter.execute(SkipUpToDateTaskExecuter.java:68)
11:30:45.921 [ERROR] [org.gradle.BuildExceptionReporter] at org.gradle.api.internal.tasks.execution.ValidatingTas
kExecuter.execute(ValidatingTaskExecuter.java:58)
11:30:45.921 [ERROR] [org.gradle.BuildExceptionReporter] at org.gradle.api.internal.tasks.execution.SkipEmptySour
ceFilesTaskExecuter.execute(SkipEmptySourceFilesTaskExecuter.java:52)
We have multiple Mesos frameworks in a Mesos Cluster with three hosts(agents). Some of the frameworks developed by ourselves are using Fenzo and some of the frameworks are not using Fenzo (e.g. Marathon). We have configured leaseOfferExpirySecs to 2 and have found that frameworks that use Fenzo have been starving frameworks that do not use Fenzo.
We would like to ask the following questions.
Read about Fenzo in netflix blog. Auto scaling concept sounds interesting. I'm yet to try my hands on Fenzo. Can you please let me know context of auto scaling here? Is it like Fenzo will shutdown/bring up VMs based on the demand ?
Thanks in advance,
Mani
If one builds a framework on top of Fenzo, what are the guidelines to enable high availability of the framework? Specifically assuming zookeeper is being used to provide leadership election between framework instances, how local cache of the Fenzo (for example Tasks queues, running state etc.) can be synchronized to other instances of framework built on Fenzo?
Thanks for your help.
Apologies if this isn't the place to ask questions - I couldn't find a mailing list.
My understanding is that a TaskScheduler
is quite stateful and should be a singleton within a cluster of JVM instances. Is this view incorrect? If so then what are the considerations around scaling?
Thanks!
After reading Fenzo, I don't understand how to get Framework information on Mesos?
Hi,
I am using the UniqueHostAttrConstraint to ensure tasks assigned to different host.
However, after the system restart. The constraint is not working. I believe this is due to the assignment history loss after restart.
So what will be the correct way to presisent the assignment history and recover it after system restart?
The explanation for TaskScheduler:disableVM indicates that it will remember the "disabled" state if the VM is not yet known by Fenzo.
http://netflix.github.io/Fenzo/fenzo-core/com/netflix/fenzo/TaskScheduler.html#disableVM-java.lang.String-long-
In my testing, I was not observing this to be working. If I disable a VM before offers are received from it, it still comes up in an enabled state.
Completely describe the Fenzo API in javadoc comments, including classes, interfaces, methods, attributes, and enums.
Use the active voice in order to make the documentation unambiguous -- see http://go/pv
See #51 for discussion
I am running tasks with a custom dockerized executor that needs 0.5 cpus.
If (let's say) all my tasks need 0.1 cpus , and Fenzo gets a lease for 1.0 cpu, what normally happens is that "scheduleOnce" tries to pair up ten tasks against that lease... so I schedule those ten.
I can work around it by checking if my custom executor is not part of the offer and then manually summing resources and not scheduling tasks that would cause a TASK_ERROR, but of course this is somewhat duplicating what I hope Fenzo would do for me, and I am bound to screw it up.
It'd be great if the (I assume) NPE was logged or bubbled up somewhere.
We're not sure how best to do this, so it may take some investigation. Do we have to host these elsewhere, or can we get them here in github with the rest of our docs?
Didn't see any docs around this:
TaskRequest.java {
public NamedResourceSetRequest(String resName, String resValue, int numSets,
int numSubResources) {
this.resName = resName;
this.resValue = resValue;
this.numSets = numSets;
this.numSubResources = numSubResources;
}
....
}
Is it for the mesos agents --attributes ?
Sorry that I couldn't find a place to ask questions so had to open an issue here.
I found that after calling TaskScheduler.getTaskAssigner().call(...), Fenzo will hold all subsequent resource offers of the corresponding slave forever without even looking at leaseOfferExpirySecs. I'd like to know why Fenzo needs to do that. It makes other frameworks unable to use the remaining resources of that slave. Or is it expected that there should NOT be other frameworks in the cluster? In other words, is it expected that the framework that uses Fenzo should be the only framework in the cluster?
Thanks a lot!
After reading the features I'm struggling to understand what actual use cases are where Fenzo comes in. For example in my case, I have a Spark framework (on Mesos) which runs interactive queries with a measured average response time. Now if I increase the cluster resources by a factor 2 or 4 for example, and increase the amount of Spark frameworks by that same factor, the variance increases greatly. So far, I can see that this stems from the fact that a Spark framework is very greedy; if a resource offer is made, the framework accepts all, and in the meantime other frameworks stall. The average response time is pretty much the same only because the added resources make up for the longer waiting periods for the frameworks. But the higher variance makes this unfavorable to interactive Spark sessions.
When a TaskRequest is submitted to a Fenzo scheduler (scheduler.scheduleOnce()
), it stays in the scheduler's internal queue until a resource offer comes in that fits its requirements. However, it may be the case that no such offered resource will ever satisfy it, and so we would like for the task to be auto-removed from the internal queue after sitting there for a specified amount of time, instead of bad TaskRequests filling up the queue forever. To support this, the following things would be needed:
SchedulerBuilder.withQueueWaitTimeout(1000)
, where 1000 is secondsSchedulerBuilder.withRemovedFromQueueCallback(Action1<TaskRequest>)
or scheduler.setRemovedFromQueueCallback(Action1<TaskRequest>)
,For a framework based on Fenzo, what are the guidelines for scheduling service style tasks?
I am looking for a use case to schedule mix of service and batch jobs. The queueable task input to Fenzo has no distinction for service or batch jobs. This implies that framework should restart the service job when the job finishes/fails. One way is to push the failed/finished service job back in the pending queue, and wait for Fenzo to schedule the job. However, this may lead to interruption of service till the time the prior pending jobs in the queue gets scheduled.
Is there any recommendation to handle restart for the service style tasks and also minimize the interruption of the service?
If we're going to encourage people to use the sample projects to help them make their first Fenzo-aware frameworks, these should be full of good code comments to explain what's going on and why and what options are available.
Currently, Fenzo supports releasing offers at a fixed rate. In order to have a Fenzo-scheduled framework function in a multiframework environment, it's important for every offer to be declined in a timely manner if it isn't used. For example, another framework may have specific constraints or properties it's looking for in an agent--currently, it's possible for Fenzo to hold onto an offer for many minutes (if you're unlucky on a very large cluster). This can negatively impact other frameworks that are looking for very specific hosts.
The solution I'd like to see is the ability to configure a maximum time to hold an offer before declining it; this way, a Fenzo-based framework could choose to hold no offers for longer than 30 seconds; this would greatly benefit multi-framework Mesos clusters.
Fenzo supports a concept of preferential named consumable resource, which models a collection
of two-level resources. The top level resource is tagged with a name during task placement process,
which defines some sort of its runtime profile. Multiple tasks matching the same profile can be
associated with the same consumable resource, and be allocated portion of its subresources.
For example, in AWS an ENI and its security group can be modeled as two level resource. The ENI
interface models the resource, the subresource is a number of IPs that can be associated with an ENI
interface, and the runtime profile is defined by security group(s) associated with an ENI.
Tasks with identical security groups placed on the same agent, may thus share single ENI interface
until pool of available IPs (sub-resources) is exhausted. When the last task associated with an ENI
interface is terminated, its runtime profile becomes undefined again.
As calling AWS API is expensive, it makes sense to reduce the amount of network stack configuration related calls by reusing already provisioned resources. This means Fenzo should promote task placement on an agent/ENI slot which already holds required resources. As Fenzo has limited insight into it (unless a task is already associated with an ENI), we need a pluggable API to externalize this evaluation process.
To achieve this goal, two new callback interface are proposed. PreferentialNamedConsumableResourceEvaluator computes fitness score for each valid task/ENI assignments. SchedulingEventListener provides notifications from within the scheduling loop, so newly placed tasks can be accounted for during fitness calculation process.
/**
* Evaluator for {@link PreferentialNamedConsumableResource} selection process. Given an agent with matching
* ENI slot (either empty or with a matching name), this evaluator computes the fitness score.
* A custom implementation can provide fitness calculators augmented with additional information not available to
* Fenzo for making best placement decision.
*
* <h1>Example</h1>
* {@link PreferentialNamedConsumableResource} can be used to model AWS ENI interfaces together with IP and security
* group assignments. To minimize number of AWS API calls and to improve efficiency, it is beneficial to place a task
* on an agent which has ENI profile with matching security group profile so the ENI can be reused. Or if a task
* is terminated, but agent releases its resources lazily, they can be reused by another task with a matching profile.
*/
public interface PreferentialNamedConsumableResourceEvaluator {
/**
* Provide fitness score for an idle consumable resource.
*
* @param hostname hostname of an agent
* @param resourceName name to be associated with a resource with the given index
* @param index a consumable resource index
* @param subResourcesNeeded an amount of sub-resources required by a scheduled task
* @param subResourcesLimit a total amount of sub-resources available
* @return fitness score
*/
double evaluateIdle(String hostname, String resourceName, int index, double subResourcesNeeded, double subResourcesLimit);
/**
* Provide fitness score for a consumable resource that is already associated with some tasks. These tasks and
* the current one having profiles so can share the resource.
*
* @param hostname hostname of an agent
* @param resourceName name associated with a resource with the given index
* @param index a consumable resource index
* @param subResourcesNeeded an amount of sub-resources required by a scheduled task
* @param subResourcesUsed an amount of sub-resources already used by other tasks
* @param subResourcesLimit a total amount of sub-resources available
* @return fitness score
*/
double evaluate(String hostname, String resourceName, int index, double subResourcesNeeded, double subResourcesUsed, double subResourcesLimit);
}
/**
* A callback API providing notification about Fenzo task placement decisions during the scheduling process.
*/
public interface SchedulingEventListener {
/**
* Called before a new scheduling iteration is started.
*/
void onScheduleStart();
/**
* Called when a new task placement decision is made (a task gets resources allocated on a server).
*
* @param taskAssignmentResult task assignment result
*/
void onAssignment(TaskAssignmentResult taskAssignmentResult);
/**
* Called when the scheduling iteration completes.
*/
void onScheduleFinish();
}
For mocking purposes (w/ Mockito) it would be handy if the builder weren't final.
Currently, I am trying to debug an issue where I provide with one task and one lease to schedule, and fenzo says that it has zero successful or failed assignments. I'm trying to debug this, but since there's no debug logging available, it's tricky to trace what's going on.
Hi,
I find that my application failed to shutdown gracefully after using TaskScheduler. After code trace, seems the ExecutorService is not shutdown (and has no way to do it). Can we have shutdown call in TaskScheduler to clear up threadpool?
Are there any plans to support multi-disk resources and persistent volume allocation / creation?
hi @spodila just testing out fenzo
Seeing the binpackingfitnescalculators.* it's effectively
sum ( fitness_i( request ).... + fitness_n( request) ) / total_fitness_fns
# or | fitness |
Wondering if it'd be worth it to make it OR objects... so that one can do
builder.
.....
.withFitnessCalculator ( Fitness1 | Fitness2 | Fitness3 )
.build()
Interested in a patch?
I guess two things:
Travis are now recommending removing the sudo tag.
"If you currently specify sudo: false in your .travis.yml, we recommend removing that configuration"
Hostname appears to be informational to Mesos: hostname does not appear in TaskStatus messages or TaskInfo messages. SlaveID does, and for Mesos tasks are assigned by SlaveID, not by hostname. This means that a framework that uses Fenzo must map SlaveIDs to hostnames just so that it can call Fenzo when a task state changes.
Better yet, make the taskUnassigner take a TaskStatus and then just pull out whatever is required.
From a quick glance in this line
The wiki docs should:
Some questions to answer:
A possible outline:
Fenzo misunderstands offers that contain numerous disk resources, as can occur when an agent is configured with multiple disks (as described here). Fenzo consequently miscalculates the available disk resources.
The VMLeaseObject
constructor iterates over the offered resources to identify the cpus, mem, and disk. The last encountered disk resource wins, becoming the basis for the diskMB
quantity. The logic should only consider the 'root' disk resource, i.e. the disk without a source
component. This approach makes sense for frameworks that don't explicitly support non-root disks.
How to fully support multi-disk scheduling with Fenzo is considered a separate issue.
@spodila Does Fenzo provides support on Openstack environment? Also how to setup and have it running?
Testing.
unknownLeaseIdsToExpire
array is modified directly when invoking TaskScheduler#expireLease
method. This array is not concurrent, and most likely the intent was to handle it like all the other mutations by the internal thread.
Here is an example stack trace:
fenzo.TaskScheduler:? - Error with scheduling run: null java.util.ConcurrentModificationException at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901) at java.util.ArrayList$Itr.next(ArrayList.java:851) at com.netflix.fenzo.AssignableVMs.expireAnyUnknownLeaseIds(AssignableVMs.java:257) at com.netflix.fenzo.AssignableVMs.prepareAndGetOrderedVMs(AssignableVMs.java:270) at com.netflix.fenzo.TaskScheduler.doSchedule(TaskScheduler.java:754) at com.netflix.fenzo.TaskScheduler.doScheduling(TaskScheduler.java:736) at com.netflix.fenzo.TaskScheduler.scheduleOnce(TaskScheduler.java:711) at com.netflix.fenzo.TaskSchedulingService.scheduleOnce(TaskSchedulingService.java:275) at com.netflix.fenzo.TaskSchedulingService.access$700(TaskSchedulingService.java:73) at com.netflix.fenzo.TaskSchedulingService$1.run(TaskSchedulingService.java:140) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
UPFRONT DISCLAIMER:
Nothing is broken - I am only asking if the Fenzo developer community would have any interest in pursuing or accepting a new feature.
I am working on a development effort where our team is adding new capabilities to a Mesos Framework that uses Fenzo. One of our requirements is "greedy" scheduling. Effectively, we work with a lot of scientific algorithms that can run sequentially or can run in parallel modes. The parallel modes run faster but need more resources, making them harder to schedule sometimes. Currently, these algorithms must be configured with high resource requirements, but sometimes they take forever to schedule. Alternatively, we can configure them with low resource requirements, and they run, but much more slowly.
We are looking at introducing a data model on top of TaskRequest that takes "resource ranges" (min, max). We are then looking at two different design approaches:
We recognize that approach 2, although more optimal and faster to schedule, is not readily possible using TaskQueues and TaskSchedulingService. There would be a lot of changes to Fenzo needed to allow concurrent calls to scheduleOnce() on the same sets of resourceOffers and taskRequests to generate the permutations... (honestly not even sure if you'd use 1 taskQueue per permutation or try to handle it all in one).
Our proposed solution is to choose approach 2 ONLY IF the Fenzo team is interested in helping to support the feature, or if the Fenzo team would be willing to accept the additional complexity of a pull request with resource ranges and permutations.
If the Fenzo team is not interested in this "greedy" scheduling feature or doesn't believe it adds general-purpose value to the community, then our team is going to select approach 1.
Any insight you have on this matter is greatly appreciated! Whether it be a "yes" or "no" on contributing support...... a "yes" or "no" on accepting a pull request.... or even general guidance on something we've missed, whereby Fenzo is already capable of solving this problem elegantly.
Thank you!
In com.netflix.fenzo.AssignableVMs#removeLimitedLeases
there is this comment:
// randomize the list so we don't always reject leases of the same VM before hitting the reject limit
Don't we want to free up large chunks of contiguous space on a single VM, rather than fragmenting space across lots of machines?
See @spodila comment on #50
Thanks Fenzo for the separation of concerns to abstract the scheduling logic. I was able to create a simple framework (using Fenzo) to scale up/down tasks(Docker) and also scale up/down platform (Softlayer) as needed.
One issue I faced is to balance between a shorter offer expiration (to increase platform sharing) and less aggressive scale up trigger (due to offer just expired ).
Finally the solution is, besides setting offer expiration and scale up delay configuration, I also add another configuration of "wait seconds since last lease expiration before scale up" to ignore the scale up if the last offer expiration is within a duration.
Please let me know if there are other alternatives than introducing this new configuration. You can find out details of my framework here: https://github.com/yanglei99/Mesos_Auto_Scale
Thanks.
Hey there,
Great to see the new OptimizingShortfallEvaluator. Any chance this class/interface hierarchy could be made public so that we can extend it to implement our own shortfall evaluation strategies?
The use case I have is where we are scheduling only short-lived tasks on a dedicated auto-scaling group (or possibly groups in the future). If there are tasks that have a lifetime in the order of seconds, then the current shortfall evaluation ends up grossly overestimating resource needs. Ideally we want to pseudo-schedule some pseudo tasks that represent what we think our resource requirements will be for the next n minutes (where n is probably derived from the auto-scaling cooldown period) based on currently running tasks and pending tasks (and maybe some task history that we record as well).
We might even just start with something fairly naive that doesn't even use pseudo-scheduling, so it would be cool just to be able to implement ShortfallEvaluator ourselves.
I started porting our Fenzo-based scheduler over to the Mesos HTTP API using mesos-rxjava.
One issue I ran into is documented in the mesos-rxjava project (mesosphere/mesos-rxjava#74). To summarize, the HTTP API now hands me an org.apache.mesos.v1.Protos.Offer
, but Fenzo only accepts org.apache.mesos.Protos.Offer
.
I have a few ideas for how to convert to/from these Offers, but they feel risky. I was wondering if (a) Fenzo plans to support v1.Protos in the future? ... and (b) if you had any idea how to cleanly convert from v1.Protos to legacy Protos in the meantime?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.