Abacuza

A Simplified Data Processing Platform

Architecture Diagram

Abacuza

Prerequisites

docker engine: v19.03 or above
docker compose: v1.27.2 or above

How to Build

Clone the repo:

git clone https://github.com/daxnet/abacuza
Build everything with the following command:

docker-compose -f docker-compose.build.yaml build

How to Debug (Services)

Start the infrastructure services like database or redis cache:

docker-compose -f docker-compose.dev.yaml up
Open abacuza.sln in Visual Studio 2019 from src/services directory
Press F5 to debug

How to Run (Debug Mode)

Follow the instructions in How to Debug (Services) to start the infrastructure services and the backend services
Go to the src/client directory
Run npm install to install the dependencies
Run npm start to start the Angular development server at localhost:4200
Navigate to http://localhost:4200 in a web browser to access the Abacuza Administrator dashboard

How to Run

Execute the following command to run everything:

docker-compose up
Navigate to http://localhost:9320 in a web browser to access the Abacuza Administrator dashboard

Quick Start: Word Count

Microsoft provides a .NET for Spark tutorial that demonstrates the counting of the words in a given text file. We will use that demo script to show the features and data processing capabilities provided by Abacuza.

Develop the Word Count Application

An application in Abacuza describes how the data should be processed or transformed, it is usually developed by data scientists to meet their analysis needs. Applications will be assigned to the Job Runners and then loaded by the job runner when a project requests a data processing session. Developing an application for Abacuza involves the following tasks:

Create a new .NET 5 console application
Add Microsoft.Spark and Abacuza.JobRunners.Spark.SDK NuGet package reference
Customize the application
Build and pack the application

Create a new .NET 5 console application

$ dotnet new console -f net5.0 -n WordCountApp

Add NuGet package reference

$ dotnet add package Microsoft.Spark --version 2.0.0
$ dotnet add package Abacuza.JobRunners.Spark.SDK --prerelease

Add a new class which derives from the SparkRunnerBase, actually its code is copied from the example code provided by Microsoft:

using Abacuza.JobRunners.Spark.SDK;
using Microsoft.Spark.Sql;

namespace WordCountApp
{
   public class WordCountRunner : SparkRunnerBase
   {
      public WordCountRunner(string[] args) : base(args)
      {
      }

      protected override DataFrame RunInternal(SparkSession sparkSession, DataFrame dataFrame)
            => dataFrame
               .Select(Functions.Split(Functions.Col("value"), " ").Alias("words"))
               .Select(Functions.Explode(Functions.Col("words"))
               .Alias("word"))
               .GroupBy("word")
               .Count()
               .OrderBy(Functions.Col("count").Desc());
   }
}

Modify the Program.cs, in the Main method, simply invoke the WordCountRunner:

static void Main(string[] args)
{
   new WordCountRunner(args).Run();
}

Under the WordCount project folder, execute the following command to publish the application that targets to Linux x64 platform:
```
$ dotnet publish -c Release -f net5.0 -r linux-x64 -o published
```
Zip the contents in the published folder, note that the zip file should only contains the content under the published folder, the published folder itself shouldn't be zipped. For example, execute following command under Linux will zip the published folder into a ZIP file:
```
$ zip -rj WordCountApp.zip published/.
```

Creating the Spark Cluster Connection

Before doing the data transformation, you will need to create a cluster connection in Abacuza which connects to a data processing cluster. By default, Abacuza delivers the Spark cluster implementation, which is also the one that is used here.

Determine your IP address by using the ifconfig (or ipconfig under Windows)
Edit the template.env file and modify the ACCESS_HOST environment variable and set it to your IP address
Start Abacuza services and front-end dashboard by using the following command:
```
$ docker-compose --env-file template.env up
```
For more information about running Abacuza locally, please refer to the steps above
Open your web browser, navigate to http://<your-ip-address>:9320, this opens the Abacuza dashboard
Login with your credential, by default, use super for username and P@ssw0rd for the password
In the left pane, click Cluster Connections, then in the Cluster Connections page, click the Add Connection button to create a new cluster connection
In the Add Connection dialog, fill in the name, description fields, for Cluster type choose spark. In the Settings text box, input the Spark settings in JSON format. To be simple, we just specify the base URL to the Spark livy. Click Save button to save the changes:
```
{
   "baseUrl": "http://192.168.0.110:8998"
}
```
Now your cluster connection which connects to the running Spark instance should be ready

Preparing the Job Runner

Follow the steps below to create a job runner in Abacuza.

Click Job Runners menu, then click the Add Job Runner button to create a new job runner
In the Create Job Runner dialog, fill in the name and description for the job runner, and for the Cluster type, choose Spark:
Click Save button, Abacuza will redirect you to the Job Runner Details page
In the Job Runner Details page, under the Binaries section, add the following two files to the Job Runner:
1. microsoft-spark-3-1_2.12-2.0.0.jar - you can find it in your published folder
2. WordCountApp.zip - This is the Zip file you created in step 6 of chapter Develop the Word Count Application

Under the Payload template section, use the following JSON document:

{
   "file": "${jr:binaries:microsoft-spark-3-1_2.12-2.0.0.jar}",
   "className": "org.apache.spark.deploy.dotnet.DotnetRunner",
   "args": [
   "${jr:binaries:WordCountApp.zip}",
   "WordCountApp",
   "${proj:input-defs}",
   "${proj:output-defs}",
   "${proj:context}"
   ]
}

Note that the ${jr:binaries} place holder refers to the binary files that you've uploaded to the current job runner.

Save the job runner

Creating the Project

Click Projects menu
In the Projects page, click the Add Project button to add a new project
In the Add Project dialog, fill in the name, description of the project. For Input endpoint, choose Text Files; for Output endpoint, choose Console, which means that we want the output of the data process to be shown in the console log. For the Job Runner, choose the one that we just created in previous steps
Save the project, the Project Details page will show
Let's prepare some data. Follow the instructions described on Microsoft official site to create a input.txt file
On the Project Details page, under INPUT tab, add the input.txt as the project input
Click Submit button, the data processing job will be submitted to one of the clusters whose type is spark, and on that cluster, the customized application that we developed above will be executed for data processing. You can monitor the status of the execution from the REVISIONS tab of the Project Details page
Once the job is completed successfully, you can click the log icon to see the logs. In this example, you can see the following output in the log

For more information about the architecture, the design concepts and the developer's manual, please refer to the Abacuza Documentation.

Documentation

Click here for the documentation.

mlnethub / abacuza Goto Github PK

abacuza's Introduction

Abacuza

Architecture Diagram

Table of Contents

Prerequisites

How to Build

How to Debug (Services)

How to Run (Debug Mode)

How to Run

Quick Start: Word Count

Develop the Word Count Application

Creating the Spark Cluster Connection

Preparing the Job Runner

Creating the Project

Documentation

abacuza's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org