duongcongtoai / emberdb Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 364 KB

Rust 99.83% Shell 0.13% Makefile 0.04%

emberdb's People

Contributors

Watchers

emberdb's Issues

Impl Seq scan

Description

The SeqScanExecutor iterates over a table and return its tuples, one-at-a-time. A sequential scan is specified by a SeqScanPlanNode. The plan node specifies the table over which the scan is performed. The plan node may also contain a predicate; if a tuple does not satisfy the predicate, it is not produced by the scan.
Hint: Be careful when using the TableIterator object. Make sure that you understand the difference between the pre-increment and post-increment operators. You may find yourself getting strange output by switching between ++iter and iter++.
Hint: You will want to make use of the predicate in the sequential scan plan node. In particular, take a look at AbstractExpression::Evaluate. Note that this returns a Value, on which you can invoke GetAs() to evaluate the result as a native C++ boolean type.
Hint: The output of sequential scan is a copy of each matched tuple and its original record identifier (RID).

Current package has no standard for Operator interface, maybe it's time to make it relevant

trait Operator {
    fn next(&mut self, t: ExecutionContext) -> Result<Batch> {}
}

ExecutionContext is put there for later phase (it contains reference to bpm, lock_manager, txn)
It may hold a reference to an underlying storage that support seq scan. This scan action should receive a filter that can be evaluated (check Expression in ToyDB)

SeqScan operator on construction receive some informatin such as predicate/filter, tablename, schema for output (which fields are selected).

Test

TEST_F(ExecutorTest, DISABLED_SimpleSeqScanTest) {
  // SELECT col_a, col_b FROM test_1 WHERE col_a < 500

  // Construct query plan
  TableMetadata *table_info = GetExecutorContext()->GetCatalog()->GetTable("test_1");
  Schema &schema = table_info->schema_;
  auto *col_a = MakeColumnValueExpression(schema, 0, "col_a");
  auto *col_b = MakeColumnValueExpression(schema, 0, "col_b");
  auto *const500 = MakeConstantValueExpression(ValueFactory::GetIntegerValue(500));
  auto *predicate = MakeComparisonExpression(col_a, const500, ComparisonType::LessThan);
  auto *out_schema = MakeOutputSchema({{"col_a", col_a}, {"col_b", col_b}});
  SeqScanPlanNode plan{out_schema, predicate, table_info->oid_};

  // Execute
  std::vector<Tuple> result_set;
  GetExecutionEngine()->Execute(&plan, &result_set, GetTxn(), GetExecutorContext());

  // Verify
  std::cout << "col_a, col_b" << std::endl;
  for (const auto &tuple : result_set) {
    ASSERT_TRUE(tuple.GetValue(out_schema, out_schema->GetColIdx("col_a")).GetAs<int32_t>() < 500);
    ASSERT_TRUE(tuple.GetValue(out_schema, out_schema->GetColIdx("col_b")).GetAs<int32_t>() < 10);
    std::cout << tuple.GetValue(out_schema, out_schema->GetColIdx("col_a")).GetAs<int32_t>() << ", "
              << tuple.GetValue(out_schema, out_schema->GetColIdx("col_b")).GetAs<int32_t>() << std::endl;
  }
  ASSERT_EQ(result_set.size(), 500);
}

Impl external mergesort join

There is only 1 implementation of join operator currently (grace hash join). There is an edge case that if a partition containing all the records that share the same value on join attribute, recursive hashing will not work and result in infinite recursion (all hashed value will eventually go to the same bucket again) => we need a fallback join operator if this case is met, and we can use sortmerge join like CockroachDB

Impl Insert

Description

The InsertExecutor inserts tuples into a table and updates indexes. There are two types of inserts that your executor must support.
The first type are insert operations in which the values to be inserted are embedded directly inside the plan node itself. We refer to these as raw inserts.
The second type are inserts that take the values to be inserted from a child executor. For example, you may have an InsertPlanNode with a SeqScanPlanNode as is child to implement an INSERT INTO .. SELECT ...
You may assume that the InsertExecutor is always at the root of the query plan in which it appears. The InsertExecutor should not modify its result set.
Hint: You will need to lookup table information for the target of the insert during executor initialization. See the System Catalog section below for additional information on accessing the catalog.
Hint: You will need to update all indexes for the table into which tuples are inserted. See the Index Updates section below for further details.
Hint: You will need to use the TableHeap class to perform table modifications.

Test

TEST_F(ExecutorTest, DISABLED_SimpleRawInsertTest)

=> Insert then run seq scan to test result

TEST_F(ExecutorTest, DISABLED_SimpleSelectInsertTest)

This test use a seed table called "table test_1"
Create a seq scan node with some predicate
Embed this node inside insert node
After the execution, execute both seqscan node and seqscan on newly created table and compare

TEST_F(ExecutorTest, DISABLED_SimpleRawInsertWithIndexTest)

Insert some raw values into an empty table
Execute createIndex operator on the recent table
SeqScan on the table, for each tuple, get the value of the indexed columns, then use that value to find RID using the index, then use that RID to fetch from the underlying storage, now we have original tuple and newly fetched tuple using index, compare them

Index:

An abstraction with underlying data structure (may use Btree)
Basic method:

  virtual void InsertEntry(const Tuple &key, RID rid, Transaction *transaction) = 0;

  // delete the index entry linked to given tuple
  virtual void DeleteEntry(const Tuple &key, RID rid, Transaction *transaction) = 0;

  virtual void ScanKey(const Tuple &key, std::vector<RID> *result, Transaction *transaction) = 0;

Async mode for executor trait

To support async mode, the trait itself must be async.
Reference: https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html#

For now we only care about 1 method:

    async fn execute(
        &self,
        partition: usize,
        context: Arc<ExecutionContext>,
    ) -> Result<SendableRecordBatchStream>

where SendableRecordBatchStream impl RecordBatchStream

pub trait RecordBatchStream: Stream<Item = ArrowResult<RecordBatch>> {
    /// Returns the schema of this `RecordBatchStream`.
    ///
    /// Implementation of this trait should guarantee that all `RecordBatch`'s returned by this
    /// stream should have the same schema as returned from this method.
    fn schema(&self) -> SchemaRef;
}

It impl std Stream trait, each poll will results to a RecordBatch

#[derive(Clone, Debug, PartialEq)]
pub struct RecordBatch {
    schema: SchemaRef,
    columns: Vec<Arc<dyn Array>>,
}

dyn Array because we don't know a head at compiled time which type the data is. We can use the same set of implementation like arrow crate: https://docs.rs/arrow/latest/arrow/array/index.html

duongcongtoai / emberdb Goto Github PK

emberdb's People

Contributors

Watchers

emberdb's Issues

Impl Seq scan

Description

Test

Impl external mergesort join

Impl Insert

Description

Test

Async mode for executor trait

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent