emberdb's People
emberdb's Issues
Async mode for executor trait
To support async mode, the trait itself must be async.
Reference: https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html#
For now we only care about 1 method:
async fn execute(
&self,
partition: usize,
context: Arc<ExecutionContext>,
) -> Result<SendableRecordBatchStream>
where SendableRecordBatchStream impl RecordBatchStream
pub trait RecordBatchStream: Stream<Item = ArrowResult<RecordBatch>> {
/// Returns the schema of this `RecordBatchStream`.
///
/// Implementation of this trait should guarantee that all `RecordBatch`'s returned by this
/// stream should have the same schema as returned from this method.
fn schema(&self) -> SchemaRef;
}
It impl std Stream trait, each poll will results to a RecordBatch
#[derive(Clone, Debug, PartialEq)]
pub struct RecordBatch {
schema: SchemaRef,
columns: Vec<Arc<dyn Array>>,
}
dyn Array because we don't know a head at compiled time which type the data is. We can use the same set of implementation like arrow crate: https://docs.rs/arrow/latest/arrow/array/index.html
Impl external mergesort join
There is only 1 implementation of join operator currently (grace hash join). There is an edge case that if a partition containing all the records that share the same value on join attribute, recursive hashing will not work and result in infinite recursion (all hashed value will eventually go to the same bucket again) => we need a fallback join operator if this case is met, and we can use sortmerge join like CockroachDB
Impl Seq scan
Description
The SeqScanExecutor iterates over a table and return its tuples, one-at-a-time. A sequential scan is specified by a SeqScanPlanNode. The plan node specifies the table over which the scan is performed. The plan node may also contain a predicate; if a tuple does not satisfy the predicate, it is not produced by the scan.
Hint: Be careful when using the TableIterator object. Make sure that you understand the difference between the pre-increment and post-increment operators. You may find yourself getting strange output by switching between ++iter and iter++.
Hint: You will want to make use of the predicate in the sequential scan plan node. In particular, take a look at AbstractExpression::Evaluate. Note that this returns a Value, on which you can invoke GetAs() to evaluate the result as a native C++ boolean type.
Hint: The output of sequential scan is a copy of each matched tuple and its original record identifier (RID).
Current package has no standard for Operator interface, maybe it's time to make it relevant
trait Operator {
fn next(&mut self, t: ExecutionContext) -> Result<Batch> {}
}
ExecutionContext is put there for later phase (it contains reference to bpm, lock_manager, txn)
It may hold a reference to an underlying storage that support seq scan. This scan action should receive a filter that can be evaluated (check Expression in ToyDB)
SeqScan operator on construction receive some informatin such as predicate/filter, tablename, schema for output (which fields are selected).
Test
TEST_F(ExecutorTest, DISABLED_SimpleSeqScanTest) {
// SELECT col_a, col_b FROM test_1 WHERE col_a < 500
// Construct query plan
TableMetadata *table_info = GetExecutorContext()->GetCatalog()->GetTable("test_1");
Schema &schema = table_info->schema_;
auto *col_a = MakeColumnValueExpression(schema, 0, "col_a");
auto *col_b = MakeColumnValueExpression(schema, 0, "col_b");
auto *const500 = MakeConstantValueExpression(ValueFactory::GetIntegerValue(500));
auto *predicate = MakeComparisonExpression(col_a, const500, ComparisonType::LessThan);
auto *out_schema = MakeOutputSchema({{"col_a", col_a}, {"col_b", col_b}});
SeqScanPlanNode plan{out_schema, predicate, table_info->oid_};
// Execute
std::vector<Tuple> result_set;
GetExecutionEngine()->Execute(&plan, &result_set, GetTxn(), GetExecutorContext());
// Verify
std::cout << "col_a, col_b" << std::endl;
for (const auto &tuple : result_set) {
ASSERT_TRUE(tuple.GetValue(out_schema, out_schema->GetColIdx("col_a")).GetAs<int32_t>() < 500);
ASSERT_TRUE(tuple.GetValue(out_schema, out_schema->GetColIdx("col_b")).GetAs<int32_t>() < 10);
std::cout << tuple.GetValue(out_schema, out_schema->GetColIdx("col_a")).GetAs<int32_t>() << ", "
<< tuple.GetValue(out_schema, out_schema->GetColIdx("col_b")).GetAs<int32_t>() << std::endl;
}
ASSERT_EQ(result_set.size(), 500);
}
Impl Insert
Description
The InsertExecutor inserts tuples into a table and updates indexes. There are two types of inserts that your executor must support.
The first type are insert operations in which the values to be inserted are embedded directly inside the plan node itself. We refer to these as raw inserts.
The second type are inserts that take the values to be inserted from a child executor. For example, you may have an InsertPlanNode with a SeqScanPlanNode as is child to implement an INSERT INTO .. SELECT ...
You may assume that the InsertExecutor is always at the root of the query plan in which it appears. The InsertExecutor should not modify its result set.
Hint: You will need to lookup table information for the target of the insert during executor initialization. See the System Catalog section below for additional information on accessing the catalog.
Hint: You will need to update all indexes for the table into which tuples are inserted. See the Index Updates section below for further details.
Hint: You will need to use the TableHeap class to perform table modifications.
Test
TEST_F(ExecutorTest, DISABLED_SimpleRawInsertTest)
=> Insert then run seq scan to test result
TEST_F(ExecutorTest, DISABLED_SimpleSelectInsertTest)
This test use a seed table called "table test_1"
Create a seq scan node with some predicate
Embed this node inside insert node
After the execution, execute both seqscan node and seqscan on newly created table and compare
TEST_F(ExecutorTest, DISABLED_SimpleRawInsertWithIndexTest)
- Insert some raw values into an empty table
- Execute createIndex operator on the recent table
- SeqScan on the table, for each tuple, get the value of the indexed columns, then use that value to find RID using the index, then use that RID to fetch from the underlying storage, now we have original tuple and newly fetched tuple using index, compare them
Index:
- An abstraction with underlying data structure (may use Btree)
Basic method:
virtual void InsertEntry(const Tuple &key, RID rid, Transaction *transaction) = 0;
// delete the index entry linked to given tuple
virtual void DeleteEntry(const Tuple &key, RID rid, Transaction *transaction) = 0;
virtual void ScanKey(const Tuple &key, std::vector<RID> *result, Transaction *transaction) = 0;
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.