thulab / tsfile Goto Github PK

THIS REPO HAS MOVED TO https://github.com/apache/incubator-iotdb. TsFile is a columnar file format designed for time-series data, which supports efficient compression and query. It is easy to integrate TsFile with your IOT big data processing frameworks.

Home Page: https://github.com/apache/incubator-iotdb

Shell 0.07% Java 99.93%

time-series-database time-series columnar-storage-format iot

tsfile's People

Contributors

Stargazers

Watchers

tsfile's Issues

Write negative array to outputstream

Exception:

09:26:26.295 [main] WARN  c.e.t.t.t.write.InternalRecordWriter - too large actual row group size!:actual:2208,threshold:2000
Exception in thread "main" java.lang.NegativeArraySizeException
	at cn.edu.tsinghua.tsfile.timeseries.write.io.TSFileIOWriter.fillInRowGroup(TSFileIOWriter.java:198)
	at cn.edu.tsinghua.tsfile.timeseries.write.InternalRecordWriter.fillInRowGroupSize(InternalRecordWriter.java:166)
	at cn.edu.tsinghua.tsfile.timeseries.write.InternalRecordWriter.flushRowGroup(InternalRecordWriter.java:151)
	at cn.edu.tsinghua.tsfile.timeseries.write.InternalRecordWriter.checkMemorySize(InternalRecordWriter.java:123)
	at cn.edu.tsinghua.tsfile.timeseries.write.InternalRecordWriter.write(InternalRecordWriter.java:74)
	at cn.edu.tsinghua.tsfile.timeseries.basis.TsFile.writeLine(TsFile.java:110)
	at cn.edu.thu.tsfile.hadoop.TsFileTestHelper.writeTsFile(TsFileTestHelper.java:48)
	at cn.edu.thu.tsfile.hadoop.TsFileTestHelper.main(TsFileTestHelper.java:103)

Code:

    public void fillInRowGroup(long diff) throws IOException {
        if (diff <= Integer.MAX_VALUE)
            out.write(new byte[(int) diff]);
        else
            throw new IOException("write too much blank byte array!array size:" + diff);
    }

fileReader and readerManager should be merged

wrong implementattion for `int read(byte[] b, int off, int len) `

LocalInput.java

Provide a easy way to construct the FileSchema except json

Now the FileSchema constructed by the JSON, it's is not easy to use.
We can provide another way to construct the instance of FileSchema.
For example: Provide ‘add’ function to construct one instance of FileSchema.

can't read a tsfile when it is written more than once

happened to write a tsfile twice and couldn't read it consequently

Add new aggregations info(such as first, sum) in PageHeader of TsFile

@MyXOF @liukun4515
请在PageHeader中加入以下字段：
double sum; // 对于int，long，float，double类型的累加值，其余类型可忽略
String firstValue; // 和该页起始时间对应的值（即第一个值）的字符串形式

add user defined properties

Users want to add some properties in tsfile schema, used to convert between other file formats.

Reconstruct some codes

modify the method in TsFile to fit spark

the method below in QueryEngine.java is return empty list now

    public ArrayList<SeriesSchema> getAllSeriesSchema() {
        return recordReader.getAllSeriesSchema();
    }

it's not suitable for spark usage

Ask a question about TsFile

As an illustrative example, a single wind turbine can generate hundreds of data points every 20 ms for fault detection or prediction through a set of sophisticated operations against time series by data scientists, such as signal decomposition and filtration, segmentation for varied working conditions, pattern matching, frequency domain analysis etc..

tsfile中提供了如举例中的'signal decomposition and filtration, segmentation for varied working conditions, pattern matching, frequency domain analysis etc'的功能吗？在Get Started中只看到了write和read功能，这和一般的数据库有什么区别呢？

Add string statistics support

the statistics（min time, max time） in Series level is never used in read process

in read process, when construct ValueReader, the code is as below

    public RowGroupReader(RowGroupMetaData rowGroupMetaData, ITsRandomAccessFileReader raf) {
        logger.debug("init a new RowGroupReader..");
        seriesDataTypeMap = new HashMap<>();
        deltaObjectUID = rowGroupMetaData.getDeltaObjectID();
        measurementIds = new ArrayList<>();
        this.totalByteSize = rowGroupMetaData.getTotalByteSize();
        this.raf = raf;

        for (TimeSeriesChunkMetaData tscMetaData : rowGroupMetaData.getTimeSeriesChunkMetaDataList()) {
            if (tscMetaData.getVInTimeSeriesChunkMetaData() != null) {
                measurementIds.add(tscMetaData.getProperties().getMeasurementUID());
                seriesDataTypeMap.put(tscMetaData.getProperties().getMeasurementUID(),
                        tscMetaData.getVInTimeSeriesChunkMetaData().getDataType());

                ValueReader si = new ValueReader(tscMetaData.getProperties().getFileOffset(),
                        tscMetaData.getTotalByteSize(),
                        tscMetaData.getVInTimeSeriesChunkMetaData().getDataType(),
                        tscMetaData.getVInTimeSeriesChunkMetaData().getDigest(), this.raf,
                        tscMetaData.getVInTimeSeriesChunkMetaData().getEnumValues(),
                        tscMetaData.getProperties().getCompression(), tscMetaData.getNumRows());
                valueReaders.put(tscMetaData.getProperties().getMeasurementUID(), si);
            }
        }
    }

the min time and max time is in tscMetaData.getTInTimeSeriesChunkMetaData(), but it is not used now, so we need to add the support

文件schema读取不出来

TsFile.getAllSeries()无法读出列信息

Get Started页面中Demo代码更新v0.2.0

remove the pagewrite interface

TsFileWriter-> RowGroupWrite-> SeriesWriter->(PageWriter and ValueWriter)
We can remove the PageWrite to simplify the write process.

Windows start server cannot find path

RLE encodings need to support boolean data type.

Refactor the aggregation interfaces and methods in TsFile

可以预想，IoTDB所支持的聚合会越来越多，使得PageHeader中的预聚合信息也会增加。为了使得聚合增加的同时对外界保持统一的接口来提高兼容性，需要对PageHeader中获取聚合信息的方式进行重构。
需要PageHeader加入如下接口：
Object getAggregation(String aggrName);
调用例：Int sum = (Int) pageHeader.getAggregation(Aggregation.COUNT);
其中聚合名称请与IoTDB中保持一致，如果该聚合信息不存在，返回NULL即可。
该issue请尽可能与issue106同时处理。
@liukun4515 @MyXOF @kr11

讨论：随着聚合信息的增加，写入速度势必受到影响。能否、如何解决这种问题？

一种比较理想的解决方式是，聚合信息独立作为一个数据块，PageHeader中存放对聚合信息块的引用。写入时不计算聚合信息，查询时，读出聚合信息块，如果聚合信息块已经被计算过，那么可以直接使用；否则计算聚合信息并且填充聚合信息块。这样可以使得只有确实被用到的聚合信息才会被计算，并且只被计算过一次。

Enums data type support.

TsFile is not support enums data type currently.

RLE Decoder and Encoder use incorrect TsFileConfig

They should use TsFileDescriptor.getInstance().getConfig() rather than new TSFileConfig()

the data type test in ValueReader is deficient

such as boolean, double, text data type.

unuseful files are left after running Junit test

two files:
./TsFileReadPageInMem
./src/test/resources/perTestInputData

No used metadata info

In the metadata class of TSFileMetaData, the List<TimeSeriesMetadata> timeSeriesList is not used by other external class. I think it should be removed.

Thrift serialize and deserialize

Improve the speed of the serialization and deserialization of Thrift

C and Python interface needed

A direct interface is needed in order to write files to tsfile, bypassing JDBC.

datatype and encoding should not be case sensitive

I can type int32 or INT32 and rle or RLE, which will make developers feel more confortable.

IoTDB> create timeseries root.vehicle.d0.s1
statement error: Statement format is not right:parsing error,statement: create timeseries root.vehicle.d0.s1 .message:line 1:36 mismatched input '' expecting WITH near 's1'. Please refer to SQL document and check if there is any keyword conflict.
IoTDB> create timeseries root.vehicle.d0.s1 with datatype=int32 encoding=rle
statement error: Statement format is not right:parsing error,statement: create timeseries root.vehicle.d0.s1 with datatype=int32 encoding=rle .message:line 1:57 missing , at 'encoding' near ''. Please refer to SQL document and check if there is any keyword conflict.
IoTDB> create timeseries root.vehicle.d0.s1 with datatype=int32,encoding=rle
statement error: data type int32 not support
IoTDB> create timeseries root.vehicle.d0.s1 with datatype=INT32,encoding=rle
statement error: encoding rle is not support
IoTDB> create timeseries root.vehicle.d0.s1 with datatype=INT32,encoding=RLE
execute successfully.

Change BYTE_ARRAY to String data type

Open source time for iotdb and TsFileSync

thanks
Ben

add document for thrift

We should add document which help developer to install the thrift.
The os should include linux, win, and mac.

verify() in SingleValueVisitor will create a new object every time when verity a value

public boolean verify(int value) {
        IntInterval val = (IntInterval) verifier.getInterval(ssFilter);
        for (int i = 0; i < val.count; i += 2) {
        ...

This is unnecessary to create a new interval every time to verify a value. That create the interval only one time will improve the query efficiency.

Add boolean data type statistics support

FilterVerifier border test exists error

eg s1>=2 and s1<=2; s1 > 3 and s1 < 3;

add javadoc

add javadoc on all public method, all including @param, @return and explanation.

Memory size is twice as much as block_size

Block_size is the threshold of flushing disk and releasing resources.
We write one device and one sensor using Plain encoding and set block_size is 32M and page_size is 4k.
Although -Xmx is set to 50M, it will throw java heap space error.

We locate the reason in class ByteArrayOutputStream.
If its capacity is too small to hold all elements, ByteArrayOutputStream doubles its capacity.
The initial capacity is 32.
Since page size is set to 4096, the sizes of time_stream and value_stream are both 2056 which are slightly large than 2048.
Therefore, the occupied memory of each of them is enlarged to 4096, almost twice as much as 2056.

pr pr#12 fix this bug.
Now when data in memory attains page_size, ValueWriter writes two outputstreams into a new outputstream which has been specified totalSize in advance.
Now the heap size occupation is as follows. The basic memory size is 5MB and the total size is about 40MB.

重复时间戳写入没有丢弃

A multi-threading bug in SingleValueVisitor generated by factory method.

例如，当两个线程都在使用工厂生成的SingleValueVisitor<Interger>时，这两个线程会拿到同一个对象，当调用satisfied(value, filterExpression) 方法时候，会去同时修改类内部变量 value的值，这种情况下，会导致不可预知的错误

negtive value insert exception

Using python insert negtive value via JDBC interface, then an exception occured.

当前format配置里面默认值编码的注释说明有误

add simplified write api

Provide a new write api write(deviceid, time, value), which means this device has only one default sensor. So that users do not need to construct two layer schema(device+sensor)

remove TDataType field from DataPoint

Seem a duplicated field.

NullPointerException: Call the query in SeriesWriterImpl

原因如下，

config.duplicateIncompletedPage 如果为false的话，cacheCurrentPageData将不会被初始化，然后调用query的时候会出现NPE

change groupid to cn.edu.tsinghua

To deploy the project in central repository. We request a groupid cn.edu.tsinghua. So we need to change the groupid and package names.

IntRLEEncoder 编码点数为1000w时，解码出现错误

如题，使用IntRLEEncoder对1000万个数据进行编码，再使用IntRLEDEcoder解码，会出现错误

Can not find tsfile-format.properties

When using TsFile, it can not read the properties file
java.io.FileNotFoundException: src/test/resources/tsfile-format.properties (No such file or directory) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.<init>(FileInputStream.java:138) at cn.edu.tsinghua.tsfile.common.conf.TSFileDescriptor.loadProps(TSFileDescriptor.java:45)

It should uses the getResources method

The code of read query process need refactoring.

re-modify the method in tsfile to fit spark

The branch 'add_rowGroupReader_support_for_spark‘ is trying to solve #108 .

But the method below in QueryEngine.java modified by this branch still does not meet the requirement of tsfile-spark-connector:
public ArrayList<SeriesSchema> getAllSeriesSchema() { return recordReader.getAllSeriesSchema(); }

Detail:
for example, if the schema of a tsfile is predefined as 's1 s2 s3 s4 s5' and the actual data in the tsfile is a delta_object 'root.car.d1' with 's1 s2 s3', then the method getAllSeriesSchema in branch 'add_rowGroupReader_support_for_spark‘ would return 's1 s2 s3' with the help of rowGroupReader
while tsfile-spark-connector is simply expecting the predefined complete schema: 's1 s2 s3 s4 s5' without the need of resorting to rowGroupReader.

thulab / tsfile Goto Github PK

tsfile's People

Contributors

Stargazers

Watchers

Forkers

tsfile's Issues

Recommend Projects

Recommend Topics

Recommend Org