Giter Site home page Giter Site logo

thulab / tsfile Goto Github PK

View Code? Open in Web Editor NEW
111.0 22.0 22.0 2.14 MB

THIS REPO HAS MOVED TO https://github.com/apache/incubator-iotdb. TsFile is a columnar file format designed for time-series data, which supports efficient compression and query. It is easy to integrate TsFile with your IOT big data processing frameworks.

Home Page: https://github.com/apache/incubator-iotdb

Shell 0.07% Java 99.93%
time-series-database time-series columnar-storage-format iot

tsfile's People

Contributors

beyyes avatar jianmin-wang avatar jixuan1989 avatar jt2594838 avatar kr11 avatar leirui avatar little-emotion avatar liukun4515 avatar liuruiyiyang avatar mdf369 avatar myxof avatar qiaojialin avatar stefaniexin avatar vivid-kr avatar wujysh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tsfile's Issues

Write negative array to outputstream

Exception:

09:26:26.295 [main] WARN  c.e.t.t.t.write.InternalRecordWriter - too large actual row group size!:actual:2208,threshold:2000
Exception in thread "main" java.lang.NegativeArraySizeException
	at cn.edu.tsinghua.tsfile.timeseries.write.io.TSFileIOWriter.fillInRowGroup(TSFileIOWriter.java:198)
	at cn.edu.tsinghua.tsfile.timeseries.write.InternalRecordWriter.fillInRowGroupSize(InternalRecordWriter.java:166)
	at cn.edu.tsinghua.tsfile.timeseries.write.InternalRecordWriter.flushRowGroup(InternalRecordWriter.java:151)
	at cn.edu.tsinghua.tsfile.timeseries.write.InternalRecordWriter.checkMemorySize(InternalRecordWriter.java:123)
	at cn.edu.tsinghua.tsfile.timeseries.write.InternalRecordWriter.write(InternalRecordWriter.java:74)
	at cn.edu.tsinghua.tsfile.timeseries.basis.TsFile.writeLine(TsFile.java:110)
	at cn.edu.thu.tsfile.hadoop.TsFileTestHelper.writeTsFile(TsFileTestHelper.java:48)
	at cn.edu.thu.tsfile.hadoop.TsFileTestHelper.main(TsFileTestHelper.java:103)

Code:

    public void fillInRowGroup(long diff) throws IOException {
        if (diff <= Integer.MAX_VALUE)
            out.write(new byte[(int) diff]);
        else
            throw new IOException("write too much blank byte array!array size:" + diff);
    }

modify the method in TsFile to fit spark

the method below in QueryEngine.java is return empty list now

    public ArrayList<SeriesSchema> getAllSeriesSchema() {
        return recordReader.getAllSeriesSchema();
    }

it's not suitable for spark usage

Ask a question about TsFile

As an illustrative example, a single wind turbine can generate hundreds of data points every 20 ms for fault detection or prediction through a set of sophisticated operations against time series by data scientists, such as signal decomposition and filtration, segmentation for varied working conditions, pattern matching, frequency domain analysis etc..

tsfile中提供了如举例中的'signal decomposition and filtration, segmentation for varied working conditions, pattern matching, frequency domain analysis etc'的功能吗? 在Get Started中只看到了write和read功能,这和一般的数据库有什么区别呢?

the statistics(min time, max time) in Series level is never used in read process

in read process, when construct ValueReader, the code is as below

    public RowGroupReader(RowGroupMetaData rowGroupMetaData, ITsRandomAccessFileReader raf) {
        logger.debug("init a new RowGroupReader..");
        seriesDataTypeMap = new HashMap<>();
        deltaObjectUID = rowGroupMetaData.getDeltaObjectID();
        measurementIds = new ArrayList<>();
        this.totalByteSize = rowGroupMetaData.getTotalByteSize();
        this.raf = raf;

        for (TimeSeriesChunkMetaData tscMetaData : rowGroupMetaData.getTimeSeriesChunkMetaDataList()) {
            if (tscMetaData.getVInTimeSeriesChunkMetaData() != null) {
                measurementIds.add(tscMetaData.getProperties().getMeasurementUID());
                seriesDataTypeMap.put(tscMetaData.getProperties().getMeasurementUID(),
                        tscMetaData.getVInTimeSeriesChunkMetaData().getDataType());

                ValueReader si = new ValueReader(tscMetaData.getProperties().getFileOffset(),
                        tscMetaData.getTotalByteSize(),
                        tscMetaData.getVInTimeSeriesChunkMetaData().getDataType(),
                        tscMetaData.getVInTimeSeriesChunkMetaData().getDigest(), this.raf,
                        tscMetaData.getVInTimeSeriesChunkMetaData().getEnumValues(),
                        tscMetaData.getProperties().getCompression(), tscMetaData.getNumRows());
                valueReaders.put(tscMetaData.getProperties().getMeasurementUID(), si);
            }
        }
    }

the min time and max time is in tscMetaData.getTInTimeSeriesChunkMetaData(), but it is not used now, so we need to add the support

remove the pagewrite interface

TsFileWriter-> RowGroupWrite-> SeriesWriter->(PageWriter and ValueWriter)
We can remove the PageWrite to simplify the write process.

Refactor the aggregation interfaces and methods in TsFile

可以预想,IoTDB所支持的聚合会越来越多,使得PageHeader中的预聚合信息也会增加。为了使得聚合增加的同时对外界保持统一的接口来提高兼容性,需要对PageHeader中获取聚合信息的方式进行重构。
需要PageHeader加入如下接口:
Object getAggregation(String aggrName);
调用例:Int sum = (Int) pageHeader.getAggregation(Aggregation.COUNT);
其中聚合名称请与IoTDB中保持一致,如果该聚合信息不存在,返回NULL即可。
该issue请尽可能与issue106同时处理。
@liukun4515 @MyXOF @kr11

讨论:随着聚合信息的增加,写入速度势必受到影响。能否、如何解决这种问题?

一种比较理想的解决方式是,聚合信息独立作为一个数据块,PageHeader中存放对聚合信息块的引用。写入时不计算聚合信息,查询时,读出聚合信息块,如果聚合信息块已经被计算过,那么可以直接使用;否则计算聚合信息并且填充聚合信息块。这样可以使得只有确实被用到的聚合信息才会被计算,并且只被计算过一次。

No used metadata info

In the metadata class of TSFileMetaData, the List<TimeSeriesMetadata> timeSeriesList is not used by other external class. I think it should be removed.

datatype and encoding should not be case sensitive

I can type int32 or INT32 and rle or RLE, which will make developers feel more confortable.

qq 20170816203033

IoTDB> create timeseries root.vehicle.d0.s1
statement error: Statement format is not right:parsing error,statement: create timeseries root.vehicle.d0.s1 .message:line 1:36 mismatched input '' expecting WITH near 's1'. Please refer to SQL document and check if there is any keyword conflict.
IoTDB> create timeseries root.vehicle.d0.s1 with datatype=int32 encoding=rle
statement error: Statement format is not right:parsing error,statement: create timeseries root.vehicle.d0.s1 with datatype=int32 encoding=rle .message:line 1:57 missing , at 'encoding' near ''. Please refer to SQL document and check if there is any keyword conflict.
IoTDB> create timeseries root.vehicle.d0.s1 with datatype=int32,encoding=rle
statement error: data type int32 not support
IoTDB> create timeseries root.vehicle.d0.s1 with datatype=INT32,encoding=rle
statement error: encoding rle is not support
IoTDB> create timeseries root.vehicle.d0.s1 with datatype=INT32,encoding=RLE
execute successfully.

add document for thrift

We should add document which help developer to install the thrift.
The os should include linux, win, and mac.

Memory size is twice as much as block_size

Block_size is the threshold of flushing disk and releasing resources.
We write one device and one sensor using Plain encoding and set block_size is 32M and page_size is 4k.
Although -Xmx is set to 50M, it will throw java heap space error.
image

We locate the reason in class ByteArrayOutputStream.
If its capacity is too small to hold all elements, ByteArrayOutputStream doubles its capacity.
The initial capacity is 32.
Since page size is set to 4096, the sizes of time_stream and value_stream are both 2056 which are slightly large than 2048.
Therefore, the occupied memory of each of them is enlarged to 4096, almost twice as much as 2056.

pr pr#12 fix this bug.
Now when data in memory attains page_size, ValueWriter writes two outputstreams into a new outputstream which has been specified totalSize in advance.
Now the heap size occupation is as follows. The basic memory size is 5MB and the total size is about 40MB.
image

add simplified write api

Provide a new write api write(deviceid, time, value), which means this device has only one default sensor. So that users do not need to construct two layer schema(device+sensor)

change groupid to cn.edu.tsinghua

To deploy the project in central repository. We request a groupid cn.edu.tsinghua. So we need to change the groupid and package names.

Can not find tsfile-format.properties

When using TsFile, it can not read the properties file
java.io.FileNotFoundException: src/test/resources/tsfile-format.properties (No such file or directory) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.<init>(FileInputStream.java:138) at cn.edu.tsinghua.tsfile.common.conf.TSFileDescriptor.loadProps(TSFileDescriptor.java:45)

It should uses the getResources method

re-modify the method in tsfile to fit spark

The branch 'add_rowGroupReader_support_for_spark‘ is trying to solve #108 .

But the method below in QueryEngine.java modified by this branch still does not meet the requirement of tsfile-spark-connector:
public ArrayList<SeriesSchema> getAllSeriesSchema() { return recordReader.getAllSeriesSchema(); }

Detail:
for example, if the schema of a tsfile is predefined as 's1 s2 s3 s4 s5' and the actual data in the tsfile is a delta_object 'root.car.d1' with 's1 s2 s3', then the method getAllSeriesSchema in branch 'add_rowGroupReader_support_for_spark‘ would return 's1 s2 s3' with the help of rowGroupReader
while tsfile-spark-connector is simply expecting the predefined complete schema: 's1 s2 s3 s4 s5' without the need of resorting to rowGroupReader.

refine reader of tsfile

Too ugly.
If the reader is not shared by multi-threads, we SHOULD NOT initialize all the readers of a file at the beginning, which is so inefficient.

TsFile can not close

I new a TsFile to read data, bug it can not be closed.
LocalFileInput input = new LocalFileInput(sourcePath); TsFile tsFile = new TsFile(input); tsFile.close()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.