Apache™ Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
The MATLAB interface for Apache Parquet provides for reading and writing of Apache Parquet files from within MATLAB. Functionality includes:
- Read and write of local Parquet files
- Access to meta data of a Parquet file
- A MATLAB Datastore for reading Parquet files
For newer MATLAB releases, starting with R2019a, consider using the shipping Parquet support, see https://www.mathworks.com/help/releases/R2019b/matlab/parquet-files.html.
MathWorks Products (http://www.mathworks.com)
- Requires MATLAB release R2017b or newer
For building the JAR file, please make sure the following products are already installed (or install & downlaod from provided links):
Download & unzip binaries from Apache Hadoop official website to a local folder.
On Windows, a compatible utility version called winutils.exe
can be downloaded from
https://github.com/steveloughran/winutils/raw/master/hadoop-2.8.3/bin/winutils.exe.
After download, we would recommend placing the executable under <repo_root>\Software\MATLAB\lib\hadoop\bin\winutils.exe
Note that you will need to first manually create the lib\hadoop\bin
folders
More detailed information on Windows install can be found here.
Installation of the interface requires building the support package (Jar file) and setting the environment variable value for HADOOP_HOME. Before proceeding:
- Install Java SDK and Maven.
- Clone repository or download + unzip/tar latest sources release.
- Create/Set HADOOP_HOME environment variable to point to Apache™ Hadoop® installation local folder (Linux/MacOS) or to the folder where
winutils.exe
executable is located (as suggested/explained below) (Windows)
The links to download these products are provided in the section 3rd party products.
To set the environment variable, please follow rules for your operating system. Please note, that this environment variable must be set prior to starting MATLAB. Changing the environment variable from within MATLAB will not have the desired effect.
To install the interface, you must first build the Jar file.
cd <this_repo>
cd Software/Java
mvn clean package
Now you can open MATLAB and install the support package.
cd <this_repo>/Software
install
Restart MATLAB, and verify installation: Windows
parquetwin('verify')
In case of issues, please refer to the following documentation. Otherwise, you're good to go.
Linux
parquettools('meta')
To write a variable to a Parquet file:
A = magic(5);
parquetwrite('m5.parquet', A);
and you can read the same file with
B = parquetread('m5.parquet');
A few unit tests can be run with
results = runParquetTests()
For more details, look at the Basic Usage document.
See documentation for more information.
The license for MATLAB interface for Parquet is available in the LICENSE.md file in this GitHub repository. This package uses certain third-party content which is licensed under separate license agreements. See the pom.xml file for third-party software downloaded at build time.
Provide suggestions for additional features or capabilities using the following link:
https://www.mathworks.com/products/reference-architectures/request-new-reference-architectures.html
Email: [email protected]