Giter Site home page Giter Site logo

enterprise-content-management / infoarchive-sip-sdk Goto Github PK

View Code? Open in Web Editor NEW
13.0 14.0 13.0 3.29 MB

Software Development Kit (SDK) for OpenText InfoArchive, facilitating the assembly and ingestion of Submission Information Packages (SIPs).

Home Page: http://documentum.opentext.com/infoarchive/

License: Mozilla Public License 2.0

Java 98.57% XProc 0.25% Groovy 1.18%
infoarchive sip

infoarchive-sip-sdk's Introduction

license mpl2 ff69b4 infoarchive sip sdk infoarchive sip sdk measure?project=com.opentext badge infoarchive sdk core badge

OpenText InfoArchive SDK

The OpenText InfoArchive SDK is a Java library that makes it quick and easy to create SIPs regardless of what type of data it contains or where that data originates from. A SIP (Submission Information Package) is a package consisting of packaging information, meta-data (structured data in the form of XML) and optionally a collection of unstructured data files.

The IA SDK aims to make the process of creating SIPs simpler by allowing a developer to dynamically assemble both the XML file containing the structured data as well as the entire SIP itself. It’s especially easy to create SIPs from any collection or stream of Plain Old Java Objects regardless of if they represent files, SQL query result sets, emails, tweets, etc.

You can also use the SDK to ingest SIPs into InfoArchive and even to configure InfoArchive. For this functionality you must have access to a running InfoArchive server. The SDK supports version 4.0 of InfoArchive and newer.

Overview

The SDK consists of the following jars:

All jars can be found in the Central Repository. The easiest way to get them is through a dependency management system like Gradle or Maven. For the latest version, see the maven-central badge at the top of this page.

Gradle

dependencies {
  compile 'com.opentext.ia:infoarchive-sdk-core:12.8.2'
}

Maven

<dependencies>
  <dependency>
    <groupId>com.opentext.ia</groupId>
    <artifactId>infoarchive-sdk-core</artifactId>
    <version>12.8.2</version>
  </dependency>
</dependencies>

Versioning

The InfoArchive SDK uses semantic versioning, which means that backwards incompatible changes will only occur in major versions. These breaking changes are documented on the wiki.

An overview of changes since version 6.1.0 can be found in the change log.

Usage

For an introduction to the SDK and some lab exercises, see the related lab project. For examples on how to use the SDK, see the sample programs.

Contributing

See the CONTRIBUTING file on how to get started.

infoarchive-sip-sdk's People

Contributors

blauwefant avatar conradopoole avatar dsmithj0 avatar gentlewind avatar igornikiforov avatar joanneshen avatar kevinhu168 avatar kovaloid avatar raysinnema avatar sgijsenot avatar torsv454 avatar voseldop avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

infoarchive-sip-sdk's Issues

ArchiveClients doesn't work correctly with many Aplications

We have experienced an error "Application " + applicationName + " not found." with ArchiveClients getApplication method when there are many Applications and rest response is paged. This method iterate only first page:
InfoArchiveLinkRelations.LINK_APPLICATIONS is http://localhost:8765/systemdata/tenants/7be2689a-0d3-4e89-9bc7-0de0d0c82389/applications but restClient.folow method work with http://localhost:8765/systemdata/tenants/7be2689a-0d83-4e89-9bc7-0de0d0c82389/applications?page=0&size=10

private static Application getApplication(RestClient restClient, Tenant tenant, String applicationName) throws IOException { Applications applications = restClient.follow(tenant, InfoArchiveLinkRelations.LINK_APPLICATIONS, Applications.class); return Objects.requireNonNull(applications.byName(applicationName), "Application " + applicationName + " not found."); }

Expand ingestion sample with content and search

Hi @RemonSinnema ,
Thank you for the ingest sample code - samples/sample3.

Looking at the sample, I see that the SIP is having only eas_sip.xml and eas_pdi.xml and missing the actual content file. I am sure you would agree that, in a typical ECM platform we will have SIPs with actual content files and associated search configuration.

Hence, I modified samples/sample3 attempting to ingest a SIP with some content. In order to achieve this, I made changes in pdi.xml, pdi-schema.xsd and created a new SIP.

While I have been able to ingest a SIP with content, I have not been able to search & retrieve the ingested SIP via IA UI(Basically, I do not get any search results, please see attached).

Would you be kind enough to update the samples/sample3 code to include the below

  1. Ingest a SIP with some content
  2. Enable search & retrieve of the ingested SIP content through IA UI.

Just adding the pdi-schema.xsd for your reference.

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" targetNamespace="urn:emc:ia:schema:sample:animal:1.0" xmlns:ns1="urn:emc:ia:schema:sample:animal:1.0">
    <xs:element name="animals">
        <xs:complexType>
            <xs:sequence>
                <xs:element maxOccurs="unbounded" ref="ns1:animal"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
    <xs:element name="animal">
        <xs:complexType>
            <xs:sequence>
                <xs:element ref="ns1:AnimalName"/>
                <xs:element ref="ns1:FileName"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
    <xs:element name="AnimalName" type="xs:string"/>
    <xs:element name="FileName" type="xs:string"/>
</xs:schema>

Thank You very much @RemonSinnema .
Aswini

snipimage

Support other stores than local File Store (e.g. ECS / S3 / CAS)

Hi @RemonSinnema ,
I see that currently the SIP SDK configuration is creating only local Folder Store which is "filestore_01" as mentioned in the link below

https://github.com/Enterprise-Content-Management/infoarchive-sip-sdk/blob/master/core/src/main/java/com/emc/ia/sdk/configurer/PropertyBasedConfigurer.java#L50

Since options to choose filestore from - ECS / S3 / CAS are available OOTB., can you include the required configurations for each of these please ?
image

Thank You.

Heap space issue with BatchAssembler

Hello All,

I am reading a CSV file which has one million records and each records has 500 columns. I am able to read the records and create a POJA list post which i am trying to create SIP packages with
XMLPdiAssembler and BatchAssembler where i have mentioned the limit is 1 million and get the below error while creating a SIP file.
Exception in thread "main" java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
at java.io.PrintWriter.flush(PrintWriter.java:320)
at com.opentext.ia.sdk.support.xml.PrintingXmlBuilder.build(PrintingXmlBuilder.java:149)
at com.opentext.ia.sdk.support.xml.PrintingXmlBuilder.build(PrintingXmlBuilder.java:15)
at com.opentext.ia.sdk.sip.XmlPdiAssembler.add(XmlPdiAssembler.java:122)
at com.opentext.ia.sdk.sip.PdiAssembler.add(PdiAssembler.java:27)
at com.opentext.ia.sdk.sip.PdiAssembler.add(PdiAssembler.java:15)
at com.opentext.ia.sdk.sip.PrintWriterAssembler.add(PrintWriterAssembler.java:77)
at com.opentext.ia.sdk.sip.SipAssembler.add(SipAssembler.java:317)
at com.opentext.ia.sdk.sip.BatchSipAssembler.add(BatchSipAssembler.java:85)

When I reduce the batch limit to 50000 I am able to create SIP packages but 50000 is too low count . As I have large files to process I need to use the limit as 1 million or half a million at least.
Below are the version of library and language I am using.
Java: 1.8 64 bit
IA SIP SDK: 11.1.0
Thanks in advance.

regards,
Araamuthan

CDATA element creation through xmlbuilder

We are having a text filed that has many special characters. So we would like to place that string inside the CDATA XML element so that it won't get validated against the schema. But I couldn't find an option to build the CDATA element through XmlBuilder.
Please provide an example to create a CDATA element through XmlBuilder.

regards,
Araamuthan
+91 8123810722
[email protected]

Support SIP encryption

Hi @RemonSinnema ,

Hope you are doing well, thank you for your continued efforts on this forum.

We are using SIP SDK to ingest SIPs with content, we will be contributing the working code very shortly.

We would want to perform SIP encryption (Bouncy Castle encryption) on the ingested SIPs in IA. Would you be able to update SIP SDK to include encryption ?

Thank You very much.

Still using emc?

In current master com.opentext.ia.sdk.sip.InfoArchivePackagingInformationAssembler I saw this line:
.namespace("urn:x-emc:ia:schema:sip:1.0")

Is this correct, or would you want to change emc into opentext?
Or would that actually break old connections which require emc?
Just a thought

SIPAssembler is creating a temp file under temp directory but not removing the tmp file after calling sipAssembler.end()

Hi Team,

I am using SIPAssembler with FileBuffer to create SIP packages for CSV files. I am using this method as we have huge sized CSV files to be processed and we are able to process them.
Now the SIPAssembler is creating a temp file in Windows/Linux tmp directory and not removing that after finishing the SIP creation. Due to which the tmp folder is running out of free space.

Below is the code we used to created SIP Packages.

public void assembler(final List csvRecords, final PackagingInformationFactory factory,
final ResponseBean status, List sipFiles, PdiAssembler pdiAssembler, String fileName)
throws SipCreationException, IOException {
SipAssembler sipAssembler =
new SipAssembler<>
(factory, pdiAssembler, new NoHashAssembler(), new DataBufferSupplier<>(FileBuffer.class),ContentAssembler.ignoreContent());
File sipFile=new File(CsvSipConstants.OUTPUT_PATH,CsvSipConstants.SIP_FILE_PREFIX+fileName+"_"+getRandomNumber()+".zip");
sipAssembler.start(new FileBuffer(sipFile));
for(CSVData record: csvRecords) {
sipAssembler.add(record);
}
sipAssembler.end();
LOGGER.info("Sip file : "+sipFile);
sipFiles.add(sipFile);
}
LOGGER.info("SIP package created");
}

Could you please help me with this?

Regards,
Amuthan

BatchSipAssemblerWithCallback returns unusable SIP files

When using the BatchSipAssemblerWithCallback class the assembler returns invalid SIP packages. These packages have invalid settings for the seqno and the islast fields. Error code:
Aip reception fails with code ERROR due to: 'Error(s) during Aip validation:\naip : Error exceedingSeqNo, Seqno:2 is > to max: 1 \n'. Consult aip logs.

With the current implementation, it isn't possible to use the Callback to directly ingest the SIP package and delete the file directly afterwards to reduce memory usage when archiving big amounts of data. Is there another way to do this, or should the implementation of the BatchSipAssemblerWithCallback class be changed to make this implementation possible?

Request for thread-safety

So, our client uses a huge amount of files, which need to be archived. It takes a few weeks (!!) just to create the sips. So the idea is to use parallelism. You can use several nodes (computers/servers), several (independent) processes or several threads.
Several nodes is not an option for us. We do use 2 processes, but we need the memory so more then 2 is not feasible. We also run into configuration, handling, and monitoring nightmare.
So that only leaves multi-threaded as an option, however, in order to make this work, all our code and all InfoArchive code must be thread-save. The request is to make this SDK thread save. If one thread uses data, and makes decisions but another changes that same data then its not difficult to understand total chaos will arise. As administrator you will see corrupt sip/zip files which cannot be opened by windows (they can be opened with 7-sip, and they reveal missing mandatory eas_pdi.xml and eas_sip.xml files). So InfoArchive (not the SDK) cannot "ant receive-ingest" them.

What would be needed is splitting the code into Singletons (there's only one instance) and Pojos (there are many, but can easily be reused, and each thread uses their own).

Example issues are:
BatchSipAssembler.current
DataSubmissionSession.DataSubmissionSessionBuilder and all those fields
PackagingInformation.PackagingInformationBuilder and all those fields
SipAssembler.pdibuffer, sipFileBuffer, pdiHash, metrics
and many more!

Program hangs

Hi,

For JWT refresh needs, there is a thread managed in DefaultClock class which hangs execution.
Probably it would be great to add a close() function in ArchiveClient to cancel active timer.

Regards,
Olivier

Declarative Configuration(SIP Archiving) - NullPointerException during SIP ingestion

ingest applications/Driver --from data
17:03:08.503 ERROR - Command failed java.lang.IllegalStateException: java.lang.Exception: Cannot upload the file: 'D:\StudyMaterial\infoarchive-ep4\infoarchive\examples\applications\Driver\data\eas_pdi.zip'. Please, contact 'logs/iashell/iashell.log' for additional information.
400 errors:
aip Aip reception fails with code ERROR due to: 'java.lang.NullPointerException: null data supplied'. Consult aip logs.
on POST request for "http://localhost:8765/systemdata/applications/33185a18-7e86-41a3-baa0-a73e32d500e2/aips"

SipAssembler reports incorrect "size of SIP"

The SipAssembler collects metrics on a SIP it generates but the SIP_SIZE of the generated SIP is incorrectly reported after the packaging information is added as:

size of pdi + size of al digital objects

When it should be:

size of pdi + size of al digital objects + size of packaging information

Unnecessary conversion from String to Integer in YAML

During the new holding installation using Holding Wizard it was revealed an issue related to unexpected holding name conversion from String type to Integer in YAML configuration on the IA Server side.
For instance, the name of the holding is 0000 in the YAML string:

holding:
    name: 0000

When it's needed to get an object model from YAML string, the SnakeYaml library determines the data type of 0000 as an Integer and then it becomes to 0.
In this case, the user enters one holding name 0000 in the Holding Wizard, and the holding tries to be saved with the other name 0 in the Server. And as a result, an error occurs.
The same behavior takes place in the following example cases (expected name -> actual name):
00123 -> 123
03210 -> 3210
etc.
A possible solution to this problem is to add single quotes around names which begin with 0 and consist only of digits.
With regards to the code changes, it's necessary to add new regexp condition ^0\d*$ in the needsToBeSingleQuoted() function here

return text.isEmpty() || text.matches("([\"%@*,].*)|(.*#.*)|(.*:\\s.*)");

It will look like this:

return text.isEmpty() || text.matches("([\"%@*,].*)|(.*#.*)|(.*:\\s.*)|(^0\\d*$)");

Ingest SIPs through SIP SDK

We have contributed a sample code for SIP Ingest through SIP SDK. However, the sample is having issues right now which could either be due to some missing configuration or due to SIP SDK. Please can you reva the pull request#16 containing the sample code and suggest the required changes.

Memeory leak for large set of data

Hello All,
I am using SIP SDK jar to create SIPs. Getting the rows from the database table then creating sips accordingly. For the smaller amount of data, sips are generating without any issues.

But for a huge number of data (> 2000000 records) am getting heap space issue.

Then I modified the code to create SIPs for each 100000 records. Even for this am getting an issue with memory.
Available RAM on the server is 4GB. On analysis, I found that sipAssembler is not cleared properly on the source code. I couldn't find any option clear the sipAssembler or FileBuffer explicitly.
Can anybody provide some comments or suggestions on this?
Regards,
Amuthan

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.