Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Chris P: this will meet our needs

Import

Susan:
After Richard's fixes, import is now accepting 5K imports routinely
at some point, need to restart Tomcat from time to time

Would like to get to 10K and perhaps more in imports

Maybe some issues with macro substitution, ampersand substitutions, etc.

Had seen these issues even with the Java Client Library in the past

Chris P:
Using Talend
Data imported directly from a MSSQL database via a Talend module
Go through transforms, merging together

For output part, were using 'Advanced XML'
Problem with not supporting multiple loops in a tree
Instead, using Java, using JAXB-generated classes
if I have repeated fields, turn them into an object list
lots of modules with code, inside of Talend
write code to map objects

originally was using services APIs to import
now using import service
just using JAXB bindings to create the XML
if anything's changed in the schema, I get a compile error
which helps me find out what to change
just import new libraries in, and see what doesn't compile

is on the main CollectionSpace wiki

Yuteh:
Generating individual files for main record and repeatable structures
Previously using Susan's 10K record merge
Doing merging
XMLMerge works for smaller files like Person
CollectionObject is too big
Richard: it probably can be made to work

Am now splitting every 3K (78 MB, not including any repeating)
have hundreds of files
if I can keep my delta in one file
can run this over and over again on the same delta file

Susan:
Have 5K, but my objects at MMI aren't as large as PAHMA's objects

Patrick:
Pulled all 600K records, denormalized into 1 million rows, for Delphi, 270 MB

Yuteh:
Wrote Java code to strip off 'easy' empty records

Chris P:
Last did this in 1.7, don't recall how many batches, wasn't too bad
Have 46K object records

Chris H:
Talend XML generator has 'create elements even if empty checkbox', is checked ('on') by default

Susan:
Required in groups and lists, perhaps in repeatables

Relations are difficult with custom extensions

Patrick:
Should be able to have generic doctype in there
Richard will think about this
Already marked with a tenant
We don't need that in the doctype

  • When filtering relations, could do a stem search

Richard:
Nuxeo shouldn't care, due to its derivation model, if derived from the common doctype

Susan:
When custom tenant isn't there

Richard:
The fix will mean that you won't need to re-import the relation records; you can leave the tenant-qualified doctypes in there

Susan:
Display predicate name in relation not used; different in app layer?
Doesn't appeared to be used at all

Richard:
Dan asked for this some time ago

Chris at Walker:
Hooking up Talend right now

Nate:
Sending payloads now using the services
The downside is you can't run it again, without querying whether the object already exists
May not necessarily be a bad fit for us

There was a set of tools that could take various data sources, transform, spit out uniform
Kettle

Susan:
I assemble the XML myself in JavaScript
Kettle lets you make fragments and assemble them in JavaScript
Quick and easy

Patrick:
Talend can import a schema and generate XML in that schema

Nate:
Pre-populate CSIDs with GUIDs?

Susan:
Yes

Richard:
Easier for creating relations

Chris H:
Simple Java method to get a GUID/UUID, which you can put in your CSID in Talend

Nate:
Collection we're importing is 11K objects
Even if we have to do it again, talking to services is appealing to us
Might look again at Talend, Kettle
Our starting data is in CSV files from FileMaker Pro, I can generate good CDWA Lite data from that

Chris P:
Relations, movements ... not just objects

Susan:
By sheer number, the relations are the most

Patrick:
If use import, you can prepopulate with CSIDs, with all relations using those, etc.
If use services, you will need to retrieve what you imported to get their CSIDs
Speed difference using import - close to an order of magnitude advantage in speed over services
If you're fiddling, that speed difference can be important

A Talend script importing from CDWA Lite would be interesting to many people

Can export a job from Talend, and someone else can look at what you've done

Chris H:
Talend is great, but has its own mindset

Not always clear about what should be shared

Yuteh has been creating some great documentation; e.g. on creating relationships

Chris P:
Has a page on the main wiki about what he did in 1.7

Would be really good would be a standalone output module
You can do whatever you need on the import side
But the maintenance is quite high on that, while the schemas are changing

Might be a significant benefit in a monthly implementer's call
Problems go out on the Work or Talk list
But successes don't always get reported or discussed

Richard:
It's possible I could get the Nuxeo shell and/or Webapp installed

If you get the Nuxeo DM webapp and configure it to point to the right repository settings used in CollectionSpace now
You can run it in its own container; it doesn't need to be in Tomcat or the same Tomcat
The configuration settings that are in Tomcat might be enough to figure this out

The worst case might be that you need to shut down / undeploy CollectionSpace while using the console or shell for an export, but you might not need to.