Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3
Info

NB: Early Draft! Caveat Lector!

Of the 649,241 objects listed (as of 1/13/2012) in the PAHMA TMS, 72,124 have a non-null value in the dimensions field. This field is filled in in one of two ways:

  • In 22,426 cases (so-called "template records"), the value is computed via relationships with other tables which specify the constituent parts of the value (e.g. "Inventory: 27.9 x 43.2 cm (11 x 17 in)". That is, some bit of code in TMS is responsible for unit conversion and formatting of displayable object dimensions.
  • In the rest of the cases (n=49,697), the value is free text, entered by hand or acquired via some other legacy component, usually according to an implicit convention (e.g. "Ht. 8.3 cm; Dia. 19.0 cm", " 13"x7")

Before these data migrate to CollectionSpace, the free text values need to be analyzed into their constituent parts and inserted into the appropriate TMS tables, so that they can be correctly converted to CSpace's dimension representation schema. Otherwise, this important metadata will end up as textual notes or in some other less-than-ideal niche.

...

Objectid

ObjectPart

Extent

Approx

Value

Unit

Note Pattern

Original data*

Confidence**

326318

 

Dia.

 

24.5

cm

of rim

 

 

 

Dia. of rim, 24.5 cm; Ht. 16.5 cm

3

326318

 

Ht.

 

16.5

cm

 

 

 

 

203854

 

Wd.

 

70

cm

 

 

 

 

203854

 

L.

ca.

128

cm

 

 

 

  Dia. of rim, 24.5 cm; Ht. 16.5 cm

3

203854

 

Wd.

 

70

cm

 

Wd. 70 cm; L. ca. 128 cm; Th. 1 cm

3

203854

 

Th L.

  ca.

1 128

cm

 

 

 

 

321474

a

L Wd. 70 cm; L. ca. 128 cm; Th. 1 cm

3

203854

 

Th.

 

11.6 1

cm

(point and barbs)

 

 

  

Wd. 70 cm; L. ca. 128 cm; Th. 1 cm

3

321474

b a

L.

  72

11.6

cm

(point and barbs)

a) L. (point and barbs) 11.6 ... etc., see above

3

321474

b

L.

 

72

cm

 

 

  a) L. (point and barbs) 11.6 ...

3

321474

b

Wd.

 

2.1

cm

 

 

 

  a) L. (point and barbs) 11.6 ... etc., see above

3

184233

 

L.

 

94

cm

(along back)

 

 

  L. (along back) 94 cm; ... etc., see above

3

184233

 

L.

 

88

cm

(straight across from tip to tip)

 

 

 

184233

 

L. (along back) 94 cm; ... etc., see above

3

184233

 

L.

 

80.5

cm

(cordage)

 

 

  L. (along back) 94 cm; ... etc., see above

3

8666

 

L.

 

76.2

cm

 

 

 

 

76.2 x 106.7 cm (2 ft 6 in x 3 ft 6 in)

3

8666

 

L.

 

106.7

cm

 

 

 

  76.2 x 106.7 cm (2 ft 6 in x 3 ft 6 in)

3

122011

 

L.

ca.

1

in.

 

 

 

  L. ca. 1"

3

186165

 

Dia.

 

7.5-8.0

cm

(ea. bowl)

 

 

  Dia. (ea. bowl) 7.5-8.0 cm

3

184981

 

Wd.

 

6-7

cm

 

 

 

  Dia. (ea. bowl) 7.5-8.0 cm

3

186753

 

Dia.

 

27.5-27.8

cm

(top)

 

 

  Dia. (ea. bowl) 7.5-8.0 cm

3

497471

 

L.

 

191

mm

  

191x96x34mm

  2

 

497471

 

Wd.

 

96

mm

  

191x96x34mm

 

  2

497471

 

 

 

34

mm

  

191x96x34mm

  1

 

* The full original data field is included only if the recognizer was unable to recognize the dimension value.The goal of this cleaning task is to build a recognizer that for each parsed dimension, to facilitate checking and further cleanup.
** 3 = extraction is highly likely to be correct; 2 = extraction is probably correct, but should be checked ; 1 = data is missing, or otherwise likely to require human intervention; 0 = definitely needs the human touch!

The goal of this cleaning task is to build a recognizer that recognizes as many of the patterns as possible, and flags the rest for manual cleanup.

...

A grammar of the most frequent patterns has been generated to map these data elements to the correct columns in the TMS dimension schema. Note that some of the patterns correspond to a single dimension record (e.g. "E N D", "E N-N D" and "N D"), while others describe several dimensions (e.g. "N D x N D x N D, "N x N x N D"), and will be exploded into several dimension records.

Table

...

5: "Dimension Patterns", in descending frequency order (i.e. "Pareto order")

Frequency

Pattern

Relative F

Cumulative F

45796

E N D

71.90%

71.90%

3367

N D x N D

5.29%

77.18%

3020

N D

4.74%

81.93%

1460

N D x D

2.29%

84.22%

1067

N x N D (N x N D)

1.68%

85.89%

1044

N x N D

1.64%

87.53%

596

N x N x N D

0.94%

88.47%

399

N D (N D)

0.63%

89.09%

298

E N-N D

0.47%

89.56%

240

E N - N D

0.38%

89.94%

231

N x N x D

0.36%

90.30%

222

N X N D

0.35%

90.65%

192

N D x N D x N D

0.30%

90.95%

169

Rim Dia. N D

0.27%

91.22%

158

E N

0.25%

91.46%

142

E N D x N D

0.22%

91.69%

139

E N % of L.

0.22%

91.91%

135

N gr

0.21%

92.12%

132

E of rim, N D

0.21%

92.32%

121

N D (h) x N D (diam.)

0.19%

92.51%

118

N D X N D

0.19%

92.70%

110

Length N D

0.17%

92.87%

...etc.

1,450 more patterns, 4,650 tokens

Details

Remarks, Observations, and Heuristics

...

The recognizer, in the form of a perl script, is attached. It takes a TSV file extracted from TMS as input, and outputs an "exploded file" as described in Table 4 above.

SQL command to extract the needed data from TMS:

Code Block

SELECT ObjectID, ObjectName, Medium, Dimensions,ID
FROM Objects
LEFT JOIN DimItemElemXrefs
ON (ObjectID = ID)
ORDER BY ObjectID
  • NB: the ID values is copied from the DimItemElemXrefs; its presence indicate that the value is generated, and so the recognizer need not attempt to recognize it.
Code Block

# (actually, there is no such thing as a command line command called 'sql'.
$ sql [...] <command> > dimensions.20120618.csv
# fix fields which have embedded newlines
$ perl -pe "chomp;print '%%'" dimensions.20120618.csv | perl -pe 's/%%(\d+)\t/\n\1\t/g;s/\(null\)//g;' | perl -pe "s/\r%%/ | /g" >  dimensions.fix.20120618.csv
$ perl dimPatsv2.pl < dimensions.fix.20120618.csv > dimensions.extract.20120618.csv
abbreviations: 89
$ cut -f10 dimensions.extract.20120618.csv | sort | uniq -c | sort -rn

61850 3
7607 1
4366 0

# clean up the extract file a bit: get rid of "(null)", 4.0000000000 -> 4.0, handle multiline fields.
$ perl -pe "chomp;s/\t(\d+)\.(\d+?)00+\t/\t\1.\2\t/;print '%%'" dimensions.structured.20120619.csv | perl -pe 's/%%(\d+)\t/\n\1\t/g;s/\(null\)//g;' | perl -pe "s/\r%%/ | /g" >  dimensions.structured.fixed.20120619.csv

# combined the structured and extracted data sets
$ cat dimensions.structured.fixed.20120619.csv dimensions.extract.20120618.csv > dimensions.both.20120618.csv
$ wc dimensions.both.20120618.csv
  113376 1288083 6877617 dimensions.both.20120618.csv

Table 6: Effectiveness of Extraction (i.e. distribution of confidence values)

Count

Percent

Cumulative Percent

Confidence Value

Legend

47659

65.1%

65.1%

3

great: all data elements found in original

14487

19.8%

84.9%

2

pretty good, but some values propagated or imputed

7343

10.0%

94.9%

1

good, but more assumptions made: units assumed to be inches, other assumptions made

3749

5.1%

100.0%

0

not “grammatical”, or otherwise complicated; needs work

73238

100.0%

 

 

Total single dimensions identified

Table 7: "End Patterns" identified (i.e. after rewriting and segmenting all original patterns)

Count

Pattern

46168

END

17010

ND

3756

N

3749

PatNoGood

1562

EAND

752

NDE

182

EN

31

AND

16

AN

5

NE

5

EAN

2

ENDE

Remarks, Observations, and Heuristics

  • Apply the distributive law to units e.g. 21 x 16mm becomes 21 mm x 16 mm
  • When the type of extent is not provided, e.g. 17"X9", use "L." for the first, "W." for the second, and leave the remaining extents blank. Indicate "medium confidence" (i.e. 2) for rows with these empty extents. Ergo, 17"X9" becomes "L. 17 in. ; W. 9 in..
  • Numeric ranges (e.g. 6.5-7.1) are retained as is, despite the fact that this makes it more challenging to convert these values to actual numeric data types.
  • In cases where the dimensions are restated in different units, omit the second set of values (usually the English measures).
  • " and ' can become in. and ft., respectively.
  • The actual recognizer is considerably more complex than presented here: more units are recognized, and more patterns and variants are handled. Consult the perl script attached to understand the reality.
  •  Note that the values for unit (e.g. "cm") and dimension (e.g. "W") will need to be converted to their standard CSpace representation prior to uploading!
  • Don't forget to combine the values extracted from free text with the values extracted directly as structured data (see Object Dimensions data mapping)