PAHMA data cleaning—Dimensions
Of the 649,241 objects listed (as of 1/13/2012) in the PAHMA TMS, 72,124 have a non-null value in the dimensions field. This field is filled in in one of two ways:
- In 22,426 cases (so-called "template records"), the value is computed via relationships with other tables which specify the constituent parts of the value (e.g. Inventory: 27.9 x 43.2 cm (11 x 17 in). That is, some bit of code in TMS is responsible for unit conversion and formatting of displayable object dimensions.
- In the rest of the cases (n=49,697), the value is free text, entered by hand or acquired via some other legacy component, usually according to an implicit convention (e.g. Ht. 8.3 cm; Dia. 19.0 cm, 13"x7")
Before these data migrate to CollectionSpace, the free text values need to be analyzed into their constituent parts and inserted into the appropriate TMS tables, so that they can be correctly converted to CSpace's dimension representation schema. Otherwise, this important metadata will end up as textual notes or in some other less-than-ideal niche.
For the vast majority of cases, it is a simple matter to parse the text into appropriate chunks using a few basic patterns. For a small but significant segment, however, the free text values will need to be munged either by hand or by hack to come into conformance.
This is a data cleaning task: the data in PAHMA's TMS will be corrected to the extent possible prior to the migration (as opposed to doing the cleaning as part of the migration).
Analysis of the Dimensions column in the Objects table in TMS
Consider the values in Table 1.
Table 1: Some basic values in the Dimensions column of the Objects table in TMS
Objectid |
Dimension |
---|---|
189890 |
L. 4 1/2 ft. |
163013 |
L. 56 cm |
325414 |
L. ca 73.0 cm |
326545 |
L. ca. 63 cm |
202836 |
Dia. 18 cm; Ht. 19 cm |
191141 |
Dia. 15.5 cm; Ht. 6.5 cm |
190288 |
Ht. 23.0 cm; Jdia. 42.8 cm |
These all conform to the following hierarchical pattern (i.e. regular expression):
/Dim(; Dim)*/
Where:
Dim = /Extent (Approx)? Value Unit/
And
Extent = /(Ht|Wd|L|Dia|Jdia).?/ Approx = /ca.?/ Value = a numeric value, decimal or fractional Unit = /(cm|mm|in|gm|ft|yd|m).?/
Indeed, about 80% of the values are matched by this particular regular language (i.e. pattern).
However, consider Table 2 below.
Table 2: Some more complex values in the Dimensions column of the Objects table in TMS
Objectid |
Dimension |
---|---|
326318 |
Dia. of rim, 24.5 cm; Ht. 16.5 cm |
203854 |
Wd. 70 cm; L. ca. 128 cm; Th. 1 cm |
321474 |
a) L. (point and barbs) 11.6 cm; b) L. 72 cm; Wd. 2.1 cm |
184233 |
L. (along back) 94 cm; L. (straight across from tip to tip) 88 cm; L. (cordage) 80.5 cm |
These dimensions (n = approx. 10,000) are more complicated. A more elaborate version of the regular language is required.
Finally, consider the even more interesting patterns exemplified in Table 4, below.
Table 3: Other interesting inhabitants of the Dimensions column of the Objects table in TMS
ObjectID |
Dimension |
Pattern |
---|---|---|
8666 |
76.2 x 106.7 cm (2 ft 6 in x 3 ft 6 in) |
N x N D (N D N D x N D N D ) |
122011 |
ca. 1 in. |
N D |
186165 |
Dia. (ea. bowl) 7.5-8.0 cm |
E N-N D |
184981 |
Wd. 6-7 cm |
E N-N D |
186753 |
Dia. (top) 27.5-27.8 cm |
E N-N D |
497471 |
191x96x34mm |
N x N x N D |
31581 |
4.6 x 3.9 x 2.5 cm (1 13/16 x 1 9/16 x 1 in) |
N x N x N D (N x N x N D) |
(NB: E = extent, N = numeric value, D = unit, see patterns above)
Points to note:
- In some cases, the unit is expressed only once, and is thus "implied" for the other extents; the extraction process therefore needs to "backpropagate" or "project" units where they are not explicit.
- Sometimes elements are omitted altogether; defaults may or may not be available in all cases.
- Sometimes the value is a range (e.g. 6-7 cm)
- Often the values are repeated (using different units). There would be no reason to store both sets in CSpace.
For the majority of the data presented in these examples, a recognizer could be created which would do the right thing, transforming free text diments into a relatively straightforward schema such as the one shown below in Table 4.
Table 4: Free Text dimensions, parsed and exploded into intermediate table
Objectid |
ObjectPart |
Extent |
Approx |
Value |
Unit |
Note |
Original data* |
Confidence** |
---|---|---|---|---|---|---|---|---|
326318 |
|
Dia. |
|
24.5 |
cm |
of rim |
Dia. of rim, 24.5 cm; Ht. 16.5 cm |
3 |
326318 |
|
Ht. |
|
16.5 |
cm |
|
Dia. of rim, 24.5 cm; Ht. 16.5 cm |
3 |
203854 |
|
Wd. |
|
70 |
cm |
|
Wd. 70 cm; L. ca. 128 cm; Th. 1 cm |
3 |
203854 |
|
L. |
ca. |
128 |
cm |
|
Wd. 70 cm; L. ca. 128 cm; Th. 1 cm |
3 |
203854 |
|
Th. |
|
1 |
cm |
|
Wd. 70 cm; L. ca. 128 cm; Th. 1 cm |
3 |
321474 |
a |
L. |
|
11.6 |
cm |
(point and barbs) |
a) L. (point and barbs) 11.6 ... etc., see above |
3 |
321474 |
b |
L. |
|
72 |
cm |
|
a) L. (point and barbs) 11.6 ... |
3 |
321474 |
b |
Wd. |
|
2.1 |
cm |
|
a) L. (point and barbs) 11.6 ... etc., see above |
3 |
184233 |
|
L. |
|
94 |
cm |
(along back) |
L. (along back) 94 cm; ... etc., see above |
3 |
184233 |
|
L. |
|
88 |
cm |
(straight across from tip to tip) |
L. (along back) 94 cm; ... etc., see above |
3 |
184233 |
|
L. |
|
80.5 |
cm |
(cordage) |
L. (along back) 94 cm; ... etc., see above |
3 |
8666 |
|
L. |
|
76.2 |
cm |
|
76.2 x 106.7 cm (2 ft 6 in x 3 ft 6 in) |
3 |
8666 |
|
L. |
|
106.7 |
cm |
|
76.2 x 106.7 cm (2 ft 6 in x 3 ft 6 in) |
3 |
122011 |
|
L. |
ca. |
1 |
in. |
|
L. ca. 1" |
3 |
186165 |
|
Dia. |
|
7.5-8.0 |
cm |
(ea. bowl) |
Dia. (ea. bowl) 7.5-8.0 cm |
3 |
184981 |
|
Wd. |
|
6-7 |
cm |
|
Dia. (ea. bowl) 7.5-8.0 cm |
3 |
186753 |
|
Dia. |
|
27.5-27.8 |
cm |
(top) |
Dia. (ea. bowl) 7.5-8.0 cm |
3 |
497471 |
|
L. |
|
191 |
mm |
|
191x96x34mm |
2 |
497471 |
|
Wd. |
|
96 |
mm |
|
191x96x34mm |
2 |
497471 |
|
|
|
34 |
mm |
|
191x96x34mm |
1 |
* The full original data field is included for each parsed dimension, to facilitate checking and further cleanup.
** 3 = extraction is highly likely to be correct; 2 = extraction is probably correct, but should be checked ; 1 = data is missing, or otherwise likely to require human intervention; 0 = definitely needs the human touch!
The goal of this cleaning task is to build a recognizer that recognizes as many of the patterns as possible, and flags the rest for manual cleanup.
Method
As suggested above, the "non-template dimensions" generally conform to one of several "high-level patterns" (e.g. Dim, above), which in turn are composed of other patterns (e.g. Extent, above). Using the semicolon and the pattern Xx as "Dim separators" (as in 3 x 4, "three by four"), the 49,697 table cells can be exploded into 73,098 Dim elements. In most cases these smaller chunks correspond to a single object dimension (e.g. "L. ca. 94 cm."), though many still express several dimensions in another syntax (e.g. 17"X9", 31 x 21 x 16 mm, etc.)
Defining regular expressions for "E" (Extent), "N" (Numeric value), and "D" (unit) as above, 1,473 "subpatterns" can be discerned in the input. As noted above, the single pattern "E N D" occurs 45,796 times, (71.9% of the tokens). The top 23 patterns, shown below, occur more than 100 times and subsume 59,156 tokens (almost 93% of the entire dataset).
The remaining 4,650 tokens (about 6%) occur in 1,450 patterns, most of these occurring only one or two times.
A grammar of the most frequent patterns has been generated to map these data elements to the correct columns in the TMS dimension schema. Note that some of the patterns correspond to a single dimension record (e.g. "E N D", "E N-N D" and "N D"), while others describe several dimensions (e.g. "N D x N D x N D, "N x N x N D"), and will be exploded into several dimension records.
Table 5: "Dimension Patterns", in descending frequency order (i.e. "Pareto order")
Frequency |
Pattern |
Relative F |
Cumulative F |
---|---|---|---|
45796 |
E N D |
71.90% |
71.90% |
3367 |
N D x N D |
5.29% |
77.18% |
3020 |
N D |
4.74% |
81.93% |
1460 |
N D x D |
2.29% |
84.22% |
1067 |
N x N D (N x N D) |
1.68% |
85.89% |
1044 |
N x N D |
1.64% |
87.53% |
596 |
N x N x N D |
0.94% |
88.47% |
399 |
N D (N D) |
0.63% |
89.09% |
298 |
E N-N D |
0.47% |
89.56% |
240 |
E N - N D |
0.38% |
89.94% |
231 |
N x N x D |
0.36% |
90.30% |
222 |
N X N D |
0.35% |
90.65% |
192 |
N D x N D x N D |
0.30% |
90.95% |
169 |
Rim Dia. N D |
0.27% |
91.22% |
158 |
E N |
0.25% |
91.46% |
142 |
E N D x N D |
0.22% |
91.69% |
139 |
E N % of L. |
0.22% |
91.91% |
135 |
N gr |
0.21% |
92.12% |
132 |
E of rim, N D |
0.21% |
92.32% |
121 |
N D (h) x N D (diam.) |
0.19% |
92.51% |
118 |
N D X N D |
0.19% |
92.70% |
110 |
Length N D |
0.17% |
92.87% |
...etc. |
1,450 more patterns, 4,650 tokens |
Details
The recognizer, in the form of a perl script, is attached. It takes a TSV file extracted from TMS as input, and outputs an "exploded file" as described in Table 4 above.
SQL command to extract the needed data from TMS:
SELECT ObjectID, ObjectName, Medium, Dimensions,ID FROM Objects LEFT JOIN DimItemElemXrefs ON (ObjectID = ID) ORDER BY ObjectID
- NB: the ID values is copied from the DimItemElemXrefs; its presence indicate that the value is generated, and so the recognizer need not attempt to recognize it.
# (actually, there is no such thing as a command line command called 'sql'. $ sql [...] <command> > dimensions.20120618.csv # fix fields which have embedded newlines $ perl -pe "chomp;print '%%'" dimensions.20120618.csv | perl -pe 's/%%(\d+)\t/\n\1\t/g;s/\(null\)//g;' | perl -pe "s/\r%%/ | /g" > dimensions.fix.20120618.csv $ perl dimPatsv2.pl < dimensions.fix.20120618.csv > dimensions.extract.20120618.csv abbreviations: 89 $ cut -f10 dimensions.extract.20120618.csv | sort | uniq -c | sort -rn 61850 3 7607 1 4366 0 # clean up the extract file a bit: get rid of "(null)", 4.0000000000 -> 4.0, handle multiline fields. $ perl -pe "chomp;s/\t(\d+)\.(\d+?)00+\t/\t\1.\2\t/;print '%%'" dimensions.structured.20120619.csv | perl -pe 's/%%(\d+)\t/\n\1\t/g;s/\(null\)//g;' | perl -pe "s/\r%%/ | /g" > dimensions.structured.fixed.20120619.csv # combined the structured and extracted data sets $ cat dimensions.structured.fixed.20120619.csv dimensions.extract.20120618.csv > dimensions.both.20120618.csv $ wc dimensions.both.20120618.csv 113376 1288083 6877617 dimensions.both.20120618.csv
Table 6: Effectiveness of Extraction (i.e. distribution of confidence values)
Count |
Percent |
Cumulative Percent |
Confidence Value |
Legend |
---|---|---|---|---|
47659 |
65.1% |
65.1% |
3 |
great: all data elements found in original |
14487 |
19.8% |
84.9% |
2 |
pretty good, but some values propagated or imputed |
7343 |
10.0% |
94.9% |
1 |
good, but more assumptions made: units assumed to be inches, other assumptions made |
3749 |
5.1% |
100.0% |
0 |
not “grammatical”, or otherwise complicated; needs work |
73238 |
100.0% |
|
|
Total single dimensions identified |
Table 7: "End Patterns" identified (i.e. after rewriting and segmenting all original patterns)
Count |
Pattern |
---|---|
46168 |
END |
17010 |
ND |
3756 |
N |
3749 |
PatNoGood |
1562 |
EAND |
752 |
NDE |
182 |
EN |
31 |
AND |
16 |
AN |
5 |
NE |
5 |
EAN |
2 |
ENDE |
Remarks, Observations, and Heuristics
- Apply the distributive law to units e.g. 21 x 16mm becomes 21 mm x 16 mm
- When the type of extent is not provided, e.g. 17"X9", use "L." for the first, "W." for the second, and leave the remaining extents blank. Indicate "medium confidence" (i.e. 2) for rows with these empty extents. Ergo, 17"X9" becomes "L. 17 in. ; W. 9 in..
- Numeric ranges (e.g. 6.5-7.1) are retained as is, despite the fact that this makes it more challenging to convert these values to actual numeric data types.
- In cases where the dimensions are restated in different units, omit the second set of values (usually the English measures).
- " and ' can become in. and ft., respectively.
- The actual recognizer is considerably more complex than presented here: more units are recognized, and more patterns and variants are handled. Consult the perl script attached to understand the reality.
- Note that the values for unit (e.g. "cm") and dimension (e.g. "W") will need to be converted to their standard CSpace representation prior to uploading!
- Don't forget to combine the values extracted from free text with the values extracted directly as structured data (see Object Dimensions data mapping)