...
Code Block |
---|
# (actually, there is no such thing as a command line command called 'sql'.
$ sql [...] <command> > dimensions.20120618.csv
# fix fields which have embedded newlines
$ perl -pe "chomp;print '%%'" dimensions.20120618.csv | perl -pe 's/%%(\d+)\t/\n\1\t/g;s/\(null\)//g;' | perl -pe "s/\r%%/ | /g" > dimensions.fix.20120618.csv
$ perl dimPatsv2.pl < dimensions.fix.20120618.csv > dimensions.extract.20120618.csv
abbreviations: 89
$ cut -f10 dimensions.extract.20120618.csv | sort | uniq -c | sort -rn
61850 3
7607 1
4366 0
# clean up the extract file a bit: get rid of "(null)", 4.0000000000 -> 4.0, handle multiline fields.
$ perl -pe "chomp;s/\t(\d+)\.(\d+?)00+\t/\t\1.\2\t/;print '%%'" dimensions.structured.20120619.csv | perl -pe 's/%%(\d+)\t/\n\1\t/g;s/\(null\)//g;' | perl -pe "s/\r%%/ | /g" > dimensions.structured.fixed.20120619.csv
# combined the structured and extracted data sets
$ cat dimensions.structured.fixed.20120619.csv dimensions.extract.20120618.csv > dimensions.both.20120618.csv
$ wc dimensions.both.20120618.csv
113376 1288083 6877617 dimensions.both.20120618.csv
|
...
Count | Percent | Cumulative Percent | Confidence Value | Legend |
---|---|---|---|---|
47659 | 65.1% | 65.1% | 3 | great: all data elements found in original |
14487 | 19.8% | 84.9% | 2 | pretty good, but some values propagated or imputed |
7343 | 10.0% | 94.9% | 1 | good, but more assumptions made: units assumed to be inches, other assumptions made |
3749 | 5.1% | 100.0% | 0 | not "grammatical"“grammatical”, or otherwise complicated; needs work |
73238 | 100.0% |
|
| Total single dimensions identified |
...
- Apply the distributive law to units e.g. 21 x 16mm becomes 21 mm x 16 mm
- When the type of extent is not provided, e.g. 17"X9", use "L." for the first, "W." for the second, and leave the remaining extents blank. Indicate "medium confidence" (i.e. 2) for rows with these empty extents. Ergo, 17"X9" becomes "L. 17 in. ; W. 9 in..
- Numeric ranges (e.g. 6.5-7.1) are retained as is, despite the fact that this makes it more challenging to convert these values to actual numeric data types.
- In cases where the dimensions are restated in different units, omit the second set of values (usually the English measures).
- " and ' can become in. and ft., respectively.
- The actual recognizer is considerably more complex than presented here: more units are recognized, and more patterns and variants are handled. Consult the perl script attached to understand the reality.
- Note that the values for unit (e.g. "cm") and dimension (e.g. "W") will need to be converted to their standard CSpace representation prior to uploading!
- Don't forget to combine the values extracted from free text with the values extracted directly as structured data (see PAHMA Collection Objects Object Dimensions data mapping)