Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

Code Block
# (actually, there is no such thing as a command line command called 'sql'.

$ sql [...] <command> > dimensions.20120618.csv
# fix fields which have embedded newlines
$ perl -pe "chomp;print '%%'" dimensions.20120618.csv | perl -pe 's/%%(\d+)\t/\n\1\t/g;s/\(null\)//g;' | perl -pe "s/\r%%/ | /g" >  dimensions.fix.20120618.csv
$ perl dimPatsv2.pl < dimensions.fix.20120618.csv > dimensions.extract.20120618.csv
abbreviations: 89
$ cut -f10 dimensions.extract.20120618.csv | sort | uniq -c | sort -rn

61850 3
7607 1
4366 0 | sort -rn

61850 3
7607 1
4366 0

# clean up the extract file a bit: get rid of "(null)", 4.0000000000 -> 4.0, handle multiline fields.
$ perl -pe "chomp;s/\t(\d+)\.(\d+?)00+\t/\t\1.\2\t/;print '%%'" dimensions.structured.20120619.csv | perl -pe 's/%%(\d+)\t/\n\1\t/g;s/\(null\)//g;' | perl -pe "s/\r%%/ | /g" >  dimensions.structured.fixed.20120619.csv

# combined the structured and extracted data sets
$ cat dimensions.structured.fixed.20120619.csv dimensions.extract.20120618.csv > dimensions.both.20120618.csv
$ wc dimensions.both.20120618.csv
  113376 1288083 6877617 dimensions.both.20120618.csv

Table 6: Effectiveness of Extraction (i.e. distribution of confidence values)

Count

Percent

Cumulative Percent

Confidence Value

Legend

47659

65.1%

65.1%

3

great: all data elements found in original

14487

19.8%

84.9%

2

pretty good, but some values propagated or imputed

7343

10.0%

94.9%

1

good, but more assumptions made: units assumed to be inches, other assumptions made

3749

5.1%

100.0%

0

not "grammatical"“grammatical”, or otherwise complicated; needs work

73238

100.0%

 

 

Total single dimensions identified

...

  • Apply the distributive law to units e.g. 21 x 16mm becomes 21 mm x 16 mm
  • When the type of extent is not provided, e.g. 17"X9", use "L." for the first, "W." for the second, and leave the remaining extents blank. Indicate "medium confidence" (i.e. 2) for rows with these empty extents. Ergo, 17"X9" becomes "L. 17 in. ; W. 9 in..
  • Numeric ranges (e.g. 6.5-7.1) are retained as is, despite the fact that this makes it more challenging to convert these values to actual numeric data types.
  • In cases where the dimensions are restated in different units, omit the second set of values (usually the English measures).
  • " and ' can become in. and ft., respectively.
  • The actual recognizer is considerably more complex than presented here: more units are recognized, and more patterns and variants are handled. Consult the perl script attached to understand the reality.
  •  Note that the values for unit (e.g. "cm") and dimension (e.g. "W") will need to be converted to their standard CSpace representation prior to uploading!
  • Don't forget to combine the values extracted from free text with the values extracted directly as structured data (see Object Dimensions data mapping)