Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3
Info

NB: Early Draft! Caveat Lector!

Of the 649,241 objects listed (as of 1/13/2012) in the PAHMA TMS, 72,124 have a non-null value in the dimensions field. This field is filled in in one of two ways:

...

  • NB: the ID values is copied from the DimItemElemXrefs; its presence indicate that the value is generated, and so the recognizer need not attempt to recognize it.
Code Block

# (actually, there is no such thing as a command line command called 'sql'.
$ mysqlsql [...] <command> > materialsdimensions.20120618.csv
# fix fields which have embedded newlines
$ perl -pe "chomp;print '%%'" materialsdimensions.20120618.csv | perl -pe 's/%%(\d+)\t/\n\1\t/g' |;s/\(null\)//g;' | perl -pe "s/\r%%/ | /g" > materials dimensions.fix.20120618.csv
$ perl dimPatsdimPatsv2.pl < materialsdimensions.fix.20120618.csv > dimensions.extract.20120618.csv
abbreviations: 89
$ cut -f10 dimensions.extract.20120618.csv | sort | uniq -c | sort -rn

61850 47659	3
14487	27607 7343	1
3749	0

Table 6: Effectiveness of Extraction (i.e. distribution of confidence values)

...

Count

...

Percent

...

Cumulative Percent

...

Confidence Value

...

Legend

...

47659

...

65.1%

...

65.1%

4366 0

# clean up the extract file a bit: get rid of "(null)", 4.0000000000 -> 4.0, handle multiline fields.
$ perl -pe "chomp;s/\t(\d+)\.(\d+?)00+\t/\t\1.\2\t/;print '%%'" dimensions.structured.20120619.csv | perl -pe 's/%%(\d+)\t/\n\1\t/g;s/\(null\)//g;' | perl -pe "s/\r%%/ | /g" >  dimensions.structured.fixed.20120619.csv

# combined the structured and extracted data sets
$ cat dimensions.structured.fixed.20120619.csv dimensions.extract.20120618.csv > dimensions.both.20120618.csv
$ wc dimensions.both.20120618.csv
  113376 1288083 6877617 dimensions.both.20120618.csv

Table 6: Effectiveness of Extraction (i.e. distribution of confidence values)

Count

Percent

Cumulative Percent

Confidence Value

Legend

47659

65.1%

65.1%

3

great: all data elements found in original

14487

19.8%

84.9%

2

pretty good, but some values propagated or imputed

7343

10.0%

94.9%

1

good, but more assumptions made: units assumed to be inches, other assumptions made

3749

5.1%

100.0%

0

not "grammatical"“grammatical”, or otherwise complicated; needs work

73238

100.0%

 

 

Total single dimensions identified

...

  • Apply the distributive law to units e.g. 21 x 16mm becomes 21 mm x 16 mm
  • When the type of extent is not provided, e.g. 17"X9", use "L." for the first, "W." for the second, and leave the remaining extents blank. Indicate "medium confidence" (i.e. 2) for rows with these empty extents. Ergo, 17"X9" becomes "L. 17 in. ; W. 9 in..
  • Numeric ranges (e.g. 6.5-7.1) are retained as is, despite the fact that this makes it more challenging to convert these values to actual numeric data types.
  • In cases where the dimensions are restated in different units, omit the second set of values (usually the English measures).
  • " and ' can become in. and ft., respectively.
  • The actual recognizer is considerably more complex than presented here: more units are recognized, and more patterns and variants are handled. Consult the perl script attached to understand the reality.and variants are handled. Consult the perl script attached to understand the reality.
  •  Note that the values for unit (e.g. "cm") and dimension (e.g. "W") will need to be converted to their standard CSpace representation prior to uploading!
  • Don't forget to combine the values extracted from free text with the values extracted directly as structured data (see Object Dimensions data mapping)