Page Comparison

Info
NB: Early Draft! Caveat Lector!

Of the 649,241 objects listed (as of 1/13/2012) in the PAHMA TMS, 72,124 have a non-null value in the dimensions field. This field is filled in in one of two ways:

...

NB: the ID values is copied from the DimItemElemXrefs; its presence indicate that the value is generated, and so the recognizer need not attempt to recognize it.

Code Block


# (actually, there is no such thing as a command line command called 'sql'.
$ mysqlsql [...] <command> > materialsdimensions.20120618.csv
# fix fields which have embedded newlines
$ perl -pe "chomp;print '%%'" materialsdimensions.20120618.csv | perl -pe 's/%%(\d+)\t/\n\1\t/g' |;s/\(null\)//g;' | perl -pe "s/\r%%/ | /g" > materials dimensions.fix.20120618.csv
$ perl dimPatsdimPatsv2.pl < materialsdimensions.fix.20120618.csv > dimensions.extract.20120618.csv
abbreviations: 89
$ cut -f10 dimensions.extract.20120618.csv | sort | uniq -c | sort -rn

61850 47659	3
14487	27607 7343	1
3749	0

Table 6: Effectiveness of Extraction (i.e. distribution of confidence values)

...

Count

...

Percent

...

Cumulative Percent

...

Confidence Value

...

Legend

...

47659

...

65.1%

...

65.1%

4366 0

# clean up the extract file a bit: get rid of "(null)", 4.0000000000 -> 4.0, handle multiline fields.
$ perl -pe "chomp;s/\t(\d+)\.(\d+?)00+\t/\t\1.\2\t/;print '%%'" dimensions.structured.20120619.csv | perl -pe 's/%%(\d+)\t/\n\1\t/g;s/\(null\)//g;' | perl -pe "s/\r%%/ | /g" >  dimensions.structured.fixed.20120619.csv

# combined the structured and extracted data sets
$ cat dimensions.structured.fixed.20120619.csv dimensions.extract.20120618.csv > dimensions.both.20120618.csv
$ wc dimensions.both.20120618.csv
  113376 1288083 6877617 dimensions.both.20120618.csv

Table 6: Effectiveness of Extraction (i.e. distribution of confidence values)

Count	Percent	Cumulative Percent	Confidence Value	Legend
47659	65.1%	65.1%	3	great: all data elements found in original
14487	19.8%	84.9%	2	pretty good, but some values propagated or imputed
7343	10.0%	94.9%	1	good, but more assumptions made: units assumed to be inches, other assumptions made
3749	5.1%	100.0%	0	not "grammatical"“grammatical”, or otherwise complicated; needs work
73238	100.0%			Total single dimensions identified

...

Apply the distributive law to units e.g. 21 x 16mm becomes 21 mm x 16 mm
When the type of extent is not provided, e.g. 17"X9", use "L." for the first, "W." for the second, and leave the remaining extents blank. Indicate "medium confidence" (i.e. 2) for rows with these empty extents. Ergo, 17"X9" becomes "L. 17 in. ; W. 9 in..
Numeric ranges (e.g. 6.5-7.1) are retained as is, despite the fact that this makes it more challenging to convert these values to actual numeric data types.
In cases where the dimensions are restated in different units, omit the second set of values (usually the English measures).
" and ' can become in. and ft., respectively.
The actual recognizer is considerably more complex than presented here: more units are recognized, and more patterns and variants are handled. Consult the perl script attached to understand the reality.and variants are handled. Consult the perl script attached to understand the reality.
Note that the values for unit (e.g. "cm") and dimension (e.g. "W") will need to be converted to their standard CSpace representation prior to uploading!
Don't forget to combine the values extracted from free text with the values extracted directly as structured data (see Object Dimensions data mapping)

Versions Compared

Old Version 8

New Version Current

Key

Table 6: Effectiveness of Extraction (i.e. distribution of confidence values)

Table 6: Effectiveness of Extraction (i.e. distribution of confidence values)