TestDataResults

From PreservWiki

Results

After the classification process was complete the results have been collated and compared to classification by file extension and classification by ROAR.

The Differences

ROAR: DROID v1.2 with signature file version 12.
Preserv2-EPrints-Toolkit: DROID v3.0 with signature file version 13.

Results URL

For an exact breakdown on the classification results and comparisons for all 13 repositories:

http://www.preserv.org.uk/testing/repositories/

General Observations

The Preserv2-EPrints-Toolkit version of DROID has difficulty in differentiating between different types of text and rtf files.
As a result it classifies all text files as one type, and similarly, all rtf files as one type.
Pronom is lacking a complete set of mime-types for its format data, which is key when it comes to changing identifiers.
The Preserv2-EPrints-Toolkit version of DROID is able to classify more objects, cutting down on the number of unknowns
Some files have been updated in the base repositories but remain of the same mime-type. DROID classification error or manual user update, unfortunately there is no way to tell.

Typical Repository Outcomes (994 Files)

2 files, previously of known format are now unknown? #
Out of the 994 files only 83 remain concerning with 43 still unknown and 40 not matching their original classification even by mime-type. Of the remaining files 567 matched exactly to their previous classification and 256 (most likely text and rtf) matched by mime type. This backs the general conclusion about DROID v3.00 with sig file v13 has issues determining the exact file version of some types of simple file.
- 184 text files changed classification but stayed of the same mime-type.
- 45 tiff files changed classification but stayed of the same mime-type.
Extension Missmatches: 88
- Reasons have been investigated only on individual repositories where it is believed the majority of reasons have been covered.

Specific Repository Outcomes

Philsci (94 Files)

EPrints 2.2.1 (pepper)

10 objects classified which were previously unknown.
75 exact matching classifications and 6 matching on mime-type (All text/rtf files)
5 remain unknown.

Senado (99 Files)

DSpace Repository
89 unknown on input, 89 unknown on output
- All of these 89 files were in fact only 2 files which were provided by the publication URL.
- Due to the fact I can't read Brazilian/Spanish I think that these pages represent a page describing a method to get at the document such as payment required (HTTP 414) or resource unavailable (HTTP 404). The pages are malformed html which don't provide an HTTP header code (as stated by me) to describe their page. Basically a very bad web site implementation Web < 0.1.
All other classifications matched exactly.
Further analysis on these files not done due to the rare nature of such a bad repository, it becomes its own study.

Minho (98 Files)

DSpace 1.5.1
3 newly classified files
90 exact matches
3 matching on mime-type
2 non-matching
- Previously Classified as HTML now PDF (embargos?)

Good Result That

ANU (76 Files)

DSpace
Of the 76 I could find accessible all 76 matched their imported classifications in every way
HOWEVER: DROID wrongly classifies the first 10 objects in the repository as tar files when they are actually compressed jpeg's.
Objects which I couldn't get could not be classified and compared.

Roskilde (98 files)

DSpace
Newly classified (previously unknown): 2
1 still unknown
1 non mathcing classification
3 matched on mime-type
- PDF changes this time.
91 exact matches
1 item with extension mismatch
- PDF which has since been removed from the repository (or hidden)
- Sends a 200 OK header for Item Withdrawn page - It should send a "410 Gone" header.

Stirling (102 Files)

DSpace
Newly classified (previously unknown): 1
Matched on mime-type: 5
Exact matches: 63
Absolute non matching: 33
- These are documents previously classified as PDF which has been newly classified as HTML.
- This is due to an embedded 302 error not being sent as a header, this now appears to be fixed since I took the original dataset.
- This also causes a lot of format ending mismatches
Format Mismatches (after above files are ruled out): 3
- These three are work docs which have been classified as excel spreadsheets - All wrongly!

ECS (94 Files)

EPrints 3.1
Still Unknown: 2
Newly Classified (previously UNKNOWN): 1
Matched on Mime Type: 1
Exact Matching Classifications: 88
Absolute non matching classifications: 2
- 1 HTML is a PDF now
- 1 HTML is a now Web Archive Document rfc822
0 extention mismatches :)

Glasgow (99 Files)

EPrints 3.1.1
Newly Classified (previously UNKNOWN): 7
Matched on Mime Type: 2
Exact Matching Classifications: 88
Still UNKNOWN: 2
- Invalid HTML redirect pages, the re-direct pages are correctly used.
Extension Mismatches: 0

Good repository

Queensland (99 Files)

EPrints 3.1.1 (Port and Brandy)
Newly Classified (previously UNKNOWN): 6
Exact Matching Classifications: 93
Extension Mismatches: 0

Good Repository

All PDF repository, everything classified, no problems, other than the fact that it's all PDF.

E-Lis (96 Files)

EPrints 3.1.2.1 (Chocolate-coated Coffee Bean)
Newly Classified (previously UNKNOWN): 7
Exact Matching Classifications: 88
Absolute non matching classifications: 1
- Probably due to upgrade in DROID Sig file as they are similar types
- fmt-99 (Hypertext Markup Language (4.0) to x-fmt-429 (Microsoft Web Archive - message/rfc822)
Extension Mismatches: 0

Good Repository

Soton (97 Files)

EPrints 2.x

Newly Classified (previously UNKNOWN): 4
Matched on Mime Type: 6
Exact Matching Classifications: 85
Absolute non matching classifications: 1
- PDF now HTML
Still UNKNOWN: 1
- The file has a Shock Wave Flash (Flash) extension, DROID may not have a signature for this format.
Extension Mismatches: 1
- This file does not exist but sends back a 302 Found HTTP code!

Tartu (98 Files)

DSpace
Newly Classified (previously UNKNOWN): 4
Matched on Mime Type: 14
Exact Matching Classifications: 60
Absolute non matching classifications: 19
- 18 are tiff's which are now classified as HTML page
- 1 is a PDF which is now classified as an HTML page
Still UNKNOWN: 1
- A 650Mb zip file, part of a much larger set of zip files. Should DROID check for consecutively number zip files and how do you handle this.
Extension Mismatches: 23
- Authentication Required or Item Removed
- Authentication required sends back a 302 (temporarily moved!!! wrong header)
- Item removed is a 200 Continue Header

Conclusions - DROID (Ordered)

The Preserv2-EPrints-Toolkit, which includes a DROID wrapper, caching database (as an EPrints dataset) and results page (as EPrints Admin Screen), was fully tested and found to be fully working (after some minor bug fixes). The toolkit can be downloaded at http://files.eprints.org/422/.

Once all the data was imported into an EPrints repository it was then run through the toolkit, allowing DROID to classify the files before comparing these file classifications with the ones from ROAR, which is running an older version of DROID and version 12 of the signature file.

The new version of DROID with the new signature file is able to classify more files than before. 275 were unknown on import and 146 remain unknown, however 89 exist in both datasets which can be put down to the fact that these files are not valid. Thus discounting these means 186 were unknown originally versus the 57 which are now unknown, which is a major improvement.

This said, the newer version of DROID is much worse at identifying the different file versions when asked to identify formats including e.g. text, rtf and the tiff image format. In these cases DROID could not differentiate types of text, e.g. Comma Separated Values text file, Macintosh formatted text file. Similarly for rtf and tiff, and perhaps for other other formats, but these are the cases that exist in significant numbers within our dataset, i.e. in repositories. For these types DROID identified them all as the same file version, grouping them together inaccurately.

To amend the above problem we feel it is important to identify files using DROID and establish the mime-type of the file first and extend this to then attempt to classify the exact version of the file. Unfortunately the PRONOM registry lacks the mime-type data for many formats (most of the text/plain ones in particular)

A few files show the same symptom as mentioned in the previous point. However, these are not present in significant quantity sufficient to conclude whether DROID is at fault or if in fact the repository has updated the file to a newer version of the same type. Unfortunately there is no way to tell.

More concerning: DROID wrongly classifies some Word docs and Excel spreadsheets. This is a big problem as it changes the suspected mime-type of the file. Maybe it should pay more attention to the file extension when doing an initial classification. This may also cut down on the classification time.

- You can't blame DROID entirely, however, as most of the files it can't classify are badly formed. So we can say here you can lie about the mime-type however DROID required greater accuracy of the data within the file. What is more important in the long term?
- Sig files were missing for .tex is one such example of where DROID could be better.

An alternative means of format classification is simply to examine file extensions. Comparing results based on file extensions rather than DROID classification reveals some surprising results. Doing the whole process by file-extension more accurately locates the correct mime-type. Doing the process this way does not tell you file format versions, however.
Taking into account all of the errors in the harvesting, 99.8% of the files harvested in our survey matched by file extension to their mime type. However, in some cases the file extension was followed by a .1, .2, .3 or .backup amoung other human readable descriptions. Only 4 (0.2%) of 2144 files in the entire dataset had no extension.
Using DROID, 146 files could not be classified which out of 2144 is a 93.1% classification rate. With more files being wrongly classified this comes down to 92.75%. Classification by extension will classify an estimated 99.8% of the files correctly (if the extension is correct, not tested for all files but tested on the fringe cases). However, this does not give you any information about the file version.

In Progress

Conclusions - Digital Preservation

Repositories do not know how to use HTTP status codes correctly. This is based on the number of files that appeared to change format when they were delivered as HTML when originally they were in a different format completely. What was happening was that the original content could not be delivered for some reason, and instead an HTTP error page was generated. In these cases the source content provided either incorrect or incomplete header information that could have resolved the problem. Thus, externally classifing a repository using HTTP as the transport method and OAI-PMH as the harvesting method can lead to results that are inconsistent with the actual content of the repository.

OLD NOTES (should now be mostly ignored)

For now this is a bullet pointed list which needs bulking out:

When harvesting repositories about 1/3 of downloads fail for various reasons, Further Investigation?

ROAR identifies a lot of stuff at HTML as it may get redirected to html in the process of trying to get a resource. This is wrong as this is not then the resource, the bug we believe is in the fact that the repository sends a HTTP 200 and not an HTTP 401 header. This is particularly the case on the ANU repository where we have only 74 items as fmt/94 (html) is the redirected 401 page. Further Investigation?

The DROID does not like being fed documents which contain possible bad URLs in them, I will need to check to see if this is a problem with DROID (I suspect not) or my code.
- File in question: http://dspace.mit.edu/47:14Z%20(GMT).%20No.%20of%20bitstreams:%20263191853.pdf:%207781517%20bytes,%20checksum:%20f3b11c5d00fa6a9a02d4c906c1825757%20(MD5)63191853MIT.pdf:%207786389%20bytes,%20checksum:%2032bfd24738365b6456b4d2a10a961c6b%20(MD5)&.
- FIXED: DROID does not handle shell escaping (in this case bash) when file names are handed to it, thus it cannot read the file. This was worked around by handing droid a "droid list XML file" which contains a XML encoded link to the file.

FIRST PARSE RESULTS (with the above correction)
- All 994 files were located by the EPrints3.2 classification script which can then be fed to DROID.
- Only 45 objects were left unclassified which is better than the 129 which were UNKNOWN at the time of importation.
- All other comparisons are difficult at this stage.
- Some initial observations:
  - DROID is classifying the objects differently from the original classification however these are still the same in mime type. For example an HTML 4.0 file is now being classified as an XHTML or just HTML file. The same is true of text files and some PDFs. We suspect this is due to the Tentative classifications which DROID provides for some files. As a result we are now including the classification quality (tentative, positive) in the data output from a classification.
- update_pronom_uids scripts tries to process NULL and thus dies because it receives an error 400, technically it should only die if the service is not reachable, or just do it on a per object basis.
- In the philsci repository the number of files with pronomids is correct but the division by mime-type is wrong.
- No error handling on the formats_risks page if nothing has been classified!
- The files missing bug was a problem with the script generating the EP3XML not escaping URLs correctly. With this fixed all repositories populate correctly. NEED TO SUBMIT THIS TO GOOGLE CODE OR SOMETHING?

Retrieved from "http://wiki.preserv.org.uk/wiki/TestDataResults"