WikiProjectMed:Tech/NC Commons IA Extension

From WikiProjectMed
Jump to navigation Jump to search

Overview

The aim of this extension is to allow NC Commons to store media files on the Internet Archive.

https://store.skizzerz.net/viewticket.php?tid=305545&c=69OwEE3t ticket requesting review

Requirements

  • Media files are on IA
  • Metadata is in database for NC Commons
  • Thumbnails are generated on the fly
  • Search uses NC Commons data

Current Behavior

NC Commons View file and meta data

  • Note that is wiki page, but underlying img src is direct link
  • img src is <img ... src="/media/2/21/Large_melanoma_in_situ_%28DermNet_NZ_0230-MiS-macro-v2%29.jpg?20210420112541" ...>
  • If click on image that direct url is displayed from the NC Commons file system

Use of image in MDWiki

  • View https://mdwiki.org/wiki/Melanoma
  • Has the above large_melanoma image on page
  • img link is to the above image page in NC Commons
  • Thumbnail is direct link to NC Commons file system
  • img src is <img ... src="https://nccommons.org/media/thumb/d/de/Melanoma_in_situ_12_macro_%28DermNet_NZ_0230-MiS-macro%29.jpg/250px-Melanoma_in_situ_12_macro_%28DermNet_NZ_0230-MiS-macro%29.jpg"
  • Click on image to view it
  • Goes to https://mdwiki.org/wiki/Melanoma#/media/File:Melanoma_in_situ_12_macro_(DermNet_NZ_0230-MiS-macro).jpg
  • Still on mdwiki
  • But image link is to NC Commons file system: <img crossorigin="anonymous" src="https://nccommons.org/media/d/de/Melanoma_in_situ_12_macro_%28DermNet_NZ_0230-MiS-macro%29.jpg"
  • Click on image to enlarge
  • User is taken to above NC Commons direct rendering of image
  • https://nccommons.org/media/d/de/Melanoma_in_situ_12_macro_%28DermNet_NZ_0230-MiS-macro%29.jpg

Caveats

  • There needs to be a mechanism so NC Commons image names don't collide with IA names. I had thought naming the containing item nc_commons_<id> where id came from the image table, but no such column exists.

Strategy 1

  • Use AWS extension as model
  • Assume it has all required functions to intercept media read/write
  • Use it to retrieve media with S3 interface to Internet Archive (remains unproven)
  • Use php shell to Internet Archive ia tool to upload media to Internet Archive
  • Include metadata with upload for reference, but search remains via NC Commons

Strategy 2

Two main parts: viewing and uploading. The biggest difference between MediaWiki files and IA files is that MediaWiki treats each individual file as a page with its own metadata, whereas the IA has metadata for a set of files. In the MediaWiki use case, quite often that would mean that an item has only one file (e.g. a single photograph) — but there are plenty of cases where derivatives, cropped versions, etc. could be sensibly stored together with the same metadata.

In this strategy, there is no distinction made between files uploaded via the new wiki-based IA uploading mechanism and files already on IA or uploaded there directly. Files uploaded

Using and viewing

  • A new parser function, e.g. {{#internetarchive |image |id=<id> |file=<path to image> |size= |caption= |<etc…> }} (including most of the normal parameters of images).
  • It would generally be wrapped in a template for easier use, e.g. {{ia-image|<id>|<path>|<size>}}.
  • For example, the following wikitext would result in the following HTML (simplified):
    • {{ia-image|AbcDef123|Lorem_ipsum.png|300px|thumb|Lorem ipsum.}}
    • <div class="thumb tright"><div class="thumbinner" style="width:300px;"><a href="/wiki/Special:InternetArchive/AbcDef123/Lorem_ipsum.png" class="image"><img src="/wiki/Special:InternetArchiveFile/AbcDef123/Lorem_ipsum.png?width=300px" decoding="async" class="thumbimage"></a><div class="thumbcaption">Lorem ipsum.</div></div></div>
  • When parsed, it would fetch (and cache) the item's metadata, and look for the requested file.
  • If found, the file would be downloaded, resized, and the smaller version kept in a new FileRepo (the original not kept locally, to reduce storage requirements).
  • The files would either be stored in the local filesystem (by default) or if the AWS extension or other FileBackend system is installed, that could be used.
  • It sounds like serving thumbnails directly from IA is not a good idea, because it can be pretty slow. This should be investigated more though, because it could be a good design to have all of an item's derivative files stored together in one place (a standard filename system for this would be needed). IA doesn't have an API for thumbnails of arbitrary images (it does for book images, at openlibrary but this is limited to paged files such as PDFs and DjVus).
  • The parser function could also have other methods included, for displaying item metadata.

This would mean that NC Commons and MDWiki would handle IA files pretty much the same, rather than NC Commons being a traditional MediaWiki shared repository, because effectively IA becomes the shared repository so there's no need for an extra step.

Uploading

  • A new special page, e.g. Special:InternetArchive, with an upload form somewhat similar to the normal IA upload form (i.e. all the useful fields).
  • New items would all be created with a wiki-specific service account as their owner, and perhaps be added to a new IA collection if that's desired, but in all other ways would be normal IA items.
  • Existing items would be able to be edited: all metadata, and files added and removed. There might be some limitations on this interface, to keep it simpler to implement (if people need to full power, they can always go direct to IA and upload and manage more complicated items there).

Searching and browsing

  • All IA items used within the wiki would have view pages similar to MediaWiki's File namespace, e.g. at Special:InternetArchive/<id>, where the relevant metadata would be displayed.
  • Items' metadata would be cached locally (and refreshed periodically via a maintenance script) and displayed in normal search results (although perhaps in a separate column, similar to how interwiki results are shown on Wikimedia sites).
  • The items that are included would probably have to be either those that are in use on the wiki, or those that belong to a IA collection (or both).

Questions/Issues Strategy 1

  • Tim: https://www.mediawiki.org/wiki/Manual:Image_administration says images are served directly by the web server, so not sure how to intercept that.
  • Tim: Unless an NC Commons image that is on IA is treated as an external image when html is generated.
  • Sam: Is there any concern with many items being created and each containing only one or a small number of files?
  • Tim: this is IA's recommendation:

Currently we recommend that items not exceed 1000 files or 500GB. Larger than that will tend to break either the use of xml files on the site or prevent internal operations such as shuffling files between servers.

  • Tim: I would expect an item to contain the primary media object and at least one thumbnail.
    • This makes sense to me, and I like the way that it'd keep everything together in one place. But we talked at some point about the fact that serving images direct from IA is quite slow sometimes. Is that a concern? Samwilson (talk) 03:52, 9 April 2023 (UTC)

Questions/Issues Strategy 2

  • Tim: To summarize this would replace NC Commons with IA and allow any IA file to be included on a mw page.
  • Tim: Will this meet WMF security requirements in terms of quality of content displayed and privacy leakage?
  • Tim: How do we guarantee that metadata like license is available for any embedded content?
    • We can't guarantee it, but we can look for the licenseurl field and if it's not present or it's not an appropriate license then show an error. Samwilson (talk) 04:02, 9 April 2023 (UTC)
  • Tim: I don't find search very strong on IA.
    • I agree, and I think we'd probably want to mirror the metadata into a DB table within MediaWiki (for all items in use, or all in a designated collection… I'm not sure). Samwilson (talk) 03:53, 9 April 2023 (UTC)