User talk:JL-Bot/Archive 5

From WikiProjectMed
Jump to navigation Jump to search

Follow-up

This is a follow-up to User talk:JL-Bot/Archive 4 § Need help. Neither of the two pages mentioned in the previous post have updated since you manually updated them on 1 April 2019‎. Any idea what's going on? Mitchumch (talk) 15:36, 18 April 2019 (UTC)

@Mitchumch: are there pages it should have picked up, but didn't? Headbomb {t · c · p · b} 15:41, 18 April 2019 (UTC)
I'm not sure. I've been adding pages to the project off and on. I thought I would see a weekly run, even if nothing changed. Is that an incorrect assumption? Mitchumch (talk) 15:43, 18 April 2019 (UTC)
I believe if there's nothing to report, there won't be any changes to those pages. The last run for WP:RECOG stuff occured on April 2nd however. Headbomb {t · c · p · b} 17:46, 18 April 2019 (UTC)
Does the bot run on a specific day of each week? Mitchumch (talk) 03:35, 19 April 2019 (UTC)
Thank you for manually running the bot. There were additional entries added for "Quality articles" here and "Open tasks" here. Any idea why the bot didn't run for this project? Mitchumch (talk) 06:26, 19 April 2019 (UTC)
My time is limited at the moment so bot runs will be erratic for awhile. -- JLaTondre (talk) 16:41, 21 April 2019 (UTC)

Latest JCW run

The bot ran... but it didn't seem to do anything. Did you feed it the old dump by mistake? Headbomb {t · c · p · b} 14:39, 24 April 2019 (UTC)

Not quite, but the effect was the same. It will be up shortly. -- JLaTondre (talk) 00:27, 25 April 2019 (UTC)
I think whatever issue was with that messed up run happened again. Headbomb {t · c · p · b} 18:26, 5 May 2019 (UTC)
I'm not seeing anything unexpected. There have been no changes to Citations.cfg or Publishers.cfg since the last run and only one addition to Questionable.cfg[1]. The majority of the updates for Publishers and Questionable seem related to changes in the wiki itself for the relevant categories or redirects based on the configurations. -- JLaTondre (talk) 19:23, 5 May 2019 (UTC)
Well, I know for a fact that I did a lot of cleanup related /Target /Publisher and /Questionable before May 1st. And even if I didn't, [2] should show a lot of changes in /Target alone from natural updates occurring from April 20th to May 1st, or have some changes in the comprehensive alphabetical listings. Headbomb {t · c · p · b} 20:34, 5 May 2019 (UTC)
Ah, I think I understand the confusion. That run was an update against the 04/20 dump (trying to get closer to this). The 05/01 dump is downloading and then will be processed. -- JLaTondre (talk) 21:13, 5 May 2019 (UTC)
Ah yes, my bad. The timing made be believe it was the new dump. I should have been more careful reading the output. Headbomb {t · c · p · b} 03:34, 6 May 2019 (UTC)

By publisher

Since the CRAPWATCH code is now relatively polished, we can now revisit the 'easy' thing I had in mind back in October. It's simply (or maybe 'simply') taking User:JL-Bot/Publishers.cfg, and outputting things at WP:JCW/PUB.


The code should be pretty much identical (with exclusions still set up at WP:JCW/EXCLUDE), except using {{JCW-PUB-top}}/{{JCW-PUB-rank}} instead of {{JCW-CRAP-top}}/{{JCW-CRAP-rank}}. That and |source= won't exist in those, so it can be omitted. {{JCW-bottom}} will also support |p-id=, similar to |q-id=. Headbomb {t · c · p · b} 23:03, 16 April 2019 (UTC)

Limiting this to 10 publishers per page might be a good idea, given how massive each entry could get. Headbomb {t · c · p · b} 15:07, 17 April 2019 (UTC)
Done. -- JLaTondre (talk) 03:18, 18 April 2019 (UTC)
Thanks. The first and second pages are pretty slow to load though... might be worth revisiting the idea of a size limit instead, like CRAPWATCH used to be done. With ~250 KB per page? Headbomb {t · c · p · b} 03:27, 18 April 2019 (UTC)
Put this on hold for a while. I need to think. It's slow to load, but it should get a bit faster after the cleanup I've been doing. Headbomb {t · c · p · b} 14:56, 14 May 2019 (UTC)

publisher issue

In WP:JCW/Publisher8, the entry for Allen Press reports Proceedings of The Biological Society of Washington, a miscapitalized version of Proceedings of the Biological Society of Washington, but it doesn't report Proceedings of the Biological Society of Washington itself. It also doesn't report Proc. Biol. Soc. Washington and similar, although this might be a bit tricky to do. Headbomb {t · c · p · b} 16:51, 24 May 2019 (UTC)

Processing was based on targets. In this case, Proceedings of the Biological Society of Washington is part of Category:Allen Press academic journals (which is included per the configuration settings), but is a redirect to Biological Society of Washington (which is not included per the configuration settings). As such, it was not being picked up (since "Biological Society of Washington" wasn't specified as being included), but the typo to it was (since typos against the configuration are checked). I changed it so that if a configuration page is not found as a target (the right side of the individual pages), it is checked as a citation (the left side of the individual pages). This impacts both Questionable and Publishers. Please check the new versions. -- JLaTondre (talk) 20:15, 25 May 2019 (UTC)
Will do. Headbomb {t · c · p · b} 17:31, 26 May 2019 (UTC)
Don't archive this just yet. I'll have some additional feedback. Headbomb {t · c · p · b} 20:39, 4 June 2019 (UTC)

Special issues

For TAR and TAR-like things, if you have Foobar Special Issue Barfoo, treat it as Foobar. For example, treat Critical Inquiry, Special Issue: "Race," Writing, and Difference as Critical Inquiry. Headbomb {t · c · p · b} 20:35, 4 June 2019 (UTC)

Done. -- JLaTondre (talk) 13:02, 8 June 2019 (UTC)
Seems to work flawlessly. Thanks! Headbomb {t · c · p · b} 17:06, 8 June 2019 (UTC)

New run

When you have a chance, could you run the bot again. I'm investigating STM Journals a possible affiliate of OMICS, and need to dig in a bit further. Headbomb {t · c · p · b} 16:06, 10 June 2019 (UTC)

Running. Results should show up in a few hours. -- JLaTondre (talk) 19:41, 10 June 2019 (UTC)

Other things to strip

Both before/after strings

  • Accepted/Accepted in
  • In press
  • Submitted/Submitted to/Submitted in
  • To be published/To be published in
  • To appear/To appear in

This way things like "J Virological Methods 2011 (in press)" gets treated as "J Virological Methods", and "Submitted to the International Journal of Social Education" gets treated as "International Journal of Social Education". Headbomb {t · c · p · b} 17:15, 20 June 2019 (UTC)

Done. It's running, but will be several hours for results to show up. -- JLaTondre (talk) 12:56, 29 June 2019 (UTC)

If at AFD, no link?

[3]

That seems weird. Could possibly affect RFDs too. Headbomb {t · c · p · b} 20:41, 29 June 2019 (UTC)

The Antique Wireless Association Review is not in the dump. A valid redirect target not being in the dump happens every once in awhile. I assume it results under some condition when a page move occurs during a dump, but not sure. I was attempting to implement better error handling for that case, but set the formatting incorrectly. I fixed the formatting. -- JLaTondre (talk) 13:17, 30 June 2019 (UTC)

Automate run whenever /config is updated?

Could the bot check if User:JL-Bot/Citations.cfg or User:JL-Bot/Questionable.cfg were updated, say once per day at 05:00 (UTC) and then automatically run if they were? Headbomb {t · c · p · b} 13:24, 25 October 2018 (UTC)

(Or User:JL-Bot/Publishers.cfg. Headbomb {t · c · p · b} 04:05, 23 June 2019 (UTC))
I don't have it running on a permanent server. That part alone takes about 2.5 hours. I'll see what I can do it, but it might not be that frequently. -- JLaTondre (talk) 14:03, 27 October 2018 (UTC)
Well it wouldn't be the full run, just the reprocessing of WP:JCW/TAR and WP:JCW/CRAP (if there's an update to the .cfg pages). But if a full time server is needed, maybe that could be hosted on the WMF servers? Headbomb {t · c · p · b} 14:43, 27 October 2018 (UTC)
2.5 hrs is just for TAR and CRAP. It's several hours more for everything else (and another couple hours to download the dump). There is a LOT of data processing for all this. The fuzzy matching used for both TAR and CRAP in particular. I have some thoughts on performance improvements, but they are unlikely to be dramatic changes. I also have another computer lying around that I think has a faster CPU than the one I use for the bot. But both cases require more work than simply limping along ;-) so I haven't gotten around to it yet. Switching over to Wikimedia Cloud Services is a similar situation. -- JLaTondre (talk) 15:41, 27 October 2018 (UTC)

I've been thinking about this some more, and how about automating just the /CRAP /TAR /PUB parts of the run overnight, if there's been changes to WP:JCW/EXCLUDE, WP:JCW/PUBSETUP or WP:CITEWATCH/SETUP since the last time the bot ran? Either way, feel free to do a bot run whenever. I've been doing lots of cleanup. Headbomb {t · c · p · b} 16:34, 26 June 2019 (UTC)

My plan is to add a mode that only processes the deltas (new or changed entries). That will significantly speed things up. I have been tied up lately, but should have time to start working things again soon. -- JLaTondre (talk) 12:41, 29 June 2019 (UTC)
It probably would, although if you have something like Category:Elsevier academic journals, and the category would acquire new articles during run, that should be factored in for a 'delta'. I feel running the /CRAP/TAR/PUB stuff overnight has the result/effort ratio, but it's your time, so if you want to optimize the use of computing resources... have at it! Headbomb {t · c · p · b} 15:19, 29 June 2019 (UTC)
Good point about the categories. That would also be the case with redirects. I've set it up to do a full target, selected, & publisher update each night (if applicable configurations have changed). -- JLaTondre (talk) 13:08, 14 July 2019 (UTC)
Great, that'll be awesome! Headbomb {t · c · p · b} 01:36, 15 July 2019 (UTC)

Regex matching...

Not that I'd want this to run for everything, but, I'm trying to expand the crapwatch to include certain patterns in WP:CRAPWATCH/SETUP#To be published. In particular, if you look at something like WP:JCW/T14, you'll have "To appear in [...]" which has a lot of [...]. While this could probably be hard coded at the bot-level, it would be good to have a way to specify when something is to be treated as a prefix. Something like

{{JCW-selected|TARGET|Prefix*}}

Which in the case of

{{JCW-selected|Unpublished|To appear.*|To be published.*|Submitted.*}}

would match all of

  • Submitted
  • Submitted for a TI Workshop on Corruption and Political Party Funding in la Pietra, Italy.
  • Submitted to IEEE SMC
  • Submitted to Nature
  • Submitted to the International Journal of Social Education
  • Submitted to U.S. Game and Fish, Atlanta, GA.
  • Submitted To: The US Naval Institute's Proceedings - but Not Published in the Journal
  • To Appear in The Cambridge Companion to Einstein (Co-edited with Christoph Lehner)
  • To appear in New Zealand Journal of Mathematics
  • To Appear in Proceedings of the 17th International Conference on Computing in High Energy and Nuclear Physics
  • To appear in Proceedings of the 18th International Symposium on Space Terahertz Technology, Caltech, March 21–23, 2007
  • To appear in the Proceedings of the 17th International Symposium on Space Terahertz Technology, Paris, May 10–12, 2006
  • To be Published
  • To be published
  • To be Published In: "The Local Group as an Astrophysical Laboratory"
  • TO BE PUBLISHED S
{{JCW-selected|Unpublished|To appear.*|To be published.*|Submitted.*}}

could be something different if it makes things easier, like

{{JCW-selected|Unpublished|/To appear.*/|/To be published.*/|/Submitted.*/}}

or

{{JCW-selected|Unpublished|regex1=/To appear.*/|regex2=/To be published.*/|regex3=/Submitted.*/}}

or whatever. Headbomb {t · c · p · b} 17:09, 20 June 2019 (UTC)

This is different than the normalization based processing of common, selected, & publisher pages. This is more akin to the Invalid page. It's easy to do. Let's use a separate template to simplify distinguishing the processing type (it can still remain on that setup page though). To provide the flexibility to have other matches like this in the future, perhaps {{JCW-pattern|Unpublished|...}} (where the first parameter would be the term grouped under). For the patterns themselves, it can use .* to indicate any characters (so would support prefixes: words.*, suffixes: .*words, or even .*words.*words.*). Note: it will not directly accept regex to avoid possible security issues. -- JLaTondre (talk) 13:54, 29 June 2019 (UTC)
Sounds like a sensible solution. Not sure what security issue there would be, but honestly, having full regex capabilities isn't really critical and I can't think of anything at the moment that would require them. /.*PATTERN.*/ is likely the most complex 'regex' it would get, and if it's still doing the case-insensitive typo matching, then having something like /(The\s*\d+\s*Edition\s*Of\s*)?(The\s*)Sports\s*Encyclopedia/ is pointless when .*Sports Encyclopedia would do just as well. Headbomb {t · c · p · b} 15:28, 29 June 2019 (UTC)
I added [4]. This way things can get tested. Headbomb {t · c · p · b} 15:36, 29 June 2019 (UTC)

Upon further though, it would be better if this could be combined with the regular template. I.e.

{{JCW-selected|ACM Conference on Foobarology|pattern1=Proceedings of .* Conference on Foobarology}}

to catch ACM Conference on Foobarology + everything that redirects there + typos as usual, PLUS the Proceedings of .* Conference on Foobarology pattern to catch things like Proceedings of the Seventeenth Conference on Foobarology + Proceedings of the 18th Conference on Foobarology + Proceedings of the XIVth Conference on Foobarology + Proceedings of the 2001 Conference on Foobarology Headbomb {t · c · p · b} 20:04, 6 July 2019 (UTC)

Is there a reason to simply not strip "Proceedings of the [NUMBER]" as part of the normalization? Where NUMBER could be 14th, XIVth, or fourteenth formats? If prefer regex, then would still want separate template. If there is a selected and a pattern with the same first parameter, the results of both can be grouped together (i.e. {{JCW-selected|ACM Conference on Foobarology|...}} and {{JCW-pattern|ACM Conference on Foobarology|...}} would have a combined output under ACM Conference on Foobarology). -- JLaTondre (talk) 02:01, 7 July 2019 (UTC)
Stripping "Proceedings of the [NUMBER]" is a bit tricky since you could also have "Transactions of the [NUMBER]", and they can refer to different publications. But you'd also have more complex patterns like "[Number] Proceedings of the Foobar Conference, Held in Helsinki, 3-4 June 2029" etc... Hence why I'd find it desirable to be able to customize what is needed.
That said, combining/merging the results of multiple {{JCW-selected}}/{{JCW-pattern}} when their |1= parameter is an exact match would work. Headbomb {t · c · p · b} 02:18, 7 July 2019 (UTC)
Done. -- JLaTondre (talk) 03:09, 15 July 2019 (UTC)
I'll be building a few more patterns like that and test things further, but it seems to work well so far! Headbomb {t · c · p · b} 09:56, 15 July 2019 (UTC)
@JLaTondre: regex matching doesn't seem to work for Publishers. Headbomb {t · c · p · b} 00:12, 17 July 2019 (UTC)
Extended it to publishers. -- JLaTondre (talk) 23:34, 17 July 2019 (UTC)

Weird stuff going on at WP:JCW/Publisher1

See [5]. Headbomb {t · c · p · b} 10:45, 22 July 2019 (UTC)

Stupidity. I was testing output of unnecessary exclusions report and forgot to remove my debug. -- JLaTondre (talk) 23:15, 22 July 2019 (UTC)

Missing a few?

In WP:JCW/Publisher4#40 (Nauka) Automation and Remote Control, as well as Avtomatika I Telemekhanika are being reported. However, Avtomatika i Telemekhanika isn't, despite being cited many times (see WP:JCW/A79). Headbomb {t · c · p · b} 22:10, 28 July 2019 (UTC)

Also, Avtomatika I Telemekhanika isn't listed at WP:JCW/A79. Headbomb {t · c · p · b} 22:34, 28 July 2019 (UTC)
Seems there was a sync issue somewhere since it's now in WP:JCW/A80. Headbomb {t · c · p · b} 07:15, 29 July 2019 (UTC)
Yes, there was a disconnect with the last full run. All should be good now. -- JLaTondre (talk) 21:52, 29 July 2019 (UTC)

Unnecessary exclusions report

WP:JCW/EXCLUDE is getting pretty large, and as the matching algorithms gets tweaked, many of those are no longuer needed. So, if the bot could look at WP:JCW/EXCLUDE everytime it runs, and post a report on WT:JCW/EXCLUDE to identify things that would not match under the current algorithms, that would be great.

Something simple like

==JL-Bot report==
<!-- Report begin-->
The following exclusions are no longer needed:
{{JCW-exclude|Journal of Popular Science|J. Populist Scientology}}
<!-- Report end-->
Report by [[User:JL-Bot]]

Headbomb {t · c · p · b} 17:56, 14 March 2019 (UTC)

To be clear, this wouldn't report

if J. Populist Scientology simply isn't found in citations. Rather it would report it if J. Populist Scientology cannot conceivably be a match to Journal of Popular Science under whatever bot logic is employed at runtime. Headbomb {t · c · p · b} 18:12, 14 March 2019 (UTC)

This is starting to get higher up the list, since we're starting to running into template expansion issues. Headbomb {t · c · p · b} 06:02, 8 July 2019 (UTC)
@Headbomb: First cut done. I did a testing sample, but there are a lot of results. I would like to do the following to validate:
  1. Allow the bot to do its run tonight (will catch any recent configuration changes if any)
  2. Remove the reported unnecessary exclusions tomorrow (but no other configuration changes)
  3. Allow the bot to do its run tomorrow night
  4. Compare results of #1 & #3
If there are no errors in the report, then the results should be the same (at least none of the exclusions should be present; it's possible a redirect or category change could occur in that time period). That will allow easily seeing that the report doesn't have any errors. -- JLaTondre (talk) 00:08, 23 July 2019 (UTC)
I'll take a look as soon as results are up! Headbomb {t · c · p · b} 00:12, 23 July 2019 (UTC)
That's... a lot of results! If the logic is sound, how about the bot takes care of those removals? I mean if it works... why would humans need to intervene? Headbomb {t · c · p · b} 00:23, 23 July 2019 (UTC)
I cleared up the A so that should be enough for a preliminary test run and see if anything majorly busted up. Tomorrow we can do the Bs and the rest and see what happens with a 'clean' diff. Headbomb {t · c · p · b} 00:36, 23 July 2019 (UTC)

Most in the A's seemed correct, however a few were not. For example,

were reported as unneeded, but were needed to exclude things at WP:JCW/Publisher13#129. Headbomb {t · c · p · b} 11:57, 23 July 2019 (UTC)

Likewise for

in WP:JCW/Publisher9#90, WP:JCW/Publisher6#57, WP:JCW/Publisher5#45 and elsewhere. Headbomb {t · c · p · b} 12:08, 23 July 2019 (UTC)

Issue was with publisher ones only & should be fixed. What is the status of the exclusions configuration? You manually added the valid ones back in? I was just going to revert the removal and re-run the exclusions report, but looks like you may have also added ones not in the pre-removal version? -- JLaTondre (talk) 23:02, 23 July 2019 (UTC)
Feel free to do whatever feels natural to you. I added those back in case it was something that could not be dealt with for whatever reason. Any change I made is too small potatoes to worry over, and represent at best 2 minutes of effort from my part. Revert to the old version and rerun if it's simpler. Headbomb {t · c · p · b} 23:46, 23 July 2019 (UTC)
I restored the prior version just to make sure nothing inadvertently lost. I will let the bot do it's normal run tonight and then re-run the unnecessary exclusions. -- JLaTondre (talk) 02:12, 24 July 2019 (UTC)
@JLaTondre: the bot ran. Let me know when you need my feedback. Headbomb {t · c · p · b} 09:36, 24 July 2019 (UTC)
Can I start removing some entries following this or should I wait a bit? Headbomb {t · c · p · b} 00:28, 25 July 2019 (UTC)
I'm currently removing the A's. It's a pain so I can see why you asked about the bot doing it. I'll finish of the A's and maybe a couple other and the let it run again tonight. If all looks good, I can update the bot to remove them (but might be a couple of days depending on schedule). -- JLaTondre (talk) 00:32, 25 July 2019 (UTC)
Okay, let's see how that works. -- JLaTondre (talk) 00:37, 25 July 2019 (UTC)
Indeed a pain. The way I did it was to open Notepad++ and compare the list with your exclusions. Highlights differences, much like a diff on Wikipedia. Then I just delete in Notepad++ and copy back the results. Still inefficient though, but efficient to deal with the A's in a reasonable amount of time. Headbomb {t · c · p · b} 01:26, 25 July 2019 (UTC)
Looks like everything worked fine. This is due to a new journal in the category. Not sure where [6] comes from, but presumably changes in the categories as well. Headbomb {t · c · p · b} 10:23, 25 July 2019 (UTC)

I'm wondering why things like

didn't get picked up as unnecessarily. I would have thought this update in logic would have made them redundant, but maybe not. Headbomb {t · c · p · b} 10:46, 25 July 2019 (UTC)

That logic is excluding things that are all uppercase. None of those match that. They contain numbers, spaces, and lowercase. It could potentially be extended to handle those cases, but I would have to think about the best way to do that without excluding more than desired. -- JLaTondre (talk) 20:59, 25 July 2019 (UTC)
Not really critical at this point. Was just wondering. Headbomb {t · c · p · b} 22:10, 25 July 2019 (UTC)

This part isn't automated yet I believe? Would be useful to have a new one. Even if the removals aren't automated just yet. Headbomb {t · c · p · b} 23:42, 29 July 2019 (UTC)

Done. Removals implemented as well. -- JLaTondre (talk) 00:27, 31 July 2019 (UTC)
Cool beans. Wonder how hard it would be to implement per-section alphabetical sorting. There's another bot in trial for this, but it could be streamlined into JL-bot if it's doing daily-ish runs now. Headbomb {t · c · p · b} 00:36, 31 July 2019 (UTC)
The exclusion report only needs to be run when there is a logic change or a new dump. If there is another bot already for maintaining the list, that would be better.-- JLaTondre (talk) 23:52, 5 August 2019 (UTC)
Also, would there be a point in User talk:JL-Bot/Citations.cfg#JL-Bot report anymore? Headbomb {t · c · p · b} 00:38, 31 July 2019 (UTC)
Removed the listing. -- JLaTondre (talk) 23:52, 5 August 2019 (UTC)

It missed a few things, mostly those with articles from before the more recent dump, e.g.

Headbomb {t · c · p · b} 00:41, 31 July 2019 (UTC)

Fixed it to check for false positives that are now articles (since those no longer hit based on a change in logic awhile back). I also fixed a case where it wasn't properly removing unnecessary false positives from the configuration page if the entries contained special characters (it was properly finding them, but the regex to remove them from the page was failing). So the last run removed a bunch more. -- JLaTondre (talk) 23:52, 5 August 2019 (UTC)
It also missed those. Possibly simply because the bot didn't run since your last code update. Headbomb {t · c · p · b} 04:22, 6 August 2019 (UTC)
Should be getting all now. -- JLaTondre (talk) 01:22, 7 August 2019 (UTC)

Misses some 'new series'

In WP:JCW/Target13#1221, the bot picks up Transactions of the American Philosophical Society, but it doesn't pick up Transactions of the American Philosophical Society, New Ser. as found in [7].

The bot should strip

  • Original Series / Original Ser. / Orig. Ser.
  • Série Originale / Sér. Originale / Série Orig. / Sér. Orig.
  • New Series / New Ser.
  • Nouvelle Série / Nouvelle Sér. / Nouv. Sér. / Série Nouvelle / Série Nouv. / Sér. Nouv.
  • Neue Folge
  • N.S.
  • N.F.
  • O.S.

Likewise for

  • First Series / Series One / Series I
  • Second Series / Series Two / Series II
  • Third Series / Series Three / Series III
  • Fourth Series / Series Four / Series IV
  • ...


  • Première Série / Série Première / Série Un / Série I
  • Seconde Série / Série Seconde / Série Deux / Série II
  • Troisième Série / Série Troisième / Série Trois / Série III
  • Quatrième Série / Série Quatrième / Série Quatre / Série IV
  • Cinquième Série / Série Cinquième / Série Cinq / Série V
  • Sixième Série / Série Sixième / Série Six / Série VI
  • Septième Série / Série Septième / Série Sept / Série VII
  • Huitième Série / Série Huitième / Série Huit / Série VIII
  • Neuvième Série / Série Neuvième / Série Neuf / Série IX
  • Dixième Série / Série Dixième / Série Dix / Série X
  • ...

In whichever caps variant and é/è/e variants.

Headbomb {t · c · p · b} 02:31, 8 July 2019 (UTC)

Current processing is based on User talk:JL-Bot/Archive 4#More things that don't count and User talk:JL-Bot/Archive_3#New JCW / MCW run. For the first part, I'll update it for the abbreviations and the additional cases. For the second part (everything after 'Likewise' above), how does that mesh with the prior request to change "Part|Section|Series|Série A" to "A"? Keep alphabetic, but delete numbers and Roman numerals (though there is overlap there)? Or now delete them all? -- JLaTondre (talk) 13:28, 14 July 2019 (UTC)
First part implemented. Second part dependent on clarification. -- JLaTondre (talk) 00:57, 15 July 2019 (UTC)
Well, basically if you have "Journal de Physique, Dixième Série" or "Transactions of the Foobar Society, Series Four", those should be treated as "Journal de Physique" and "Transactions of the Foobar Society", respectively. It's possible some of this already overlaps with previous behaviour, in which case things should just be expanded to cover the new cases. Headbomb {t · c · p · b} 01:39, 15 July 2019 (UTC)
Implemented. Currently running so results should be up in a few hours. -- JLaTondre (talk) 01:40, 12 August 2019 (UTC)

Improved Chinese support?

Did you do something related to Chinese support recently? There's a lot of extra Chinese-related stuff picked up in the diffs. This is an improvement, just unexpected. Headbomb {t · c · p · b} 11:58, 12 August 2019 (UTC)

Some Russian stuff as well. And a couple of other typos-related things. Headbomb {t · c · p · b} 12:37, 12 August 2019 (UTC)
Improved Unicode handling for the normalizations. Basically does a simply transliteration so limited, but does pick up a couple more things. -- JLaTondre (talk) 01:02, 15 August 2019 (UTC)

More things that don't count: Abteilung, Supplementbände, Beihefte, Teil

Much like "Journal of Physics Series A" = "Journal of Physics A", "Palaeontographica Abteilung B" should = "Palaeontographica B". Likewise for Supplementbände/Supplementbande, Beihefte, and Teil. Headbomb {t · c · p · b} 13:45, 12 August 2019 (UTC)

Implemented. Also tweaked a couple of regexes to pick up more variations on issue | section | page (etc) and associated numbers. Running so results will be up later. -- JLaTondre (talk) 01:04, 15 August 2019 (UTC)
Great! Headbomb {t · c · p · b} 01:19, 15 August 2019 (UTC)

'Exact' searches

I tried something with {{R from misspellings}} and {{R from miscaps}}. There's potential... but there's a lot of tweaking that needs to happen if they're to be useful, because Academic Journals/Journals_cited_by_Wikipedia/Publisher4&diff=prev&oldid=908690643 this is kinda ridiculous.

So it would be nice to have a way of having

  • {{JCW-selected|Redirects from misspellings|Category:Redirects from misspellings|mode=caps}}
  • {{JCW-selected|Redirects from misspellings|Category:Redirects from miscaps|mode=caps}}

or something that would just search for things tagged with Category:Redirects from misspellings and Category:Redirects from miscapitalisations, and redlinked variants that differ only in capitalization.

The idea would be to collect all the typos (and redlinked capitalized variations) and all the miscaps (and their redlinked capitalized variants) undo these two categories.

Miscaps

So basically, for Journal of Neurology, which has the associated, it would report

Journal of neurology (tagged with {{R from miscaps}})
Journal of NEUROlogy (redlinked capital variant of Journal of Neurology)

And for Bulletin for Biblical Research, it would report

Bulletin for biblical Research (redlinked capital variant of American Journal of Mathematics)
Bulletin for biblical research (redlinked capital variant of American Journal of Mathematics)

and any other variants marked with {{R from miscaps}}/Category:Redirects from misspellings, or redlinked variants that only differ in capitalization. So J. phys. A would get reported for Journal of Physics A (as a redlinked variant of J. Phys. A).

Typos

And American Journal of Physics has

American Journal of Physic (tagged with {{R from typo}})
American journal of Physic (redlinked capital variant of American Journal of Physic)

Or something. Feel free to refine my idea. Headbomb {t · c · p · b} 10:44, 31 July 2019 (UTC)

This is different processing than the normal publisher processing. Like the pattern matching, it would need to be separated out. If you want the results on the Publisher pages, that can still be done, but it needs either:
  1. Another template, perhaps {{JCW-misspellings}} or {{JCW-category}}, to control it; or
  2. If it will only ever be these two, hardcode instead of using a configuration
It looks like you want the misspellings and the capitalization differences in the same output? -- JLaTondre (talk) 00:55, 6 August 2019 (UTC)
If a template, I think {{JCW-caps}} (since it's base + caps variant) would be the one. But honestly, this could probably hardcoded and put on a separate page at WP:JCW/Maintenance. Could put the WP:JCW/Invalid results there too. Headbomb {t · c · p · b} 04:41, 6 August 2019 (UTC)
At the same time... it would be good if that 'maintenance' page was customizable. Perhaps the invalid + typo + miscaps stuff hardcoded, with addition patterns that could be specified on a config page like WP:JCW/Publisher8#71 and WP:JCW/Publisher11#104 per WP:JCW/Maintenance.cfg. Headbomb {t · c · p · b} 05:37, 12 August 2019 (UTC)
Set up the structure & moved Invalid under it. Also created the processing for Wikipedia:JCW/Patterns & made the first run. I re-used on the existing formats. If you want a different one, let me know. I'll start working on the typos next. -- JLaTondre (talk) 01:42, 16 August 2019 (UTC)

@JLaTondre: Wikipedia:WikiProject Academic Journals/Journals cited by Wikipedia/Typos format is messed up. Also the table should be sortable. Headbomb {t · c · p · b} 18:54, 17 August 2019 (UTC)

I know. I'm still working on it. -- JLaTondre (talk) 19:00, 17 August 2019 (UTC)
The more I think of it, the more I'm convinced it would be better to seperate caps/typo pages. Headbomb {t · c · p · b} 19:38, 17 August 2019 (UTC)
Okay, will do that. -- JLaTondre (talk) 00:57, 23 August 2019 (UTC)
Miscapitalisations & Misspellings have been broken into their own pages. -- JLaTondre (talk) 00:35, 2 September 2019 (UTC)

Missing run

I added [8] on August 28, but the August 29 run didn't touch [9]. Headbomb {t · c · p · b} 20:27, 29 August 2019 (UTC)

The resultant output is larger than the maximum Wikipedia page size. I will have to split that into multiple pages (I had that on my to do list, but you got there faster than I could). -- JLaTondre (talk) 13:12, 1 September 2019 (UTC)
Too large? Would have thought those are barely used and would only amount to a few dozen entries? Headbomb {t · c · p · b} 13:22, 1 September 2019 (UTC)
Digging deeper, "Identifiers' is retrieving 528,846 citations. The *.OL.* is the main culprit since that pattern appears in *ology (biology, technology, etc.), volume, and many other common words. I can add a safeguard that when a specific term returns too many results (perhaps more than 500?), it displays an error vs. trying to show them all. -- JLaTondre (talk) 20:06, 1 September 2019 (UTC)
Safeguard implemented & results updated. I still need to put in paging in case patterns expand, but that is lower priority. -- JLaTondre (talk) 00:34, 2 September 2019 (UTC)

I wasn't thinking about OL, it's going to be too common to be useful. Best to just remove it from the list. Headbomb {t · c · p · b} 13:22, 2 September 2019 (UTC)

Miscaps missing

And many more are not reported as miscapitalizations. It find those tagged with {{R from miscaps}} and their variants, but miscaps and variants of those not tagged with {{R from miscaps}} are not picked up. Headbomb {t · c · p · b} 00:14, 25 August 2019 (UTC)

Expanded it to pull in all nonexistant pages that differ in only capitalization from either a target or a redirect to a target. It still also reports redirects tagged as miscaps. -- JLaTondre (talk) 15:48, 2 September 2019 (UTC)
Wonderful. That will really help with miscapitalization cleanup. Shame we're after the 31st, but things should substantially improve by the September 20th dump. Headbomb {t · c · p · b} 20:09, 2 September 2019 (UTC)

Exclude patterns

Would it be possible to have something like

  • {{JCW-pattern|identifiers|.*doi.*|!vaudois!|!doing!|!bowdoin!|!macédoine!}}

To exclude things that match vaudois/doing/bowdoin/macédoine from the things that match doi? Such as 'Bull. Soc. Vaudoise Sci. Nat.'?

Open to alternate syntax, but for this would it would be good to keep everything in the {{JCW-pattern}} template with unnamed parameters. Headbomb {t · c · p · b} 13:15, 11 September 2019 (UTC)

Implemented for WP:JCW/Patterns. I will extend to WP:CITEWATCH in the next day or so. For precedence, .* will trump ! so if both appear in a pattern, it will look for citations containing the pattern including the exclamation. Similar to .*, it also doesn't have to be at the start and end. You can put it only at the start to exclude anything ending with term, etc. -- JLaTondre (talk) 01:09, 13 September 2019 (UTC)
.* will trump !... the idea here is that ! excludes what it finds with .*. If .* trumps !, wouldn't that defeats the purpose of ! in the first place? Or do you mean for exact pattern match, like .*foobar.* + !foo!, which would be a rather silly thing to ask the bot to do? Headbomb {t · c · p · b} 02:09, 14 September 2019 (UTC)
I mean that if .* and ! occur in the same pattern, it will be treated as an include pattern (i.e. .*!.* will search for all citations containing an exclamation mark). -- JLaTondre (talk) 13:06, 14 September 2019 (UTC)
Ah, yes, that makes perfect sense. Headbomb {t · c · p · b} 16:12, 14 September 2019 (UTC)

WP:JCW/PUB and WP:CITEWATCH now support excludes as well. -- JLaTondre (talk) 13:55, 14 September 2019 (UTC)

Weird edit?

In [10], many linked articles are moved around for no apparent reason.

Not sure what's the sorting logic, but I recall those being presented alphabetically in the past. Headbomb {t · c · p · b} 13:17, 25 September 2019 (UTC)

Those are the links for the DOI articles. Forgot to put a sort on those. Added. Will be consistent after next update. -- JLaTondre (talk) 23:18, 25 September 2019 (UTC)

Support note= for publishers.

User:JL-Bot/Publishers.cfg has several |note= that are ignored by the bot. Those should be reported in WP:JCW/PUB. Same as WP:JCW/CRAP, except that page just won't even have a |source= in it. Headbomb {t · c · p · b} 16:33, 30 September 2019 (UTC)

Implemented. -- JLaTondre (talk) 22:40, 30 September 2019 (UTC)

Vital Articles

I've been thinking it would be useful to have lists of Vital Articles by WikiProject, and the Recognized Content listings seem like they might be a good fit. How feasibile would it be to incorporate this? PC78 (talk) 00:29, 16 September 2019 (UTC)

It's a bit different, in the sense that this isn't really a recognition of anything save for importance of the topic. So not quite within the WP:RECOG. That said, it's not like it hurts anything either. If it's implemented, there should be a ' Level #' indicator, rather than just '' because a Level 3 vital article is nowhere near the same thing as a level 5 vital article. Headbomb {t · c · p · b} 07:04, 16 September 2019 (UTC)
Recognition of importance seems just as significant as recognition of quality. Hopefully it would be possible to configure the layout. I've got a few ideas if this is a goer. PC78 (talk) 08:08, 16 September 2019 (UTC)
So adding vital articles to recognized content? Or a new list? And just all project scope articles tagged as a vital article (i.e. Category:All Wikipedia vital articles)? Or broken out by level (i.e. separate section for Category:All Wikipedia level-1 vital articles, etc.)? -- JLaTondre (talk) 23:00, 23 September 2019 (UTC)
It's your bot so I would need to be guided by you and what you think would be best. I was thinking just add it to recognized content (presumably you could configure a vital articles only listing if necessary?), but it would probably need support for an overflow page as per some of the other content types. And it would probably be wise to break them down with separate sections for each level, as suggested above (a different parameter for each level?). Perhaps an optional parameter to display WikiProject assessment, e.g. Example (B-Class)? PC78 (talk) 03:14, 24 September 2019 (UTC)
Implemented. Example is at User:JL-Bot/Project content/Trial 5.1. Added Vital levels 1 to 5 as well as B and C class. It supports overflow for all (also added overflow for A class). See documentation change for the new fields. Let me know if you have any questions using it. -- JLaTondre (talk) 21:33, 24 September 2019 (UTC)
@JLaTondre: probably would be nice to add a |vital-icons option or something to add , , etc... next to each articles to get a quick assessment of where things stand. Headbomb {t · c · p · b} 21:43, 24 September 2019 (UTC)
@JLaTondre: Thanks for that, it's late though so I'll have a proper look tomorrow. Not sure if you misinterpreted my comment about assessment, I meant specifically for the vital articles, kind of like what Headbomb suggested. PC78 (talk) 22:56, 24 September 2019 (UTC)
So different icons within the vital sections based on the status of the article... I'll have to think about how best to do that. Currently, the bot applies a single icon for each type, both at the header and at the article (when not in compact mode). Adding the new types was pretty easy as it was an extension of the prior logic. The icons will be a larger change. There are some changes I have been wanting to make to improve efficiency. I'll factor this into that. -- JLaTondre (talk) 23:33, 25 September 2019 (UTC)

I've set up a couple of listings at Wikipedia:WikiProject Film/Vital articles and Wikipedia:WikiProject Korea/Vital articles. How long before the bot picks them up? PC78 (talk) 00:43, 29 September 2019 (UTC)

That task typically runs each weekend. It ran yesterday, but takes most of the day to complete. Results are up now. -- JLaTondre (talk) 12:19, 29 September 2019 (UTC)
@PC78: Icons have been updated to display the associated icon for the vital article class. I have updated your two pages. -- JLaTondre (talk) 21:52, 3 October 2019 (UTC)
Looks good, the class icons are probably more useful here. PC78 (talk) 02:40, 4 October 2019 (UTC)

More aggressive section matching

Very often, you've got subtitles/section titles appended to journal titles, e.g.

which doesn't (at the moment) have a redirect entry, unlike the very close match

The bot should be able to figure this out. Specifically, when you have a redirect/base name with

  • Part X
  • Section X
  • Series X
  • Teil X
  • Volume X

in the title, where X is a single capitalized letter, ignore everything after that single capitalized letter for purpose of matching. There might be other similar words (look in words that are ignored to make this list more extensive).

Likewise, if you have

It should, based on that, be able to pickup

The logic for this would be if you have a redirect/base name that ends in a single capitalized letter, match everything that has that pattern, regardless of what comes after that letter.

Headbomb {t · c · p · b} 19:13, 23 August 2019 (UTC)

First part (stripping stuff after Section X) is done. It is processing and results will be up when done (several hours). Second part (matching based on redirects) is more complicated as it doesn't fit the current normalization processing. I can add a post cleanup step, but that will be more work. What is the priority of this vs. handling DOI matching? -- JLaTondre (talk) 18:17, 2 September 2019 (UTC)
The DOI matching would be more useful for WP:CRAPWATCH cleanup. More powerful section matching is nice but it's like a refinement on a refinement. Of the two, the DOI matching is more useful to Wikipedia in general, whereas section matching is really only of interest to WikiProject Journals. Headbomb {t · c · p · b} 19:47, 2 September 2019 (UTC)
@Headbomb: The second part (redirects ending in single letter) is done and results are saving. I took the "match everything that has that pattern, regardless of what comes after that letter" literally. So if the redirect is "Name W", it will match "Name Word". That has the potential to catch a few false positives, but it also found some valid cases. I can change it to anything that comes after the letter plus a word boundary (space, punctuation) so it would skip "Name Word", but catch "Name W extra", "Name W: extra", etc. if false positives outweigh valid. -- JLaTondre (talk) 22:04, 5 October 2019 (UTC)
Looking forward to it. I can't see the false positives being all that common. Headbomb {t · c · p · b} 01:06, 6 October 2019 (UTC)

Novaya Seriya = New Series

In whatever bit handles the 'New Series / Nouvelle Série / Nueue Folge / etc... if you could add "Novaya Seriya", that would be great. Headbomb {t · c · p · b} 19:47, 7 October 2019 (UTC)

Implemented. Will be in tonight's run. -- JLaTondre (talk) 01:03, 15 October 2019 (UTC)
WP:JCW/Target26#Matematicheskii Sbornik still doesn't report Matematicheskii Sbornik, Novaya Seriya (see WP:JCW/M9) Headbomb {t · c · p · b} 17:06, 16 October 2019 (UTC)
Looks like I didn't run the re-normalization like I thought I did. It's in process & results will be up in awhile. -- JLaTondre (talk) 00:47, 17 October 2019 (UTC)

How hard/feasible would it be to have a 'doi' match?

Mostly thinking of WP:CRAPWATCH here. For instance DOIs that start with 10.4172/... are 10.4172 for OMICS Publishing Group. Having

  • {{JCW-selected|OMICS Publishing Group|Category:OMICS Publishing Group academic journals|Journal of Surgery|Medicinal Chemistry (journal)|doi=10.4172|source=BLP}}

would add flexibility/future proofing to the crapatch. Headbomb {t · c · p · b} 19:48, 20 August 2019 (UTC)

What is it supposed to match to? -- JLaTondre (talk) 23:21, 21 August 2019 (UTC)

The |doi=10.4172/... of citation templates. Alternatively, {{doi|10.4172/...}} itself if feasible. Creating something like

Rank Target/Group Entries (Citations, Articles) Total Citations Distinct Articles Citations/article


28 OMICS Publishing Group
[Beall's publisher list]
160 146 1.096

Headbomb {t · c · p · b} 00:42, 22 August 2019 (UTC)

The doi from the citation template is pretty straightforward. What are you thinking for the {{doi}} template since it doesn't contain the journal name? It is supposed to follow the journal name, I believe? But that would be hard to pull out of the article (too much variability in formats). There is also {{doi-inline}} which has an optional parameter for the journal name. That one would be easy if the journal is provided. If not, it has the same issue as doi. It would be easy enough to track all the doi numbers and if a number occurs in both a citation template and a doi template, use the citation template name for both. Don't know how often that will happen though. Assume you would like a maintenance page for doi entries that don't follow an expected format? -- JLaTondre (talk) 00:57, 23 August 2019 (UTC)
It would just be a raw listing grouped under the associated 'header'. See the updated mockup. Headbomb {t · c · p · b} 01:23, 23 August 2019 (UTC)

I have the template parsing done. The doi values are all properly being parsed from the database dump. Next will be working generating the output. -- JLaTondre (talk) 01:11, 13 September 2019 (UTC)

Results are uploaded to WP:CITEWATCH. One large caveat is that there is no de-duping between the name matching results and the doi matching ones. Therefore, some results are listed in both places. However, the doi matching is also picking up results that are not caught by the normalization process. -- JLaTondre (talk) 03:24, 15 September 2019 (UTC)
Immediate feedback would be that it's missing the articles links for DOIs with 5 or fewer links. E.g. WP:JCW/Questionable4#Ashdin Publishing. Headbomb {t · c · p · b} 01:16, 16 September 2019 (UTC)
Should be fixed. 09/20 dump running so will be in those results. -- JLaTondre (talk) 23:13, 23 September 2019 (UTC)
Also [11]. Headbomb {t · c · p · b} 06:58, 16 September 2019 (UTC)
Sorry, I had fixed that, but the change didn't propagate through until the change in the configuration page forced a whole update. -- JLaTondre (talk) 23:13, 23 September 2019 (UTC)
Yeah it works fine now, although the links for the 5 or fewer are still missing. Those really help in dealing with bad inputs / quickly finding the problem page. Headbomb {t · c · p · b} 23:16, 23 September 2019 (UTC)
And now the links are there. And the children rejoiced! Headbomb {t · c · p · b} 12:24, 24 September 2019 (UTC)

The next step for the bot would likely be dupe elimination from the DOI stuff. The logic being if it's redundant with what gets picked up without the DOI checking, ignore it. If it's not redundant, keep it and list it under the DOI header. Headbomb {t · c · p · b} 05:58, 6 October 2019 (UTC)

Any updates on redundancy elimination? Headbomb {t · c · p · b} 01:15, 19 October 2019 (UTC)
Side tracked by the saving update. They were more work than I expected. I will do this next. -- JLaTondre (talk) 00:28, 29 October 2019 (UTC)
Implemented. Results will appear with tonight's run. -- JLaTondre (talk) 22:55, 30 October 2019 (UTC)

Cutting down on nearpointless edits

Looking at Special:Contributions/JL-Bot, there's a lot of pointless edits. Making use of

Similar to

alongside this, well that would cut down on a lot of those. Headbomb {t · c · p · b} 19:57, 20 August 2019 (UTC)

Rather have a single template per page type which would get passed to JCW-bottom to display the date line. I'll look at how to best structure it. Probably won't be able to get to it until next month. -- JLaTondre (talk) 23:28, 21 August 2019 (UTC)
Could have {{JCW-date}} contain the extra info and emit it the information. E.g. this + this. Headbomb {t · c · p · b} 00:51, 22 August 2019 (UTC)
Implemented. I restructured how saving works so that common, publishers, and questionable are now saved independently (prior, they were linked so if one changed, they were all saved). I also created {{JCW-bottom-common}}, {{JCW-bottom-publishers}}, {{JCW-bottom-questionable}}, and {{MCW-bottom-common}}. These transcluded {{JCW-bottom}} with the parameters needed for each applicable page type (i.e. the publishers pages will all transclude {{JCW-bottom-publishers}} & the bot will update that template at the end of saving the publishers pages). {{JCW-date}} is no longer used and has been deleted. -- JLaTondre (talk) 00:41, 28 October 2019 (UTC)
That's a bit ugly, but whatever. If it works, it works. Is the empty pipe intended in here and elsewhere? Headbomb {t · c · p · b} 04:08, 28 October 2019 (UTC)
No, typo. Corrected. -- JLaTondre (talk) 00:05, 29 October 2019 (UTC)
Also, are those oversights? Seems weird to treat things differently for the letters. Headbomb {t · c · p · b} 04:10, 28 October 2019 (UTC)
No, that is correct. The individual pages only get updated with a new database dump so no need for a separate template. -- JLaTondre (talk) 00:05, 29 October 2019 (UTC)

@JLaTondre: WP:CRAPWATCH hasn't updated following new exclusions. Only WP:JCW/PUB and WP:JCW/TAR. Headbomb {t · c · p · b} 07:26, 31 October 2019 (UTC)

Nevermind, it just took a few hours after the other updates. Headbomb {t · c · p · b} 08:38, 31 October 2019 (UTC)

Multiple DOI prefixes

For example, Hindawi has both 10.5402 and 10.1155... What would be the best way to handle those?

  • {{JCW-selected|Hindawi Publishing Corporation|Category:Hindawi Publishing Corporation academic journals|...|doi=10.1155|doi2=10.5402|...}}

? Headbomb {t · c · p · b} 04:11, 18 October 2019 (UTC)

It already allowed for multiple doi= params. I changed it to allow an optional number to follow the doi. The output will be sorted by the values, not by the parameter numbers. -- JLaTondre (talk) 00:21, 19 October 2019 (UTC)
So what's the structure then? |doi=/|doi2=/|doi3=... or |doi=10.1234, 10.4321, 10.0987 or |doi=10.1234; 10.4321; 10.0987? The first would be the most convenient for me, but I can adapt either way. Headbomb {t · c · p · b} 01:07, 19 October 2019 (UTC)
doi=, doi1=, etc. Multiple parameters each with a single value. Not a single parameter with multiple values. -- JLaTondre (talk) 01:12, 19 October 2019 (UTC)
Oh, or do you mean for the output? There will be separate *doi= sections for each doi in the configuration. -- JLaTondre (talk) 01:15, 19 October 2019 (UTC)
I meant for input. I figured for output, it would just be the same as now, with an additional doi subheading thing. Headbomb {t · c · p · b} 01:16, 19 October 2019 (UTC)

Exclusions not followed?

In WP:JCW/PUB10, 'AIEE Proceedings of the International Electrical Congress' is listed under 'Institution of Engineering and Technology' (#92). However, in WP:JCW/EXCLUDE, there is

So what's happening here? Is it because what's picked up by patterns isn't overruled by the exclusion page? Headbomb {t · c · p · b} 15:45, 23 October 2019 (UTC)

Yes, patterns are not checking false positives. I will add that. -- JLaTondre (talk) 00:40, 25 October 2019 (UTC)
Implemented for questionable & publishers. Results will appear with tonight's run. I did not implement for the maintenance patterns page since it doesn't seem to apply there. -- JLaTondre (talk) 22:57, 30 October 2019 (UTC)
Yeah for the patterns page, I can just tweak the patterns exclusion logic. Headbomb {t · c · p · b} 23:50, 30 October 2019 (UTC)


Handle {{ill}} in |journal=

Things like |journal={{ill|Visions (magazine)|lt=Visions|de|Visions}} causes some issues. Namely, things are treated as Visions (magazine) (first parameter), rather than Visions (last parameter). Headbomb {t · c · p · b} 19:01, 28 October 2019 (UTC)

The last parameter is the German language name. It just happens to match the English one in this case. I assume you really mean to use the lt= parameter since that is the displayed value?
  • {{ill|English|de|German}} use English
  • {{ill|English|lt=Display|de|German}} use Display
If you are truly saying you want the foreign language version, what do you want done when there are multiple (ex. {{ill|English|es|Spanish|it|Italian|de|German}}? -- JLaTondre (talk) 00:27, 29 October 2019 (UTC)
I want whatever is the displayed name. If |lt= is the displayed name (which it seems to be according to doc), then use |lt=. Headbomb {t · c · p · b} 01:48, 29 October 2019 (UTC)
Implemented. I'm not planning on updating the individual letter pages, but let that wait for next dump. However, changes will be reflected in targets, questionable, & publishers in tonight's run. -- JLaTondre (talk) 23:01, 30 October 2019 (UTC)

Further efforts to reduce WP:JCW/EXCLUDE

If you have something like

It's very possible that this was added because of a temporary one-time GIGO stuff in citations. While it is still a theoretical match, it's unlikely to happen again after things have been cleaned up. So, if the bot could add a count, like so

  • {{JCW-exclude|2012 (film)|201 File|c=3}}

To indicate that 201 File appears 3 times in |journal=, that would let us more easily find things that are no longer needed. It could be that 0 uses get automatically cleaned up in the future, but for now, a count would be enough. Headbomb {t · c · p · b} 06:55, 13 October 2019 (UTC)

@Headbomb: Do you want a single count? Or count by type (target, questionable, publisher)? -- JLaTondre (talk) 23:16, 1 November 2019 (UTC)
Just a plain dumb count. If 201 File (e.g. parameter 2) appears 3 times in |journal=, then the bot would report that as
  • {{JCW-exclude|2012 (film)|201 File|c=3}}
This would apply to all exclusions with 201 File in parameter 2. Headbomb {t · c · p · b} 00:02, 2 November 2019 (UTC)

For example, for 201 File, which features in |journal= twice according to WP:JCW/A2, all these would get |c=2

Headbomb {t · c · p · b} 00:08, 2 November 2019 (UTC)

Implemented. Exclusions updated with a test case to verify working. Now the full run is in progress. -- JLaTondre (talk) 17:17, 2 November 2019 (UTC)
I take it [12] was the test? Headbomb {t · c · p · b} 17:27, 2 November 2019 (UTC)
Correct. -- JLaTondre (talk) 01:11, 4 November 2019 (UTC)
Full version uploaded. -- JLaTondre (talk) 01:11, 4 November 2019 (UTC)
@JLaTondre: seems to work pretty well. However, I think it might be skipping |magazine= for counts. Could you confirm. Headbomb {t · c · p · b} 22:11, 4 November 2019 (UTC)
It counts across both JCW and MCW. Is there something specific you think is missing? -- JLaTondre (talk) 23:50, 4 November 2019 (UTC)
I just saw a lot of c=0 on several Foobar (magazine) entries, like Europe (magazine). So I was wondering if the magazine parameter was getting ignored. Headbomb {t · c · p · b} 01:38, 5 November 2019 (UTC)

There might be something else going on that's fishy, since the bot says

  • {{JCW-exclude|Europe (magazine)|EURODL|c=0}}

But EURODL is listed in WP:JCW/E27 as being cited once. Headbomb {t · c · p · b} 02:20, 5 November 2019 (UTC)

There's definitely something fishy going on with parentheses

  • {{JCW-exclude|Geoscience e-Journals|Geosciences Journal|c=1}}
  • {{JCW-exclude|Geosciences (journal)|Geosciences Journal|c=0}}

Headbomb {t · c · p · b} 02:25, 5 November 2019 (UTC)

I read more into the request than was asked for. It is generating a count of how many times the false positive is seen in the generation of the target, questionable, and publisher lists. Not a count of how many citations they have in the individual lists. For "Europe (magazine)", that is not on any of those pages so the false positive is never registered. For the Geoscience ones, I would have to do more digging as to what logic is in play there. However, based on what was really asked for, that is much easier processing that I actually implemented. Sounds like I should switch over to that (i.e. report the corresponding citation count from the combined journal and magazine individual pages)? -- JLaTondre (talk) 21:51, 6 November 2019 (UTC)
Sounds like that yes. The reason for the unfancy logic is that if you have an exclusion rule that pushes something off WP:CRAPWATCH entirely, or would push something further down WP:JCW/TAR (like at WP:JCW/Target42), then the exclusion is still needed, even if there are no 'hits' on these pages. Headbomb {t · c · p · b} 00:09, 7 November 2019 (UTC)
Updated to display citation counts from individual listings. Since that will only change with a database dump, it will no longer run with the nightly runs either. -- JLaTondre (talk) 00:03, 9 November 2019 (UTC)
@JLaTondre: well it might need updating following additions, e.g. [13]. Headbomb {t · c · p · b} 07:12, 9 November 2019 (UTC)
Okay, re-added the nightly run. -- JLaTondre (talk) 13:44, 9 November 2019 (UTC)

Improved matches with = or /

If you have an entry with a = or a /, the bot should normalize the entry to the part before the = or / in WP:JCW/TAR, WP:JCW/PUB and WP:JCW/CRAP.

For example, treat

as

If multiple / or = are found, normalize things to before the first one, e.g. treat

as

for purpose of matching. Headbomb {t · c · p · b} 03:13, 10 November 2019 (UTC)

The = portion has been implemented. They should show up with tonight's run.
For the / portion, that produces some problems. There are plenty of citations like "IEEE/ACM Transactions on Networking" that would get truncated to "IEEE". I could ignore cases where capitals are on both sides of the slash, but those are not the only false positives. -- JLaTondre (talk) 04:21, 19 November 2019 (UTC)
Those cases were in my mind, but I didn't know how common they were. I think a possible solution is to do to the / truncation iff the / happens after say, 10 characters. Or some other threshold. Perhaps in conjunction to the allcaps on both side rule. I'll know more when results are up. Headbomb {t · c · p · b} 04:34, 19 November 2019 (UTC)
The / portion has been implemented. Results will be up tonight & you can decide how you want to limit the matching. -- JLaTondre (talk) 01:25, 20 November 2019 (UTC)
With what logic? Truncate iff / occurs after 10 characters, and ignoring CAPS/CAPS? Headbomb {t · c · p · b} 05:57, 20 November 2019 (UTC)
It seems to have skipped WP:JCW/TAR entirely. Headbomb {t · c · p · b} 10:22, 20 November 2019 (UTC)

Upon review, it seems that truncating at 9 or above characters would get rid of nearly every false positive (I think I saw 4 that would remain), and keep all the good matches. So if you've got 123 45678/9abcdef don't truncate, and if you've got 123 456 789/0abc def, truncate. I didn't see anything that a fancier CAPS/CAPS type of rule would change anything that the first one wouldn't cover so you can hold off on that one, but that might just be because things are currently getting lost in the signal to noise ratio. Headbomb {t · c · p · b} 11:01, 20 November 2019 (UTC)

Truncating at 9 implemented for "/". -- JLaTondre (talk) 01:34, 21 November 2019 (UTC)
Seems to work mostly fine. Shame it's not the 19th, I'd have done a lot of cleanup before the next time. Oh well, I'll just set up more exemptions in the meantime. Headbomb {t · c · p · b} 09:10, 21 November 2019 (UTC)

Equal issues

One thing that makes things tricky for exclusions is that

  • {{JCW-exclude|Society for Sedimentary Geology|Se Pu = Chinese Journal of Chromatography}}

will no longuer 'display' correctly. I suggest supporting something like

  • {{JCW-exclude|Society for Sedimentary Geology|2=Se Pu = Chinese Journal of Chromatography}}</nowiki>

where you would strip |2=| in {{JCW-exclude}}/{{JCW-pattern}}. Headbomb {t · c · p · b} 07:31, 19 November 2019 (UTC)

Implemented. -- JLaTondre (talk) 01:25, 20 November 2019 (UTC)
[14] only partially. This one didn't work. The counts are also off. Headbomb {t · c · p · b} 10:10, 20 November 2019 (UTC)
Both fixed. -- JLaTondre (talk) 01:34, 21 November 2019 (UTC)

Normalization issues

In WP:JCW/Target7#UBV Photoelectric Photometry Catalogue, many things are grouped under UBV Photoelectric Photometry Catalogue, which really shouldn't be. Several of those entries are linked and resolve elsewhere, like VizieR On-line Data Catalog: B/GCVS and VizieR On-line Data Catalog: B/gcvs. Originally Published in: 2009yCat....102025S. Likewise in WP:JCW/Target13#Bright Star Catalogue.

The redlinks are fine. Headbomb {t · c · p · b} 16:20, 22 November 2019 (UTC)

VizieR On-line Data Catalog: II/168. Originally Published in: Institut d'Astronomie and VizieR On-line Data Catalog: II/168. Originally published in: Institut d'Astronomie are redirects to UBV Photoelectric Photometry Catalogue. With the new slash logic, they reduce to "VizieR On-line Data Catalog: II" which results in a one or two character difference from all the other "VizieR On-line Data Catalog:" results. None of the results exist in the 11/01 database and so are properly captured here. The three regular blue links (indicating not present in the dump) are redirects created after the dump (you created them on the 7th). With the next dump (which became available today and I'll process tonight), they will go away since they will now be in the dump. -- JLaTondre (talk) 22:25, 22 November 2019 (UTC)
Nevermind then. I thought those redirects were older than that. Headbomb {t · c · p · b} 14:31, 23 November 2019 (UTC)

WP:JCW/PUB seems to ignore DOI matches

Compare WP:JCW/Publisher5#Hindawi Publishing Corporation's 403 entries with WP:JCW/Questionable1#Hindawi Publishing Corporation's 509 entries for example. Headbomb {t · c · p · b} 11:02, 13 November 2019 (UTC)

DOI extended to publishers. Results will be up with the 11/20 dump processing. -- JLaTondre (talk) 23:39, 22 November 2019 (UTC)

Exclusion ignored

does not seem to exclude 'Coconut' from WP:JCW/CW#Pseudo-scholarship. It's added via Category:Pre-Columbian trans-oceanic contact. Would

work instead? Headbomb {t · c · p · b} 10:00, 15 November 2019 (UTC)

The exclusions are only excluding citations found by matching against the configuration. They aren't excluding items included by the configuration. The assumption was that since the configuration was saying include it, it should be included. I can update it to apply against the configuration also. Is that just for the questionable processing or also for the publisher processing? -- JLaTondre (talk) 02:04, 22 November 2019 (UTC)
AFAIK, both /Questionable and /Publisher should have the same core logic for everything. I don't recall any situation where the logic between them should differ, although feel free to point out if there are inconsistancies.
As for excluding something told to be included, this is a bit of a rare situation, but it's the perfect cornercase. Coconut (the food) is the item including in the pre-Columbian category. But Coconut (a non-notable poetry magazine from a non-notable publisher) is what's being 'caught' by this. Normally this could/would be fixed by creating a Coconut (magazine) redirect, but here the redirect can't be created, so an overide needs to be available. Headbomb {t · c · p · b} 07:04, 22 November 2019 (UTC)
Implemented. -- JLaTondre (talk) 04:31, 24 November 2019 (UTC)

Add |doi= to PUB/CRAP compilations

Just like this [15]. This would be fetched from the config pages, and would be added even if there are no 'distinct' DOI matches, to produce something like.

Rank Target/Group Entries (Citations, Articles) Total Citations Distinct Articles Citations/article


520 Foobar Society | {{doi|10.1234}}
[Beall's publisher list]
1 1 1.000
521 KSP Journals | {{doi|10.1453}} · {{doi|10.6547}}
[Beall's publisher list]
1 1 1.000

with |doi1=/|doi2=/|doi3=... following/replacing |doi= as needed. Headbomb {t · c · p · b} 15:36, 25 November 2019 (UTC)

Implemented. Will show up with next run. -- JLaTondre (talk) 01:35, 26 November 2019 (UTC)
@JLaTondre: missed the /PUB stuff, I believe. Headbomb {t · c · p · b} 11:40, 26 November 2019 (UTC)
It will run tonight. In testing the changes, it logged the latest configuration timestamp & I forgot to set it back. -- JLaTondre (talk) 23:07, 26 November 2019 (UTC)