User:Robert McClenon/Oversized Queries

From WikiProjectMed
Jump to navigation Jump to search

In 2007, I was one of the system testers for a system that tracked occupational health and industrial hygiene information on the brave men and women of the United States armed forces. One day, when we had set up a production mirror system for testing, one of the users submitted a query for personnel, but forgot to specify any criteria for the personnel. The system rolled over. That is, it crashed, and restarted itself from the crash. This is not as bad as if the system had crashed and did not restart itself, and had to be restarted by a system administrator, but it is bad. The problem was that the query was being done on the database server, and it ran out of memory, and the software in the database server didn't have a good mechanism for stopping the query at this point. We (the developers) had to put in a patch to return an error message if a query on personnel had no criteria. Someone said: "But we don't have five million users. We only have a few hundred users. How can we have five million records?" Personnel didn't mean users. Personnel meant service members and veterans. The number of personnel records increased in a linear fashion over time. Records were not deleted over time when personnel left the armed forces, because the data was needed if the personnel were later treated in veterans' hospitals for possible job-related conditions, such as exposure to chemicals, or exposure to insects, or to insecticides used to control the insects. The number of personnel records was unlimited and expanding over time. The issue had not been previously encountered in testing because testing was normally done on a test system which did not have five million personnel records, but the production mirror was set up with a copy of the production database with five million personnel records.

By the way, in order to have access to the system, users were required to complete HIPAA training and sign HIPAA non-disclosure agreements for sensitive personally identifying information.

So it is with Wikipedia, which has more than five million articles, and the number of articles in unlimited and expanding over time. Wikipedia performs well under normal conditions. A query that retrieves "too many" records in client space is only a problem for client performance, that is, the performance of the user's web browser, and, in worst cases, the user's computer. If you scroll through five million articles, one at a time, you will soon realize that that isn't what you wanted. However, a query that retrieves "too many" records in the database server should not be permitted.

In deciding whether any large subset (and the whole set is a subset of itself) of articles is problematic, the most important question is whether the large subset is being built on the database server. If so, it is likely to be a problem. If the large subset is only being built in the web browser, or if the large subset is never actually all being loaded at once (e.g., because it is being scrolled or paged), it can be treated as Someone Else's Problem because it is invisible to everyone except the person querying the large subset.

For example, a category with thousands, tens of thousands, or hundreds of thousands of articles is a large subset of Wikipedia pages.

Robert McClenon (talk) 22:38, 26 November 2018 (UTC)