Incident Documentation - An Unexpected Journey
by Željko Filipin
(Originally published at Doing the needful blog.)
Introduction
The Release Engineering team wants to continually improve the quality of our software over time. One of the ways in which we hoped to do that this year is by creating more useful Selenium smoke tests. (From now on, test will be used instead of Selenium test.) This blog post is about how we determined where the tests should focus and the relative priority.
At first, I thought this would be a trivial task. A few hours of work. A few days at most. A week or two if I’ve completely underestimated it. A couple of months later, I know I have completely underestimated it.
Things I needed to do:
- Define prioritization scheme.
- Prioritize target repositories.
Define Prioritization Scheme
In general:
- Does a repository have stewards? (Do the stewards want tests?)
- Does a repository have existing tests?
For the last year:
- How much change did happen for a repository? Simply put: more change can lead to more risk.
- How many incidents is a repository connected to? We wanted to make sure we didn’t miss any obvious problematic areas.
Does a Repository Have Stewards?
This was relatively simple task. The best source of information is Developers/Maintainers page.
Does a Repository Have Existing Tests?
This was also easy. Selenium/Node.js page has list of repositories that have tests in Node.js. I already had all repositories with Node.js and Ruby tests on my machine, so a quick search for webdriverio
(Node.js) and mediawiki_selenium
(Ruby) found all the tests. In order to be really sure I’ve found all repositories with tests, I’ve cloned all repositories from Gerrit.
$ ack --json webdriverio
extensions/Echo/package.json
27: "webdriverio": "4.12.0"
...
$ ack --type-add=lock:ext:lock --lock mediawiki_selenium
skins/MinervaNeue/Gemfile.lock
42: mediawiki_selenium (1.7.3)
...
To make extra sure I have not missed any repositories, I’ve used MediaWiki code search (mediawiki_selenium, webdriverio ) and GitHub search (org:wikimedia extension:lock mediawiki_selenium, org:wikimedia extension:json webdriverio).
This is the list.
Repository | Language |
mediawiki/core | JavaScript |
mediawiki/extensions/AdvancedSearch | JavaScript |
mediawiki/extensions/CentralAuth | Ruby |
mediawiki/extensions/CentralNotice | Ruby |
mediawiki/extensions/CirrusSearch | JavaScript |
mediawiki/extensions/Cite | JavaScript |
mediawiki/extensions/Echo | JavaScript |
mediawiki/extensions/ElectronPdfService | JavaScript |
mediawiki/extensions/GettingStarted | Ruby |
mediawiki/extensions/Math | JavaScript |
mediawiki/extensions/MobileFrontend | Ruby |
mediawiki/extensions/MultimediaViewer | Ruby |
mediawiki/extensions/Newsletter | JavaScript |
mediawiki/extensions/ORES | JavaScript |
mediawiki/extensions/Popups | JavaScript |
mediawiki/extensions/QuickSurveys | Ruby |
mediawiki/extensions/RelatedArticles | JavaScript |
mediawiki/extensions/RevisionSlider | Ruby |
mediawiki/extensions/TwoColConflict | JavaScript, Ruby |
mediawiki/extensions/Wikibase | JavaScript, Ruby |
mediawiki/extensions/WikibaseLexeme | JavaScript, Ruby |
mediawiki/extensions/WikimediaEvents | PHP |
mediawiki/skins/MinervaNeue | Ruby |
phab-deployment | JavaScript |
wikimedia/community-tech-tools | Ruby |
wikimedia/portals/deploy | JavaScript |
How Much Change Did Happen for a Repository?
After reviewing several tools, I’ve found that we already use Bitergia for various metrics. There is even a nice list of top 50 repositories by the number of commits. The tool even supports limiting the report from a date to a date. Exactly what I needed.
Bitergia > Last 90 days > Absolute > From 2017-11-01 00:00:00.000
> To 2018-10-31 23:59:59.999
> Go > Git > Overview > Repositories (raw data: P7776 , direct link).
This is the top 50 list (excludes empty commits and bots).
Repository | Commits |
---|---|
mediawiki/extensions | 11300 |
operations/puppet | 7988 |
mediawiki/core | 4590 |
operations/mediawiki-config | 4005 |
integration/config | 1652 |
operations/software/librenms | 1169 |
pywikibot/core | 927 |
mediawiki/extensions/Wikibase | 806 |
apps/android/wikipedia | 789 |
mediawiki/services/parsoid | 700 |
mediawiki/extensions/VisualEditor | 692 |
operations/dns | 653 |
VisualEditor/VisualEditor | 599 |
mediawiki/skins | 570 |
mediawiki/extensions/MobileFrontend | 504 |
mediawiki/extensions/ContentTranslation | 491 |
translatewiki | 486 |
oojs/ui | 469 |
wikimedia/fundraising/crm | 457 |
mediawiki/extensions/BlueSpiceFoundation | 414 |
mediawiki/extensions/CirrusSearch | 357 |
mediawiki/extensions/AbuseFilter | 306 |
phabricator/phabricator | 302 |
mediawiki/services/restbase | 290 |
mediawiki/extensions/Flow | 232 |
mediawiki/extensions/Echo | 223 |
mediawiki/vagrant | 221 |
mediawiki/extensions/Popups | 184 |
mediawiki/extensions/Translate | 182 |
mediawiki/extensions/DonationInterface | 180 |
analytics/refinery | 178 |
mediawiki/extensions/PageTriage | 177 |
mediawiki/extensions/Cargo | 176 |
mediawiki/tools/codesniffer | 156 |
mediawiki/extensions/TimedMediaHandler | 152 |
mediawiki/extensions/UniversalLanguageSelector | 142 |
mediawiki/vendor | 140 |
mediawiki/extensions/SocialProfile | 139 |
analytics/refinery/source | 138 |
operations/software | 137 |
mediawiki/services/restbase/deploy | 136 |
operations/debs/pybal | 123 |
mediawiki/extensions/CentralAuth | 116 |
mediawiki/tools/release | 116 |
mediawiki/services/cxserver | 112 |
mediawiki/extensions/BlueSpiceExtensions | 110 |
mediawiki/extensions/WikimediaEvents | 110 |
labs/private | 108 |
operations/debs/python-kafka | 104 |
labs/tools/heritage | 96 |
I’ve got similar results with running git rev-list
for all repositories (script, results: P7834).
How Many Incidents Is a Repository Connected To?
This proved to be the most time consuming task.
I have started by reviewing existing incident documentation. Take a look at a few incidents. Can you tell which incident report is connected to which repository? I couldn’t. (If you can, please let me know. I need your help.)
Incident reports are a wall of text. It was really hard for me to connect an incident report to a repository. An incident report has a title and text, example: 20180724-Train. Text has several sections, including Actionables. Text contains links to Gerrit patches and Phabricator tasks. (From now on, I’ll use patches instead of Gerrit patches and tasks instead of Phabricator tasks.)
A patch belongs to a repository. Wikitext [[gerrit:448103]]
is patch mediawiki/extensions/Wikibase/+/448103, so repository is mediawiki/extensions/Wikibase
. That is the strongest link between an incident and a repository.
A task usually has patches associated with it. Wikitext [[phab:T181315]]
is patch T181315 . Gerrit search bug:T181315 finds many connected patches, many of them in operations/puppet
and one in mediawiki/vagrant
. That is an useful, but not a strong link between an incident and a repository. Some tasks have several related patches, so it provides a lot of data.
A task also usually has several tags. Most of them are not useful in this context, but tags that are components (and not for example milestones or tags) could be useful, if the component can be linked to a repository. It is also not a strong link between an incident and a repository, and it usually does not provide a lot of data.
At the end, I wrote a tool with imaginative name, Incident Documentation. The tool currently collects data from patches and tasks from Actionables section of the incident report. It does not collect data from task components. It is tracked as issue #5.
Incident Review 2017-11-01 to 2018-10-31
After reviewing Actionables section for each incident report, related patches and tasks, here are the results. Please note this table only connects incident report and repositories. It does not show how many patches from a repository are connected to an incident report. It is tracked as issue #11.
Repository | Incidents |
---|---|
operations/puppet | 22 |
mediawiki/core | 6 |
operations/mediawiki-config | 4 |
mediawiki/extensions/Wikibase | 4 |
wikidata/query/rdf | 2 |
operations/debs/pybal | 2 |
mediawiki/extensions/ORES | 2 |
integration/config | 2 |
wikidata/query/blazegraph | 1 |
operations/software | 1 |
operations/dns | 1 |
mediawiki/vagrant | 1 |
mediawiki/tools/release | 1 |
mediawiki/services/ores/deploy | 1 |
mediawiki/services/eventstreams | 1 |
mediawiki/extensions/WikibaseQualityConstraints | 1 |
mediawiki/extensions/PropertySuggester | 1 |
mediawiki/extensions/PageTriage | 1 |
mediawiki/extensions/Cognate | 1 |
mediawiki/extensions/Babel | 1 |
maps/tilerator/deploy | 1 |
maps/kartotherian/deploy | 1 |
integration/jenkins | 1 |
eventlogging | 1 |
analytics/refinery/source | 1 |
analytics/refinery | 1 |
All-Projects | 1 |
Selecting Repositories
This table is sorted by the amount of change. The only column that needs explanation is Selected. It shows if a test makes sense for the repository, taking into account all available data. Repositories without maintainers and with existing tests are excluded.
Repository | Change | Stewards | Coverage | Incidents | Selected |
---|---|---|---|---|---|
mediawiki/extensions | 11300 | ||||
operations/puppet | 7988 | SRE | 22 | ||
mediawiki/core | 4590 | Core Platform | JavaScript | 6 | |
operations/mediawiki-config | 4005 | Release Engineering | 4 | ||
integration/config | 1652 | Release Engineering | 2 | ||
operations/software/librenms | 1169 | SRE | |||
pywikibot/core | 927 | ||||
mediawiki/extensions/Wikibase | 806 | WMDE | JavaScript, Ruby | 4 | |
apps/android/wikipedia | 789 | ||||
mediawiki/services/parsoid | 700 | Parsing | |||
mediawiki/extensions/VisualEditor | 692 | Editing | ✅ | ||
operations/dns | 653 | SRE | 1 | ||
VisualEditor/VisualEditor | 599 | Editing | |||
mediawiki/skins | 570 | Reading | |||
mediawiki/extensions/MobileFrontend | 504 | Reading | Ruby | ||
mediawiki/extensions/ContentTranslation | 491 | Language engineering | ✅ | ||
translatewiki | 486 | ||||
oojs/ui | 469 | ||||
wikimedia/fundraising/crm | 457 | Fundraising tech | |||
mediawiki/extensions/BlueSpiceFoundation | 414 | ||||
mediawiki/extensions/CirrusSearch | 357 | Search Platform | JavaScript | ||
mediawiki/extensions/AbuseFilter | 306 | Contributors | ✅ | ||
phabricator/phabricator | 302 | Release Engineering | ✅ | ||
mediawiki/services/restbase | 290 | Core Platform | |||
mediawiki/extensions/Flow | 232 | Growth | ✅ | ||
mediawiki/extensions/Echo | 223 | Growth | JavaScript | ||
mediawiki/vagrant | 221 | Release Engineering | 1 | ||
mediawiki/extensions/Popups | 184 | Reading | JavaScript | ||
mediawiki/extensions/Translate | 182 | Language engineering | ✅ | ||
mediawiki/extensions/DonationInterface | 180 | Fundraising tech | ✅ | ||
analytics/refinery | 178 | Analytics | 1 | ||
mediawiki/extensions/PageTriage | 177 | Growth | 1 | ✅ | |
mediawiki/extensions/Cargo | 176 | ||||
mediawiki/tools/codesniffer | 156 | ||||
mediawiki/extensions/TimedMediaHandler | 152 | Reading | ✅ | ||
mediawiki/extensions/UniversalLanguageSelector | 142 | Language engineering | ✅ | ||
mediawiki/vendor | 140 | ||||
mediawiki/extensions/SocialProfile | 139 | ||||
analytics/refinery/source | 138 | Analytics | 1 | ||
operations/software | 137 | SRE | 1 | ||
mediawiki/services/restbase/deploy | 136 | Core Platform | |||
operations/debs/pybal | 123 | SRE | 2 | ||
mediawiki/extensions/CentralAuth | 116 | Ruby | |||
mediawiki/tools/release | 116 | 1 | |||
mediawiki/services/cxserver | 112 | ||||
mediawiki/extensions/BlueSpiceExtensions | 110 | ||||
mediawiki/extensions/WikimediaEvents | 110 | PHP | |||
labs/private | 108 | ||||
operations/debs/python-kafka | 104 | SRE | |||
labs/tools/heritage | 96 |
Since some of the repositories connected to incidents are not in the top 50 Bitergia report, I’ve used git rev-list
to sort them. Numbers are different because Bitergia excludes empty commits and bots script , results: P7834).
Repository | Change | Stewards | Coverage | Incidents | Selected |
---|---|---|---|---|---|
mediawiki/extensions/WikibaseQualityConstraints | 910 | WMDE | 1 | ✅ | |
mediawiki/extensions/ORES | 364 | Growth | JavaScript | 2 | |
wikidata/query/rdf | 204 | WMDE | 2 | ||
mediawiki/extensions/Babel | 146 | Editing | 1 | ✅ | |
mediawiki/services/ores/deploy | 84 | Growth | 1 | ||
maps/kartotherian/deploy | 80 | 1 | |||
mediawiki/extensions/PropertySuggester | 67 | WMDE | 1 | ✅ | |
maps/tilerator/deploy | 61 | 1 | |||
mediawiki/extensions/Cognate | 47 | WMDE | 1 | ✅ | |
All-Projects | 37 | 1 | |||
eventlogging | 26 | 1 | |||
integration/jenkins | 19 | Release Engineering | 1 | ||
mediawiki/services/eventstreams | 16 | 1 | |||
wikidata/query/blazegraph | 10 | WMDE | 1 |
Prioritize Repositories
Change column uses Bitergia numbers. Numbers in italic are from git rev-list
.
Repository | Change | Stewards | Coverage | Incidents | Selected |
---|---|---|---|---|---|
mediawiki/extensions/VisualEditor | 692 | Editing | ✅ | ||
mediawiki/extensions/ContentTranslation | 491 | Language engineering | ✅ | ||
mediawiki/extensions/AbuseFilter | 306 | Contributors | ✅ | ||
phabricator/phabricator | 302 | Release Engineering | ✅ | ||
mediawiki/extensions/Flow | 232 | Growth | ✅ | ||
mediawiki/extensions/Translate | 182 | Language engineering | ✅ | ||
mediawiki/extensions/DonationInterface | 180 | Fundraising tech | ✅ | ||
mediawiki/extensions/PageTriage | 177 | Growth | 1 | ✅ | |
mediawiki/extensions/TimedMediaHandler | 152 | Reading | ✅ | ||
mediawiki/extensions/UniversalLanguageSelector | 142 | Language engineering | ✅ | ||
mediawiki/extensions/WikibaseQualityConstraints | //910// | WMDE | 1 | ✅ | |
mediawiki/extensions/Babel | //146// | Editing | 1 | ✅ | |
mediawiki/extensions/PropertySuggester | //67// | WMDE | 1 | ✅ | |
mediawiki/extensions/Cognate | //47// | WMDE | 1 | ✅ |
The same table grouped by stewards.
Repository | Change | Stewards | Coverage | Incidents | Selected |
---|---|---|---|---|---|
mediawiki/extensions/VisualEditor | 692 | Editing | ✅ | ||
mediawiki/extensions/Babel | //146// | Editing | 1 | ✅ | |
mediawiki/extensions/ContentTranslation | 491 | Language engineering | ✅ | ||
mediawiki/extensions/Translate | 182 | Language engineering | ✅ | ||
mediawiki/extensions/UniversalLanguageSelector | 142 | Language engineering | ✅ | ||
mediawiki/extensions/AbuseFilter | 306 | Contributors | ✅ | ||
phabricator/phabricator | 302 | Release Engineering | ✅ | ||
mediawiki/extensions/Flow | 232 | Growth | ✅ | ||
mediawiki/extensions/PageTriage | 177 | Growth | 1 | ✅ | |
mediawiki/extensions/DonationInterface | 180 | Fundraising tech | ✅ | ||
mediawiki/extensions/TimedMediaHandler | 152 | Reading | ✅ | ||
mediawiki/extensions/WikibaseQualityConstraints | //910// | WMDE | 1 | ✅ | |
mediawiki/extensions/PropertySuggester | //67// | WMDE | 1 | ✅ | |
mediawiki/extensions/Cognate | //47// | WMDE | 1 | ✅ |
Conclusions
- There are some repositories that do not fit the Selenium/end-to-end testing model (eg:
operations/puppet
oroperations/mediawiki-config
) but could benefit from other testing mechanisms or deployment practices. - A test could prevent an outage if it runs:
- Every time a patch is uploaded to Gerrit. That way it could find a problem during development. That is already done for repositories that have tests.
- After deployment. That way it could find a problem that was not found during development. In ideal case, deployment would be made to a test server in production, a test would run targeting the tests server. If it fails, further deployment would be cancelled. This is not yet done.
- Automattic runs tests targeting WordPress.com production:
We decided to implement some basic e2e test scenarios which would only run in production – both after someone deploys a change and a few times a day to cover situations where someone makes some changes to a server or something.
Next steps:
- I will contact owners of selected repositories (see Prioritize Repositories section) and offer help in creating the first test.
- I will add results from Incident Documentation tool to incident reports as a new Related Repositories section. The section will link to the tool and explain how it got the data. It will also ask for edits if the data is not correct.
- I will reach out to people that created (or edited) incident reports and ask them to populate Related Repositories section. This might have mixed results. For best results, the section will already be populated with the data from Incident Documentation tool.
-
I will add Related Repositories section to the [[ https://wikitech.wikimedia.org/wiki/Incident_documentation/Report_Template incident report template ]].
Incident Documentation tool improvements:
- There are several way to link from a wiki page to a patch or task. The tool for now only supports
[[gerrit:]]
and[[phab:]]
. Tracked as issue #6. - Gerrit patches and Phabricator tasks from Actionables section do not provide enough data. The entire incident report should be used. I have limited it first because I was collecting data manually (and Actionables looked like the most important part of the incident report), later because of #6. Tracked as issue #4.
- Find Gerrit repository from task component. Tracked as issue #5.
- A table with the number of patches from each repository would be helpful. Tracked as issue #11.
- A report with folder/filenames from a repository that are mentioned the most. Especially useful for big repositories like
operations/puppet
andmediawiki/core
. Tracked as issue #12.