Wikidata:Requests for comment/Improve bot policy for data import and data modification
An editor has requested the community to provide input on "Improve bot policy for data import and data modification" via the Requests for comment (RFC) process. This is the discussion page regarding the issue.
If you have an opinion regarding this issue, feel free to comment below. Thank you! |
THIS RFC IS CLOSED. Please do NOT vote nor add comments.
The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
Closing this RFC with the following results:
- Bots importing from Wikipedia should add in addition to imported from Wikimedia project (P143) also reference URL (P854) with the value of the full URL and either retrieved (P813) or include the version id of the source page in the full URL.
- Bots with a new source for an existing statement should add it as a new source, rather than create a new statement.
--Pasleim (talk) 18:58, 13 April 2016 (UTC)[reply]
Contents
- 1 Policy for data modification by bot
- 2 Improve data quality by bots
- 3 Constraint the data import from Wikipedia by bots
- 4 New Proposals by User:ArthurPSmith
- 4.1 Bots importing from wikipedia should add reference URL to page imported from
- 4.2 Any bot, whatever the source, should not add a statement that has previously been deleted
- 4.3 Bots with a new source for an existing statement should add it as a new source, rather than create a new statement
- 4.4 Bots with a new value for an existing claim with the same source should replace the old claim value
- 4.5 Bots importing from wikipedia should wherever possible attempt to extract the original source
- 5 For what P143 (imported from) is good for
- 6 To identify the correct item a fact should be added to
- 7 Maintanance-category för imported from-statements
Policy for data modification by bot[edit]
Can we force bots to check data already present in the items before doing any modification and to put that check as requirement to get the bot flag ? If a bot doesn't respect that check it will loose its bot flag and other actions can be taken depending on the importance of the damage to WD. The required check is the analysis of the presence of an existing value before any addition of a new value and in case of an existing value, the bot has to add a new value and not erasing the previous one.
- Support/ Oppose Comment
- Support unless the objective of the bot edit is to remove or replace the existing value for another one, for example if the consensus says that now a property has to be qualifier deleting a property to add it as a qualifier should not be a problem. -- Agabi10 (talk) 14:57, 20 November 2015 (UTC)[reply]
- Oppose at least from my experience with pywikibot, the only way to overwrite/delete a claim is to first fetch it, otherwise you are necessarily just adding a new claim. What this suggestion is in only adding new values is essentially what happens by default with no "check" at all. Unless you are suggesting the addition happen only if the new value is *different* form the old one? What about modification of references and qualifiers on an existing claim - if you have a new source and the value of the claim is the same as the existing one, should you just add the source to the existing claim, or add a new claim with the same value but new source? I think adding the source to the existing claim is better. Another point on that - if the value is different but the *source* is the same, it seems to me the bot is perfectly within rights to replace the old value with the new value, since it is presumably from an updated version of the same source. Rules or suggestions on that? In any case, I don't think the above proposal makes any sense as it stands. ArthurPSmith (talk) 16:20, 20 November 2015 (UTC)[reply]
- Strong oppose Agree with ArthurPSmith --I9606 (talk) 19:48, 20 November 2015 (UTC)[reply]
- Comment ArthurPSmith, I9606: If that happens with pywikibot what is the problem to oppose? If pywikibot creates a new claim by default it is easier to implement. At least as I understand when Snipre talks about the addition of a new value it means a new different value. Obviously in the case of the same value the the correct action is adding a new reference to the existing value. -- Agabi10 (talk) 19:57, 20 November 2015 (UTC)[reply]
- Comment Agabi10 - the proposed "required check" is exactly equivalent to no check at all; what this proposed policy is implying is rather the opposite, that any bot that does a check and deletes the old claim or replaces it with a new one should not be allowed a bot flag, which I don't think is reasonable. However there are some policies along these lines that might make sense. Would it be appropriate to add other proposals for bot policies at the bottom of this RFC? ArthurPSmith (talk) 15:31, 23 November 2015 (UTC)[reply]
- Comment ArthurPSmith That's not what I understand, but watching all the opposes maybe I am the one who understood it incorrectly. Maybe Snipre could expand the explanation to make it clearer what they mean in this section. By the way, what other proposals do you want to add? At least I'm open to suggestions to improve data quality as long as they don't make harder or more useless human corrections and/or shrink the data input throughput too much. -- Agabi10 (talk) 17:26, 23 November 2015 (UTC)[reply]
- Agabi10 there are 3 or 4 specific things that have been mentioned in the discussion on this page which maybe could be their own separate proposals - for example recommending that bots not add a claim that was previously deleted. ArthurPSmith (talk)
- Oppose Although it only makes sense to check available claims on data, threatening the removal of ones bot flag is in my opinion too harsh. What checks are we talking about? There will be situations where the bot will have checks in place, but where people will disagree on the level of detailed checked. These issues should be resolved through discussion, where there is no place for threads to remove bot flags. In our bot, we do have checks in place which rely on either WDQ or WDQS. These are mainly to check across Wikidata items for similar/identical claims. These checks sometimes do fail, because of some update issue with the WDQ/WDQS. I assume these are not the check being discussed here, but is this a correct assumption? Another example of an issue that might get punished is where according to one resource, a claim is deprecated and other resources annotate it as either normal or even preferred. In that case I would create a new "identical" claim with the appropriate rank, again could this be picked up as being a flaw in my bot? There are simply to many grey area's possible, that a strong policy would be counter productive. So, I agree with ArthurPSmith --Andrawaag (talk) 09:49, 21 November 2015 (UTC)[reply]
- Oppose Deletions by bot are just as valuable as additions by bot. I agree that checks should be required but such a rule shouldn't say that the but must add a new value but not remove the previous. Instead I would like to see checks to make sure that a bot doesn't add duplicate values and more importantly checks that the value added haven't been previously deleted (for example imported by another bot previously but then removed by a human because that statement are wrong). --Pajn (talk) 11:12, 22 November 2015 (UTC)[reply]
- Oppose I don't see the value added with add only data. Depend by task, sometime a bot can add a new value, sometime replace. Is more important to have precise instructions when asking for work of a bot. --ValterVB (talk) 19:38, 22 November 2015 (UTC)[reply]
- Oppose; Bots shall edit in compliance with all existing rules, that would be enough. There are totally valid situations that allow addition, replacement and/or deletion of statements. No need to restrict bot behavior here. —MisterSynergy (talk) 18:54, 8 December 2015 (UTC)[reply]
Improve data quality by bots[edit]
Wikidata doesn't look for the truth but aims to have reliable and verifiable data. To accelerate the replacement of value without reference or with unreliable reference (typically data imported from Wikipedia based on the rule that WP is not a source) can we agree on the deletion of existing values without reference or using imported from Wikimedia project (P143) for sourcing their Wikipedia origins if and only if the new value has an external reference ?
- Support/ Oppose Comment
- Weak support The bot should only delete the statement with imported from Wikimedia project (P143) if it doesn't give more information than the sourced one or if the information in the Wikipedia field is incorrect based on the source. There are for example some biographies in Wikidata in which the imported from Wikimedia project (P143) is the full day while the correctly sourced one is only the year. Deleting the imported from Wikimedia project (P143) by a rule in this case will make us lose information, so it is not the desired behavior. In this example it could be acceptable deleting the information if the data in the source is supposed to be true (we also can import deprecated data). Following the last example this could be applied if the sourced year is different than the non-sourced full date's year. -- Agabi10 (talk) 14:57, 20 November 2015 (UTC)[reply]
- Support This seems sensible - Agaib10's qualifier regarding "more information" seems appropriate too; I would even more strongly support this in the case of replacing claims with no source reference at all. However, I would hope that for each bot task the reviewers will look and ensure that the external reference being relied on is MORE reliable than wikipedias; this isn't a good rule if we're allowing any external site to replace consensus data which does have some value. In that vein, perhaps we need an additional rule that every specific external source that has data being pulled in by a bot should have its own designated bot task with approval process required. ArthurPSmith (talk) 16:28, 20 November 2015 (UTC)[reply]
- Comment Why should an external reference be more reliable than wikipedia? I was a wikipedia author for many years and I would say that the references used in de:WP are reliable. And wikidata has no problem with more than one claim about the same information, that has different references. The problem is, I guess, that so far infoboxes can´t decide which data + references to choose. But why couldn´t the infobox (the module) say, if there are two datas + references, choose the one that is not imported from wikipedia? There could be more than one rule coded into a module. --Molarus 17:12, 20 November 2015 (UTC)[reply]
- Comment Molarus If the data in Wikipedia is correctly referenced the idea should be adding that reference to the Wikidata entity and removing the imported from Wikimedia project (P143). What imported from Wikimedia project (P143) Wikipedia means at the moment is something like "data added from Wikipedia without human check that could be a vandalism or unreferenced data that later was correcte". We are no necessarily doubting about the reliability of the data imported from Wikidata, the main doubts are about how trustfull the data is and in the case of claims with only imported from Wikimedia project (P143) references we simply can't know. I think that the ideal solution would be updating those references. Or would you think that it would be acceptable in de:WP if the only line in the references section of an article is "This article is translated from the English Wikipedia and it was correctly referenced, so we can trust it"? -- Agabi10 (talk) 17:34, 20 November 2015 (UTC)[reply]
- At de:WP it is not allowed to import wd:data without external references. That means, if there are two claims, one with a reference from wikipedia and the other with an external reference, the infobox should choose the second claim (and ignore the first). Of course, the wikidata module in de:WP don´t work this way. And we would need a bot, that checks if someone changes the data with an external references and undo this. By the way, does
{{#property:P36}}
import references? PS: I´m writing a js-tool at the moment for importing numbers from WP-infoboxes and I will add that the tool imports the references too. Weblinks should be transformed into P1476 (title) + P854 (url). --Molarus 17:59, 20 November 2015 (UTC)[reply]
- At de:WP it is not allowed to import wd:data without external references. That means, if there are two claims, one with a reference from wikipedia and the other with an external reference, the infobox should choose the second claim (and ignore the first). Of course, the wikidata module in de:WP don´t work this way. And we would need a bot, that checks if someone changes the data with an external references and undo this. By the way, does
- Molarus I don't know if it loads any references at all, in eswiki we use a Lua module to get the info in most of the cases and I can't remember any function to get the sources
and I don't remember having it neither in the loaded entity object. Anyway if it doesn't import the sources that should be a must have feature. If we don't show the sources from Wikidata in the Wikipedia because we are too lazy or too busy to add that to the module its our fault, if it is not possible it is a big error of the Wikibase client. If the clients can't get the sources to reference the data correctly adding the sources here would be completely worthless. And while the sources are not readable from the clients the Wikipedias will offer more resistance to the use of the Wikidata's data.-- Agabi10 (talk) 18:55, 20 November 2015 (UTC)[reply]
- OK, I checked it, the references are inside the loaded entity object. -- Agabi10 (talk) 18:59, 20 November 2015 (UTC)[reply]
- At de:wp the module needs to have just one claim. Therefore, at this moment, this proposal would be at least a small improvement. I have changed my votum. --Molarus 19:36, 20 November 2015 (UTC)[reply]
- Oppose While I agree with the goal "To accelerate the replacement of value without reference or with unreliable reference" I'm not sure that going though and deleting all such statements is the best way to achieve that goal. It seems to me that people are more likely to act to improve a claim (delete if invalid or improve the reference if valid), if they first can see that it exists. Would you delete all sentences in Wikipedia articles with a "citation needed" tag? Tools that query the data can easily add filters to hide unreferenced data or data that was only referenced to Wikipedia. Perhaps another idea would be to consider adding an alert in the wikidata.org UI to notify users that claims are missing references (and thus should be treated with caution and could be improved by the user). I agree with the goal, but is there any evidence that this strategy would move us in that direction? --I9606 (talk) 20:11, 20 November 2015 (UTC)[reply]
- If we ask the developers to add that alert in the UI we have to take into account that there are claims that are auto-referenced, for example IMDb identifiers. The way to check if the statement is correct is following their link. Adding a reference of "stated in (P248) in IMDb" doesn't make sense in this situation. The code related to the UI change should be aware of this things. -- Agabi10 (talk) 20:19, 20 November 2015 (UTC)[reply]
- Strong oppose One example : date of birth (P569) in Joseph Stalin (Q855). Statement from autorities Integrated Authority File (Q36578) and Bibliothèque nationale de France (Q193563) are the official date, 21/12/1879 (Gregorian), which seems to be wrong and has been depreciated. As written in Wikipedia, the real date of birth, seems to be the one stated in normal rank in Wikidata, 18/12/1878 (Gregorian), which has Wikipedia as source. So if this request is applied, the real date of birth (P569) of Joseph Stalin (Q855) will be lost.
Now imagine that a cataloguer working in an institution which provides authority data, see the Wikipedia article, the references in the article, and rationally decides to change the date of birth. The authority data for the date of birth is not established from Wikipedia but "very influenced" by Wikipedia. And after that if we make an import from this source, if we decide to replace Wikipedia as source with authority as source, we will have a circular reference and lose the initial source to see it. WP is de facto a source even with all its known issues; it is better to admit it rather than trying to erase it. And it isn't science-fiction: a case in SUDOC (Q2597810) where the date of death is explictly referenced from wikipedia, http://www.idref.fr/029191149. Source matters, even it's Wikipedia. --Shonagon (talk) 20:23, 22 November 2015 (UTC)[reply] - Oppose Most info on Wikipedia is not clearly sourced, and so is data on Wikidata. That does not mean it is all wrong, it means you have to do more effort yourself to double check information, something you need to do yourself anyways. Even when having a source, information might be wrong. It might even be worse, with a source chances are huge that a person will only double check that one source, so if our source is wrong, we validate our information with wrong information. We, as a community, should try to find people that add false information to our project, instead of trying to find information that is not more then 99,999% likely to be correct. It would leave us behind with only 0.001% of what we have right now, and it would all fit in one printed edition of a book. Edoderoo (talk) 06:58, 29 February 2016 (UTC)[reply]
Constraint the data import from Wikipedia by bots[edit]
Bots imported quite a lot of data since 2 years to fill the items of Wikidata. But at the same time people are working on data to eliminate errors or to improve data quality. Current bot imports have some impact on the work of Wikidatians by erasing sometimes the manual edits. To avoid that problem a strict limitation or even a ban of data import from Wikipedia is proposed. The goal is to get a feed back of the contributors and in the case of some clear trend to know which limitations can be applied.
- Support/ Oppose Comment
- Strong oppose I don't think that we are still in a moment in which we can't import useful data from the Wikipedias and as long as the imported values are referenced with imported from Wikimedia project (P143) I think that we have to continue accepting that values. At least as far as we don't have a better known referenced and easy to import place to get that data. It is easier to get good references for manual isolated statement creations than for the automated ones and we are not as many people here as to be able to add all the required information manually if we want to have an useful database. -- Agabi10 (talk) 14:57, 20 November 2015 (UTC)[reply]
- WE:de forbids the use of data with imported from Wikimedia project (P143), WP:fr is going in the same direction. I don't know for the rest of the WPs but already now all data imported using imported from Wikimedia project (P143) will be considered as scrap data. So all imports based on raw data imports without any external reference is useless for WD. So I don't know when you will consider that these imports are not useful when the final users are already considered them as useless. Snipre (talk) 16:57, 23 November 2015 (UTC)[reply]
- I don't think forbidding use of data with imported from Wikimedia project (P143) is a good way for the Wikipedias (-sources etc.) The imported from Wikimedia project (P143) "pseudo-reference" doesn't make the statement worse than no reference at all, it just doesn't make them any better either. Besides, many (really many) statements were (and maybe still are) imported from Wikipedias (or even worse: guessed on the basis of [often misunderstood] Wikipedia categorization) without tagging them with imported from Wikimedia project (P143), so filtering out the tagged ones (which we can suppose were made by the more responsible users...) only solves the less painful part of the problem. What I suggested in cs.wiki (and what AFAIK was applied for the properties most damaged by irresponsible bot-or-WiDaR users) is not to use statements without references, while imported from Wikimedia project (P143) is not considered a reference. It needs a bit Lua programming, but at this moment this seems to be the only way to make some use of this hodge-podge of valuable data and annoying garbage.--Shlomo (talk) 09:54, 18 December 2015 (UTC)[reply]
- WE:de forbids the use of data with imported from Wikimedia project (P143), WP:fr is going in the same direction. I don't know for the rest of the WPs but already now all data imported using imported from Wikimedia project (P143) will be considered as scrap data. So all imports based on raw data imports without any external reference is useless for WD. So I don't know when you will consider that these imports are not useful when the final users are already considered them as useless. Snipre (talk) 16:57, 23 November 2015 (UTC)[reply]
- Strong oppose Although data imports from Wikipedia have certainly caused problems, as a data source it is not unique in that regard.. and is a vitally important data source. It would be extremely bad to impose this block because (1) there is a lot of useful data in infoboxes that should be moved into wikidata and (2) it would likely be very annoying to potential new Wikidatans coming from the Wikipedia community. Better tools for avoiding errors with bots in general would be great. Blanket blocks like this would be terrible. --I9606 (talk) 19:54, 20 November 2015 (UTC)[reply]
- "vitally important data source". For whom ? Some wikipedias forbid the use of data tagged with imported from Wikimedia project (P143). They don't want to source Wikipedia claims as Wikipedia source. Snipre (talk) 16:57, 23 November 2015 (UTC)[reply]
- Strong oppose Of course it could be a major issue that bots erase manual edits but what is the relationship with the imports of Wikipedia? If the problem is data erasing by bots, please make a specific request.
Wikidata is first an emation of Wikipedia. Blocking data import from Wikipedia while everedy day new items are created from Wikipedia articles, would be a pity. When we see what we have and could make now with data from Wikipedia, such a policy would be catastrophic. There are still a lot of data that could be imported from Wikipedia. Of course we have to be very careful, but we are; that was a point of my article [in French] Journey from DBpedia to Wikidata on a bot written in october 2013. In many domains, we still lack data, manual edits are not enough and hopfully there are (or were) imports from Wikipedia. Let's continue carefully. --Shonagon (talk) 20:50, 22 November 2015 (UTC)[reply]- Sorry but you are wrong when you say we are careful when importing data from WP. I had some examples of the inverse.
- Then you neglect the famous principle "WP is not a source" when saying that Wikidata is first an emation of Wikipedia. The import from Wikipedia could be supported if the extraction of the sources is performed at the same time. But this not the case and even impossible because data imports from WP use categories and infoboxes which are not often sourced. The problem is data erasing by lower quality data. If we erase like described the above section proposed low quality by highest quality data then we have a gain. But the inverse not. Snipre (talk) 16:48, 23 November 2015 (UTC)[reply]
- Strong support per above discussion. -- Vlsergey (talk) 22:35, 17 December 2015 (UTC)[reply]
- Support The reliability of the data is at such level, that if nothing is done to rise it, it wouldn't matter whether we'd add hundred or million new statemets a day. Nobody would ask for them... --Shlomo (talk) 09:54, 18 December 2015 (UTC)[reply]
New Proposals by User:ArthurPSmith[edit]
Based on the above discussion and some comments elsewhere I'd like to propose the following: ArthurPSmith (talk) 22:05, 7 December 2015 (UTC)[reply]
- ArthurPSmith: I hope this is more close to your intention? --Succu (talk) 22:51, 7 December 2015 (UTC)[reply]
Bots importing from wikipedia should add reference URL to page imported from[edit]
In addition to imported from Wikimedia project (P143) German Wikipedia (Q48183) or whatever the original wikipedia, all bots importing from wikipedia's should henceforth include a link to the specific wikipedia article and version that is the source via the reference URL (P854) with value the full URL (http://de.wikipedia.org/...) and either a retrieved (P813) timestamp or include the version id (oldid?) of the source page in the full URL. ArthurPSmith (talk) 22:05, 7 December 2015 (UTC)[reply]
- Support --Andreas JN466 03:06, 8 December 2015 (UTC)[reply]
- Weak support In general I support this one, but if tools like QuickStatements are included also it should only be considered a recomendation, because depending on the way of generating the list it is not possible. Also, at this moment and at least for QuickStatements it doesn't look like it is possible to add two properties to the same reference. So it should be updated to allow it if this is mandatory in the end. -- Agabi10 (talk) 06:53, 8 December 2015 (UTC)[reply]
- I am sure Magnus would implement this in a certain way if we asked him to. Yellowcard (talk) 15:28, 8 December 2015 (UTC)[reply]
- Strong oppose The problem is not to add more reference data, the problem is that Wikipedia is not a source. If Wikidata should a reference database for Wikipedias, all data from Wikipedia can't be used. Snipre (talk) 08:57, 8 December 2015 (UTC)[reply]
- I agree in principle. But having a reference to a Wikipedia without having a link to the specific article version is even worse than having one with such a link. Andreas JN466 20:03, 8 December 2015 (UTC)[reply]
- Support; As long as we import from Wikipedia, the reference should be as specific as possible and point to a fixed version of an article. I meanwhile found enough outdated and confusing Wikipedia-references to know that Wikipedia articles are not stable enough to always point to the most recent version. —MisterSynergy (talk) 09:12, 8 December 2015 (UTC)[reply]
- Strong support The current way of referencing bot imports is really bad. No one can retrace the source data in the corresponding Wikipedia. Currently, there is neither a date nor a Wikipedia article nor an article version in the reference. This has to be changed as soon as possible to improve the quality of Wikidata. Yellowcard (talk) 15:30, 8 December 2015 (UTC)[reply]
- Weak support for retrieved (P813) Good idea to improve the quality of the reference, but with the reserves concerning the tools like QuickStatements as expressed by Agabi10. If it's not a rule, it could be a strong recomendation.
Oppose for reference URL (P854) , an useless constraint in more than 99.9% of the cases where the Wikipedia URL is implicit with the page linked to the Qitem. Yes it could be different (so it's a good practice to edit the Wikiepdia different page from which the importation is done; I don't see any example), or it could change (merges or interwikis changes), but the case is so rare and not necessary an issue, that this constraint should not be required for all imports.imported from Wikimedia project (P143) + retrieved (P813) seem to be be relevant, reasonable and useful constraint for referencing all new imports from Wikipedia. Shonagon (talk) 18:56, 8 December 2015 (UTC)[reply]- Comment Shonagon, the idea of using reference URL (P854) is not to specify the article, it is to link to the specific version of the article. Using retrieved (P813) could help getting the date, but if it has lot of editions in that date can be hard to get the specific version. -- Agabi10 (talk) 19:30, 8 December 2015 (UTC)[reply]
- Thanks Agabi10 to make it more clear for me. So I still prefer imported from Wikimedia project (P143) + retrieved (P813) instead of imported from Wikimedia project (P143) + reference URL (P854) with a specific version, harder to get, harder to read. Shonagon (talk) 19:40, 8 December 2015 (UTC)[reply]
- @Shonagon: retrieved (P813) but of what page? sitelinks can be changed without notice and from what I know, you do not have to use the sitelinked article to use P143. Often list-articles, categories and other things are used. The more detailed we are, the better! -- Innocent bystander (talk) 20:24, 8 December 2015 (UTC)[reply]
- @Innocent bystander: yes sitelinks can be changed, but as written above, it's rare and not necessarily an issue. I'm just looking for a reasonable balance between efficency and quality, to keep a dynamisim and to improve quality, as proposed. In practice, a lot of mass edits are currently not in those proposed criteria. I prefer a reasonable step with a constraint for retrieved (P813), easy to adopt and which already improves strongly quality of the reference. The difference is not on the principle: if it's easy to do with reference URL (P854) and the version of a Wikipedia page, yes it would be better, and I would agree. Shonagon (talk) 21:05, 8 December 2015 (UTC)[reply]
- @Shonagon: retrieved (P813) but of what page? sitelinks can be changed without notice and from what I know, you do not have to use the sitelinked article to use P143. Often list-articles, categories and other things are used. The more detailed we are, the better! -- Innocent bystander (talk) 20:24, 8 December 2015 (UTC)[reply]
- Thanks Agabi10 to make it more clear for me. So I still prefer imported from Wikimedia project (P143) + retrieved (P813) instead of imported from Wikimedia project (P143) + reference URL (P854) with a specific version, harder to get, harder to read. Shonagon (talk) 19:40, 8 December 2015 (UTC)[reply]
- Comment Shonagon, the idea of using reference URL (P854) is not to specify the article, it is to link to the specific version of the article. Using retrieved (P813) could help getting the date, but if it has lot of editions in that date can be hard to get the specific version. -- Agabi10 (talk) 19:30, 8 December 2015 (UTC)[reply]
- Strong support I often find it very hard today to identify how a wrong claim has been included in the items. -- Innocent bystander (talk) 19:02, 8 December 2015 (UTC)[reply]
- Support. --Epìdosis 17:13, 10 December 2015 (UTC)[reply]
- Support --Anthonyhcole (talk) 01:40, 11 December 2015 (UTC)[reply]
- Support, a link to the specific version of the WP article is better than the nothing-status quo, for sure. --Wuerzele (talk) 11:21, 14 December 2015 (UTC)[reply]
- Support, weakly, with some limitations and concerns. It is a generally good idea, but it'd probably be better if we have a separate property for "Wikimedia page version" that we link to. That way, if we had an item on, say, someone called John Smith, and we imported some fact about Smith from English Wikipedia, that fact would have a reference of the form ( imported from: English Wikipedia, Wikimedia page version: numbering of particular diff ). This is to help distinguish stuff that has been machine imported from Wikipedia from pointers to other references. There are other issues: what about pages that have been revdelled? What about pages that have been deleted completely? If someone submits an article to Wikipedia that contains false and misleading information backed by faked citations, and Wikipedians quickly rumble this and delete the page, what processes do we have to ensure that this good article quality curation work gets transmitted up the stream to Wikidata? There are bigger concerns about how we ensure Wikidata builds quality control. —Tom Morris (talk) 10:48, 18 December 2015 (UTC)[reply]
Any bot, whatever the source, should not add a statement that has previously been deleted[edit]
This maybe should be a constraint built into the server side of the API to disallow adding a previously deleted statement if the property and value are identical - respond with an error message that the edit requires human attention or something like that. I'm not sure that typical bot software would have access to the edit history to check on this anyway. ArthurPSmith (talk) 22:05, 7 December 2015 (UTC)[reply]
- Weak support If this is done it should be also a parameter to override this behavior and allow adding the value again, even if it is not the standard behavior for some actions it can be required. -- Agabi10 (talk) 06:53, 8 December 2015 (UTC)[reply]
- Comment I don’t see the problem here. Why would this be useful? —MisterSynergy (talk) 09:12, 8 December 2015 (UTC)[reply]
- @ MisterSynergy: one of the complaints was that bots were recreating false statements that had been previously examined by a human and deleted for good cause. The Signpost op-ed in particular mentioned some famous hoaxes and the "Brazilian aardvark" case. The concern here is with bots reverting the efforts of manual curation. It maybe should cover more issues (adding aliases and sitselinks previously deleted, creating items with labels and descriptions identical to one previously deleted, ...?). I think there is a real genuine concern that overuse of bots for wikidata editing is disempowering to the human editors and this would be one way to address it. ArthurPSmith (talk) 14:26, 9 December 2015 (UTC)[reply]
- Okay, Thanks for clarification. I prefer not to vote in this section then. —MisterSynergy (talk) 14:57, 9 December 2015 (UTC)[reply]
- @ MisterSynergy: one of the complaints was that bots were recreating false statements that had been previously examined by a human and deleted for good cause. The Signpost op-ed in particular mentioned some famous hoaxes and the "Brazilian aardvark" case. The concern here is with bots reverting the efforts of manual curation. It maybe should cover more issues (adding aliases and sitselinks previously deleted, creating items with labels and descriptions identical to one previously deleted, ...?). I think there is a real genuine concern that overuse of bots for wikidata editing is disempowering to the human editors and this would be one way to address it. ArthurPSmith (talk) 14:26, 9 December 2015 (UTC)[reply]
Bots with a new source for an existing statement should add it as a new source, rather than create a new statement[edit]
If the property and value and qualifiers (modification of proposal) are identical, several sources reporting the same piece of data seems like a good thing. Even if one of them is a wikipedia. ArthurPSmith (talk) 22:05, 7 December 2015 (UTC)[reply]
- Support; This would enable us to add (external) references to already existing and maybe Wikipedia-imported statements. Would be great if we could do this with tools such as QuickStatements as well. —MisterSynergy (talk) 09:12, 8 December 2015 (UTC)[reply]
- Support, thought this was the Status Quo anyway. Yellowcard (talk) 15:49, 8 December 2015 (UTC)[reply]
- Support to make it explicitly a rule. I think it is already the case in practice. I remember to have have seen somme issue of duplicate statements, from automated imports, but it was a long time ago. Shonagon (talk) 19:02, 8 December 2015 (UTC)[reply]
- Comment This should always be the way to do it. But I guess such sources as Wikipedia do not have to be added at all when there already are better sources. Personally I normally replace such sources as Wikipedia when I can. -- Innocent bystander (talk) 19:05, 8 December 2015 (UTC)[reply]
- Comment @MisterSynergy, Yellowcard, Shonagon, Innocent bystander: my original proposal here missed the fact that you need to include qualifiers in the description of what makes two entries identical. For example if two claims on a person holding a political position are qualified with different date ranges, they are distinct even if the property and value are the same. I assume the above statements of support still hold but I wanted you to be aware of the modification. ArthurPSmith (talk) 16:04, 9 December 2015 (UTC)[reply]
- Support Though it should be noted that some values (dates at least) may look the same to a human editor but differ in the code making them impossible to easily compare. /Lokal Profil (talk) 01:26, 10 December 2015 (UTC)[reply]
Bots with a new value for an existing claim with the same source should replace the old claim value[edit]
If the same source has a new value for something (same source meaning identical source claims other than for retrieved (P813)) a bot should delete the old claim and add the new value. If there was more than one source on the previous statement, it should remove the matching source on the old statement and add the new as an additional statement. ArthurPSmith (talk) 22:05, 7 December 2015 (UTC)[reply]
- Weak oppose We should not be deleting valid information only because the same source is saying different things in different points of time. If this is accepted it should be specified that there are cases in which all the values should be kept and the incorrect ones should have their rank changed to deprecated. -- Agabi10 (talk) 06:53, 8 December 2015 (UTC)[reply]
- Oppose; Add a new statement with the reference and mark the old one with a low rank. —MisterSynergy (talk) 09:12, 8 December 2015 (UTC)[reply]
- Strong oppose per above. Yellowcard (talk) 15:52, 8 December 2015 (UTC)[reply]
- Comment Not sure that making this a general rule is a good idea. This could be relevant for some data, not for others where the former value has sense as depreciated with low rank, like a former attribution for an artwork. Shonagon (talk) 19:15, 8 December 2015 (UTC)[reply]
- Comment it was my feeling that changes are recorded in history, so using ranks for this wasn't necessary. However, maybe it is in some cases and not in others? Is there a way to restate this policy that makes sense? Maybe "bots with a new value for an existing claim with the same source should either replace the old claim value or change the rank of the old claim to 'deprecated'"? ArthurPSmith (talk) 16:11, 9 December 2015 (UTC)[reply]
- Comment - I can imagine situations where this would be true, but I can image that there can be situations where it is not true too. It really depends on the situation, and therefor is not suitable for a general rule. Some sources changes their data on a regular interval (maybe even constantly?), making the old data less useful. The number of goals a player scored is not something that we need to visualize in Wikidata on a weekly/monthly base by adding more and more statements to my idea. But population of a country per 3 years might be giving extra information by showing more data. Edoderoo (talk) 07:09, 29 February 2016 (UTC)[reply]
Bots importing from wikipedia should wherever possible attempt to extract the original source[edit]
Infoboxes don't typically include source references, but if a date of birth for example is pulled from a wikipedia infobox, that date should be referenced somewhere in the article as well. If a bot can reliably locate the corresponding piece of information within the article text and finds a reference at the end of the associated sentence, the bot should provide that reference as a second source in addition to "imported from". ArthurPSmith (talk) 22:26, 7 December 2015 (UTC)[reply]
- Support --Andreas JN466 03:06, 8 December 2015 (UTC)[reply]
- Question Is it really possible to automatically know if a statement is referenced in the text with a good reference? -- Agabi10 (talk) 06:53, 8 December 2015 (UTC)[reply]
- Oppose Can’t imagine that this works properly. —MisterSynergy (talk) 09:12, 8 December 2015 (UTC)[reply]
- Comment Would be nice, but is very hard (and in some/many cases impossible) in practice to be done by a bot. Yellowcard (talk) 15:54, 8 December 2015 (UTC)[reply]
- Strong oppose Generally not a reliable way to do it. The exception is maybe some sets of bot-created articles who has not been touched by carbon based lifeforms yet. -- Innocent bystander (talk) 19:08, 8 December 2015 (UTC)[reply]
- Oppose It seems to be an unicorn. If it was easily possible, it would be already adopted. And if possible, like MisterSynergy said, not sure that it would work properly. Shonagon (talk) 19:21, 8 December 2015 (UTC)[reply]
- Support --Anthonyhcole (talk) 01:48, 11 December 2015 (UTC)[reply]
- Comment "Whenever possible" is rather vague. If admins (or a future bot approval group?) are approving a bot, "whenever possible" will be rather limited by both the context of the type of import, and by the skill of the programmer involved. —Tom Morris (talk) 10:57, 18 December 2015 (UTC)[reply]
For what P143 (imported from) is good for[edit]
If I want to show in Wikipedia references from WD, I would like to show references in my language. Therefore I would suggest to use this property to group references of the same language together. A lua module could look for this property and if there are no refs of "my" language, the module could take for example the en-refs as standard. If there is no P143 the module has to take as references what is there. Therefore, bots that imports references should use P143 to show where the references are from. I know that the result could be 300+ refs for one statement. --Molarus 19:41, 8 December 2015 (UTC)[reply]
- @Molarus: You are aware of that there is a property that can be used to show the language of a source? See Q21538626#P1448. -- Innocent bystander (talk) 08:17, 9 December 2015 (UTC)[reply]
- I wasn´t aware of that property, but P143 is already in most references. Therefore we don´t have to add millions of P407 for the same purpose. The idea behind my proposal is, that if we agree on this (false) use of this property, we could adapt the lua modules to use this property for that purpose. --Molarus 08:42, 9 December 2015 (UTC)[reply]
- I have written a lot of articles about astronomy on svwiki, but there are almost no good sources in Swedish in this subject. I have always used English sources. So if you import data from Astronomy-articles on svwiki, the sources are written in English. -- Innocent bystander (talk) 08:55, 9 December 2015 (UTC)[reply]
- You are right, that a language property would be better, but no bot would know that your source is written in English. In my proposal the solution would be, that the standard values are changed in the wp.article, so that the infobox would show your Swedish references in the case there would be also a e.g. French references, written in French (that I can´t read, while I can read English). If your references is the only one, the module would take your references, even if you have given one written in Swedish. Better a reference I can´t read, than no reference at all. --Molarus 14:58, 9 December 2015 (UTC)[reply]
- @Molarus: Sometimes one single reference only support parts of a claim. That "X is an:Swedish urban area from:1960/to:1985" can have one source for that it started in 1960 and one for that it ended in 1985. (There is no good source for all parts of such a claim.) In fact, I would like to see a reference for every five year from 1960 to 1990, to be sure that it has not lost its status anytime between 1960 and 1985. To then modify the modules to only show one reference (maybe the best according to your references) would not support the whole claim. -- Innocent bystander (talk) 09:44, 10 December 2015 (UTC)[reply]
- @Innocent bystander:I don´t think this example has something to do with references. Just add the same statement with different qualifiers. For example I had a similar problem with Q1444786, an old building, that started as a tower in the 12. century, than was a prison and in the end was destroyed in 1856. Things change over time. I think something like that could be a real showcase item. Another similar problem is this Q152136 roman province. I invested some time for a better resonator article. I think the whole history items are not maintained well, but they could host so much information. --Molarus 19:12, 10 December 2015 (UTC)[reply]
- @Molarus: If a place is an urban area 1960-1985 and then again 1995-2010, that would fit well into two different claims, yes. Setting them into four different claims: startdate:1960, enddate:1985, startdate:1995, enddate:2010 do not look like a good idea at all. -- Innocent bystander (talk) 19:21, 10 December 2015 (UTC)[reply]
- @Innocent bystander:I don´t think this example has something to do with references. Just add the same statement with different qualifiers. For example I had a similar problem with Q1444786, an old building, that started as a tower in the 12. century, than was a prison and in the end was destroyed in 1856. Things change over time. I think something like that could be a real showcase item. Another similar problem is this Q152136 roman province. I invested some time for a better resonator article. I think the whole history items are not maintained well, but they could host so much information. --Molarus 19:12, 10 December 2015 (UTC)[reply]
- @Molarus: Sometimes one single reference only support parts of a claim. That "X is an:Swedish urban area from:1960/to:1985" can have one source for that it started in 1960 and one for that it ended in 1985. (There is no good source for all parts of such a claim.) In fact, I would like to see a reference for every five year from 1960 to 1990, to be sure that it has not lost its status anytime between 1960 and 1985. To then modify the modules to only show one reference (maybe the best according to your references) would not support the whole claim. -- Innocent bystander (talk) 09:44, 10 December 2015 (UTC)[reply]
- You are right, that a language property would be better, but no bot would know that your source is written in English. In my proposal the solution would be, that the standard values are changed in the wp.article, so that the infobox would show your Swedish references in the case there would be also a e.g. French references, written in French (that I can´t read, while I can read English). If your references is the only one, the module would take your references, even if you have given one written in Swedish. Better a reference I can´t read, than no reference at all. --Molarus 14:58, 9 December 2015 (UTC)[reply]
- I have written a lot of articles about astronomy on svwiki, but there are almost no good sources in Swedish in this subject. I have always used English sources. So if you import data from Astronomy-articles on svwiki, the sources are written in English. -- Innocent bystander (talk) 08:55, 9 December 2015 (UTC)[reply]
- I wasn´t aware of that property, but P143 is already in most references. Therefore we don´t have to add millions of P407 for the same purpose. The idea behind my proposal is, that if we agree on this (false) use of this property, we could adapt the lua modules to use this property for that purpose. --Molarus 08:42, 9 December 2015 (UTC)[reply]
To identify the correct item a fact should be added to[edit]
When I have a group of facts I want to add to Wikidata, I have to identify three things:
- What item is this fact about?
- How do I add this fact?
- How do I source it?
The first is often the most difficult. If I have a fact that "John Smith was born 1 April 1650" I have to identify which of all items is about my John Smith. The source here is maybe a very reliable one. But I probably have to look into the pages on Wikipedia to identify the correct item. If Wikipedia mixes the sons of mrs Smith, it then does not matter how reliable my source is. My statements will still be added in the wrong item.
This is one of the problems I am currently working with. Yesterday, I removed a number of statements from Gothenburg (Q25287). It was in properties like twinned administrative body (P190), head of government (P6), coat of arms image (P94), area (P2046), population (P1082). All of these statements were more or less correct, but they were all added to the wrong item. The English description of this item is "Swedish city". What even many Swedes then misses is that Q25287 is not at all about the "City of Gothenburg". It is about the urban area of Gothenburg. And urban areas do not have coat of arms, mayors or sister-citys. And the area and population of the Urban area is not the same as that for the City.
Some of these statements were probably added by carbon based lifeforms, not by robots. But they have the same problems. You have to find a reliable way to identify which item you should add the statements to. And if you then rely on Wikipedia, there still can be very large mistakes. -- Innocent bystander (talk) 08:55, 9 December 2015 (UTC)[reply]
- I have discovered a new way for your problem lately. I was looking into a WP-article which should be a lake (sometimes those articles are about a river, not a lake). To verify that, I took the template and looked for the English version name of the template. It was a lake-template, so my assumption was right. As far as I know, at the moment there is no connection in wd between an article and the templates the article uses. In this case it would be very easy to distinguish between a city and an urban area. Without that the verification process is much more difficult for a bot, if done at all. --Molarus 15:19, 9 December 2015 (UTC)[reply]
- Innocent bystander: This is more about basic item verification that has to be done mostly manually anyway: add meaningful labels and descriptions and getting the interwikis right. If the interwikis stem from the initial import in March/April 2013, the situation is probably messi and needs to be fixed before useful items are present. Before Wikidata, very different entities have been connected with each other by interwikis and many of these problems outlast until today. If you then import from different Wikipedias that have articles about different entities, it is of course a messi item then. However, a diligent bot maintainer should identify these problems in advance of the bot run and either clean up or skip this item. —MisterSynergy (talk) 15:56, 9 December 2015 (UTC)[reply]
- I am afraid it still is made "messi". Big parts of the interwiki-fixing is done by non-Wikidata-users, who prefer their own way of doing it. -- Innocent bystander (talk) 19:27, 10 December 2015 (UTC)[reply]
Maintanance-category för imported from-statements[edit]
On svwiki Q21700437 has been installed as a maintainance-category for articles with LUA-templates with "imported from" in the sources. Are you aware of any simlair systems in other wiki? -- Innocent bystander (talk) 16:45, 11 December 2015 (UTC)[reply]