European Court of Human Rights Mapping Project
Principal Investigator(s): View help for Principal Investigator(s) Jessica Greenberg, University of Illinois at Urbana-Champaign; Benjamin Krupp, University of Illinois at Urbana-Champaign; Stephanie Auer
Version: View help for Version V1
Name | File Type | Size | Last Modified |
---|---|---|---|
final_for_viz.csv | text/csv | 28 MB | 11/24/2021 02:24:AM |
Project Citation:
Greenberg, Jessica, Krupp, Benjamin, and Auer, Stephanie . European Court of Human Rights Mapping Project. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2021-12-02. https://doi.org/10.3886/E155781V1
Project Description
Summary:
View help for Summary
This database and the related web based map application consist of easily searchable, cleaned and standardized data scraped from the European Court of Human Rights online database (HuDOC). The records cover all judgments of violations and nonviolations from the Court’s first cases through judgments delivered through October 24 2021. The .csv file includes fields for case name, application number, country, conclusion (violation/nonviolation); article number, paragraph number, importance (based on the HuDOC key); and date of judgment. The related web map application can be found at: https://auerdatascience.shinyapps.io/shiny_app/. All code for this project is on github at https://github.com/sauer3/human-rights-app.
Please see "collection notes" for complete methodology.
Please see "collection notes" for complete methodology.
Funding Sources:
View help for Funding Sources
National Science Foundation. Law and Social Sciences Program
Scope of Project
Subject Terms:
View help for Subject Terms
European Court of Human Rights;
human rights violations;
HuDOC
Geographic Coverage:
View help for Geographic Coverage
Europe
Time Period(s):
View help for Time Period(s)
1968 – 2021 (European Court of Human Rights Caselaw through October 24 2021)
Universe:
View help for Universe
Judgments of violation and nonviolation in European Court of Human Rights Caselaw from 1968 through October 24 2021.
Data Type(s):
View help for Data Type(s)
other
Collection Notes:
View help for Collection Notes
This dataset was scraped from the European Court of Human Right’s official database, HuDOC. The dataset includes fields for the following categories of case information, application_number, document_title, document_type, originating_body, date, conclusion, violation, language, document collection, raw_country, and importance. The violation field is further broken down into article, paragraph and sub-paragraph. Every violation or nonviolation has a separate entry; an application with multiple conclusions (violation or nonviolation) will appear as separate entries with the same application number and case name. We started by scraping data from the HUDOC site by recursively creating a .url to download the violations and non-violations database by every month of every year from 1968-2021. These tables were then concatenated into a master data frame. This original master had a single column that contained all the article/paragraph/sub-paragraph information in non-uniform text. The information in this field was not standardized, which made it difficult to clean into sortable article/paragraph/sub-paragraph form. To solve this problem, we altered the scraping script by adding article, paragraph, and sub-paragraph to the parameters by which we recursively downloaded the dataset. We also added a separate scrape that would take into account the importance class (key, 1, 2, 3) that we concatenated with the master set by application number. We then cleaned and ordered this final master set.
Once we arrived at a master data frame that contained all of the categories we desired, the next task was to clean for inconsistencies in HuDoc’s manual coding. We wrote a data cleaning script to prepare the data set for exploratory analysis and visualization. This script targeted four areas of inconsistency within the data. (1) It pulls country name from the title of the case and solves for inconsistencies with country name, spellings and languages used. (2) It formats dates to be consistent and machine-readable. (3) It separates cases that involve multiple countries. (4) It removes the lower court ruling when the Grand Chamber overturned Chamber judgments. This means the dataset reflects the Court’s final ruling of violation or nonviolation.
We tested separately for the comprehensiveness and accuracy of our master set. In comprehensiveness tests, our objectives were to make sure that the dataset included all cases, and mirrored both internal court reporting and HuDOC. In accuracy tests, our objective was to test our dataset at the highest resolution (specific article, paragraph and subparagraph information) against both HuDOC and original court documents (judgments) to identify possible bugs in our scraping script and/or inconsistencies in HuDOC’s internal coding. In our comprehensiveness testing, we checked our cumulative dataset numbers against those reported in ECtHR’s annual yearbooks. Over the entirety of the dataset, there was a 6% discrepancy in total violations between these two sets. Our master dataset included only 94% of the cases listed in the Annual Yearbooks. We attribute this discrepancy to cases that were not listed in English (our scrape sorted for English versions of the cases), cases of overturned violations, and HUDOC’s internal miscoding and human error.
To test the master frame for accuracy, we manually coded (based on judgment documents) 193 cases selected at random proportionate to the total number of violation judgments in a given year. Ratio was 1 test judgment for every 100 cases, with a minimum per-year of 1 test case, in order to maintain historical breadth of tests. We manually coded these cases for article violations, application number, date and country of origin. We compared this manually coded test frame against our master frame, and found that our data set was 99% accurate to HUDOC, but that HUDOC was only 92.8% accurate to case documents. After analysis, this 92.8% error was due to HuDOC miscoding, specifically around missing paragraph and sub-designations when the case was coded/entered into the HuDOC system. The vast majority (around 80%) of these mistakes were in articles 5, 6 and protocol 1 – areas where paragraph and sub-paragraph designation are most meaningful. The HuDOC system breaks down article number as a general number (Article 5, Article 6, etc.). In cases in which subparagraph specifies a particular subcategory of violation it also breaks down that general number into paragraph (Eg. article 5-1). In our tests we found coding inconsistencies in how these subdesignations were used. For example, if a violation was found of Article 5, subparagraph 1, it was at times coded as simply an article 5 violation, in other cases it was coded as both a 5 and a 5-1 violation (the standard and proper way to code this to maximize search accuracy). In other cases it was coded as only 5-1. What this means is that searches for Article 5 violations would not account for all 5 cases, if they had not been comprehensively coded. We found these coding inconsistencies in as many as 14.6% of Article 6 cases and 13.6% of article 5 cases. Roughly a third of these hand-coding errors were correctable in the script (for example, ensuring that every 5-1 violation was also listed as a 5 violation). However, for the cases in which paragraph was omitted in HuDOC, the only corrective path would be hand-coding the court documents ourselves. This means our dataset returns more accurate results than HuDOC for queries relating to articles with the most paragraph sub-designations (for example 5, 6 and some protocols), but is not 100% accurate to case documents (judgments) and can not achieve full accuracy to case documents in those same areas.
We are confident that any errors in the dataset are due to the HuDOC source material. Within the parameters of HuDOC our dataset is accurate within 1% to HuDOC and corrects for HuDOC errors (likely due to human coding error) where possible, particularly with regard to paragraph and subparagraph as outlined above.
Once we arrived at a master data frame that contained all of the categories we desired, the next task was to clean for inconsistencies in HuDoc’s manual coding. We wrote a data cleaning script to prepare the data set for exploratory analysis and visualization. This script targeted four areas of inconsistency within the data. (1) It pulls country name from the title of the case and solves for inconsistencies with country name, spellings and languages used. (2) It formats dates to be consistent and machine-readable. (3) It separates cases that involve multiple countries. (4) It removes the lower court ruling when the Grand Chamber overturned Chamber judgments. This means the dataset reflects the Court’s final ruling of violation or nonviolation.
We tested separately for the comprehensiveness and accuracy of our master set. In comprehensiveness tests, our objectives were to make sure that the dataset included all cases, and mirrored both internal court reporting and HuDOC. In accuracy tests, our objective was to test our dataset at the highest resolution (specific article, paragraph and subparagraph information) against both HuDOC and original court documents (judgments) to identify possible bugs in our scraping script and/or inconsistencies in HuDOC’s internal coding. In our comprehensiveness testing, we checked our cumulative dataset numbers against those reported in ECtHR’s annual yearbooks. Over the entirety of the dataset, there was a 6% discrepancy in total violations between these two sets. Our master dataset included only 94% of the cases listed in the Annual Yearbooks. We attribute this discrepancy to cases that were not listed in English (our scrape sorted for English versions of the cases), cases of overturned violations, and HUDOC’s internal miscoding and human error.
To test the master frame for accuracy, we manually coded (based on judgment documents) 193 cases selected at random proportionate to the total number of violation judgments in a given year. Ratio was 1 test judgment for every 100 cases, with a minimum per-year of 1 test case, in order to maintain historical breadth of tests. We manually coded these cases for article violations, application number, date and country of origin. We compared this manually coded test frame against our master frame, and found that our data set was 99% accurate to HUDOC, but that HUDOC was only 92.8% accurate to case documents. After analysis, this 92.8% error was due to HuDOC miscoding, specifically around missing paragraph and sub-designations when the case was coded/entered into the HuDOC system. The vast majority (around 80%) of these mistakes were in articles 5, 6 and protocol 1 – areas where paragraph and sub-paragraph designation are most meaningful. The HuDOC system breaks down article number as a general number (Article 5, Article 6, etc.). In cases in which subparagraph specifies a particular subcategory of violation it also breaks down that general number into paragraph (Eg. article 5-1). In our tests we found coding inconsistencies in how these subdesignations were used. For example, if a violation was found of Article 5, subparagraph 1, it was at times coded as simply an article 5 violation, in other cases it was coded as both a 5 and a 5-1 violation (the standard and proper way to code this to maximize search accuracy). In other cases it was coded as only 5-1. What this means is that searches for Article 5 violations would not account for all 5 cases, if they had not been comprehensively coded. We found these coding inconsistencies in as many as 14.6% of Article 6 cases and 13.6% of article 5 cases. Roughly a third of these hand-coding errors were correctable in the script (for example, ensuring that every 5-1 violation was also listed as a 5 violation). However, for the cases in which paragraph was omitted in HuDOC, the only corrective path would be hand-coding the court documents ourselves. This means our dataset returns more accurate results than HuDOC for queries relating to articles with the most paragraph sub-designations (for example 5, 6 and some protocols), but is not 100% accurate to case documents (judgments) and can not achieve full accuracy to case documents in those same areas.
We are confident that any errors in the dataset are due to the HuDOC source material. Within the parameters of HuDOC our dataset is accurate within 1% to HuDOC and corrects for HuDOC errors (likely due to human coding error) where possible, particularly with regard to paragraph and subparagraph as outlined above.
Methodology
Collection Mode(s):
View help for Collection Mode(s)
web scraping
Related Publications
Published Versions
Report a Problem
Found a serious problem with the data, such as disclosure risk or copyrighted content? Let us know.
This material is distributed exactly as it arrived from the data depositor. ICPSR has not checked or processed this material. Users should consult the investigator(s) if further information is desired.