Data Critique – Global Data Centers

Data Critique

Data center adjacent to residential housing in Loudoun County, Virginia.

This image illustrates how data center infrastructure is often situated in close proximity to local communities, raising concerns about environmental impact, resource consumption, and land use.

Dataset Overview

The dataset that our team chose for the project is the Global Data Center Dataset from Kaggle, which discusses statistics from global data infrastructure by country as of 2025. The data, which represents each country as a row, includes twenty columns such as total number of data centers, floor space occupied, power and renewable energies consumed, key companies that run the data centers, cloud services providers, internet penetration of the country, and more. As such, the dataset could illuminate information about the impact and scale of data centers on a country level and could answer more specific questions about environmental justice, such as how cloud provider presence and internet penetration can reveal global power imbalances.

Data Sources & Construction

The data is generated from several sources and updated on an expected quarterly basis by Shashank Tripathi, a student at IIT Guwahati. In terms of its original sources of data, the dataset is a compiled secondary dataset that aggregates information from multiple publicly accessible sources rather than a single standardized collection process. According to the dataset documentation, sources include government reports, DataReportal, ISOC Pulse, Statista, Cloudscene, the International Energy Agency (IEA), national operator disclosures, Wikipedia, Reddit, and other industry publications. Because the data set draws from a mix of institutional, commercial, and crowdsources, the level of detail and reliability may differ across countries. This variation reflects broader disparities in how data center infrastructure is reported and documented around the world, particularly across countries with differing levels of transparency and technical capacity.

Funding & Institutional Context

As for the organization that funded the creation of the dataset, the funding source for this Kaggle dataset could not be identified. The data set was uploaded by Shashank Tripathi, a student at IIT Guwahati, with no disclosed institutional affiliation, professional credentials or acknowledgement of financial support. We can assume that this is an unfunded individual compilation rather than a formally sponsored research project. However, this dataset has underlying sources such as ISOC Pulse, Statista, and IEA with distinct funding structures that systematically shape what information is collected and what is omitted. For example, the International Energy Agency (IEA), funded by wealthy OECD nations, provides robust data for industrialized countries, but developing regions remain statistically marginalized. This is not because the creator chose to exclude these perspectives but because no funded institution systematically collects that data.

Missing Data & Limitations

Information is left out of the spreadsheet because any pieces of data are aggregated estimates. Therefore, the granular data is excluded from the analysis in this data set. We don’t know how the data centers are distributed across each country, and especially for large countries, the impact in certain regions cannot be analyzed. For some of our research questions, we are likely going to use outside data about the country, specifically population, GDP, and total electricity consumption, all of which are not included here. In terms of the values in our dataset, several countries are missing data, denoted with ‘unknown’. The dataset doesn’t provide the source of such renewable electricity, nor does it denote what the other energy sources are. The categorical data, specifically for tier_distribution, is less specific for smaller countries, and therefore, our data is estimated from the language written when percentages are unavailable.

Scope & Implications

If this dataset were our only source, some integral information about these countries would be left out. Since the dataset specifically measures country-level insights into global data center infrastructure, it leaves out regional information about each country. Since it is covered from a generalized perspective, the dataset does not cover each specific sector in a country, and probably only covers surface-level information, thus not accounting for local areas of each country’s infrastructure. These regional factors have a big impact on the way the data set is made, but there is information that is left out and contributes to the overall larger picture.