Building a single, cohesive domain list requires multiple sources. We prefer to use open sources available to anyone but there is often a cost. Details of each source are provided below.

Common Crawl

Common Crawl is a project to index the web. The data is available on S3 to those with AWS accounts and there is a transfer cost. uses the index subset stored in PARQUET files that includes only URLs. This is still a couple of terabytes! Find more information at and browse the index files (requires AWS login) at:


The Centralized Zone Data Service is provided by ICANN at . ICANN is an international regulatory body for internet domain names. The service itself is a website and web API for downloading "zone files" for generic, root domains. This does not include Country Code TLDs such as ".uk" and only contains entries for domain names active in DNS. It is possible for a domain name to exist (or be reserved) and not show up in this list even if it's for a supported TLD.

Hacker News

The social news site published by Y Combinator at is also available as a Public Dataset in BigQuery. This provides a more constrained view of TLDs but also one that is very active.