Building a single, cohesive domain list requires multiple sources. We prefer to use open sources available to anyone but there is often a cost. Details of each source are provided below.

Common Crawl

Common Crawl is a project to index the web. The data is available on S3 to those with AWS accounts and there is a transfer cost. Dir.domains uses the index subset stored in PARQUET files that includes only URLs. This is still a couple of terabytes! Find more information at http://index.commoncrawl.org/ and browse the index files (requires AWS login) at:

https://s3.console.aws.amazon.com/s3/buckets/commoncrawl?prefix=cc-index/table/cc-main/warc/

CZDS

The Centralized Zone Data Service is provided by ICANN at https://czds.icann.org/ . ICANN is an international regulatory body for internet domain names. The service itself is a website and web API for downloading "zone files" for generic, root domains. This does not include Country Code TLDs such as ".uk" and only contains entries for domain names active in DNS. It is possible for a domain name to exist (or be reserved) and not show up in this list even if it's for a supported TLD.

Hacker News

The social news site published by Y Combinator at https://news.ycombinator.com is also available as a Public Dataset in BigQuery. This provides a more constrained view of TLDs but also one that is very active.