Reddit comments dataset
This dataset contains publicly available comments on Reddit that go back to December, 2005, to March, 2023, and contains over 14B rows of data. The raw data is in JSON format in compressed files and the rows look like the following:
A shoutout to Percona for the motivation behind ingesting this dataset, which we have downloaded and stored in an S3 bucket.
Creating a table
The following commands were executed on a Production instance of ClickHouse Cloud with the minimum memory set to 720GB. To run this on your own cluster, replace default in the s3Cluster function call with the name of your cluster. If you do not have a cluster, then replace the s3Cluster function with the s3 function.
- Let's create a table for the Reddit data:
The names of the files in S3 start with RC_YYYY-MM where YYYY-MM goes from 2005-12 to 2023-02. The compression changes a couple of times though, so the file extensions are not consistent. For example:
- the file names are initially RC_2005-12.bz2toRC_2017-11.bz2
- then they look like RC_2017-12.xztoRC_2018-09.xz
- and finally RC_2018-10.zsttoRC_2023-02.zst
Load data
- The following file has 86M records from December, 2017:
- It will take a while depending on your resources, but when it's done verify it worked:
- Let's see how many unique subreddits were in December of 2017:
Example queries
- This query returns the top 10 subreddits (in terms of number of comments):
- Here are the top 10 authors in December of 2017, in terms of number of comments posted:
Loading the entire dataset
- We already inserted some data, but we will start over:
- This is a fun dataset and it looks like we can find some great information, so let's go ahead and insert the entire dataset from 2005 to 2023. For practical reasons, it works well to insert the data by years starting with...
...and ending with:
If you do not have a cluster, use s3 instead of s3Cluster:
- To verify it worked, here are the number of rows per year (as of February, 2023):
- Let's see how many rows were inserted and how much disk space the table is using:
Notice the compression of disk storage is about 1/3 of the uncompressed size:
Example query - comments, authors and subreddits per month
- The following query shows how many comments, authors and subreddits we have for each month:
This is a substantial query that has to process all 14.69 billion rows, but we still get an impressive response time (about 48 seconds):
More queries
- Here are the top 10 subreddits of 2022:
- Let's see which subreddits had the biggest increase in comments from 2018 to 2019:
It looks like memes and teenagers were busy on Reddit in 2019:
Other queries
- One more query: let's compare ClickHouse mentions to other technologies like Snowflake and Postgres. This query is a big one because it has to search all 14.69 billion comments three times for a substring, but the performance is actually quite impressive. (Unfortunately ClickHouse users are not very active on Reddit yet):
