New Version of the Yelp Dataset!

Yelp has released an updated academic dataset!

In the seminar on NeoJ we use part of the Yelp Academic Dataset (available here) and I use it in my MIS Big Data class (BUS4 118D).  Yelp updates the dataset roughly every year and usually in the early Spring (a number of years back it was every 6 months when they had a dataset challenge for students to compete in).

When downloading the data, it’s best to have a fast Internet connection (such as a wired connection at school) since even zipped up it’s about 5GB.

This dataset covers 11 metro areas, 10 in the United States and 1 in Canada. This is a different set of metro areas than in the prior two datasets, so if you don’t mind having your data stop in 2020, you could compare 29 metro areas if you have the two prior datasets also.  Yelp has also posted the dataset on Kaggle, where you can also download a few earlier versions: https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset  However, the files on Kaggle are not zipped, so larger files.

The data files are in the JSON format, but the download is as a zipped tarball (if you are a Linux/Unix user, that will mean something).  To open it on a Windows machine, you can use 7-Zip which is an easy-to-use free utility for working with multiple compressed data formats.  If you are a Mac user, you can use the tar utility on the command line in a terminal window (since your OS underneath is similar to Linux).  If you are unzipping on a PC, keep in mind that you need to unzip twice.  A tarball combines a set of files, and then the tarball is zipped, so on a PC, when you unzip the download, you will have a tarball, and then you need to unzip the tarball to get at the files.  Alternately, when using 7-zip, you can open the archive instead of unzipping it (opening will take a few minutes), then open the tarball folder within the archive, and the last step is to drag the json and pdf documentation files out of the tarball into a folder on your PC.