Pathways in Data Science and Machine Learning – panel discussion on 5/18

DeepLearning.AI, which was founded by Andrew Ng (who is one of the keynote speakers at the Data+AI Summit, a co-founder of Coursera, and has one of the most popular AI courses on Coursera), is hosting a 1-hour panel discussion titled Pathways in Data Science and Machine Learning with 4 women in the field of machine learning and AI, including the chief data scientist at IBM, and the director of machine learning at Databricks. We use the Databricks community platform in the Spark and Jupyter notebooks seminar. The panel discussion is free (as are most things posted on our blog).

The panel will be discussing pivoting your career to data science or machine learning. Although many of you may be more at the start of your careers than pivoting, this should still be an interesting discussion, and it’s targeted at both technical and non-technical audiences.

Although the discussion is not about women in data science, if you are a woman looking to work in this field, it’s always good to see women at the C-level talking about the field. Guys are obviously welcome to attend too and I signed up.

Since this is during final exams here at SJSU, maybe you don’t have a lot of free time on the 18th, but if you register (it’s free), they will send you a link to the recording afterwards, and it should be an interesting discussion!

Videos from the Data Visualization Society’s Outlier Conference

If you participated in the Data Science for all seminar Telling Your Data Story Using Tableau, you may find some of the talks from the Outlier conference inspiring.  In the seminar, one of the references we recommend is a paper from Tableau on Visual Analysis Best Practices – particularly pages 7-13 which match chart types with the type data story you are trying to tell.  One of the Outlier presentations that was interesting was, This should have Been a Bar Chart by Robert Kosara of Tableau.

You can find the index of all 71 videos from the conference at the You Tube page for the Outlier 2022 conference.

IBM’s Call for Code 2022 Has Started!!!

 

Are you looking for some way to flex your new Python skills and explore some data?

Consider registering either individually or as a team for IBM’s Call for Code 2022 Global Challenge.  Each year IBM has a contest looking at trying to address some social need, and this year it’s sustainability:

How can technology improve sustainable production, consumption, and management of resources, reduce pollution creation, and protect biodiversity to create a greener future?

The winning team takes hoe a cool $200,000! That of course draws in a lot of competition!

Register Here for the Call for Code – submissions due October 31, 2022

However, even if you don’t win, this gives you a good goal to work on, something you can share, and access to learning materials, datasets from the sponsors, access to Weather Company’s data API, technical sessions, and other tools.  You also get a free account on IBM’s cloud without having to enter a credit card (even for the free level, you usually have to enter your credit card, which can be a hassle as a student).

In 2019, even before everyone went virtual with the pandemic, the winning team met face-to-face for the first time at the awards ceremony when they were collecting their prize – so you could work on a virtual team with friends on the other side of the country or around the globe!

 

 

New Version of the Yelp Dataset!

Yelp has released an updated academic dataset!

In the seminar on NeoJ we use part of the Yelp Academic Dataset (available here) and I use it in my MIS Big Data class (BUS4 118D).  Yelp updates the dataset roughly every year and usually in the early Spring (a number of years back it was every 6 months when they had a dataset challenge for students to compete in).

When downloading the data, it’s best to have a fast Internet connection (such as a wired connection at school) since even zipped up it’s about 5GB.

This dataset covers 11 metro areas, 10 in the United States and 1 in Canada. This is a different set of metro areas than in the prior two datasets, so if you don’t mind having your data stop in 2020, you could compare 29 metro areas if you have the two prior datasets also.  Yelp has also posted the dataset on Kaggle, where you can also download a few earlier versions: https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset  However, the files on Kaggle are not zipped, so larger files.

The data files are in the JSON format, but the download is as a zipped tarball (if you are a Linux/Unix user, that will mean something).  To open it on a Windows machine, you can use 7-Zip which is an easy-to-use free utility for working with multiple compressed data formats.  If you are a Mac user, you can use the tar utility on the command line in a terminal window (since your OS underneath is similar to Linux).  If you are unzipping on a PC, keep in mind that you need to unzip twice.  A tarball combines a set of files, and then the tarball is zipped, so on a PC, when you unzip the download, you will have a tarball, and then you need to unzip the tarball to get at the files.  Alternately, when using 7-zip, you can open the archive instead of unzipping it (opening will take a few minutes), then open the tarball folder within the archive, and the last step is to drag the json and pdf documentation files out of the tarball into a folder on your PC.

Join a Tableau User Group

If you participated in the seminar Telling Your Data Story Using Tableau, then you will find this is also in the “Tell Me More!” module.

Early in 2022 Tableau started a user group for Tableau newbies and you can sign up here: https://usergroups.tableau.com/tableaunewbiesusergroup

Their next meeting is on May 26th.  The group is led by a couple experienced Tableau users.

Also, you may want to check out the The Tableau Student Guide. This is a blog that Tableau started early in 2022 and it posts a new dataset and challenge each week.  You can then submit your visualization and get feedback from the group.  If you are trying to build a portfolio of work, this is a way to get started since they provide a dataset and a goal (early on it’s often hard for new users to decide what to analyze, so this gives you a starting point, a goal, and feedback).