The Data Daily

Insights Of Their Own: Visualizing Women’s Baseball, Nightingale

Insights Of Their Own: Visualizing Women’s Baseball, Nightingale

Baseball is the “national pastime” in the United States, even as it draws players and fans from all over the world. With few exceptions, it has been a men-only sport. While coaches like Alyssa Nakken and Rachel Balcovec have made recent inroads into the sport, and players like Julie Croteau have played on men’s college teams, there has yet to be a female major league player. One of the most important things that good data visualization does is to offer new ways of seeing information–like baseball statistics–and to give new life to the accomplishments of the AAGPBL’s pioneering players, who played their last games nearly seven decades ago. 

The All-American Girls Professional Baseball League (AAGPBL) was the first women’s major league, and thus far the only women’s major baseball league, predating Title IX of the Education Amendments of 1972 by nearly 30 years. Title IX marks its 50th anniversary this year, and with Amazon Prime releasing a new series inspired by the Penny Marshall film A League of Their Own this August, I wanted to see if it was possible to make the league’s statistics come alive visually.

During World War II, Major League Baseball shut down as players answered the military’s call for the defense of democracy, and fuel rationing made any sort of road trip far more difficult than before the war. While women could not at the time serve in combat roles, they took mens’ places in factory floors, and on the field. As Anika Orrock shows in her brilliantly written and illustrated book, The Incredible Women of the All-American Girls Professional Baseball League,the AAGPBL held its first tryouts at Wrigley Field in spring 1943. It stood up four teams that began play that summer: the Racine, Wisconsin Belles; Rockford, Illinois Peaches; the Kenosha, Wisconsin Comets; and the South Bend, Indiana Blue Sox. It played for 12 seasons between 1943 and 1954 and expanded to ten teams, before shrinking again to five teams by its end.

As ahead of its time as it was, the league was rooted in the gender roles of the 1940s. Players were chaperoned and received tutoring in makeup, body carriage, and decorum from Helena Rubenstein’s cosmetics company. They wore dresses–not trousers or jeans–when out in public and their game uniforms were above-the-knee skirts, which were far from ideal for sliding on the basepaths. 

Baseball is relatively unique in the way that it embraces statistics to an even greater degree than other major spectator sports (although other sports like hockey and American football have become more stats-oriented in recent years). I’m a member of the Society for American Baseball Research (SABR), a global organization of baseball fans that conducts historical and quantitative research using historical statistics and biomechanical data, so baseball statistics are something that I think about quite often.

The emerging practice of data visualization hasn’t yet been widely adopted in the baseball research community. While there is very much a tradition of analytics that is deeply rooted in the history of the game, stats have been presented in the same way for decades. The most visible element of baseball statistics – the bubblegum card – has included a simple spreadsheet on the back of the card containing a player’s career statistics which has not changed significantly, in format or content, in more than 70 years. Now, though, is a unique opportunity to expand the use of data visualization, as media become ever more visual and the sport seeks to attract new fans.

Coupled with an emerging environment in the U.S. supporting equal access, spanning beyond race and gender, now is a great time to take a new look at the AAGBPL and to visualize its players’ notable accomplishments. 

AAGPBL data presented an interesting challenge. While the league has an active alumni association featuring individual player records, online data services like Stathead have yet to integrate the full set of records. The best resource for league data is a hard copy record book first published in 2000 by W.C. Madden, and compiled from a mix of league records, newspaper accounts, and other contemporary sources.

Madden’s work has inspired some valuable research over the years; for example, Kriss Barnhart examined the relationship between the league’s rule changes, including the size of the ball and the length of the base paths, and player performance.

The greatest challenge for this analysis was constructing a workable data set using the available figures from Madden’s record book. Madden notes that for the first half of its life (1943 to 1948), the league kept accurate individual records, but from 1949 onwards league-wide records became more of a challenge and by 1951, the central league office had been disbanded. 

Madden has compiled an authoritative record, but as with any publication, there are a few misprints or mathematical errors, and I needed to do a significant amount of work–about 40 hours worth–to enter, clean and correct the data. (There was unfortunately no substitute for the tedious task of manually keying in the data, checking every line in the book to make sure that I didn’t miss anything.) While I chose Tableau Public to build the visualization, I created my database in Google Sheets, which I linked to my Tableau workbook. I built two worksheets, one for pitching statistics and another for batting statistics. 

This results in a simple database that still enables time-series analysis by year and player, in order to track the overall evolution of the league. Also, some assumptions had to be made for certain statistics, like on base percentage (OBP), which traditionally includes all the ways that a player can reach base, including on a fielder’s error, but which are not recorded measures in Madden’s record book.

The database is in two separate files: one for batting and another for pitching. Each contains twelve separate sheets, one for each season. In addition to the figures that Madden collected, I calculated five sets of statistics that have become popular in more recent decades. For batters, these include on-base percentage (the percentage of the time that a batter gets on base, whether by getting a hit or reaching base on four pitches out of the strike zone); slugging percentage (a weighted average of singles, doubles, triples, and home runs); and on-base percentage plus slugging percentage; and for pitchers, the ratio between strikeouts (good for a pitcher) and walks (not good for a pitcher) and walks and hits per inning pitched (WHIP). 

The two files are combined in a union, based on the player name and year field, so that the visualization can include both sets of statistics. This is important because in the 1940s and 1950s, pitchers still hit in the batting order instead of a designated hitter, which has been the case since 1973.

Creating a union for the tables allows them to function as a single table, which makes year-to-year comparisons much easier.

And now, at long last, the data are ready to visualize. The game is so complex, with so many statistical measures collected, that showing them comprehensively on a chart is next to impossible. So these visualizations show statistics that are especially relevant or insightful for the performance of the league.

The first chart is an overview of key batting and pitching statistics. Each of these is shown by a simple line graph, with no labels on either axis (called sparklines) and linked to the two matrices in the middle, one plotting batting average against OPS and the other showing ERA against K/BB for pitching. Each dot denotes a player’s season statistics, with the size of the dot related to the number of plate appearances and the color of the dot based on the year. By clicking or hovering on each dot in the visualization, a viewer can find the statistics for a particular player, or move from dot to dot to compare players.

One important note for the charts on this page is that in baseball–as in many fields–there are certain measures that show a positive outcome with a low measure – like earned run average and WHIP for pitchers. For those measures, the axes featuring these measures are reversed so that the best outcomes are shown in the upper right-hand corner of the large matrix–as they usually are–and in the sparklines, where improvement is shown by upward movement along the y axis.

The sparklines reveal the most interesting insight about the AAGPBL: the steady increase in offense–measured by hits, home runs, and OPS–and the corresponding decrease in pitching performance, especially ERA and WHIP. These are average measures across the entire league and don’t reflect individual performances as much as they do the changing dynamics of a league that began using underhand pitching, a ball the size of a softball, and bases 40 feet apart to a sport that was much more like the men’s major leagues, with a ball the size of a baseball, overhand pitching, and bases 55 feet apart.

The second chart, “Key Ranges by Season,” shows how typical performance changed during the life of the league, both for batters and pitchers.

This chart shows the same six measures as in the first chart: batting average, hits, and home runs for batting; and WHIP, IP, and wins for pitching. While the sparklines show the average trend, this graphic shows the spread: the median, quartiles, and upper and lower limit for each year. Interestingly, the range for the batting measures (especially for home runs, and especially during the last five years of the league) increased, while for innings pitched and strikeout/walk ratio, both ranges decreased for the last eight years of the league.

The third chart, “Four AAGPBL Luminaries,” is the most accessible for someone new to baseball and to the AAGPBL in particular: a survey of players highlighting four, in particular, that the audience can learn more about on their own. This is where the true power of data visualization lies: not merely in revealing facts, but in empowering audiences to discover facts that are especially relevant for them.

This chart uses tree diagrams that go from “hot”–the career leaders in each of the four categories – to “cool” (the bottom part of the top 20–roughly–in each category. The four luminaries include:

I have only scratched the surface of the accomplishments of the All-American Girls Professional Baseball League, but this brief look at the data highlights a few intriguing insights about the league, both in hitting and pitching, that are worth exploring more deeply. Ultimately, the power of data visualization is creating new ways of seeing data–whether this year’s Major League statistics or the entire lifetime of a professional league such as the AAGPBL. Even though few AAGPBL players remain living, their accomplishments, recorded for posterity and able to tell stories of their own, live on. The best thing that the reader can do to follow up on reading this article is to interact with the visualizations themselves by visiting the Tableau Public dashboard. In the end, the value of data visualization for baseball is the same as it is in any other field: to create a new way of seeing data that are otherwise contained in reams of spreadsheet columns. 

Images Powered by Shutterstock