Logo

The Data Daily

How to use nflfastR with Google BigQuery? | R-bloggers

How to use nflfastR with Google BigQuery? | R-bloggers

How to use nflfastR with Google BigQuery?
Posted on November 5, 2022 by rstats-tips.net in R bloggers | 0 Comments
[This article was first published on rstats-tips.net , and kindly contributed to R-bloggers ]. (You can report issue about the content on this page here )
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Lately I wanted to play around with nflfastR . That’s a great package giving you access to NFL’s play-by-play data since 1999. It let’s you download all the data and store it in several different databases.
Unfortunately I ran into trouble when I tried to import the data to Google’s BigQuery.
Two problems
My solutions
The open source way
The solution for the first bug was relative simple: Check if the table exists. If not don’t set the append parameter to TRUE when calling DBI::dbWriteTable(). (See here ).
The other one was a little bit more tricky. As it turned out the new table is created by using data consisting of
the first season loaded (usually 1999) and
a default_play. Both are binded together (see here ) and then given to DBI::dbWriteTable().
The default_play is somehow created by the package author.
So DBI (or the database) guesses what column type should be created for each column by this data. And that’s where the trouble starts:
The column weather is empty for seasons 1999 and 2000. So the only value provided for this column is in default_play. The value is a very long string of repeating “NA”.
As it turned out DBI::dbWriteTable() or BigQuery uses column type BYTES as the best fitting type for this value. (Why this happens is not quite clear. It depends on the length of the string. See my question here. ) Unfortunately this column types can’t handle Unicode values which are used in the weather-column in seasons 2001 and younger.
So I provided two ways for fixing this issue:
Change the value of default_play$weather, see commit
Load the data in reverse order so you provide at creation time correct data, see commit
Both solutions were rejected by the package maintainers. They argue that the DBI-implementation for BigQuery is the reason for the problems . This might be true. But it doesn’t help someone who wants to use nflfastR with BigQuery.
What to do else?
So if you want to use nflfastR with BigQuery you can either use my fork of this package . I’ll try to update it when the upstream package is updated.
Pre-Import Script
But I looked for another solution tackling the main reason why the original packages fails.
The main reason the import of the data fails is that the database guesses what the best column types are for our data. I think that’s bad habit. I think it’s much better to be precise. So tell the database the correct column types for the new table. If this would be done in the original package all problems would be gone.
So here’s my script you can run before importing the data:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188
library(tidyverse) library(DBI) project

Images Powered by Shutterstock