Dataset of ~29K coronavirus related research papers

AI2 released a large dataset of ~29k coronavirus related research papers, in a somewhat-convenient-to-process json format. The data is hosted on Kaggle, with an accompanying list of research questions (provided by the white house and NIH) including information they want to get extracted out of the data.
These are the questions:

This can be an interesting¬†opportunity for some relevant and hands-on NLP work, especially those of you who already are into relation extraction / health / biomed. Any approach goes, clustering, ML, sci-BERT, hand-built heuristics… happy to brainstorm about this.
Moreover, some of you may remember that i briefly presented a tool that allows querying of syntax-based patterns over wikipedia data. We (at AI2 Israel) indexed this coronavirus corpus with the tool, to allow syntactic queries over it. This can also be an interesting avenue for exploring the data. While the tool is not yet ready for public release, and is a bit rough around the edges, I will be happy to share it with the lab if there is interest. If anyone wants, I will present the tool to you again in a zoom meeting tomorrow (if there is interest, probably around 15:00, email me if interested).
Following are two papers that we submitted to the ACL demo track: one of them describing the search tool (reading sections 6 and 7), and another about Aryeh’s work (joint with Reut) on BART: an extended syntactic representation that exposes many relation-extraction related relations in a more unified way (it is relevant, as the tool is indexing syntactic graphs after Aryeh’s enhancements.¬†+ it is also a really cool paper, imo).

Yet_another_Enhanced_UD_Representation (2)


Leave a Comment

4 × 3 =