Microsoft Academic Graph (MAG) on Azure Data Lake (ADL)

Introduction

Visual studio code (VSC)

VSC is a lightweight yet extremely powerful integrated development environment (IDE) from Microsoft. It is a free download, and is platform independent: in addition to Windows, there are versions for MacOS and various distributions of Linux as well, all performing in the similar way.

To remain nimble, VSC does not preinstall many program languages and supporting tools like Visual Studio, but in some ways, this makes VSC to be quite pleasant and homey for tasks like data analytics and big data development, including using MAG on ADL.

After successfully installing VSC, you will be greeted with a familiar interface that looks like Figure 1.

VSC-Start-Page Figure 1

To use ADL, just install the Azure Data Lake Tools extension by clicking the "Extension" option on the vertical bar on the left and select the namesake extension. VSC is smart enough to include other extensions, such as the Azure Account support in this case. Installing ADL Tools extension automatically includes a SQL-like language called USQL that is tailored for cloud computing in Azure. USQL alleviates you from having to manually coordinate the chores of running parallel computations and combining their results among multiple nodes in the cluster. Since USQL also allows you to write customized procedures in C#, it is not a bad idea to include that extension as well. After installing the three extensions, your screen should look like Figure 2.

VSC-Start-Page Figure 2

The next step is to connect VSC to your Azure account, which can be done by Commands. You can see all the commands in the dropdown menu under View and Command Palette, or by simply entering the shortcut key CNTL+Shift+P. VSC displays a reminder of commonly used commands as default once you dismiss the welcoming screen.

VSC-Start-Page Figure 3

As you may have guessed from Figure 3, the command to connect VSC to your ADL is "ADL: Login". You will be prompted to log in to your Azure account and authorize VSC to remember your credentials for future connections. You might need to consult your ADL administrator for some of the information, including most importantly, where in the system does MAG reside. If all goes well, you should see your Azure credential in the status bar at the bottom of the screen, and the Azure resources you have access to are all shown in the Explorer pane (which can be navigated to by clicking on the Explore button on the vertical toolbar on the left). Figure 4 shows an example where MAG is made available in an ADL Storage accounts under a folder named "mag". In its "graph" subfolder are the various snapshots of MAG taken at November 2017 through March 2018.

VSC-Start-Page Figure 4

With the setup in place, you are ready to run your first MAG application! Inside each snapshot folder contains some sample codes, all with a file extension '.usql'. You want to run "CreateFunctions.usql" before running sample scripts. "CreateFunctions.usql" defines functions for extracting various files. In the sample script, first few lines of the "DECLARE" statements specify where the MAG data can be found and where you want the outputs to go. Modify those paths appropriately for your ADL account setup, and you are ready to proceed. A handful of commands often used are "ADL: Compile Script" (CNTL+Q CNTL+B) and "ADL: Submit Job" (CNTL+Q CNTL+S). Do not despair if you forget these commands: VSC conveniently keeps track of the context sensitive commands you might need and tug them under the popup menu of the '…' at the upper right corner inside the IDE. Give it a try and see if your results are showing up at the right places as you command them to be in your USQL code.