Monash University is opening access to its data to staff, students and researchers via a data “lakehouse” environment that it has stood up using Databricks and on Azure.
Academic and data technology services director Cliff Ashford told a Databricks cloud migration and modernisation forum last week that the university had rebuilt its data environment “from the ground up”.
The work has been underway for the past year-and-a-half, and is already solving key bottlenecks associated with its previous data infrastructure.
That previous infrastructure comprised a 2016-built data lake with an Oracle database and Redshift at its core; and a 2006-built data warehouse that was also largely Oracle-based.
Ashford noted those descriptions were “hugely simplified”.
“The [data lake] is a plate of spaghetti,” he said.
“The [data warehouse] is an entire pot of spaghetti. There are so many things in that that I can’t put them on a single diagram.”
Ashford said that dealing with the twin repositories was particularly challenging during Covid, a time when data became even more critical to decision-making.
The front-facing impacts of the infrastructure were, in Ashford’s words, that “we’d lost all ability to be agile, to enhance this environment, and obviously our certainty about the data coming out of it was low.”
A decision was taken to create a new data infrastructure - based on Azure and Databricks.
Structurally, data is ingested from a range of source systems - the University has in the region of 1200 to choose from - into a “raw tier”.
Ingestion occurs via defined patterns that have the stamp of approval from the university’s security team.
Transformation of the data starts to occur in a ‘persistent storage area’ tier. This tier also has embedded governance controls that, for example, recognise personally identifiable information in an ingested dataset and draws it away into an encrypted repository, which itself contains “a double-encrypted sub-repository inside that for some of the extremely sensitive data at the institution.”
Use case-specific transformation also occurs further up the stack, such as for use in machine learning algorithms.
How people then access data that sits in the ‘lakehouse’ environment is an area the university has invested considerable effort into.
Ashford said the ultimate goal is to provide “everyone in our institution - our 85,000 students, our 15,000 staff” with “access to data, all the time.”
“We want to minimise the hurdles to access data.”
To do that, it has created essentially two data access methods.
First, it is possible to consume data through pre-defined and approved dashboards and APIs. The enterprise standard for dashboards is Power BI.
However, the university has also created what it calls PENs - a kind of sandbox environment where users can experiment, build new models and tools, or run ad hoc analytics, on datasets they are approved to access.
“The logic of this is we don’t give anyone direct access to the ‘lakehouse’ at all,” Ashford said.
“All access is provided either via such things as dashboards or some exposed data via APIs, or through the dedicated sandboxes where you go in, and permission is granted to you to access certain datasets.”
Within the PEN, any tool that can be hosted on Azure can be made available.
The target ‘user’ for PENs include the standard data scientists, analysts, engineers, dashboard developers and statisticians.
However, Ashford said the university also wanted to see its researchers apply to use PENs for projects, and even students.
“We want students to be able to use that in their own projects when they’re learning to be data scientists,” he said.
That could see the number of PENs in operation grow considerably.
“I expect ultimately to have tens of thousands of PENs in the institution, all managed and monitored [centrally],” Ashford said.
However, at the risk of creating a sprawl of sandboxes, there are some restrictions on use.
“We dont let people have a PEN for more than a year,” Ashford said.
“What I don’t want is essentially ‘shadow IT’ taking place in that area, where people are publishing that as part of general business processes.
“It’s really there to develop new stuff or to do one-off analytics.”
Additionally, while PEN users are free to choose tooling, if they decide they want to use a model or dashboard in production, it needs to be re-coded to fit with the existing enterprise standards.
“If someone builds something in the PEN, there is no restriction on them as an analyst, but if they want to build a dashboard for enterprise level, we’d mandate that either they build it in Power BI, or if they build it in Tableau or something else, when we implement it for the overall institution, we will re-code it to our standards in Power BI,” Ashford said.