Materials Science and Engineering Associate Professor Dane Morgan and his students use atomic-scale computer modeling to understand and design new materials. As co-director of the Wisconsin Materials Institute, he is part of an effort to leverage computational, experimental and data analysis infrastructure and expertise at UW-Madison—ultimately to help increase the speed with which the U.S. discovers, develops and manufactures new materials. WMI researchers are addressing challenges associated with the vast amounts of experimental and computational data generated by materials analysis, synthesis and characterization efforts.
Q1. What is “big data?”
A. Most broadly, I think of “big data” as a catch-all phrase referring to the enormous scale of data we are now experiencing. In this sense it is a big development. It will change our world completely and is not a passing fad. To provide a more detailed understanding of the phenomenon that is big data, it is often described using five Vs: volume, velocity, variety, veracity and value.
Volume refers to the vast amounts of data generated. Just think of all the emails, Twitter messages, photos, video clips, sensor data, etc., that we produce and share every second. People now talk about zettabytes (1021 bytes) or even brontobytes (1027 bytes) of data. This increasingly makes data sets too large to store and analyze using traditional database technology. With big data technology, we can manipulate these data sets. For example, we can now store and use these data sets with the help of distributed systems, where parts of the data are stored in different locations and brought together by software.
Velocity refers to the speed at which new data is generated and the speed at which data moves around. Just think of social media messages going viral in seconds, the speed at which credit card transactions are checked for fraudulent activities, or the milliseconds it takes trading systems to analyze social media networks to pick up signals that trigger decisions to buy or sell shares. Big data technology allows us now to analyze the data while it is being generated, providing highly valuable real-time responses.
Variety refers to the increasingly different types of data we are now trying to use. In the past, data-centered activities often focused on structured data that neatly fit into tables or relational databases, such as financial data (think sales by product or region). Much of the data we wish to work with now is unstructured and comes in a wide variety—think of photos, video sequences or social media updates. With big data technology, we can now harness different types of data—structured and unstructured—including messages, social media conversations, photos, sensor data, video or voice recordings, and bring them together with more traditional, structured data.
Veracity refers to the messiness or trustworthiness of the data. With many forms of big data, quality and accuracy are less controllable—just think of Twitter posts with hash tags, abbreviations, typos and colloquial speech as well as the reliability and accuracy of content. But big data and analytics technology now allow us to work with these types of data. The volume often makes up for the lack of quality or accuracy.
It’s all well and good having access to big data, but unless we can turn it into value, it is useless. So value is a critical “V” for big data. It is important that anyone developing in this space, be it in business or science, make a case for the potential value of collecting and analyzing big data. It is perhaps easy to fall into the “buzz” trap and embark on big data initiatives without a clear understanding of costs and benefits. However, there is also a strong driving force to collect data whose value is not yet totally clear as this leaves open opportunities for finding value in it later.
A key point of the 5Vs is that despite the catchy short name, “big data,” you shouldn’t define it as just by size, although that is commonly a critical aspect. For me, a working practical definition of big data it is any data that causes me to have to do something beyond the standard methods of manipulation and communication. For example, if I can’t email it or easily share it in the cloud, can’t read it fast enough to know what do to with it, can’t store it, can’t easily plot it, etc.—these are all properties that might define something as in some way “big data” for me or a typical research scientist—and working with such data sets could benefit from sophisticated big data solutions.
Q2. What opportunities does big data present for engineering in general?
A. I think they’re very extensive. The long-term goal is that big data enables you to solve harder problems faster. Very broadly in engineering, you’re trying to design some approach or system, be it a new material or a nuclear reactor or a welding tool. In almost all of these cases, you tend to have a lot of complex phenomena coupling together that you need to be able to understand and control, and I think big data as a philosophy represents this idea of collecting enormous amounts of data on all the different complex interacting aspects of the problem—and then assembling it in a way where you can access it, mine it and use it much more optimally than you could before.
Q3. Why are big data approaches useful in the materials field?
A. The unprecedented ability to organize and manipulate data that is provided by “big data” approaches will be transformative for materials researchers and developers at many levels. For example, materials design, which is focused on developing new materials, already relies extensively on databases of key properties ranging from electronic conductivities to phase stability. In the future, we will have access to far more materials data, with tools to rapidly organize and process it, and this will accelerate the pace of designing new materials with properties precisely tailored to a specific application.
I also think we will see a change from researchers simply providing their final processed understanding and selective data in research papers to providing full records of all the data they produced, greatly enhancing the long-term value of the materials data that is being generated.
Q4. What are the primary challenges related to big data research?
A. A major challenge to big data in materials is building the cyber infrastructure, as it’s sometimes called, for sharing data, analyzing data, making it discoverable, and linking it to traditional ways of sharing—for example, publications.
So-called “metadata” is another big challenge. Metadata is the information about your data that people need to use it—the labels, or “tags,” that tell people what kind of data they’re looking at and allow them to search, sort and use the data in a logical way. For example, if I have measured the light-absorbing properties of a material for a solar cell, I may need to describe what material I used, how I made it, and how I measured its properties for people to be able to use my data optimally.
If you have the necessary cyber infrastructure set up and, for example, an exabyte, or 1 million terabytes, of stored research data, it’s going to be hard to use that data unless it is labeled with essential metadata that provides guidance for the user. You’ll need to figure out how to describe and categorize that data with appropriate metadata so that people will know how to use it. And developing such metadata, while it can perhaps be done by computer at some stage, presently takes significant human time, including decisions among people in user communities to set standards for metadata. Obtaining good metadata is often a big stumbling block in materials-related big data activities.
Q5. What makes UW-Madison a good place to do work related to big data?
A. I think it’s exciting that within the College of Engineering we have the Wisconsin Materials Institute—which is very committed to promoting data-related concepts, tools and activities—and has the financial resources to do it, with the support of Dean Robertson. We also have many researchers directly working at the interface of big data and engineering, for example, through integrating statistical design of experiments and nanomaterials discovery, as is being done by Statistics Professor Peter Qian and Materials Science and Engineering Professor Xudong Wang, or coupling advanced data reconstruction tools with electron microscopy, as Materials Science and Engineering Professor Paul Voyles is doing.
At UW-Madison, we also have a very strong group of scientists—many of whom are outside of engineering—who are very interested in data analytics and can impact the engineering discipline. For example, we have the Wisconsin Institute for Discovery, which has teams of people who work in these areas and do extremely innovative work.
So our strong cross-disciplinary culture, the strengths in data technologies on campus already, and the commitment of those in their domain-specific fields to develop these data-centric activities all make the university a very vibrant place for this kind of work.