Posted on 19. Feb, 2015 by perspective in Academic Departments, Chemical and Biological Engineering, Civil and Environmental Engineering, Electrical and Computer Engineering, Features, Industrial and Systems Engineering, Issues, Magazine, Materials Science and Engineering, Mechanical Engineering, People, Research
Wisconsin engineers find meaning in big data
For all the scale and promise the term “big data” evokes, the exploding field of big data research really comes down to constraints. Even as high-throughput computers become increasingly powerful, it’s not feasible for them to simply brute-force their way through every massive set of information that might provide potential insight. So, like most of us in this information-laden age, big-data researchers experience a dual feeling of opportunity and inundation.
“We’ve got many, many ways to collect new data, we have a ton of existing data, and now we have limited resources,” says McFarland-Bascom Professor of Electrical and Computer Engineering Rob Nowak.
Luckily, finding solutions within constraints is what engineers do. Across the College of Engineering, researchers are forging multidisciplinary partnerships to apply big-data concepts to areas ranging from manufacturing to astronomy to healthcare. Meet some of the engineers who are making UW-Madison a big-data leader, tackling both the theoretical underpinnings of computing and the practical challenges of turning data sets into real-world benefits.
New tools for winnowing and learning
While the application-oriented side of big data is a common theme in the College of Engineering, UW-Madison engineers also want to make transformative contributions to the mathematical and theoretical underpinnings of analyzing massive data sets. For Electrical and Computer Engineering Associate Professor Rebecca Willett and McFarland-Bascom Professor of Electrical and Computer Engineering Rob Nowak (along with colleagues including Industrial and Systems Engineering and Computer Sciences Professor Stephen Wright), that effort centers on the intersection of human expertise and computing power. They focus primarily on developing algorithms to winnow down big data sets to their most important elements for human analysis, and on creating algorithms and theory that enable machines to learn from human experts with a minimum of human interaction. This research means confronting the limitations of current data-analysis tools from a computing angle, and also confronting the bottlenecks that human experts create when carrying out their role in the data-analysis process. “There are a certain big-data problems that we don’t yet know how to solve, where we need new tools,” Willett says.
While they are focused on advancing the fundamental underpinnings of big data research, they do think about the myriad opportunities for applications, in data sets that emerge in settings ranging from astronomers’ satellites to neuroimaging studies. Some of the fundamental questions Willett and Nowak tackle are informed by collaborations, including one with UW-Madison astronomy professor Christy Tremonti, and by their roles with the newly established UW-Madison Center for Predictive Computational Phenotyping, which aims to turn massive amounts of patient data into useful information that could inform treatments and health risk assessments.
On the importance of human input: “Methods that Google and Facebook use to automatically analyze images do not always translate to scientific data,” Willett says. “The features that an astronomer may use to distinguish between two galaxies may be totally different from the features that allow me to distinguish between a cat and a dog. We need easy-to-use methods that incorporate that astronomer’s expertise without it being a burden.”
On the promise and challenges of big data: “Any time you’re trying to do data analysis, there are costs,” Willett says. “But computational costs are decreasing, so we can do more computation with fewer resources—this is not the bottleneck. Rather, the bottleneck is the human expert that can only examine a small fraction of the available data.”
On what the “value of information” means for patient data: “In biomedical research, there may be many experiments you could run, or tons of data for a human analyst to look at,” Nowak says. “The ‘value of information’ is a mathematical framework that can predict what experiment is going to provide the most new information, or help us to decide which data we should ask a scientific expert to look at and interpret.”
On informing theoretical research with practical applications: “Certain things we might want to do mathematically are physically impossible or prohibitively expensive computationally or otherwise, so working with real-world applications grounds the work and also might expose problems, weaknesses, or challenges that haven’t been addressed by our current theoretical understanding of big data problems,” Nowak says.
On the nuances of sifting through patient data: “If you’re trying to analyze images, the data are highly structured,” Willett says. “But an electronic health record is more complex, containing the times at which different lab tests are conducted, pharmaceutical records, images corresponding to different scans, and physicians’ cryptic notes. Unfortunately, these data often contain significant errors, making it very difficult to identify new risk factors or best practices.”
On where “big data” begins and ends: “I like the term ‘found data,’ which is a good term for a lot of the big data we hear about in the news,” Nowak says. “It’s a byproduct of all our online commercial and social activities. In engineering and science, we usually purposely gather data to help us answer specific questions. So big data research in science and engineering naturally involves the data gathering process in addition to data analytics.”
Applying computational tools to experimental materials data
Materials Science and Engineering Professor Paul Voyles applies computational tools—such as machine learning, image and pattern recognition, optimization and data mining—to solve difficult problems like measuring the atomic structure of glass. The atoms in a glass structure are all jumbled up and disorganized, rather than arranged on a geometric grid, so it’s particularly difficult to analyze the positions of atoms and to pull out what’s significant and what controls the properties of the material. To address this, Voyles’ research group is using optimization techniques to build better computer models that agree with the experimental data. He then can interrogate the models to extract useful information. “These techniques, which fall broadly under the category of machine learning, may be fruitful for getting the data to tell us what’s important, and to point us in the right direction instead of having us guess,” Voyles says.
On working with data that isn’t especially big in size: “The amount of experimental data my group deals with isn’t ‘Facebook big,’ but it’s ‘big’ in terms of dimensionality and complexity—lots of different factors in the data all matter and all interact with one another in complicated ways. So while we’ve got hundreds of gigabytes instead of petabytes of data, the underlying relationships are still complicated and hard to access. There’s a lot of exciting science and potential for rapid speedup in materials engineering development cycles by applying tools that have been developed for big data to our complex but not-so-big data.”
On the importance of interdisciplinary collaboration and communication: “What’s really required is for experts in both areas—engineering and data science—to work together closely because it isn’t enough for a educated non-expert like me to pull techniques and software from the literature or a website and try to simply apply them. On the other hand, data scientists can generate beautiful and theoretically important breakthroughs that don’t solve practical problems. So you really need communication back and forth between the people who have data expertise and who have science and engineering problems.”
On how big data could change the materials science field: “From the 10,000-foot view, I think we’re moving toward a new way of doing materials science and engineering, which erases boundaries between experiments and computation. When I was a graduate student, you could either do experiments or you could do simulations, and each was so hard that you couldn’t do both. Now, mostly because the computational tools have gotten more powerful and easier to use, it really is possible to do both.”
New data for old problems
Chemical engineers have long focused on how to monitor and adjust processes for optimal quality and efficiency. But as all kinds of data sets become available in different fields, researchers like Paul A. Elfers and W. Harmon Ray Professor of Chemical and Biological Engineering Jim Rawlings can go beyond industrial chemical reactions and into other large-scale processes that need to become more efficient. One of Rawlings’ current projects involves collaborating with Johnson Controls to develop algorithms and computer models that will continually monitor and adjust HVAC systems in large commercial buildings, drawing on the huge amount of information that increasingly cheap and versatile sensors provide.
On responding to the skeptics: “The reason people think the term big data is overhyped is we have been doing that kind of thing forever in the field. Now we’re still doing it and availability of more data is changing slightly the techniques we’re using, and people say, ‘But that’s what we’ve always been doing,’ and there’s this disenchantment. I think that overlooks the fact that the data sets are getting bigger and richer and they are growing fast, and that’s a very good thing in terms of opportunities to make significant improvements in process operations.”
On the opportunities big data presents in industrial applications: “Because you have so much more data, you can validate models, and if you get them right, they’re going to tell you things you never used to know, and you’re going to make incredibly sharp decisions about how to operate to lower energy costs, in the case of a HVAC system, or higher quality products or higher throughputs, in the case of a chemical plant.”
On the connection between big data and primal human instincts: “One of the exciting new forms of data is video images. These are high-resolution pictures of process operations such as material coatings, flames, etc. Your mind is a
really good visual processor, because our eyes have been one of our key sensors for finding food, detecting threats, scanning the horizon, knowing what’s going on. We are working on how to compress video images and turn that information into decision-making, automatically, so the human operator does not have to do it. How do we build smart enough algorithms that can do what the human can do very easily?”
Analyzing bone shape data to gain insight into hip osteoarthritis
Mechanical Engineering Associate Professor Xiaoping Qian uses computational techniques to analyze large sets of 3D shape data. Using high-resolution 3D sensors, Qian scans a physical object and acquires massive point cloud data—the tens of thousands of data points that make up the contours of the object’s surface. He then uses computer models to make sense of the data and study the object’s shape. Qian is applying these techniques in research of how the shape of the bones of a person’s hip joint could affect the risk of developing osteoarthritis in the hip. Qian and his collaborators are building a large database of massive scan data from patients with hip impingement syndrome. They are scanning hips, including acetabular and femur bones on both left and right sides—and since most patients report one side is suffering from the impingement, they deem the other side “normal.”
Qian will crunch this data with computer algorithms to analyze the bone shapes—identifying shape variation patterns and establishing a normal shape variation. The team then will compare these variation patterns against those with impingement to look for patterns and pinpoint any abnormalities in bone shape. Qian says the eventual goal is to see if the team can discover some biomarkers to identify what makes some people more likely to develop hip osteoarthritis.
On the challenges of working with big data: “One major challenge in working with large sets of shape data is the shape correspondence issue. That is, for a given point in one shape, you need to figure out the corresponding points in other shapes so an accurate statistical analysis can be conducted.”
On analyzing large data sets for design and manufacturing applications: “When you laser scan a physical object, like a mechanical part, you can acquire massive scan data, which is usually raw data. I’m interested in how to make sense of the data, process it, extract meaningful information, and how to feed that data into design and manufacturing applications. The use of statistical shape modeling for product design, analysis and manufacturing represents an uncharted research area.”
Varied applications for data analytics
A self-described quick-learner, Industrial and Systems Engineering Assistant Professor Kaibo Liu developed data fusion methodologies and applies the principles of data analytics to the large data sets of everything from solar image monitoring to pharmaceutical manufacturing and Alzheimer’s disease research. Currently, Liu’s two primary areas of concentration are high-dimensional streaming data monitoring and the degradation and prognostics research featured with multiple-sensors.
For example, Liu developed data fusion methodologies that select and combine the information from multiple-sensor data for monitoring, diagnosing and prognosticating aircraft turbine engines. He says about 90 percent of current degradation prognostics research only takes data in from a single sensor. Liu’s approach—analyzing data from multiple sensors—treats each part of a turbine’s engines as related and dependent on each other. His research leverages those dependencies among parts that are indicated by real-time multiple-sensor signals to predict the behavior of the engine system as a whole. In this way, his degradation and prognostic analysis could impact the entire manufacture of turbine engines, from better designs and more responsive maintenance to better allocation of resources, lower costs and an overall better quality product.
On the first challenge of big data: “A lot of companies and researchers are fascinated with collecting data. With so much data, the proportion of useful information is usually very small. If you do not have a clear objective and application for the data, the data’s value remains to be seen, and any decision you make may not be optimal for the whole system.”
On how “big data” can be a relative term: “Data analytics can be intended for different applications such as manufacturing, healthcare or natural science. Different research areas have different definitions for big data. Some disciplines think a gigabyte of data is big. For me, that’s not big. It’s more like ‘relative-size data.’”
Trafficking in new data sets
As the IT manager of the Wisconsin Traffic Operations and Safety (TOPS) Laboratory and a leading researcher in the area of transportation safety, Steven Parker uses computer analysis of traffic data to help transportation officials create safer, more efficient road systems. To illustrate how the realities of big data come to bear on traffic research, Parker points to the federal Strategic Highway Research Program 2 Naturalistic Driving Study, which has collected approximately two petabytes of video and operational data on how drivers behave—much of it sourced from sensors and cameras equipped in the vehicles. Such data could yield crucial insights into why crashes happen and how to prevent them, but presents big data challenges with respect to effectively integrating this data with traditional data sets and pulling out the most relevant details.
In addition to sensors that count vehicles and monitor their speed, technologies like LIDAR are helping traffic engineers make detailed analyses of the topography and curvature of roads including necessary traffic control and visual challenges. One of Parker’s current research efforts is enhancing the advanced traffic management system (ATMS) software that the Wisconsin Department of Transportation uses to monitor traffic statewide.
On how data can make traffic engineers more proactive: “To a certain extent we know about crash history, but how do we study near-crashes? One of the promises of big data is that we can be more predictive rather than reactive. We can know beforehand that we are getting into a high-danger situation. Can we start doing statewide analysis and making apples-to-apples comparisons around the state to prioritize certain areas that might be high risk, rather than waiting for those areas to come to the top after crashes happen?”
On big data and emergency decision-making: “When it comes down to severe-injury crashes, seconds count, so how can we make the best decisions about where to send someone and what resources to send to the ground based on what we’re detecting in the field and what we know from past experience?”
On the next generation of ATMS: “The next generation ATMS will leverage real-time intelligence made possible from big data. We have to be aware that big data is on the horizon, and how can we start tracking initial capabilities and what are the things we want it to do? It will need to have capabilities to process and bring in all these data sets without overloading the people who are charged with managing traffic on the roadway.”
The TOPS Lab has been working with extremely large transportation engineering data sets since its inception in 2003. According to David Noyce, CEE transportation engineering professor and TOPS Lab director, “Dr. Parker and his research team have been working in the burgeoning area of big data for many years, before we used the term ‘big data.’ The incredible ability of our transportation engineering research team will undoubtedly create new and efficient ways to process expansive amounts of big data in timely and proactive ways that lead to safety and operational improvements across our entire transportation system.”