Open Data Science Conference
  • BOOTCAMP
  • EAST
  • WEST
  • EUROPE
  • APAC
  • Ai+
  • Blog
  • Partners
  • Newsletter
  • Jobs
  • About
  • Home
Open Data Science ConferenceOpen Data Science ConferenceOpen Data Science ConferenceOpen Data Science Conference
  • Focus Areas
    • Hands-on Training
    • Deep Learning & Machine Learning
    • Machine Learning for Programmers
    • Data Visualization
    • Data Science Kick Start
    • AI X Business
    • MLOps & Data Engineering
    • Research Frontiers
    • R for Data Science
    • NLP
    • Mini-Bootcamp
  • Bootcamp
    • Register
    • Program Overview
    • Specialization Tracks
    • Bring Your Team
  • AIx Biz
  • Attend
    • Why Attend
    • Convince Your Boss
    • Bring Your Team
    • Download Attendee Guide
    • See Who Attends East
  • Schedule
    • Schedule Overview
    • Preliminary Schedule
    • Speakers
    • Training
  • Register
    • Conference Tickets
    • Bootcamp Tickets
    • AI Expo Tickets
    • AIx Summit
    • Career Expo Tickets
    • Bring Your Team
  • Speakers
    • Call for Speakers
    • East Speakers
  • Partner
    • Partner with ODSC
    • Meet our Partners
    • Partner Brochure
    • Hiring
    • AI Expo Hall
  • Info
    • Media Pass
    • Discounts
    • Volunteer
    • Scholarship Passes
    • Conference Guide
    • FAQ
  • Focus Areas
    • Hands-on Training
    • Deep Learning & Machine Learning
    • Machine Learning for Programmers
    • Data Visualization
    • Data Science Kick Start
    • AI X Business
    • MLOps & Data Engineering
    • Research Frontiers
    • R for Data Science
    • NLP
    • Mini-Bootcamp
  • Bootcamp
    • Register
    • Program Overview
    • Specialization Tracks
    • Bring Your Team
  • AIx Biz
  • Attend
    • Why Attend
    • Convince Your Boss
    • Bring Your Team
    • Download Attendee Guide
    • See Who Attends East
  • Schedule
    • Schedule Overview
    • Preliminary Schedule
    • Speakers
    • Training
  • Register
    • Conference Tickets
    • Bootcamp Tickets
    • AI Expo Tickets
    • AIx Summit
    • Career Expo Tickets
    • Bring Your Team
  • Speakers
    • Call for Speakers
    • East Speakers
  • Partner
    • Partner with ODSC
    • Meet our Partners
    • Partner Brochure
    • Hiring
    • AI Expo Hall
  • Info
    • Media Pass
    • Discounts
    • Volunteer
    • Scholarship Passes
    • Conference Guide
    • FAQ
Feb
24

Translating Data for the Masses – One NCAA Basketball Game at a Time

  • Posted By : odscadmin/
  • 0 comments /
  • Under : Data Visualization

Sometimes, I’m stumped when people ask me what I do for a living.  When I reply that I help organizations communicate better with and about data, or call myself a data translator, I see the gaze of the person I’m talking to fog over. You may be familiar with this look – it’s a common reaction many people have once data or numbers are introduced into a conversation.

A perfect example to demonstrate what being a data translator means came up this past weekend.  My partner and a bunch of his college buddies are huge UNC basketball fans, and on Saturday, the Tar Heels (UNC) played Duke – a fierce and famed rivalry.  The game was tight – much closer than anyone who has been following either team this season could have expected. In the post-game recap conversation, our friend shared this analysis of the probability of the Tar Heels winning the game out to the group. I sensed the immediate opportunity to translate this chart for the members of our group chat that do not have a master’s degree in statistics. 

Thanks to Luke Benz (@recspecs730) for putting out this (and many other) visualizations about college hoops on his Twitter feed.

This chart demonstrates the hope, and ultimate defeat, that Tar Heel fans felt throughout the game. But if you’re unfamiliar with an output like this one, the clear heartbreak may not be immediately understood. 

My job as a data translator and storyteller is to take charts like the one above (from the ncaahoopR package developed by Luke Benz), and transform them to be accessible and easily understood to the general population. Like a version of Google Translate dedicated to breaking down the work of data scientists.

I’d make a few changes to this chart to help interpret the game for my Tar Heel fan friends.

  • Remove Duke from the equation – win probabilities for a two-team game are inverses, so the other team values are not essential and ultimately a bit confusing.
  • Add in some annotations to call out key points in the game.
  • Create a title that makes clear what the chart is about and the additional context that will be important for a Tar Heel fan to walk away with.

My version below contains the same data and information but transformed for an audience of college basketball enthusiasts. 

We are living in an incredible time for data – never before have we had access to so many data sets (big and small) or visualization tools.  With this access comes the need to really put thought into the data we share – the why and the how – because we no longer need to put a ton of thought or work into creating visualizations.  The most impressive data analysis is useless without the ability to clearly communicate essential takeaways and offer up persuasive recommendations.  

I challenge you to think about the data and dataviz you create and distribute through the lens of a translator.  Be particular and intentional about the data and visualizations you share. Determine their importance and how these data points help influence decision-makers and tell clear stories to your audience.  Consider the possibility that everyone who uses your data does not have your background, and instead, help them learn through your expertise via clear insights and uncluttered visualizations.

I’m excited to share more about creating strong data stories at ODSC East.  Please come check out my session “The Art (and Importance) of Data Storytelling” to learn more about strategic choices in visualization design and the influential power you can harness as a data translator. 


learn data storytellingBio: 

Diedre Downing is a Lead Data Storytelling Trainer at StoryIQ where she helps organizations improve their communication with and about data. An accidental math teacher, Diedre learned the power of demystifying numbers in New York City classrooms and the power of influencing decision-makers with data during her time running WeTeachNYC.org for the NYC Department of Education. Diedre is an Adjunct Lecturer at Hunter College in New York and has spoken at NCTM, iNACOL, and Learning Forward about adult learning methodology and best practices in professional learning.

 


Why We Need Graph Analytics for Real-World Predictions
Oct
08

Why We Need Graph Analytics for Real-World Predictions

  • Posted By : odscadmin/
  • 0 comments /
  • Under : Data Visualization

As data becomes increasingly interconnected and systems increasingly sophisticated, it’s essential to make use of the rich and evolving relationships within our data. Graphs are uniquely suited to this task because they are, very simply, a mathematical representation of a network. The objects that makeup graphs are called nodes (or vertices) and the links between them are called relationships (or edges).

[See more articles from ODSC West 2019 speakers here!]

Graph Analytics

A property graph model consists of entities, often called nodes, and links between them, often called relationships. Nodes and relationships can also contain properties and attributes.

Graph algorithms are built to operate on relationships and are exceptionally capable of finding structures and revealing patterns in connected data. This is important because real-world networks tend to form highly dense groups with structure and “lumpy” distributions. We see this behavior in everything from IT and social networks to economic and transportation systems. Traditional statistical approaches don’t fully utilize the topology of data itself and often “average out” distributions. Graph analytics vary from conventional analysis by calculating metrics based on the relationships between things.

Graph Algorithms

Graph algorithms are used when we need to understand structures and relationships to answer questions about the pathways that things might take, how they flow, who influences that flow, and how groups interact. This is essential for tasks like forecasting behavior, understanding dynamic groups, or finding predictive components and patterns.

There are many types of graph algorithms but the three classic categories consider the overall nature of the graph: pathfinding, centrality, and community detection. However, other graph algorithms such as similarity and link prediction consider and compare specific nodes.

  • Pathfinding algorithms are fundamental to graph analytics and explore routes between nodes.
  • Centrality algorithms help us understand the impact of individual nodes to the overall network. They identify the most influential nodes and help us understand group dynamics.
  • Community algorithms find communities where members have more relationships within the group that outside it. This helps infer similar behavior or preferences, estimate resiliency and prepare data for other analyses.
  • Similarity algorithms look at how alike individual nodes are by comparing the properties and attributes of nodes.
  • Link Prediction algorithms consider the proximity of nodes as well as structural elements, such as potential triangles, to estimate the formation of new relationships or the existence of undocumented connections.

Example Combating Fraud

Let’s say we’re trying to combat fraud in online orders. We likely already have profile information or behavioral indicators that would flag fraudulent behavior. However, it can be difficult to differentiate between behaviors that indicate a minor offense, unusual activity, and a fraud ring. This can lead us into a lose-lose choice: Chase all suspicious orders—which is costly and slows business—or let most suspicious activity go by. Moreover, as criminal activity evolves, we could be blind to new patterns.

Graph algorithms, such as Louvain Modularity, can be used for more advanced community detection to find group interacting at different levels. For example, in a fraud scenario, we may want to correlate tightly knit groups of accounts with a certain threshold of returned products. Or perhaps we want to identify which accounts in each group have the most overall incoming transactions, including indirect paths, using the PageRank algorithm.

To illustrate these algorithms, below is a screenshot using Louvain and PageRank on season two of Game of Thrones. It finds community groups and the most influential characters using our experimental tool, the Graph Algorithms Playground. Notice how Jon is influential in a weakly-connected community but not overall, and that the Daenerys group is isolated. Interestingly, it’s been noted that highly connected “islands” of communities can signal fraud in certain financial networks.

Graph Analytics

Conclusion

We’ve quickly overviewed what graphs are and how graph algorithms are uniquely suited for today’s connected data, however, we’ve just scratched the surface of what’s possible.  If you’re interested in diving deeper, consider attending our training, “Reveal Predictive Patterns with Neo4j Graph Algorithms” at ODSC West 2019 on Wednesday, October 30th.

[Related Article: Creating Multiple Visualizations in a Single Python Notebook]

We also recommend downloading a free copy of the O’Reilly book, “Graph Algorithms: Practical Examples in Apache Spark and Neo4j” while it’s still available.  This book walks through hands-on examples of how to use graph algorithms in Apache Spark and Neo4j, including a chapter dedicated to machine learning.


Authors:

Jennifer Reif

Jennifer is a Developer Relations Engineer at Neo4j, conference speaker, blogger, and an avid developer and problem-solver. She has worked with a variety of commercial and open source tools and enjoys learning new technologies, sometimes on a daily basis! Her passion is finding ways to organize chaos and deliver software more effectively.

 

 

Amy E. Hodler

Amy is a network science devotee and a program director for AI and graph analytics at Neo4j. Amy is the co-author of Graph Algorithms: Practical Examples in Apache Spark and Neo4j. She tweets @amyhodler

 

 

 

 

Originally posted on OpenDataScience.com


Smart Image Analysis for E-Commerce Applications
Sep
25

Smart Image Analysis for E-Commerce Applications

  • Posted By : odscadmin/
  • 0 comments /
  • Under : Data Visualization

Editor’s note: Abon is a speaker for ODSC West this Fall! Consider attending his talk, “Computer Vision for E-Commerce: Intelligent Analysis and Selection of Product Images at Scale” then.


In e-commerce, the role of product images is critical in delivering satisfactory customer experience. Images help online shoppers gain confidence in a product and increase their engagement with the product, increasing the likelihood of purchase. Hence, from the perspective of an e-commerce company like Walmart, images are an integral, vast, and valuable component of its catalog. We are motivated to analyze images for several reasons: to measure the quality of images, to detect and discard offensive pictures, to select and rank them by their relevance to a product.

[Related Article: 4 Examples of Businesses Solving Problems with AI]

Image Analysis for e-Commerce

Image analysis for an e-commerce catalog such as ours typically goes through the following stages (not necessarily in this order):
Filter images by content or quality – This covers several binary classification problems, each addressing a quality (such as sharpness) or a compliance issue (such as violent or adult content). Some of them, especially the compliance problems, are ill-posed with severe class imbalance. Some of them are better treated as object detection than classification.
Classify images based on content – This stage categorizes the images of a product into several buckets based on their viewpoint or other characteristics such as lifestyle vs. solo images, product image vs packaging image and so on.
Extract content from images – In this step, specific information such as textual attributes is extracted from images wherever possible. An example of this is detection of the drug facts table from the picture of medicine and extraction of the ingredients. Another example is intelligent extraction of a flat or a textured square region from apparel images that can be used as a thumbnail or swatch on the website.

For curious readers, here are the links to our recent papers that discuss our image analysis pipeline in more detail:
https://arxiv.org/abs/1811.07996
https://arxiv.org/abs/1905.02234

We deal with a number of challenges while building models or algorithms that are part of the above-mentioned pipeline. Let us focus on one of them in this article: shortage of training data. The challenge, far more commonplace than you think, arises primarily for two reasons:

1. The problem we want to detect manifests rarely, making it difficult to find examples of the “positive” class. A typical example of this would be an offensive image such as a racially inappropriate symbol on a hat. Usually, one such image is found and reported by a customer by accident. In reality, less than 0.01% of the products have such an image. However, there is no easy way to sieve through the catalog to find more examples.
2. The scale of the problem makes image annotation prohibitively expensive. Let us say that we want to classify images into five viewpoints – top, bottom, left, right, and close-up – for a product, and then scale it for 10,000 types of products. This means we have an extreme-scale classification problem with ~50,000 classes to solve. Since different viewpoints are often close to each other in color and shape, we do need a decent amount of annotated images for each class. Also, for such a fine-grained task, we should probably leverage a trained crowd, which is more expensive, as opposed to completely anonymous ones to ensure a better quality response. Even if we ask for 10 annotated images per class, the cost of annotating half-a-million images can be too high for many projects in many companies.

We often take recourse to a number of practical strategies – some conventional and some ad hoc or unique – to deal with this challenge.

1. Data augmentation: Standard techniques of image data augmentation include color and geometric space transformations. They often cannot produce enough useful data, hence problem-specific custom techniques such as superimposition or image synthesis are applied.
2. Few shot learning: When only a few examples of a class are known in advance, a mix of few shot learning and a conventional classifier often produces better results than one of them alone. For example, consider the picture of an “energy guide” of a television. The look of it remains almost the same regardless of the brand and model of the televisions. Hence, it is possible to build a classifier for this class with very few examples. However, the “close-up” views of the different models of a television vary so much that a few-shot learner will easily overfit.
3. Iterative training – When we do not have enough training data to build a high-precision model, we start with shallow linear classifiers, very small neural nets or heuristic-based classifiers. These baseline models, which often work as low precision and moderate recall predictors, are used to generate predictions. Depending on the crowdsourcing budget, a percentage of the high confidence predictions are sent for manual review. The reviewed images are fed back to the baseline classifiers, and they are retrained. This process is repeated until some base models are good enough, or we have enough data to develop a full-scale model.
4. Multi-stage inference – When it is expensive to procure training data for a complex task or it is compute-intensive to run a complex model on the entire set of products, we try to divide the problem into two. A precursor, namely a simpler image model or a non-image model that learns from contexts such as product title and category, is added before the main model. A typical example would be the detection of nudity – a problem more likely to occur in certain categories such as wall arts or books. Hence, adding a faster and lightweight classifier that separates books and wall arts from the rest of the catalog reduces the load on the slower and deeper object detection network that is trained to detect nudity. Also, the training data can be collected from those categories only instead of the entire catalog.
5. Transfer (and meta) learning – Last but not the least, appropriate use of transfer learning and meta-learning, when possible, can produce high-quality results with a relatively small amount of data. If the problem at hand involves classes similar to the ones available in public datasets such as Imagenet or Coco (for example, we want to detect rocking chairs that are similar to dining chairs); or the problem requires finer classification of such a  class (for example, we want to distinguish between left-facing and right-facing pictures of shoes), beginning from a pre-trained model and fine-tuning it is a great practical idea.

[Related Article: Deep-Text Analysis for Journalism]

Overall, developing classification and object recognition solutions for real problems and scaling them for an enormous catalog is an intriguingly complex problem. To understand more about how we learn from the textbooks and then deviate from them to solve these problems, consider attending my talk at ODSC West in San Francisco from October 29th-November 1st.

Originally posted on OpenDataScience.com


Telling Human Stories with Data
Sep
25

Telling Human Stories with Data

  • Posted By : odscadmin/
  • 0 comments /
  • Under : Data Visualization

Editor’s Note: Be sure to attend Alan’s talk on this idea of telling human stories with data at ODSC Europe this November 19 – 22 in London! Register now for “Bringing Data to the Masses Through Visualization.”

If you’re reading this and you’re part of the ODSC community, the chances are you’re an expert in the data you work with. You understand the methodologies required to make the data robust and meaningful, and you are already using tools to visualize data for yourself to better see patterns and relationships. Communicating the meaning of the data to non-experts, however, brings new challenges.

[Related Article: Why Effective and Ethical AI Needs Human-Centered Design]

I work with government departments, corporations, startups and charities on developing their data visualization practice–from the simplest PowerPoint slide through to complex, interactive dashboards–and I consistently find that this storytelling aspect of data science is the one that causes the most problems.

Non-experts in data are often your most important stakeholders: senior leaders, colleagues in non-data roles, investors, customers, regulators, donors, or members of the public. Convincing them of the importance and urgency of a data pattern or trend is what drives change, alters behavior, prompts actions and gets decisions made.

Over my time working with different organizations, I’ve come to rely on a simple three-part framework for creating data visualizations that are compelling for human audiences—telling human stories with data. It’s not a guaranteed formula for success, but in my experience when a visualization isn’t working, it’s because one of these elements has either been poorly defined, or not discussed at all.

Audience

With visualizations, there is a natural tendency to start with the data and work out from there. This makes intuitive sense–but actually we need to jump ahead and think about our audience, and the context in which they’re going to receive the information.

Human beings have different prior experience, knowledge, and motivations–depending on their background and role. They may be domain experts, but not data experts. They will be looking at your visualization from the perspective of what it means for them–so present the data in a way that makes that easy to see and understand. Remember, you’re probably not trying to simply win an argument–you’re trying to win people over. This requires empathy.

Format also matters here. There is no single ideal way to visualize a data set. The output should and will look completely different depending on whether it’s a presentation slide, a poster, an image on social media, an online interactive, or long, printed report.

If you have multiple, diverse audiences, you’ll probably need to create a separate data communication for each one. If you’re trying to aim something at ‘everyone’ it’s likely that it won’t resonate with anyone. Defining your audience at the start of the process makes it much easier to decide what data to include (and what you should leave out), what format will work best, and how much contextual information they will need.

Story

Once you’ve established ‘who’ you’re talking to, you need to define ‘what’ you’re telling them. You should be able to write the story (or key message) down as a single, short sentence. This should be the title that sits above your visualization–rather than the title being the name of the data set.

From a big piece or research or large data set, you’re likely to have multiple stories–but tell them one at a time. Put simply, ten slides with one clear chart per slide is much better than one slide with ten charts piled on top of each other.

Talking in terms of stories can make data scientists nervous–they worry about cherry-picking, being biased, or not showing the whole picture. But a crucial part of your role is to act as an expert filter, and separate the signal from the noise. Think about what you would be comfortable verbally telling somebody about the data set–your story just needs to match (not exceed) that level of confidence.

There’s a reason you’ve taken people’s time and attention and asked them to look at this data–tell them what that reason is. If you’re not sure what the key message is, the person you’re communicating with has no chance of understanding it.

Action

This is the “why” part of the equation: we are showing you this data because we think you need to do X. You could be prompting a decision, asking a question, specifying a specific action, or cajoling your audience into some kind of behavior change–but there has to be a reason why you’re presenting them with this data in the first place.

This is another thing that can cause concern for data scientists–particularly if you feel that your role is to provide insight, not dictate the decision-making. Your rank and role within the organization will determine whether the action is an order, a recommendation, or a suggestion.

But in nearly every case, you can flag up the patterns, trends, anomalies, areas of concern, or potential opportunities–and then simply recommend that resource and attention from the senior team should be directed towards those areas.

In Summary: Telling Human Stories with Data

This simple three-part framework (Audience, Story, Action) gives you a mini brief to work to when you’re building a visualization that needs to communicate effectively to non-experts.

[Related Article: 3 Things Your Boss Won’t Care About in Your Data Visualizations]

Obviously, there are additional complexities and nuances that come with applying this in practice, and I’ll be expanding on these in my talk at the ODSC Europe in November.

Editor’s Note: Be sure to attend Alan’s talk on this idea of telling human stories with data at ODSC Europe this November 19 – 22 in London! Register now for “Bringing Data to the Masses Through Visualization.”


Aug
28

Your Data is Garbage Unless it Tells a Story

  • Posted By : odscadmin/
  • 0 comments /
  • Under : Data Visualization

Bill is a speaker for ODSC West 2019 this November in San Francisco! Attend his talk “From Numbers to Narrative: Turning Raw Data into Compelling Stories with Impact” then!


If a tree falls in the forest… you know the aphorism. I’m using it here to remind you that your most important job is not data science – OK, maybe it’s not more important, but it’s equally important – is communications. When you do work, and your work remains in a vacuum, it serves no purpose. If your colleagues, bosses, clients don’t learn from your data analysis, then it is literally useless.

Did you know that one of the most-referenced skill requirements in job postings for data science professionals is communications? Yes, your potential employer knows how important it is too – it’s not just me saying this!

 

So communications is important – really important? Agree? Keep reading.

How do you communicate? Do you just show someone a spreadsheet and say nothing? Do you print out a chart, deliver it to your colleagues via inter-office mail (does that still exists?!?) without context? Do you email someone a giant list of facts in bullet point form? I hope not!

No, you tell a story. Story is defined in overly-complicated ways. Let me simplify. I just mean you tell a logical flowing linear sequence of information with continuity and completeness. There is a beginning, middle, and end. And your audience is not left doubting your credibility or thinking you’re hiding something or holding back when you’re done.

[Related article: Advantages and Best Uses of Four Popular Data Visualization Tools]

Here’s the thing – storytelling is an evolutionary imperative. Story is literally the most important thing that has helped humans survive and evolve as a species. As Lisa Cron says, in Wired for Story, “Story, as it turns out, was crucial to our evolution—more so than opposable thumbs. Opposable thumbs let us hang on; story told us what to hang on to.” It’s the story of the Neanderthal who ate the berries that poisoned him and led to a horrible, painful death that taught the entire village to avoid that one type of berry. Not everyone could witness it first-hand. And that story was told in a logical flowing way. It was not told like so: “Red berries. Death. Pain.” That might have worked…but a more dramatic, emotional, flowing story with memorable details would be far more effective.

For you doubters out there – for the data purists who insist “we are scientists – it’s different for us!” I point to this study, which found that the most influential scientists (those whose research is cited most often by other scientists) are those who use narrative techniques in their research reports.

Storytelling is critical, effective, and will help you do your job better. I’ll be talking about data storytelling at ODSC East in May and I can’t wait to share some tips and tricks, and research-driven best practices for storytelling and visualization that will help you think differently about your next project and implement specific changes to your work the moment you leave the room!

[Related article: 6 Reasons Why Data Science Projects Fail]

—

Editor’s note: Be sure to attend his talk “From Numbers to Narrative: Turning Raw Data into Compelling Stories with Impact” at ODSC West later this year!


Categories
  • Accelerate AI (5)
  • Career (4)
  • Data Science (41)
  • Data Visualization (5)
  • Deep Learning (12)
  • Machine Learning (37)
  • NLP (10)
  • Python (4)
  • R (1)
  • Statistics (1)
Recent Posts
  • Top ODSC Europe 2020 Sessions Available for Free On-Demand October 9,2020
  • Announcing ODSC APAC Dec 8-9 October 9,2020
  • Announcing the ODSC Ai x West Business and Innovation Summit This Oct 29-30 October 9,2020
Open Data Science

Open Data Science
Innovation Center
101 Main St
Cambridge, MA 02142
info@odsc.com

Menu
  • Partner with ODSC
  • Blog
  • Training
  • Jobs
  • FAQ
Conferences
  • East 2021
  • West 2021
  • Europe 2021
  • APAC 2020
Extras
  • Newsletter
  • About
  • Code of Conduct
  • Privacy Policy
Copyright ODSC 2020. All Rights Reserved
Close