Our Work

We focus on solving big data problems that impact the Intelligence Community and national security.

All of our completed work is available on Github.

Projects

VennData

In Progress

Determine the optimal composition of your data.

Data creation, data curation, data augmentation, data shapley, active learning, synthetic data

As we expand into new domains in machine learning, data is becoming more expensive and difficult to label. One possible solution is to build cheaper analogous datasets, for example data from Los Angeles and using in San Francisco, or training on synthetic data and testing in the real world. But this violates one of the main requirements of machine learning, that the training and test data come from the same distribution. VennData is working on solving this discrepancy by building metrics and a pipeline to analyze when and how the training dataset deviates from the testing dataset.

GANomede

In Progress

Understanding the fingerprints left by deep learning generative models on synthetic data.

Synthetic data, Generative models

Deep learning models show impressive results creating synthetic examples of the data they’ve learned. But how is that data represented in the models and what fingerprints are left in synthetically generated data? The GANomede project revolves around understanding the capabilities and limitations of deep learning generative models, their capacity for data representation and their utility for detecting synthetically generated data.

VeriCAT

In Progress

Machine Translation Quality Estimation

Natural language processing, machine translation, model evaluation

Launch UI

Is your translation accurate? The VeriCAT project quantifies trust in machine translation to empower an end user to use translations effectively. This project includes a novel translation quality estimation model, dataset, and user interface.

Cyphercat

In Progress

Security & Machine Learning

Machine Learning Model Pipeline Security

Launch GitHub

Is your training data safe? Cyphercat demonstrates security vulnerabilities for machine learning pipelines and training data by attacking various model architectures. Attacks include model inversion (reconstructing an image from the training data) and membership inference (determining if a specific piece of data is contained within the training data).

VOiCES

In Progress

Open Source Speech Dataset

Transcription, Denoising, Speaker Separation, Speaker Identification

Download VOiCES Dataset

SRI International and Lab41, In-Q-Tel, are proud to release the Voices Obscured in Complex Environmental Settings (VOICES) corpus, a collaborative effort that brings speech data in acoustically challenging reverberant environments to the researcher. Clean speech was recorded in rooms of different sizes, each having distinct room acoustic profiles, with background noise played concurrently. These recordings provides audio data that better represent real-use scenarios. The intended purpose of this corpus is to promote acoustic research.

Magnolia

Complete

Speech Isolation using Deep Learning

Speech, Audio, Deep learning, Tensorflow, Neural Networks

Launch GitHub

At cocktail parties, it is often difficult to make out what someone is saying because several people are talking at once. Humans do a decent job understanding anyway, in part because we have two ears that can determine the direction of a speaker. The same idea can be applied to microphones, using many microphones to resolve many speakers and isolate their speech signals. Current technologies use expensive microphone arrays, are limited in the environments they can operate in, or can isolate a limited numbers of speakers. Magnolia proposes to use deep learning to break these constraints and isolate speech to work with COTS microphones in a variety of environmental conditions.

Pelops

Complete

Car Recognition using Deep Learning

Python, TensorFlow, Keras, Docker

Launch GitHub

Cars are ubiquitous in urban life. They are uniquely identifiable via their license plates, but unfortunately license plates are only visible from certain angles and even then they are hard to read at a distance. Pelops will use deep learning based methods to automatically identify cars by using their large scale features—color, shape, light configuration, etc. Pelops will also attempt to re-identify specific cars if they are seen multiple times, allowing automatic pattern of life discovery.

Altair

Complete

Recommending Code to Coders

Jupyter Notebooks, Docker, Spark, Mesos, Python

Poseidon

Complete

Software Defined Network Situational Awareness

This challenge is a joint challenge between two IQT Labs: Lab41 and Cyber Reboot. Current software defined network offerings lack tangible security emphasis much less methods to enhance operational security. Without situational awareness and context, defending a network remains a difficult proposition. This challenge will utilize SDN and machine learning to determine what is on the network, and what is it doing. This goal will help sponsors leverage SDN to provide situational awareness and better defend their networks.

Launch GitHub

This challenge is a joint challenge between two IQT Labs: Lab41 and Cyber Reboot. Current software defined network offerings lack tangible security emphasis much less methods to enhance operational security. Without situational awareness and context, defending a network remains a difficult proposition. This challenge will utilize SDN and machine learning to determine what is on the network, and what is it doing. This goal will help sponsors leverage SDN to provide situational awareness and better defend their networks.

Attalos

Complete

Multimodal Joint Vector Representations

Current methods using machine learning are focused on classifying items into one of many classes. These techniques are often trained on one type of data (e.g., images), but ignore other information in the dataset (e.g., tags, metadata, etc.) The Attalos Challenge is focused on building representations of images, text, and social networks, leveraging all the information together. Doing so will enable training classifiers that will work across a variety of datasets.

Launch GitHub

Current methods using machine learning are focused on classifying items into one of many classes. These techniques are often trained on one type of data (e.g., images), but ignore other information in the dataset (e.g., tags, metadata, etc.) The Attalos Challenge is focused on building representations of images, text, and social networks, leveraging all the information together. Doing so will enable training classifiers that will work across a variety of datasets.

Gestalt

Complete

Visual Data Story Telling

Vega, Lyra, Cognitive and Perceptual Principles, Human Centered Design, Front-End Technologies

Launch GitHub

This challenge will employ various data visualization tools and user experience frameworks to construct cohesive data stories focused on communicating ripple-effect scenarios. Several event-based datasets from different disciplines will serve as the basis for data story development. Lab41 will develop an optimal front-end visualization development stack in which user experience is a driver. Our ultimate goal is to create a roadmap for how to approach visual data stories from technical considerations to user engagement.

Pythia

Complete

Natural Language Processing & Text Classification

Python, TensorFlow, Docker, Neon

Launch GitHub

Pythia discovers new and useful information in large, dynamic document stores. It constructs systems to measure and locate new information in a document being ingested into a corpus, and is exploring predictive analytics for making existing structured metadata more informative, by doing modeling on document content.

MagicHour

Complete

Scalable Security Log File Ingest and Analysis

Jupyter, Spark, Mesos, Python

Launch GitHub

The challenge will evaluate text clustering machine learning algorithms and graph modeling for scalable system log ingest and analysis. Lab41 will create a solution that can automatically identify and parse multiple log file formats to obviate the need to write a specialized parser to each new type. Our ultimate goal is to transform disparate text-based event log content into a graph model for advanced analytics and reduced storage requirements.

D*Script

Complete

TensorFlow, Neon, Torch, Caffe, Theano

TensorFlow, Neon, Torch, Caffe, Theano

Launch GitHub

This challenge will evaluate the potential of neural networks to recognize authorship over a variety of unstructured handwriting images. The data will include different document types, paper texture, pens used, and writer range of variation. We will implement state of the art technologies that have shown promise in computer vision related to visual attention (what features to specifically pay attention to) and sequential modeling (in what order do writers use pen strokes).

Hermes

Complete

Recommender System Analysis

Jupyter, Spark, Mesos, Python

Launch GitHub

Hermes will compare the results of multiple recommender systems on a variety of datasets. These datasets include common, conventional datasets that are traditionally used in recommender systems: movies, books, and news. However, the challenge will also look to explore programmatic datasets from GitHub, and data from internal sources. Each of these datasets will then be subjected to a variety of recommender systems where we can compare and contrast a wide variety of performance metrics.

Sunny-Side-Up

Complete

Deep Learning Sentiment Analysis

Torch, Caffe, Theano, Pylearn2, Neon, Lua, Python, Docker, Spark, GPUs

Launch GitHub

This challenge will evaluate the feasibility of using architectures such as Convolutional and Recurrent Neural Networks to classify the positive, negative, or neutral sentiment of Twitter messages towards a specific topic. The ultimate goal is to help government sponsors better characterize opinions expressed towards topics and events of national security importance.

Soft Boiled

Complete

Geo-Inference of Social Media Data

IPython, Spark, Mesos, Python

Launch GitHub

This challenge employed various geospatial inference methods to determine the location of Twitter users. Lab41 created and evaluated novel network based and content based approaches of inferring where users are based when posting a Tweet.

Interested in participating?

Join us on any of our In Progress projects - or our next challenge!

Work with us

Have any interesting techniques or challenges we should consider?

We’d love to hear from you.

Talk with us

Next: Process

Get more insight into how we work, and where you fit in.

Let's Go