We focus on solving big data problems that impact the Intelligence Community and national security.
All of our completed work is available on Github.
Determine the optimal composition of your data.
Data creation, data curation, data augmentation, data shapley, active learning, synthetic data
As we expand into new domains in machine learning, data is becoming more expensive and difficult to label. One possible solution is to build cheaper analogous datasets, for example data from Los Angeles and using in San Francisco, or training on synthetic data and testing in the real world. But this violates one of the main requirements of machine learning, that the training and test data come from the same distribution. VennData is working on solving this discrepancy by building metrics and a pipeline to analyze when and how the training dataset deviates from the testing dataset.
Understanding the fingerprints left by deep learning generative models on synthetic data.
Synthetic data, Generative models
Deep learning models show impressive results creating synthetic examples of the data they’ve learned. But how is that data represented in the models and what fingerprints are left in synthetically generated data? The GANomede project revolves around understanding the capabilities and limitations of deep learning generative models, their capacity for data representation and their utility for detecting synthetically generated data.
Machine Translation Quality Estimation
Natural language processing, machine translation, model evaluationLaunch UI
Is your translation accurate? The VeriCAT project quantifies trust in machine translation to empower an end user to use translations effectively. This project includes a novel translation quality estimation model, dataset, and user interface.
Security & Machine Learning
Machine Learning Model Pipeline SecurityLaunch GitHub
Is your training data safe? Cyphercat demonstrates security vulnerabilities for machine learning pipelines and training data by attacking various model architectures. Attacks include model inversion (reconstructing an image from the training data) and membership inference (determining if a specific piece of data is contained within the training data).
Open Source Speech Dataset
Transcription, Denoising, Speaker Separation, Speaker IdentificationDownload VOiCES Dataset
SRI International and Lab41, In-Q-Tel, are proud to release the Voices Obscured in Complex Environmental Settings (VOICES) corpus, a collaborative effort that brings speech data in acoustically challenging reverberant environments to the researcher. Clean speech was recorded in rooms of different sizes, each having distinct room acoustic profiles, with background noise played concurrently. These recordings provides audio data that better represent real-use scenarios. The intended purpose of this corpus is to promote acoustic research.
Speech Isolation using Deep Learning
Speech, Audio, Deep learning, Tensorflow, Neural NetworksLaunch GitHub
At cocktail parties, it is often difficult to make out what someone is saying because several people are talking at once. Humans do a decent job understanding anyway, in part because we have two ears that can determine the direction of a speaker. The same idea can be applied to microphones, using many microphones to resolve many speakers and isolate their speech signals. Current technologies use expensive microphone arrays, are limited in the environments they can operate in, or can isolate a limited numbers of speakers. Magnolia proposes to use deep learning to break these constraints and isolate speech to work with COTS microphones in a variety of environmental conditions.
Car Recognition using Deep Learning
Python, TensorFlow, Keras, DockerLaunch GitHub
Cars are ubiquitous in urban life. They are uniquely identifiable via their license plates, but unfortunately license plates are only visible from certain angles and even then they are hard to read at a distance. Pelops will use deep learning based methods to automatically identify cars by using their large scale features—color, shape, light configuration, etc. Pelops will also attempt to re-identify specific cars if they are seen multiple times, allowing automatic pattern of life discovery.
Recommending Code to Coders
Jupyter Notebooks, Docker, Spark, Mesos, Python
Software Defined Network Situational Awareness
This challenge is a joint challenge between two IQT Labs: Lab41 and Cyber Reboot. Current software defined network offerings lack tangible security emphasis much less methods to enhance operational security. Without situational awareness and context, defending a network remains a difficult proposition. This challenge will utilize SDN and machine learning to determine what is on the network, and what is it doing. This goal will help sponsors leverage SDN to provide situational awareness and better defend their networks.Launch GitHub
This challenge is a joint challenge between two IQT Labs: Lab41 and Cyber Reboot. Current software defined network offerings lack tangible security emphasis much less methods to enhance operational security. Without situational awareness and context, defending a network remains a difficult proposition. This challenge will utilize SDN and machine learning to determine what is on the network, and what is it doing. This goal will help sponsors leverage SDN to provide situational awareness and better defend their networks.
Multimodal Joint Vector Representations
Current methods using machine learning are focused on classifying items into one of many classes. These techniques are often trained on one type of data (e.g., images), but ignore other information in the dataset (e.g., tags, metadata, etc.) The Attalos Challenge is focused on building representations of images, text, and social networks, leveraging all the information together. Doing so will enable training classifiers that will work across a variety of datasets.Launch GitHub
Current methods using machine learning are focused on classifying items into one of many classes. These techniques are often trained on one type of data (e.g., images), but ignore other information in the dataset (e.g., tags, metadata, etc.) The Attalos Challenge is focused on building representations of images, text, and social networks, leveraging all the information together. Doing so will enable training classifiers that will work across a variety of datasets.
Visual Data Story Telling
Vega, Lyra, Cognitive and Perceptual Principles, Human Centered Design, Front-End TechnologiesLaunch GitHub
This challenge will employ various data visualization tools and user experience frameworks to construct cohesive data stories focused on communicating ripple-effect scenarios. Several event-based datasets from different disciplines will serve as the basis for data story development. Lab41 will develop an optimal front-end visualization development stack in which user experience is a driver. Our ultimate goal is to create a roadmap for how to approach visual data stories from technical considerations to user engagement.
Natural Language Processing & Text Classification
Python, TensorFlow, Docker, NeonLaunch GitHub
Pythia discovers new and useful information in large, dynamic document stores. It constructs systems to measure and locate new information in a document being ingested into a corpus, and is exploring predictive analytics for making existing structured metadata more informative, by doing modeling on document content.
Scalable Security Log File Ingest and Analysis
Jupyter, Spark, Mesos, PythonLaunch GitHub
The challenge will evaluate text clustering machine learning algorithms and graph modeling for scalable system log ingest and analysis. Lab41 will create a solution that can automatically identify and parse multiple log file formats to obviate the need to write a specialized parser to each new type. Our ultimate goal is to transform disparate text-based event log content into a graph model for advanced analytics and reduced storage requirements.
TensorFlow, Neon, Torch, Caffe, Theano
TensorFlow, Neon, Torch, Caffe, TheanoLaunch GitHub
This challenge will evaluate the potential of neural networks to recognize authorship over a variety of unstructured handwriting images. The data will include different document types, paper texture, pens used, and writer range of variation. We will implement state of the art technologies that have shown promise in computer vision related to visual attention (what features to specifically pay attention to) and sequential modeling (in what order do writers use pen strokes).
Recommender System Analysis
Jupyter, Spark, Mesos, PythonLaunch GitHub
Hermes will compare the results of multiple recommender systems on a variety of datasets. These datasets include common, conventional datasets that are traditionally used in recommender systems: movies, books, and news. However, the challenge will also look to explore programmatic datasets from GitHub, and data from internal sources. Each of these datasets will then be subjected to a variety of recommender systems where we can compare and contrast a wide variety of performance metrics.
Deep Learning Sentiment Analysis
Torch, Caffe, Theano, Pylearn2, Neon, Lua, Python, Docker, Spark, GPUsLaunch GitHub
This challenge will evaluate the feasibility of using architectures such as Convolutional and Recurrent Neural Networks to classify the positive, negative, or neutral sentiment of Twitter messages towards a specific topic. The ultimate goal is to help government sponsors better characterize opinions expressed towards topics and events of national security importance.
Geo-Inference of Social Media Data
IPython, Spark, Mesos, PythonLaunch GitHub
This challenge employed various geospatial inference methods to determine the location of Twitter users. Lab41 created and evaluated novel network based and content based approaches of inferring where users are based when posting a Tweet.
Interested in participating?
Join us on any of our In Progress projects - or our next challenge!Work with us
Have any interesting techniques or challenges we should consider?
We’d love to hear from you.Talk with us
Get more insight into how we work, and where you fit in.Let's Go