HarvardX DH101

pixelated hardcover books

Two years ago, I took the Harvardx Digital Humanities course. Here are some of my notes on it, and from it.

It was a good reminder of how much more I know now, and how much I have grown as a Digital Humanities practitioner. This was, in many ways, my first foray into the world and even now, I find this field enthralling as ever.

In my two years of self isolating work from home all the way across far far away from mainstream western academia, I had this singular spearheaded drive to be present amongst the humanities academia. As a tech student, I figured I could fulfill that with my fascination in Digital Humanities. My entire aesthetic while younger – Tumblr 2014 earthy dark blue librarian plaid outfits, glassy eyes, brown leather satchel, cobblestones and brown bricked buildings amidst shrubbery and forests – compounded in my research interest becoming academic form of enmeshing all forms of pursuit of knowledge that culminated in the Digital Humanities. Despite the reduced fervor surrounding MOOCs (Massive Online Open Courses) over the years, I loved auditing courses like The University of Edinburgh’s Introduction to Philosophy, writing exercises courses, Historical Fiction: Plagues, Witches, and Wars etc. With all their faults, they did garner attention to online classes, which set the precursor to today’s universities’ COVID-19 response. They also paved the way to access for people interested in a wide variety of subject areas but no structure to secure the knowledge train to. I was extremely excited to see an online course on Digital Humanities – by Harvard University, too! The primary banner for the course was bites eating up a manuscript, or a manuscript converting to its digital form, or even a cellular automata model of bookish digitization depending on whom you asked (The last one would have to be one of those social science professors.)


  1. Yes, the course was a valuable addition to my skill-set and critical thinking capabilities in understanding Digital Humanities better.
  2. No, it does not cover the depth or breadth of Digital Humanities as a subject.
  3. Yes, it is a great jumping off point.
  4. Yes, it does add value to technical students if only to peek a glimpse at how humanities scholars consider digital humanities themselves.
  5. Digital humanities is still very mainstream USA-Europe-centric.


I was curious about how this topic would be covered in an online undergraduate level course. While there are many syllabi online, from Miriam Posner’s UCLA DH101 series, to Lincoln Mullen’s Computational History CLIO-2 series, here was the content of the courses itself, apart from the reading material. Programming Historian is another example of guided learning but it is a learning resource itself, not a structured course. My goals for this Harvardx DH101 course were to go in open-minded, with carefully critical, and boundless enthusiasm for learning more about this field that has fascinated me so much – that seems to have been made just for me. The course proclaimed itself to align with the following:

  • To learn various digital tools and techniques
  • New claims with different perspectives on data
  • New research and teaching methodology
  • Technical grasp of 
    • File types 
    • Create, gather, organize data
    • Text analysis with command-line functions (wget)
    • Visual text Analysis
    • Optical Character Recognition (OCR)
    • Text Encoding Initiative (TEI) 

MOOC Target Audience

  1. Humanists
  2. Social Scientists
  3. Digital Scholars
  4. Librarians
  5. Archivist
  6. Museum Curator
  7. Public Historian

I won’t begrudge them for not including data scientists or computer science students in this list. The course only brushes on data-related projects. A course for Digital Humanists coming into the field from a tech perspective would, now that I think of it, be a different beast altogether. Developing critical thinking, ethical inquiry, researching the topics of interest even superficially, understanding the humanities thought process and epistemology, would be more of note in such a course, perhaps. So why am I, a Computational Sciences graduate student taking this course? I believe it still lets me be a part of this world. Do I expect even a certificate might be a bonafide attestation of my status as a Digital Humanist? No. I do consider it a part of my self-training and here is a structured resource on the subject. It also gave me access to learn more of the people and projects that are included here.


  • Bill Barthelmy, Senior Technical Architect, Academic Technology for the Faculty of Arts and Sciences
  • Francesca Bewer, Research Curator, Conservation and Technical Study Programs and Director of Summer Institute for Technical Studies in Art, Harvard Art Museums
  • Cole Crawford, Humanities Research Computing Specialist
  • Jeff Steward, Director of Digital Infrastructure and Emerging Technology, Harvard Art Museums
  • Peri Green, Graduate Student, Technology in Education
  • Kelly O’Neill
  • Suzanne P. Blier
  • Peter Bolracha
  • Kirakosian
  • Derek Miller
  • Vincent Brown
  • Laura Woods
  • Stephen Osadetz

MOOC Outline

The structure of the course is quite a learning curve; we start with definitions of Digital Humanities and wade into the command line as well as a software (Although I still think a software is easier to play around with, yes a Computer Science ad Engineering graduate student here) by the end of it. I appreciate that it tries to instill critical inquiry in the process of imbibing the values of a Digital Humanist. It was vastly lacking in my undergraduate education, perhaps as a result of STEM studies in a India (but that’s a whole other story). An entire weekly section is based on data studies itself, which is imperative, I believe. I enjoyed the incorporation of the Word Cloud, a very visual Digital Humanities methodology of measuring the frequency of occurrences in surveys and text corpus at the beginning.

Computational Methods and the Humanities

How would you describe computational methods applied to humanities research?

Can you imagine applying computational methods to your own work in the humanities? How do Jeffrey Schnapp’s comments change or challenge your thinking about Digital Humanities? All of these ideas will help you get a sense of each other’s perspective here at the start of the course.

The computational methods used for a certain project reflect the type of humanities study taking place. Digital Humanities not only gives attention to detail – the anomalies and exceptions but also the larger patterns.I am interested in the spatial, visual, and network analysis of data. Certain specific computational methods like Data Visualization using Tableau, Gephi, Storymap.js, or even D3.js would each serve a different purpose from Archival collection methods like Omeka, Jupyter, or Databases. No two methods make the same assumptions, the use of the same tools, or ask the same questions even. As a CS student, it didn’t occur to me that Humanities studies usually focus on anomalies and exceptions until Dr. Schnapps mentioned it. Humanities scholars tend to study the why, how, what of a drastic change over what is only prevalent. Which means going deeper into a research question’s epistemology than most technical studies which are somewhat more practical. I wrote about these fundamental differences in the two fields here in my first blog post two whole years ago: Origins of Digital Humanities in the Western World. So I am glad I managed to make some points mentioned in the course even as I began my foray into this field.

“What does it mean to have a – CRT, LCD – monitor say “Hello, World” to us?”

Dr. Jeffrey Schnapps

I love that line of thinking more than actually ever having had to printf that statement. (Or its variations.)

According to the MOOC, Digital Humanities projects can be categorized in two ways:

  1. Infrastructure Building
  2. Creative, Expressive, Critical, and Experimental Practice

In my blog post on The Role of Digital Humanities in the Western Museums, I categorized them data sciences with archives as :

  1. Data Collection and Archival
  2. Analysis of Collections





In what you’ve seen so far, how do these examples fit with your own work and your own professional interests? What opportunities can you identify that you might like to explore further or learn more about? Feel free to write about any of the DH projects that you explored online, or any ideas that you heard the speakers discuss in the video.

In my post My Journey to Data Visualizations,  I used datasets on Broadway plays and musicals to understand and analyze the trends in average audience attendance along with playing weeks and overall gross value, so Dr. Derek Miller’s work Visualizing Broadway is extremely fascinating for me.
I followed Dr. Peter Bol’s advice on talking to people to find out how a DH project works. Many fields within the humanities itself seem exciting, and I want to develop the technical approach to DH. In my post on DH in Art and Museums,  The Role of Digital Humanities in the Western Museums, I studied the way scholars approached the technical side of their studies. This was mainly for me to have a better grasp of the academic side of the research. Even when I applied to universities for MS in CS, I ensured that there was a Digital Scholarship lab, and multiple such Digital Humanities research projects taking place in their labs. This was in an effort to study DH without compromising my technical studies.

Library in a Digital Age

Role of Librarians in the Digital Age

  • Acquisition, Curation, Stewardship of content, including metadata
  • Map Librarians – GIS in various disciplines
  • Multimedia creation – Communication Technology
  • Visualization techniques
  • Digital Preservation for long term storage

Digital Scholarship in libraries (I am the GRA at my university’s DiSc!) is a culmination of accessing digital methodology and skills using larger digital corpora.

Data services

Capturing, acquiring, gathering, identifying required data and directing researchers to resources based on their data literacy levels, is a core function of Digital Scholarship centers.

Born Digital Content

Digitizing priority is based on the kind of data that researchers benefit from such digitization efforts.

Acquiring and Managing Digital Resources

  • Bundle resources licenses from vendors
  • Digital versions of books
  • Text Collections as Data
  • For text mining efforts
  • Catalog with descriptive cues (tags) for Metadata and Search-ability
Collections by a vendorCollections from a Library
Permission Renewal to Data Mine & Analyze
Copyright Status
Permission to Digitize
Advise for future terms
Partnership with vendors
Google Books

Library Collections and the Research Journey

Literature Search Engine

  1. Recognizing the intellectual contribution is the fundamental principle of library collections.
  2. Split up entire corpora into blocks of 1000 words
  3. Researchers know what they are looking for, and can look for particular language
  4. Evaluate for the most common words in contextual meaning

Concept Searching: Machine Learning in the Humanities

Supervised machine learning to allow scholars to constantly tinker with their requirements. For a long time in the humanities, when there was no tool for your research, you build the tool yourself. This gave rise to many alt-ac (alternate academia positions in universities) with specific considerations.

Re-imagining the Research Process

Librarians do a lot more than stacking books. This led to the epistemological intellectual question in critical pedagogy and library sciences: Should a scholar not be working on scholarship alone?

Metadata as a Research Tool

The data about the data enriches the usefulness of information. It is built on top of a regular search engine. Data from Wiki data, including images, can be used to learn more about data about data. The Dublin Core Metadata Element Set is the standardized formalized rules for recording these metadata information.

Art Museum Collections

Martha Tedeschi and  Francesca Bewer from Harvard Art Museums talk of creating new meaning with works of art and providing new connections between audience and artworks. As a National Gallery of Art intern, going back to review my notes from here further illuminated the work of Digital Humanities possibilities in museum spaces.

Van Gogh’s Three Pairs of Shoes

The course uses the painting as a reference for extracting data from an analog object, to look beneath a painting. The goal of a digital tool in museum spaces is for the collaboration in scholarship, amongst artists, conservators, subject matter technical expertise, educators, as well as researchers – To look more closely and find what was beyond a painting. For example, digitally captured images are produced via X-ray fluorescence analysis and digital radiography.

Benefits of Technology in Museum Scholarship

How does the technical affect the ongoing scholarly research? The museum becomes more audience oriented, discover the secrets of the art like a detective as a corollary, thereby sharing perspectives with other disciplines.

  • Digital Access and Museum Curatorial Practice
  • For convenience sake the technical art history is preserved.
  • New Approaches to Museum Scholarship
  • Readily available for crowdsourcing information as is the case with Wikimedia.

Harvard Art Museum and Open Access

Institute of Museum and Library Services (IMLS) funds efforts like the Harvard API for broad wide clean datasets including Coin collections with collaborators from across the world.

Recall the way in which Jeffrey Schnapp described two categories of digital humanities projects: those that identify and study large scale trends and those that focus on smaller, specific examples or exceptions. How does his perspective relate to Martha Tedeschi’s study of James McNeill Whistler or Francesca Bewer’s study of Van Gogh’s painting Three Pairs of Shoes?

Digital Humanities not only gives attention to detail – the anomalies and exceptions but also the larger patterns. At the individual painting level, there are multiple paintings of various subjects to be uncovered. Through this emerges a larger pattern of the painter secrets. On being encouraged to look beyond the surface level painting, the audience is connected to an artwork in new ways. DH and digital tools in particular facilitate this.

Another important feature of these tools is their increased accessibility and availability. It might seem like a small part of scholarship study itself, but it is much simpler to study an entire corpora from one place than to travel to their physical locations. Funding for travel grants can be used in more immediately academic pursuits in this way. Every drop of water together is an ocean.

What is Data? Traditionally, information gathering in academics was  collecting, studying, interpreting, inferring, and then publishing results. Data is a construct where requirements and assumptions must be accounted for. What is Digital Scholarship?


  • Thoughtful
  • Theoretically informed
  • Ethically informed
  • Collaborative


  1. Digital evidence
  2. Methods of inquiry
  3. Research
  4. Publication
  5. Preservation


  • Visual Humanities
  • Digital History
  • New Media
  • Computational methods in humanities research
In his essay, “How Not to Teach Digital Humanities,” Ryan Cordell writes, “. . . I have become increasingly convinced that DH will only be a revolutionary interdisciplinary movement if its various practitioners bring to it methods of distinct disciplines and take insights from it back to those disciplines.” (Debates in the Digital Humanities 2016) Based on what you already know about the humanities and any categories of computing or digital technology, can you identify some of the benefits that digital tools of analysis provide to humanities research? Based on what you already know, can you identify some of the drawbacks or risks that we should all keep in mind when considering how digital tools, methods, and sources shape our understanding of specific research questions?

Advantages of using digital tools for humanities research:

  • New perspective with distant reading techniques
  • Speed and Efficiency, depending on the tools and familiarity with them
  • Wider reach and Propagation

Possible disadvantages of using digital tools, methods, and sources for humanities research:

  • Inaccuracies in qualitative data will affect quantitative data.
  • Dependent on data availability, robustness, density, balance considerations.
  • Extra work and completely new research skill set for humanities scholars.

Critical Reflections on Digital Humanities

History of Digital Humanities

1956 – Humanities Informatics, Humanities Computing, Computing in humanities. Computational Humanities, Contemporary DH
First ever National Endowment for Humanities project Dartmouth Dante Project Turning point:

  1. Internet
  2. Personal devices

Critical Code Studies

Interpretative practices in humanities applied to the software.Study of the backend:

  • Structure
  • Effects
    • Effects to social realm
  • Interaction models
  • Ask questions on each aspect of the code
  • Field of humanistic study

Digital Humanities and Design

  1. Infrastructure building
    1. Archival building
    2. Sustaining traditional forms of scholarship
  2. Creative, Expressive, Critical, and Experimental Practice
    1. Genres of scholarship
    2. Cultural forms
    3. Models of communication
    4. Re-imagine boundaries

Contemporary DH:

  • Scale of Human Experience
  • Computational culture fields / Cultural analytics – Lev Manovich’s Selfie City geo-tagging of self representation and urban mapping
  • Reinventing storytelling modes of meaningful experiences (visualizations)
  • Immersive interactive museum – metaLab Lightbox

Computation is Not Value Neutral

1st Cybernetic Revolution Marshall McLuhan – Media is an extension of human capabilities.
metaLab’s Curricle (Curriculum and Horse carts) 

Now that you have learned more about some of the critical theory behind digital humanities work, we want to know how your thinking has changed? What topics or ideas surprised you the most? What might have been most relevant to your own research interests?
It constantly surprises me how varied this field can get. In its attempts to break boundaries, our imagination is the only limit. I was thinking about the many possibilities in DH as I listened to Dr. Schnapps talking about Curricle.

Technology and Ethics

  • Databases do not exist in a silo they are technical+social+cultural constructs.
  • Study of Design choices for platforms, software, databases and tools is integral to how they are used.
  • Study of complexities, consequences, ethics of critical functioning for all DH an beyond, to tech itself from the social consequences of the algorithm design.


Digital_Humanities  By Anne BurdickJohanna DruckerPeter LunenfeldTodd Presner and Jeffrey Schnapp

Do you agree or disagree that in addition to writing, other teaching and research practices grounded in digital tools and formats should be considered part of the “vital engine” of education? Respond in the discussion forum.
While this probably places higher amount of effort and labour on the pedagogy and students, it is an inspired way of further interacting with the subject of study. When I work on a project which includes data from the subject of study, I go to the grassroots of the stated problem to think of a solution. 

  • Level of engagement
  • Depth of understanding
  • Learning new methodologies that can be applicable elsewhere

Project – Visualizing Broadway

Derek Miller on the Creation and delivery of a DH project

  • Broadway Theater Data from 1915-2015 for different genres interactions analysis
  • 17th and 18th century Comedie-Francaise data analysis
  • 18th century NEH funded British Theatre NY Philharmonic data analysis
  • Theater history data analysis

There’s a story even in the process of cleaning data or what is missing in the datasets. Visualization the opening and closing of Broadway shows gives a visceral sense of the amount of attempts and failure in the industry.
You can learn more about Professor Miller’s project, Visualizing Broadway on this website.

To see another network analysis example, explore Debra Caplan’s website, Visualizing the Vilna Troupe showing connections among the Vilna Troupe members and friends.

To see another network analysis example, explore Six Degrees of Francis Bacon. 

To see another network analysis example, explore Kindred BritainExternal link.

Explore the Comedie-Francaise Registers Project here.

Tool for Network Analysis – Gephi & Cytoscape

To import and Graph Networks Dr. Derek Miller converted XML code to 2 CSV files with Python. Additional Calculations for more connections are also possible.

Project –   Exploring Medieval Mary Magdalene

Prof. Racha Kirakosian (Ecole Nationale de Chartre in Paris) and Eleanor Goerss Translation and Text Encoding. The critical apparatus (Latin: apparatus criticus) is the critical and primary source material that accompanies an edition of a text. A critical apparatus is often a by-product of textual criticism.
When there are multiple manuscripts with no direct interdependency between them putting them together is hard even with a critical apparatus. 


  • Side by side image and text
  • Editing digitally with code
  • Prepare text for future data analysis
  • Levels of Recognition and Re-categorization of information


  • Transcription / Tagging mistakes fixes
  • Stability Maintenance
  • Availability
  • Computer Stick eye – Training computers to understand medieval script

Tool for Text Analysis – TEI and XML Editing

Text Encoding Initiative
Markup and display of text in various online formats

  • XML based meta language
    • 1987
    • DTD – Document Type Definition
    • Defines structural rules of XML doc
      • Elements
      • Tags
      • Attributes
  • Oxygen Editor 
  • Extensible Markup Language
    • Create
    • Store
    • Edit
    • Structure

Why encode?

Analyze and categorize text in XML without visual markers

<?xml version="1.0" encoding="UTF-8"?><?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_lite.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_lite.rng" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?><TEI xmlns="http://www.tei-c.org/ns/1.0"> <teiHeader> <fileDesc> <titleStmt> <title>Title</title> </titleStmt> <publicationStmt> <p>Publication information</p> </publicationStmt> <sourceDesc> <p>Information about the source</p> </sourceDesc> </fileDesc> <encodingDesc> <editorialDecl><ab>Text that does not change.</ab></editorialDecl> </encodingDesc> <!--Comment--/> </teiHeader> <text> <body> <p>Some text here.</p> </body> </text></TEI>
Oxygen is by Jeff Witt at Loyola University. Mirador viewer is Harvard's manuscript viewer integrated with the text editor.

Imperiia: Mapping the Russian Empire

With GIS Kelly O’Neill

Spatial history of Russian Empire New methodology to understand Russian past
Geo referencing is the making readable image surfaces of the Earth. In 19th century Russia, 59 sheets cut, fit, stretched, and manipulated to geo reference.
How is geo referencing different from whatever Google Maps is?

Tools for Geospatial Analysis

Kelly O’Neill asks “Why does where matter?” Public health, Public policy, Urban Development map-making is not that different from Humanists map-making. Using different artifacts as a reference point for data point can be painful but fruitful.Three Rules of Data Organization:

  1. Transparency
  2. Consistency
  3. One piece of Information in One cell
  1. Creating Data

Source: Deck of Playing Cards1856 – One card one province transferred to an excel sheet

  1. Identifying Location

Finding latitude and longitude online tool is a Gazetteer – GeoNamesFor places that are called differently now, use older map atlas, upload to Map Warper tool to stitch the digitized old map onto an actual surface of digital earth location.

  1. Geo-referencing

Historical geo-referencing GeoTIFF on World Map data can be created using ArcMap GIS tool for feature extraction (tracing boundaries of provinces, districts, etc) Spatial join will match location information to the place on the map.

  1. Visualization results

Tableau can visualize the results and the output uploaded to Tableau Public.

Take Lord of the Rings, Song of Ice and Fire, Narnia, etc maps and “warp” them onto the earth surface for educational purposes

China Local

Peter Bol and Bill Barthelmy

China LocalWho Cares about Chinese Culture?” Harvard Magazine, May-June 2010.  Omeka was used

Slave Revolt in Jamaica, 1760-1761

Vincent Brown

Tacky’s Revolt event in Beitain’s most profitable, military-strategic, and politically connected colony.Source material: Primary – Edward Long Secondary – Military records, Diaries, Letters, Newspapers. To create Location Database, hand drawn old maps as a graphic interpretation were used, on top of which fuzzy circles / solid circles / lines are placed.Access Maps– David Heinemann, Ben Sheesley, and Andy Woodruff Going too far will lead to inaccuracies, so a running list of sources is important to keep the details in heck. Slave Revolt in Jamaica 

The Giza Project

Peter Der Manuelian Museum of Fine Arts, Boston to Harvard University
Creating 3D photo-realistic visualization for pieces of walls present in different parts of the world brought together.

World Map

Suzanne P. Blier Center for Geographic Analysis at Harvard University worldmap.harvard.edu 2 million viewers, with 78,000 layers, 2 million points 

Harvard Art Museum Tool Builder

Martha Tedeschi and Francesca Bewer Visual Essays
Template for online exhibitionsThis seemed like an alternative to Omeka at first glance, then I realized each project has its own specifications.<br>https://www.harvardartmuseums.org/tour/digital-tool-builder-tutorial/stop/653<br><br>

More Digital Humanities Project Examples

Cleveland Historical

The Kilpatrick Collection of Cherokee Manuscripts

Digital Transgender Archive

The Drawings of the Florentine Painters

Women’s Worlds in Qajar Iran

Tech Advancements

Cultural Impact

Type of Project

Structure of project:

  • A repository of files or digital assets
  • Information architecture or structure
  • Suite of services
  • Display for user experience

Out of all the digital humanities projects shown in this lesson, which one or two resonated with you the most? Which projects may be similar to your interests or particularly inspiring to you?



Unstructured DataSemi-structured DataStructured Data
Textual usuallyEasy for humans to understandDifficult for computers to understand Uses formal constructsNo formal structurePre-defined semantic unitsOrganized wrt a data modelFurther data can be calculated

File TypesDefinitionsApplications
The Shape of Data is not a AI-human rom-com. It is a way of categorizing the analysis of words and numbers by a human or computer – what is called machine and human readable. Similar to Low-level Languages, Mid-level Languages, and High-level Languages, there are 3 types of data categories too.

Practice: Applying Structure – Peer Assessment

In this activity you will practice applying structure to a short poem. In this case, we will show you how a short poem can be encoded using TEI tagging and then you will have the opportunity to practice tagging another short poem.Start by reading and analyzing the example provided, including the poem by Milton.Then apply the same type of tagging to the exercise that includes the poem by Keats.(Hint: You may find it easier to copy and past the poem from the exercise into your own text editing document, apply your tagging, and then copy and paste your final answer into the space provided. Once you submit your response you will not be able to edit it further.)

"On First Looking into Chapman's Homer" by John Keats (1816)
 Much have I travell'd in the realms of gold, 
 And many goodly states and kingdoms 

  Round many western islands have I 

  Which bards in fealty to Apollo 

 Oft of one wide expanse had I been 

 That deep-brow'd Homer ruled as his 

 Yet did I never breathe its pure 

 Till I heard Chapman speak out loud and 

 Then felt I like some watcher of the 

 When a new planet swims into his 

 Or like stout Cortez when with eagle 

 He star'd at the Pacific—and all his 

 Look'd at each other with a wild 

 Silent, upon a peak in Darien. 

File Types

  • Plain Text
  • UTF-8 represents Unicode character

Harvardx goes one step ahead by making the entire Plaintext section of the unit, all actually in Plain Text. It might be Computational linguistics to know that at computer level each characters have a unicode character assigned to them in all languages including invisible characters like line break.

As a result of the history of computer processing of text to be limited at 255 bytes, processing of several languages requires more storage space. 

Uses of Plain Text:

  • Store and exchange un-formatted text between programs.
  • Text mining applications
  • Textual analysis


  • Flexible
  • Most used
    • Widely supported
    • Word processors
    • Web browsers
    • Other utility programs


  • Lack of structure
  • Lack of info on interpretation or processing of data
  • Restrictions on structure even on 


  • Comma Separated Values
  • Technically also plain text
  • Represents tabular data
  • Data is arranged in a 2 dimensional grid 
  • One data in One cell


  • Comma separates present cell to the cell on the right. 
  • Line break separates present row with the row below.
  • Symbols avoid ambiguity


  • Represent one table of a database


  • Simple format
  • Clear, unambiguous structure


  • Rigid structure
  • One cell, One table of data only
  • Software incompatibilities
    • Programs do not have a formal standard for implementing CSV
  • Data Corruption
  • Incorrect encoding used to decode file contents causes data corruption
  • Does not specify text encoding type used

JSON (JavaScript Object Notation)

  • Information is represented logically
  • Two types of structure:
    • Lists of items
      • In a sequence of items
    • Dictionaries of items
      • Item associated with unique key or label
  • Type of Items:
    • String
      • Piece of text
    • List
    • Dictionary
  • Recursive
  • Represents complex structures
  • Conventions
    • “Strings and keys”
    • [ Lists ]
    • item1, item2, …
    • { dictionaries }
  • Each entry format: 
    • “key:value”
  • Uses
    • lightweight
    • flexible
    • stores and exchanges data between different systems 
    • web-based Application Programming Interfaces (APIs)
      • Facilitate communication between independent online platforms
    • Advantages
      • Easily processed pro-grammatically
        • Limited number of structure types
      • Represent complex data with minimal set of structures
    • Disadvantages
      • Strings don’t have further structure by themselves
        • Additional info or meta data 


  • Uses
    • Represent and transmit web content
    • Interactive web browser rendering
    • Processed programmatically for data extraction
  • Advantages
    • Standardized
    • Semantic info representation
  • Disadvantages
    • Still evolving
      • Standardized version being revised
      • Different software 

Introduction to Digital Data Creation

Physical Analog sources -> Photographs -> Perform Optimal Character Recognition (OCR) ->images to text file format -> Data

Forms of Digitization

Digitizing Text with OCR


  • Quick 
  • Effective


  • Handwriting is unclear and not machine readable

Digitizing Text with Crowdsourcing

Image and Full Transcription: (successful transcription of handwritten text): https://transcription.si.edu/transcribe/15654/NMAAHC-007675740_00588
Organization: Smithsonian Transcription Center (https://transcription.si.edu)
Title of Specific Project in Example: District of Columbia Education, Records Relating to School Buildings, Grounds, and Supplies, Quartermaster’s Monthly Reports (https://transcription.si.edu/project/15654)
Link to Catalog of All Projects Undergoing Crowdsourced Transcription by Smithsonian Transcription Center:https://transcription.si.edu/browse?filter=&sort=

Digitizing Objects

2D or 3D renditions Giza Project

Digitizing Audio/Visual Information

We have a provided a brief introduction to some of the most prominent ways of creating Digital Humanities data. Now is the time for you to share your thoughts about creating digital data. Which digitization approaches are you most interested in and why?

Baudelaire Song Project

Different Ways of Getting Data

Downloading Data from Online Repositories

Malaysia Government has a website containing Open Data in MAMPU.
Table of different data repositories

RepositoryDescriptionData Types
Registry of Research Data Repositories
Searchable registry of over 2,000 repositories that host research data. Individual datasets may be subject to use restrictions.Archived Data
Audiovisual data
Configuration data
Network-based data
Plain text
Raw Data
Scientific and Statistical Data Formats
Software applications
Source Code
Standard Office Documents
Structured Graphics
Structured Text
Harvard Dataverse
Searchable repository of research data in a variety of formats. Individual datasets may be subject to use restrictions.Applications
FITS Images
Tabular Data
Compressed Files (e.g. ZIP)
Full-text corpus data
This site contains downloadable, full-text corpus data from six large English corporaiWebNOWWikipediaCOCACOHAGloWbE, as well as the Spanish language Corpus del Español. The data is being used at hundreds of universities throughout the world, as well as in a wide range of companies. Individual datasets may be subject to use restrictions or require purchase.Databases
Plain text
This site contains downloadable corpora developed by Mark Davies, Brigham Young University. Individual datasets may be subject to use restrictions or require purchase.Databases
Plain text
Project Gutenberg
Project Gutenberg offers over 58,000 free eBooks in a variety of formats and languages.ePub
Plain text
Spatial Data Repository
The Spatial Data Repository provides geographically-linked health and demographic data from The DHS Program and the U.S. Census Bureau for mapping in geographic information systems (GIS).Various Geo-spatial formats
Natural Earth
Free vector and raster map dataESRI shapefile format
TIFF format
TFW format
New York University (NYU) Spatial Data Repository
Catalog of geo-spatial data and maps available from New York University.Image
Raster Line

Relational Databases

Database is an organized collection of data contained in a digital system for data creation, modification, and retrieval. Queries are the requests made for a data to be matched with a set of formal requirements using Structured Query Language (SQL). 

Web APIs

Computer software interacts with an API (Application Programming Interface), not directly with the web browser for semi-structured data

Use Cases:

  • Extract machine readable data from online systems.
  • real-time data exchange between independent projects and apps
  • “mashups”
    • user perspective function
  • multiple independent systems

Chinese Text Project API Text Tools and MARKUS

Web Scraping

Use Cases:Search Engines data extraction

Main content of individual page

Boilerplate navigation “Analysing and understanding news consumption patterns by tracking online user behaviour with a multimodal research design” Quiz


Kyle Courtney, the author has Rights of limited monopoly until it is under public domain.

  • Right to make copies
  • Right to make derivative works
  • Right to distribute those works

Public Domain

In the US Individual works expire to public domain after 75 years. Corporations works expire to public domain after 90 years.

Originality has two components:

  1. Independent creation
  2. Sufficient creativity

There is no creativity in facts, hence, facts are not copyright-able.
According to Feist Standard, two telephone page book companies sued each other but the content was not copyright-able.Data could be copyrightable, only if some amount of creativity is applied.
How do you measure creativity in this?


A contract is a legally binding agreement between two or more parties that is enforceable by law.Beyond copyright, certain vendors or companies might hold the license over some data. The law recognizes the ones holding the rights to be the creator or owner of the data, not the actual real creator.
For every digital humanities project remember to keep track of licenses. Check the use, reuse, distribution, publication policies.
These licenses have sections that are pretty standard.
definitions, scope, authorized uses, licensing fees, termination, governing law. Most importantly, authorized uses, the rights of the licensee, you, the signer, and restricted uses.
Limits can be placed on tech usage, space usage, utilization, attribution, or commercial aspects.Creative Commons

Networked Infrastructure for Nineteenth Century Electronic Scholarship (NINES database)

Open Access and Open Source

Open Access

  • Digital
  • Online
  • Free of Charge
  • Free of Copyright
  • Free of licensing restrictions

Non-exclusive TransferNot limited to one party
Peter Suber  – Office for Scholarly Communication. 
Advantages of Open Access:

  • Research usefulness
  • Research visibility
  • Research retrieval
  • Increase in Citation and Integration 

Open Source

  • Source code
  • Tech neutral
  • Free of charge
  • Free of most copyright restrictions
  • Free of licensing restrictions
  • Free of distribution restrictions
  • Derivative works

Open Source / Free software / Copyleft / Community software / Public software
DH has taken extensive advantage of Open Source software

Fair Use

4 Factors of Instagram:

  • Purpose of Use
  • Nature of Use
  • Amount of Usage
  • Market for Harm 

Google Books Hathi Trust
Peter Murray-Rust University of Cambridge

Dissemination of Data and Scholarship

Sharing or licensing work

You want to identify your audience and venue.
Who are you aiming for?
And you want to define what your deliverables are.
Is it data visualization, which offers macro and micro views
for data points or large data sets?
Is it a software tool that you’ve developed that users can process data
if they use it?
Or are you putting it into info-graphics which is a narrative
form of digital storytelling?
You may want to also consider, are you going to present this on a platform?
Or you’re going to web host yourself?
And consider data storage providers.
All of these have legal and ethical implications for your choices.
Wide Web Consortium and look at the web
Content Accessibility guidelines.

Command Line chapter involved installing a virtual machine (virtualbox.org)

Running Ubuntu 18.04 – free LINUX OS download

Working with Tools – Voyant

It is especially wonderful to be able to use a pre-installed Voyant in the laptops, learning to use it as a part of my DiSc Lab training.

Objective: Create data from split text files, Compare data results across text files

Visualizing text files with Command line ->

  • Automating batch operations
  • Manipulating files
  • Auto writing spreadsheets
  • Customizing summary stats
  • Standard summary stats+viz
  • Charles Dickens text data
  • Corpus of presidential speeches 
  • Voyant tools by Stefan Sinclair is a go-to for distant reading
  • Download tale_of_two_cities.txt Concordances mapped out Word cloud – most mentioned ~ biggerSummary Stats – vocab densityTermsBerry Tools – Colocated next to each otherExportI can click this Export to URL an Embedded Tool, Data or Bibliography Reference. PNG, SVG file

How would you describe computational methods applied to humanities research? Can you imagine applying computational methods to your own work in the humanities? How do Jeffrey Schnapp’s comments change or challenge your thinking about Digital Humanities? All of these ideas will help you get a sense of each other’s perspective here at the start of the course.

Computational methods applied to humanities research are quantitative, qualitative, and representational methodologies that employ software tools to arrive at questions and answers for a given sphere of the humanities. Computational methods is not the same as using the computer, even if it involves sing the computer to arrive at a result. Today, I attended a seminar on DH blogs, specifically – How are DH blogs doing these days? by Laura Crossley a George Mason University PhD student and Digital History fellow in the Roy Rosenzweig Center for History and New Media.

A DH interest survey

I keep wondering as I go through this course – Should I pick a particular project and imitate it as closely as possible to prove I can? As in, take up a Broadway shows map timeline (maybe even coin a term for this kind of work) Will I be able to find out how to map timeline of the various books I read along with the Book Clubs I am a part of? The opportunities are obviously endless. My interest is boundless. I have time I think. If only the work came naturally to me.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.