There are many fossil remains of plants that could help write new evolutionary histories of botanical families. Common among the remains are leaves, which could provide an excellent way to identify plants. But resources for leaf identification can be hard to access. Peter Wilf and colleagues have addressed this knowledge gap by assembling an open-access database of 30,252 images of vouchered leaf specimens vetted to family level. As well as providing an excellent educational resource for human students, the database could also supply machine learning projects with material to improve their systems.

“The complexity of leaves is off the charts, and the terminology we have to describe them is only the tiniest beginning of what is needed,” said Peter Wilf said in a press release. “Researchers need much more accessible visual references to study what the differences are among the many plant groups, so we can put more of that into words. There are a lot of plant families that look superficially similar, and this collection provides an opportunity to see new patterns.”
It’s not only complexity that’s the issue. In their article, Wilf and colleagues also highlight the recording of leaf architecture, or rather the lack of it. “To build their knowledge of leaf architecture, researchers still rely primarily on “oral tradition” from a dwindling number of knowledgeable colleagues and a handful of survey papers and field guides that emphasize purportedly diagnostic leaf features… There is significant literature on the leaf architecture and leaf-fossil records of various taxa… However, many of the most diverse and ecologically significant groups of angiosperms have virtually no documentation of diagnostic leaf-blade features (e.g., Asteraceae, Rubiaceae), and thus their leaf fossils remain largely unrecognized, though probably hidden in plain sight in museum collections…”
Accessing these collections can be a challenge. Physically, they can be around the world, leading to a lot of expense for travelling. Some herbaria are digitising their collections, but it’s the bigger and better-funded herbaria that can afford to do that. In the article, Wilf and colleagues also add that merely being available online often isn’t enough for a research project. “In most of the online image sets, bulk downloads are not easily done, images are downsampled to low resolution, and the filenames are not standardized, requiring significant manual effort to re-organize and collate them for a particular project. Adding further complications to data modularity, taxonomic data have often become partially obsolete.”
“What we have done here is to make this massive educational resource available to everyone by vetting and standardizing all these images from different legacy sources,” Wilf said. “It took 15 years for us all to do that and convert all the filenames, but now you can have the whole package on your desktop with a single browser click. Every filename has the key information embedded, in the same order for rapid alpha-sorting: family, genus, species, and specimen number. The filenames can be rapidly searched in seconds for the item you are interested in and the images viewed using standard tools, such as the Windows search bar. All images are original resolution; nothing is downsampled.”
It’s not just human eyes that can benefit from the database. The authors also talk about machine learning. They describe a few apps as “making spectacular breakthroughs” in identifying plants. But they also point out some problems. First, the algorithms are opaque – it’s unclear what the computers have recognised as diagnostic features when identifying plants.
Another problem is that not many algorithms identify beyond the species level. The public likes to know the species of a plant, but it can be helpful to know what connects a family of plants. For leaf fossils, there might well be no extant species or genus to connect to an image, so being able to identify a family through machine learning would be extremely useful.
“This database makes the information in these collections available to people around the world in a form that is easier to search than the original and more amenable to digital analyses,” said Scott Wing, a co-author of the article. “We think the database will encourage new research and also open the museum collections to people.”
READ THE ARTICLE
Wilf, P., Wing, S.L., Meyer, H.W., Rose, J.A., Saha, R., Serre, T., CΓΊneo, N.R., Donovan, M.P., Erwin, D.M., Gandolfo, M.A., GonzΓ‘lez-Akre, E., Herrera, F., Hu, S., Iglesias, A., Johnson, K.R., Karim, T.S. and Zou, X. (2021) “An image dataset of cleared, x-rayed, and fossil leaves vetted to plant family for human and machine learning,” PhytoKeys, https://doi.org/10.3897/phytokeys.187.72350
ACCESS THE DATABASE
Wilf, P., Wing, S.L., Meyer, H.W., Rose, J.A., Saha, R., Serre, T., RubΓ©n CΓΊneo, N., Donovan, M., Erwin, D.M., Gandolfo, M.A., Gonzalez-Akre, E.B., Herrera, F., Hu, S., Iglesias, A., Johnson, K.R., Karim, T.S. and Zou, X. (2021) “Image collection and supporting data for: An image dataset of cleared, x-rayed, and fossil leaves vetted to plant family for human and machine learning.” Figshare+, https://doi.org/10.25452/figshare.plus.14980698